Skip to content

Feat: Observability Enhancements#19

Merged
RndmCodeGuy20 merged 11 commits into
stagingfrom
feat/track-03-observability
Jul 1, 2026
Merged

Feat: Observability Enhancements#19
RndmCodeGuy20 merged 11 commits into
stagingfrom
feat/track-03-observability

Conversation

@RndmCodeGuy20

Copy link
Copy Markdown
Owner

No description provided.

Persist the producer span context in the event_outbox.traceparent column,
re-activate it in the relay (so enqueue rejoins the request trace), and inject
traceparent into the Redis stream message so the worker can continue the trace.
Add an OTel tracer (worker/utils/tracing.py), extract the producer context in
consume() and start worker.consume as a child-with-link, and span the pipeline
stages (dispatch, download, dedup, image variants, ffmpeg). Also wire the
previously-uncalled init_metrics and record job/asset/queue metrics so the
worker SLIs have data.
Stamp trace_id/span_id onto API request logs (TracingMiddleware now runs before
the logger) and worker logs, and broaden the Loki derived-field regex to link
log lines to their Tempo trace across formats.
Read the chi route pattern after routing (http_route was always 'unknown'),
cut the metric export interval to 15s, add finer histogram buckets for the
queue-lag SLI, and export DB connection-pool gauges via db.Stats().
Add the loadtest overlay (CPU/mem pinning + full sampling), Prometheus SLO
recording rules, remote-write receiver and exemplar storage, and pin Tempo to
2.6.1 (latest had an incompatible config schema).
Move the dashboard provider config into the dashboards provisioning dir (it was
misplaced under datasources/, so no dashboards ever loaded) and repair the
legacy metrics dashboard's datasource binding and stale metric names.
API RED, worker/app saturation (USE), pipeline funnel, queue health, and a
consolidated experiment overview combining k6 client load with server-side
pipeline, worker, queue and DB metrics.
Closed- and open-model scripts running the real presign→upload→complete client
flow with per-iteration unique bytes (dedup defeat), SLO-mapped thresholds, a
host-run wrapper, and Prometheus remote-write of client metrics.
First bottleneck-analysis writeup: single-threaded worker saturates at ~1.1
jobs/s while the API stays idle, motivating Track 1.
@RndmCodeGuy20 RndmCodeGuy20 changed the base branch from master to staging June 30, 2026 08:47
@RndmCodeGuy20 RndmCodeGuy20 merged commit 74c8091 into staging Jul 1, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant