Skip to content

observability: structured OTel-aligned JSON logging across frontend and workers#193

Merged
jameshawkes merged 1 commit into
upstreamfrom
feat/observability-structured-logging
May 18, 2026
Merged

observability: structured OTel-aligned JSON logging across frontend and workers#193
jameshawkes merged 1 commit into
upstreamfrom
feat/observability-structured-logging

Conversation

@jameshawkes
Copy link
Copy Markdown
Contributor

@jameshawkes jameshawkes commented May 18, 2026

Summary

Introduces a single internal-path polytope-observability crate consumed by frontend and every workers/* binary, replacing ad-hoc tracing::info! lines with a locked event taxonomy aligned with the ECMWF Codex observability guidelines and the OTel log data model.

What changes

New crate: observability/

  • formatter — OTel-aligned JSON event writer. Fields: timestamp (RFC 3339 UTC), severityText, severityNumber (TRACE=1, DEBUG=5, INFO=9, WARN=13, ERROR=17), body, resource, attributes.
  • resource — stable per-service block: service.name, service.version, deployment.environment (POLYTOPE_ENV), k8s.namespace.name, k8s.pod.name.
  • redaction — in-app secret scrub. Authorization / Bearer, password|token|api_key= assignments, JWT-shaped tokens, URL userinfo, known test credentials. Applied recursively to strings AND JSON values.
  • bounded_request* — port of the Python `_bound_logging_value` helper. Strings truncated at 1000 chars, lists summarised as `{_summary, count, preview}` past 100 items (first 10 shown). 32 KiB hard cap as backstop.
  • env_filter — defaults `info`, accepts standard `RUST_LOG`.
  • test_helper — capturing layer for per-crate tests.

Wiring

  • Frontend (`frontend/src/api/{v1,v2,openmeteo/mod,mod}.rs`, `auth/middleware.rs`, `main.rs`): every `api.*` event emits via the new formatter; `enduser.id` / `enduser.realm` attached from the authenticated `AuthUser` (absent on unauthenticated paths).
  • Workers (`workers/common/src/lib.rs`, `workers/common/src/delivery/{mod,bobs,s3}.rs`, every worker `main.rs`): every `worker.*` event emits via the formatter; user fields extracted from `WorkItem.user.auth.{username,realm}` — no propagation changes needed.
  • Worker → BOBS: `X-Polytope-Job-Id` header forwarded with the broker-issued id (server-side strict-validated in BOBS).
  • Dockerfiles for frontend + every worker updated to copy the new `observability/` crate.

Locked event taxonomy

Family Events
api `api.collection.list`, `api.auth.{rejected,mock_accepted}`, `api.job.{submitted,rejected,cancelled}`, `api.job.poll.{completed,pending(DEBUG),failed,cancelled}` (with `reason` attribute), `api.openmeteo.{processed,failed}`
worker `worker.job.{started,completed,failed,rejected}`, `worker.delivery.{completed(DEBUG),failed}`, `worker.heartbeat.failed`, `worker.broker.poll.failed`
startup `startup.config.{loaded,failed}`, `startup.server.listening`, `startup.shutdown.{received,complete}`

`polytope.request` payload is captured only on `api.job.submitted` (bounded and redacted).

`job.id` is the single cross-service correlator (frontend → worker → BOBS).

…nd workers

Introduces a single internal-path `polytope-observability` crate consumed
by `frontend` and every `workers/*` binary, replacing the prior ad-hoc
`tracing::info!` lines with a locked event taxonomy.

Crate (`observability/`):
- `formatter`: OTel-aligned JSON event writer. Fields: `timestamp`
  (RFC 3339 UTC), `severityText`, `severityNumber` (TRACE=1, DEBUG=5,
  INFO=9, WARN=13, ERROR=17), `body`, `resource`, `attributes`.
- `resource`: stable per-service block — `service.name`, `service.version`,
  `deployment.environment` (`POLYTOPE_ENV`), `k8s.namespace.name`
  (`K8S_NAMESPACE_NAME`), `k8s.pod.name` (`K8S_POD_NAME`).
- `redaction`: in-app secret scrub before emission. Covers Authorization
  / Bearer, password / token / api_key assignments, JWT-shaped tokens,
  URL userinfo, and known test credentials.
- `bounded_request*`: port of the Python `_bound_logging_value` helper —
  strings truncated at 1000 chars, lists summarised with `{_summary,
  count, preview}` past 100 items (first 10 shown), recursive on nested
  objects. 32 KiB hard cap as backstop.
- `env_filter`: default `info`, accepts standard `RUST_LOG` overrides.
- `test_helper`: capturing layer for per-crate tests + helpers that
  assert required fields, event names, and absence of probe strings.

Events emitted (locked taxonomy, attribute key `event.name` per OTel
semantic convention, dotted form):

- `api.collection.list`, `api.auth.rejected`, `api.auth.mock_accepted`
- `api.job.submitted` (only event carrying `polytope.request`),
  `api.job.rejected`, `api.job.cancelled`,
  `api.job.poll.completed`, `api.job.poll.pending` (DEBUG),
  `api.job.poll.failed{reason=not_found}` (DEBUG),
  `api.job.poll.failed{reason=job_lost}` (WARN), `api.job.poll.cancelled`
- `api.openmeteo.processed`, `api.openmeteo.failed`
- `worker.job.started`, `worker.job.completed`, `worker.job.failed`,
  `worker.job.rejected`, `worker.delivery.completed` (DEBUG),
  `worker.delivery.failed`, `worker.heartbeat.failed`,
  `worker.broker.poll.failed`
- `startup.config.loaded`, `startup.config.failed`,
  `startup.server.listening`, `startup.shutdown.received`,
  `startup.shutdown.complete`

Per-request attributes: `job.id` (cross-service correlator),
`enduser.id` + `enduser.realm` (from authenticated AuthUser; absent on
unauthenticated paths). Workers pick these up from
`WorkItem.user.auth.{username,realm}` — no new propagation needed.

Worker → BOBS: `X-Polytope-Job-Id` header forwarded with the
broker-issued id (validated server-side in BOBS).

Dockerfiles updated to copy the new `observability/` crate into every
worker build stage.

Tests: per-crate unit + integration tests cover formatter shape,
redaction probes, severity-number contract, request truncation,
multibyte UTF-8 boundary safety, and event-name presence on lifecycle
events. mars-worker tests still need native MARS/eckit headers (no
change from before).

Plan and taxonomy lives in polytope-config/docs/observability.md.

Reviewed by Warp (security audit): APPROVE_WITH_NOTES — no blocking
findings. Follow-ups: tighten assignment_re word boundary so
\`my_token=…\` is redacted; cookie / x-api-key / proxy-authorization
not currently treated as header-named secret keys (theoretical).
@jameshawkes jameshawkes merged commit 70f1214 into upstream May 18, 2026
4 checks passed
@jameshawkes jameshawkes deleted the feat/observability-structured-logging branch May 18, 2026 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant