observability: structured OTel-aligned JSON logging across frontend and workers#193
Merged
Merged
Conversation
…nd workers
Introduces a single internal-path `polytope-observability` crate consumed
by `frontend` and every `workers/*` binary, replacing the prior ad-hoc
`tracing::info!` lines with a locked event taxonomy.
Crate (`observability/`):
- `formatter`: OTel-aligned JSON event writer. Fields: `timestamp`
(RFC 3339 UTC), `severityText`, `severityNumber` (TRACE=1, DEBUG=5,
INFO=9, WARN=13, ERROR=17), `body`, `resource`, `attributes`.
- `resource`: stable per-service block — `service.name`, `service.version`,
`deployment.environment` (`POLYTOPE_ENV`), `k8s.namespace.name`
(`K8S_NAMESPACE_NAME`), `k8s.pod.name` (`K8S_POD_NAME`).
- `redaction`: in-app secret scrub before emission. Covers Authorization
/ Bearer, password / token / api_key assignments, JWT-shaped tokens,
URL userinfo, and known test credentials.
- `bounded_request*`: port of the Python `_bound_logging_value` helper —
strings truncated at 1000 chars, lists summarised with `{_summary,
count, preview}` past 100 items (first 10 shown), recursive on nested
objects. 32 KiB hard cap as backstop.
- `env_filter`: default `info`, accepts standard `RUST_LOG` overrides.
- `test_helper`: capturing layer for per-crate tests + helpers that
assert required fields, event names, and absence of probe strings.
Events emitted (locked taxonomy, attribute key `event.name` per OTel
semantic convention, dotted form):
- `api.collection.list`, `api.auth.rejected`, `api.auth.mock_accepted`
- `api.job.submitted` (only event carrying `polytope.request`),
`api.job.rejected`, `api.job.cancelled`,
`api.job.poll.completed`, `api.job.poll.pending` (DEBUG),
`api.job.poll.failed{reason=not_found}` (DEBUG),
`api.job.poll.failed{reason=job_lost}` (WARN), `api.job.poll.cancelled`
- `api.openmeteo.processed`, `api.openmeteo.failed`
- `worker.job.started`, `worker.job.completed`, `worker.job.failed`,
`worker.job.rejected`, `worker.delivery.completed` (DEBUG),
`worker.delivery.failed`, `worker.heartbeat.failed`,
`worker.broker.poll.failed`
- `startup.config.loaded`, `startup.config.failed`,
`startup.server.listening`, `startup.shutdown.received`,
`startup.shutdown.complete`
Per-request attributes: `job.id` (cross-service correlator),
`enduser.id` + `enduser.realm` (from authenticated AuthUser; absent on
unauthenticated paths). Workers pick these up from
`WorkItem.user.auth.{username,realm}` — no new propagation needed.
Worker → BOBS: `X-Polytope-Job-Id` header forwarded with the
broker-issued id (validated server-side in BOBS).
Dockerfiles updated to copy the new `observability/` crate into every
worker build stage.
Tests: per-crate unit + integration tests cover formatter shape,
redaction probes, severity-number contract, request truncation,
multibyte UTF-8 boundary safety, and event-name presence on lifecycle
events. mars-worker tests still need native MARS/eckit headers (no
change from before).
Plan and taxonomy lives in polytope-config/docs/observability.md.
Reviewed by Warp (security audit): APPROVE_WITH_NOTES — no blocking
findings. Follow-ups: tighten assignment_re word boundary so
\`my_token=…\` is redacted; cookie / x-api-key / proxy-authorization
not currently treated as header-named secret keys (theoretical).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduces a single internal-path
polytope-observabilitycrate consumed byfrontendand everyworkers/*binary, replacing ad-hoctracing::info!lines with a locked event taxonomy aligned with the ECMWF Codex observability guidelines and the OTel log data model.What changes
New crate:
observability/formatter— OTel-aligned JSON event writer. Fields:timestamp(RFC 3339 UTC),severityText,severityNumber(TRACE=1, DEBUG=5, INFO=9, WARN=13, ERROR=17),body,resource,attributes.resource— stable per-service block:service.name,service.version,deployment.environment(POLYTOPE_ENV),k8s.namespace.name,k8s.pod.name.redaction— in-app secret scrub. Authorization / Bearer,password|token|api_key=assignments, JWT-shaped tokens, URL userinfo, known test credentials. Applied recursively to strings AND JSON values.bounded_request*— port of the Python `_bound_logging_value` helper. Strings truncated at 1000 chars, lists summarised as `{_summary, count, preview}` past 100 items (first 10 shown). 32 KiB hard cap as backstop.env_filter— defaults `info`, accepts standard `RUST_LOG`.test_helper— capturing layer for per-crate tests.Wiring
Locked event taxonomy
`polytope.request` payload is captured only on `api.job.submitted` (bounded and redacted).
`job.id` is the single cross-service correlator (frontend → worker → BOBS).