Skip to content

sdk-telemetry: tighten cross-repo counter contract + auto atexit + scheme handling#91

Closed
mmercuri wants to merge 1 commit into
metrics-analytics-dashboardfrom
fix/sdk-telemetry-cross-repo
Closed

sdk-telemetry: tighten cross-repo counter contract + auto atexit + scheme handling#91
mmercuri wants to merge 1 commit into
metrics-analytics-dashboardfrom
fix/sdk-telemetry-cross-repo

Conversation

@mmercuri

Copy link
Copy Markdown
Contributor

Stacks on #90. F24+F26+F27 from cross-PR panel audit. See commit message for detail. Tests: 16/16 PASS.

…heme

Three findings from the cross-PR panel audit of the metrics stack:

F24 — Counter label-set divergence from atlas-app Go declaration
  atlas-app/apps/shared/observability/metrics.go declares
  atlas_sdk_events_total with EXACTLY two labels (surface, event). The
  SDK previously carried an "allowlist" that forwarded command +
  resource + outcome + status_code to the OTel counter, producing a
  3-or-4-label Prometheus series server-side. Dashboards, cardinality
  budgets, and TestEmitSiteArity are all built against the 2-label
  declaration — the divergence meant some series didn't match the
  intended query shape.
  Fix: strip ALL attributes from the counter emission. Keep the
  ``attributes=`` parameter on the event() signature for forward
  compatibility, but drop it before counter.add(). Extra dimensions
  belong on OTel SPAN attributes, not the counter — and atlas-app's Go
  declaration stays the single source of truth for the label schema.
  Test renamed from test_attributes_allowlist to
  test_counter_labels_match_atlas_app_declaration and asserts the
  exact 2-key contract.

F26 — OTLP endpoint scheme vs LAYERLENS_OTLP_INSECURE ambiguity
  Previously:
      LAYERLENS_OTLP_ENDPOINT=https://otel.layerlens.ai:4317
      LAYERLENS_OTLP_INSECURE=true
  would pass both to OTLPMetricExporter, and the gRPC exporter's
  resolution was version-dependent — sometimes honouring the scheme,
  sometimes the flag. Customers couldn't predict which would win.
  Fix: detect scheme from the endpoint. https:// → insecure=False,
  http:// → insecure=True. The LAYERLENS_OTLP_INSECURE env var is now
  only consulted for scheme-less endpoints (the pure "host:port"
  form). Strip the scheme before handing the endpoint to the exporter
  since the gRPC exporter expects schemeless host:port.

F27 — Non-CLI SDK consumers lost up to 30s of telemetry on exit
  PeriodicExportingMetricReader flushes every 30s and on
  meter_provider.shutdown(). CLI callers registered
  atexit.register(_telemetry.shutdown) explicitly; library consumers
  that just imported Stratix and opted in via env had nothing hooking
  their process exit. Up to 30s of init/cmd_run events were silently
  dropped on a clean exit.
  Fix: auto-register atexit.register(shutdown) on first successful
  _try_init(). _atexit_registered guard keeps it idempotent, so the
  CLI's explicit atexit.register is still a no-op second registration.
  Library consumers now flush automatically.

Verification:
  PYTHONPATH=src python -m pytest tests/test_telemetry.py -v
  => 16 passed in 0.52s
@m-peko m-peko closed this May 21, 2026
@m-peko m-peko deleted the fix/sdk-telemetry-cross-repo branch July 2, 2026 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants