Skip to content

sdk-telemetry: opt-in atlas_sdk_events_total emission (off by default)#90

Closed
mmercuri wants to merge 1 commit into
mainfrom
metrics-analytics-dashboard
Closed

sdk-telemetry: opt-in atlas_sdk_events_total emission (off by default)#90
mmercuri wants to merge 1 commit into
mainfrom
metrics-analytics-dashboard

Conversation

@mmercuri

Copy link
Copy Markdown
Contributor

Summary

Adds a small client-side telemetry module (layerlens._telemetry) that mirrors the atlas_sdk_events_total{surface, event} and atlas_sdk_request_duration_seconds shapes the atlas-app server side declares (see the metrics-analytics-dashboard PR stack on LayerLens/atlas-app, starting at #1752).

Contract: off by default. Customer opts in via LAYERLENS_TELEMETRY=on.

Key properties

  • Opt-in default: LAYERLENS_TELEMETRY unset → zero-overhead no-op. OpenTelemetry SDK is not even imported until telemetry is enabled (lazy import in _try_init).
  • PII-safe attribute allowlist: only command, resource, outcome, status_code are forwarded on top of the required surface + event keys. Anything else a caller might accidentally pass is dropped.
  • Fail-isolated: every emit is wrapped in try/except; telemetry cannot break the customer's program. Init failures (missing SDK, network) silently disable telemetry for the lifetime of the process.
  • Atexit flush: CLI installs atexit.register(shutdown) so the last batch flushes cleanly even on short-lived commands.

What's wired

  • Stratix(...).__init__ → emits atlas_sdk_events_total{surface=sdk_python, event=init} once per client construction.
  • CLI root command → emits atlas_sdk_events_total{surface=cli, event=cmd_run} once per CLI invocation.

Verification

Check Result
pytest tests/test_telemetry.py 16/16 PASS (0.67s)
Tests cover disabled-by-default no-op, OTel-not-imported-when-off, attribute allowlist, exception swallow, atexit safety, truthy value parsing (9 parametrized)

Cross-repo coupling

The server side of this counter lands in LayerLens/atlas-app PR #1752 (atlas_sdk_events_total in apps/shared/observability/metrics.go). Labels match exactly: surface + event. Suggested merge order: atlas-app #1752 first, then this. If this merges first, opted-in customers emit to a metric that doesn't have a scrape target yet — harmless (OTLP exporter silently drops under network-unreachable) but wasteful.

Docs

docs/telemetry.md (82 lines) — explains what's collected, how to disable, why none of it is PII.

Test plan

  • CI green
  • Local opt-in smoke: LAYERLENS_TELEMETRY=on python -c "from layerlens import Stratix; Stratix(api_key='test')" → confirm 1 counter emit to the configured OTLP endpoint (or silent drop if endpoint unreachable — that's the contract)
  • Confirm LAYERLENS_TELEMETRY=off python -c "import layerlens; print('ok')" does not import opentelemetry (covered by test_event_with_telemetry_off_doesnt_import_otel)

…otal

Adds a small client-side telemetry module that mirrors the
`atlas_sdk_events_total{surface, event}` counter exported by atlas-app
on the metrics-analytics-dashboard branch. Off by default; flip
LAYERLENS_TELEMETRY=on to opt in. Allowlists attribute keys so PII can
never leak; failure-isolated so telemetry can never break customer apps.

Wires `init` event into Stratix.__init__ and `cmd_run` event into the
CLI root command. 16 unit tests pass under PYTHONPATH=src pytest.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants