Skip to content

feat(instrument): Multi-tenancy hardening — cache + queue + OTel correlation + logging#123

Closed
mmercuri wants to merge 1 commit into
feat/instrument-multitenancy-org-id-propagationfrom
feat/instrument-multitenancy-hardening
Closed

feat(instrument): Multi-tenancy hardening — cache + queue + OTel correlation + logging#123
mmercuri wants to merge 1 commit into
feat/instrument-multitenancy-org-id-propagationfrom
feat/instrument-multitenancy-hardening

Conversation

@mmercuri

Copy link
Copy Markdown
Contributor

Stacks on PR #118 (per-event org_id propagation). Closes 4 adjacent CLAUDE.md gaps that #118 did not address.

Gaps closed

Gap 1 — Cache audit + sweep

Adds tests/instrument/adapters/_base/test_cache_tenant_isolation.py (14 tests). Proves every BaseAdapter in-memory cache is per-instance and inherits the single-tenant binding from #118:

  • circuit breaker counter / _circuit_open flag / _circuit_opened_at
  • _trace_events buffer
  • _event_sinks registry
  • _lock
  • org_id binding (immutability check)

Concurrent stress: 2-tenant @ 500 events each, 3-tenant @ 200 events each, 8-thread × 100-event single-tenant burst. All assert zero cross-tenant pollution under contention. Two structural guard tests forbid future class-level OR module-level mutable containers that would silently merge tenants.

Gap 2 — Per-tenant stream isolation in IngestionPipelineSink

Replaces single global buffer with dict[org_id, list[event]]. Each tenant gets:

  • Independent buffer with its own max_per_tenant_buffer_size cap (default 1000)
  • FIFO eviction scoped to THAT tenant only — a noisy tenant cannot displace a quieter tenant's events
  • dropped_per_tenant counter for the sink_per_tenant_dropped{org_id} gauge
  • buffer_size_per_tenant() snapshot for the sink_per_tenant_buffer_size{org_id} gauge

flush() issues one ingest() call per tenant under THAT tenant's org_id — never a mixed batch. tests/instrument/adapters/_base/test_sinks_per_tenant.py (10 tests) covers per-tenant partitioning, isolated FIFO eviction under burst contention, defensive snapshot copies, and per-tenant flush call shape.

Gap 3 — OTel ↔ SDK org_id correlation

Adds _set_current_span_org_id() that stamps layerlens.org_id on the active OTel span on every emit_event / emit_dict_event. No-op when:

  • OpenTelemetry isn't installed
  • The active span is the no-op INVALID_SPAN
  • is_recording() returns False
  • The OTel API raises (defensive)

The hot path never faults on observability failures. tests/instrument/adapters/_base/test_otel_correlation.py (8 tests) covers dict + typed payload paths, missing OTel, non-recording spans, set_attribute failures, and per-span tenant attribution.

Gap 4 — Tenant-aware logging propagation

New module src/layerlens/instrument/adapters/_base/logging.py with TenantContextLogAdapter and get_tenant_logger(). Every log record carries org_id in extra AND in a [org_id=...] message prefix. BaseAdapter exposes self.tlogger bound to its org_id; subclasses can drop-in replace logging.getLogger(__name__).

Caller-supplied extra={'org_id': ...} cannot impersonate another tenant — adapter binding always wins. tests/instrument/adapters/_base/test_tenant_logger.py (12 tests) covers fail-fast on empty/whitespace/non-string org_id, per-instance binding under shared logger names, and end-to-end wiring through BaseAdapter.

Test plan

  • pytest tests/instrument/adapters/_base/test_{cache_tenant_isolation,otel_correlation,sinks_per_tenant,tenant_logger}.py -x — 44 passed
  • pytest tests/instrument/adapters/_base/ — 61 passed (44 new + 17 from fix(instrument): Propagate org_id through all event emissions (multi-tenancy CLAUDE.md fix) #118, no regression)
  • mypy --strict src/.../adapters/_base/{adapter,sinks,logging}.py — Success: no issues
  • ruff check src/.../adapters/_base/ tests/.../adapters/_base/ — All checks passed

Files modified per gap

  • Gap 1: tests/instrument/adapters/_base/test_cache_tenant_isolation.py (new), comments in src/layerlens/instrument/adapters/_base/adapter.py
  • Gap 2: src/layerlens/instrument/adapters/_base/sinks.py (per-tenant buffer + cap + metrics), tests/instrument/adapters/_base/test_sinks_per_tenant.py (new)
  • Gap 3: src/layerlens/instrument/adapters/_base/adapter.py (_set_current_span_org_id helper + emit hooks), tests/instrument/adapters/_base/test_otel_correlation.py (new)
  • Gap 4: src/layerlens/instrument/adapters/_base/logging.py (new), src/layerlens/instrument/adapters/_base/adapter.py (tlogger property + wiring), tests/instrument/adapters/_base/test_tenant_logger.py (new)

…elation + logging

Closes 4 adjacent CLAUDE.md gaps that PR #118 (per-event org_id
propagation) did not address. Stacks on
feat/instrument-multitenancy-org-id-propagation.

Gap 1 — Cache audit + sweep
  Adds tests/instrument/adapters/_base/test_cache_tenant_isolation.py
  (14 tests). Proves every BaseAdapter in-memory cache (circuit
  breaker counters, _trace_events buffer, sink registry, lock,
  org_id binding) is per-instance and therefore inherits the
  single-tenant binding established in PR #118. Concurrent stress
  tests with 2 / 3 tenants and 8 emit threads prove no cross-tenant
  pollution under contention. Two guard tests forbid future
  introduction of class-level OR module-level mutable containers
  that would silently merge tenants.

Gap 2 — Per-tenant stream isolation in IngestionPipelineSink
  Replaces the single global buffer with dict[org_id, list[event]].
  Each tenant gets an independent buffer with its own
  max_per_tenant_buffer_size cap (default 1000) and FIFO eviction
  scoped to THAT tenant — a noisy tenant can never displace a
  quieter tenant's events. Adds buffer_size_per_tenant() snapshot
  and dropped_per_tenant counter for the
  sink_per_tenant_buffer_size{org_id} gauge. flush() now issues
  one ingest() call per tenant under that tenant's org_id.
  Tests/instrument/adapters/_base/test_sinks_per_tenant.py
  (10 tests) covers: per-tenant partitioning, isolated FIFO
  eviction under burst contention, defensive snapshot copies, and
  per-tenant flush call shape.

Gap 3 — OTel ↔ SDK org_id correlation
  Adds _set_current_span_org_id() that stamps layerlens.org_id on
  the active OTel span on every emit_event / emit_dict_event.
  No-op when OTel is absent, when the active span is the no-op
  INVALID_SPAN, or when is_recording() returns False — never
  faults the adapter hot path. Tests/instrument/adapters/_base/
  test_otel_correlation.py (8 tests) covers: dict + typed payload
  paths, missing OTel, non-recording spans, set_attribute
  failures, and per-span tenant attribution.

Gap 4 — Tenant-aware logging propagation
  New module src/layerlens/instrument/adapters/_base/logging.py
  with TenantContextLogAdapter and get_tenant_logger() factory.
  Every log record carries org_id in extras AND in a
  '[org_id=...]' message prefix. BaseAdapter exposes self.tlogger
  bound to its org_id; subclasses can drop in self.tlogger for
  logging.getLogger(__name__). Caller-supplied
  extra={'org_id': ...} cannot impersonate another tenant.
  Tests/instrument/adapters/_base/test_tenant_logger.py
  (12 tests) covers: fail-fast on empty/whitespace/non-string
  org_id, per-instance binding under shared logger names, and
  end-to-end wiring through BaseAdapter.

44 new tests, all green (61 tests in tests/instrument/adapters/_base
total). mypy --strict + ruff check clean for all changed source files.
@mmercuri mmercuri requested a review from m-peko April 27, 2026 04:19
@m-peko m-peko closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants