LayerLens · mmercuri · Apr 23, 2026 · Apr 23, 2026 · Apr 23, 2026 · Apr 24, 2026
diff --git a/docs/adapters/memory-contract.md b/docs/adapters/memory-contract.md
@@ -0,0 +1,194 @@
+# Memory persistence contract for adapters
+
+LayerLens adapters carry per-conversation, per-agent recall — episodic,
+procedural, and semantic memory — alongside the trace events they emit.
+This module ports the ad-hoc memory plumbing that the four mature
+framework adapters (LangChain, AutoGen, CrewAI, Semantic Kernel) carry
+on the `ateam` monorepo into a shared, replay-safe primitive that any
+adapter on the `stratix-python` SDK can plug into. Without this
+plumbing, lighter adapters behave as "goldfish agents" — every run
+starts from a blank slate, which is the difference between a usable
+production agent and a demo.
+
+This document defines the binding contract every adapter that integrates
+the recorder must satisfy. It is enforced at runtime by
+`MemoryRecorder.__init__` (fail-fast on missing tenant), by
+`MemoryRecorder.restore` (cross-tenant rejection + content-hash
+integrity check), and at CI time by
+`tests/instrument/adapters/_base/test_memory.py`.
+
+Cross-pollination audit reference:
+[`A:/tmp/adapter-cross-pollination-audit.md`](../../../tmp/adapter-cross-pollination-audit.md)
+§2 item #1.
+
+## The three buckets
+
+The memory model is the canonical agent-memory split that appears
+across the literature (LangChain memory module; CrewAI procedural
+memory; AutoGen episodic/semantic split):
+
+| Bucket         | Lifetime              | Bounded by            | Eviction              |
+|----------------|-----------------------|------------------------|------------------------|
+| **Episodic**   | per-turn              | `max_episodic` (200)   | FIFO (oldest dropped)  |
+| **Procedural** | recurring patterns    | `max_procedural` (16)  | least-frequent + ties broken by oldest `last_seen_turn` |
+| **Semantic**   | long-lived facts      | `max_semantic` (64)    | least-recently-set     |
+
+* **Episodic** — per-turn `(input, output, error?, tools?, extra?)`
+  records, ordered by `turn_index`. The detector for procedural
+  patterns reads this stream.
+* **Procedural** — derived: each entry is
+  `{"pattern": [[prev_turn_tools], [cur_turn_tools]], "count": int,
+  "last_seen_turn": int}`. Detected automatically from the recent
+  episodic window every time `record_turn` is called.
+* **Semantic** — caller-controlled key/value store
+  (`set_semantic(key, value)`). Both keys and values are stringified.
+  Callers wanting structured semantic data should JSON-encode their
+  value.
+
+All three buckets are **bounded** (CLAUDE.md "every cache must be
+bounded"). The defaults are conservative; callers wanting a different
+size construct the recorder with explicit `max_*` kwargs.
+
+## The contract
+
+1. **Every adapter owns exactly one `MemoryRecorder`.** Constructed in
+   `BaseAdapter.__init__` and exposed via `adapter.memory_recorder`
+   (read-only property). The recorder is bound to the same `org_id` as
+   the adapter — multi-tenancy is propagated.
+
+2. **Construction without a tenant raises.** `MemoryRecorder(org_id="")`
+   raises `ValueError`. There is no "default" tenant, no blank
+   fallback. `BaseAdapter.__init__` already fails fast on missing
+   `org_id` (see [multi-tenancy.md](multi-tenancy.md)) so the recorder
+   inherits that guarantee.
+
+3. **Every recorded turn is bounded.** A single oversized turn cannot
+   blow past the bucket caps: per-field strings longer than 8 KB are
+   truncated to a deterministic suffix (`<...truncated:orig_len=N>`).
+   The truncation is defence-in-depth, not a substitute for the
+   adapter-level truncation policy (cross-poll #3). Detection of
+   recurring tool patterns runs in O(window) per turn.
+
+4. **Cross-tenant restore is prohibited.** `restore(snapshot)` raises
+   `ValueError` if `snapshot.org_id != recorder.org_id`. This mirrors
+   the `BaseAdapter.org_id` contract — a tenant-A snapshot cannot land
+   in a tenant-B recorder, even if both happen to share a process.
+
+5. **Snapshots are tamper-evident.** `restore(snapshot)` recomputes the
+   SHA-256 content hash and rejects the snapshot if the recorded hash
+   does not match. Guards against accidentally-mutated dicts in
+   transit and against forged snapshots reconstructed without the
+   `MemorySnapshot` factory.
+
+6. **Snapshots are replay-safe.** The round-trip
+   `snapshot() → restore() → snapshot()` produces a snapshot with the
+   identical `content_hash` (deterministic reconstruction). This is the
+   foundation of replay-safe memory: the replay engine restores the
+   recorder, then the adapter re-runs the agent and produces the same
+   next-turn snapshot. The `record_turn` method stamps a wall-clock
+   `timestamp_ns` into the new turn — replay engines suppress this
+   drift by capturing the original `timestamp_ns` from the source
+   trace and seeding the recorder's clock at restore time.
+
+7. **Snapshot serialisation is dict-shaped.** `snapshot.to_dict()`
+   returns a JSON-serialisable dict; `MemorySnapshot.from_dict(data)`
+   round-trips. Adapters embed the dict under
+   `ReplayableTrace.metadata["memory_snapshot"]` so the replay engine
+   can reconstruct via `MemoryRecorder.restore(MemorySnapshot.from_dict(...))`.
+
+8. **Recording is best-effort.** `BaseAdapter.record_memory_turn`
+   catches and logs all exceptions at DEBUG. A failure inside the
+   recorder MUST NOT propagate into the host framework's call stack —
+   tracing never breaks user code (CLAUDE.md). The trade-off is that a
+   recorder bug shows up as missing memory in the replay rather than a
+   crash in production.
+
+9. **Thread-safe.** `record_turn`, `set_semantic`, `clear`, and
+   `restore` are all guarded by an internal lock. Many concurrent
+   `record_turn` calls produce a snapshot whose `episodic` indices
+   form an unbroken `1..N` sequence.
+
+## Wiring at the lifecycle hook
+
+Every adapter wires `record_memory_turn(...)` into its **agent-output
+boundary** — the point at which the framework reports a completed
+agent step / chat turn / invocation. The exact hook varies by
+framework:
+
+| Adapter             | Hook                                                                | Episodic input            | Tool list                                                   |
+|---------------------|---------------------------------------------------------------------|---------------------------|--------------------------------------------------------------|
+| `agno`              | `Agent.run` / `arun` finally-block                                   | `args[0] / kwargs["input"]`| `_collect_tool_names(agent, result)` from `result.messages`  |
+| `ms_agent_framework`| `Chat.invoke` / `invoke_stream` finally-block                        | `kwargs["input"]/["message"]`| `_collect_tool_names_from_messages(seen)` from streamed items|
+| `openai_agents`     | `_on_agent_span_end` (TraceProcessor) + `on_run_end` (Runner wrap)    | cached at `_on_agent_span_start` per `span_id` | rolled up from `_on_function_span_end` per `parent_id` |
+| `llama_index`       | `_on_agent_step_end`                                                  | cached at `_on_agent_step_start` per thread id| rolled up from `_on_tool_call` per thread id |
+| `google_adk`        | `after_agent_callback` + `on_agent_end`                                | cached at `before_agent_callback` per thread id| rolled up from `after_tool_callback` per thread id |
+| `bedrock_agents`    | `_after_invoke_agent` (boto3 hook)                                    | cached at `_before_invoke_agent` per thread id| rolled up from `_process_trace` action-group / KB step names |
+
+Each adapter also embeds its memory snapshot in `serialize_for_replay`
+output via `ReplayableTrace.metadata["memory_snapshot"] =
+self.memory_snapshot_dict()` — so a downstream replay engine can
+reconstruct the full episodic + procedural + semantic state before
+re-execution.
+
+## Honest scope disclosure (target adapter coverage)
+
+The cross-pollination audit §2 item #1 enumerates **seven** target
+adapters: `agno`, `ms_agent_framework`, `openai_agents`, `llama_index`,
+`google_adk`, `bedrock_agents`, **`browser_use`**.
+
+Six are wired in this PR. The seventh — `browser_use` — does not exist
+on this branch's base (`feat/instrument-multitenancy-org-id-propagation`);
+it lives on the parallel `feat/instrument-frameworks-browser-use-full`
+history. It will be wired when that adapter is ported to this base or
+when the histories merge. This follows the same honest-disclosure
+pattern as PR #120 (state filters, which omitted `ms_agent_framework`
+for the same reason — adapter not on its base).
+
+For `browser_use`, the eventual wiring (per the cross-pollination
+audit) will be:
+
+* **Episodic** — page navigation events (`url`, `action`, `selector`)
+  per turn.
+* **Procedural** — recurring `(prev_action, current_action)` patterns
+  (e.g. `"click[search]"` → `"type[query]"` → `"click[submit]"`).
+* **Semantic** — long-lived page-content cache keyed by URL or DOM
+  hash, so a re-visit can short-circuit page reload during replay.
+
+## Audit hooks
+
+* **Construction failures** — `MemoryRecorder.__init__` raises with a
+  message naming the missing field (`"non-empty org_id"`,
+  `"bounded buffer sizes"`).
+* **Cross-tenant restore** — raises with the explicit
+  `"Cross-tenant restore is prohibited (CLAUDE.md multi-tenancy)"`
+  message.
+* **Tampered snapshots** — raises with
+  `"snapshot content_hash mismatch"` and includes the recorded vs
+  recomputed hashes for triage.
+* **Best-effort recording failures** — logged at DEBUG via
+  `BaseAdapter.record_memory_turn` with `exc_info=True` so the failing
+  call site is preserved without escalating.
+
+## Replay engine integration
+
+A replay flow looks like:
+
+```python
+# Original run captures both events and memory.
+adapter = AgnoAdapter(stratix=client, org_id="tenant-A")
+# ... agent runs, on_run_end fires record_memory_turn() ...
+trace = adapter.serialize_for_replay()
+trace.metadata["memory_snapshot"]  # serialised MemorySnapshot dict.
+
+# Replay reconstructs the recorder before re-execution.
+replay_adapter = AgnoAdapter(stratix=client, org_id="tenant-A")
+snapshot = MemorySnapshot.from_dict(trace.metadata["memory_snapshot"])
+replay_adapter.memory_recorder.restore(snapshot)
+# Re-run the agent — it sees the original recall state.
+```
+
+The next-turn snapshot taken from `replay_adapter` will match the
+original (modulo the wall-clock `timestamp_ns` of the new turn — see
+contract item 6). This is what makes memory persistence "replay-safe":
+the replay engine can drive an adapter through the same agent state
+the original run reached.
diff --git a/docs/adapters/multi-tenancy.md b/docs/adapters/multi-tenancy.md
@@ -0,0 +1,148 @@
+# Multi-tenancy contract for adapters
+
+LayerLens is a multi-tenant SaaS platform. Every event emitted by an
+adapter MUST be tagged with the originating tenant's `org_id`. Cache
+keys, queue topics, ingest streams, RLS policies, and downstream
+attestation chains all read this field to scope data to a single
+tenant.
+
+This document defines the binding contract that every framework /
+protocol / provider adapter must satisfy. It is enforced at runtime by
+`BaseAdapter.__init__` (fail-fast) and at CI time by the test suite at
+`tests/instrument/adapters/_base/test_org_id_propagation.py` plus the
+parametrized `tests/instrument/adapters/frameworks/test_per_adapter_org_id.py`.
+
+## The contract
+
+1. **Every adapter is bound to exactly one tenant at construction.**
+   The tenant binding (`org_id`) is stored as `self._org_id` and
+   exposed as the read-only property `adapter.org_id`. The bound value
+   is a non-empty string — there is no null sentinel, no empty
+   fallback, no `"default"` placeholder.
+
+2. **Construction without a resolvable `org_id` raises.** Resolution
+   order at `__init__`:
+
+   1. Explicit `org_id=...` keyword to the adapter constructor.
+   2. `stratix.org_id` attribute on the attached client (if not blank).
+   3. `stratix.organization_id` attribute on the attached client — the
+      public `layerlens.Stratix` client uses this name (if not blank).
+
+   If none of the three resolve to a non-empty string,
+   `BaseAdapter.__init__` raises `ValueError`. This is a fail-fast.
+   Callers cannot opt out, suppress, or work around it. There is no
+   silent fallback. A blank `org_id` is rejected with the same error
+   as an absent one.
+
+3. **Every emission is stamped.** Both `emit_event` (typed payload)
+   and `emit_dict_event` (dict payload) call `BaseAdapter._stamp_org_id`
+   before forwarding to the client. The bound `self._org_id` is
+   written to the payload's `org_id` field unconditionally — any
+   caller-supplied value (including a wrong tenant's id) is
+   overwritten. The adapter binding is the source of truth.
+
+4. **Every trace record carries `org_id`.** The replay event records
+   stored in `self._trace_events` include `org_id` at the envelope
+   level *and* inside the payload dict, so replay round-trips and
+   downstream re-ingest preserve the binding.
+
+5. **Every sink dispatch carries `org_id`.** The `EventSink.send`
+   ABC requires the keyword: `send(event_type, payload, timestamp_ns,
+   *, org_id: str)`. Sinks that omit it are flagged at the type-check
+   layer (mypy `--strict`). The `IngestionPipelineSink` uses the
+   per-event `org_id` as the `tenant_id` for downstream ingest.
+
+## Wiring a new adapter
+
+Subclasses of `BaseAdapter` (and `BaseProtocolAdapter` /
+`LLMProviderAdapter`) get the contract for free **as long as their
+`__init__` forwards `org_id` to `super().__init__`**. The canonical
+shape:
+
+```python
+class MyAdapter(BaseAdapter):
+    FRAMEWORK = "my_framework"
+    VERSION = "0.1.0"
+
+    def __init__(
+        self,
+        stratix: Any | None = None,
+        capture_config: CaptureConfig | None = None,
+        # framework-specific args here ...
+        *,
+        org_id: str | None = None,
+    ) -> None:
+        super().__init__(
+            stratix=stratix,
+            capture_config=capture_config,
+            org_id=org_id,
+        )
+        # adapter-specific state ...
+```
+
+Note the keyword-only `*` separator for `org_id`. The rest of
+`__init__` is unchanged from the pre-multi-tenancy era.
+
+Adapter helper functions (the `instrument_*` convenience exports in
+each adapter's `__init__.py`) should also accept and forward `org_id`:
+
+```python
+def instrument_agent(
+    agent: Any,
+    stratix: Any = None,
+    capture_config: dict[str, Any] | None = None,
+    org_id: str | None = None,
+) -> MyAdapter:
+    adapter = MyAdapter(
+        stratix=stratix,
+        capture_config=capture_config,
+        org_id=org_id,
+    )
+    adapter.connect()
+    adapter.instrument_agent(agent)
+    return adapter
+```
+
+## Test obligations
+
+Every new framework adapter must:
+
+1. Have its class added to `_all_adapter_classes()` in
+   `tests/instrument/adapters/frameworks/test_per_adapter_org_id.py`.
+   The two parametrized tests there assert (a) the adapter accepts
+   `org_id` and exposes the bound value via the property, and (b) the
+   adapter raises without an `org_id`.
+2. If the adapter ships its own dedicated test file, every test that
+   constructs the adapter must pass `org_id` (typically via the
+   shared `_RecordingStratix` test stand-in, which carries
+   `org_id = "test-org"` as a class attribute).
+3. The cross-tenant isolation guarantee is covered centrally in
+   `tests/instrument/adapters/_base/test_org_id_propagation.py`. New
+   adapters do not need to re-prove cross-tenant isolation if they
+   route emissions through the standard `BaseAdapter` path; they MUST
+   add a per-adapter cross-tenant test if they bypass the base path.
+
+## What changed (April 2026)
+
+Prior to this change, all adapter emissions in the stratix-python SDK
+shipped without `org_id` propagation. The 2026-04-25 audit
+(`A:/tmp/adapter-depth-audit.md`, cross-cutting finding #3) flagged
+this as a CLAUDE.md violation. The fix:
+
+- `BaseAdapter.__init__` now requires a resolvable `org_id` and
+  stores it on the instance.
+- `emit_event` and `emit_dict_event` stamp `org_id` into every
+  payload before forwarding to the client.
+- `EventSink.send` now requires the `org_id` keyword.
+- Every shipped adapter (17 framework + protocol + provider) was
+  updated to thread `org_id` through to `super().__init__`.
+
+## References
+
+- CLAUDE.md, "Multi-Tenancy" section — the platform-wide mandate.
+- `A:/tmp/adapter-depth-audit.md` — the audit that surfaced the gap.
+- `src/layerlens/instrument/adapters/_base/adapter.py` — `_resolve_org_id`,
+  `BaseAdapter.__init__`, `_stamp_org_id`, `emit_event`,
+  `emit_dict_event`, `_post_emit_success`.
+- `src/layerlens/instrument/adapters/_base/sinks.py` — `EventSink`
+  ABC, `TraceStoreSink.send`, `IngestionPipelineSink.send`.
diff --git a/docs/samples-guide.md b/docs/samples-guide.md
@@ -92,11 +92,13 @@ See the [MCP README](../samples/mcp/README.md) for setup instructions.
 
 Located in [`samples/copilotkit/`](../samples/copilotkit/). Full-stack integration with CopilotKit using LangGraph CoAgents and generative UI card components.
 
-- [`agents/evaluator_agent.py`](../samples/copilotkit/agents/evaluator_agent.py) -- LangGraph CoAgent for evaluation workflows
+- [`agents/evaluator_agent.py`](../samples/copilotkit/agents/evaluator_agent.py) -- LangGraph CoAgent for evaluation workflows (human-in-the-loop judge confirmation via `interrupt()`)
 - [`agents/investigator_agent.py`](../samples/copilotkit/agents/investigator_agent.py) -- LangGraph CoAgent for trace investigation
 - [`components/*.tsx`](../samples/copilotkit/components/) -- React card components for rendering results
 - [`hooks/*.ts`](../samples/copilotkit/hooks/) -- CopilotKit hooks for wiring LayerLens actions
 
+> **Checkpointer note:** Any LangGraph CoAgent that calls `interrupt()` (such as `evaluator_agent.py`) **must** be compiled with a checkpointer. Without one, the AG-UI stream ends without emitting `RUN_FINISHED` and CopilotKit blocks all subsequent messages. The sample ships with `InMemorySaver` for a zero-setup local run and documents Postgres / SQLite / Redis / LangGraph Platform alternatives for production in its [README](../samples/copilotkit/README.md#human-in-the-loop-checkpointers).
+
 See the [CopilotKit README](../samples/copilotkit/README.md) for the full list.
 
 ### Claude Code Skills (6 skills)