Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
a2e9fd6
fix(copilotkit): compile evaluator_graph with a checkpointer for inte…
mmercuri Apr 23, 2026
c40c03f
fix(copilotkit): register DTOs with serde, add AG-UI wire e2e test
mmercuri Apr 23, 2026
6421a9e
fix(copilotkit): resolve error/field collision + browser e2e harness
mmercuri Apr 23, 2026
542002b
fix(copilotkit): mirror CopilotKit's sample + workaround for ag-ui-la…
mmercuri Apr 24, 2026
e8d5f04
fix(copilotkit): also suppress duplicate RUN_STARTED on re-entry inte…
mmercuri Apr 24, 2026
796d562
fix(copilotkit): re-pin to latest published versions (DevRel env) and…
mmercuri Apr 24, 2026
fb43742
fix(copilotkit): wire useLangGraphInterrupt on frontend; drop backend…
mmercuri Apr 24, 2026
45fa80f
feat(copilotkit): rearchitect evaluator as create_agent + frontend HI…
mmercuri Apr 24, 2026
f8dd7db
test(copilotkit): add live-LLM e2e + OpenAI-compatible endpoint support
mmercuri Apr 25, 2026
a1eebe7
fix(copilotkit harness): unblock React hydration on Next 16 dev server
mmercuri Apr 25, 2026
268a273
fix(harness picker): resilient render while LLM streams tool args
mmercuri Apr 25, 2026
c953aee
feat(harness): per-tool render cards + Tailwind 4 + production Evalua…
mmercuri Apr 25, 2026
55b1f63
copilotkit sample: ship customer-facing app with shadcn-aligned cards
mmercuri Apr 26, 2026
bf1d20e
copilotkit sample: chat tokens, severity pills, trace status, markdow…
mmercuri Apr 26, 2026
05516be
copilotkit sample: render pending verdicts as skeleton cards
mmercuri Apr 26, 2026
a72a3ec
copilotkit sample: extract MarkdownLite, render judge goals
mmercuri Apr 26, 2026
2c7b60b
copilotkit sample: rewrite README to match the shipped architecture
mmercuri Apr 26, 2026
b0aa842
ci: replace walrus + reformat test_samples_e2e for ruff py37 target
mmercuri Apr 26, 2026
1a30622
copilotkit sample: drop OpenRouter-specific references
mmercuri Apr 26, 2026
d8601e3
ci: wrap server.py uvicorn boot in main() + extend test mocks
mmercuri Apr 26, 2026
569abdb
copilotkit sample: stop polling on any terminal eval state, not just …
mmercuri Apr 26, 2026
7228526
feat(assistant): add Stratix Assistant SDK resource
mmercuri Apr 26, 2026
e12398f
fix(instrument): propagate org_id through all event emissions
mmercuri Apr 27, 2026
9d54477
feat(instrument): Memory persistence (episodic/procedural/semantic) f…
mmercuri May 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
194 changes: 194 additions & 0 deletions docs/adapters/memory-contract.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
# Memory persistence contract for adapters

LayerLens adapters carry per-conversation, per-agent recall — episodic,
procedural, and semantic memory — alongside the trace events they emit.
This module ports the ad-hoc memory plumbing that the four mature
framework adapters (LangChain, AutoGen, CrewAI, Semantic Kernel) carry
on the `ateam` monorepo into a shared, replay-safe primitive that any
adapter on the `stratix-python` SDK can plug into. Without this
plumbing, lighter adapters behave as "goldfish agents" — every run
starts from a blank slate, which is the difference between a usable
production agent and a demo.

This document defines the binding contract every adapter that integrates
the recorder must satisfy. It is enforced at runtime by
`MemoryRecorder.__init__` (fail-fast on missing tenant), by
`MemoryRecorder.restore` (cross-tenant rejection + content-hash
integrity check), and at CI time by
`tests/instrument/adapters/_base/test_memory.py`.

Cross-pollination audit reference:
[`A:/tmp/adapter-cross-pollination-audit.md`](../../../tmp/adapter-cross-pollination-audit.md)
§2 item #1.

## The three buckets

The memory model is the canonical agent-memory split that appears
across the literature (LangChain memory module; CrewAI procedural
memory; AutoGen episodic/semantic split):

| Bucket | Lifetime | Bounded by | Eviction |
|----------------|-----------------------|------------------------|------------------------|
| **Episodic** | per-turn | `max_episodic` (200) | FIFO (oldest dropped) |
| **Procedural** | recurring patterns | `max_procedural` (16) | least-frequent + ties broken by oldest `last_seen_turn` |
| **Semantic** | long-lived facts | `max_semantic` (64) | least-recently-set |

* **Episodic** — per-turn `(input, output, error?, tools?, extra?)`
records, ordered by `turn_index`. The detector for procedural
patterns reads this stream.
* **Procedural** — derived: each entry is
`{"pattern": [[prev_turn_tools], [cur_turn_tools]], "count": int,
"last_seen_turn": int}`. Detected automatically from the recent
episodic window every time `record_turn` is called.
* **Semantic** — caller-controlled key/value store
(`set_semantic(key, value)`). Both keys and values are stringified.
Callers wanting structured semantic data should JSON-encode their
value.

All three buckets are **bounded** (CLAUDE.md "every cache must be
bounded"). The defaults are conservative; callers wanting a different
size construct the recorder with explicit `max_*` kwargs.

## The contract

1. **Every adapter owns exactly one `MemoryRecorder`.** Constructed in
`BaseAdapter.__init__` and exposed via `adapter.memory_recorder`
(read-only property). The recorder is bound to the same `org_id` as
the adapter — multi-tenancy is propagated.

2. **Construction without a tenant raises.** `MemoryRecorder(org_id="")`
raises `ValueError`. There is no "default" tenant, no blank
fallback. `BaseAdapter.__init__` already fails fast on missing
`org_id` (see [multi-tenancy.md](multi-tenancy.md)) so the recorder
inherits that guarantee.

3. **Every recorded turn is bounded.** A single oversized turn cannot
blow past the bucket caps: per-field strings longer than 8 KB are
truncated to a deterministic suffix (`<...truncated:orig_len=N>`).
The truncation is defence-in-depth, not a substitute for the
adapter-level truncation policy (cross-poll #3). Detection of
recurring tool patterns runs in O(window) per turn.

4. **Cross-tenant restore is prohibited.** `restore(snapshot)` raises
`ValueError` if `snapshot.org_id != recorder.org_id`. This mirrors
the `BaseAdapter.org_id` contract — a tenant-A snapshot cannot land
in a tenant-B recorder, even if both happen to share a process.

5. **Snapshots are tamper-evident.** `restore(snapshot)` recomputes the
SHA-256 content hash and rejects the snapshot if the recorded hash
does not match. Guards against accidentally-mutated dicts in
transit and against forged snapshots reconstructed without the
`MemorySnapshot` factory.

6. **Snapshots are replay-safe.** The round-trip
`snapshot() → restore() → snapshot()` produces a snapshot with the
identical `content_hash` (deterministic reconstruction). This is the
foundation of replay-safe memory: the replay engine restores the
recorder, then the adapter re-runs the agent and produces the same
next-turn snapshot. The `record_turn` method stamps a wall-clock
`timestamp_ns` into the new turn — replay engines suppress this
drift by capturing the original `timestamp_ns` from the source
trace and seeding the recorder's clock at restore time.

7. **Snapshot serialisation is dict-shaped.** `snapshot.to_dict()`
returns a JSON-serialisable dict; `MemorySnapshot.from_dict(data)`
round-trips. Adapters embed the dict under
`ReplayableTrace.metadata["memory_snapshot"]` so the replay engine
can reconstruct via `MemoryRecorder.restore(MemorySnapshot.from_dict(...))`.

8. **Recording is best-effort.** `BaseAdapter.record_memory_turn`
catches and logs all exceptions at DEBUG. A failure inside the
recorder MUST NOT propagate into the host framework's call stack —
tracing never breaks user code (CLAUDE.md). The trade-off is that a
recorder bug shows up as missing memory in the replay rather than a
crash in production.

9. **Thread-safe.** `record_turn`, `set_semantic`, `clear`, and
`restore` are all guarded by an internal lock. Many concurrent
`record_turn` calls produce a snapshot whose `episodic` indices
form an unbroken `1..N` sequence.

## Wiring at the lifecycle hook

Every adapter wires `record_memory_turn(...)` into its **agent-output
boundary** — the point at which the framework reports a completed
agent step / chat turn / invocation. The exact hook varies by
framework:

| Adapter | Hook | Episodic input | Tool list |
|---------------------|---------------------------------------------------------------------|---------------------------|--------------------------------------------------------------|
| `agno` | `Agent.run` / `arun` finally-block | `args[0] / kwargs["input"]`| `_collect_tool_names(agent, result)` from `result.messages` |
| `ms_agent_framework`| `Chat.invoke` / `invoke_stream` finally-block | `kwargs["input"]/["message"]`| `_collect_tool_names_from_messages(seen)` from streamed items|
| `openai_agents` | `_on_agent_span_end` (TraceProcessor) + `on_run_end` (Runner wrap) | cached at `_on_agent_span_start` per `span_id` | rolled up from `_on_function_span_end` per `parent_id` |
| `llama_index` | `_on_agent_step_end` | cached at `_on_agent_step_start` per thread id| rolled up from `_on_tool_call` per thread id |
| `google_adk` | `after_agent_callback` + `on_agent_end` | cached at `before_agent_callback` per thread id| rolled up from `after_tool_callback` per thread id |
| `bedrock_agents` | `_after_invoke_agent` (boto3 hook) | cached at `_before_invoke_agent` per thread id| rolled up from `_process_trace` action-group / KB step names |

Each adapter also embeds its memory snapshot in `serialize_for_replay`
output via `ReplayableTrace.metadata["memory_snapshot"] =
self.memory_snapshot_dict()` — so a downstream replay engine can
reconstruct the full episodic + procedural + semantic state before
re-execution.

## Honest scope disclosure (target adapter coverage)

The cross-pollination audit §2 item #1 enumerates **seven** target
adapters: `agno`, `ms_agent_framework`, `openai_agents`, `llama_index`,
`google_adk`, `bedrock_agents`, **`browser_use`**.

Six are wired in this PR. The seventh — `browser_use` — does not exist
on this branch's base (`feat/instrument-multitenancy-org-id-propagation`);
it lives on the parallel `feat/instrument-frameworks-browser-use-full`
history. It will be wired when that adapter is ported to this base or
when the histories merge. This follows the same honest-disclosure
pattern as PR #120 (state filters, which omitted `ms_agent_framework`
for the same reason — adapter not on its base).

For `browser_use`, the eventual wiring (per the cross-pollination
audit) will be:

* **Episodic** — page navigation events (`url`, `action`, `selector`)
per turn.
* **Procedural** — recurring `(prev_action, current_action)` patterns
(e.g. `"click[search]"` → `"type[query]"` → `"click[submit]"`).
* **Semantic** — long-lived page-content cache keyed by URL or DOM
hash, so a re-visit can short-circuit page reload during replay.

## Audit hooks

* **Construction failures** — `MemoryRecorder.__init__` raises with a
message naming the missing field (`"non-empty org_id"`,
`"bounded buffer sizes"`).
* **Cross-tenant restore** — raises with the explicit
`"Cross-tenant restore is prohibited (CLAUDE.md multi-tenancy)"`
message.
* **Tampered snapshots** — raises with
`"snapshot content_hash mismatch"` and includes the recorded vs
recomputed hashes for triage.
* **Best-effort recording failures** — logged at DEBUG via
`BaseAdapter.record_memory_turn` with `exc_info=True` so the failing
call site is preserved without escalating.

## Replay engine integration

A replay flow looks like:

```python
# Original run captures both events and memory.
adapter = AgnoAdapter(stratix=client, org_id="tenant-A")
# ... agent runs, on_run_end fires record_memory_turn() ...
trace = adapter.serialize_for_replay()
trace.metadata["memory_snapshot"] # serialised MemorySnapshot dict.

# Replay reconstructs the recorder before re-execution.
replay_adapter = AgnoAdapter(stratix=client, org_id="tenant-A")
snapshot = MemorySnapshot.from_dict(trace.metadata["memory_snapshot"])
replay_adapter.memory_recorder.restore(snapshot)
# Re-run the agent — it sees the original recall state.
```

The next-turn snapshot taken from `replay_adapter` will match the
original (modulo the wall-clock `timestamp_ns` of the new turn — see
contract item 6). This is what makes memory persistence "replay-safe":
the replay engine can drive an adapter through the same agent state
the original run reached.
148 changes: 148 additions & 0 deletions docs/adapters/multi-tenancy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Multi-tenancy contract for adapters

LayerLens is a multi-tenant SaaS platform. Every event emitted by an
adapter MUST be tagged with the originating tenant's `org_id`. Cache
keys, queue topics, ingest streams, RLS policies, and downstream
attestation chains all read this field to scope data to a single
tenant.

This document defines the binding contract that every framework /
protocol / provider adapter must satisfy. It is enforced at runtime by
`BaseAdapter.__init__` (fail-fast) and at CI time by the test suite at
`tests/instrument/adapters/_base/test_org_id_propagation.py` plus the
parametrized `tests/instrument/adapters/frameworks/test_per_adapter_org_id.py`.

## The contract

1. **Every adapter is bound to exactly one tenant at construction.**
The tenant binding (`org_id`) is stored as `self._org_id` and
exposed as the read-only property `adapter.org_id`. The bound value
is a non-empty string — there is no null sentinel, no empty
fallback, no `"default"` placeholder.

2. **Construction without a resolvable `org_id` raises.** Resolution
order at `__init__`:

1. Explicit `org_id=...` keyword to the adapter constructor.
2. `stratix.org_id` attribute on the attached client (if not blank).
3. `stratix.organization_id` attribute on the attached client — the
public `layerlens.Stratix` client uses this name (if not blank).

If none of the three resolve to a non-empty string,
`BaseAdapter.__init__` raises `ValueError`. This is a fail-fast.
Callers cannot opt out, suppress, or work around it. There is no
silent fallback. A blank `org_id` is rejected with the same error
as an absent one.

3. **Every emission is stamped.** Both `emit_event` (typed payload)
and `emit_dict_event` (dict payload) call `BaseAdapter._stamp_org_id`
before forwarding to the client. The bound `self._org_id` is
written to the payload's `org_id` field unconditionally — any
caller-supplied value (including a wrong tenant's id) is
overwritten. The adapter binding is the source of truth.

4. **Every trace record carries `org_id`.** The replay event records
stored in `self._trace_events` include `org_id` at the envelope
level *and* inside the payload dict, so replay round-trips and
downstream re-ingest preserve the binding.

5. **Every sink dispatch carries `org_id`.** The `EventSink.send`
ABC requires the keyword: `send(event_type, payload, timestamp_ns,
*, org_id: str)`. Sinks that omit it are flagged at the type-check
layer (mypy `--strict`). The `IngestionPipelineSink` uses the
per-event `org_id` as the `tenant_id` for downstream ingest.

## Wiring a new adapter

Subclasses of `BaseAdapter` (and `BaseProtocolAdapter` /
`LLMProviderAdapter`) get the contract for free **as long as their
`__init__` forwards `org_id` to `super().__init__`**. The canonical
shape:

```python
class MyAdapter(BaseAdapter):
FRAMEWORK = "my_framework"
VERSION = "0.1.0"

def __init__(
self,
stratix: Any | None = None,
capture_config: CaptureConfig | None = None,
# framework-specific args here ...
*,
org_id: str | None = None,
) -> None:
super().__init__(
stratix=stratix,
capture_config=capture_config,
org_id=org_id,
)
# adapter-specific state ...
```

Note the keyword-only `*` separator for `org_id`. The rest of
`__init__` is unchanged from the pre-multi-tenancy era.

Adapter helper functions (the `instrument_*` convenience exports in
each adapter's `__init__.py`) should also accept and forward `org_id`:

```python
def instrument_agent(
agent: Any,
stratix: Any = None,
capture_config: dict[str, Any] | None = None,
org_id: str | None = None,
) -> MyAdapter:
adapter = MyAdapter(
stratix=stratix,
capture_config=capture_config,
org_id=org_id,
)
adapter.connect()
adapter.instrument_agent(agent)
return adapter
```

## Test obligations

Every new framework adapter must:

1. Have its class added to `_all_adapter_classes()` in
`tests/instrument/adapters/frameworks/test_per_adapter_org_id.py`.
The two parametrized tests there assert (a) the adapter accepts
`org_id` and exposes the bound value via the property, and (b) the
adapter raises without an `org_id`.
2. If the adapter ships its own dedicated test file, every test that
constructs the adapter must pass `org_id` (typically via the
shared `_RecordingStratix` test stand-in, which carries
`org_id = "test-org"` as a class attribute).
3. The cross-tenant isolation guarantee is covered centrally in
`tests/instrument/adapters/_base/test_org_id_propagation.py`. New
adapters do not need to re-prove cross-tenant isolation if they
route emissions through the standard `BaseAdapter` path; they MUST
add a per-adapter cross-tenant test if they bypass the base path.

## What changed (April 2026)

Prior to this change, all adapter emissions in the stratix-python SDK
shipped without `org_id` propagation. The 2026-04-25 audit
(`A:/tmp/adapter-depth-audit.md`, cross-cutting finding #3) flagged
this as a CLAUDE.md violation. The fix:

- `BaseAdapter.__init__` now requires a resolvable `org_id` and
stores it on the instance.
- `emit_event` and `emit_dict_event` stamp `org_id` into every
payload before forwarding to the client.
- `EventSink.send` now requires the `org_id` keyword.
- Every shipped adapter (17 framework + protocol + provider) was
updated to thread `org_id` through to `super().__init__`.

## References

- CLAUDE.md, "Multi-Tenancy" section — the platform-wide mandate.
- `A:/tmp/adapter-depth-audit.md` — the audit that surfaced the gap.
- `src/layerlens/instrument/adapters/_base/adapter.py` — `_resolve_org_id`,
`BaseAdapter.__init__`, `_stamp_org_id`, `emit_event`,
`emit_dict_event`, `_post_emit_success`.
- `src/layerlens/instrument/adapters/_base/sinks.py` — `EventSink`
ABC, `TraceStoreSink.send`, `IngestionPipelineSink.send`.
4 changes: 3 additions & 1 deletion docs/samples-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,11 +92,13 @@ See the [MCP README](../samples/mcp/README.md) for setup instructions.

Located in [`samples/copilotkit/`](../samples/copilotkit/). Full-stack integration with CopilotKit using LangGraph CoAgents and generative UI card components.

- [`agents/evaluator_agent.py`](../samples/copilotkit/agents/evaluator_agent.py) -- LangGraph CoAgent for evaluation workflows
- [`agents/evaluator_agent.py`](../samples/copilotkit/agents/evaluator_agent.py) -- LangGraph CoAgent for evaluation workflows (human-in-the-loop judge confirmation via `interrupt()`)
- [`agents/investigator_agent.py`](../samples/copilotkit/agents/investigator_agent.py) -- LangGraph CoAgent for trace investigation
- [`components/*.tsx`](../samples/copilotkit/components/) -- React card components for rendering results
- [`hooks/*.ts`](../samples/copilotkit/hooks/) -- CopilotKit hooks for wiring LayerLens actions

> **Checkpointer note:** Any LangGraph CoAgent that calls `interrupt()` (such as `evaluator_agent.py`) **must** be compiled with a checkpointer. Without one, the AG-UI stream ends without emitting `RUN_FINISHED` and CopilotKit blocks all subsequent messages. The sample ships with `InMemorySaver` for a zero-setup local run and documents Postgres / SQLite / Redis / LangGraph Platform alternatives for production in its [README](../samples/copilotkit/README.md#human-in-the-loop-checkpointers).

See the [CopilotKit README](../samples/copilotkit/README.md) for the full list.

### Claude Code Skills (6 skills)
Expand Down
Loading