Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
a2e9fd6
fix(copilotkit): compile evaluator_graph with a checkpointer for inte…
mmercuri Apr 23, 2026
c40c03f
fix(copilotkit): register DTOs with serde, add AG-UI wire e2e test
mmercuri Apr 23, 2026
6421a9e
fix(copilotkit): resolve error/field collision + browser e2e harness
mmercuri Apr 23, 2026
542002b
fix(copilotkit): mirror CopilotKit's sample + workaround for ag-ui-la…
mmercuri Apr 24, 2026
e8d5f04
fix(copilotkit): also suppress duplicate RUN_STARTED on re-entry inte…
mmercuri Apr 24, 2026
796d562
fix(copilotkit): re-pin to latest published versions (DevRel env) and…
mmercuri Apr 24, 2026
fb43742
fix(copilotkit): wire useLangGraphInterrupt on frontend; drop backend…
mmercuri Apr 24, 2026
45fa80f
feat(copilotkit): rearchitect evaluator as create_agent + frontend HI…
mmercuri Apr 24, 2026
f8dd7db
test(copilotkit): add live-LLM e2e + OpenAI-compatible endpoint support
mmercuri Apr 25, 2026
a1eebe7
fix(copilotkit harness): unblock React hydration on Next 16 dev server
mmercuri Apr 25, 2026
268a273
fix(harness picker): resilient render while LLM streams tool args
mmercuri Apr 25, 2026
c953aee
feat(harness): per-tool render cards + Tailwind 4 + production Evalua…
mmercuri Apr 25, 2026
55b1f63
copilotkit sample: ship customer-facing app with shadcn-aligned cards
mmercuri Apr 26, 2026
bf1d20e
copilotkit sample: chat tokens, severity pills, trace status, markdow…
mmercuri Apr 26, 2026
05516be
copilotkit sample: render pending verdicts as skeleton cards
mmercuri Apr 26, 2026
a72a3ec
copilotkit sample: extract MarkdownLite, render judge goals
mmercuri Apr 26, 2026
2c7b60b
copilotkit sample: rewrite README to match the shipped architecture
mmercuri Apr 26, 2026
b0aa842
ci: replace walrus + reformat test_samples_e2e for ruff py37 target
mmercuri Apr 26, 2026
1a30622
copilotkit sample: drop OpenRouter-specific references
mmercuri Apr 26, 2026
d8601e3
ci: wrap server.py uvicorn boot in main() + extend test mocks
mmercuri Apr 26, 2026
569abdb
copilotkit sample: stop polling on any terminal eval state, not just …
mmercuri Apr 26, 2026
7228526
feat(assistant): add Stratix Assistant SDK resource
mmercuri Apr 26, 2026
e12398f
fix(instrument): propagate org_id through all event emissions
mmercuri Apr 27, 2026
c8c8422
feat(instrument): factory-based replay re-execution for 8 lighter ada…
mmercuri Apr 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 148 additions & 0 deletions docs/adapters/multi-tenancy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Multi-tenancy contract for adapters

LayerLens is a multi-tenant SaaS platform. Every event emitted by an
adapter MUST be tagged with the originating tenant's `org_id`. Cache
keys, queue topics, ingest streams, RLS policies, and downstream
attestation chains all read this field to scope data to a single
tenant.

This document defines the binding contract that every framework /
protocol / provider adapter must satisfy. It is enforced at runtime by
`BaseAdapter.__init__` (fail-fast) and at CI time by the test suite at
`tests/instrument/adapters/_base/test_org_id_propagation.py` plus the
parametrized `tests/instrument/adapters/frameworks/test_per_adapter_org_id.py`.

## The contract

1. **Every adapter is bound to exactly one tenant at construction.**
The tenant binding (`org_id`) is stored as `self._org_id` and
exposed as the read-only property `adapter.org_id`. The bound value
is a non-empty string — there is no null sentinel, no empty
fallback, no `"default"` placeholder.

2. **Construction without a resolvable `org_id` raises.** Resolution
order at `__init__`:

1. Explicit `org_id=...` keyword to the adapter constructor.
2. `stratix.org_id` attribute on the attached client (if not blank).
3. `stratix.organization_id` attribute on the attached client — the
public `layerlens.Stratix` client uses this name (if not blank).

If none of the three resolve to a non-empty string,
`BaseAdapter.__init__` raises `ValueError`. This is a fail-fast.
Callers cannot opt out, suppress, or work around it. There is no
silent fallback. A blank `org_id` is rejected with the same error
as an absent one.

3. **Every emission is stamped.** Both `emit_event` (typed payload)
and `emit_dict_event` (dict payload) call `BaseAdapter._stamp_org_id`
before forwarding to the client. The bound `self._org_id` is
written to the payload's `org_id` field unconditionally — any
caller-supplied value (including a wrong tenant's id) is
overwritten. The adapter binding is the source of truth.

4. **Every trace record carries `org_id`.** The replay event records
stored in `self._trace_events` include `org_id` at the envelope
level *and* inside the payload dict, so replay round-trips and
downstream re-ingest preserve the binding.

5. **Every sink dispatch carries `org_id`.** The `EventSink.send`
ABC requires the keyword: `send(event_type, payload, timestamp_ns,
*, org_id: str)`. Sinks that omit it are flagged at the type-check
layer (mypy `--strict`). The `IngestionPipelineSink` uses the
per-event `org_id` as the `tenant_id` for downstream ingest.

## Wiring a new adapter

Subclasses of `BaseAdapter` (and `BaseProtocolAdapter` /
`LLMProviderAdapter`) get the contract for free **as long as their
`__init__` forwards `org_id` to `super().__init__`**. The canonical
shape:

```python
class MyAdapter(BaseAdapter):
FRAMEWORK = "my_framework"
VERSION = "0.1.0"

def __init__(
self,
stratix: Any | None = None,
capture_config: CaptureConfig | None = None,
# framework-specific args here ...
*,
org_id: str | None = None,
) -> None:
super().__init__(
stratix=stratix,
capture_config=capture_config,
org_id=org_id,
)
# adapter-specific state ...
```

Note the keyword-only `*` separator for `org_id`. The rest of
`__init__` is unchanged from the pre-multi-tenancy era.

Adapter helper functions (the `instrument_*` convenience exports in
each adapter's `__init__.py`) should also accept and forward `org_id`:

```python
def instrument_agent(
agent: Any,
stratix: Any = None,
capture_config: dict[str, Any] | None = None,
org_id: str | None = None,
) -> MyAdapter:
adapter = MyAdapter(
stratix=stratix,
capture_config=capture_config,
org_id=org_id,
)
adapter.connect()
adapter.instrument_agent(agent)
return adapter
```

## Test obligations

Every new framework adapter must:

1. Have its class added to `_all_adapter_classes()` in
`tests/instrument/adapters/frameworks/test_per_adapter_org_id.py`.
The two parametrized tests there assert (a) the adapter accepts
`org_id` and exposes the bound value via the property, and (b) the
adapter raises without an `org_id`.
2. If the adapter ships its own dedicated test file, every test that
constructs the adapter must pass `org_id` (typically via the
shared `_RecordingStratix` test stand-in, which carries
`org_id = "test-org"` as a class attribute).
3. The cross-tenant isolation guarantee is covered centrally in
`tests/instrument/adapters/_base/test_org_id_propagation.py`. New
adapters do not need to re-prove cross-tenant isolation if they
route emissions through the standard `BaseAdapter` path; they MUST
add a per-adapter cross-tenant test if they bypass the base path.

## What changed (April 2026)

Prior to this change, all adapter emissions in the stratix-python SDK
shipped without `org_id` propagation. The 2026-04-25 audit
(`A:/tmp/adapter-depth-audit.md`, cross-cutting finding #3) flagged
this as a CLAUDE.md violation. The fix:

- `BaseAdapter.__init__` now requires a resolvable `org_id` and
stores it on the instance.
- `emit_event` and `emit_dict_event` stamp `org_id` into every
payload before forwarding to the client.
- `EventSink.send` now requires the `org_id` keyword.
- Every shipped adapter (17 framework + protocol + provider) was
updated to thread `org_id` through to `super().__init__`.

## References

- CLAUDE.md, "Multi-Tenancy" section — the platform-wide mandate.
- `A:/tmp/adapter-depth-audit.md` — the audit that surfaced the gap.
- `src/layerlens/instrument/adapters/_base/adapter.py` — `_resolve_org_id`,
`BaseAdapter.__init__`, `_stamp_org_id`, `emit_event`,
`emit_dict_event`, `_post_emit_success`.
- `src/layerlens/instrument/adapters/_base/sinks.py` — `EventSink`
ABC, `TraceStoreSink.send`, `IngestionPipelineSink.send`.
218 changes: 218 additions & 0 deletions docs/adapters/replay-execution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
# Adapter Replay Re-execution

This document covers the **factory-based replay** path on
LayerLens framework adapters — the cross-pollination audit
item §2.6 lift that brings the LangChain `execute_replay` pattern
to the eight lighter framework adapters.

It is companion to:

* [`docs/adapters/multi-tenancy.md`](multi-tenancy.md) — the tenant
binding contract that `ReplayResult` propagates.
* `A:/tmp/adapter-cross-pollination-audit.md` §2.6 — the audit row
that scopes this lift.

## When to use which replay path

`BaseAdapter` exposes **two** replay entry points:

| Method | Caller | Inputs | Returns |
| -------------------------------- | --------------------- | ----------------------------------------------------- | ----------------------------- |
| `execute_replay()` | LayerLens replay engine | `(inputs, original_trace, request, replay_trace_id)` | A `SerializedTrace` |
| `execute_replay_via_factory()` | Adapter SDK / CI tests | `(trace: ReplayableTrace, agent_factory: Callable)` | A `ReplayResult` (this doc) |

The **engine** path stays untouched — that is the integration with
the platform replay service that owns trace storage and result routing.

The **factory** path is what this document covers. It is the
self-contained option you reach for when:

* You want to re-run a captured trace through a fresh agent
*inside the same Python process* (CI, integration tests, debugging).
* You want a divergence report rather than a single pass/fail.
* You want a uniform `ReplayResult` shape across every framework so
dashboards and `assert` lines do not need adapter-specific branches.

## The pieces

```text
┌─────────────────────────┐ builds ┌──────────────────┐
│ ReplayExecutor │──────────>│ ReplayResult │
│ (shared) │ │ - trace_id │
│ │ │ - source_trace_id│
│ + execute_replay() │ │ - org_id │
│ │ │ - framework │
│ │ │ - outputs │
│ │ │ - captured_events│
│ │ │ - divergences[] │
│ │ │ - duration_ns │
│ │ │ - execution_error│
└─────────────┬───────────┘ └──────────────────┘
│ uses
v
┌─────────────────────────┐
│ StubInjector │ (optional, adapter-specific)
│ + build_patches() │
└─────────────────────────┘
```

* `ReplayExecutor` lives at
`layerlens.instrument.adapters._base.replay`. It is intentionally
*narrow* — it does not know how to invoke a framework agent.
* Adapters provide `_invoke_for_replay(agent, inputs, trace)` to
invoke their framework's run/arun/__call__ entry point.
* Adapters expose `execute_replay_via_factory(trace, agent_factory)`
as the public surface — typically a 2-line delegate to the
base helper `_replay_via_executor`.

## Eight wired adapters (cross-pollination audit §2.6)

| Adapter | `instrument_*` | Invocation |
| -------------------------- | --------------------- | ---------------------- |
| `agno` | `instrument_agent` | `arun` / `run` |
| `openai_agents` | `instrument_runner` | `Runner.run(agent, x)` |
| `llama_index` | `instrument_workflow` | `arun` / `run` |
| `google_adk` | `instrument_agent` | `run_async` / `run` |
| `strands` | `instrument_agent` | `__call__` / `invoke` |
| `pydantic_ai` | `instrument_agent` | `run` / `run_sync` |
| `smolagents` | `instrument_agent` | `run(task)` |
| `ms_agent_framework` | `instrument_chat` | `invoke(input=)` |

LangChain already had its own `execute_replay()` (the original audit
reference) and is not in this lift. Bedrock Agents, browser_use,
embedding, and langfuse are excluded by audit rationale (see §2.6 row
notes — "MAYBE — requires Bedrock-side state, harder",
"N/A — importer / single-agent / no agent concept").

## Honest divergence detection

Per CLAUDE.md ("Honest divergence detection — if replay can't
reproduce exactly, surface it"), `ReplayResult.divergences` is the
*authoritative* report of every event mismatch. The executor never
silently "passes" a divergent replay.

Five divergence kinds are surfaced:

| Kind | When it fires |
| --------------------- | -------------------------------------------------------------- |
| `MISSING_EVENT` | Original had an event at index N; replay's sequence is shorter |
| `EXTRA_EVENT` | Replay emitted an event the original did not contain |
| `EVENT_TYPE_MISMATCH` | Same position, different `event_type` |
| `PAYLOAD_MISMATCH` | Same `event_type`, different meaningful payload field |
| `EXECUTION_ERROR` | Framework raised before producing a comparable trace |

`PAYLOAD_MISMATCH` is deliberately conservative — it compares only
fields whose mismatch genuinely means the agent did something different
(`model`, `provider`, `tool_name`, `agent_name`, `from_agent`,
`to_agent`). Wall-clock fields like `timestamp_ns`, `duration_ns`,
and `run_id` are *expected* to differ between runs and are not
flagged. Flagging them would make every replay "divergent" and
hide real regressions.

`is_exact` reports zero divergences. `succeeded` reports no execution
error. They are orthogonal:

| `succeeded` | `is_exact` | Outcome |
| ----------- | ---------- | -------------------------------------- |
| `True` | `True` | Perfect reproduction |
| `True` | `False` | Replay ran but diverged |
| `False` | (any) | Framework crashed during replay |

## Multi-tenancy

`ReplayResult.org_id` carries the bound tenant from the originating
adapter, set by `BaseAdapter._resolve_org_id` at construction
(see `multi-tenancy.md`). Per-event records inside
`ReplayResult.captured_events` also carry `org_id` because they are
emitted through `_post_emit_success`, which always stamps the field.

A replay started on `adapter_a` (tenant A) cannot leak events into
`adapter_b`'s (tenant B) trace stream — the executor binds to the
adapter at construction and never crosses adapters mid-replay.

## Usage

### Minimal example (agno)

```python
from layerlens.instrument.adapters.frameworks.agno import AgnoAdapter

adapter = AgnoAdapter(org_id="tenant-acme")
adapter.connect()

# 1. Capture an original run.
original_agent = build_my_agno_agent()
adapter.instrument_agent(original_agent)
original_agent.run("Plan a trip to Tokyo")

trace = adapter.serialize_for_replay()

# 2. Replay through a fresh agent built by a factory.
def factory():
return build_my_agno_agent() # fresh instance every replay

result = await adapter.execute_replay_via_factory(trace, factory)

assert result.org_id == "tenant-acme"
assert result.framework == "agno"
if not result.is_exact:
for div in result.divergences:
print(f"[{div.kind.value}] index={div.index} {div.detail}")
```

### Async factory

The factory may return either an agent instance or an awaitable
resolving to one — the executor inspects the return value and awaits
when needed:

```python
async def async_factory():
config = await load_config_from_db()
return AgnoAgent.from_config(config)

result = await adapter.execute_replay_via_factory(trace, async_factory)
```

### Adapter-specific stub injection

For LLM-deterministic replay, supply a `StubInjector` that returns
patches the executor will apply for the duration of the replay run.
The base case (no stubs) works for fixture-based tests where the
agent itself is deterministic.

```python
from layerlens.instrument.adapters._base.replay import (
ReplayExecutor,
StubInjector,
)

class MyOpenAIStubs(StubInjector):
def build_patches(self, adapter, trace):
# Replace ChatCompletions.create with a deterministic fake.
return [
("openai.resources.chat.completions.Completions.create",
my_fake_create),
]

executor = ReplayExecutor(adapter, stub_injector=MyOpenAIStubs())
result = await executor.execute_replay(trace, factory)
```

Stub teardown is guaranteed even on framework error — patches are
unwound in a `finally` block so a failed replay leaves no global
monkey-patches behind.

## Failure modes (what is *not* swallowed)

| Failure | Handling |
| ------------------------------ | ------------------------------------------------------------ |
| Agent raises mid-execution | Captured into `result.execution_error`; replay marked failed |
| Agent factory itself raises | Captured into `result.execution_error` |
| Stub teardown raises | Logged at WARNING; original execution outcome preserved |
| `org_id` cannot be resolved | `BaseAdapter.__init__` raises `ValueError` (fail-fast) |
| Adapter never overrode the method | `NotImplementedError` from `BaseAdapter.execute_replay_via_factory` |

The first two are intentional — collecting them as data lets a
replay-batch consumer aggregate partial results across many traces
without exception-handling boilerplate.
4 changes: 3 additions & 1 deletion docs/samples-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,11 +92,13 @@ See the [MCP README](../samples/mcp/README.md) for setup instructions.

Located in [`samples/copilotkit/`](../samples/copilotkit/). Full-stack integration with CopilotKit using LangGraph CoAgents and generative UI card components.

- [`agents/evaluator_agent.py`](../samples/copilotkit/agents/evaluator_agent.py) -- LangGraph CoAgent for evaluation workflows
- [`agents/evaluator_agent.py`](../samples/copilotkit/agents/evaluator_agent.py) -- LangGraph CoAgent for evaluation workflows (human-in-the-loop judge confirmation via `interrupt()`)
- [`agents/investigator_agent.py`](../samples/copilotkit/agents/investigator_agent.py) -- LangGraph CoAgent for trace investigation
- [`components/*.tsx`](../samples/copilotkit/components/) -- React card components for rendering results
- [`hooks/*.ts`](../samples/copilotkit/hooks/) -- CopilotKit hooks for wiring LayerLens actions

> **Checkpointer note:** Any LangGraph CoAgent that calls `interrupt()` (such as `evaluator_agent.py`) **must** be compiled with a checkpointer. Without one, the AG-UI stream ends without emitting `RUN_FINISHED` and CopilotKit blocks all subsequent messages. The sample ships with `InMemorySaver` for a zero-setup local run and documents Postgres / SQLite / Redis / LangGraph Platform alternatives for production in its [README](../samples/copilotkit/README.md#human-in-the-loop-checkpointers).

See the [CopilotKit README](../samples/copilotkit/README.md) for the full list.

### Claude Code Skills (6 skills)
Expand Down
Loading
Loading