Skip to content

feat(instrument): State include/exclude key filters for 6 multi-agent adapters (cross-poll #6)#120

Closed
mmercuri wants to merge 2 commits into
developmentfrom
feat/instrument-state-filters
Closed

feat(instrument): State include/exclude key filters for 6 multi-agent adapters (cross-poll #6)#120
mmercuri wants to merge 2 commits into
developmentfrom
feat/instrument-state-filters

Conversation

@mmercuri

Copy link
Copy Markdown
Contributor

Cross-pollination item #6 — state include/exclude key filters

Implements cross-pollination item #6 from A:/tmp/adapter-cross-pollination-audit.md §2.12. LangGraph's LangGraphStateAdapter (mature) supports include/exclude key filters at the state-snapshot level so customers can scrub sensitive state (api_keys, tokens, PII) WITHOUT modifying their agent code or doing post-hoc redaction. This PR brings the same contract to the lighter multi-agent framework adapters present on this base.

What's new

Shared filter module

src/layerlens/instrument/adapters/_base/state_filters.py — new

  • StateFilter — frozen dataclass with include_keys, exclude_keys, mask_keys, recursive. Three operations applied in order: exclude (drop), mask (replace value with [REDACTED]), include (allowlist). Matching is case-insensitive substring after alphanumeric-only normalisation, so X-Api-Key, USER_API_KEY, customer.email_address all match without enumerating every variant.
  • DEFAULT_PII_EXCLUDE_KEYS — conservative denylist covering 49 common credential / PII / financial / contact field names. Customers who do nothing still get baseline protection out of the box per CLAUDE.md ("never silently leak customer data").
  • default_state_filter() — factory installed by every adapter unless the caller passes a custom state_filter.
  • filter_state(state, filter) — pure function returning (filtered_state, filtered_keys) so adapters can surface the clipped key names as _filtered_keys event metadata for audit.
  • filter_payload_fields(payload, filter, fields) — surgical helper that filters only the named dict-shaped fields of a mixed-shape payload (so scalar metadata like model, latency_ms is preserved).
  • StateFilter.permissive() — opt-out factory for tests / explicit disablement. The active filter snapshot is surfaced under adapter_info().metadata['state_filter'] so operators can detect accidental disablement.
  • StateFilter.with_extra_excludes() — default + caller's additions.

FrameworkAdapter integration

src/layerlens/instrument/adapters/frameworks/_base_framework.py

  • Constructor accepts optional state_filter (defaults to default_state_filter()).
  • self._state_filter reachable on every subclass.
  • New _filter_payload(payload, *fields) helper used by adapters immediately before each _emit(...) call for any payload that may contain user-controlled state.
  • New serialize_state_filter_for_replay() — replay engine uses this to reconstruct an equivalent filter on the other side, so the captured payload shapes match between original run and replay.
  • adapter_info().metadata['state_filter'] surfaces the active config.

Per-adapter wiring (6 multi-agent adapters)

Adapter Constructor state_filter Filter applied at emit
agno YES agent.input/output, tool.call/result
openai_agents YES agent.input (tools/handoffs/output_type), generation messages, function args + parameters_schema + mcp_data, tool result
llamaindex YES LLM messages + output_message, tool args, retrieval input/output, query input/output, agent_step input/output
google_adk YES run user_content + agent_tree, agent user_content, tool args, tool result
strands YES invocation messages, after-invocation output, tool input, tool result
pydantic_ai YES agent input + deps_summary, agent output, tool.call args, tool.result output, streaming output_message

The audit §2.12 enumerates 7 targets including ms_agent_framework — that adapter doesn't exist on this branch's base (feat/instrument-callback-resilience); it lives only on the parallel feat/instrument-multitenancy-org-id-propagation history. It will be wired when the ms_agent_framework adapter is ported to this base or the histories merge.

Tests (53 new)

tests/instrument/adapters/_base/test_state_filters.py

  • TestStateFilterConstruction — defaults are PII-aware, lowercasing, permissive factory, with_extra_excludes factory.
  • TestStateFilterMetadata — default snapshot shape, allowlist surfaces in metadata.
  • TestFilterStateExclude — default PII keys removed, vendor variants caught (X-Api-Key, USER_API_KEY, stripe_customer_email), permissive opt-out.
  • TestFilterStateMask — keeps key visible, masking runs before recurse so nested PII can't leak through a masked field.
  • TestFilterStateInclude — allowlist semantics, exclude wins over include when both match.
  • TestFilterStateRecursive — nested dicts, lists of dicts, non-recursive flag.
  • TestFilterStatePassthrough — primitives + empty dict pass through.
  • TestFilterPayloadFields — surgical filter (scalars untouched), missing fields skipped, scalar field is no-op, accumulating _filtered_keys across multiple passes.
  • TestFrameworkAdapterStateFilterDefaults — default installed, custom override, end-to-end PII drop, replay snapshot, adapter_info.
  • TestPerAdapterStateFilterWiring — parametrized across all 6 adapters: constructor accepts state_filter, default is PII-aware, state_filter surfaces in adapter_info.
  • TestEndToEndAgnoFilter — filter actually runs at the emit boundary (not just sits idle on the adapter).

Existing test suites unchanged

Documentation

docs/adapters/state-filters.md — explains default behaviour, three filter operations, configuration recipes, recursion, auditability via _filtered_keys, replay reproducibility, and the per-adapter wiring matrix.

Acceptance

pytest tests/instrument/adapters/_base/test_state_filters.py -x
# 53 passed in 0.10s

pytest tests/instrument/adapters/_base/
# 110 passed, 7 skipped in 0.26s

pytest tests/instrument/adapters/frameworks/
# 114 passed, 12 skipped, 1 pre-existing flake

mypy --strict src/layerlens/instrument/adapters/_base/state_filters.py
# Success: no issues found in 1 source file

mypy src/layerlens/instrument/adapters/frameworks/{_base_framework,agno,openai_agents,llamaindex,google_adk,strands,pydantic_ai}.py
# Success: no issues found in 7 source files

ruff check ...
# All checks passed!

Test plan

  • pytest tests/instrument/adapters/_base/test_state_filters.py -x
  • pytest tests/instrument/adapters/_base/ (no regressions on resilience tests)
  • pytest tests/instrument/adapters/frameworks/ (no regressions on adapter tests)
  • mypy --strict src/layerlens/instrument/adapters/_base/state_filters.py
  • mypy src/.../{adapters wired in this PR}.py
  • ruff check
  • CI must pass on Linux / 3.10-3.13 matrix (developed on Windows / 3.9)

…0 lighter adapters

Introduces a shared @resilient_callback decorator + ResilienceTracker
under `src/layerlens/instrument/adapters/_base/`, then applies it to every
callback method on the 10 lighter framework adapters (agno, llamaindex,
google_adk, strands, pydantic_ai, smolagents, bedrock_agents, openai_agents,
haystack, langfuse) so an exception in our observability code can never
crash the customer's framework execution.

What the decorator does on failure:
1. Catches Exception (NOT BaseException — KeyboardInterrupt / SystemExit
   still propagate so users can Ctrl-C their agent).
2. Logs the exception via the wrapped function's module logger with
   adapter_name + callback_name + truncated traceback.
3. Increments the adapter's per-instance ResilienceTracker counter.
4. Returns the framework's expected default value — None for void
   handlers, or the value of `passthrough_arg` for mutating hooks
   (Pydantic-AI's `after_model_request` returns the response object;
   `before_tool_execute` returns the args tuple).

Health surfacing:
- FrameworkAdapter now owns a `_resilience: ResilienceTracker` attribute
  set in `__init__` so every framework adapter inherits the contract.
- `adapter_info().metadata` merges the live resilience snapshot
  (`resilience_status`, `resilience_failures_total`,
  `resilience_failure_threshold`, per-callback breakdown, last error).
- After DEFAULT_FAILURE_THRESHOLD (5) failures the adapter reports
  `resilience_status: "degraded"` so monitoring can alert.
- `disconnect()` resets the tracker so reconnects start clean.

Per-adapter callback audit + fixes:

| Adapter         | Callbacks wrapped | Notes                                  |
|-----------------|-------------------|----------------------------------------|
| agno            | 2                 | _on_run_start, _on_run_end             |
| llamaindex      | 16                | 3 span lifecycle + dispatcher + 12 events|
| google_adk      | 11                | All adapter _on_* + simplified plugin shims|
| strands         | 7                 | All hook handlers (replaces manual try/except)|
| pydantic_ai     | 9 (incl 3 split)  | Error hooks split: telemetry resilient, re-raise unconditional|
| smolagents      | 6                 | Run/step handlers (replaces manual try/except)|
| bedrock_agents  | 2                 | _before_invoke + _after_invoke (with try/finally for _end_run)|
| openai_agents   | 3                 | on_trace_start/_end + on_span_end (replaces manual try/except)|
| haystack        | 1                 | _on_span_end (replaces manual try/except)|
| langfuse        | 5                 | _import_single_trace, _import_observation, _import_score, _export_single_trace, plus inner emit fallbacks|
| TOTAL           | 62                |                                        |

Pydantic-AI error-callback split: `_on_run_error`,
`_on_model_request_error`, `_on_tool_execute_error` MUST always re-raise
the framework's original error (per Pydantic-AI's contract). The
telemetry side is moved into a `_emit_*_error_telemetry` helper wrapped
with @resilient_callback; the public hook calls it then unconditionally
`raise error`. So adapter-side telemetry bugs can never swallow a real
framework error.

Tests:
- `tests/instrument/adapters/_base/test_resilience.py` — 34 tests
  covering tracker mechanics, decorator behaviour, passthrough args,
  KeyboardInterrupt propagation, FrameworkAdapter integration, package
  re-exports, and decorator metadata preservation.
- `tests/instrument/adapters/_base/test_per_adapter_resilience.py` —
  per-adapter smoke tests (one per lighter adapter) that simulate a
  callback exception by sabotaging an inner helper, plus a parametrized
  health-degradation test across all 10 adapters.

Refactor: `_base.py` (the AdapterInfo + BaseAdapter module) becomes
`_base/` package with `__init__.py` re-exporting from `_core.py` (moved
via `git mv`) and the new `resilience.py`. All existing
`from .._base import AdapterInfo, BaseAdapter` imports continue working
unchanged.

Acceptance:
- pytest tests/instrument/adapters/_base/test_resilience.py -x — 34 passed
- pytest tests/instrument/adapters/frameworks/ -x — 146 passed (12 skipped for missing optional deps; 2 deselected pre-existing Windows clock-resolution flakes in test_haystack)
- mypy --strict src/layerlens/instrument/adapters/_base/resilience.py — Success
- mypy src — Success: 169 source files
- ruff check — All checks passed
- Full test suite: 1090 passed
… adapters (cross-poll #6)

Implements cross-pollination item #6 from `A:/tmp/adapter-cross-pollination-audit.md` §2.12.
LangGraph's `LangGraphStateAdapter` (mature) supports include/exclude key
filters at the state-snapshot level so customers can scrub sensitive
state (api_keys, tokens, PII) WITHOUT modifying their agent code or
doing post-hoc redaction. This PR brings the same contract to the
lighter multi-agent framework adapters present on this base.

## What's new

### Shared filter module
`src/layerlens/instrument/adapters/_base/state_filters.py` — new

- `StateFilter` — frozen dataclass with `include_keys`, `exclude_keys`,
  `mask_keys`, `recursive`. Three operations applied in order:
  exclude (drop), mask (replace value with `[REDACTED]`), include
  (allowlist). Matching is case-insensitive substring after
  alphanumeric-only normalisation, so `X-Api-Key`, `USER_API_KEY`,
  `customer.email_address` all match without enumerating every variant.
- `DEFAULT_PII_EXCLUDE_KEYS` — conservative denylist covering 49
  common credential / PII / financial / contact field names. Customers
  who do nothing still get baseline protection out of the box per
  CLAUDE.md ("never silently leak customer data").
- `default_state_filter()` — factory installed by every adapter unless
  the caller passes a custom `state_filter`.
- `filter_state(state, filter)` — pure function returning
  `(filtered_state, filtered_keys)` so adapters can surface the
  clipped key names as `_filtered_keys` event metadata for audit.
- `filter_payload_fields(payload, filter, fields)` — surgical helper
  that filters only the named dict-shaped fields of a mixed-shape
  payload (so scalar metadata like `model`, `latency_ms` is preserved).
- `StateFilter.permissive()` — opt-out factory for tests / explicit
  disablement. The active filter snapshot is surfaced under
  `adapter_info().metadata['state_filter']` so operators can detect
  accidental disablement.
- `StateFilter.with_extra_excludes()` — default + caller's additions.

### FrameworkAdapter integration
`src/layerlens/instrument/adapters/frameworks/_base_framework.py`

- Constructor accepts optional `state_filter` (defaults to
  `default_state_filter()`).
- `self._state_filter` reachable on every subclass.
- New `_filter_payload(payload, *fields)` helper used by adapters
  immediately before each `_emit(...)` call for any payload that may
  contain user-controlled state.
- New `serialize_state_filter_for_replay()` — replay engine uses this
  to reconstruct an equivalent filter on the other side, so the
  captured payload shapes match between original run and replay.
- `adapter_info().metadata['state_filter']` surfaces the active config.

### Per-adapter wiring (6 multi-agent adapters)

| Adapter         | Constructor `state_filter` | Filter applied at emit         |
|-----------------|----------------------------|--------------------------------|
| `agno`          | YES                        | `agent.input/output`, `tool.call/result` |
| `openai_agents` | YES                        | `agent.input` (tools/handoffs/output_type), generation messages, function args + parameters_schema + mcp_data, tool result |
| `llamaindex`    | YES                        | LLM messages + output_message, tool args, retrieval input/output, query input/output, agent_step input/output |
| `google_adk`    | YES                        | run user_content + agent_tree, agent user_content, tool args, tool result |
| `strands`       | YES                        | invocation messages, after-invocation output, tool input, tool result |
| `pydantic_ai`   | YES                        | agent input + deps_summary, agent output, tool.call args, tool.result output, streaming output_message |

The audit §2.12 enumerates 7 targets including `ms_agent_framework` —
that adapter doesn't exist on this branch's base
(`feat/instrument-callback-resilience`); it lives only on the parallel
`feat/instrument-multitenancy-org-id-propagation` history. It will be
wired when the ms_agent_framework adapter is ported to this base or
the histories merge.

## Tests (53 new + integration)

### `tests/instrument/adapters/_base/test_state_filters.py` — 53 tests

- `TestStateFilterConstruction` — defaults are PII-aware, lowercasing,
  permissive factory, with_extra_excludes factory.
- `TestStateFilterMetadata` — default snapshot shape, allowlist
  surfaces in metadata.
- `TestFilterStateExclude` — default PII keys removed, vendor variants
  caught (`X-Api-Key`, `USER_API_KEY`, `stripe_customer_email`),
  permissive opt-out.
- `TestFilterStateMask` — keeps key visible, masking runs before
  recurse so nested PII can't leak through a masked field.
- `TestFilterStateInclude` — allowlist semantics, exclude wins over
  include when both match.
- `TestFilterStateRecursive` — nested dicts, lists of dicts,
  non-recursive flag.
- `TestFilterStatePassthrough` — primitives + empty dict pass through.
- `TestFilterPayloadFields` — surgical filter (scalars untouched),
  missing fields skipped, scalar field is no-op, accumulating
  `_filtered_keys` across multiple passes.
- `TestFrameworkAdapterStateFilterDefaults` — default installed,
  custom override, end-to-end PII drop, replay snapshot, adapter_info.
- `TestPerAdapterStateFilterWiring` — parametrized across all 6
  adapters: constructor accepts state_filter, default is PII-aware,
  state_filter surfaces in adapter_info.
- `TestEndToEndAgnoFilter` — filter actually runs at the emit
  boundary (not just sits idle on the adapter).

### Existing test suites unchanged
- `tests/instrument/adapters/_base/` — 110 passed, 7 skipped.
- `tests/instrument/adapters/frameworks/` — 114 passed (langchain,
  langgraph, langfuse, agentforce — adapters with deps installed in
  CI venv), 12 skipped (optional deps), 1 pre-existing Windows
  clock-resolution flake on test_haystack (documented in PR #117).

## Documentation

`docs/adapters/state-filters.md` — explains default behaviour, three
filter operations, configuration recipes, recursion, auditability via
`_filtered_keys`, replay reproducibility, and the per-adapter wiring
matrix.

## Acceptance

```
pytest tests/instrument/adapters/_base/test_state_filters.py -x
# 53 passed in 0.10s

pytest tests/instrument/adapters/_base/
# 110 passed, 7 skipped in 0.26s

pytest tests/instrument/adapters/frameworks/  # adapters with installed deps
# 114 passed, 12 skipped, 1 pre-existing flake (test_haystack.test_input_and_output)

mypy --strict src/layerlens/instrument/adapters/_base/state_filters.py
# Success: no issues found in 1 source file

mypy src/layerlens/instrument/adapters/frameworks/{_base_framework,agno,openai_agents,llamaindex,google_adk,strands,pydantic_ai}.py
# Success: no issues found in 7 source files

ruff check src/layerlens/instrument/adapters/_base/state_filters.py src/layerlens/instrument/adapters/frameworks/{_base_framework,agno,openai_agents,llamaindex,google_adk,strands,pydantic_ai}.py tests/instrument/adapters/_base/test_state_filters.py
# All checks passed!
```
@mmercuri mmercuri requested a review from m-peko April 27, 2026 03:49
m-peko pushed a commit that referenced this pull request May 12, 2026
…or 6 lighter adapters (cross-poll #1) (#130)

Implements cross-pollination item #1 from
A:/tmp/adapter-cross-pollination-audit.md section 2 #1. The four mature
framework adapters (LangChain, AutoGen, CrewAI, Semantic Kernel) carry
ad-hoc memory plumbing — episodic recent turns, procedural learned
patterns, semantic long-lived facts — that lets agents recall context
across runs. The lighter adapters (agno, ms_agent_framework,
openai_agents, llama_index, google_adk, bedrock_agents, browser_use)
all behave as goldfish agents — every run starts from a blank slate.
This PR ports the pattern into a shared, replay-safe primitive that
the lighter adapters plug into uniformly.

## What is new

### Shared memory primitive
src/layerlens/instrument/adapters/_base/memory.py — new

- MemorySnapshot — frozen dataclass with turn_index, episodic
  (recent turns), procedural (detected patterns), semantic
  (key/value facts), content_hash (SHA-256 of canonical-JSON
  encoding), org_id (tenant binding). to_dict / from_dict
  round-trip preserves identity.
- MemoryRecorder — thread-safe accumulator. record_turn(...) is the
  per-turn entry point; set_semantic(key, value) for long-lived
  facts; snapshot() returns the immutable view; restore(snap)
  rebuilds state from a previous snapshot. All buckets bounded
  (defaults 200/16/64); episodic FIFO eviction, semantic LRU,
  procedural keep-top-by-count.
- Procedural pattern detection: O(window) per turn, scans the recent
  episodic window for recurring (prev_tools, current_tools) pairs.
- Multi-tenant: recorder requires non-empty org_id at construction;
  restore() rejects cross-tenant snapshots and tampered snapshots
  (content-hash mismatch).
- Replay-safe: snapshot -> restore -> snapshot round-trip
  produces byte-identical content_hash.

### BaseAdapter integration
src/layerlens/instrument/adapters/_base/adapter.py

- Constructor builds self._memory_recorder = MemoryRecorder(org_id=self._org_id).
- New record_memory_turn(...) helper — best-effort wrapper that swallows
  recorder failures so memory persistence never breaks the host
  framework call stack (CLAUDE.md "tracing never breaks user code").
- memory_recorder property, memory_snapshot() and
  memory_snapshot_dict() convenience accessors.

### Per-adapter wiring (6 adapters)

- agno: Agent.run/arun finally-block; episodic input from args/kwargs;
  tool list from _collect_tool_names(result.messages).
- ms_agent_framework: Chat.invoke/invoke_stream finally-block;
  episodic input from kwargs; tool list from streamed message items.
- openai_agents: _on_agent_span_end (TraceProcessor) + on_run_end
  (Runner wrap); episodic input cached at span_start per span_id;
  tool list rolled up from _on_function_span_end per parent_id.
- llama_index: _on_agent_step_end; episodic input cached at
  step_start per thread id; tool list rolled up from _on_tool_call.
- google_adk: after_agent_callback + on_agent_end; episodic input
  cached at before_agent_callback per thread id; tool list rolled
  up from after_tool_callback per thread id.
- bedrock_agents: _after_invoke_agent (boto3 hook); episodic input
  cached at _before_invoke_agent per thread id; tool list rolled up
  from _process_trace action-group / KB step names.

Each adapter serialize_for_replay() now embeds the snapshot under
ReplayableTrace.metadata["memory_snapshot"] so replay engines can
reconstruct memory state via MemorySnapshot.from_dict(...) ->
recorder.restore(snapshot) before re-execution.

## Tests (57 new)

### tests/instrument/adapters/_base/test_memory.py — 27 tests

Recorder construction (empty/non-string org_id rejected; zero buffer
sizes rejected; initial state empty). Snapshot determinism (identical
content -> identical hash; different org_id -> different hash;
mutating recorder doesnt affect prior snapshot; to_dict/from_dict
round-trip preserves hash; from_dict rejects missing required fields).
Replay round-trip (snapshot -> restore -> snapshot byte-identical
hash; deterministic next-state under matching inputs; cross-tenant
restore raises; tampered-content-hash restore raises). Bounded
eviction (episodic FIFO at cap; semantic LRU at cap; semantic
overwrite refreshes LRU; procedural cap). Procedural detection
(repeated tool sequences accumulate count; no-tool turns produce
no patterns). Per-turn truncation (multi-megabyte values capped with
deterministic suffix). Thread safety (8 threads x 50 turns produces
unbroken 1..400 sequence). Clear preserves binding; defaults positive;
extra metadata sorted for hash determinism.

### tests/instrument/adapters/frameworks/test_memory_persistence_wiring.py — 30 tests (5 x 6 adapters parametrized)

- Each adapter exposes a recorder bound to its org_id.
- record_memory_turn advances the episodic buffer.
- serialize_for_replay() embeds metadata["memory_snapshot"].
- Replay engine can restore the recorder from the serialised trace
  (content-hash match end-to-end).
- Cross-tenant snapshot is rejected at the per-adapter recorder
  boundary.

## Documentation

docs/adapters/memory-contract.md — explains the three buckets,
the contract (tenant binding, bounded buffers, tamper-evident
snapshots, replay-safe round-trip, best-effort recording, thread
safety), per-adapter wiring matrix, and audit hooks. Includes the
replay-engine integration recipe and the honest scope disclosure
for browser_use.

## Honest scope disclosure

The cross-pollination audit section 2 #1 enumerates seven target
adapters. Six are wired here. The seventh — browser_use — does
NOT exist on this PR base branch
(feat/instrument-multitenancy-org-id-propagation); it lives on the
parallel feat/instrument-frameworks-browser-use-full history. It
will be wired when that adapter is ported to this base or when the
histories merge. This follows the same honest-disclosure pattern as
PR #120 (state filters, which omitted ms_agent_framework for the
same reason).

The future browser_use wiring (per audit section 2 #1) will be:
- Episodic: page navigation events (URL, action, selector)
- Procedural: recurring (prev_action, current_action) patterns
- Semantic: long-lived page-content cache keyed by URL/DOM hash

## Acceptance

uv run pytest tests/instrument/adapters/_base/test_memory.py -x
  -> 27 passed

uv run pytest tests/instrument/adapters/frameworks/test_memory_persistence_wiring.py -x
  -> 30 passed

uv run pytest tests/instrument/adapters/_base/
  -> 44 passed (no regressions)

uv run pytest tests/instrument/adapters/frameworks/{agno,bedrock_agents,google_adk,llama_index,ms_agent_framework,openai_agents}_adapter.py
  -> 72 passed (no regressions)

uv run mypy --strict src/layerlens/instrument/adapters/_base/memory.py
  -> Success: no issues found in 1 source file

uv run mypy src/layerlens/instrument/adapters/_base/adapter.py src/layerlens/instrument/adapters/frameworks/{6 adapters}/lifecycle.py
  -> Success: no issues found in 7 source files

uv run ruff check src/layerlens/instrument/adapters/_base/memory.py tests/instrument/adapters/_base/test_memory.py tests/instrument/adapters/frameworks/test_memory_persistence_wiring.py
  -> All checks passed!
@m-peko m-peko closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants