feat(instrument): State include/exclude key filters for 6 multi-agent adapters (cross-poll #6)#120
Closed
mmercuri wants to merge 2 commits into
Closed
feat(instrument): State include/exclude key filters for 6 multi-agent adapters (cross-poll #6)#120mmercuri wants to merge 2 commits into
mmercuri wants to merge 2 commits into
Conversation
…0 lighter adapters Introduces a shared @resilient_callback decorator + ResilienceTracker under `src/layerlens/instrument/adapters/_base/`, then applies it to every callback method on the 10 lighter framework adapters (agno, llamaindex, google_adk, strands, pydantic_ai, smolagents, bedrock_agents, openai_agents, haystack, langfuse) so an exception in our observability code can never crash the customer's framework execution. What the decorator does on failure: 1. Catches Exception (NOT BaseException — KeyboardInterrupt / SystemExit still propagate so users can Ctrl-C their agent). 2. Logs the exception via the wrapped function's module logger with adapter_name + callback_name + truncated traceback. 3. Increments the adapter's per-instance ResilienceTracker counter. 4. Returns the framework's expected default value — None for void handlers, or the value of `passthrough_arg` for mutating hooks (Pydantic-AI's `after_model_request` returns the response object; `before_tool_execute` returns the args tuple). Health surfacing: - FrameworkAdapter now owns a `_resilience: ResilienceTracker` attribute set in `__init__` so every framework adapter inherits the contract. - `adapter_info().metadata` merges the live resilience snapshot (`resilience_status`, `resilience_failures_total`, `resilience_failure_threshold`, per-callback breakdown, last error). - After DEFAULT_FAILURE_THRESHOLD (5) failures the adapter reports `resilience_status: "degraded"` so monitoring can alert. - `disconnect()` resets the tracker so reconnects start clean. Per-adapter callback audit + fixes: | Adapter | Callbacks wrapped | Notes | |-----------------|-------------------|----------------------------------------| | agno | 2 | _on_run_start, _on_run_end | | llamaindex | 16 | 3 span lifecycle + dispatcher + 12 events| | google_adk | 11 | All adapter _on_* + simplified plugin shims| | strands | 7 | All hook handlers (replaces manual try/except)| | pydantic_ai | 9 (incl 3 split) | Error hooks split: telemetry resilient, re-raise unconditional| | smolagents | 6 | Run/step handlers (replaces manual try/except)| | bedrock_agents | 2 | _before_invoke + _after_invoke (with try/finally for _end_run)| | openai_agents | 3 | on_trace_start/_end + on_span_end (replaces manual try/except)| | haystack | 1 | _on_span_end (replaces manual try/except)| | langfuse | 5 | _import_single_trace, _import_observation, _import_score, _export_single_trace, plus inner emit fallbacks| | TOTAL | 62 | | Pydantic-AI error-callback split: `_on_run_error`, `_on_model_request_error`, `_on_tool_execute_error` MUST always re-raise the framework's original error (per Pydantic-AI's contract). The telemetry side is moved into a `_emit_*_error_telemetry` helper wrapped with @resilient_callback; the public hook calls it then unconditionally `raise error`. So adapter-side telemetry bugs can never swallow a real framework error. Tests: - `tests/instrument/adapters/_base/test_resilience.py` — 34 tests covering tracker mechanics, decorator behaviour, passthrough args, KeyboardInterrupt propagation, FrameworkAdapter integration, package re-exports, and decorator metadata preservation. - `tests/instrument/adapters/_base/test_per_adapter_resilience.py` — per-adapter smoke tests (one per lighter adapter) that simulate a callback exception by sabotaging an inner helper, plus a parametrized health-degradation test across all 10 adapters. Refactor: `_base.py` (the AdapterInfo + BaseAdapter module) becomes `_base/` package with `__init__.py` re-exporting from `_core.py` (moved via `git mv`) and the new `resilience.py`. All existing `from .._base import AdapterInfo, BaseAdapter` imports continue working unchanged. Acceptance: - pytest tests/instrument/adapters/_base/test_resilience.py -x — 34 passed - pytest tests/instrument/adapters/frameworks/ -x — 146 passed (12 skipped for missing optional deps; 2 deselected pre-existing Windows clock-resolution flakes in test_haystack) - mypy --strict src/layerlens/instrument/adapters/_base/resilience.py — Success - mypy src — Success: 169 source files - ruff check — All checks passed - Full test suite: 1090 passed
… adapters (cross-poll #6) Implements cross-pollination item #6 from `A:/tmp/adapter-cross-pollination-audit.md` §2.12. LangGraph's `LangGraphStateAdapter` (mature) supports include/exclude key filters at the state-snapshot level so customers can scrub sensitive state (api_keys, tokens, PII) WITHOUT modifying their agent code or doing post-hoc redaction. This PR brings the same contract to the lighter multi-agent framework adapters present on this base. ## What's new ### Shared filter module `src/layerlens/instrument/adapters/_base/state_filters.py` — new - `StateFilter` — frozen dataclass with `include_keys`, `exclude_keys`, `mask_keys`, `recursive`. Three operations applied in order: exclude (drop), mask (replace value with `[REDACTED]`), include (allowlist). Matching is case-insensitive substring after alphanumeric-only normalisation, so `X-Api-Key`, `USER_API_KEY`, `customer.email_address` all match without enumerating every variant. - `DEFAULT_PII_EXCLUDE_KEYS` — conservative denylist covering 49 common credential / PII / financial / contact field names. Customers who do nothing still get baseline protection out of the box per CLAUDE.md ("never silently leak customer data"). - `default_state_filter()` — factory installed by every adapter unless the caller passes a custom `state_filter`. - `filter_state(state, filter)` — pure function returning `(filtered_state, filtered_keys)` so adapters can surface the clipped key names as `_filtered_keys` event metadata for audit. - `filter_payload_fields(payload, filter, fields)` — surgical helper that filters only the named dict-shaped fields of a mixed-shape payload (so scalar metadata like `model`, `latency_ms` is preserved). - `StateFilter.permissive()` — opt-out factory for tests / explicit disablement. The active filter snapshot is surfaced under `adapter_info().metadata['state_filter']` so operators can detect accidental disablement. - `StateFilter.with_extra_excludes()` — default + caller's additions. ### FrameworkAdapter integration `src/layerlens/instrument/adapters/frameworks/_base_framework.py` - Constructor accepts optional `state_filter` (defaults to `default_state_filter()`). - `self._state_filter` reachable on every subclass. - New `_filter_payload(payload, *fields)` helper used by adapters immediately before each `_emit(...)` call for any payload that may contain user-controlled state. - New `serialize_state_filter_for_replay()` — replay engine uses this to reconstruct an equivalent filter on the other side, so the captured payload shapes match between original run and replay. - `adapter_info().metadata['state_filter']` surfaces the active config. ### Per-adapter wiring (6 multi-agent adapters) | Adapter | Constructor `state_filter` | Filter applied at emit | |-----------------|----------------------------|--------------------------------| | `agno` | YES | `agent.input/output`, `tool.call/result` | | `openai_agents` | YES | `agent.input` (tools/handoffs/output_type), generation messages, function args + parameters_schema + mcp_data, tool result | | `llamaindex` | YES | LLM messages + output_message, tool args, retrieval input/output, query input/output, agent_step input/output | | `google_adk` | YES | run user_content + agent_tree, agent user_content, tool args, tool result | | `strands` | YES | invocation messages, after-invocation output, tool input, tool result | | `pydantic_ai` | YES | agent input + deps_summary, agent output, tool.call args, tool.result output, streaming output_message | The audit §2.12 enumerates 7 targets including `ms_agent_framework` — that adapter doesn't exist on this branch's base (`feat/instrument-callback-resilience`); it lives only on the parallel `feat/instrument-multitenancy-org-id-propagation` history. It will be wired when the ms_agent_framework adapter is ported to this base or the histories merge. ## Tests (53 new + integration) ### `tests/instrument/adapters/_base/test_state_filters.py` — 53 tests - `TestStateFilterConstruction` — defaults are PII-aware, lowercasing, permissive factory, with_extra_excludes factory. - `TestStateFilterMetadata` — default snapshot shape, allowlist surfaces in metadata. - `TestFilterStateExclude` — default PII keys removed, vendor variants caught (`X-Api-Key`, `USER_API_KEY`, `stripe_customer_email`), permissive opt-out. - `TestFilterStateMask` — keeps key visible, masking runs before recurse so nested PII can't leak through a masked field. - `TestFilterStateInclude` — allowlist semantics, exclude wins over include when both match. - `TestFilterStateRecursive` — nested dicts, lists of dicts, non-recursive flag. - `TestFilterStatePassthrough` — primitives + empty dict pass through. - `TestFilterPayloadFields` — surgical filter (scalars untouched), missing fields skipped, scalar field is no-op, accumulating `_filtered_keys` across multiple passes. - `TestFrameworkAdapterStateFilterDefaults` — default installed, custom override, end-to-end PII drop, replay snapshot, adapter_info. - `TestPerAdapterStateFilterWiring` — parametrized across all 6 adapters: constructor accepts state_filter, default is PII-aware, state_filter surfaces in adapter_info. - `TestEndToEndAgnoFilter` — filter actually runs at the emit boundary (not just sits idle on the adapter). ### Existing test suites unchanged - `tests/instrument/adapters/_base/` — 110 passed, 7 skipped. - `tests/instrument/adapters/frameworks/` — 114 passed (langchain, langgraph, langfuse, agentforce — adapters with deps installed in CI venv), 12 skipped (optional deps), 1 pre-existing Windows clock-resolution flake on test_haystack (documented in PR #117). ## Documentation `docs/adapters/state-filters.md` — explains default behaviour, three filter operations, configuration recipes, recursion, auditability via `_filtered_keys`, replay reproducibility, and the per-adapter wiring matrix. ## Acceptance ``` pytest tests/instrument/adapters/_base/test_state_filters.py -x # 53 passed in 0.10s pytest tests/instrument/adapters/_base/ # 110 passed, 7 skipped in 0.26s pytest tests/instrument/adapters/frameworks/ # adapters with installed deps # 114 passed, 12 skipped, 1 pre-existing flake (test_haystack.test_input_and_output) mypy --strict src/layerlens/instrument/adapters/_base/state_filters.py # Success: no issues found in 1 source file mypy src/layerlens/instrument/adapters/frameworks/{_base_framework,agno,openai_agents,llamaindex,google_adk,strands,pydantic_ai}.py # Success: no issues found in 7 source files ruff check src/layerlens/instrument/adapters/_base/state_filters.py src/layerlens/instrument/adapters/frameworks/{_base_framework,agno,openai_agents,llamaindex,google_adk,strands,pydantic_ai}.py tests/instrument/adapters/_base/test_state_filters.py # All checks passed! ```
8 tasks
m-peko
pushed a commit
that referenced
this pull request
May 12, 2026
…or 6 lighter adapters (cross-poll #1) (#130) Implements cross-pollination item #1 from A:/tmp/adapter-cross-pollination-audit.md section 2 #1. The four mature framework adapters (LangChain, AutoGen, CrewAI, Semantic Kernel) carry ad-hoc memory plumbing — episodic recent turns, procedural learned patterns, semantic long-lived facts — that lets agents recall context across runs. The lighter adapters (agno, ms_agent_framework, openai_agents, llama_index, google_adk, bedrock_agents, browser_use) all behave as goldfish agents — every run starts from a blank slate. This PR ports the pattern into a shared, replay-safe primitive that the lighter adapters plug into uniformly. ## What is new ### Shared memory primitive src/layerlens/instrument/adapters/_base/memory.py — new - MemorySnapshot — frozen dataclass with turn_index, episodic (recent turns), procedural (detected patterns), semantic (key/value facts), content_hash (SHA-256 of canonical-JSON encoding), org_id (tenant binding). to_dict / from_dict round-trip preserves identity. - MemoryRecorder — thread-safe accumulator. record_turn(...) is the per-turn entry point; set_semantic(key, value) for long-lived facts; snapshot() returns the immutable view; restore(snap) rebuilds state from a previous snapshot. All buckets bounded (defaults 200/16/64); episodic FIFO eviction, semantic LRU, procedural keep-top-by-count. - Procedural pattern detection: O(window) per turn, scans the recent episodic window for recurring (prev_tools, current_tools) pairs. - Multi-tenant: recorder requires non-empty org_id at construction; restore() rejects cross-tenant snapshots and tampered snapshots (content-hash mismatch). - Replay-safe: snapshot -> restore -> snapshot round-trip produces byte-identical content_hash. ### BaseAdapter integration src/layerlens/instrument/adapters/_base/adapter.py - Constructor builds self._memory_recorder = MemoryRecorder(org_id=self._org_id). - New record_memory_turn(...) helper — best-effort wrapper that swallows recorder failures so memory persistence never breaks the host framework call stack (CLAUDE.md "tracing never breaks user code"). - memory_recorder property, memory_snapshot() and memory_snapshot_dict() convenience accessors. ### Per-adapter wiring (6 adapters) - agno: Agent.run/arun finally-block; episodic input from args/kwargs; tool list from _collect_tool_names(result.messages). - ms_agent_framework: Chat.invoke/invoke_stream finally-block; episodic input from kwargs; tool list from streamed message items. - openai_agents: _on_agent_span_end (TraceProcessor) + on_run_end (Runner wrap); episodic input cached at span_start per span_id; tool list rolled up from _on_function_span_end per parent_id. - llama_index: _on_agent_step_end; episodic input cached at step_start per thread id; tool list rolled up from _on_tool_call. - google_adk: after_agent_callback + on_agent_end; episodic input cached at before_agent_callback per thread id; tool list rolled up from after_tool_callback per thread id. - bedrock_agents: _after_invoke_agent (boto3 hook); episodic input cached at _before_invoke_agent per thread id; tool list rolled up from _process_trace action-group / KB step names. Each adapter serialize_for_replay() now embeds the snapshot under ReplayableTrace.metadata["memory_snapshot"] so replay engines can reconstruct memory state via MemorySnapshot.from_dict(...) -> recorder.restore(snapshot) before re-execution. ## Tests (57 new) ### tests/instrument/adapters/_base/test_memory.py — 27 tests Recorder construction (empty/non-string org_id rejected; zero buffer sizes rejected; initial state empty). Snapshot determinism (identical content -> identical hash; different org_id -> different hash; mutating recorder doesnt affect prior snapshot; to_dict/from_dict round-trip preserves hash; from_dict rejects missing required fields). Replay round-trip (snapshot -> restore -> snapshot byte-identical hash; deterministic next-state under matching inputs; cross-tenant restore raises; tampered-content-hash restore raises). Bounded eviction (episodic FIFO at cap; semantic LRU at cap; semantic overwrite refreshes LRU; procedural cap). Procedural detection (repeated tool sequences accumulate count; no-tool turns produce no patterns). Per-turn truncation (multi-megabyte values capped with deterministic suffix). Thread safety (8 threads x 50 turns produces unbroken 1..400 sequence). Clear preserves binding; defaults positive; extra metadata sorted for hash determinism. ### tests/instrument/adapters/frameworks/test_memory_persistence_wiring.py — 30 tests (5 x 6 adapters parametrized) - Each adapter exposes a recorder bound to its org_id. - record_memory_turn advances the episodic buffer. - serialize_for_replay() embeds metadata["memory_snapshot"]. - Replay engine can restore the recorder from the serialised trace (content-hash match end-to-end). - Cross-tenant snapshot is rejected at the per-adapter recorder boundary. ## Documentation docs/adapters/memory-contract.md — explains the three buckets, the contract (tenant binding, bounded buffers, tamper-evident snapshots, replay-safe round-trip, best-effort recording, thread safety), per-adapter wiring matrix, and audit hooks. Includes the replay-engine integration recipe and the honest scope disclosure for browser_use. ## Honest scope disclosure The cross-pollination audit section 2 #1 enumerates seven target adapters. Six are wired here. The seventh — browser_use — does NOT exist on this PR base branch (feat/instrument-multitenancy-org-id-propagation); it lives on the parallel feat/instrument-frameworks-browser-use-full history. It will be wired when that adapter is ported to this base or when the histories merge. This follows the same honest-disclosure pattern as PR #120 (state filters, which omitted ms_agent_framework for the same reason). The future browser_use wiring (per audit section 2 #1) will be: - Episodic: page navigation events (URL, action, selector) - Procedural: recurring (prev_action, current_action) patterns - Semantic: long-lived page-content cache keyed by URL/DOM hash ## Acceptance uv run pytest tests/instrument/adapters/_base/test_memory.py -x -> 27 passed uv run pytest tests/instrument/adapters/frameworks/test_memory_persistence_wiring.py -x -> 30 passed uv run pytest tests/instrument/adapters/_base/ -> 44 passed (no regressions) uv run pytest tests/instrument/adapters/frameworks/{agno,bedrock_agents,google_adk,llama_index,ms_agent_framework,openai_agents}_adapter.py -> 72 passed (no regressions) uv run mypy --strict src/layerlens/instrument/adapters/_base/memory.py -> Success: no issues found in 1 source file uv run mypy src/layerlens/instrument/adapters/_base/adapter.py src/layerlens/instrument/adapters/frameworks/{6 adapters}/lifecycle.py -> Success: no issues found in 7 source files uv run ruff check src/layerlens/instrument/adapters/_base/memory.py tests/instrument/adapters/_base/test_memory.py tests/instrument/adapters/frameworks/test_memory_persistence_wiring.py -> All checks passed!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cross-pollination item #6 — state include/exclude key filters
Implements cross-pollination item #6 from
A:/tmp/adapter-cross-pollination-audit.md§2.12. LangGraph'sLangGraphStateAdapter(mature) supports include/exclude key filters at the state-snapshot level so customers can scrub sensitive state (api_keys, tokens, PII) WITHOUT modifying their agent code or doing post-hoc redaction. This PR brings the same contract to the lighter multi-agent framework adapters present on this base.What's new
Shared filter module
src/layerlens/instrument/adapters/_base/state_filters.py— newStateFilter— frozen dataclass withinclude_keys,exclude_keys,mask_keys,recursive. Three operations applied in order: exclude (drop), mask (replace value with[REDACTED]), include (allowlist). Matching is case-insensitive substring after alphanumeric-only normalisation, soX-Api-Key,USER_API_KEY,customer.email_addressall match without enumerating every variant.DEFAULT_PII_EXCLUDE_KEYS— conservative denylist covering 49 common credential / PII / financial / contact field names. Customers who do nothing still get baseline protection out of the box per CLAUDE.md ("never silently leak customer data").default_state_filter()— factory installed by every adapter unless the caller passes a customstate_filter.filter_state(state, filter)— pure function returning(filtered_state, filtered_keys)so adapters can surface the clipped key names as_filtered_keysevent metadata for audit.filter_payload_fields(payload, filter, fields)— surgical helper that filters only the named dict-shaped fields of a mixed-shape payload (so scalar metadata likemodel,latency_msis preserved).StateFilter.permissive()— opt-out factory for tests / explicit disablement. The active filter snapshot is surfaced underadapter_info().metadata['state_filter']so operators can detect accidental disablement.StateFilter.with_extra_excludes()— default + caller's additions.FrameworkAdapter integration
src/layerlens/instrument/adapters/frameworks/_base_framework.pystate_filter(defaults todefault_state_filter()).self._state_filterreachable on every subclass._filter_payload(payload, *fields)helper used by adapters immediately before each_emit(...)call for any payload that may contain user-controlled state.serialize_state_filter_for_replay()— replay engine uses this to reconstruct an equivalent filter on the other side, so the captured payload shapes match between original run and replay.adapter_info().metadata['state_filter']surfaces the active config.Per-adapter wiring (6 multi-agent adapters)
state_filteragnoagent.input/output,tool.call/resultopenai_agentsagent.input(tools/handoffs/output_type), generation messages, function args + parameters_schema + mcp_data, tool resultllamaindexgoogle_adkstrandspydantic_aiThe audit §2.12 enumerates 7 targets including
ms_agent_framework— that adapter doesn't exist on this branch's base (feat/instrument-callback-resilience); it lives only on the parallelfeat/instrument-multitenancy-org-id-propagationhistory. It will be wired when the ms_agent_framework adapter is ported to this base or the histories merge.Tests (53 new)
tests/instrument/adapters/_base/test_state_filters.pyTestStateFilterConstruction— defaults are PII-aware, lowercasing, permissive factory, with_extra_excludes factory.TestStateFilterMetadata— default snapshot shape, allowlist surfaces in metadata.TestFilterStateExclude— default PII keys removed, vendor variants caught (X-Api-Key,USER_API_KEY,stripe_customer_email), permissive opt-out.TestFilterStateMask— keeps key visible, masking runs before recurse so nested PII can't leak through a masked field.TestFilterStateInclude— allowlist semantics, exclude wins over include when both match.TestFilterStateRecursive— nested dicts, lists of dicts, non-recursive flag.TestFilterStatePassthrough— primitives + empty dict pass through.TestFilterPayloadFields— surgical filter (scalars untouched), missing fields skipped, scalar field is no-op, accumulating_filtered_keysacross multiple passes.TestFrameworkAdapterStateFilterDefaults— default installed, custom override, end-to-end PII drop, replay snapshot, adapter_info.TestPerAdapterStateFilterWiring— parametrized across all 6 adapters: constructor accepts state_filter, default is PII-aware, state_filter surfaces in adapter_info.TestEndToEndAgnoFilter— filter actually runs at the emit boundary (not just sits idle on the adapter).Existing test suites unchanged
tests/instrument/adapters/_base/— 110 passed, 7 skipped.tests/instrument/adapters/frameworks/— 114 passed (langchain, langgraph, langfuse, agentforce — adapters with deps installed in CI venv), 12 skipped (optional deps), 1 pre-existing Windows clock-resolution flake on test_haystack (documented in PR feat(instrument): Per-callback try/except resilience wrapper across 10 lighter adapters (cross-poll #5) #117).Documentation
docs/adapters/state-filters.md— explains default behaviour, three filter operations, configuration recipes, recursion, auditability via_filtered_keys, replay reproducibility, and the per-adapter wiring matrix.Acceptance
Test plan
pytest tests/instrument/adapters/_base/test_state_filters.py -xpytest tests/instrument/adapters/_base/(no regressions on resilience tests)pytest tests/instrument/adapters/frameworks/(no regressions on adapter tests)mypy --strict src/layerlens/instrument/adapters/_base/state_filters.pymypy src/.../{adapters wired in this PR}.pyruff check