feat(instrument): Error-aware event emission on framework exceptions across 10 lighter adapters (cross-poll #2) by mmercuri · Pull Request #115 · LayerLens/stratix-python

mmercuri · 2026-04-27T00:08:20Z

Summary

Cross-pollination wave #2 from A:/tmp/adapter-cross-pollination-audit.md §2.3.

Mature adapters (LangChain on_*_error, AutoGen wrappers.py:94-108, Agentforce trust layer) emit a structured event when framework callbacks raise — so the trace dashboard shows a real failure with exception type, message, and bounded traceback, instead of a hung "start" event with no matching "end". This PR ports that contract to all 10 lighter runtime adapters: agno, openai_agents, llama_index, google_adk, strands, pydantic_ai, smolagents, bedrock_agents, ms_agent_framework, embedding.

What's in the PR

Shared helper

src/layerlens/instrument/adapters/_base/errors.py — emit_error_event(adapter, exc, context, severity, event_type, org_id) with:
- PII-safe message truncation (500 chars) + traceback truncation (8 frames / 4000 chars)
- Secret-pattern scrubbing (api_key=, Bearer …, sk-…)
- Allow-listed SAFE_CONTEXT_KEYS — only framework attribution keys propagate; PII-shaped keys are dropped
- Multi-tenant org_id automatically resolved from adapter._stratix.org_id (or tenant_id); explicit override supported
- Best-effort emission inside BaseAdapter.emit_dict_event — errors in error-handling never mask the original framework error
- Preserves re-raise semantics — caller is responsible for raise
_base/capture.py — adds agent.error / tool.error / model.error to ALWAYS_ENABLED_EVENT_TYPES so error events are never silently dropped by capture-config gating
_base/__init__.py — re-exports the public surface

Per-adapter wiring (all 10 targets)

agno: both traced_run (async) + traced_run_sync; on_tool_use(error=…)
openai_agents: span-end handlers detect span.error on agent / generation / function spans (shared _emit_span_error helper); on_run_end / on_tool_use
llama_index: _handle_event checks event.exception on every routed event and emits model.error / tool.error / agent.error based on event-type prefix; on_agent_end / on_tool_use
google_adk: new _maybe_emit_callback_error helper inspects error / exception / error_message on callback context, llm_response, and tool_output; on_agent_end / on_tool_use
strands: _create_traced_call and on_tool_use
pydantic_ai: both async and sync run wrappers, plus on_tool_use
smolagents: _create_traced_run and on_tool_use
bedrock_agents: new _maybe_emit_invoke_error reads parsed.trace.failureTrace.failureReason and top-level errorMessage / errorCode; on_invoke_end / on_tool_use
ms_agent_framework: both traced_invoke (async generator) and traced_invoke_stream; on_tool_use
embedding: all three provider wrappers (OpenAI, Cohere, SentenceTransformer) catch upstream exceptions and re-raise

Tests

tests/instrument/adapters/_base/test_errors.py — 28 helper tests (default event type, exception metadata, org_id propagation/override/omission, message + traceback truncation, context-key filtering, secret redaction in message + tb, idempotency, re-raise pattern, custom event type, broken __str__ defense, no-traceback defense, circuit-breaker respect, trace-buffer participation, build_error_payload immutability, empty-framework defense, SAFE_CONTEXT_KEYS lint, internal-helper white-box)
10 per-adapter test extensions + new test_embedding_adapter.py — 29 new tests covering both framework-callback and explicit on_* lifecycle paths, asserting event type / framework / phase / exception_type fields

CLAUDE.md commercial-grade compliance

No TODOs
All 10 adapters wired (no "planned backlog")
Errors RE-RAISED after emit
PII-safe (truncation + secret scrub + key allow-list)
Multi-tenant (org_id on every event)
No co-author trailers

Test plan

pytest tests/instrument/adapters/_base/test_errors.py -x — 28/28 pass
pytest tests/instrument/adapters/frameworks/ -x — 149/149 pass (no regression on 120 baseline + 29 new)
mypy --strict src/layerlens/instrument/adapters/_base/errors.py — clean
ruff check src/layerlens/instrument/adapters/ tests/instrument/adapters/ — clean
Reviewer @m-peko: confirm error-event payload schema matches policy.violation consumer expectations on the atlas-app side
Reviewer: confirm SAFE_CONTEXT_KEYS allow-list does not need additions for any tenant-specific adapter the team has internally

Cross-references

Audit item: A:/tmp/adapter-cross-pollination-audit.md §2.3 (origin: AutoGen wrappers.py:94-108, LangChain callbacks.py:402-436)
Sister PRs from the same wave (handoff std, importer hardening, SSE parser, manifest tier fix) are independent and can land in any order

Notes for reviewer

This PR is targeted at development (not main) since main does not yet contain src/layerlens/instrument/. Same base as the other adapter PRs.

@trace

) * SDK samples: 70+ production-ready samples, docs, and tests (rebased on main) Rebased onto latest main (e8a8033) which includes: - CLI with auth (PR #72) - layerlens.instrument tracing + adapters (PR #66, #69) - Scorers resource, integrations resource - API naming convention fixes (PR #61) No impact on samples: Stratix() constructor is backward-compatible, use_bearer_auth defaults to False, all existing API signatures unchanged. Samples include: core (18), industry (10), cowork (5), modalities (3), integrations (2), cicd (2+workflow), openclaw (10+skill), mcp (1), copilotkit (2+UI), claude-code skills (6), sample data (23 files). 469 non-live tests passing. 54 live tests available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove marc-only/ from tracking, add to .gitignore Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Move examples/cli/ to samples/cli/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add instrumentation and integration management samples from examples/ Copy 3 new files from examples/ that had no equivalent in samples/: - samples/integrations/openai_instrumented.py (instrument_openai + @trace + span) - samples/integrations/langchain_instrumented.py (LangChainCallbackHandler) - samples/core/integration_management.py (client.integrations CRUD) Update docs/instrumentation/providers.md and frameworks.md with Related Samples links. Update samples/integrations/README.md and samples/core/README.md. Update samples/README.md integrations count (2 → 4). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Consolidate examples/ into samples/: remove duplicates, integrate unique patterns - Remove 14 examples/ files already covered by samples/core equivalents - Create samples/core/benchmark_evaluation.py for model+benchmark workflow (evaluations.create → wait_for_completion → results.get/get_all) - Integrate 12 unique patterns from remaining examples/ into samples/: - trace_evaluation.py: add get_results().steps iteration, get_many() without filter - compare_evaluations.py: add compare_models(), outcome_filter, result field access - judge_optimization.py: add BadRequestError catch, optimization result fields - model_benchmark_management.py: add models.add/remove, benchmarks.add/remove, filters - evaluation_filtering.py: document both camelCase and snake_case sort_by conventions - paginated_results.py: add results.get_by_id() alternative - public_catalog.py: add evaluation summary fields, get_prompts search/sort params - async_workflow.py: add evaluation instance methods (wait_for_completion_async, etc) - Add Related Samples to docs/examples/creating-evaluations.md - Add Related Samples to docs/instrumentation/providers.md and frameworks.md - Update all READMEs for new files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove hardcoded retrieval score from rag_assessment.py (CLAUDE.md Rule 3) The "0.92" similarity score was fabricated and displayed as if computed by a real retrieval engine. Removed the fake score -- retrieval is by document ID, and actual quality scoring comes from the judge evaluation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add per-sample SDK call assertions for all 58 samples (10/10 compliance) Every sample now has specific assertions verifying which SDK methods it calls (not just "didn't crash"). Covers: - 20 core samples (benchmark_evaluation, integration_management added) - 5 cowork samples (code_review, pair_programming, rag_assessment, etc) - 3 modality samples (text, brand, document evaluation) - 4 integration samples (openai/anthropic traced + instrumented) - 2 cicd samples Also adds mock setup for client.integrations and client.trace_evaluations.get_many. 495 non-live tests passing, 58 live tests deselected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove examples/ entirely, remap all 53 doc references to samples/ All example files have been either: - Removed (14 duplicates already covered by samples/core equivalents) - Removed after integrating unique patterns into samples/ (12 files) - Replaced by samples/core/benchmark_evaluation.py (3 client workflow files) Updated all 53 doc references in docs/examples/ to point to samples/core/. Updated docs/examples/README.md with new file table. examples/ directory no longer exists. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add comprehensive MCP server tests (29 tests) Tests cover all 6 tool handlers, dispatch logic, error handling, asyncio.to_thread wrapping, and helper functions: - TestToolCatalogue: server creation and handler existence - TestHandleListTraces: summary output, default limit, empty/null responses - TestHandleGetTrace: detail output, not-found handling - TestHandleRunEvaluation: creation output, failure handling - TestHandleGetEvaluation: status+results, not-found, pending state - TestHandleCreateJudge: creation output, failure handling - TestHandleListJudges: list output, empty/null responses - TestDispatchAndErrors: unknown tool, SDK exceptions, helper functions - TestAsyncWrapping: all 5 handlers verified to use asyncio.to_thread Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix samples --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: m-peko <marinpeko5@gmail.com>

Bootstraps the LayerLens instrument layer with the abstract base classes, adapter registry, capture configuration, event sinks, vendored event schemas, and pydantic v1/v2 compatibility shim that every concrete adapter (frameworks, protocols, providers) will depend on. Scope ----- - src/layerlens/instrument/__init__.py: lean re-export surface - src/layerlens/instrument/_vendored/: frozen ateam event schemas (no runtime ateam dependency) - src/layerlens/instrument/adapters/_base/: BaseAdapter, AdapterRegistry, AdapterStatus, AdapterHealth, AdapterCapability, ReplayableTrace, CaptureConfig, EventSink, TraceStoreSink, IngestionPipelineSink, PydanticCompat - src/layerlens/_compat/pydantic.py: model_dump/model_validate shim spanning pydantic v1 + v2 - scripts/{port_adapter,port_protocol,emit_adapter_manifest, regen_dep_baselines}.py: codegen helpers used to port the rest of M1 - tests/instrument/{test_base_layer,test_lazy_imports, test_default_install,test_resolved_dep_tree}.py + _baselines/ - .github/workflows/dep-tree-guard.yaml: CI gate that locks the default install footprint - docs/adapters/: CONTRIBUTING, STATUS, pydantic-compatibility, testing, PERSONA_REVIEW Blast radius ------------ - Pure additions. No public surface changes outside the new layerlens.instrument namespace. - Default `pip install layerlens` install set is unchanged (verified by test_default_install.py against the new baseline). - Lazy adapter discovery: importing layerlens.instrument MUST NOT pull in any optional adapter dep (verified by test_lazy_imports.py). Test plan --------- - uv run pytest tests/instrument/test_base_layer.py tests/instrument/test_lazy_imports.py -x -> 45 passed - The dep-tree-guard workflow exercises test_default_install.py and test_resolved_dep_tree.py against the new baselines on every PR. LAY-3400 umbrella: this PR is the prerequisite for the M1.B/M1.C/M1.D adapter ports, M7 protocol certification, and M8 Cohere/Mistral.

Ports the twelve agent-tier framework adapters from the ateam reference implementation onto the new layerlens.instrument base layer: Semantic Kernel, LlamaIndex, OpenAI Agents, Pydantic-AI, Agno, Strands, SmolAgents, MS Agent Framework, Google ADK, Bedrock Agents, Embedding (vector store hooks), Benchmark Import Pairs with feat/instrument-frameworks-orchestration (M1.C part 1) which lands LangChain, LangGraph, CrewAI, AutoGen, Langfuse, and Agentforce. Together they complete M1.C. Scope ----- - src/layerlens/instrument/adapters/frameworks/{semantic_kernel, llama_index,openai_agents,pydantic_ai,agno,strands,smolagents, ms_agent_framework,google_adk,bedrock_agents,embedding, benchmark_import}/: per-framework packages - tests/instrument/adapters/frameworks/test_*_adapter.py + the test_bulk_ported_smoke.py harness (which exercises every ported adapter against canned trace fixtures so partial framework SDKs on a given runner don't drop coverage to zero) - samples/instrument/<framework>/: runnable per-framework samples - docs/adapters/frameworks-<framework>.md: per-framework integration guide - pyproject.toml: twelve new optional extras (semantic-kernel, llama-index, openai-agents, pydantic-ai, agno, strands, smolagents, ms-agent-framework, google-adk, bedrock-agents, embedding, benchmark-import) with python_version markers; pyright/ruff exclusions for the dynamic monkey-patching framework code Blast radius ------------ - Default `pip install layerlens` install set is unchanged. Each framework's heavy deps are gated behind their own extra. - No changes to existing public API surface. - Importing layerlens.instrument still does NOT pull in any framework module (lazy registry lookup). Test plan --------- - uv run pytest tests/instrument/adapters/frameworks/ -x -> 184 passed, 1 skipped (test_bulk_ported_smoke.py covers all 12 agent-tier adapters plus the orchestration-tier ones from part 1 via the same harness) Stacks on --------- - feat/instrument-base-foundation (M1.A) — required for the BaseAdapter surface this PR consumes. Sibling of ---------- - feat/instrument-frameworks-orchestration (M1.C part 1) — both branches stack on the base foundation independently and don't conflict; they can land in either order. LAY-3400 umbrella (M1.C part 2).

Cross-pollination wave #2 from A:/tmp/adapter-cross-pollination-audit.md §2.3. When a framework callback raises (rate limit, API down, malformed prompt, tool exception), the corresponding lifecycle event used to appear as a "start" with no matching "end" — dashboards rendered this as a hung request, not a failure. The mature adapters (LangChain on_*_error, AutoGen wrappers.py:94-108, Agentforce trust layer) already surface the exception as a structured event before re-raise; this PR ports that contract to the ten lighter runtime adapters. Shared helper ------------- - `src/layerlens/instrument/adapters/_base/errors.py` — `emit_error_event()` with PII-safe message truncation (500 chars), traceback truncation (8 frames / 4000 chars), secret-pattern scrubbing (api_key=, Bearer, sk-*), allow-listed framework-context keys, and multi-tenant org_id propagation from the adapter's stratix client. - `_base/capture.py` — adds `agent.error`, `tool.error`, `model.error` to ALWAYS_ENABLED_EVENT_TYPES so error events bypass capture-config gating (silent error drops are exactly the failure mode the helper exists to prevent). - `_base/__init__.py` — re-exports the public surface. Per-adapter wiring (all 10 targets) ----------------------------------- - agno: `_create_traced_run{,_sync}` emit policy.violation on raise; `on_tool_use` emits tool.error when the optional `error` kwarg is set. - openai_agents: span-end handlers detect `span.error` on agent / generation / function spans and route through a shared `_emit_span_error` helper; `on_run_end` / `on_tool_use` likewise. - llama_index: `_handle_event` checks for `event.exception` on every routed event and emits model.error / tool.error / agent.error based on the LlamaIndex event type prefix. - google_adk: new `_maybe_emit_callback_error` helper inspects the callback context / llm_response / tool_output for `error`, `exception`, `error_message` attributes (or dict keys); also wires `on_agent_end` / `on_tool_use`. - strands: `_create_traced_call` and `on_tool_use` mirror the agno pattern. - pydantic_ai: both async and sync run wrappers, plus `on_tool_use`. - smolagents: `_create_traced_run` and `on_tool_use`. - bedrock_agents: `_after_invoke_agent` detects `parsed.failureTrace` and top-level SDK error keys via new `_maybe_emit_invoke_error`; `on_invoke_end` / `on_tool_use` also wired. - ms_agent_framework: both `traced_invoke` (async generator) and `traced_invoke_stream` emit on raise; `on_tool_use` covers the programmatic surface. - embedding: all three provider wrappers (OpenAI, Cohere, SentenceTransformer) catch upstream exceptions, emit model.error, and re-raise. Re-raise semantics are preserved everywhere — the helper only emits; callers are responsible for `raise`. PII safety is enforced by the allow-list of context keys (no raw user input ever propagates), by the secret pattern scrubber on message + traceback, and by hard length caps. Multi-tenant deployments get org_id automatically from the stratix client. Tests ----- - `tests/instrument/adapters/_base/test_errors.py` — 28 tests covering helper correctness: default event type, exception type / message / module, org_id propagation + override + omission, message + traceback truncation, context-key filtering, secret redaction (api_key, Bearer, sk-*, idempotency), re-raise pattern, custom event type, exception with broken __str__, exception with no traceback, circuit-breaker respect, trace-buffer participation, build_error_payload immutability, empty-framework defense, SAFE_CONTEXT_KEYS lint, internal helper white-box tests. - 10 per-adapter test extensions adding 32 new tests across the lighter adapters, verifying both the framework-callback path (raise → policy.violation / agent.error / tool.error / model.error) and the explicit `on_*` lifecycle hook path. New `test_embedding_adapter.py` is the first test file for the embedding wrapper. Acceptance gates (all passing) ------------------------------ - `pytest tests/instrument/adapters/_base/test_errors.py -x` — 28/28 - `pytest tests/instrument/adapters/frameworks/ -x` — 149/149 (no regression on the 120 baseline tests + 29 new) - `mypy --strict src/layerlens/instrument/adapters/_base/errors.py` — clean - `ruff check src/layerlens/instrument/adapters/ tests/instrument/adapters/` — clean

…laceholder from #116) (#126) Replaces the M7 placeholder shipped in PR #116 (truncation policy) with the full BrowserUseAdapter — every lifecycle hook wired, every event emitted, and every cross-cutting CLAUDE.md contract enforced from day one. What changed ------------ Full lifecycle adapter (src/layerlens/instrument/adapters/frameworks/ browser_use/lifecycle.py): * connect / disconnect / health_check / get_adapter_info / serialize_for_replay (all five abstract BaseAdapter methods). * on_session_start, on_session_end, on_navigation, on_action, on_screenshot, on_dom_extraction, on_llm_call (every spec'd hook). * Capability declaration: TRACE_TOOLS + TRACE_MODELS + TRACE_STATE + STREAMING + REPLAY (no longer the placeholder's TRACE_TOOLS-only set). * Canonical events: browser.session.start, browser.navigate, browser.action, browser.screenshot, browser.dom.extract, tool.call, model.invoke, agent.input/output/state.change, cost.record, environment.config — plus agent.error / tool.error / model.error per the PR #115 error-aware emission contract. * Per-callback resilience wrapper per PR #117 — observability errors NEVER crash the customer's agent, surfaced via resilience_snapshot(). * Multi-tenant org_id propagation per PR #118 — bound at construction (kwarg or resolved from stratix.org_id), stamped defensively on every emit, caller-supplied values overwritten to prevent cross-tenant leaks. * Truncation policy from day one (DEFAULT_POLICY) — screenshot bytes DROPPED to deterministic SHA-256 references, DOM/HTML capped at 16 KiB, prompts/completions/tool I/O at 4/2 KiB. * Browser-event layer mapping (_BROWSER_EVENT_LAYERS) so unknown browser.* event types respect CaptureConfig gating without falling through the unknown-event-drops-by-default path. * requires_pydantic = PydanticCompat.V2_ONLY (browser_use is a v2 lib). Public surface (src/layerlens/instrument/adapters/frameworks/ browser_use/__init__.py): * ADAPTER_CLASS = BrowserUseAdapter (registry). * instrument_agent(agent, stratix=, capture_config=, org_id=) one-liner returning the connected, wrapping adapter. * STRATIXBrowserUseAdapter top-level binding (legacy alias) — fires DeprecationWarning on construction. Exposed as a static binding so the manifest consistency lint's AST walk finds it. Pyproject: * Adds 'browser-use' optional extra: browser-use>=0.1.0,<2 with the python_version >= '3.11' marker (browser_use's own constraint). Tests (tests/instrument/adapters/frameworks/test_browser_use_adapter.py): * Replaces the 7-test scaffold from #116 with 40 tests covering: wiring + alias + lifecycle round-trip + truncation (screenshot drop, hash determinism, HTML cap, short-payload no-audit) + multi-tenancy (kwarg, client attribute, defensive overwrite) + resilience (poison stratix, exploding agent attribute access) + error-aware emission (agent.error / tool.error / model.error) + per-hook coverage + sync + async wrapping + replay round-trip + 10-case provider detection table. Sample (samples/instrument/browser_use/{main.py,__init__.py,README.md}): * Runs OFFLINE — no browser-use install, no Playwright, no API key, no network. Three-step duck-typed agent + happy/--fail paths exercise the full event surface and demonstrate screenshot drop + org_id stamping + agent.error emission before re-raise. Doc (docs/adapters/frameworks-browser_use.md): * Install + quickstart + capabilities matrix + 14-event reference table + truncation policy table + multi-tenancy + resilience + error-aware emission + capture config + browser_use specifics + BYOK + replay sections. Manifest (scripts/emit_adapter_manifest.py): * Promotes browser_use from _LIFECYCLE_PREVIEW to _MATURE — every required artifact (test file with >= 12 funcs, sample, doc, STRATIX→LayerLens deprecation alias) ships in this PR. Verification ------------ * uv run pytest tests/instrument/adapters/frameworks/test_browser_use_adapter.py → 40 passed * mypy --strict src/layerlens/instrument/adapters/frameworks/browser_use → Success: no issues found in 2 source files * ruff check on src + test + script → All checks passed! * Sample runs cleanly offline (happy + --fail) * pip install -e .[browser-use] resolves cleanly (browser-use only pulled on Python 3.11+ per the env marker) * tests/instrument/adapters/test_manifest_consistency.py:: test_mature_adapters_have_required_artifacts[browser_use] passes * Full instrument suite (excl. pre-existing crewai/protocols references not on this branch): 312 passed, 1 skipped, 12 xfailed

mmercuri and others added 4 commits April 21, 2026 18:12

mmercuri requested a review from m-peko April 27, 2026 00:08

mmercuri mentioned this pull request Apr 27, 2026

feat(instrument): browser_use adapter full implementation (replaces placeholder from #116) #126

Merged

7 tasks

m-peko closed this May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(instrument): Error-aware event emission on framework exceptions across 10 lighter adapters (cross-poll #2)#115

feat(instrument): Error-aware event emission on framework exceptions across 10 lighter adapters (cross-poll #2)#115
mmercuri wants to merge 4 commits into
developmentfrom
feat/instrument-error-aware-emission

mmercuri commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mmercuri commented Apr 27, 2026

Summary

What's in the PR

Shared helper

Per-adapter wiring (all 10 targets)

Tests

CLAUDE.md commercial-grade compliance

Test plan

Cross-references

Notes for reviewer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants