Skip to content

feat(instrument): Error-aware event emission on framework exceptions across 10 lighter adapters (cross-poll #2)#115

Closed
mmercuri wants to merge 4 commits into
developmentfrom
feat/instrument-error-aware-emission
Closed

feat(instrument): Error-aware event emission on framework exceptions across 10 lighter adapters (cross-poll #2)#115
mmercuri wants to merge 4 commits into
developmentfrom
feat/instrument-error-aware-emission

Conversation

@mmercuri

Copy link
Copy Markdown
Contributor

Summary

Cross-pollination wave #2 from A:/tmp/adapter-cross-pollination-audit.md §2.3.

Mature adapters (LangChain on_*_error, AutoGen wrappers.py:94-108, Agentforce trust layer) emit a structured event when framework callbacks raise — so the trace dashboard shows a real failure with exception type, message, and bounded traceback, instead of a hung "start" event with no matching "end". This PR ports that contract to all 10 lighter runtime adapters: agno, openai_agents, llama_index, google_adk, strands, pydantic_ai, smolagents, bedrock_agents, ms_agent_framework, embedding.

What's in the PR

Shared helper

  • src/layerlens/instrument/adapters/_base/errors.pyemit_error_event(adapter, exc, context, severity, event_type, org_id) with:
    • PII-safe message truncation (500 chars) + traceback truncation (8 frames / 4000 chars)
    • Secret-pattern scrubbing (api_key=, Bearer …, sk-…)
    • Allow-listed SAFE_CONTEXT_KEYS — only framework attribution keys propagate; PII-shaped keys are dropped
    • Multi-tenant org_id automatically resolved from adapter._stratix.org_id (or tenant_id); explicit override supported
    • Best-effort emission inside BaseAdapter.emit_dict_event — errors in error-handling never mask the original framework error
    • Preserves re-raise semantics — caller is responsible for raise
  • _base/capture.py — adds agent.error / tool.error / model.error to ALWAYS_ENABLED_EVENT_TYPES so error events are never silently dropped by capture-config gating
  • _base/__init__.py — re-exports the public surface

Per-adapter wiring (all 10 targets)

  • agno: both traced_run (async) + traced_run_sync; on_tool_use(error=…)
  • openai_agents: span-end handlers detect span.error on agent / generation / function spans (shared _emit_span_error helper); on_run_end / on_tool_use
  • llama_index: _handle_event checks event.exception on every routed event and emits model.error / tool.error / agent.error based on event-type prefix; on_agent_end / on_tool_use
  • google_adk: new _maybe_emit_callback_error helper inspects error / exception / error_message on callback context, llm_response, and tool_output; on_agent_end / on_tool_use
  • strands: _create_traced_call and on_tool_use
  • pydantic_ai: both async and sync run wrappers, plus on_tool_use
  • smolagents: _create_traced_run and on_tool_use
  • bedrock_agents: new _maybe_emit_invoke_error reads parsed.trace.failureTrace.failureReason and top-level errorMessage / errorCode; on_invoke_end / on_tool_use
  • ms_agent_framework: both traced_invoke (async generator) and traced_invoke_stream; on_tool_use
  • embedding: all three provider wrappers (OpenAI, Cohere, SentenceTransformer) catch upstream exceptions and re-raise

Tests

  • tests/instrument/adapters/_base/test_errors.py — 28 helper tests (default event type, exception metadata, org_id propagation/override/omission, message + traceback truncation, context-key filtering, secret redaction in message + tb, idempotency, re-raise pattern, custom event type, broken __str__ defense, no-traceback defense, circuit-breaker respect, trace-buffer participation, build_error_payload immutability, empty-framework defense, SAFE_CONTEXT_KEYS lint, internal-helper white-box)
  • 10 per-adapter test extensions + new test_embedding_adapter.py — 29 new tests covering both framework-callback and explicit on_* lifecycle paths, asserting event type / framework / phase / exception_type fields

CLAUDE.md commercial-grade compliance

  • No TODOs
  • All 10 adapters wired (no "planned backlog")
  • Errors RE-RAISED after emit
  • PII-safe (truncation + secret scrub + key allow-list)
  • Multi-tenant (org_id on every event)
  • No co-author trailers

Test plan

  • pytest tests/instrument/adapters/_base/test_errors.py -x — 28/28 pass
  • pytest tests/instrument/adapters/frameworks/ -x — 149/149 pass (no regression on 120 baseline + 29 new)
  • mypy --strict src/layerlens/instrument/adapters/_base/errors.py — clean
  • ruff check src/layerlens/instrument/adapters/ tests/instrument/adapters/ — clean
  • Reviewer @m-peko: confirm error-event payload schema matches policy.violation consumer expectations on the atlas-app side
  • Reviewer: confirm SAFE_CONTEXT_KEYS allow-list does not need additions for any tenant-specific adapter the team has internally

Cross-references

  • Audit item: A:/tmp/adapter-cross-pollination-audit.md §2.3 (origin: AutoGen wrappers.py:94-108, LangChain callbacks.py:402-436)
  • Sister PRs from the same wave (handoff std, importer hardening, SSE parser, manifest tier fix) are independent and can land in any order

Notes for reviewer

  • This PR is targeted at development (not main) since main does not yet contain src/layerlens/instrument/. Same base as the other adapter PRs.

mmercuri and others added 4 commits April 21, 2026 18:12
)

* SDK samples: 70+ production-ready samples, docs, and tests (rebased on main)

Rebased onto latest main (e8a8033) which includes:
- CLI with auth (PR #72)
- layerlens.instrument tracing + adapters (PR #66, #69)
- Scorers resource, integrations resource
- API naming convention fixes (PR #61)

No impact on samples: Stratix() constructor is backward-compatible,
use_bearer_auth defaults to False, all existing API signatures unchanged.

Samples include: core (18), industry (10), cowork (5), modalities (3),
integrations (2), cicd (2+workflow), openclaw (10+skill), mcp (1),
copilotkit (2+UI), claude-code skills (6), sample data (23 files).

469 non-live tests passing. 54 live tests available.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove marc-only/ from tracking, add to .gitignore

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Move examples/cli/ to samples/cli/

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add instrumentation and integration management samples from examples/

Copy 3 new files from examples/ that had no equivalent in samples/:
- samples/integrations/openai_instrumented.py (instrument_openai + @trace + span)
- samples/integrations/langchain_instrumented.py (LangChainCallbackHandler)
- samples/core/integration_management.py (client.integrations CRUD)

Update docs/instrumentation/providers.md and frameworks.md with Related Samples links.
Update samples/integrations/README.md and samples/core/README.md.
Update samples/README.md integrations count (2 → 4).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Consolidate examples/ into samples/: remove duplicates, integrate unique patterns

- Remove 14 examples/ files already covered by samples/core equivalents
- Create samples/core/benchmark_evaluation.py for model+benchmark workflow
  (evaluations.create → wait_for_completion → results.get/get_all)
- Integrate 12 unique patterns from remaining examples/ into samples/:
  - trace_evaluation.py: add get_results().steps iteration, get_many() without filter
  - compare_evaluations.py: add compare_models(), outcome_filter, result field access
  - judge_optimization.py: add BadRequestError catch, optimization result fields
  - model_benchmark_management.py: add models.add/remove, benchmarks.add/remove, filters
  - evaluation_filtering.py: document both camelCase and snake_case sort_by conventions
  - paginated_results.py: add results.get_by_id() alternative
  - public_catalog.py: add evaluation summary fields, get_prompts search/sort params
  - async_workflow.py: add evaluation instance methods (wait_for_completion_async, etc)
- Add Related Samples to docs/examples/creating-evaluations.md
- Add Related Samples to docs/instrumentation/providers.md and frameworks.md
- Update all READMEs for new files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove hardcoded retrieval score from rag_assessment.py (CLAUDE.md Rule 3)

The "0.92" similarity score was fabricated and displayed as if computed
by a real retrieval engine. Removed the fake score -- retrieval is by
document ID, and actual quality scoring comes from the judge evaluation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add per-sample SDK call assertions for all 58 samples (10/10 compliance)

Every sample now has specific assertions verifying which SDK methods
it calls (not just "didn't crash"). Covers:
- 20 core samples (benchmark_evaluation, integration_management added)
- 5 cowork samples (code_review, pair_programming, rag_assessment, etc)
- 3 modality samples (text, brand, document evaluation)
- 4 integration samples (openai/anthropic traced + instrumented)
- 2 cicd samples

Also adds mock setup for client.integrations and client.trace_evaluations.get_many.
495 non-live tests passing, 58 live tests deselected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove examples/ entirely, remap all 53 doc references to samples/

All example files have been either:
- Removed (14 duplicates already covered by samples/core equivalents)
- Removed after integrating unique patterns into samples/ (12 files)
- Replaced by samples/core/benchmark_evaluation.py (3 client workflow files)

Updated all 53 doc references in docs/examples/ to point to samples/core/.
Updated docs/examples/README.md with new file table.
examples/ directory no longer exists.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add comprehensive MCP server tests (29 tests)

Tests cover all 6 tool handlers, dispatch logic, error handling,
asyncio.to_thread wrapping, and helper functions:

- TestToolCatalogue: server creation and handler existence
- TestHandleListTraces: summary output, default limit, empty/null responses
- TestHandleGetTrace: detail output, not-found handling
- TestHandleRunEvaluation: creation output, failure handling
- TestHandleGetEvaluation: status+results, not-found, pending state
- TestHandleCreateJudge: creation output, failure handling
- TestHandleListJudges: list output, empty/null responses
- TestDispatchAndErrors: unknown tool, SDK exceptions, helper functions
- TestAsyncWrapping: all 5 handlers verified to use asyncio.to_thread

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix samples

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: m-peko <marinpeko5@gmail.com>
Bootstraps the LayerLens instrument layer with the abstract base classes,
adapter registry, capture configuration, event sinks, vendored event
schemas, and pydantic v1/v2 compatibility shim that every concrete
adapter (frameworks, protocols, providers) will depend on.

Scope
-----
- src/layerlens/instrument/__init__.py: lean re-export surface
- src/layerlens/instrument/_vendored/: frozen ateam event schemas (no
  runtime ateam dependency)
- src/layerlens/instrument/adapters/_base/: BaseAdapter, AdapterRegistry,
  AdapterStatus, AdapterHealth, AdapterCapability, ReplayableTrace,
  CaptureConfig, EventSink, TraceStoreSink, IngestionPipelineSink,
  PydanticCompat
- src/layerlens/_compat/pydantic.py: model_dump/model_validate shim
  spanning pydantic v1 + v2
- scripts/{port_adapter,port_protocol,emit_adapter_manifest,
  regen_dep_baselines}.py: codegen helpers used to port the rest of M1
- tests/instrument/{test_base_layer,test_lazy_imports,
  test_default_install,test_resolved_dep_tree}.py + _baselines/
- .github/workflows/dep-tree-guard.yaml: CI gate that locks the default
  install footprint
- docs/adapters/: CONTRIBUTING, STATUS, pydantic-compatibility, testing,
  PERSONA_REVIEW

Blast radius
------------
- Pure additions. No public surface changes outside the new
  layerlens.instrument namespace.
- Default `pip install layerlens` install set is unchanged (verified by
  test_default_install.py against the new baseline).
- Lazy adapter discovery: importing layerlens.instrument MUST NOT pull
  in any optional adapter dep (verified by test_lazy_imports.py).

Test plan
---------
- uv run pytest tests/instrument/test_base_layer.py
  tests/instrument/test_lazy_imports.py -x  -> 45 passed
- The dep-tree-guard workflow exercises test_default_install.py and
  test_resolved_dep_tree.py against the new baselines on every PR.

LAY-3400 umbrella: this PR is the prerequisite for the M1.B/M1.C/M1.D
adapter ports, M7 protocol certification, and M8 Cohere/Mistral.
Ports the twelve agent-tier framework adapters from the ateam
reference implementation onto the new layerlens.instrument base layer:

  Semantic Kernel, LlamaIndex, OpenAI Agents, Pydantic-AI, Agno,
  Strands, SmolAgents, MS Agent Framework, Google ADK,
  Bedrock Agents, Embedding (vector store hooks), Benchmark Import

Pairs with feat/instrument-frameworks-orchestration (M1.C part 1)
which lands LangChain, LangGraph, CrewAI, AutoGen, Langfuse, and
Agentforce. Together they complete M1.C.

Scope
-----
- src/layerlens/instrument/adapters/frameworks/{semantic_kernel,
  llama_index,openai_agents,pydantic_ai,agno,strands,smolagents,
  ms_agent_framework,google_adk,bedrock_agents,embedding,
  benchmark_import}/: per-framework packages
- tests/instrument/adapters/frameworks/test_*_adapter.py + the
  test_bulk_ported_smoke.py harness (which exercises every ported
  adapter against canned trace fixtures so partial framework SDKs
  on a given runner don't drop coverage to zero)
- samples/instrument/<framework>/: runnable per-framework samples
- docs/adapters/frameworks-<framework>.md: per-framework integration
  guide
- pyproject.toml: twelve new optional extras
  (semantic-kernel, llama-index, openai-agents, pydantic-ai, agno,
  strands, smolagents, ms-agent-framework, google-adk,
  bedrock-agents, embedding, benchmark-import) with python_version
  markers; pyright/ruff exclusions for the dynamic monkey-patching
  framework code

Blast radius
------------
- Default `pip install layerlens` install set is unchanged. Each
  framework's heavy deps are gated behind their own extra.
- No changes to existing public API surface.
- Importing layerlens.instrument still does NOT pull in any framework
  module (lazy registry lookup).

Test plan
---------
- uv run pytest tests/instrument/adapters/frameworks/ -x  ->
  184 passed, 1 skipped (test_bulk_ported_smoke.py covers all 12
  agent-tier adapters plus the orchestration-tier ones from part 1
  via the same harness)

Stacks on
---------
- feat/instrument-base-foundation (M1.A) — required for the
  BaseAdapter surface this PR consumes.

Sibling of
----------
- feat/instrument-frameworks-orchestration (M1.C part 1) — both
  branches stack on the base foundation independently and don't
  conflict; they can land in either order.

LAY-3400 umbrella (M1.C part 2).
Cross-pollination wave #2 from A:/tmp/adapter-cross-pollination-audit.md §2.3.

When a framework callback raises (rate limit, API down, malformed prompt,
tool exception), the corresponding lifecycle event used to appear as a
"start" with no matching "end" — dashboards rendered this as a hung
request, not a failure. The mature adapters (LangChain on_*_error,
AutoGen wrappers.py:94-108, Agentforce trust layer) already surface the
exception as a structured event before re-raise; this PR ports that
contract to the ten lighter runtime adapters.

Shared helper
-------------
- `src/layerlens/instrument/adapters/_base/errors.py` — `emit_error_event()`
  with PII-safe message truncation (500 chars), traceback truncation
  (8 frames / 4000 chars), secret-pattern scrubbing (api_key=, Bearer,
  sk-*), allow-listed framework-context keys, and multi-tenant org_id
  propagation from the adapter's stratix client.
- `_base/capture.py` — adds `agent.error`, `tool.error`, `model.error`
  to ALWAYS_ENABLED_EVENT_TYPES so error events bypass capture-config
  gating (silent error drops are exactly the failure mode the helper
  exists to prevent).
- `_base/__init__.py` — re-exports the public surface.

Per-adapter wiring (all 10 targets)
-----------------------------------
- agno: `_create_traced_run{,_sync}` emit policy.violation on raise;
  `on_tool_use` emits tool.error when the optional `error` kwarg is set.
- openai_agents: span-end handlers detect `span.error` on agent /
  generation / function spans and route through a shared
  `_emit_span_error` helper; `on_run_end` / `on_tool_use` likewise.
- llama_index: `_handle_event` checks for `event.exception` on every
  routed event and emits model.error / tool.error / agent.error based
  on the LlamaIndex event type prefix.
- google_adk: new `_maybe_emit_callback_error` helper inspects the
  callback context / llm_response / tool_output for `error`,
  `exception`, `error_message` attributes (or dict keys); also wires
  `on_agent_end` / `on_tool_use`.
- strands: `_create_traced_call` and `on_tool_use` mirror the agno
  pattern.
- pydantic_ai: both async and sync run wrappers, plus `on_tool_use`.
- smolagents: `_create_traced_run` and `on_tool_use`.
- bedrock_agents: `_after_invoke_agent` detects `parsed.failureTrace`
  and top-level SDK error keys via new `_maybe_emit_invoke_error`;
  `on_invoke_end` / `on_tool_use` also wired.
- ms_agent_framework: both `traced_invoke` (async generator) and
  `traced_invoke_stream` emit on raise; `on_tool_use` covers the
  programmatic surface.
- embedding: all three provider wrappers (OpenAI, Cohere,
  SentenceTransformer) catch upstream exceptions, emit model.error,
  and re-raise.

Re-raise semantics are preserved everywhere — the helper only emits;
callers are responsible for `raise`. PII safety is enforced by the
allow-list of context keys (no raw user input ever propagates), by the
secret pattern scrubber on message + traceback, and by hard length
caps. Multi-tenant deployments get org_id automatically from the
stratix client.

Tests
-----
- `tests/instrument/adapters/_base/test_errors.py` — 28 tests covering
  helper correctness: default event type, exception type / message /
  module, org_id propagation + override + omission, message + traceback
  truncation, context-key filtering, secret redaction (api_key, Bearer,
  sk-*, idempotency), re-raise pattern, custom event type, exception
  with broken __str__, exception with no traceback, circuit-breaker
  respect, trace-buffer participation, build_error_payload immutability,
  empty-framework defense, SAFE_CONTEXT_KEYS lint, internal helper
  white-box tests.
- 10 per-adapter test extensions adding 32 new tests across the lighter
  adapters, verifying both the framework-callback path (raise →
  policy.violation / agent.error / tool.error / model.error) and the
  explicit `on_*` lifecycle hook path. New `test_embedding_adapter.py`
  is the first test file for the embedding wrapper.

Acceptance gates (all passing)
------------------------------
- `pytest tests/instrument/adapters/_base/test_errors.py -x` — 28/28
- `pytest tests/instrument/adapters/frameworks/ -x` — 149/149 (no
  regression on the 120 baseline tests + 29 new)
- `mypy --strict src/layerlens/instrument/adapters/_base/errors.py` —
  clean
- `ruff check src/layerlens/instrument/adapters/ tests/instrument/adapters/` —
  clean
@mmercuri mmercuri requested a review from m-peko April 27, 2026 00:08
m-peko pushed a commit that referenced this pull request May 12, 2026
…laceholder from #116) (#126)

Replaces the M7 placeholder shipped in PR #116 (truncation policy) with
the full BrowserUseAdapter — every lifecycle hook wired, every event
emitted, and every cross-cutting CLAUDE.md contract enforced from day one.

What changed
------------

Full lifecycle adapter (src/layerlens/instrument/adapters/frameworks/
browser_use/lifecycle.py):

* connect / disconnect / health_check / get_adapter_info /
  serialize_for_replay (all five abstract BaseAdapter methods).
* on_session_start, on_session_end, on_navigation, on_action,
  on_screenshot, on_dom_extraction, on_llm_call (every spec'd hook).
* Capability declaration: TRACE_TOOLS + TRACE_MODELS + TRACE_STATE +
  STREAMING + REPLAY (no longer the placeholder's TRACE_TOOLS-only set).
* Canonical events: browser.session.start, browser.navigate,
  browser.action, browser.screenshot, browser.dom.extract, tool.call,
  model.invoke, agent.input/output/state.change, cost.record,
  environment.config — plus agent.error / tool.error / model.error
  per the PR #115 error-aware emission contract.
* Per-callback resilience wrapper per PR #117 — observability errors
  NEVER crash the customer's agent, surfaced via resilience_snapshot().
* Multi-tenant org_id propagation per PR #118 — bound at construction
  (kwarg or resolved from stratix.org_id), stamped defensively on
  every emit, caller-supplied values overwritten to prevent
  cross-tenant leaks.
* Truncation policy from day one (DEFAULT_POLICY) — screenshot bytes
  DROPPED to deterministic SHA-256 references, DOM/HTML capped at
  16 KiB, prompts/completions/tool I/O at 4/2 KiB.
* Browser-event layer mapping (_BROWSER_EVENT_LAYERS) so unknown
  browser.* event types respect CaptureConfig gating without falling
  through the unknown-event-drops-by-default path.
* requires_pydantic = PydanticCompat.V2_ONLY (browser_use is a v2 lib).

Public surface (src/layerlens/instrument/adapters/frameworks/
browser_use/__init__.py):

* ADAPTER_CLASS = BrowserUseAdapter (registry).
* instrument_agent(agent, stratix=, capture_config=, org_id=)
  one-liner returning the connected, wrapping adapter.
* STRATIXBrowserUseAdapter top-level binding (legacy alias) — fires
  DeprecationWarning on construction. Exposed as a static binding so
  the manifest consistency lint's AST walk finds it.

Pyproject:

* Adds 'browser-use' optional extra: browser-use>=0.1.0,<2 with the
  python_version >= '3.11' marker (browser_use's own constraint).

Tests (tests/instrument/adapters/frameworks/test_browser_use_adapter.py):

* Replaces the 7-test scaffold from #116 with 40 tests covering:
  wiring + alias + lifecycle round-trip + truncation (screenshot
  drop, hash determinism, HTML cap, short-payload no-audit) +
  multi-tenancy (kwarg, client attribute, defensive overwrite) +
  resilience (poison stratix, exploding agent attribute access) +
  error-aware emission (agent.error / tool.error / model.error) +
  per-hook coverage + sync + async wrapping + replay round-trip +
  10-case provider detection table.

Sample (samples/instrument/browser_use/{main.py,__init__.py,README.md}):

* Runs OFFLINE — no browser-use install, no Playwright, no API key,
  no network. Three-step duck-typed agent + happy/--fail paths
  exercise the full event surface and demonstrate screenshot drop +
  org_id stamping + agent.error emission before re-raise.

Doc (docs/adapters/frameworks-browser_use.md):

* Install + quickstart + capabilities matrix + 14-event reference
  table + truncation policy table + multi-tenancy + resilience +
  error-aware emission + capture config + browser_use specifics +
  BYOK + replay sections.

Manifest (scripts/emit_adapter_manifest.py):

* Promotes browser_use from _LIFECYCLE_PREVIEW to _MATURE — every
  required artifact (test file with >= 12 funcs, sample, doc,
  STRATIX→LayerLens deprecation alias) ships in this PR.

Verification
------------

* uv run pytest tests/instrument/adapters/frameworks/test_browser_use_adapter.py
  → 40 passed
* mypy --strict src/layerlens/instrument/adapters/frameworks/browser_use
  → Success: no issues found in 2 source files
* ruff check on src + test + script
  → All checks passed!
* Sample runs cleanly offline (happy + --fail)
* pip install -e .[browser-use] resolves cleanly (browser-use only
  pulled on Python 3.11+ per the env marker)
* tests/instrument/adapters/test_manifest_consistency.py::
  test_mature_adapters_have_required_artifacts[browser_use] passes
* Full instrument suite (excl. pre-existing crewai/protocols
  references not on this branch): 312 passed, 1 skipped, 12 xfailed
@m-peko m-peko closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants