Explain why your agent failed.
AgentLens is a local-first debugger for LLM agents. It is built for the moment when a trace tells you what happened, but not where the run actually started going wrong.
The screenshot above is based on a real LangGraph-backed demo run traced by AgentLens:
- the model decides to call
weather_snapshot - the tool returns fresh evidence:
Shanghai: rain - the final answer updates to match the tool result
That is the core product idea:
- failure explanation — where the run likely went wrong
- tool and memory evidence — what influenced the outcome
- run divergence — where run B started behaving differently from run A
Most tools help you log traces. AgentLens is being built to help you answer the harder question:
Why did this agent make the wrong decision?
The latest alpha can already trace a real LangGraph runtime and render it into a debugging view that surfaces:
- runtime overview
- model turns
- tool evidence
- final answer
- failure chain and suspicious signals when they exist
To reproduce the LangGraph demo:
export OPENAI_API_KEY=...
export OPENAI_BASE_URL=https://your-openai-compatible-host/v1
export AGENTLENS_OPENAI_MODEL=gpt-5.2
python3 -m pip install -e ".[langgraph]"
python3 cli.py demo langgraph
python3 cli.py viewMost LLM observability tools are good at showing traces. Agent systems need more than traces — they need failure explanation.
Real agent failures are often caused by:
- tool output being misinterpreted
- recalled memory conflicting with fresh evidence
- the run diverging from a previously good trajectory
- the true failure starting earlier than the final bad answer
External signals point in the same direction:
- practitioners increasingly talk about agent debugging and replayability as missing layers
- recent work like Microsoft Research's AgentRx frames the problem as locating the critical failure step and the root cause in agent trajectories
AgentLens is built around that wedge: explain where an agent run went wrong, and why.
AgentLens is not trying to replace Langfuse, LangSmith, or Helicone head-on.
Instead, it starts with a sharper wedge:
- agent runtime debugging
- memory observability
- run replay + regression diff
- capture one agent run as structured events
- store traces locally
- generate a failure summary for the latest run
- render a small local HTML debugging view
- compare two runs and surface the first divergence
- highlight suspicious signals such as memory/tool conflicts
- stronger failure heuristics
- explicit stale-memory / conflicting-evidence summaries
- improved divergence rendering
- better replay-oriented inspection flow
- richer memory attribution
- run bundles for sharing/debugging
- framework adapters
- solo builders shipping AI agents
- small teams building internal copilots
- developers working with tool-calling agents
- engineers debugging memory-enabled agent systems
The goal is not only to log a run, but to identify the likely failure point and suspicious signals.
Most tools treat memory as metadata. AgentLens treats memory as part of the decision path and highlights when memory conflicts with fresh tool evidence.
Not just dashboards — the system should help answer where run B started behaving differently from run A.
The SDK should work with:
- OpenAI SDK
- custom agent loops
- lightweight wrappers
- eventually LangGraph / AutoGen / CrewAI adapters
sdk/python/— Python SDK for event captureserver/— ingestion + storage APIweb/— trace viewer UIdocs/— architecture, schema, roadmapexamples/— minimal instrumented agents
- Python 3.10+
- A local shell environment that can run
python3 - No external model/API dependency is required for the current alpha demos
git clone https://github.com/Exploreunive/agentlens.git
cd agentlensOptional: create a virtual environment
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -e .If you prefer not to use a virtual environment:
python3 -m pip install -e .Optional: install the OpenAI SDK integration extra
python3 -m pip install -e ".[openai]"Run tests:
pytest -qagentlens/
├── sdk/python/agentlens/ # trace capture SDK
├── examples/ # demo agent runs
├── docs/ # architecture, schema, launch notes
├── tests/ # automated tests
├── analyzer.py # run analysis heuristics
├── explain.py # root-cause card builder
├── viewer.py # local HTML trace viewer
├── diff_runs.py # run divergence report
└── cli.py # minimal CLI entrypoint
cd agentlens
python3 examples/divergent_agent.py # generate two runs with hidden memory/tool conflict
python3 viewer.py # render latest run with root-cause card
python3 diff_runs.py # show where the two runs diverged
pytest -q # run testspython3 cli.py demo # minimal run
python3 cli.py demo divergent # hidden degradation demo
python3 cli.py demo failure # visible failure demo
python3 cli.py demo openai-wrapper # minimal OpenAI-compatible wrapper demo
python3 cli.py demo langgraph # LangGraph-backed agent runtime demo
python3 cli.py view # latest trace -> HTML
python3 cli.py diff # latest two runs -> Markdown diff
python3 cli.py explain # generate both HTML + diff artifacts
python3 cli.py baseline save good-run
python3 cli.py baseline list
python3 cli.py regression check good-run
python3 cli.py bundle exportThe trace viewer now also highlights:
- the first suspicious step
- the likely failure step
- error events directly in the timeline
- event-type filters for narrowing the trace quickly
1. Hidden degradation
python3 examples/divergent_agent.py
python3 diff_runs.py
python3 viewer.pyShows a case where the final answer still looks acceptable, but the run already contains a memory_conflict signal.
python3 examples/failure_answer_agent.py
python3 diff_runs.py
python3 viewer.pyShows a case where stale recalled memory overrides fresh tool evidence and the final answer visibly degrades.
Current alpha prototype can already:
- emit structured JSONL traces
- generate a root-cause style failure card
- surface suspicious signals such as
memory_conflictandstale_memory_override - compare two runs and show the first divergence
- render a local HTML debugging view
- demonstrate both hidden degradation and visible failure scenarios
- instrument agent runs with higher-level SDK helpers for spans, LLM calls, tool calls, and memory events
- use a minimal OpenAI-compatible wrapper for lower-friction LLM tracing
- trace a real LangGraph-backed agent runtime through LangChain's
create_agent - save named baselines and generate regression reports against newer runs
- support privacy-safe local tracing with optional redaction
Because the point is not just to collect events.
The point is to help answer questions like:
- Why did this run become unreliable?
- Which suspicious step showed up before the final answer visibly degraded?
- Did the latest run regress against a known-good baseline?
- Did stale memory or fresh tool evidence change the outcome?
New in the latest alpha:
- failure chains that connect memory recall / tool evidence / suspicious signal / final answer
- answer risk labels to distinguish hidden degradation from visible failure
- divergence timelines with severity, not just a single first-diff blob
Example: hidden failure before obvious answer degradation
A useful debugging tool should catch this situation:
- the final answer still looks acceptable
- but recalled memory conflicts with fresh tool evidence
- the run is already unreliable even before the answer visibly breaks
That is the kind of failure AgentLens is trying to surface.
See also: docs/EXAMPLE_FAILURE.md
The Python SDK now supports a more ergonomic, local-first instrumentation style:
from agentlens import AgentLensClient
client = AgentLensClient()
run_id = client.new_run()
client.emit(type='run.start', run_id=run_id, payload={'task': 'answer a question'})
with client.span(run_id=run_id, name='research_and_answer') as span:
llm = client.record_llm_call(
run_id=run_id,
model='gpt-4o-mini',
prompt='Should we call the weather tool?',
decision='call_weather_tool',
reason='Need fresh evidence',
metrics={'latency_ms': 42, 'input_tokens': 30, 'output_tokens': 16},
parent_span_id=span.span_id,
)
client.record_tool_call(
run_id=run_id,
tool_name='weather.get_forecast',
args={'city': 'Shanghai'},
result={'condition': 'rain'},
parent_span_id=llm['response'].span_id,
)
client.record_memory_recall(
run_id=run_id,
content='User usually jogs when it is sunny',
parent_span_id=span.span_id,
)This keeps the local JSONL event model explicit, while reducing repetitive boilerplate for common agent flows.
AgentLens now includes a minimal wrapper that can either:
- trace a simulated OpenAI-compatible call with no API dependency
- trace a real OpenAI Responses API call when
OPENAI_API_KEYis set
python3 cli.py demo openai-wrapper
python3 cli.py viewFor a real OpenAI call:
export OPENAI_API_KEY=...
export OPENAI_BASE_URL=https://your-openai-compatible-host/v1
export AGENTLENS_OPENAI_MODEL=gpt-5.2
export AGENTLENS_OPENAI_API_STYLE=chat
python3 -m pip install -e ".[openai]"
python3 cli.py demo openai-wrapperThe wrapper lives in sdk/python/agentlens/openai_wrapper.py and is intentionally small. The goal is not to replace the OpenAI SDK, but to make real Responses API or Chat Completions calls traceable with very little glue code.
from openai import OpenAI
from agentlens import AgentLensClient, OpenAIResponsesTracer
client = AgentLensClient(redact_sensitive=True)
tracer = OpenAIResponsesTracer(client)
sdk_client = OpenAI()
run_id = client.new_run()
client.emit(type='run.start', run_id=run_id, payload={'task': 'answer a user question'})
response = tracer.trace_responses_create(
run_id=run_id,
client=sdk_client,
model='gpt-4.1-mini',
input='Should I jog tomorrow morning in Shanghai if rain is likely?',
)
client.emit(
type='run.end',
run_id=run_id,
payload={'final_answer': response.output_text},
)For OpenAI-compatible providers, initialize the SDK with base_url=... and choose the API shape your provider supports:
AGENTLENS_OPENAI_API_STYLE=responsesAGENTLENS_OPENAI_API_STYLE=chat
AgentLens now also includes a real agent runtime example built with LangChain's create_agent, which runs on LangGraph.
Install the optional runtime dependencies:
python3 -m pip install -e ".[langgraph]"Run the demo against an OpenAI-compatible provider:
export OPENAI_API_KEY=...
export OPENAI_BASE_URL=https://your-openai-compatible-host/v1
export AGENTLENS_OPENAI_MODEL=gpt-5.2
python3 cli.py demo langgraph
python3 cli.py viewThe adapter lives in sdk/python/agentlens/langgraph_adapter.py and emits:
run.start/run.endllm.request/llm.responsetool.call/tool.resulterrorwhen model or tool execution fails
This gives AgentLens a path from toy demos into a real agent runtime that developers already use.
AgentLens now also supports a simple local baseline workflow:
python3 cli.py demo minimal
python3 cli.py baseline save good-run
python3 cli.py demo failure
python3 cli.py regression check good-runThis writes a Markdown regression report that makes it easier to answer a higher-value debugging question:
Did the latest run get worse than the baseline, and where did it diverge?
AgentLens can also export a shareable local bundle for a run:
python3 cli.py bundle exportThat writes a zip file under artifacts/bundles/ containing:
- the raw JSONL trace
- the rendered HTML trace viewer
- a manifest with summary metadata
- the latest diff report when a comparison run is available
This is useful for bug reports, async teammate debugging, or preserving a regression case without sending your whole repo around.
AgentLens now also supports an optional local redaction mode for sensitive payloads:
from agentlens import AgentLensClient
client = AgentLensClient(redact_sensitive=True)When enabled, AgentLens will automatically:
- redact common sensitive keys like
api_key,token, andpassword - scrub common secrets such as
sk-...andghp_... - mask email addresses and phone numbers in captured strings
This is especially useful when developers want to trace real agent runs locally without dumping obvious secrets into JSONL artifacts.
Make agent systems debuggable, replayable, and trustworthy.
- stronger root-cause heuristics
- better divergence explanation wording
- richer memory attribution
- replay-oriented run inspection
- framework adapters beyond the built-in OpenAI SDK wrapper
- stronger LangGraph and agent-runtime integrations
This is still an alpha project.
Current limitations:
- local-first only
- no hosted service
- no production-grade replay engine yet
- no LangGraph / AutoGen adapters yet
- root-cause analysis is heuristic-based, not model-judged or formally verified
- current UI is a minimal local HTML viewer, not a polished multi-page app
