AgentLens

Explain why your agent failed.

AgentLens is a local-first debugger for LLM agents. It is built for the moment when a trace tells you what happened, but not where the run actually started going wrong.

The screenshot above is based on a real LangGraph-backed demo run traced by AgentLens:

the model decides to call weather_snapshot
the tool returns fresh evidence: Shanghai: rain
the final answer updates to match the tool result

That is the core product idea:

failure explanation — where the run likely went wrong
tool and memory evidence — what influenced the outcome
run divergence — where run B started behaving differently from run A

Most tools help you log traces. AgentLens is being built to help you answer the harder question:

Why did this agent make the wrong decision?

Demo Snapshot

The latest alpha can already trace a real LangGraph runtime and render it into a debugging view that surfaces:

runtime overview
model turns
tool evidence
final answer
failure chain and suspicious signals when they exist

To reproduce the LangGraph demo:

export OPENAI_API_KEY=...
export OPENAI_BASE_URL=https://your-openai-compatible-host/v1
export AGENTLENS_OPENAI_MODEL=gpt-5.2
python3 -m pip install -e ".[langgraph]"
python3 cli.py demo langgraph
python3 cli.py view

Why AgentLens

Most LLM observability tools are good at showing traces. Agent systems need more than traces — they need failure explanation.

Real agent failures are often caused by:

tool output being misinterpreted
recalled memory conflicting with fresh evidence
the run diverging from a previously good trajectory
the true failure starting earlier than the final bad answer

External signals point in the same direction:

practitioners increasingly talk about agent debugging and replayability as missing layers
recent work like Microsoft Research's AgentRx frames the problem as locating the critical failure step and the root cause in agent trajectories

AgentLens is built around that wedge: explain where an agent run went wrong, and why.

Positioning

AgentLens is not trying to replace Langfuse, LangSmith, or Helicone head-on.

Instead, it starts with a sharper wedge:

agent runtime debugging
memory observability
run replay + regression diff

MVP

v0.1-alpha

capture one agent run as structured events
store traces locally
generate a failure summary for the latest run
render a small local HTML debugging view
compare two runs and surface the first divergence
highlight suspicious signals such as memory/tool conflicts

v0.2

stronger failure heuristics
explicit stale-memory / conflicting-evidence summaries
improved divergence rendering
better replay-oriented inspection flow

v0.3

richer memory attribution
run bundles for sharing/debugging
framework adapters

Initial users

solo builders shipping AI agents
small teams building internal copilots
developers working with tool-calling agents
engineers debugging memory-enabled agent systems

What makes this different

1. Root-cause debugging, not just observability

The goal is not only to log a run, but to identify the likely failure point and suspicious signals.

2. Memory attribution is first-class

Most tools treat memory as metadata. AgentLens treats memory as part of the decision path and highlights when memory conflicts with fresh tool evidence.

3. Run divergence is a core workflow

Not just dashboards — the system should help answer where run B started behaving differently from run A.

4. Framework-light

The SDK should work with:

OpenAI SDK
custom agent loops
lightweight wrappers
eventually LangGraph / AutoGen / CrewAI adapters

Repo plan

sdk/python/ — Python SDK for event capture
server/ — ingestion + storage API
web/ — trace viewer UI
docs/ — architecture, schema, roadmap
examples/ — minimal instrumented agents

Requirements

Python 3.10+
A local shell environment that can run python3
No external model/API dependency is required for the current alpha demos

Installation

git clone https://github.com/Exploreunive/agentlens.git
cd agentlens

Optional: create a virtual environment

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -e .

If you prefer not to use a virtual environment:

python3 -m pip install -e .

Optional: install the OpenAI SDK integration extra

python3 -m pip install -e ".[openai]"

Run tests:

pytest -q

Project structure

agentlens/
├── sdk/python/agentlens/   # trace capture SDK
├── examples/               # demo agent runs
├── docs/                   # architecture, schema, launch notes
├── tests/                  # automated tests
├── analyzer.py             # run analysis heuristics
├── explain.py              # root-cause card builder
├── viewer.py               # local HTML trace viewer
├── diff_runs.py            # run divergence report
└── cli.py                  # minimal CLI entrypoint

Quickstart

cd agentlens
python3 examples/divergent_agent.py  # generate two runs with hidden memory/tool conflict
python3 viewer.py                    # render latest run with root-cause card
python3 diff_runs.py                 # show where the two runs diverged
pytest -q                            # run tests

CLI

python3 cli.py demo                  # minimal run
python3 cli.py demo divergent        # hidden degradation demo
python3 cli.py demo failure          # visible failure demo
python3 cli.py demo openai-wrapper   # minimal OpenAI-compatible wrapper demo
python3 cli.py demo langgraph        # LangGraph-backed agent runtime demo
python3 cli.py view                  # latest trace -> HTML
python3 cli.py diff                  # latest two runs -> Markdown diff
python3 cli.py explain               # generate both HTML + diff artifacts
python3 cli.py baseline save good-run
python3 cli.py baseline list
python3 cli.py regression check good-run
python3 cli.py bundle export

The trace viewer now also highlights:

the first suspicious step
the likely failure step
error events directly in the timeline
event-type filters for narrowing the trace quickly

Two demo failure modes

1. Hidden degradation

python3 examples/divergent_agent.py
python3 diff_runs.py
python3 viewer.py

Shows a case where the final answer still looks acceptable, but the run already contains a memory_conflict signal.

2. Visible failure

python3 examples/failure_answer_agent.py
python3 diff_runs.py
python3 viewer.py

Shows a case where stale recalled memory overrides fresh tool evidence and the final answer visibly degrades.

What you get today

Current alpha prototype can already:

emit structured JSONL traces
generate a root-cause style failure card
surface suspicious signals such as memory_conflict and stale_memory_override
compare two runs and show the first divergence
render a local HTML debugging view
demonstrate both hidden degradation and visible failure scenarios
instrument agent runs with higher-level SDK helpers for spans, LLM calls, tool calls, and memory events
use a minimal OpenAI-compatible wrapper for lower-friction LLM tracing
trace a real LangGraph-backed agent runtime through LangChain's create_agent
save named baselines and generate regression reports against newer runs
support privacy-safe local tracing with optional redaction

Why someone would try this instead of another tracing tool

Because the point is not just to collect events.

The point is to help answer questions like:

Why did this run become unreliable?
Which suspicious step showed up before the final answer visibly degraded?
Did the latest run regress against a known-good baseline?
Did stale memory or fresh tool evidence change the outcome?

New in the latest alpha:

failure chains that connect memory recall / tool evidence / suspicious signal / final answer
answer risk labels to distinguish hidden degradation from visible failure
divergence timelines with severity, not just a single first-diff blob

Example: hidden failure before obvious answer degradation

A useful debugging tool should catch this situation:

the final answer still looks acceptable
but recalled memory conflicts with fresh tool evidence
the run is already unreliable even before the answer visibly breaks

That is the kind of failure AgentLens is trying to surface.

SDK helpers

The Python SDK now supports a more ergonomic, local-first instrumentation style:

from agentlens import AgentLensClient

client = AgentLensClient()
run_id = client.new_run()
client.emit(type='run.start', run_id=run_id, payload={'task': 'answer a question'})

with client.span(run_id=run_id, name='research_and_answer') as span:
    llm = client.record_llm_call(
        run_id=run_id,
        model='gpt-4o-mini',
        prompt='Should we call the weather tool?',
        decision='call_weather_tool',
        reason='Need fresh evidence',
        metrics={'latency_ms': 42, 'input_tokens': 30, 'output_tokens': 16},
        parent_span_id=span.span_id,
    )
    client.record_tool_call(
        run_id=run_id,
        tool_name='weather.get_forecast',
        args={'city': 'Shanghai'},
        result={'condition': 'rain'},
        parent_span_id=llm['response'].span_id,
    )
    client.record_memory_recall(
        run_id=run_id,
        content='User usually jogs when it is sunny',
        parent_span_id=span.span_id,
    )

This keeps the local JSONL event model explicit, while reducing repetitive boilerplate for common agent flows.

OpenAI-compatible wrapper demo

AgentLens now includes a minimal wrapper that can either:

trace a simulated OpenAI-compatible call with no API dependency
trace a real OpenAI Responses API call when OPENAI_API_KEY is set

python3 cli.py demo openai-wrapper
python3 cli.py view

For a real OpenAI call:

export OPENAI_API_KEY=...
export OPENAI_BASE_URL=https://your-openai-compatible-host/v1
export AGENTLENS_OPENAI_MODEL=gpt-5.2
export AGENTLENS_OPENAI_API_STYLE=chat
python3 -m pip install -e ".[openai]"
python3 cli.py demo openai-wrapper

The wrapper lives in sdk/python/agentlens/openai_wrapper.py and is intentionally small. The goal is not to replace the OpenAI SDK, but to make real Responses API or Chat Completions calls traceable with very little glue code.

from openai import OpenAI

from agentlens import AgentLensClient, OpenAIResponsesTracer

client = AgentLensClient(redact_sensitive=True)
tracer = OpenAIResponsesTracer(client)
sdk_client = OpenAI()
run_id = client.new_run()

client.emit(type='run.start', run_id=run_id, payload={'task': 'answer a user question'})

response = tracer.trace_responses_create(
    run_id=run_id,
    client=sdk_client,
    model='gpt-4.1-mini',
    input='Should I jog tomorrow morning in Shanghai if rain is likely?',
)

client.emit(
    type='run.end',
    run_id=run_id,
    payload={'final_answer': response.output_text},
)

For OpenAI-compatible providers, initialize the SDK with base_url=... and choose the API shape your provider supports:

AGENTLENS_OPENAI_API_STYLE=responses
AGENTLENS_OPENAI_API_STYLE=chat

LangGraph runtime demo

AgentLens now also includes a real agent runtime example built with LangChain's create_agent, which runs on LangGraph.

Install the optional runtime dependencies:

python3 -m pip install -e ".[langgraph]"

Run the demo against an OpenAI-compatible provider:

export OPENAI_API_KEY=...
export OPENAI_BASE_URL=https://your-openai-compatible-host/v1
export AGENTLENS_OPENAI_MODEL=gpt-5.2
python3 cli.py demo langgraph
python3 cli.py view

The adapter lives in sdk/python/agentlens/langgraph_adapter.py and emits:

run.start / run.end
llm.request / llm.response
tool.call / tool.result
error when model or tool execution fails

This gives AgentLens a path from toy demos into a real agent runtime that developers already use.

Baselines and regression checks

AgentLens now also supports a simple local baseline workflow:

python3 cli.py demo minimal
python3 cli.py baseline save good-run
python3 cli.py demo failure
python3 cli.py regression check good-run

This writes a Markdown regression report that makes it easier to answer a higher-value debugging question:

Did the latest run get worse than the baseline, and where did it diverge?

Shareable debug bundles

AgentLens can also export a shareable local bundle for a run:

python3 cli.py bundle export

That writes a zip file under artifacts/bundles/ containing:

the raw JSONL trace
the rendered HTML trace viewer
a manifest with summary metadata
the latest diff report when a comparison run is available

This is useful for bug reports, async teammate debugging, or preserving a regression case without sending your whole repo around.

Privacy-safe tracing

AgentLens now also supports an optional local redaction mode for sensitive payloads:

from agentlens import AgentLensClient

client = AgentLensClient(redact_sensitive=True)

When enabled, AgentLens will automatically:

redact common sensitive keys like api_key, token, and password
scrub common secrets such as sk-... and ghp_...
mask email addresses and phone numbers in captured strings

This is especially useful when developers want to trace real agent runs locally without dumping obvious secrets into JSONL artifacts.

Vision

Make agent systems debuggable, replayable, and trustworthy.

Road to v0.2

stronger root-cause heuristics
better divergence explanation wording
richer memory attribution
replay-oriented run inspection
framework adapters beyond the built-in OpenAI SDK wrapper
stronger LangGraph and agent-runtime integrations

Current limitations

This is still an alpha project.

Current limitations:

local-first only
no hosted service
no production-grade replay engine yet
no LangGraph / AutoGen adapters yet
root-cause analysis is heuristic-based, not model-judged or formally verified
current UI is a minimal local HTML viewer, not a polished multi-page app

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
docs		docs
examples		examples
sdk/python/agentlens		sdk/python/agentlens
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PROJECT.md		PROJECT.md
README.md		README.md
ROADMAP.md		ROADMAP.md
analyzer.py		analyzer.py
bundle_export.py		bundle_export.py
cli.py		cli.py
diff_runs.py		diff_runs.py
explain.py		explain.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
regression.py		regression.py
viewer.py		viewer.py

Folders and files

Latest commit

History

Repository files navigation

AgentLens

Demo Snapshot

Why AgentLens

Positioning

MVP

v0.1-alpha

v0.2

v0.3

Initial users

What makes this different

1. Root-cause debugging, not just observability

2. Memory attribution is first-class

3. Run divergence is a core workflow

4. Framework-light

Repo plan

Requirements

Installation

Project structure

Quickstart

CLI

Two demo failure modes

1. Hidden degradation

2. Visible failure

What you get today

Why someone would try this instead of another tracing tool

Example: hidden failure before obvious answer degradation

SDK helpers

OpenAI-compatible wrapper demo

LangGraph runtime demo

Baselines and regression checks

Shareable debug bundles

Privacy-safe tracing

Vision

Road to v0.2

Current limitations

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages