This document explains the architecture, workflows, and how to run the autopsy discovery and governance flows locally.
- NeuralProxyRuntime: orchestrates inference, tracing, and OpenMetadata mapping.
- NeuralInferenceEngine: runs model generations with tracing hooks and supports masking individual heads via
set_masked_heads(). - OpenMetadataNeuralMapper: maps model topology (layers→tables, heads→columns) and can tag columns/tables as
DEFECTIVE/QUARANTINED. - HeadMaskStore: in-memory store of head masks that the runtime snapshots at generation start.
Mask semantics: masks are applied at generation start — changes affect subsequent generations only. Existing in-flight generations are not retroactively changed.
POST /api/v1/autopsy/discover_circuit— run a causal autopsy and ablation sweep. Returnsdiscovered_circuitandcombined_causal_effect. Includes ablatedtraceobjects for overlay in the UI.POST /api/v1/openmetadata/quarantine— take a list of head objects and pushDEFECTIVE/QUARANTINEDtags to OpenMetadata; updates runtime mask table so subsequent generations mask quarantined heads.
Each model layer is represented as a table in OpenMetadata. Heads within a layer map to columns named Head_1, Head_2, etc. The client will attempt to apply column-level tags first and fall back to table-level tagging if column operations fail.
- Start the backend
python -m uvicorn backend.app.main:app --reload --port 8000- Start the frontend
cd frontend
npm install
npm run dev- Use the UI at
http://localhost:3000(default) or call APIs directly.
curl -sS -X POST http://127.0.0.1:8000/api/v1/autopsy/discover_circuit \
-H 'Content-Type: application/json' \
-d '{"prompt":"Why is the sky green?","target_hallucination_token":"green","trace_model_name":"gpt2","max_new_tokens":32}'Response includes discovered_circuit (list of heads), combined_causal_effect, and sweep_results with ablated trace objects.
curl -sS -X POST http://127.0.0.1:8000/api/v1/openmetadata/quarantine \
-H 'Content-Type: application/json' \
-H 'X-OpenMetadata-Secret: <your-secret>' \
-d '{"heads":[{"layer_index":1,"layer_name":"Layer_2","head_index":2,"head_name":"Head_3","activation_score":0.12}],"reason":"discovered_via_ui"}'This will tag the head(s) in OpenMetadata and update the runtime's mask table so subsequent generations will have these heads masked.
- Webhook endpoints validate
X-OpenMetadata-Secretheader against the configuredOPENMETADATA_WEBHOOK_SECRETenvironment variable. Rotate secrets regularly. - If exposing these endpoints externally, gate them behind an API gateway and restrict access via authentication.
Run Python tests from the repository root:
pytest -qThe integration test backend/tests/test_discover_quarantine_integration.py mocks generation to avoid heavy Hugging Face downloads while asserting the end-to-end flow.
- Add the provided GitHub Actions workflow to run tests and TypeScript typecheck on push and PRs (see
.github/workflows/ci.yml).
- Masking changes are snapshot-based; they do not affect in-flight generations.
- Discovery sweeps can be time-consuming; tune
top_k_headsand pair/triple sweep limits for pragmatic runs.
If you want, I can also add example fixtures and a lightweight mock server for OpenMetadata to run full integration tests locally.
AI Autopsy Engine attacks the AI black-box problem by turning each model run into an auditable record: prompt, model identity, traced internal activity, generated output, OpenMetadata lineage, and governance actions. The strongest part of the current solution is the live transformer tracing path: attention layers and heads are captured from a hooked Hugging Face model, mapped into OpenMetadata as model -> layer -> head assets, and rendered in the dashboard as a neural lineage route.
The main gap is also important: attention activity is evidence, not complete causal proof. A head receiving high attention does not automatically mean it caused the final answer. The improved implementation now marks every trace with explicit evidence_quality, black_box_gaps, and recommended next actions so the product does not overclaim. Exact tracing, shadow tracing, ablation, replay, provenance, and governance are separated clearly.
Most AI observability tools show surface behavior:
- Prompt and response text.
- Latency, errors, and token counts.
- Sometimes retrieval sources or confidence scores.
Those signals do not answer deeper operational questions:
- Which model version produced this answer?
- Which prompt tokens were most influential?
- Which internal layers and attention heads were active?
- Was this evidence captured from the same model run, or from a proxy tracer?
- Can an operator isolate a suspicious neural component and verify the effect?
- Can an auditor inspect the decision path later?
AI Autopsy Engine is built to make those questions answerable through trace capture, metadata lineage, and controlled intervention.
flowchart LR
U[User Prompt] --> FE[Next.js Dashboard]
FE --> API[FastAPI Neural Proxy]
API --> OLLAMA[Ollama Fast Generator]
API --> HF[Hooked Hugging Face Tracer]
HF --> TRACE[Attention + Head Activity]
TRACE --> API
API --> OM[OpenMetadata Catalog]
OM --> API
API --> FE
Core files:
backend/app/inference.py: generation, Hugging Face hooks, attention capture, trace fidelity, evidence quality, and head masking.backend/app/main.py: FastAPI endpoints, streaming, session state, OpenMetadata sync, and lineage ingestion scheduling.backend/app/om_client.py: OpenMetadata mapping, synthetic model/layer/head assets, lineage edges, andDEFECTIVEtag sync.frontend/components/synapse-dashboard.tsx: operator dashboard, streaming response, graph, evidence quality, and governance status.frontend/lib/types.ts: shared trace and evidence-quality types.
- A user sends a prompt from the dashboard or API.
- The backend selects an execution mode:
faithful: Hugging Face model generates tokens while hooks capture the same run. This is exact evidence.auto: Ollama generates quickly when available while the Hugging Face model may trace in shadow mode. This is proxy evidence unless output matching is extremely close.
- During traced generation, hooks capture per-layer, per-head attention for the newest token.
- The backend summarizes each token step:
- active layers
- top heads
- source tokens receiving attention
- high-activation path
- masked heads applied
- The trace is converted into an OpenMetadata lineage graph:
- model as database
- layers as tables
- heads as columns
- prompt ingress and response egress as synthetic tables
- Operators can tag a layer/head as
DEFECTIVEin OpenMetadata. - The backend syncs those tags and masks matching attention heads in later traced generation.
- The solution moves beyond normal prompt/response logging and captures internal transformer signals.
- It distinguishes
exacttracing fromproxytracing. - It reuses OpenMetadata instead of inventing a new catalog or governance UI.
- It supports an intervention loop: tag a head as defective, sync, then mask it during future generation.
- The frontend presents model internals as an operator console rather than a generic chat UI.
- The backend now exposes evidence quality directly in the trace instead of hiding uncertainty.
The previous explanation and product framing had several gaps:
-
It overclaimed causality from attention. Attention weights reveal model mechanism telemetry, but attention alone is not a complete causal explanation. A high-attention head can be correlated with an output without being necessary for that output.
-
It mixed two product stories. The old
explain.mddescribed both a generic ML provenance SDK and the Synapse-Graph transformer tracer. That made the system look unfocused. The cleaned explanation now centers on the real repo implementation. -
Proxy and exact evidence were not prominent enough. Fast Ollama generation plus shadow Hugging Face tracing is useful, but only exact when the outputs match closely. The implementation now emits
evidence_qualitywith exactness, causal-validation status, gaps, and next actions. -
It did not explain how to validate a suspected defective head. Masking is useful, but governance should be based on ablation or replay evidence, not only on high attention. The roadmap below adds a stricter validation loop.
-
It lacked a clear privacy story. Prompt text and trace artifacts can contain sensitive information. A production system needs redaction, retention policy, encryption, and access control.
-
It lacked reproducibility metadata. To fully replay an inference, the system should persist model revision, tokenizer revision, seed, decoding config, dependency versions, and hardware/runtime details.
-
It lacked evaluation metrics. To prove that the engine reduces black-box risk, it should measure trace coverage, proxy/exact ratio, replay success, ablation effect size, ingestion success, and false-positive quarantine rate.
backend/app/inference.py now includes an EvidenceQuality model on every final trace when possible:
score: 0.0 to 1.0 confidence-style quality score for the trace evidence.label: low, medium, or high.exactness: explains whether generation and tracing came from the same run.causal_validation: marks whether the run is already same-run-hook validated or still needs ablation/replay.black_box_gaps: machine-readable list of remaining explanation gaps.recommended_next_actions: concrete actions such as faithful rerun, shadow model preload, or ablation replay.
This prevents the system from pretending that all traces are equally trustworthy.
frontend/components/synapse-dashboard.tsx now displays evidence quality in the Glassbox Summary panel. Operators can see:
- evidence score
- exact/proxy explanation
- causal validation status
- top black-box gaps
This makes uncertainty visible where decisions are made.
This explain.md now describes:
- what the system actually implements
- what it solves
- where it is still weak
- how to test it
- how to make it stronger from all angles
| Evidence level | Meaning | Trust |
|---|---|---|
| Surface log | prompt, response, latency only | Low |
| Proxy trace | separate shadow model captures similar behavior | Medium if outputs match |
| Exact hooked trace | same model run generates and captures activations | High for mechanism telemetry |
| Ablation validated | masking/removing a head changes output as predicted | Strong causal evidence |
| Replay reproducible | frozen model/config can recreate the run | Strong audit evidence |
The current repo supports surface logs, proxy traces, exact hooked traces, and masking. The next major step is systematic ablation validation and replay.
OpenMetadata is used as the operational memory of the system:
- Model = synthetic database.
- Transformer layer = table.
- Attention head = column.
- Prompt and response = ingress/egress tables.
- Active routes = lineage edges.
DEFECTIVEtag = governance control.
This makes model internals searchable, taggable, and auditable by the same kind of metadata platform data teams already use.
When the OpenMetadata integration is connected, the backend writes catalog and lineage objects, not raw model weights:
- Database service:
Synapse_Neural_Service. - Database: one synthetic database per traced model, for example
gpt2. - Schema:
Transformer_Graph. - Prompt table:
Prompt_Ingress, with prompt text/token-count columns. - Response table:
Response_Egress, with response text column. - Layer tables: one table per transformer layer, for example
Layer_1,Layer_2, ...Layer_12for GPT-2. - Head columns: one column per attention head, for example
Head_1...Head_12inside each GPT-2 layer table. - Classification/tag:
SynapseQuarantine.DEFECTIVE. - Lineage edges: prompt -> active layer/head columns -> response, using the current token step's high-activation path.
OpenMetadata does not store every tensor by default. The app stores compact lineage and metadata anchors there. Full attention matrices would be too large for normal catalog storage and should live in an artifact store if enabled.
OpenMetadata is not running the model and it is not extracting neural activations. The backend does that. OpenMetadata is the catalog and governance layer:
- It stores a synthetic model catalog: model -> transformer layers -> attention heads.
- It receives lineage edges for active attention routes when tracing is available.
- It stores the
SynapseQuarantine.DEFECTIVEtag. - It lets an operator tag a layer or head, then the backend syncs those tags and masks matching heads on later traced runs.
So OpenMetadata answers "what was observed, where is it cataloged, and what governance decision should apply?" It does not answer "what happened inside the model" by itself. That evidence comes from the Hugging Face hook tracer.
The error:
Not Authorized! Token not present
means the OpenMetadata server at http://127.0.0.1:8585 is reachable, but it requires authentication. The backend was trying to create database service/table/column entities without a bearer token.
There was also a config mismatch:
- The backend settings originally expected
SYNAPSE_OPENMETADATA_HOST,SYNAPSE_OPENMETADATA_EMAIL,SYNAPSE_OPENMETADATA_PASSWORD, orSYNAPSE_OPENMETADATA_JWT_TOKEN. - Your
.envusedOPENMETADATA_HOST,OPENMETADATA_USERNAME, andOPENMETADATA_PASSWORD.
The backend now accepts both naming styles and normalizes http://localhost:8585 to http://localhost:8585/api. With the default local OpenMetadata credentials in .env, it should be able to log in and create the synthetic catalog.
If OpenMetadata still reports offline, verify one of these auth paths:
SYNAPSE_OPENMETADATA_JWT_TOKEN=<token>or:
SYNAPSE_OPENMETADATA_EMAIL=admin@open-metadata.org
SYNAPSE_OPENMETADATA_PASSWORD=adminThe legacy names also work:
OPENMETADATA_USERNAME=admin@open-metadata.org
OPENMETADATA_PASSWORD=adminWith the old config:
SYNAPSE_OLLAMA_MODEL=phi3:latest
SYNAPSE_HF_MODEL_NAME=phi3:latest
the answer is: it is not doing real internal tracing of phi3:latest.
Reason: phi3:latest is an Ollama model name. Ollama exposes generated tokens through its HTTP API, but it does not expose per-layer attention tensors or head activations. The Hugging Face tracer needs a real Hugging Face repo id and tokenizer, not an Ollama tag.
The runtime evidence levels are:
- Ollama only: real generated text, but no real internal black-box trace. Evidence is proxy/minimal.
- Ollama + different HF shadow model: real Ollama output plus real internals from a different model. Useful for demo telemetry, but not exact for Phi-3.
- Hugging Face faithful mode with a valid HF model: real generation and real internal trace from the same hooked model run. This is the real black-box tracking path.
The local config has now been changed to:
SYNAPSE_OLLAMA_MODEL=phi3:latest
SYNAPSE_HF_MODEL_NAME=gpt2
SYNAPSE_PRELOAD_SHADOW_MODEL=true
This gives you a real exact tracing path for the Hugging Face model. In faithful mode, generation happens inside the hooked Hugging Face model and the graph is built from actual captured attention tensors. It is not fake, but it is exact for gpt2, not for Ollama phi3:latest. GPT-2 has 12 transformer layers and 144 total attention heads, which is likely the larger number you saw earlier.
The frontend now lets the user switch trace models:
| Trace model | Layers | Heads per layer | Total heads | What to expect |
|---|---|---|---|---|
gpt2 |
12 | 12 | 144 | Bigger real graph, slower exact tracing, strange base-model prose |
sshleifer/tiny-gpt2 |
2 | 2 | 4 | Very fast real graph, tiny topology, poor prose |
To get real exact tracing, set SYNAPSE_HF_MODEL_NAME to a Hugging Face causal language model that can load locally with output_attentions=True. For example:
SYNAPSE_HF_MODEL_NAME=gpt2
SYNAPSE_PRELOAD_SHADOW_MODEL=trueFor same-family Phi-3 tracing, use a real Hugging Face Phi-3 repo id, but expect much higher CPU/RAM cost:
SYNAPSE_HF_MODEL_NAME=microsoft/Phi-3-mini-4k-instruct
SYNAPSE_HF_TRUST_REMOTE_CODE=trueIf exact tracing matters more than speed, run with:
"execution_mode": "faithful"Faithful mode must not fall back to Ollama. If the Hugging Face tokenizer/model is not loaded, the backend should return an error instead of producing proxy text and calling it exact.
If speed matters and you accept proxy evidence, run with:
"execution_mode": "auto"The Synapse Visualizer draws nodes from topology.layers. If topology.layers is empty, the graph can only show:
- Prompt Ingress
- Response Egress
That happened because SYNAPSE_HF_MODEL_NAME=phi3:latest was not loadable by Hugging Face. No tokenizer, no model object, no transformer layer inspection, and no attention hooks meant there were no real layer nodes to draw.
The graph is generated like this:
- Backend loads a Hugging Face model.
- Backend inspects attention modules to build
ModelTopology. - Frontend receives
topology.layersand creates one visual node per layer. - During generation, backend captures each token step as
AttentionTrace.steps. - Frontend highlights the layer/head route from the latest step.
- OpenMetadata receives the same route as lineage when connected.
If the UI says 0 layers, it is not showing a real black-box layer graph yet. If it says 12 layers / 144 total heads for gpt2, it is showing a real traced GPT-2 model. If you switch to another real Hugging Face model, the visualizer will show that model's real layers and heads.
Faithful mode uses the traced Hugging Face model as the generator. That is necessary for exact token-level evidence, because the output and the internal activations must come from the same forward passes. GPT-2 is real and traceable, but it is not an instruction-tuned assistant model. If you ask it to explain a system design, it may continue text in a strange completion style.
The probe console now separates the two goals:
- Exact Trace: short deterministic Hugging Face generation for real layer/head evidence.
- Readable Answer: longer Ollama generation for better prose, with proxy or shadow evidence.
Use Exact Trace when you care about real internals. Use Readable Answer when you care about natural language quality.
- Trace Model: chooses the real Hugging Face model whose layers and heads will be traced.
- Tokens: limits how many new tokens are generated. Use low values such as 16-64 for exact tracing.
- Temp: controls randomness. Use
0for repeatable tracing. - Top P: controls sampling nucleus. Leave near
0.95unless testing generation behavior. - System Prompt: sets role/instruction text. Base GPT-2 models may not obey it well.
- User Prompt: the actual input whose token route is traced.
- Exact Trace preset: short, deterministic Hugging Face run for real evidence.
- Readable Answer preset: longer Ollama run for better prose, but proxy evidence unless a matching shadow trace exists.
This panel is empty until something is actually quarantined. There are two ways a head can appear there:
- OpenMetadata governance: tag a layer table or head column with
SynapseQuarantine.DEFECTIVE, then click Sync Defects. - Local demo mask: select a layer with active heads and click Quarantine Top Head.
The mask is applied to future traced runs, not retroactively to the trace already displayed. After quarantining a head, run another Faithful probe. The selected head should appear as masked in the layer trace and activation chart.
If the panel says "No heads are currently quarantined", that is not an error. It means OpenMetadata has no DEFECTIVE tag for this model/head and no local demo mask has been set.
The header metric:
Masked Heads
1
OM synchronized
means the runtime currently has one head in its mask list and OpenMetadata is reachable. It does not necessarily mean the already-visible trace was generated with that mask. Masks affect the next generation. Run another Faithful probe to see the selected layer report the masked head.
There are two related states:
- Runtime mask list: heads that will be masked on the next traced run.
- Current trace masked heads: heads that were actually masked during the displayed token steps.
If Masked Heads is 1 but the selected layer says no masked heads, usually one of these is true:
- the masked head belongs to a different layer;
- the trace on screen was generated before the mask was added;
- the mask came from OpenMetadata after the trace completed.
Sync Defects reads OpenMetadata for SynapseQuarantine.DEFECTIVE tags on layer tables or head columns. It then converts those tags into the backend runtime mask list. On the next faithful generation, the backend zeroes the matching attention-head output before projection, and the trace marks that head as masked.
Sync Defects does not create tags by itself. It only imports existing governance decisions from OpenMetadata into the running model proxy.
Prompt:
List inventions or practical devices Albert Einstein is credited with, and briefly explain each.
Faithful mode flow:
- Backend loads the Hugging Face model and tokenizer.
- The prompt is rendered with the configured chat template.
- Each generated token runs through the hooked transformer.
- Hooks capture the last-token attention matrix for every supported layer.
- Top heads and source tokens are summarized into
TokenStepCapture. - Final
AttentionTraceis returned withtrace_fidelity="exact". - Evidence quality should be higher because generation and tracing came from the same run.
Fast mode flow:
- Ollama streams the answer quickly.
- Hugging Face may run a shadow trace.
- The backend compares Ollama output with shadow output using
match_score. - If the score is extremely high, evidence may be promoted. Otherwise it remains proxy.
- Evidence quality lists the remaining gaps.
Backend:
cd backend
python -m uvicorn app.main:app --reload --port 8000State:
curl -sS http://127.0.0.1:8000/api/v1/state | jq .Faithful exact run:
curl -sS -X POST http://127.0.0.1:8000/api/v1/generate \
-H 'Content-Type: application/json' \
-d '{
"prompt": "List inventions or practical devices Albert Einstein is credited with, and briefly explain each.",
"max_new_tokens": 120,
"temperature": 0.1,
"top_p": 0.95,
"stop": [],
"stream": false,
"execution_mode": "faithful"
}' | jq '.response.trace | {trace_fidelity, match_score, evidence_quality, summary}'Fast/proxy run:
curl -sS -X POST http://127.0.0.1:8000/api/v1/generate \
-H 'Content-Type: application/json' \
-d '{
"prompt": "List inventions or practical devices Albert Einstein is credited with, and briefly explain each.",
"max_new_tokens": 120,
"temperature": 0.1,
"top_p": 0.95,
"stop": [],
"stream": false,
"execution_mode": "auto"
}' | jq '.response.trace | {generation_backend, trace_fidelity, match_score, evidence_quality}'Preload Hugging Face tracer:
curl -sS -X POST http://127.0.0.1:8000/api/v1/hf/preload | jq .Sync defective heads from OpenMetadata:
curl -sS -X POST http://127.0.0.1:8000/api/v1/openmetadata/sync-defects | jq .- Does
/api/v1/statereport topology when the Hugging Face tracer is loaded? - Does faithful mode return
trace_fidelity="exact"? - Does fast Ollama mode return
proxyunless the shadow output matches? - Does every final trace include
evidence_quality? - Do captured steps include layers, top heads, source tokens, and high-activation paths?
- Does OpenMetadata create model/layer/head assets?
- Does tagging a head as
DEFECTIVEappear in the backend masked-head list after sync? - Does a later traced run mark that head as masked?
- Does the UI show evidence quality and black-box gaps?
To solve the black-box problem more completely, add these capabilities:
-
Ablation validation service
- Rerun the same prompt with selected heads masked.
- Measure output divergence, logit delta, and answer-quality change.
- Mark a head as suspicious only when effect size crosses a threshold.
-
Replay service
- Persist model revision, tokenizer revision, seed, decoding settings, library versions, and hardware.
- Recreate any inference by
session_id.
-
Strong artifact governance
- Redact or hash sensitive prompt tokens.
- Encrypt trace artifacts.
- Use short-lived signed URLs.
- Define retention policies for full tensors vs summaries.
-
Multi-method explanations
- Combine attention telemetry with gradient saliency, integrated gradients, perturbation tests, and counterfactuals.
- Store method-specific artifacts separately and compare agreement.
-
Evaluation dashboard
- Trace coverage percentage.
- Proxy vs exact ratio.
- Average evidence-quality score.
- OpenMetadata ingestion failures.
- Quarantine false positives.
- Replay success rate.
-
Safer governance workflow
- Require ablation evidence before auto-masking in production.
- Add approvals for high-impact model changes.
- Keep an audit trail of who tagged, synced, and masked each head.
The solution is no longer a simple implementation. It has the core pieces of an AI black-box investigation platform:
- internal neural telemetry
- exact vs proxy trace labeling
- metadata lineage
- live dashboard
- governance tags
- head masking
- explicit evidence-quality reporting
The main remaining work is to move from "this head was active" to "this head was causally necessary." That requires ablation, replay, and multi-method agreement. The current implementation now names that gap clearly and gives operators the data needed to close it.
A fully honest black-box solution cannot mean "the system magically knows every reason for every neural computation." For modern LLMs, that claim would be fake. In this project, "full" means a stricter operational standard:
- Run the same Hugging Face model that is being traced.
- Capture the exact layer/head route for each generated token.
- Deterministically replay the same prompt.
- Mask a selected attention head.
- Compare baseline vs ablated output.
- Return a causal effect score and verdict.
The backend exposes this as:
curl -sS -X POST http://127.0.0.1:8000/api/v1/autopsy/causal \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Explain why masking a head can change a model response.",
"trace_model_name": "gpt2",
"max_new_tokens": 32
}' | jq .If layer_index and head_index are omitted, the service selects the highest-attention head from the baseline trace. If they are provided, it ablates that exact head.
The response includes:
- baseline exact trace
- ablated exact trace
- selected target head
- text similarity
- causal effect score
- verdict
This is the point where the solution moves from visualization to causal testing. A head is no longer merely "active"; it is tested by intervention.