This repo is a collection of notebooks I've generated to learn the key AI interpretability techniques and work up to reproducing and extending papers. I'm interested agent system level interpretability, so I've set TACT —- detecting and steering coding-agent drift in the residual stream (Sui et al., 2026) -- as my target.
The notebooks build up from foundational mathematics, through the key interpretability techniques (black and white box) up to a laptop-scale reproduction of the TACT paper.
Everything is sized for a single Apple-Silicon laptop: small models (gpt2-small, pythia-70m, Qwen3-1.7B), pretrained SAEs, and synthetic warm-ups before any real model is touched.
| Part | Directory | What it is |
|---|---|---|
| 1 — Techniques | notebook-syllabus/ |
12 worked notebooks on the white-box toolkit: probes, SAEs, steering, activation patching, circuit discovery, logit/tuned lens, model editing — on toy, inspectable tasks. |
| 2 — Reproductions | Reproductions/ |
7 paper reproductions, each building one capability toward the capstone: agent evals, linear probes, steering, SAEs, long-conversation monitoring, the TACT capstone, and circuit tracing. |
| Capstone target | projects/tact-reproduction/ |
The TACT mechanism in miniature: synthetic drift geometry, unit-tested axis math, and the "Replacement Point" the reproductions teach you to fill with real data. |
The full paper list, with public sources, is in reading-list.md.
| Area | Technique | Where |
|---|---|---|
| Probing | linear & unsupervised (CCS) probes, AUROC, deception / eval-awareness probes | syllabus 03, Reproductions 02 |
| Steering / RepE | contrast-vector steering via forward hooks, side-effect analysis | syllabus 06, Reproductions 03 |
| Sparse autoencoders | dictionary learning, TopK, pretrained SAEs (SAELens / Gemma Scope) | syllabus 05, Reproductions 04 |
| Activation patching | path & attribution patching, causal interventions | syllabus 04 |
| Circuit discovery | ACDC-style pruning, induction-head case study | syllabus 07, 11 |
| Readout methods | logit lens, tuned lens, direct logit attribution | syllabus 02 |
| Knowledge localisation | ROME-style factual editing | syllabus 08 |
| Attribution graphs | transcoders + circuit tracing (Anthropic 2025) | Reproductions 07 |
| Black-box agent evals | Inspect / Petri, LLM-as-judge, scenario design | Reproductions 01 |
| Trajectory monitoring | per-turn projection, representation drift over long context | Reproductions 05 |
One shared environment runs the whole course. The only subtlety is the transformers version: TransformerLens (used in several notebooks) needs transformers < 5 (it relies on TRANSFORMERS_CACHE, removed in 5.0), while Qwen3 needs transformers >= 4.51. The range [4.51, 5.0) satisfies both, and requirements.txt pins it (verified with transformers 4.57.6 + transformer-lens 2.15.4 + sae-lens 6.5.3 on Python 3.12 / mps).
# from the repo root
uv venv .venv --python 3.12
uv pip install --python .venv/bin/python -r requirements.txt
.venv/bin/python -m ipykernel install --user --name interpretability --display-name "Interpretability"(Plain python3.12 -m venv .venv && pip install -r requirements.txt works too.) Open any notebook and select the Interpretability kernel.
A few notebooks pull heavier or GPU-oriented tools installed separately — circuit-tracer (rung 07, GPU/Colab), mini-swe-agent (capstone trajectory collection), inspect-petri (rung 01). Each notebook says so above the relevant cell.
If a Part 1 notebook misbehaves on
transformers4.57.x, pintransformers==4.46.3(the version those notebooks were originally verified against) in a separate environment.
# from a rung directory, e.g. Reproductions/02-linear-probes-on-agent-states
python -m pytest tests -qnotebook-syllabus/ Part 1 — 12 white-box technique notebooks + shared helpers
Reproductions/ Part 2 — 7 reproduction rungs (each: notebook, README, tests)
projects/tact-reproduction/ capstone target — synthetic TACT mechanism + tested axis math
reading-list.md every paper, with public sources
requirements.txt the shared environment