Skip to content

tbtommyb/interpretability

Repository files navigation

Interpretability & Evals — a self-directed course

This repo is a collection of notebooks I've generated to learn the key AI interpretability techniques and work up to reproducing and extending papers. I'm interested agent system level interpretability, so I've set TACT —- detecting and steering coding-agent drift in the residual stream (Sui et al., 2026) -- as my target.

The notebooks build up from foundational mathematics, through the key interpretability techniques (black and white box) up to a laptop-scale reproduction of the TACT paper.

Everything is sized for a single Apple-Silicon laptop: small models (gpt2-small, pythia-70m, Qwen3-1.7B), pretrained SAEs, and synthetic warm-ups before any real model is touched.

What's here

Part Directory What it is
1 — Techniques notebook-syllabus/ 12 worked notebooks on the white-box toolkit: probes, SAEs, steering, activation patching, circuit discovery, logit/tuned lens, model editing — on toy, inspectable tasks.
2 — Reproductions Reproductions/ 7 paper reproductions, each building one capability toward the capstone: agent evals, linear probes, steering, SAEs, long-conversation monitoring, the TACT capstone, and circuit tracing.
Capstone target projects/tact-reproduction/ The TACT mechanism in miniature: synthetic drift geometry, unit-tested axis math, and the "Replacement Point" the reproductions teach you to fill with real data.

The full paper list, with public sources, is in reading-list.md.

Techniques and papers covered

Area Technique Where
Probing linear & unsupervised (CCS) probes, AUROC, deception / eval-awareness probes syllabus 03, Reproductions 02
Steering / RepE contrast-vector steering via forward hooks, side-effect analysis syllabus 06, Reproductions 03
Sparse autoencoders dictionary learning, TopK, pretrained SAEs (SAELens / Gemma Scope) syllabus 05, Reproductions 04
Activation patching path & attribution patching, causal interventions syllabus 04
Circuit discovery ACDC-style pruning, induction-head case study syllabus 07, 11
Readout methods logit lens, tuned lens, direct logit attribution syllabus 02
Knowledge localisation ROME-style factual editing syllabus 08
Attribution graphs transcoders + circuit tracing (Anthropic 2025) Reproductions 07
Black-box agent evals Inspect / Petri, LLM-as-judge, scenario design Reproductions 01
Trajectory monitoring per-turn projection, representation drift over long context Reproductions 05

Setup

One shared environment runs the whole course. The only subtlety is the transformers version: TransformerLens (used in several notebooks) needs transformers < 5 (it relies on TRANSFORMERS_CACHE, removed in 5.0), while Qwen3 needs transformers >= 4.51. The range [4.51, 5.0) satisfies both, and requirements.txt pins it (verified with transformers 4.57.6 + transformer-lens 2.15.4 + sae-lens 6.5.3 on Python 3.12 / mps).

# from the repo root
uv venv .venv --python 3.12
uv pip install --python .venv/bin/python -r requirements.txt
.venv/bin/python -m ipykernel install --user --name interpretability --display-name "Interpretability"

(Plain python3.12 -m venv .venv && pip install -r requirements.txt works too.) Open any notebook and select the Interpretability kernel.

A few notebooks pull heavier or GPU-oriented tools installed separately — circuit-tracer (rung 07, GPU/Colab), mini-swe-agent (capstone trajectory collection), inspect-petri (rung 01). Each notebook says so above the relevant cell.

If a Part 1 notebook misbehaves on transformers 4.57.x, pin transformers==4.46.3 (the version those notebooks were originally verified against) in a separate environment.

Running the tests

# from a rung directory, e.g. Reproductions/02-linear-probes-on-agent-states
python -m pytest tests -q

Layout

notebook-syllabus/       Part 1 — 12 white-box technique notebooks + shared helpers
Reproductions/           Part 2 — 7 reproduction rungs (each: notebook, README, tests)
projects/tact-reproduction/   capstone target — synthetic TACT mechanism + tested axis math
reading-list.md          every paper, with public sources
requirements.txt         the shared environment

About

Notebooks to develop skills and reproduce key LLM interpretability papers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors