Interpretability & Evals — a self-directed course

This repo is a collection of notebooks I've generated to learn the key AI interpretability techniques and work up to reproducing and extending papers. I'm interested agent system level interpretability, so I've set TACT —- detecting and steering coding-agent drift in the residual stream (Sui et al., 2026) -- as my target.

The notebooks build up from foundational mathematics, through the key interpretability techniques (black and white box) up to a laptop-scale reproduction of the TACT paper.

Everything is sized for a single Apple-Silicon laptop: small models (gpt2-small, pythia-70m, Qwen3-1.7B), pretrained SAEs, and synthetic warm-ups before any real model is touched.

What's here

Part	Directory	What it is
1 — Techniques	`notebook-syllabus/`	12 worked notebooks on the white-box toolkit: probes, SAEs, steering, activation patching, circuit discovery, logit/tuned lens, model editing — on toy, inspectable tasks.
2 — Reproductions	`Reproductions/`	7 paper reproductions, each building one capability toward the capstone: agent evals, linear probes, steering, SAEs, long-conversation monitoring, the TACT capstone, and circuit tracing.
Capstone target	`projects/tact-reproduction/`	The TACT mechanism in miniature: synthetic drift geometry, unit-tested axis math, and the "Replacement Point" the reproductions teach you to fill with real data.

The full paper list, with public sources, is in reading-list.md.

Techniques and papers covered

Area	Technique	Where
Probing	linear & unsupervised (CCS) probes, AUROC, deception / eval-awareness probes	syllabus 03, Reproductions 02
Steering / RepE	contrast-vector steering via forward hooks, side-effect analysis	syllabus 06, Reproductions 03
Sparse autoencoders	dictionary learning, TopK, pretrained SAEs (SAELens / Gemma Scope)	syllabus 05, Reproductions 04
Activation patching	path & attribution patching, causal interventions	syllabus 04
Circuit discovery	ACDC-style pruning, induction-head case study	syllabus 07, 11
Readout methods	logit lens, tuned lens, direct logit attribution	syllabus 02
Knowledge localisation	ROME-style factual editing	syllabus 08
Attribution graphs	transcoders + circuit tracing (Anthropic 2025)	Reproductions 07
Black-box agent evals	Inspect / Petri, LLM-as-judge, scenario design	Reproductions 01
Trajectory monitoring	per-turn projection, representation drift over long context	Reproductions 05

Setup

One shared environment runs the whole course. The only subtlety is the transformers version: TransformerLens (used in several notebooks) needs transformers < 5 (it relies on TRANSFORMERS_CACHE, removed in 5.0), while Qwen3 needs transformers >= 4.51. The range [4.51, 5.0) satisfies both, and requirements.txt pins it (verified with transformers 4.57.6 + transformer-lens 2.15.4 + sae-lens 6.5.3 on Python 3.12 / mps).

# from the repo root
uv venv .venv --python 3.12
uv pip install --python .venv/bin/python -r requirements.txt
.venv/bin/python -m ipykernel install --user --name interpretability --display-name "Interpretability"

(Plain python3.12 -m venv .venv && pip install -r requirements.txt works too.) Open any notebook and select the Interpretability kernel.

A few notebooks pull heavier or GPU-oriented tools installed separately — circuit-tracer (rung 07, GPU/Colab), mini-swe-agent (capstone trajectory collection), inspect-petri (rung 01). Each notebook says so above the relevant cell.

If a Part 1 notebook misbehaves on transformers 4.57.x, pin transformers==4.46.3 (the version those notebooks were originally verified against) in a separate environment.

Running the tests

# from a rung directory, e.g. Reproductions/02-linear-probes-on-agent-states
python -m pytest tests -q

Layout

notebook-syllabus/       Part 1 — 12 white-box technique notebooks + shared helpers
Reproductions/           Part 2 — 7 reproduction rungs (each: notebook, README, tests)
projects/tact-reproduction/   capstone target — synthetic TACT mechanism + tested axis math
reading-list.md          every paper, with public sources
requirements.txt         the shared environment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interpretability & Evals — a self-directed course

What's here

Techniques and papers covered

Setup

Running the tests

Layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Reproductions		Reproductions
notebook-syllabus		notebook-syllabus
projects/tact-reproduction		projects/tact-reproduction
.gitignore		.gitignore
README.md		README.md
reading-list.md		reading-list.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Interpretability & Evals — a self-directed course

What's here

Techniques and papers covered

Setup

Running the tests

Layout

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages