Feynman Lectures on LLM Adaptation

A series of private 1-on-1 tutoring sessions in the style of Richard Feynman, covering how large language models adapt their behavior — from prompting to fine-tuning to cutting-edge parameter-efficient methods, composable AI review systems, cryptographic protocols, and the statistical foundations of machine learning.

Teaching Approach

First-principles reasoning with vivid, mechanism-based analogies
One concept at a time, with conceptual questions to verify understanding
Socratic method: wrong answers are met with new analogies, not corrections
Analogies adjusted to the student's learning style (mechanical/engineering-oriented)

Shared Vocabulary (Analogies Used Throughout)

Analogy	Concept
The Sculpture	Pre-trained model weights — carved during training, frozen afterward
The Stone	Parametric memory (weights) — permanent knowledge
The Water	Contextual memory (prompts) — temporary, flows through the sculpture
The Chisel	Training / gradient descent — reshapes the stone
The Ballroom Musician	The model's ability to read context and adapt output
The Piano	Pre-trained model; tuning = fine-tuning specific strings
The Bottleneck	LoRA's low-rank constraint — compressing adaptation through r dimensions
The Attachment	LoRA adapter — bolted onto the sculpture, removable
The TV Signal	Pre-trained weights as a complex broadcast; fine-tuning correction as a few knobs
The Dart Board	Bias-variance tradeoff — cluster center (bias) vs spread (variance) vs shaking wall (noise)
Eigenvectors / PCA	Orthogonal adapter training — decomposing useful adaptation into independent components
The Diagonal Bug	Interaction effects invisible to orthogonal specialists looking along single axes
The Ballroom Crowd vs Committee	Unstructured multi-reviewer output vs architected review pipeline
The Diamond vs Circle	L1 vs L2 constraint geometry — corners induce sparsity, smooth surfaces don't
The Orange Peel	Curse of dimensionality — in high dimensions, all volume is in the skin

Course Structure

Course 1: In-Context Learning vs Fine-Tuning

5 lessons — What are the two fundamental ways to change an LLM's behavior?

What Does an LLM Actually "Know"? — Parametric vs contextual memory; the sculpture analogy
In-Context Learning: The Art of the Reminder — Pattern activation, not learning; the ballroom musician
Fine-Tuning: Rewiring the Brain — Picking up the chisel; catastrophic forgetting; the tug-of-war
The Great Trade-Off — When to pour water vs carve stone; includes interlude on overfitting & dataset guidelines
The Frontier: Where the Line Gets Blurry — The spectrum between ICL and fine-tuning; includes interlude on transformer weight anatomy (MLP vs attention)

Course 2: LoRA Deep Dive

5 lessons — How do you efficiently adapt a frozen model?

The Core Intuition: Why Low-Rank? — Low-rank weight updates; A x B decomposition; includes bias-variance refresher
The Mechanics: How LoRA Actually Works — Parallel path, bottleneck, initialization, scaling factor alpha, merge vs swap
The Hyperparameters That Matter — Rank, alpha, layer targeting, learning rate; the practical starting recipe
LoRA in Practice — QLoRA, adapter merging, multi-adapter serving, production decision tree
The Frontier of Parameter-Efficient Methods — Prompt tuning, prefix tuning, adapter layers, DoRA, MoLoRA; why LoRA won

Course 3: Orthogonal Adapters & Composable Code Review Committees

5 lessons — How do you build diverse AI review systems with guaranteed unique perspectives?

The Blind Spot Problem — Why LLMs reviewing LLM code has inherent limits; the shared manifold problem; diversity hypothesis
Orthogonality: What It Means and Why It Guarantees Diversity — Projection constraints; the diagonal bug problem; includes interlude on training data & eigenvector analogy
Building the Committee — Three-tier architecture (specialists, composition, prioritization); bootstrapping via mutation testing
When Reviewers Disagree — Three categories of disagreement; confidence calibration; the disagreement matrix; composition model bias
Quis Custodiet Ipsos Custodes? — Third-order blind spots; defense in depth; the Kegan developmental parallel

Course 4: From Architecture to Enterprise

5 lessons — How do you build a defensible AI startup on the verification committee thesis?

The Generalization: From Code Review to Universal Verification Committees — Domain suitability framework; scoring domains; why crypto is the best first vertical
The TAM: Mapping the Opportunity Space — Bottom-up TAM, continuous monitoring market creation, wedge strategy, data flywheel, pricing paradox
Formal Verification of Cryptographic Protocols with Lean 4 — The four levels of correctness; why LLMs fail at Level 4; orthogonal adapters for verification gaps; mutation of specs and proofs
The Moat Question: What Frontier Labs Can't Copy — Five layers of moat; horizontal tax; reputation ratchet; winner-take-most dynamics
Building the Company: Architecture as Strategy — Architecture-to-strategy mapping; go-to-market; team; fundraising; 18-month roadmap; the deepest risk

Course 5: The Self-Improving Harness

6 lessons — How do you build a system that bootstraps from cloud dependence to local autonomy?

The Harness: Orchestrating Cloud and Local — Cascade routing, cost-quality curve, sending failures to the teacher
Distillation from First Principles — Dark knowledge, soft targets, temperature; selective distillation into LoRA specialists; sequence-level distillation from cloud APIs
Distillation into LoRA: Merging Teacher Knowledge with Task Adaptation — Rank expansion for dual signals; four combination strategies; LR/epoch/sampling balance
Micro Fine-Tuning: Learning While Serving — Quality filter, replay buffer, micro learning rate, EWC anchoring, validation gate, three firewalls against model collapse
The Full Architecture — Seven data flows, graceful degradation, build sequence, composition model co-evolution
The Bootstrap Paradox and the Economic Inflection — Learning starvation, sawtooth improvement, proactive red-teaming, when student surpasses teacher, AGI as civilization

Course 6: Cryptography from First Principles

5 lessons — Math foundations through core crypto primitives (for understanding the BABE protocol)

The Lock and Key Universe — Security parameter, negligible functions, PPT adversaries, security games, why negligible must be exponential
Finite Fields and Modular Arithmetic — Clock arithmetic, why primes give division, generators, discrete log problem
Hash Functions and the Random Oracle — Hash properties, ROM, notation boot camp (sampling, probability, oracle access)
Digital Signatures and Security Games — EUF-CMA, Lamport one-time signatures, the Lamport=GC coincidence, dot notation for oracles
Polynomials and Arithmetic Circuits — Schwartz-Zippel, R1CS, the polynomial bridge to Groth16, CRS and trusted setup

Course 7: The Cryptographic Toolkit

5 lessons — The algebraic machinery BABE is built from

Elliptic Curves — Point addition, scalar multiplication, ECDLP, BN254, implicit notation [x]_s
Bilinear Pairings — The pairing map, bilinearity, Groth16 verification equation, one-shot limitation
SNARKs and Groth16 — Complete Groth16 flow (Gen/Prove/Verify), knowledge soundness, extractors, role in BABE
Garbled Circuits — Wire labels, gate garbling, evaluation, free-XOR, half-gates, adaptive privacy
Witness Encryption — Encrypt under NP statement, BABE's pairing-based WE, extractable security, the Lemma 10 fix, WE+GC=BABE

Course 8: BABE — The Protocol and Its Security

5 lessons — The complete protocol, its security proof, and the Lean mechanization

Bitcoin as a Cryptographic Platform — UTXO model, locking scripts, 6-transaction graph, unstoppable transactions
The BABE Construction — WE+GC split, randomized encodings, DRE, linearization by lifting, 1000x size reduction
The Security Proof — Robustness, knowledge soundness, hybrid arguments, reduction chain, why four proof assistants
The Mechanization — Four proof assistants, axiom boundaries, audit findings, trust surface, stopping criterion
The Full Circle — Recursive verification, the complete map, biological parallels, "billions of years of pre-training followed by a lifetime of LoRAs"

Course 9: Remedial — Strengthening the Weak Spots

5 lessons — Focused drilling on reductions, precise definitions, and component boundaries

Reductions: The Technique — Four worked examples, the three-step recipe, student constructs a case-split reduction
Reduction Drills (upcoming)
The Confusion Matrix — Six confused concept pairs with precise distinctions and tests
The BABE Component Map — Which operation belongs to which system, the math type test
Remedial Exam — 15 focused questions, 12/15 clean, all major gaps closed

Course 10: Statistical Learning Theory

5 lessons — The mathematical foundations of generalization, estimation, and high-dimensional learning

Bias, Variance, and Empirical Risk Minimization — Bias-variance decomposition derivation, dart board analogy, ERM pathologies (overfitting, distribution shift)
Regularization — Ridge Regression and Sparsity — Ridge derivation with eigenvalue analysis, Bayesian interpretation, L1 sparsity (geometric and subdifferential arguments)
VC Dimension and PAC Learning — Shattering, VC dimension examples, generalization bounds, PAC framework, sample complexity
MLE, MAP, and Consistency — MLE consistency conditions, failure cases (mixtures, Neyman-Scott), MAP as regularized MLE, prior-penalty duality
Dimensionality and KL Divergence — Curse of dimensionality (orange peel, distance concentration), KL asymmetry, forward vs reverse KL, JS divergence

Course 11: Optimization in ML

5 lessons — How do optimizers navigate loss landscapes, and why does the "wrong" method often win?

SGD, Generalization, and Saddle Points — SGD noise as implicit regularization, flat minima, SDE approximation, saddle points vs local minima in high dimensions
Adam vs SGD, and Convergence Guarantees — Adam update rules, sharp minima with adaptive methods, AMSGrad, convergence conditions (L-smoothness, strong convexity, PL condition)
Gradient Pathologies — Vanishing gradients (chain rule, sigmoid, ReLU, residuals, init), exploding gradients (clipping, batch norm, LSTM gating, transformers)
Second-Order Methods — Hessian and curvature, Newton's method, why O(n²) is infeasible, L-BFGS, Hessian-free, natural gradient, Fisher information, KFAC
Loss Landscape Geometry — Sharp vs flat minima, PAC-Bayes, Dinh et al. controversy, SAM, line search vs schedules, warmup mystery for transformers

Course 12: Probabilistic ML & Inference

5 lessons — EM, MCMC, variational inference, Bayesian foundations, and when the machinery breaks

The EM Algorithm — MLE with latent variables, Jensen's inequality, ELBO, E-step/M-step, monotonic improvement, GMM example, failure modes (local optima, singular covariances, intractable E-step)
MCMC Methods — Metropolis-Hastings (propose-accept/reject, detailed balance, proposal tuning), Gibbs sampling (full conditionals, MH with acceptance=1, correlated variables problem), VI overview (optimization vs sampling, ELBO, mean-field, speed vs accuracy tradeoffs)
The ELBO and VI Bias — Full ELBO derivation, log P(x) = ELBO + KL(Q||P), reverse KL is mode-seeking, mean-field misses correlations, normalizing flows as fix
Bayesian Foundations — Exchangeability, de Finetti's theorem, Bayesian justification for parametric models, Dirichlet Process (CRP, stick-breaking), Gaussian Process (kernel as function prior)
Priors and Posterior Collapse — Prior sensitivity, Bernstein-von Mises, Jeffreys prior, posterior collapse in VAEs (powerful decoder problem, KL annealing, free bits)

Course 13: Deep Learning Theory

5 lessons — Transformers, residual connections, overparameterization, scaling laws, and the phenomena that defy classical intuition (Q31–Q40)

Transformers vs RNNs — Sequential bottleneck, hidden state compression, parallel processing, attention O(n²d) complexity, sparse/linear/FlashAttention solutions
Positional Encoding and Normalization — Permutation equivariance without PE, sinusoidal/learned/RoPE/ALiBi, layer norm vs batch norm, pre-norm vs post-norm, RMSNorm
Residual Connections and Overparameterization — y=F(x)+x gradient math, 2^L paths, ensemble interpretation, ODE connection, interpolation threshold, NTK, implicit regularization
Lottery Tickets and Double Descent — Sparse subnetworks matching full performance, iterative magnitude pruning, supermasks, three regimes of double descent, epoch-wise double descent
Scaling Laws and Why ReLU Dominates — Kaplan/Chinchilla power laws, compute-optimal frontier, sigmoid→tanh→ReLU→GELU arc, dying ReLU, why GELU won in transformers

Course 14: Computer Vision & NLP

5 lessons — From convolution priors to tokenization failures and hallucination (Q41–Q50)

Why Convolution Works — Locality and stationarity priors, weight sharing (FC N⁴ vs conv k²), inductive bias, translation equivariance vs invariance, proof sketch
Pooling and Dilated Convolutions — Receptive field growth, max vs avg pooling, strided conv, global average pooling, dilated convolutions, WaveNet, gridding artifact
Feature Hierarchies and Word2Vec — Edges→textures→parts→objects, GradCAM, transfer learning, Skip-gram/CBOW, softmax bottleneck, king−man+woman≈queen
Negative Sampling and Pretraining — Binary classification reframing, noise distribution P(w)∝freq^(3/4), NCE connection, BERT vs GPT (bidirectional vs causal), convergence toward autoregressive scale
Tokenization and Hallucination — Character/word/subword (BPE), arithmetic and multilingual impact, byte-level BPE, five causes of hallucination, mitigation strategies (RAG, CoT, calibration)

Course 15: Reinforcement Learning

5 lessons — Bellman equations to deep RL instability, exploration, and sample efficiency (Q51–Q60)

Bellman Equations and Dynamic Programming — Value function derivation, Bellman recursion, optimality equation, Q-function, policy iteration vs value iteration, contraction mapping convergence
Q-Learning Instability and Exploration — Tabular convergence, three sources of DQN instability (correlated samples, non-stationary targets, maximization bias), the deadly triad, ε-greedy vs UCB vs Thompson Sampling
Actor-Critic and Credit Assignment — REINFORCE variance, advantage function, TD error as advantage estimate, A2C/PPO/SAC, temporal credit assignment, eligibility traces, λ-return, HER
On-Policy vs Off-Policy and Function Approximation Issues — Behavioral vs target policy, importance sampling variance, PPO clipping, SAC entropy, convergence hierarchy (tabular → linear → neural), Baird's counterexample
Reward Shaping and Sample Efficiency — Potential-based shaping (Ng et al. theorem), telescoping argument, reward hacking, five root causes of sample inefficiency, model-based RL, offline RL

Course 16: Generalization Theory & Graph Neural Networks

5 lessons — Why deep nets generalize, compression and stability views, information bottleneck, and the expressivity limits of GNNs (Q61–Q70)

Why Deep Nets Generalize — Zhang et al. random labels puzzle, SGD's simplicity bias, function space perspective, implicit regularization (min-norm, min nuclear norm, LR as regularizer, early stopping)
Compression and Stability — MDL principle, compression bounds, noise stability, PAC-Bayes, algorithmic stability, Hardt et al. SGD stability proof, connection to differential privacy
Information Bottleneck and GNN Oversmoothing — Tishby's IB framework, two-phase training, Saxe et al. criticism, oversmoothing as low-pass filtering, eigenvalue convergence, DropEdge/PairNorm fixes
Message Passing and Spectral vs Spatial GNNs — MPNN framework, 1-WL expressivity ceiling, graph Laplacian Fourier domain, ChebNet, GCN as 1st-order Chebyshev, spectral vs spatial comparison
Graph Isomorphism and GNN Expressivity — GI complexity status, Babai's algorithm, GIN (injective aggregation = 1-WL), beyond-1-WL strategies (higher-order, random features, positional encodings)

Course 17: Causality & Practical ML

5 lessons — Causal inference, Simpson's paradox, class imbalance, data leakage, and the engineering of reliable ML systems (Q71–Q80)

Causation Formalized — Pearl's SCMs, structural equations, DAGs, do-operator, adjustment formula, backdoor criterion, frontdoor criterion
Instrumental Variables and Counterfactuals — Unmeasured confounders, IV conditions, 2SLS, weak instruments, potential outcomes Y(1)/Y(0), fundamental problem of causal inference, Pearl's three-level ladder
Simpson's Paradox and Class Imbalance — Berkeley admissions, aggregate vs stratify (causal structure decides), SMOTE, focal loss, AUPRC vs accuracy, threshold tuning
Data Leakage and Feature Engineering — Target/temporal/train-test leakage, detection and prevention, feature engineering techniques, feature selection methods
Cross-Validation and Hyperparameter Tuning — K-fold, stratified, time series split, nested CV, grid vs random (Bergstra & Bengio), Bayesian optimization, learning rate as most important hyperparameter

Course 18: Systems, Robustness & the Frontier

6 lessons — Distributed training, memory/precision, adversarial robustness, fairness, privacy, generative models, RLHF, alignment, and open questions (Q81–Q100)

Distributed Training — AllReduce communication overhead, sync vs async SGD, linear scaling rule, LARS/LAMB, data/tensor/pipeline parallelism, ZeRO, 3D parallelism
Memory, Precision & Latency — Memory budget (params, gradients, optimizer, activations), gradient checkpointing, FP16/BF16 mixed precision, GPTQ/AWQ quantization, continuous batching, speculative decoding, paged attention, GQA
Adversarial Robustness — Goodfellow linearity hypothesis, FGSM/PGD attacks, features-vs-bugs debate, adversarial training min-max, accuracy-robustness tradeoff, certified defenses, randomized smoothing
Fairness and Bias — Demographic parity, equalized odds, calibration, Chouldechova impossibility theorem, bias taxonomy, disaggregated metrics, counterfactual fairness, proxy variables
Privacy and Generative Models — (ε,δ)-differential privacy, DP-SGD, composition theorem, federated learning, GANs vs diffusion (stability, mode coverage, conditioning), why diffusion training is stable
RLHF, Alignment & the Frontier — RLHF pipeline (SFT→reward model→PPO), DPO, Goodhart's law, mesa-optimization, Constitutional AI, CLIP, test-time compute, mechanistic interpretability, scaling laws, open questions

Key Takeaways Across All Courses

An LLM's knowledge lives in its weights (stone) — permanent patterns carved during training
In-context learning activates existing capabilities via the prompt (water) — nothing changes, nothing persists
Fine-tuning modifies weights (picks up the chisel) — powerful but expensive and risky
"Activating, not carving" — the student's own summary of ICL vs fine-tuning
LoRA represents weight changes as low-rank matrices — because fine-tuning updates are empirically low-rank
Orthogonal adapter training guarantees diverse perspectives — the constraint is in weight space, not data space
Orthogonality is like PCA/eigenvectors — each adapter captures the next most important independent direction of useful adaptation
A review committee needs both specialists (orthogonal) AND generalists (composition layer) to catch diagonal bugs
The distinction between "activation" and "learning" is a matter of where you draw the line — it's learning all the way down
The holy grail is continual learning — if LoRA adapters can be continually learned, merged, and composed, the line between training and adaptation dissolves
Every prediction error decomposes into bias² + variance + noise — the pattern of failure tells you what to fix
Regularization is prior knowledge in disguise — L2 = Gaussian prior, L1 = Laplace prior, MAP = MLE + regularization
Generalization depends on the ratio of model capacity to data, not capacity alone — VC theory and PAC learning formalize this
KL divergence is asymmetric by design — forward KL covers modes (blurry), reverse KL seeks modes (sharp) — this explains the VAE vs GAN divide
EM is the gateway to variational methods — bound what you can't compute, optimize the bound — but it only works when the E-step is tractable
Exchangeability justifies Bayesian modeling — de Finetti proves a parameter and prior must exist if data order is irrelevant
Posterior collapse in VAEs is a degenerate equilibrium, not a bug — the ELBO is doing what it's told, but a powerful decoder makes z unnecessary
Mean-field VI underestimates uncertainty — factorization + reverse KL = mode-seeking with no correlations — calibrated uncertainty requires richer families or MCMC

Student Profile

Experience level: understands neural networks and transformer architecture
Thinks mechanically — reasons about tokens, weights, and machinery rather than anthropomorphizing
Prefers practical, engineering-grounded explanations over abstract theory
Strong at synthesis — naturally connects ideas across lessons and extrapolates to frontier implications
Pushes back on imprecise claims, leading to productive deeper discussions

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
course-1-icl-vs-finetuning		course-1-icl-vs-finetuning
course-10-statistical-learning-theory		course-10-statistical-learning-theory
course-11-optimization		course-11-optimization
course-12-probabilistic-ml		course-12-probabilistic-ml
course-13-deep-learning-theory		course-13-deep-learning-theory
course-14-vision-and-nlp		course-14-vision-and-nlp
course-15-reinforcement-learning		course-15-reinforcement-learning
course-16-generalization-and-gnns		course-16-generalization-and-gnns
course-17-causality-and-practical-ml		course-17-causality-and-practical-ml
course-18-systems-robustness-frontier		course-18-systems-robustness-frontier
course-2-lora		course-2-lora
course-3-orthogonal-adapters-code-review		course-3-orthogonal-adapters-code-review
course-4-from-architecture-to-enterprise		course-4-from-architecture-to-enterprise
course-5-self-improving-harness		course-5-self-improving-harness
course-6-crypto-foundations		course-6-crypto-foundations
course-7-crypto-toolkit		course-7-crypto-toolkit
course-8-babe-protocol		course-8-babe-protocol
course-9-remedial		course-9-remedial
CLAUDE.md		CLAUDE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Feynman Lectures on LLM Adaptation

Teaching Approach

Shared Vocabulary (Analogies Used Throughout)

Course Structure

Course 1: In-Context Learning vs Fine-Tuning

Course 2: LoRA Deep Dive

Course 3: Orthogonal Adapters & Composable Code Review Committees

Course 4: From Architecture to Enterprise

Course 5: The Self-Improving Harness

Course 6: Cryptography from First Principles

Course 7: The Cryptographic Toolkit

Course 8: BABE — The Protocol and Its Security

Course 9: Remedial — Strengthening the Weak Spots

Course 10: Statistical Learning Theory

Course 11: Optimization in ML

Course 12: Probabilistic ML & Inference

Course 13: Deep Learning Theory

Course 14: Computer Vision & NLP

Course 15: Reinforcement Learning

Course 16: Generalization Theory & Graph Neural Networks

Course 17: Causality & Practical ML

Course 18: Systems, Robustness & the Frontier

Key Takeaways Across All Courses

Student Profile

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Feynman Lectures on LLM Adaptation

Teaching Approach

Shared Vocabulary (Analogies Used Throughout)

Course Structure

Course 1: In-Context Learning vs Fine-Tuning

Course 2: LoRA Deep Dive

Course 3: Orthogonal Adapters & Composable Code Review Committees

Course 4: From Architecture to Enterprise

Course 5: The Self-Improving Harness

Course 6: Cryptography from First Principles

Course 7: The Cryptographic Toolkit

Course 8: BABE — The Protocol and Its Security

Course 9: Remedial — Strengthening the Weak Spots

Course 10: Statistical Learning Theory

Course 11: Optimization in ML

Course 12: Probabilistic ML & Inference

Course 13: Deep Learning Theory

Course 14: Computer Vision & NLP

Course 15: Reinforcement Learning

Course 16: Generalization Theory & Graph Neural Networks

Course 17: Causality & Practical ML

Course 18: Systems, Robustness & the Frontier

Key Takeaways Across All Courses

Student Profile

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages