A series of private 1-on-1 tutoring sessions in the style of Richard Feynman, covering how large language models adapt their behavior — from prompting to fine-tuning to cutting-edge parameter-efficient methods, composable AI review systems, cryptographic protocols, and the statistical foundations of machine learning.
- First-principles reasoning with vivid, mechanism-based analogies
- One concept at a time, with conceptual questions to verify understanding
- Socratic method: wrong answers are met with new analogies, not corrections
- Analogies adjusted to the student's learning style (mechanical/engineering-oriented)
| Analogy | Concept |
|---|---|
| The Sculpture | Pre-trained model weights — carved during training, frozen afterward |
| The Stone | Parametric memory (weights) — permanent knowledge |
| The Water | Contextual memory (prompts) — temporary, flows through the sculpture |
| The Chisel | Training / gradient descent — reshapes the stone |
| The Ballroom Musician | The model's ability to read context and adapt output |
| The Piano | Pre-trained model; tuning = fine-tuning specific strings |
| The Bottleneck | LoRA's low-rank constraint — compressing adaptation through r dimensions |
| The Attachment | LoRA adapter — bolted onto the sculpture, removable |
| The TV Signal | Pre-trained weights as a complex broadcast; fine-tuning correction as a few knobs |
| The Dart Board | Bias-variance tradeoff — cluster center (bias) vs spread (variance) vs shaking wall (noise) |
| Eigenvectors / PCA | Orthogonal adapter training — decomposing useful adaptation into independent components |
| The Diagonal Bug | Interaction effects invisible to orthogonal specialists looking along single axes |
| The Ballroom Crowd vs Committee | Unstructured multi-reviewer output vs architected review pipeline |
| The Diamond vs Circle | L1 vs L2 constraint geometry — corners induce sparsity, smooth surfaces don't |
| The Orange Peel | Curse of dimensionality — in high dimensions, all volume is in the skin |
5 lessons — What are the two fundamental ways to change an LLM's behavior?
- What Does an LLM Actually "Know"? — Parametric vs contextual memory; the sculpture analogy
- In-Context Learning: The Art of the Reminder — Pattern activation, not learning; the ballroom musician
- Fine-Tuning: Rewiring the Brain — Picking up the chisel; catastrophic forgetting; the tug-of-war
- The Great Trade-Off — When to pour water vs carve stone; includes interlude on overfitting & dataset guidelines
- The Frontier: Where the Line Gets Blurry — The spectrum between ICL and fine-tuning; includes interlude on transformer weight anatomy (MLP vs attention)
5 lessons — How do you efficiently adapt a frozen model?
- The Core Intuition: Why Low-Rank? — Low-rank weight updates; A x B decomposition; includes bias-variance refresher
- The Mechanics: How LoRA Actually Works — Parallel path, bottleneck, initialization, scaling factor alpha, merge vs swap
- The Hyperparameters That Matter — Rank, alpha, layer targeting, learning rate; the practical starting recipe
- LoRA in Practice — QLoRA, adapter merging, multi-adapter serving, production decision tree
- The Frontier of Parameter-Efficient Methods — Prompt tuning, prefix tuning, adapter layers, DoRA, MoLoRA; why LoRA won
5 lessons — How do you build diverse AI review systems with guaranteed unique perspectives?
- The Blind Spot Problem — Why LLMs reviewing LLM code has inherent limits; the shared manifold problem; diversity hypothesis
- Orthogonality: What It Means and Why It Guarantees Diversity — Projection constraints; the diagonal bug problem; includes interlude on training data & eigenvector analogy
- Building the Committee — Three-tier architecture (specialists, composition, prioritization); bootstrapping via mutation testing
- When Reviewers Disagree — Three categories of disagreement; confidence calibration; the disagreement matrix; composition model bias
- Quis Custodiet Ipsos Custodes? — Third-order blind spots; defense in depth; the Kegan developmental parallel
5 lessons — How do you build a defensible AI startup on the verification committee thesis?
- The Generalization: From Code Review to Universal Verification Committees — Domain suitability framework; scoring domains; why crypto is the best first vertical
- The TAM: Mapping the Opportunity Space — Bottom-up TAM, continuous monitoring market creation, wedge strategy, data flywheel, pricing paradox
- Formal Verification of Cryptographic Protocols with Lean 4 — The four levels of correctness; why LLMs fail at Level 4; orthogonal adapters for verification gaps; mutation of specs and proofs
- The Moat Question: What Frontier Labs Can't Copy — Five layers of moat; horizontal tax; reputation ratchet; winner-take-most dynamics
- Building the Company: Architecture as Strategy — Architecture-to-strategy mapping; go-to-market; team; fundraising; 18-month roadmap; the deepest risk
6 lessons — How do you build a system that bootstraps from cloud dependence to local autonomy?
- The Harness: Orchestrating Cloud and Local — Cascade routing, cost-quality curve, sending failures to the teacher
- Distillation from First Principles — Dark knowledge, soft targets, temperature; selective distillation into LoRA specialists; sequence-level distillation from cloud APIs
- Distillation into LoRA: Merging Teacher Knowledge with Task Adaptation — Rank expansion for dual signals; four combination strategies; LR/epoch/sampling balance
- Micro Fine-Tuning: Learning While Serving — Quality filter, replay buffer, micro learning rate, EWC anchoring, validation gate, three firewalls against model collapse
- The Full Architecture — Seven data flows, graceful degradation, build sequence, composition model co-evolution
- The Bootstrap Paradox and the Economic Inflection — Learning starvation, sawtooth improvement, proactive red-teaming, when student surpasses teacher, AGI as civilization
5 lessons — Math foundations through core crypto primitives (for understanding the BABE protocol)
- The Lock and Key Universe — Security parameter, negligible functions, PPT adversaries, security games, why negligible must be exponential
- Finite Fields and Modular Arithmetic — Clock arithmetic, why primes give division, generators, discrete log problem
- Hash Functions and the Random Oracle — Hash properties, ROM, notation boot camp (sampling, probability, oracle access)
- Digital Signatures and Security Games — EUF-CMA, Lamport one-time signatures, the Lamport=GC coincidence, dot notation for oracles
- Polynomials and Arithmetic Circuits — Schwartz-Zippel, R1CS, the polynomial bridge to Groth16, CRS and trusted setup
5 lessons — The algebraic machinery BABE is built from
- Elliptic Curves — Point addition, scalar multiplication, ECDLP, BN254, implicit notation [x]_s
- Bilinear Pairings — The pairing map, bilinearity, Groth16 verification equation, one-shot limitation
- SNARKs and Groth16 — Complete Groth16 flow (Gen/Prove/Verify), knowledge soundness, extractors, role in BABE
- Garbled Circuits — Wire labels, gate garbling, evaluation, free-XOR, half-gates, adaptive privacy
- Witness Encryption — Encrypt under NP statement, BABE's pairing-based WE, extractable security, the Lemma 10 fix, WE+GC=BABE
5 lessons — The complete protocol, its security proof, and the Lean mechanization
- Bitcoin as a Cryptographic Platform — UTXO model, locking scripts, 6-transaction graph, unstoppable transactions
- The BABE Construction — WE+GC split, randomized encodings, DRE, linearization by lifting, 1000x size reduction
- The Security Proof — Robustness, knowledge soundness, hybrid arguments, reduction chain, why four proof assistants
- The Mechanization — Four proof assistants, axiom boundaries, audit findings, trust surface, stopping criterion
- The Full Circle — Recursive verification, the complete map, biological parallels, "billions of years of pre-training followed by a lifetime of LoRAs"
5 lessons — Focused drilling on reductions, precise definitions, and component boundaries
- Reductions: The Technique — Four worked examples, the three-step recipe, student constructs a case-split reduction
- Reduction Drills (upcoming)
- The Confusion Matrix — Six confused concept pairs with precise distinctions and tests
- The BABE Component Map — Which operation belongs to which system, the math type test
- Remedial Exam — 15 focused questions, 12/15 clean, all major gaps closed
5 lessons — The mathematical foundations of generalization, estimation, and high-dimensional learning
- Bias, Variance, and Empirical Risk Minimization — Bias-variance decomposition derivation, dart board analogy, ERM pathologies (overfitting, distribution shift)
- Regularization — Ridge Regression and Sparsity — Ridge derivation with eigenvalue analysis, Bayesian interpretation, L1 sparsity (geometric and subdifferential arguments)
- VC Dimension and PAC Learning — Shattering, VC dimension examples, generalization bounds, PAC framework, sample complexity
- MLE, MAP, and Consistency — MLE consistency conditions, failure cases (mixtures, Neyman-Scott), MAP as regularized MLE, prior-penalty duality
- Dimensionality and KL Divergence — Curse of dimensionality (orange peel, distance concentration), KL asymmetry, forward vs reverse KL, JS divergence
5 lessons — How do optimizers navigate loss landscapes, and why does the "wrong" method often win?
- SGD, Generalization, and Saddle Points — SGD noise as implicit regularization, flat minima, SDE approximation, saddle points vs local minima in high dimensions
- Adam vs SGD, and Convergence Guarantees — Adam update rules, sharp minima with adaptive methods, AMSGrad, convergence conditions (L-smoothness, strong convexity, PL condition)
- Gradient Pathologies — Vanishing gradients (chain rule, sigmoid, ReLU, residuals, init), exploding gradients (clipping, batch norm, LSTM gating, transformers)
- Second-Order Methods — Hessian and curvature, Newton's method, why O(n²) is infeasible, L-BFGS, Hessian-free, natural gradient, Fisher information, KFAC
- Loss Landscape Geometry — Sharp vs flat minima, PAC-Bayes, Dinh et al. controversy, SAM, line search vs schedules, warmup mystery for transformers
5 lessons — EM, MCMC, variational inference, Bayesian foundations, and when the machinery breaks
- The EM Algorithm — MLE with latent variables, Jensen's inequality, ELBO, E-step/M-step, monotonic improvement, GMM example, failure modes (local optima, singular covariances, intractable E-step)
- MCMC Methods — Metropolis-Hastings (propose-accept/reject, detailed balance, proposal tuning), Gibbs sampling (full conditionals, MH with acceptance=1, correlated variables problem), VI overview (optimization vs sampling, ELBO, mean-field, speed vs accuracy tradeoffs)
- The ELBO and VI Bias — Full ELBO derivation, log P(x) = ELBO + KL(Q||P), reverse KL is mode-seeking, mean-field misses correlations, normalizing flows as fix
- Bayesian Foundations — Exchangeability, de Finetti's theorem, Bayesian justification for parametric models, Dirichlet Process (CRP, stick-breaking), Gaussian Process (kernel as function prior)
- Priors and Posterior Collapse — Prior sensitivity, Bernstein-von Mises, Jeffreys prior, posterior collapse in VAEs (powerful decoder problem, KL annealing, free bits)
5 lessons — Transformers, residual connections, overparameterization, scaling laws, and the phenomena that defy classical intuition (Q31–Q40)
- Transformers vs RNNs — Sequential bottleneck, hidden state compression, parallel processing, attention O(n²d) complexity, sparse/linear/FlashAttention solutions
- Positional Encoding and Normalization — Permutation equivariance without PE, sinusoidal/learned/RoPE/ALiBi, layer norm vs batch norm, pre-norm vs post-norm, RMSNorm
- Residual Connections and Overparameterization — y=F(x)+x gradient math, 2^L paths, ensemble interpretation, ODE connection, interpolation threshold, NTK, implicit regularization
- Lottery Tickets and Double Descent — Sparse subnetworks matching full performance, iterative magnitude pruning, supermasks, three regimes of double descent, epoch-wise double descent
- Scaling Laws and Why ReLU Dominates — Kaplan/Chinchilla power laws, compute-optimal frontier, sigmoid→tanh→ReLU→GELU arc, dying ReLU, why GELU won in transformers
5 lessons — From convolution priors to tokenization failures and hallucination (Q41–Q50)
- Why Convolution Works — Locality and stationarity priors, weight sharing (FC N⁴ vs conv k²), inductive bias, translation equivariance vs invariance, proof sketch
- Pooling and Dilated Convolutions — Receptive field growth, max vs avg pooling, strided conv, global average pooling, dilated convolutions, WaveNet, gridding artifact
- Feature Hierarchies and Word2Vec — Edges→textures→parts→objects, GradCAM, transfer learning, Skip-gram/CBOW, softmax bottleneck, king−man+woman≈queen
- Negative Sampling and Pretraining — Binary classification reframing, noise distribution P(w)∝freq^(3/4), NCE connection, BERT vs GPT (bidirectional vs causal), convergence toward autoregressive scale
- Tokenization and Hallucination — Character/word/subword (BPE), arithmetic and multilingual impact, byte-level BPE, five causes of hallucination, mitigation strategies (RAG, CoT, calibration)
5 lessons — Bellman equations to deep RL instability, exploration, and sample efficiency (Q51–Q60)
- Bellman Equations and Dynamic Programming — Value function derivation, Bellman recursion, optimality equation, Q-function, policy iteration vs value iteration, contraction mapping convergence
- Q-Learning Instability and Exploration — Tabular convergence, three sources of DQN instability (correlated samples, non-stationary targets, maximization bias), the deadly triad, ε-greedy vs UCB vs Thompson Sampling
- Actor-Critic and Credit Assignment — REINFORCE variance, advantage function, TD error as advantage estimate, A2C/PPO/SAC, temporal credit assignment, eligibility traces, λ-return, HER
- On-Policy vs Off-Policy and Function Approximation Issues — Behavioral vs target policy, importance sampling variance, PPO clipping, SAC entropy, convergence hierarchy (tabular → linear → neural), Baird's counterexample
- Reward Shaping and Sample Efficiency — Potential-based shaping (Ng et al. theorem), telescoping argument, reward hacking, five root causes of sample inefficiency, model-based RL, offline RL
5 lessons — Why deep nets generalize, compression and stability views, information bottleneck, and the expressivity limits of GNNs (Q61–Q70)
- Why Deep Nets Generalize — Zhang et al. random labels puzzle, SGD's simplicity bias, function space perspective, implicit regularization (min-norm, min nuclear norm, LR as regularizer, early stopping)
- Compression and Stability — MDL principle, compression bounds, noise stability, PAC-Bayes, algorithmic stability, Hardt et al. SGD stability proof, connection to differential privacy
- Information Bottleneck and GNN Oversmoothing — Tishby's IB framework, two-phase training, Saxe et al. criticism, oversmoothing as low-pass filtering, eigenvalue convergence, DropEdge/PairNorm fixes
- Message Passing and Spectral vs Spatial GNNs — MPNN framework, 1-WL expressivity ceiling, graph Laplacian Fourier domain, ChebNet, GCN as 1st-order Chebyshev, spectral vs spatial comparison
- Graph Isomorphism and GNN Expressivity — GI complexity status, Babai's algorithm, GIN (injective aggregation = 1-WL), beyond-1-WL strategies (higher-order, random features, positional encodings)
5 lessons — Causal inference, Simpson's paradox, class imbalance, data leakage, and the engineering of reliable ML systems (Q71–Q80)
- Causation Formalized — Pearl's SCMs, structural equations, DAGs, do-operator, adjustment formula, backdoor criterion, frontdoor criterion
- Instrumental Variables and Counterfactuals — Unmeasured confounders, IV conditions, 2SLS, weak instruments, potential outcomes Y(1)/Y(0), fundamental problem of causal inference, Pearl's three-level ladder
- Simpson's Paradox and Class Imbalance — Berkeley admissions, aggregate vs stratify (causal structure decides), SMOTE, focal loss, AUPRC vs accuracy, threshold tuning
- Data Leakage and Feature Engineering — Target/temporal/train-test leakage, detection and prevention, feature engineering techniques, feature selection methods
- Cross-Validation and Hyperparameter Tuning — K-fold, stratified, time series split, nested CV, grid vs random (Bergstra & Bengio), Bayesian optimization, learning rate as most important hyperparameter
6 lessons — Distributed training, memory/precision, adversarial robustness, fairness, privacy, generative models, RLHF, alignment, and open questions (Q81–Q100)
- Distributed Training — AllReduce communication overhead, sync vs async SGD, linear scaling rule, LARS/LAMB, data/tensor/pipeline parallelism, ZeRO, 3D parallelism
- Memory, Precision & Latency — Memory budget (params, gradients, optimizer, activations), gradient checkpointing, FP16/BF16 mixed precision, GPTQ/AWQ quantization, continuous batching, speculative decoding, paged attention, GQA
- Adversarial Robustness — Goodfellow linearity hypothesis, FGSM/PGD attacks, features-vs-bugs debate, adversarial training min-max, accuracy-robustness tradeoff, certified defenses, randomized smoothing
- Fairness and Bias — Demographic parity, equalized odds, calibration, Chouldechova impossibility theorem, bias taxonomy, disaggregated metrics, counterfactual fairness, proxy variables
- Privacy and Generative Models — (ε,δ)-differential privacy, DP-SGD, composition theorem, federated learning, GANs vs diffusion (stability, mode coverage, conditioning), why diffusion training is stable
- RLHF, Alignment & the Frontier — RLHF pipeline (SFT→reward model→PPO), DPO, Goodhart's law, mesa-optimization, Constitutional AI, CLIP, test-time compute, mechanistic interpretability, scaling laws, open questions
- An LLM's knowledge lives in its weights (stone) — permanent patterns carved during training
- In-context learning activates existing capabilities via the prompt (water) — nothing changes, nothing persists
- Fine-tuning modifies weights (picks up the chisel) — powerful but expensive and risky
- "Activating, not carving" — the student's own summary of ICL vs fine-tuning
- LoRA represents weight changes as low-rank matrices — because fine-tuning updates are empirically low-rank
- Orthogonal adapter training guarantees diverse perspectives — the constraint is in weight space, not data space
- Orthogonality is like PCA/eigenvectors — each adapter captures the next most important independent direction of useful adaptation
- A review committee needs both specialists (orthogonal) AND generalists (composition layer) to catch diagonal bugs
- The distinction between "activation" and "learning" is a matter of where you draw the line — it's learning all the way down
- The holy grail is continual learning — if LoRA adapters can be continually learned, merged, and composed, the line between training and adaptation dissolves
- Every prediction error decomposes into bias² + variance + noise — the pattern of failure tells you what to fix
- Regularization is prior knowledge in disguise — L2 = Gaussian prior, L1 = Laplace prior, MAP = MLE + regularization
- Generalization depends on the ratio of model capacity to data, not capacity alone — VC theory and PAC learning formalize this
- KL divergence is asymmetric by design — forward KL covers modes (blurry), reverse KL seeks modes (sharp) — this explains the VAE vs GAN divide
- EM is the gateway to variational methods — bound what you can't compute, optimize the bound — but it only works when the E-step is tractable
- Exchangeability justifies Bayesian modeling — de Finetti proves a parameter and prior must exist if data order is irrelevant
- Posterior collapse in VAEs is a degenerate equilibrium, not a bug — the ELBO is doing what it's told, but a powerful decoder makes z unnecessary
- Mean-field VI underestimates uncertainty — factorization + reverse KL = mode-seeking with no correlations — calibrated uncertainty requires richer families or MCMC
- Experience level: understands neural networks and transformer architecture
- Thinks mechanically — reasons about tokens, weights, and machinery rather than anthropomorphizing
- Prefers practical, engineering-grounded explanations over abstract theory
- Strong at synthesis — naturally connects ideas across lessons and extrapolates to frontier implications
- Pushes back on imprecise claims, leading to productive deeper discussions