20 Theory Notebooks + 10 Real-World Projects + 10 Interactive Streamlit Apps
Every algorithm implemented from scratch using only NumPy and Matplotlib. No black boxes.
Notebooks • Projects • Quick Start • Streamlit Apps • Learning Path
Most RL resources either drown you in theory or hand you a black-box library. This repo does neither.
- Every algorithm is implemented from scratch -- you see every gradient, every update, every Q-value
- Every concept comes with LaTeX math and plain-English explanations side by side
- Every project has an interactive Streamlit app you can play with in your browser
- Zero dependency on OpenAI Gym, Stable Baselines, or any RL library -- just NumPy
"What I cannot create, I do not understand." -- Richard Feynman
Build your RL foundation from the ground up. Each notebook contains detailed math, step-by-step derivations, and working code with visualizations.
| # | Topic | Key Concepts |
|---|---|---|
| 01 | Introduction to RL | Agent-Environment loop, Return, Discount factor |
| 02 | Markov Decision Processes | Markov property, MDP tuple, V/Q functions |
| 03 | Bellman Equations | Expectation & Optimality equations, Matrix form |
| 04 | Dynamic Programming | Policy Evaluation, Policy Iteration, Value Iteration |
| 05 | Monte Carlo Methods | First-Visit MC, Importance Sampling |
| 06 | Temporal Difference Learning | TD(0), Bias-Variance tradeoff, TD vs MC |
| 07 | SARSA | On-policy TD control, Expected SARSA |
| 08 | Q-Learning | Off-policy TD, Double Q-Learning |
| 09 | N-Step & Eligibility Traces | N-step returns, TD(lambda), Backward view |
| 10 | Function Approximation | Linear FA, Semi-gradient TD, Tile Coding |
| 11 | Policy Gradient (REINFORCE) | Policy Gradient Theorem, Baseline |
| 12 | Actor-Critic | TD error as advantage, A2C |
| 13 | GAE | Generalized Advantage Estimation |
| 14 | PPO | Clipped objective, KL penalty |
| 15 | DQN | Experience Replay, Target Networks |
| 16 | Double & Dueling DQN | Overestimation fix, V+A decomposition |
| 17 | Experience Replay | Uniform, Prioritized (SumTree), HER |
| 18 | Exploration vs Exploitation | UCB, Thompson Sampling, Intrinsic Motivation |
| 19 | Multi-Armed Bandits | Regret, Gradient Bandit, Contextual Bandits |
| 20 | Model-Based RL | Dyna-Q, MCTS, World Models |
Each project applies RL to a real domain with a detailed notebook, interactive Streamlit app, and comprehensive documentation.
| # | Project | Domain | Key Algorithm | App |
|---|---|---|---|---|
| 01 | RLHF for LLM Alignment | AI Safety | PPO + Bradley-Terry Reward Model | Launch |
| 02 | Offline RL for Healthcare | Medicine | Conservative Q-Learning (CQL) | Launch |
| 03 | Multi-Agent RL | Robotics | Independent Q-Learning, Predator-Prey | Launch |
| 04 | Safe RL | Autonomous Systems | Lagrangian Constrained MDP | Launch |
| 05 | World Models | Planning | Dyna-Q, Learned Dynamics | Launch |
| 06 | RLAIF | AI Safety | AI Feedback vs Human Feedback | Launch |
| 07 | Hierarchical RL | Navigation | Options Framework, Four Rooms | Launch |
| 08 | Meta-RL | Few-Shot Learning | MAML, Learning to Learn | Launch |
| 09 | Drug Discovery | Pharma | Multi-Objective Policy Gradient | Launch |
| 10 | Sim-to-Real Transfer | Robotics | Domain Randomization | Launch |
pip install numpy matplotlib streamlit jupyterThat's it. No complex dependencies.
cd Notebooks
jupyter notebook
# Open any of the 20 notebooks# Example: RLHF Demo
cd Project_01_RLHF_LLM_Alignment
streamlit run app.py
# Example: Drug Discovery
cd Project_09_Drug_Discovery_RL
streamlit run app.pycd Notebooks
jupyter nbconvert --to notebook --execute --inplace *.ipynbNot sure where to start? Follow this path:
START HERE
|
v
+-------------------------+
| 01-04: Foundations |
| MDPs, Bellman, DP |
+-------------------------+
|
+---------+---------+
v v
+----------------+ +----------------+
| 05-08: Tabular | | 18-19: Bandits |
| MC, TD, SARSA | | Exploration |
| Q-Learning | +----------------+
+----------------+
|
v
+-------------------+
| 09-10: Scaling Up |
| N-Step, Func Approx|
+-------------------+
|
+--------+--------+
v v
+-----------+ +-------------+
| 11-14: | | 15-17: |
| Policy | | Value-Based |
| Gradient | | DQN Family |
| REINFORCE | | Replay |
| AC, PPO | +-------------+
+-----------+
|
v
+-----------------------------------+
| 20: Model-Based RL |
+-----------------------------------+
|
v
+-----------------------------------+
| PROJECTS: Pick your interest! |
| |
| AI Safety --> 01 (RLHF), 06 |
| Healthcare --> 02, 09 |
| Robotics --> 03, 04, 10 |
| Planning --> 05, 07 |
| Meta-Learning --> 08 |
+-----------------------------------+
Project 01: RLHF for LLM Alignment
Train language models to follow human preferences using the same pipeline behind ChatGPT and Claude.
The 3-Step Pipeline:
- Supervised Fine-Tuning (SFT) -- Train on human demonstrations
- Reward Model -- Learn preferences from human rankings (Bradley-Terry model)
- PPO Fine-Tuning -- Optimize policy with KL-constrained PPO
Key Equation:
Project 02: Offline RL for Healthcare
Learn optimal treatment policies from patient records without experimenting on real patients.
Why Offline RL? You can't reset a patient. Offline RL learns from fixed datasets.
Key Algorithm: Conservative Q-Learning (CQL) penalizes out-of-distribution actions:
$$Q_{CQL} = Q - \alpha \cdot \mathbb{E}\pi[Q(s,a)] + \alpha \cdot \mathbb{E}{data}[Q(s,a)]$$
Project 03: Multi-Agent RL
Multiple agents learning simultaneously in a predator-prey environment.
Challenge: Each agent's environment is non-stationary because other agents are also learning.
Project 04: Safe RL
Learn policies that maximize reward while satisfying safety constraints.
Lagrangian Approach:
Project 05: World Models
Learn the environment dynamics and plan in imagination.
Key Idea:
Project 06: RLAIF (AI Feedback)
Replace expensive human feedback with AI-generated feedback. Compare convergence and quality.
Project 07: Hierarchical RL
Break complex tasks into subtasks using the Options Framework in a Four Rooms environment.
Project 08: Meta-RL (Learning to Learn)
Train agents that can adapt to new tasks in just a few episodes using MAML.
MAML Update:
$$\theta^* = \theta - \alpha \nabla_\theta \mathcal{L}{task}(\theta - \alpha \nabla\theta \mathcal{L}_{task}(\theta))$$
Project 09: Drug Discovery
Use RL to design molecules that satisfy multiple drug properties (LogP, toxicity, drug-likeness).
Project 10: Sim-to-Real Transfer
Train in simulation, deploy in the real world using Domain Randomization.
No OpenAI Gym dependency. Every environment is hand-crafted:
| Environment | Used In | States | Actions |
|---|---|---|---|
| GridWorld (4x4) | Notebooks 01-04 | 16 discrete | 4 (up/right/down/left) |
| Blackjack | Notebook 05 | Player sum x Dealer | Hit / Stick |
| Random Walk | Notebook 06, 09 | 5-19 states | Left / Right |
| Cliff Walking | Notebook 07 | 4x12 grid | 4 directions |
| Mountain Car | Notebook 10 | Position x Velocity | 3 (reverse/neutral/forward) |
| CartPole | Notebooks 11-16 | 4D continuous | 2 (left/right) |
| 10-Armed Bandit | Notebooks 18-19 | None | K arms |
| Maze (6x9) | Notebook 20 | 54 cells | 4 directions |
| Predator-Prey | Project 03 | Grid positions | 5 (4 dirs + stay) |
| Four Rooms | Project 07 | Multi-room grid | 4 directions |
| Component | Technology |
|---|---|
| Core Math | NumPy |
| Visualization | Matplotlib |
| Interactive Apps | Streamlit |
| Notebooks | Jupyter |
| Language | Python 3.10+ |
Philosophy: Zero abstraction layers. When you read Q[state][action] += alpha * td_error, that's exactly what's happening. No hidden magic.
![]() Milan Amrut Joshi Project Lead & Core Author All 20 notebooks, 10 projects, Streamlit apps |
![]() Antonin Raffin RL Expert & Advisor Creator of Stable-Baselines3, DLR Robotics |
![]() Costa Huang RL Expert & Advisor Creator of CleanRL, Hugging Face RL Team |
| Project | Expert 1 | Expert 2 |
|---|---|---|
| 01 — RLHF LLM Alignment |
![]() John Schulman Co-creator of PPO, TRPO & RLHF |
![]() Leandro von Werra Creator of TRL, Hugging Face |
| 02 — Offline RL Healthcare |
![]() Aviral Kumar Creator of CQL, UC Berkeley |
![]() Justin Fu D4RL Benchmark Creator, UC Berkeley |
| 03 — Multi-Agent RL |
![]() Shariq Iqbal MARL Researcher, USC |
![]() Christian Schroeder de Witt QMIX Co-author, Oxford |
| 04 — Safe RL |
![]() Joshua Achiam Creator of CPO, OpenAI |
![]() Alex Ray Safety Gym Co-author, OpenAI |
| 05 — World Models |
![]() David Ha World Models Paper, Google Brain |
![]() Danijar Hafner Dreamer/V2/V3, Google DeepMind |
| 06 — RLAIF |
![]() Leandro von Werra TRL Library, Hugging Face |
![]() Edward Beeching HF RL Researcher |
| 07 — Hierarchical RL |
![]() Ofir Nachum Hierarchical RL, Google Brain |
![]() Pierre-Luc Bacon Options Framework, Mila |
| 08 — Meta-RL |
![]() Chelsea Finn Creator of MAML, Stanford |
![]() Kate Rakelly PEARL Meta-RL, UC Berkeley |
| 09 — Drug Discovery RL |
![]() Wengong Jin Molecular Generation, MIT |
![]() Bharath Ramsundar Creator of DeepChem |
| 10 — Sim-to-Real Transfer |
![]() Josh Tobin Domain Randomization, OpenAI |
![]() Xue Bin Peng Sim-to-Real Transfer, UC Berkeley |
Found a bug? Want to add a project? PRs welcome!
- Fork the repo
- Create your branch (
git checkout -b feature/amazing-feature) - Commit your changes
- Push and open a PR
| Resource | Author |
|---|---|
| Reinforcement Learning: An Introduction | Sutton & Barto (2018) |
| Deep RL Course | Hugging Face |
| Spinning Up in Deep RL | OpenAI |
| Training LLMs to Follow Instructions with Human Feedback | Ouyang et al. (2022) |
| Proximal Policy Optimization Algorithms | Schulman et al. (2017) |
| Playing Atari with Deep Reinforcement Learning | Mnih et al. (2013) |
If you find this useful, please give it a star!
It helps others discover this resource.
Made with determination by Milan Amrut Joshi





















