Reinforcement Learning from Scratch

20 Theory Notebooks + 10 Real-World Projects + 10 Interactive Streamlit Apps
Every algorithm implemented from scratch using only NumPy and Matplotlib. No black boxes.

Notebooks • Projects • Quick Start • Streamlit Apps • Learning Path

Why This Repo?

Most RL resources either drown you in theory or hand you a black-box library. This repo does neither.

Every algorithm is implemented from scratch -- you see every gradient, every update, every Q-value
Every concept comes with LaTeX math and plain-English explanations side by side
Every project has an interactive Streamlit app you can play with in your browser
Zero dependency on OpenAI Gym, Stable Baselines, or any RL library -- just NumPy

"What I cannot create, I do not understand." -- Richard Feynman

What's Inside

20 Theory Notebooks

Build your RL foundation from the ground up. Each notebook contains detailed math, step-by-step derivations, and working code with visualizations.

#	Topic	Key Concepts
01	Introduction to RL	Agent-Environment loop, Return, Discount factor
02	Markov Decision Processes	Markov property, MDP tuple, V/Q functions
03	Bellman Equations	Expectation & Optimality equations, Matrix form
04	Dynamic Programming	Policy Evaluation, Policy Iteration, Value Iteration
05	Monte Carlo Methods	First-Visit MC, Importance Sampling
06	Temporal Difference Learning	TD(0), Bias-Variance tradeoff, TD vs MC
07	SARSA	On-policy TD control, Expected SARSA
08	Q-Learning	Off-policy TD, Double Q-Learning
09	N-Step & Eligibility Traces	N-step returns, TD(lambda), Backward view
10	Function Approximation	Linear FA, Semi-gradient TD, Tile Coding
11	Policy Gradient (REINFORCE)	Policy Gradient Theorem, Baseline
12	Actor-Critic	TD error as advantage, A2C
13	GAE	Generalized Advantage Estimation
14	PPO	Clipped objective, KL penalty
15	DQN	Experience Replay, Target Networks
16	Double & Dueling DQN	Overestimation fix, V+A decomposition
17	Experience Replay	Uniform, Prioritized (SumTree), HER
18	Exploration vs Exploitation	UCB, Thompson Sampling, Intrinsic Motivation
19	Multi-Armed Bandits	Regret, Gradient Bandit, Contextual Bandits
20	Model-Based RL	Dyna-Q, MCTS, World Models

10 Real-World Projects

Each project applies RL to a real domain with a detailed notebook, interactive Streamlit app, and comprehensive documentation.

#	Project	Domain	Key Algorithm	App
01	RLHF for LLM Alignment	AI Safety	PPO + Bradley-Terry Reward Model	Launch
02	Offline RL for Healthcare	Medicine	Conservative Q-Learning (CQL)	Launch
03	Multi-Agent RL	Robotics	Independent Q-Learning, Predator-Prey	Launch
04	Safe RL	Autonomous Systems	Lagrangian Constrained MDP	Launch
05	World Models	Planning	Dyna-Q, Learned Dynamics	Launch
06	RLAIF	AI Safety	AI Feedback vs Human Feedback	Launch
07	Hierarchical RL	Navigation	Options Framework, Four Rooms	Launch
08	Meta-RL	Few-Shot Learning	MAML, Learning to Learn	Launch
09	Drug Discovery	Pharma	Multi-Objective Policy Gradient	Launch
10	Sim-to-Real Transfer	Robotics	Domain Randomization	Launch

Quick Start

Prerequisites

pip install numpy matplotlib streamlit jupyter

That's it. No complex dependencies.

Run Any Notebook

cd Notebooks
jupyter notebook
# Open any of the 20 notebooks

Run Any App

# Example: RLHF Demo
cd Project_01_RLHF_LLM_Alignment
streamlit run app.py

# Example: Drug Discovery
cd Project_09_Drug_Discovery_RL
streamlit run app.py

Run All Notebooks Programmatically

cd Notebooks
jupyter nbconvert --to notebook --execute --inplace *.ipynb

Learning Path

Not sure where to start? Follow this path:

                    START HERE
                        |
                        v
            +-------------------------+
            |  01-04: Foundations      |
            |  MDPs, Bellman, DP      |
            +-------------------------+
                        |
              +---------+---------+
              v                   v
    +----------------+   +----------------+
    | 05-08: Tabular |   | 18-19: Bandits |
    | MC, TD, SARSA  |   | Exploration    |
    | Q-Learning     |   +----------------+
    +----------------+
              |
              v
    +-------------------+
    | 09-10: Scaling Up |
    | N-Step, Func Approx|
    +-------------------+
              |
     +--------+--------+
     v                  v
+-----------+   +-------------+
| 11-14:    |   | 15-17:      |
| Policy    |   | Value-Based |
| Gradient  |   | DQN Family  |
| REINFORCE |   | Replay      |
| AC, PPO   |   +-------------+
+-----------+
     |
     v
+-----------------------------------+
|     20: Model-Based RL            |
+-----------------------------------+
              |
              v
+-----------------------------------+
|   PROJECTS: Pick your interest!   |
|                                   |
|   AI Safety --> 01 (RLHF), 06    |
|   Healthcare --> 02, 09           |
|   Robotics --> 03, 04, 10        |
|   Planning --> 05, 07            |
|   Meta-Learning --> 08           |
+-----------------------------------+

Project Deep Dives

Project 01: RLHF for LLM Alignment

Train language models to follow human preferences using the same pipeline behind ChatGPT and Claude.

The 3-Step Pipeline:

Supervised Fine-Tuning (SFT) -- Train on human demonstrations
Reward Model -- Learn preferences from human rankings (Bradley-Terry model)
PPO Fine-Tuning -- Optimize policy with KL-constrained PPO

Key Equation:

$$L^{CLIP}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t, 1\pm\epsilon)\hat{A}_t\right)\right]$$

Project 02: Offline RL for Healthcare

Learn optimal treatment policies from patient records without experimenting on real patients.

Why Offline RL? You can't reset a patient. Offline RL learns from fixed datasets.

Key Algorithm: Conservative Q-Learning (CQL) penalizes out-of-distribution actions:

$$Q_{CQL} = Q - \alpha \cdot \mathbb{E}\pi[Q(s,a)] + \alpha \cdot \mathbb{E}{data}[Q(s,a)]$$

Project 03: Multi-Agent RL

Multiple agents learning simultaneously in a predator-prey environment.

Challenge: Each agent's environment is non-stationary because other agents are also learning.

Project 04: Safe RL

Learn policies that maximize reward while satisfying safety constraints.

Lagrangian Approach:

$$L(\theta, \lambda) = J(\theta) - \lambda(J_c(\theta) - d)$$

Project 05: World Models

Learn the environment dynamics and plan in imagination.

Key Idea: $z_{t+1} = f_\theta(z_t, a_t)$ -- predict next state from current state and action.

Project 06: RLAIF (AI Feedback)

Replace expensive human feedback with AI-generated feedback. Compare convergence and quality.

Project 07: Hierarchical RL

Break complex tasks into subtasks using the Options Framework in a Four Rooms environment.

Project 08: Meta-RL (Learning to Learn)

Train agents that can adapt to new tasks in just a few episodes using MAML.

MAML Update:

$$\theta^* = \theta - \alpha \nabla_\theta \mathcal{L}{task}(\theta - \alpha \nabla\theta \mathcal{L}_{task}(\theta))$$

Project 09: Drug Discovery

Use RL to design molecules that satisfy multiple drug properties (LogP, toxicity, drug-likeness).

Project 10: Sim-to-Real Transfer

Train in simulation, deploy in the real world using Domain Randomization.

Environments Built from Scratch

No OpenAI Gym dependency. Every environment is hand-crafted:

Environment	Used In	States	Actions
GridWorld (4x4)	Notebooks 01-04	16 discrete	4 (up/right/down/left)
Blackjack	Notebook 05	Player sum x Dealer	Hit / Stick
Random Walk	Notebook 06, 09	5-19 states	Left / Right
Cliff Walking	Notebook 07	4x12 grid	4 directions
Mountain Car	Notebook 10	Position x Velocity	3 (reverse/neutral/forward)
CartPole	Notebooks 11-16	4D continuous	2 (left/right)
10-Armed Bandit	Notebooks 18-19	None	K arms
Maze (6x9)	Notebook 20	54 cells	4 directions
Predator-Prey	Project 03	Grid positions	5 (4 dirs + stay)
Four Rooms	Project 07	Multi-room grid	4 directions

Tech Stack

Component	Technology
Core Math	NumPy
Visualization	Matplotlib
Interactive Apps	Streamlit
Notebooks	Jupyter
Language	Python 3.10+

Philosophy: Zero abstraction layers. When you read Q[state][action] += alpha * td_error, that's exactly what's happening. No hidden magic.

Contributors & Domain Experts

_{Milan Amrut Joshi}
_{Project Lead & Core Author}
_{All 20 notebooks, 10 projects, Streamlit apps}

_{Antonin Raffin}
_{RL Expert & Advisor}
_{Creator of Stable-Baselines3, DLR Robotics}

_{Costa Huang}
_{RL Expert & Advisor}
_{Creator of CleanRL, Hugging Face RL Team}

Domain Experts by Project

Project	Expert 1	Expert 2
01 — RLHF LLM Alignment	_{John Schulman} _{Co-creator of PPO, TRPO & RLHF}	_{Leandro von Werra} _{Creator of TRL, Hugging Face}
02 — Offline RL Healthcare	_{Aviral Kumar} _{Creator of CQL, UC Berkeley}	_{Justin Fu} _{D4RL Benchmark Creator, UC Berkeley}
03 — Multi-Agent RL	_{Shariq Iqbal} _{MARL Researcher, USC}	_{Christian Schroeder de Witt} _{QMIX Co-author, Oxford}
04 — Safe RL	_{Joshua Achiam} _{Creator of CPO, OpenAI}	_{Alex Ray} _{Safety Gym Co-author, OpenAI}
05 — World Models	_{David Ha} _{World Models Paper, Google Brain}	_{Danijar Hafner} _{Dreamer/V2/V3, Google DeepMind}
06 — RLAIF	_{Leandro von Werra} _{TRL Library, Hugging Face}	_{Edward Beeching} _{HF RL Researcher}
07 — Hierarchical RL	_{Ofir Nachum} _{Hierarchical RL, Google Brain}	_{Pierre-Luc Bacon} _{Options Framework, Mila}
08 — Meta-RL	_{Chelsea Finn} _{Creator of MAML, Stanford}	_{Kate Rakelly} _{PEARL Meta-RL, UC Berkeley}
09 — Drug Discovery RL	_{Wengong Jin} _{Molecular Generation, MIT}	_{Bharath Ramsundar} _{Creator of DeepChem}
10 — Sim-to-Real Transfer	_{Josh Tobin} _{Domain Randomization, OpenAI}	_{Xue Bin Peng} _{Sim-to-Real Transfer, UC Berkeley}

Contributing

Found a bug? Want to add a project? PRs welcome!

Fork the repo
Create your branch (git checkout -b feature/amazing-feature)
Commit your changes
Push and open a PR

References

Resource	Author
Reinforcement Learning: An Introduction	Sutton & Barto (2018)
Deep RL Course	Hugging Face
Spinning Up in Deep RL	OpenAI
Training LLMs to Follow Instructions with Human Feedback	Ouyang et al. (2022)
Proximal Policy Optimization Algorithms	Schulman et al. (2017)
Playing Atari with Deep Reinforcement Learning	Mnih et al. (2013)

If you find this useful, please give it a star!
It helps others discover this resource.

Made with determination by Milan Amrut Joshi

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Notebooks		Notebooks
Project_01_RLHF_LLM_Alignment		Project_01_RLHF_LLM_Alignment
Project_02_Offline_RL_Healthcare		Project_02_Offline_RL_Healthcare
Project_03_Multi_Agent_RL		Project_03_Multi_Agent_RL
Project_04_Safe_RL		Project_04_Safe_RL
Project_05_World_Models		Project_05_World_Models
Project_06_RLAIF		Project_06_RLAIF
Project_07_Hierarchical_RL		Project_07_Hierarchical_RL
Project_08_Meta_RL		Project_08_Meta_RL
Project_09_Drug_Discovery_RL		Project_09_Drug_Discovery_RL
Project_10_Sim_to_Real		Project_10_Sim_to_Real
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_project_notebooks_01_05.py		generate_project_notebooks_01_05.py
generate_project_notebooks_06_10.py		generate_project_notebooks_06_10.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement Learning from Scratch

Why This Repo?

What's Inside

20 Theory Notebooks

10 Real-World Projects

Quick Start

Prerequisites

Run Any Notebook

Run Any App

Run All Notebooks Programmatically

Learning Path

Project Deep Dives

Environments Built from Scratch

Tech Stack

Contributors & Domain Experts

Domain Experts by Project

Contributing

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning from Scratch

Why This Repo?

What's Inside

20 Theory Notebooks

10 Real-World Projects

Quick Start

Prerequisites

Run Any Notebook

Run Any App

Run All Notebooks Programmatically

Learning Path

Project Deep Dives

Environments Built from Scratch

Tech Stack

Contributors & Domain Experts

Domain Experts by Project

Contributing

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages