nano-vLLM Deep Dive

An 11-chapter study series on LLM inference engineering — from tokens to PagedAttention — built around the nano-vLLM codebase.

What This Is

A beginner-friendly yet technically rigorous guide to how large language model inference actually works. Each chapter is a self-contained HTML page with:

Real-world analogies before every new concept
Interactive visualizations and simulators
Annotated code from the actual nano-vLLM implementation
A 3-question quiz with detailed feedback

No ML, linear algebra, or CUDA knowledge required. Basic Python reading comfort helps — all code is annotated line by line.

Chapters

#	Title	Topic	~Time
01	What Is LLM Inference?	Tokens, autoregressive generation, Q/K/V, HBM, bottlenecks	22 min
02	nano-vLLM Architecture	File structure, CPU control plane vs GPU data plane	18 min
03	KV Cache	Physical tensor layout, Triton kernel, GQA	20 min
04	PagedAttention	Virtual memory for KV cache, block tables, free lists	22 min
05	The Scheduler	Continuous batching, state machine, preemption	20 min
06	Prefill vs Decode	Compute-bound vs memory-bound, TTFT/TPOT, chunked prefill	18 min
07	Prefix Caching	xxhash content-addressable blocks, reference counting	18 min
08	Sampling Strategies	Greedy, temperature, top-k, top-p	16 min
09	Tensor Parallelism	Column/Row parallel layers, all-reduce, NCCL	20 min
10	Optimization Stack	FlashAttention, CUDA Graphs, torch.compile, kernel fusion	24 min
11	Benchmarks	Throughput vs latency, nano-vLLM vs vLLM, reading benchmarks honestly	14 min

Total reading time: ~4 hours

Running Locally

No build step required. Serve the static files:

cd nano-vllm-guide
python3 -m http.server 8000

Then open http://localhost:8000 in your browser.

Who This Is For

Software engineers curious about how LLMs work under the hood
ML practitioners who use LLM APIs and want to understand what happens beneath generate()
Beginners who've heard "KV cache" or "PagedAttention" and want a real explanation

Based On

All code examples come from GeeeekExplorer/nano-vllm — a ~1,200 line reimplementation of vLLM's core algorithms in pure Python and Triton, created by Xingkai Yu (DeepSeek).

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LICENSE		LICENSE
README.md		README.md
ch01.html		ch01.html
ch02.html		ch02.html
ch03.html		ch03.html
ch04.html		ch04.html
ch05.html		ch05.html
ch06.html		ch06.html
ch07.html		ch07.html
ch08.html		ch08.html
ch09.html		ch09.html
ch10.html		ch10.html
ch11.html		ch11.html
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nano-vLLM Deep Dive

What This Is

Chapters

Running Locally

Who This Is For

Based On

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nano-vLLM Deep Dive

What This Is

Chapters

Running Locally

Who This Is For

Based On

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages