A compact, readable GPT-style decoder-only Transformer implemented in pure PyTorch. The goal is to expose the essential architectural pieces with minimal scaffolding so you can train and tinker quickly.
model.py– GPT building blocks (token/position embeddings, masked attention, MLP, blocks, and the full model with generate()).train.py– Small training loop on character-level data with optional sampling at the end.
-
Token + position embedding
- Tokens are mapped via
nn.Embedding(vocab_size, n_embd). - Absolute positions use a learned embedding
nn.Embedding(block_size, n_embd). - Both embeddings are summed and passed through dropout.
- Tokens are mapped via
-
Stack of Transformer decoder blocks (repeated
n_layertimes)
Structure inside each block:- LayerNorm (pre-norm)
- Causal multi-head self-attention
- Q, K, V are linear projections of shape
(B, T, n_head, head_size). - Attention weights = softmax((QK^T) / sqrt(head_size)) with a lower-triangular mask so tokens only attend to past and present.
- Weighted sum over V, then projected back to model width and dropped out.
- Q, K, V are linear projections of shape
- Residual connection to input.
- LayerNorm.
- MLP: Linear -> GELU -> Linear with hidden width
4 * n_embd, plus dropout. - Residual connection again.
-
Output head
- Final LayerNorm.
- Linear projection to vocab logits (
n_embd -> vocab_size). - Loss (when targets provided) is cross-entropy over flattened logits and targets.
-
Autoregressive generation
- Maintain the last
block_sizetokens as context. - Predict next-token distribution, sample, append, and repeat for
max_new_tokens.
- Maintain the last
vocab_size: Number of unique tokens (characters in the toy setup).block_size: Maximum context length; inputs longer than this are rejected.n_layer,n_head,n_embd: Depth, attention heads, and model width.dropout: Shared dropout rate for embeddings, attention weights, and MLP outputs.
- Tokenizer: Simple character-level encoder/decoder built from the provided text.
- Data prep: 90/10 train/val split on tokenized data. Ensures
len(text) > block_size + 1. - Batching: Random contiguous chunks of length
block_size; targets are shifted by one token. - Optimization: AdamW with constant learning rate (default
3e-4). - Evaluation: Periodic loss estimates on train/val via
estimate_loss. - Sampling: After training, generates 128 tokens from a zero context and decodes to text.
- Device: Uses CUDA when available, else CPU.
- Install dependencies (PyTorch only):
pip install torch
- (Optional) Provide your own training text: save a
.txtfile with enough characters for the block size. - Run the toy trainer (uses a built-in sample if
--text-fileis omitted):Adjustpython train.py --max-iters 200 --block-size 64 --batch-size 32
--eval-interval,--n-layer,--n-head,--n-embd, and--dropoutto explore size/performance trade-offs. - After training, the script samples 128 tokens from the model and prints the decoded text.
- The toy dataset path is
--text-file. Make sure your text is longer thanblock_size + 1, or lower--block-size. model.generate()expects token IDs; intrain.pythe decode function maps IDs back to characters for inspection.- To plug in a different tokenizer, swap out
build_tokenizerintrain.pyand updatevocab_sizeinGPTConfig. - For experimentation, start tiny (
n_embd=64,n_head=2,n_layer=2) and grow once things work. - Dropout defaults to 0.1; consider lowering for very small datasets to avoid underfitting.
- Mirrors modern GPT decoder blocks while staying short enough to read in one sitting.
- Uses explicit causal masking to make attention behavior transparent.
- Keeps initialization and training loop minimal (AdamW, constant LR) to highlight architecture over optimization tricks.