Mini GPT (PyTorch)

A compact, readable GPT-style decoder-only Transformer implemented in pure PyTorch. The goal is to expose the essential architectural pieces with minimal scaffolding so you can train and tinker quickly.

Layout

model.py – GPT building blocks (token/position embeddings, masked attention, MLP, blocks, and the full model with generate()).
train.py – Small training loop on character-level data with optional sampling at the end.

End-to-end architecture

Token + position embedding
- Tokens are mapped via nn.Embedding(vocab_size, n_embd).
- Absolute positions use a learned embedding nn.Embedding(block_size, n_embd).
- Both embeddings are summed and passed through dropout.
Stack of Transformer decoder blocks (repeated n_layer times)
Structure inside each block:
- LayerNorm (pre-norm)
- Causal multi-head self-attention
  - Q, K, V are linear projections of shape (B, T, n_head, head_size).
  - Attention weights = softmax((QK^T) / sqrt(head_size)) with a lower-triangular mask so tokens only attend to past and present.
  - Weighted sum over V, then projected back to model width and dropped out.
- Residual connection to input.
- LayerNorm.
- MLP: Linear -> GELU -> Linear with hidden width 4 * n_embd, plus dropout.
- Residual connection again.
Output head
- Final LayerNorm.
- Linear projection to vocab logits (n_embd -> vocab_size).
- Loss (when targets provided) is cross-entropy over flattened logits and targets.
Autoregressive generation
- Maintain the last block_size tokens as context.
- Predict next-token distribution, sample, append, and repeat for max_new_tokens.

Key configuration knobs (`GPTConfig`)

vocab_size: Number of unique tokens (characters in the toy setup).
block_size: Maximum context length; inputs longer than this are rejected.
n_layer, n_head, n_embd: Depth, attention heads, and model width.
dropout: Shared dropout rate for embeddings, attention weights, and MLP outputs.

Training loop (train.py)

Tokenizer: Simple character-level encoder/decoder built from the provided text.
Data prep: 90/10 train/val split on tokenized data. Ensures len(text) > block_size + 1.
Batching: Random contiguous chunks of length block_size; targets are shifted by one token.
Optimization: AdamW with constant learning rate (default 3e-4).
Evaluation: Periodic loss estimates on train/val via estimate_loss.
Sampling: After training, generates 128 tokens from a zero context and decodes to text.
Device: Uses CUDA when available, else CPU.

Quickstart

Install dependencies (PyTorch only):
```
pip install torch
```
(Optional) Provide your own training text: save a .txt file with enough characters for the block size.
Run the toy trainer (uses a built-in sample if --text-file is omitted):
```
python train.py --max-iters 200 --block-size 64 --batch-size 32
```
Adjust --eval-interval, --n-layer, --n-head, --n-embd, and --dropout to explore size/performance trade-offs.
After training, the script samples 128 tokens from the model and prints the decoded text.

Notes and guidance

The toy dataset path is --text-file. Make sure your text is longer than block_size + 1, or lower --block-size.
model.generate() expects token IDs; in train.py the decode function maps IDs back to characters for inspection.
To plug in a different tokenizer, swap out build_tokenizer in train.py and update vocab_size in GPTConfig.
For experimentation, start tiny (n_embd=64, n_head=2, n_layer=2) and grow once things work.
Dropout defaults to 0.1; consider lowering for very small datasets to avoid underfitting.

Why this design

Mirrors modern GPT decoder blocks while staying short enough to read in one sitting.
Uses explicit causal masking to make attention behavior transparent.
Keeps initialization and training loop minimal (AdamW, constant LR) to highlight architecture over optimization tricks.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
model.py		model.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini GPT (PyTorch)

Layout

End-to-end architecture

Key configuration knobs (`GPTConfig`)

Training loop (train.py)

Quickstart

Notes and guidance

Why this design

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mini GPT (PyTorch)

Layout

End-to-end architecture

Key configuration knobs (GPTConfig)

Training loop (train.py)

Quickstart

Notes and guidance

Why this design

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Key configuration knobs (`GPTConfig`)

Packages