Skip to content

nadeem4/mini-gpt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Mini GPT (PyTorch)

A compact, readable GPT-style decoder-only Transformer implemented in pure PyTorch. The goal is to expose the essential architectural pieces with minimal scaffolding so you can train and tinker quickly.

Layout

  • model.py – GPT building blocks (token/position embeddings, masked attention, MLP, blocks, and the full model with generate()).
  • train.py – Small training loop on character-level data with optional sampling at the end.

End-to-end architecture

  1. Token + position embedding

    • Tokens are mapped via nn.Embedding(vocab_size, n_embd).
    • Absolute positions use a learned embedding nn.Embedding(block_size, n_embd).
    • Both embeddings are summed and passed through dropout.
  2. Stack of Transformer decoder blocks (repeated n_layer times)
    Structure inside each block:

    • LayerNorm (pre-norm)
    • Causal multi-head self-attention
      • Q, K, V are linear projections of shape (B, T, n_head, head_size).
      • Attention weights = softmax((QK^T) / sqrt(head_size)) with a lower-triangular mask so tokens only attend to past and present.
      • Weighted sum over V, then projected back to model width and dropped out.
    • Residual connection to input.
    • LayerNorm.
    • MLP: Linear -> GELU -> Linear with hidden width 4 * n_embd, plus dropout.
    • Residual connection again.
  3. Output head

    • Final LayerNorm.
    • Linear projection to vocab logits (n_embd -> vocab_size).
    • Loss (when targets provided) is cross-entropy over flattened logits and targets.
  4. Autoregressive generation

    • Maintain the last block_size tokens as context.
    • Predict next-token distribution, sample, append, and repeat for max_new_tokens.

Key configuration knobs (GPTConfig)

  • vocab_size: Number of unique tokens (characters in the toy setup).
  • block_size: Maximum context length; inputs longer than this are rejected.
  • n_layer, n_head, n_embd: Depth, attention heads, and model width.
  • dropout: Shared dropout rate for embeddings, attention weights, and MLP outputs.

Training loop (train.py)

  • Tokenizer: Simple character-level encoder/decoder built from the provided text.
  • Data prep: 90/10 train/val split on tokenized data. Ensures len(text) > block_size + 1.
  • Batching: Random contiguous chunks of length block_size; targets are shifted by one token.
  • Optimization: AdamW with constant learning rate (default 3e-4).
  • Evaluation: Periodic loss estimates on train/val via estimate_loss.
  • Sampling: After training, generates 128 tokens from a zero context and decodes to text.
  • Device: Uses CUDA when available, else CPU.

Quickstart

  1. Install dependencies (PyTorch only):
    pip install torch
  2. (Optional) Provide your own training text: save a .txt file with enough characters for the block size.
  3. Run the toy trainer (uses a built-in sample if --text-file is omitted):
    python train.py --max-iters 200 --block-size 64 --batch-size 32
    Adjust --eval-interval, --n-layer, --n-head, --n-embd, and --dropout to explore size/performance trade-offs.
  4. After training, the script samples 128 tokens from the model and prints the decoded text.

Notes and guidance

  • The toy dataset path is --text-file. Make sure your text is longer than block_size + 1, or lower --block-size.
  • model.generate() expects token IDs; in train.py the decode function maps IDs back to characters for inspection.
  • To plug in a different tokenizer, swap out build_tokenizer in train.py and update vocab_size in GPTConfig.
  • For experimentation, start tiny (n_embd=64, n_head=2, n_layer=2) and grow once things work.
  • Dropout defaults to 0.1; consider lowering for very small datasets to avoid underfitting.

Why this design

  • Mirrors modern GPT decoder blocks while staying short enough to read in one sitting.
  • Uses explicit causal masking to make attention behavior transparent.
  • Keeps initialization and training loop minimal (AdamW, constant LR) to highlight architecture over optimization tricks.

About

A compact, readable GPT-style decoder-only Transformer implemented in pure PyTorch. The goal is to expose the essential architectural pieces with minimal scaffolding so you can train and tinker quickly.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages