A ~190M parameter language model built from scratch in PyTorch.
MicroLM is a GPT-style decoder-only transformer trained on the TinyStories dataset. Built to understand modern LLM architectures by implementing every component from the ground up.
- Modern architecture — RMSNorm, SwiGLU, multi-head causal attention
- Clean implementation — ~200 lines of core model code
- Trained from scratch — On 474M tokens from TinyStories
- Consumer GPU friendly — Designed for 8GB VRAM
Input Tokens
│
▼
┌─────────────────────────────────────────┐
│ Token Embedding + Position Embedding │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Transformer Block (×12) │
│ ├── RMSNorm → Multi-Head Attention │
│ └── RMSNorm → SwiGLU FFN │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ RMSNorm → Linear (→ vocab_size) │
└─────────────────────────────────────────┘
│
▼
Output Logits
| Parameter | Value |
|---|---|
| Vocabulary | 50,257 (GPT-2 BPE) |
| Embedding dim | 768 |
| Layers | 12 |
| Attention heads | 12 |
| FFN hidden dim | 3,072 |
| Context length | 512 |
| Total parameters | ~190M |
Root Mean Square normalization, which is simpler and faster than LayerNorm, used in Llama.
def forward(self, x):
rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
return (x / rms) * self.weightGated activation function that outperforms standard GELU FFN.
def forward(self, x):
return self.w2(F.silu(self.w1(x)) * self.w3(x))Multi-head attention with causal masking to prevent attending to future tokens.
microlm/
├── data/
│ ├── data_download.py # Download TinyStories
│ ├── data_preprocessing.py # Tokenize with GPT-2 BPE
│ └── dataset.py # PyTorch Dataset class
├── model/
│ ├── bigram.py # Baseline model
│ └── transformer.py # Transformer implementation
├── train.py # Training script
└── checkpoints/ # Saved weights
# Create environment
conda create -n microlm python=3.10
conda activate microlm
pip install -r requirements.txt
# Install PyTorch (see https://pytorch.org)
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126python data/data_download.py
python data/data_preprocessing.pypython train.py- RoPE — Rotary Position Embeddings for better length generalization
- KV-Cache — Efficient autoregressive inference
- Mixed Precision — FP16 training with
torch.cuda.amp - Learning Rate Schedule — Cosine decay with warmup
- Text Generation — Temperature, top-k, and top-p sampling