Skip to content

choshingcheung/microlm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MicroLM

A ~190M parameter language model built from scratch in PyTorch.

Overview

MicroLM is a GPT-style decoder-only transformer trained on the TinyStories dataset. Built to understand modern LLM architectures by implementing every component from the ground up.

Highlights

  • Modern architecture — RMSNorm, SwiGLU, multi-head causal attention
  • Clean implementation — ~200 lines of core model code
  • Trained from scratch — On 474M tokens from TinyStories
  • Consumer GPU friendly — Designed for 8GB VRAM

Architecture

Input Tokens
     │
     ▼
┌─────────────────────────────────────────┐
│  Token Embedding + Position Embedding   │
└─────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────┐
│  Transformer Block (×12)                │
│  ├── RMSNorm → Multi-Head Attention     │
│  └── RMSNorm → SwiGLU FFN               │
└─────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────┐
│  RMSNorm → Linear (→ vocab_size)        │
└─────────────────────────────────────────┘
     │
     ▼
Output Logits

Model Configuration

Parameter Value
Vocabulary 50,257 (GPT-2 BPE)
Embedding dim 768
Layers 12
Attention heads 12
FFN hidden dim 3,072
Context length 512
Total parameters ~190M

Key Components

RMSNorm

Root Mean Square normalization, which is simpler and faster than LayerNorm, used in Llama.

def forward(self, x):
    rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
    return (x / rms) * self.weight

SwiGLU

Gated activation function that outperforms standard GELU FFN.

def forward(self, x):
    return self.w2(F.silu(self.w1(x)) * self.w3(x))

Causal Self-Attention

Multi-head attention with causal masking to prevent attending to future tokens.

Project Structure

microlm/
├── data/
│   ├── data_download.py       # Download TinyStories
│   ├── data_preprocessing.py  # Tokenize with GPT-2 BPE
│   └── dataset.py             # PyTorch Dataset class
├── model/
│   ├── bigram.py              # Baseline model
│   └── transformer.py         # Transformer implementation
├── train.py                   # Training script
└── checkpoints/               # Saved weights

Setup

# Create environment
conda create -n microlm python=3.10
conda activate microlm
pip install -r requirements.txt

# Install PyTorch (see https://pytorch.org)
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126

Usage

Prepare Data

python data/data_download.py
python data/data_preprocessing.py

Train

python train.py

Future Improvements

  • RoPE — Rotary Position Embeddings for better length generalization
  • KV-Cache — Efficient autoregressive inference
  • Mixed Precision — FP16 training with torch.cuda.amp
  • Learning Rate Schedule — Cosine decay with warmup
  • Text Generation — Temperature, top-k, and top-p sampling

References

About

A ~190M parameter language model built from scratch in PyTorch.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors