MicroLM

A ~190M parameter language model built from scratch in PyTorch.

Overview

MicroLM is a GPT-style decoder-only transformer trained on the TinyStories dataset. Built to understand modern LLM architectures by implementing every component from the ground up.

Highlights

Modern architecture — RMSNorm, SwiGLU, multi-head causal attention
Clean implementation — ~200 lines of core model code
Trained from scratch — On 474M tokens from TinyStories
Consumer GPU friendly — Designed for 8GB VRAM

Architecture

Input Tokens
     │
     ▼
┌─────────────────────────────────────────┐
│  Token Embedding + Position Embedding   │
└─────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────┐
│  Transformer Block (×12)                │
│  ├── RMSNorm → Multi-Head Attention     │
│  └── RMSNorm → SwiGLU FFN               │
└─────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────┐
│  RMSNorm → Linear (→ vocab_size)        │
└─────────────────────────────────────────┘
     │
     ▼
Output Logits

Model Configuration

Parameter	Value
Vocabulary	50,257 (GPT-2 BPE)
Embedding dim	768
Layers	12
Attention heads	12
FFN hidden dim	3,072
Context length	512
Total parameters	~190M

Key Components

RMSNorm

Root Mean Square normalization, which is simpler and faster than LayerNorm, used in Llama.

def forward(self, x):
    rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
    return (x / rms) * self.weight

SwiGLU

Gated activation function that outperforms standard GELU FFN.

def forward(self, x):
    return self.w2(F.silu(self.w1(x)) * self.w3(x))

Causal Self-Attention

Multi-head attention with causal masking to prevent attending to future tokens.

Project Structure

microlm/
├── data/
│   ├── data_download.py       # Download TinyStories
│   ├── data_preprocessing.py  # Tokenize with GPT-2 BPE
│   └── dataset.py             # PyTorch Dataset class
├── model/
│   ├── bigram.py              # Baseline model
│   └── transformer.py         # Transformer implementation
├── train.py                   # Training script
└── checkpoints/               # Saved weights

Setup

# Create environment
conda create -n microlm python=3.10
conda activate microlm
pip install -r requirements.txt

# Install PyTorch (see https://pytorch.org)
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126

Usage

Prepare Data

python data/data_download.py
python data/data_preprocessing.py

Train

python train.py

Future Improvements

RoPE — Rotary Position Embeddings for better length generalization
KV-Cache — Efficient autoregressive inference
Mixed Precision — FP16 training with torch.cuda.amp
Learning Rate Schedule — Cosine decay with warmup
Text Generation — Temperature, top-k, and top-p sampling

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
model		model
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MicroLM

Overview

Highlights

Architecture

Model Configuration

Key Components

RMSNorm

SwiGLU

Causal Self-Attention

Project Structure

Setup

Usage

Prepare Data

Train

Future Improvements

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MicroLM

Overview

Highlights

Architecture

Model Configuration

Key Components

RMSNorm

SwiGLU

Causal Self-Attention

Project Structure

Setup

Usage

Prepare Data

Train

Future Improvements

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages