Tesserack

Compiling strategy guides into reward functions for reinforcement learning.

What is this?

Most RL game agents learn from scratch with sparse rewards ("you won" / "you lost"). Tesserack takes a different approach: it uses an LLM to read a strategy guide and extract structured "unit tests" that fire as dense rewards throughout gameplay.

The strategy guide becomes a curriculum. Instead of stumbling randomly until the agent accidentally beats Brock, it gets rewarded for:

Walking toward the gym (+0.1)
Entering the gym door (+2.0)
Winning the badge (+50.0)

How it works

Human Knowledge     →  LLM Compiler  →  Unit Test Rewards  →  RL Agent
(Prima Guide PDF)      (Claude Vision)   (test-bundles.json)   (REINFORCE)

Extract: Claude Vision reads pages from the Prima Strategy Guide and extracts locations, objectives, and map coordinates
Compile: Extractions become tiered unit tests (movement → landmarks → objectives)
Train: REINFORCE policy network gets dense rewards as tests fire

The LLM acts as a "compiler" that translates human-readable instructions into machine-executable reward signals.

Reward Tiers

Tier	What It Rewards	Example	Reward
Tier 1	Micro movement	Coordinates changed, moved toward objective	0.1 - 0.2
Tier 2	Landmarks	Reached Oak's Lab region, entered a door	2.0 - 5.0
Tier 3	Objectives	Got starter Pokemon, earned badge	10.0 - 50.0
Penalties	Bad behavior	Stuck for 30+ frames	-0.5

Inspired by OLMoCR-2

OLMoCR-2 showed that unit tests make excellent reward signals - deterministic, interpretable, and dense. Tesserack applies that insight to game playing: the strategy guide's objectives become the "unit tests" that shape agent behavior.

Quick Start

cd app
npm install
npm run dev

Open http://localhost:5173, drop a Pokemon Red ROM, and switch to Train mode.

Requirements: Chrome/Edge 113+ (WebGPU), Pokemon Red ROM

Features

Pure RL Mode: REINFORCE policy network with unit test rewards (no LLM at runtime)
LLM Mode: Browser-based language model for task decomposition
675 pre-compiled tests across 41 locations from the Prima Guide
Real-time visualization: reward breakdown, policy entropy, training metrics
Export/Import: backup and restore all training data and save states

Extraction Pipeline

To regenerate test bundles from the Prima Guide (requires Anthropic API key):

# Download guide from archive.org
npm run guide:download

# Convert PDF pages to images
npm run guide:extract-pages

# Extract structured data via Claude Vision
ANTHROPIC_API_KEY=sk-... npm run guide:extract-claude

# Compile into test bundles
npm run guide:generate-bundles

# Validate output
npm run guide:validate

See docs/EXTRACTION.md for details.

Architecture

Pure RL Mode (Unit Test Rewards)

┌─────────────────────────────────────────────────────────────┐
│   ┌─────────────┐         ┌─────────────┐                   │
│   │   Policy    │ action  │  Emulator   │  new state        │
│   │   πθ(a|s)   │────────▶│  (binjgb)   │─────────┐         │
│   └─────────────┘         └─────────────┘         │         │
│         ▲                                         │         │
│         │ REINFORCE                               ▼         │
│   ┌─────────────┐         ┌─────────────────────────────┐   │
│   │  Rollout    │◀────────│  Unit Test Rewards          │   │
│   │  Buffer     │  reward │  tests(prev, curr) → r      │   │
│   └─────────────┘         └─────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

LLM Mode (Guide-Enhanced)

┌─────────────────────────────────────────────────────────────┐
│   ┌─────────────┐         ┌─────────────┐                   │
│   │  Browser    │  tasks  │   Policy    │  actions          │
│   │  LLM        │────────▶│  (Executor) │─────────▶ Game    │
│   └─────────────┘         └─────────────┘                   │
│         ▲                       │                            │
│         │                       │ learns                     │
│         │ context               ▼                            │
│   ┌─────────────┐         ┌─────────────┐                   │
│   │  Walkthrough│         │  Reward     │                   │
│   │  Graph      │         │  System     │                   │
│   └─────────────┘         └─────────────┘                   │
└─────────────────────────────────────────────────────────────┘

Why This Matters

Reward engineering is hard - Tesserack automates it by mining existing guides
Curriculum is implicit - The guide's structure naturally provides learning progression
Transferable method - Any game with a strategy guide could use this approach
Interpretable rewards - You can see exactly which tests fired and why

Links

License

MIT

Built by Sid Mohan

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github/workflows		.github/workflows
agent		agent
app		app
data		data
docs		docs
lab		lab
llm		llm
scripts		scripts
shared		shared
web		web
.gitignore		.gitignore
README.md		README.md
config.py		config.py
main.py		main.py
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tesserack

What is this?

How it works

Reward Tiers

Inspired by OLMoCR-2

Quick Start

Features

Extraction Pipeline

Architecture

Pure RL Mode (Unit Test Rewards)

LLM Mode (Guide-Enhanced)

Why This Matters

Links

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

sidmohan0/tesserack

Folders and files

Latest commit

History

Repository files navigation

Tesserack

What is this?

How it works

Reward Tiers

Inspired by OLMoCR-2

Quick Start

Features

Extraction Pipeline

Architecture

Pure RL Mode (Unit Test Rewards)

LLM Mode (Guide-Enhanced)

Why This Matters

Links

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages