LLM Quest Benchmark

Benchmark for evaluating LLM context scaffolds on interactive fiction quests. Measures how prompt context, compact memory, tools, and planning loops affect sequential decision-making across models and tasks.

Project Site | Leaderboard | About / Write-up

See the About page for the project narrative, taxonomy, metrics, caveats, and model selection rationale.

Quick Start (Docker)

git clone --recursive https://github.com/yourconscience/llm_quest_benchmark.git
cd llm_quest_benchmark

cp .env.template .env
# Add your API keys in .env file

# Run a single quest
docker compose run llm-quest run --quest quests/Boat.qm --model gemini-3-flash-preview

# Run a benchmark matrix
docker compose run llm-quest benchmark --config configs/benchmarks/memory_full_transcript.yaml

Local Development

Prerequisites

Python 3.11+, Node.js 18+, uv, pnpm

Install

git clone --recursive https://github.com/yourconscience/llm_quest_benchmark.git
cd llm_quest_benchmark
uv sync --extra dev
pnpm install

cp .env.template .env
# Add your API keys

./download_quests.sh --refresh

Usage

# Run one quest
uv run llm-quest run --quest quests/Boat.qm --model gemini-3-flash-preview --timeout 120

# Run benchmark matrix
uv run llm-quest benchmark --config configs/benchmarks/memory_full_transcript.yaml

# Generate report from benchmark results
uv run llm-quest benchmark-report --benchmark-id <id> --output report.md

# Analyze a single run
uv run llm-quest analyze-run --run-summary results/<harness>/<quest>/run_<id>/run_summary.json

# Play as human in terminal
uv run llm-quest play --quest quests/Boat.qm

# Build static site JS assets
pnpm run build

# Rebuild compressed Play quest assets after refreshing local quest files
pnpm run build:play-assets

API Configuration

Set provider API keys in .env (see .env.template).

OpenRouter

Any model on OpenRouter can be used with the openrouter: prefix:

uv run llm-quest run --model openrouter:openai/gpt-5.4-mini --quest quests/Boat.qm

Requires OPENROUTER_API_KEY in .env.

OpenAI-compatible proxies (OAuth, local LLMs)

Set OPENAI_BASE_URL in .env to route the openai: provider through any compatible endpoint:

# .env - local OAuth proxy (CLIProxyAPI, codex-proxy, openai-oauth, etc.)
OPENAI_BASE_URL=http://localhost:8318/v1

No OPENAI_API_KEY needed when using a local proxy:

uv run llm-quest run --model gpt-5-mini --quest quests/Boat.qm

When OPENAI_BASE_URL is set, the client will not forward OPENAI_API_KEY to that endpoint. If your proxy requires bearer auth, set OPENAI_BASE_URL_API_KEY explicitly.

Direct API keys

Provider-specific keys in .env:

OPENAI_API_KEY - OpenAI API
GOOGLE_API_KEY / GEMINI_API_KEY - Google AI Studio
ANTHROPIC_API_KEY - Anthropic API
DEEPSEEK_API_KEY - DeepSeek API
OPENROUTER_API_KEY - OpenRouter (multi-provider gateway)

Project Structure

llm_quest_benchmark/harnesses/ - LLM harness implementations for prompt, memory, tools, and planning experiments
llm_quest_benchmark/players/ - Non-LLM player primitives (human, random_choice)
llm_quest_benchmark/prompt_templates/ - Jinja2 prompt templates for the public context-scaffold taxonomy
llm_quest_benchmark/executors/ - CLI, benchmark orchestration, TS bridge
configs/benchmarks/ - YAML benchmark configurations
quests/ - Quest files (downloaded via download_quests.sh)
space-rangers-quest/ - TypeScript quest engine (submodule)
docs/ARCHITECTURE.md - Runtime architecture and taxonomy mapping
docs/DATASHEET.md - Dataset and public leaderboard slice documentation
research/ - Error analysis, landscape comparison (gitignored)

Validation

uv run llm-quest --help
uv run ruff check .
uv run ruff format --check .
uv run pytest
pnpm run build
python3 -m json.tool site/leaderboard.json >/tmp/llm_quest_leaderboard_check.json

# Report-only for now: broad pre-existing typing backlog is not a release gate.
uv run mypy llm_quest_benchmark

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
.github		.github
configs		configs
docs		docs
llm_quest_benchmark		llm_quest_benchmark
quests		quests
scripts		scripts
site		site
space-rangers-quest @ 29b9790		space-rangers-quest @ 29b9790
.coderabbit.yaml		.coderabbit.yaml
.dockerignore		.dockerignore
.env.template		.env.template
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
download_quests.sh		download_quests.sh
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
tsconfig.json		tsconfig.json
uv.lock		uv.lock
webpack.engine.js		webpack.engine.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Quest Benchmark

Quick Start (Docker)

Local Development

Prerequisites

Install

Usage

API Configuration

OpenRouter

OpenAI-compatible proxies (OAuth, local LLMs)

Direct API keys

Project Structure

Validation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Quest Benchmark

Quick Start (Docker)

Local Development

Prerequisites

Install

Usage

API Configuration

OpenRouter

OpenAI-compatible proxies (OAuth, local LLMs)

Direct API keys

Project Structure

Validation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages