Benchmark for evaluating LLM context scaffolds on interactive fiction quests. Measures how prompt context, compact memory, tools, and planning loops affect sequential decision-making across models and tasks.
Project Site | Leaderboard | About / Write-up
See the About page for the project narrative, taxonomy, metrics, caveats, and model selection rationale.
git clone --recursive https://github.com/yourconscience/llm_quest_benchmark.git
cd llm_quest_benchmark
cp .env.template .env
# Add your API keys in .env file
# Run a single quest
docker compose run llm-quest run --quest quests/Boat.qm --model gemini-3-flash-preview
# Run a benchmark matrix
docker compose run llm-quest benchmark --config configs/benchmarks/memory_full_transcript.yaml- Python 3.11+, Node.js 18+,
uv,pnpm
git clone --recursive https://github.com/yourconscience/llm_quest_benchmark.git
cd llm_quest_benchmark
uv sync --extra dev
pnpm install
cp .env.template .env
# Add your API keys
./download_quests.sh --refresh# Run one quest
uv run llm-quest run --quest quests/Boat.qm --model gemini-3-flash-preview --timeout 120
# Run benchmark matrix
uv run llm-quest benchmark --config configs/benchmarks/memory_full_transcript.yaml
# Generate report from benchmark results
uv run llm-quest benchmark-report --benchmark-id <id> --output report.md
# Analyze a single run
uv run llm-quest analyze-run --run-summary results/<harness>/<quest>/run_<id>/run_summary.json
# Play as human in terminal
uv run llm-quest play --quest quests/Boat.qm
# Build static site JS assets
pnpm run build
# Rebuild compressed Play quest assets after refreshing local quest files
pnpm run build:play-assetsSet provider API keys in .env (see .env.template).
Any model on OpenRouter can be used with the openrouter: prefix:
uv run llm-quest run --model openrouter:openai/gpt-5.4-mini --quest quests/Boat.qmRequires OPENROUTER_API_KEY in .env.
Set OPENAI_BASE_URL in .env to route the openai: provider through any compatible endpoint:
# .env - local OAuth proxy (CLIProxyAPI, codex-proxy, openai-oauth, etc.)
OPENAI_BASE_URL=http://localhost:8318/v1No OPENAI_API_KEY needed when using a local proxy:
uv run llm-quest run --model gpt-5-mini --quest quests/Boat.qmWhen OPENAI_BASE_URL is set, the client will not forward OPENAI_API_KEY to that endpoint.
If your proxy requires bearer auth, set OPENAI_BASE_URL_API_KEY explicitly.
Provider-specific keys in .env:
OPENAI_API_KEY- OpenAI APIGOOGLE_API_KEY/GEMINI_API_KEY- Google AI StudioANTHROPIC_API_KEY- Anthropic APIDEEPSEEK_API_KEY- DeepSeek APIOPENROUTER_API_KEY- OpenRouter (multi-provider gateway)
llm_quest_benchmark/harnesses/- LLM harness implementations for prompt, memory, tools, and planning experimentsllm_quest_benchmark/players/- Non-LLM player primitives (human,random_choice)llm_quest_benchmark/prompt_templates/- Jinja2 prompt templates for the public context-scaffold taxonomyllm_quest_benchmark/executors/- CLI, benchmark orchestration, TS bridgeconfigs/benchmarks/- YAML benchmark configurationsquests/- Quest files (downloaded viadownload_quests.sh)space-rangers-quest/- TypeScript quest engine (submodule)docs/ARCHITECTURE.md- Runtime architecture and taxonomy mappingdocs/DATASHEET.md- Dataset and public leaderboard slice documentationresearch/- Error analysis, landscape comparison (gitignored)
uv run llm-quest --help
uv run ruff check .
uv run ruff format --check .
uv run pytest
pnpm run build
python3 -m json.tool site/leaderboard.json >/tmp/llm_quest_leaderboard_check.json
# Report-only for now: broad pre-existing typing backlog is not a release gate.
uv run mypy llm_quest_benchmarkMIT