GRPO-based Reasoning Assistance Calling Efficiently.
LLM Meta-Policy for Overcooked: learning when to call the LLM high-level planner so that overall LLM calls are reduced without losing task performance.
Most LLM-as-planner work calls the LLM at fixed intervals. GRACE learns when to call, training a small meta-policy with GRPO. Result: same task performance with fewer LLM calls (Pareto improvement).
Placeholder — figures/pareto.png will go here once Phase 11 sweeps run.
git clone https://github.com/idaun/grace.git
cd grace
uv sync --extra dev --extra overcooked
.venv/bin/pytest -v # all 99+ tests should pass--extra play— pygame for human-play mode (Phase 9)--extra unity— mlagents-envs for Unity environments (Phase 6)
docker build -t grace:latest .
docker run --rm grace:latest pytest -v.venv/bin/python scripts/play_human.py --env dummy --mode coop
# Player 1: WASD + Space (interact) + E (stay)
# Player 2: arrows + RShift (interact) + RCtrl (stay)For the Unity build, see unity_env/README.md.
# Plain PPO baseline (no LLM)
PYTHONPATH=$(pwd) python scripts/train.py env=cramped_room policy=ppo meta=never seed=0
# LLM-augmented with fixed-K calls
PYTHONPATH=$(pwd) python scripts/train.py \
env=cramped_room policy=llm_augmented meta=fixed_k100 llm=qwen3.6_35b seed=0
# Learned meta-policy (the contribution)
PYTHONPATH=$(pwd) python scripts/train_meta.py \
env=cramped_room policy=llm_augmented meta=learned llm=qwen3.6_35b seed=0PYTHONPATH=$(pwd) python scripts/eval.py +run_dir=runs/<run_dir>/ +n_episodes=20
PYTHONPATH=$(pwd) python scripts/eval_transfer.py \
+train_run=runs/<run_dir>/ \
+test_layouts=[asymmetric_advantages] \
+n_episodes=10See docs/REPRODUCIBILITY.md for the three hypothesis sweeps (H1, H2, H3) and expected wallclock.
| Path | Contents |
|---|---|
src/ |
Reusable library (envs, llm, policies, training, eval) |
configs/ |
Hydra configs for experiments |
scripts/ |
Entry points (train.py, eval.py, sweep.py, plot_results.py, ...) |
tests/ |
Unit + smoke tests |
unity_env/ |
Unity ML-Agents project (C#) |
docs/ |
Versioned prompts and experiment journal |
- Phase 0-7 — Scaffolding (LLM client, env, PPO, GRPO, eval)
- Phase 8 — Real Carroll's overcooked-ai integration + checkpoints
- Phase 9 — Human-play (pygame + Unity) + BC warm-start
- Phase 10 — Prompt v2 + latency diagnostics
- Phase 11 — Sweep harness + statistics
- Phase 12 — Public-readiness polish
- Phase 13 — Full experimental sweep (compute-bound — user runs)
Placeholder — to be filled once the manuscript is on arXiv.
@misc{grace2026,
title = {GRACE: GRPO-based Reasoning Assistance Calling Efficiently},
author = {GRACE Authors},
year = {2026},
note = {Preprint}
}MIT — see LICENSE.