|
| 1 | +# LongMemEval benchmark runner |
| 2 | + |
| 3 | +This runner executes LongMemEval against a live OpenCode instance with the magic-context plugin enabled. |
| 4 | + |
| 5 | +It uses the real OpenCode HTTP API via `@opencode-ai/sdk`, creates real sessions, replays dataset user turns, asks the benchmark question in a new session, and judges the answer with the official LongMemEval prompt logic. |
| 6 | + |
| 7 | +## What it benchmarks |
| 8 | + |
| 9 | +- One OpenCode session per haystack conversation session |
| 10 | +- All sessions share the same project/directory so magic-context shares project identity and memory state |
| 11 | +- Final benchmark question is asked in a new session to test cross-session memory retrieval |
| 12 | +- No dataset assistant responses are injected |
| 13 | +- All user messages go through the real OpenCode API |
| 14 | + |
| 15 | +## Files |
| 16 | + |
| 17 | +- `runner.ts` — end-to-end orchestration, replay, resume, logging, summary |
| 18 | +- `judge.ts` — official LongMemEval judge prompt templates and GPT-4o evaluation |
| 19 | +- `types.ts` — dataset, state, and result types |
| 20 | +- `config.ts` — CLI parsing and runtime configuration |
| 21 | + |
| 22 | +## Prerequisites |
| 23 | + |
| 24 | +1. OpenCode running locally with magic-context enabled |
| 25 | +2. Dataset JSON downloaded locally |
| 26 | +3. Judge API key available in `OPENAI_API_KEY` by default |
| 27 | + |
| 28 | +Default OpenCode URL: `http://127.0.0.1:21354` |
| 29 | + |
| 30 | +## Run |
| 31 | + |
| 32 | +From `packages/plugin/`: |
| 33 | + |
| 34 | +```bash |
| 35 | +# Full run |
| 36 | +bun run scripts/longmemeval/runner.ts --dataset ./longmemeval_s_cleaned.json |
| 37 | + |
| 38 | +# Subset by index |
| 39 | +bun run scripts/longmemeval/runner.ts --dataset ./longmemeval_s_cleaned.json --start 0 --end 10 |
| 40 | + |
| 41 | +# Specific categories |
| 42 | +bun run scripts/longmemeval/runner.ts --dataset ./longmemeval_s_cleaned.json --types temporal-reasoning,knowledge-update |
| 43 | + |
| 44 | +# Resume |
| 45 | +bun run scripts/longmemeval/runner.ts --dataset ./longmemeval_s_cleaned.json --resume |
| 46 | + |
| 47 | +# Fast mode |
| 48 | +bun run scripts/longmemeval/runner.ts --dataset ./longmemeval_s_cleaned.json --fast |
| 49 | +``` |
| 50 | + |
| 51 | +## Important flags |
| 52 | + |
| 53 | +```bash |
| 54 | +--parallel <n> # questions in parallel, default 1 |
| 55 | +--cleanup # delete OpenCode sessions after judging |
| 56 | +--output-dir <path> # override run artifact directory |
| 57 | +--opencode-url <url> # override OpenCode base URL |
| 58 | +--turn-delay-ms <ms> # delay between replayed user turns |
| 59 | +--session-delay-ms <ms> # delay between haystack sessions |
| 60 | +--final-delay-ms <ms> # delay before asking final question |
| 61 | +--question-ids <a,b> # explicit question id filter |
| 62 | +--max-attempts <n> # retry attempts for request failures |
| 63 | +--retry-base-delay-ms <ms> # base exponential backoff delay |
| 64 | +``` |
| 65 | + |
| 66 | +## Resumability |
| 67 | + |
| 68 | +The runner is crash-safe at per-question execution granularity and persists state after each meaningful step. |
| 69 | + |
| 70 | +Saved state includes: |
| 71 | + |
| 72 | +- created OpenCode session IDs |
| 73 | +- current haystack session index |
| 74 | +- current turn index within the active haystack session |
| 75 | +- whether per-session banner/date marker was sent |
| 76 | +- whether the final question session exists |
| 77 | +- whether the final question banner was sent |
| 78 | +- whether the final question was asked |
| 79 | +- captured hypothesis text |
| 80 | +- accumulated OpenCode token usage and costs |
| 81 | +- accumulated judge token usage and costs |
| 82 | +- last error snapshot |
| 83 | + |
| 84 | +Artifacts written under the run output directory: |
| 85 | + |
| 86 | +- `runner-state.json` — resumable in-progress state |
| 87 | +- `results.jsonl` — append-only completed results |
| 88 | +- `summary.json` — latest aggregated summary |
| 89 | +- `runner.log` — timestamped runner log |
| 90 | + |
| 91 | +On `--resume`, the runner validates that the saved selection matches the current dataset/filter selection. |
| 92 | + |
| 93 | +## Question prompt |
| 94 | + |
| 95 | +The final benchmark question is wrapped as: |
| 96 | + |
| 97 | +```text |
| 98 | +Based on our previous conversations, please answer this question. Do not search files or use tools — answer purely from what you remember about our past interactions. |
| 99 | +
|
| 100 | +Question: {question} |
| 101 | +``` |
| 102 | + |
| 103 | +The final question also sends a system instruction reinforcing that it must answer from conversational memory only. |
| 104 | + |
| 105 | +## Judge behavior |
| 106 | + |
| 107 | +`judge.ts` implements the official LongMemEval category-specific prompts for: |
| 108 | + |
| 109 | +- `single-session-user` |
| 110 | +- `single-session-assistant` |
| 111 | +- `single-session-preference` |
| 112 | +- `multi-session` |
| 113 | +- `temporal-reasoning` |
| 114 | +- `knowledge-update` |
| 115 | +- `_abs` abstention questions |
| 116 | + |
| 117 | +Scoring follows the official evaluator behavior: the label is `true` if the judge response contains `yes` case-insensitively. |
| 118 | + |
| 119 | +## Cost tracking |
| 120 | + |
| 121 | +The runner records: |
| 122 | + |
| 123 | +- actual OpenCode cost reported by assistant messages |
| 124 | +- estimated OpenCode cost from optional CLI pricing inputs |
| 125 | +- estimated judge cost from configured judge pricing |
| 126 | + |
| 127 | +If OpenCode pricing flags are omitted, OpenCode estimated cost defaults to zero while actual OpenCode cost still uses the API-reported value. |
| 128 | + |
| 129 | +## Notes |
| 130 | + |
| 131 | +- This runner is designed to build the benchmark harness; it does not modify plugin runtime code. |
| 132 | +- It does not inject dataset assistant messages. |
| 133 | +- It does not bypass the OpenCode API. |
0 commit comments