Python 3.10+ pipeline:
data/<model>/dialog<NN>_*.jsonl → adapter + DESIGN.md ground truth →
DialogRecord → automatic eval → LLM judge → aggregation →
leaderboard + comparison figures
data/
├── DESIGN.md # 16 dialog scenario specs (ground truth source)
├── DESIGN_annotated.md # optional: richer per-dialog ground truth (merged when present)
├── dialog_specs.json # structured ground truth (regenerated by run-all)
├── model_a/ # baseline / first agent (neutral folder name)
│ └── dialog<N>.jsonl
└── model_b/ # second agent (same naming; use any neutral names you like)
└── dialog<N>.jsonl
Folder names: use neutral directory names (model_a / model_b, run_1 / run_2, …). The leaderboard Model column uses the subdirectory name — avoid labels like “improved” that can bias informal reading. Under the hood the LLM judge also uses blind scoring by default: the rollout folder name is replaced by a placeholder (anonymous_agent) in the judge prompt so scores are comparable when traces are identical. Use --no-blind-model-name only if you intentionally want the judge to see folder names.
Optional DESIGN_annotated.md: When this file sits next to DESIGN.md, run-all / parse_design_md.py merge its ### **Ground Truth** blocks (characteristic form, required tool sequence, success-criteria bullets) into dialog_specs.json for the LLM judge. Use run-all --no-annotated-overlay to skip, or --annotated-design PATH to point at a specific file.
Filename convention: files should match dialog<N>...jsonl where <N> aligns with dialogs in DESIGN.md (e.g. dialog1.jsonl, dialog01_case_xyz.jsonl).
JSONL rollout format: all model rollouts use the AssetOps event-stream schema:
- one timestamped agent event per line;
turn_idgroups events into dialog turns;tool_calls[]records tool invocations, including optionalserverplustool_name;- planner/executor logs may include
plan.stepsandplan_kind=replan; - final text is carried by an event with
task_type=final_response.
Force with --adapter assetops_event_stream_v1 if needed. The adapter attaches DESIGN ground truth via dialog_specs.json.
The adapter builds an architecture-neutral intermediate_plan for fair comparison across plan-execute and supervisor-specialist traces. It summarizes workflows, planned tool sequences, executed tool sequences, and recovery markers such as failed tool calls followed by replanning/re-execution.
run-all (see §2) regenerates dialog_specs.json automatically every run; you can also do it manually:
python scripts/parse_design_md.py --design ./data/DESIGN.md --out ./data/dialog_specs.jsonThe leaderboard always emits Model | 7 metric columns. Subjective scores are in ([0,1]); automatic metrics are rates in ([0,1]) or n/a.
For full formulas, per-dialog definitions, aggregation rules, and recovery/schema edge cases, see METRICS.md.
| Metric | Meaning | Per-dialog | Corpus aggregation |
|---|---|---|---|
| Planning Effectiveness | Subjective planning | from llm_evaluation / human_evaluation |
macro avg (\tfrac{1}{N}\sum_i s_i) |
| Tool Usage Quality | Subjective tool use | same | same |
| Task Completion | Subjective completion | same | same |
| Tool Name Validity | Legal tool-name rate | legal / total | micro avg over all calls |
| Schema Compliance | Argument structure valid among legal names | schema_ok / legal | micro avg |
| Execution Success Rate | Tool exec success | exec_ok / total | micro avg |
| Recovery Success Rate | Recovery → final success | dialogs with recovery_triggered: last turn succeeded | dialog-level rate; n/a when no recovery was triggered |
Rows are sorted alphabetically by model name. Subjective source: leaderboard --metric-source llm|human|automatic selects llm_evaluation, human_evaluation, or leaves the first three columns blank (automatic-only view).
Schema Compliance details: tool existence and argument schemas are validated against schema_compliance_schema.yaml. Bare tool names are resolved when unique; otherwise the adapter combines server + tool_name as server.tool. Compliance checks are structural: arguments must be an object, required arguments must be present, unknown arguments are rejected, and JSON value types must match the canonical input schema.
Recovery details: plan-execute traces trigger recovery only on failed tool calls, not on skipped/unexecuted plan artifacts. Supervisor-specialist traces scope recovery to the failing specialist role and require later successful work from that same role.
pip install -r requirements.txt
# API key: either export once per shell, OR create ./.env (see .env.example) — main.py loads it automatically.
export OPENAI_API_KEY=sk-...
python main.py run-all
# Optional tweaks
python main.py run-all --model gpt-4.1 --runs 3
# Let judge see real folder names in the prompt (not recommended for blind A/B):
# python main.py run-all --no-blind-model-name
# Offline / CI without OpenAI credentials:
python main.py run-all --judge mockWhat run-all does, in order:
- parses
data/DESIGN.md→data/dialog_specs.json(re-runs every time so spec edits are picked up automatically); - discovers per-model rollouts under
data/<model>/, groups same-stem dialog files across model folders, runs the paired LLM judge by default, and writes canonicalDialogRecordJSON todata/_judged/<model>/<dialog_id>.json; - aggregates the corpus and writes
data/_leaderboard/leaderboard_metrics.{txt,csv,json}plus the table to stdout; - writes
metrics_radar.png(radar overlay of the seven [0,1] metrics underautomatic_evaluation) intodata/_leaderboard/figures/. Skip this step with--no-figureswhen matplotlib is unavailable.
Drop a second model's rollouts under data/model_b/ (next to data/model_a/) to compare in one leaderboard/radar chart.
If you want to step through the pipeline manually:
python scripts/parse_design_md.py --design ./data/DESIGN.md --out ./data/dialog_specs.json
python main.py auto-evaluate --input ./data
python main.py llm-judge --input ./data/_evaluated --out ./data/_judged
python main.py leaderboard --input ./data/_judged --metric-source llm
python main.py visualize --input ./data/_judged --out ./data/_leaderboard/figuresSample leaderboard:
Model | Planning Effectiveness | Tool Usage Quality | Task Completion | Tool Name Validity | Schema Compliance | Execution Success Rate | Recovery Success Rate
model_a | 0.95 | 0.85 | 0.95 | 1 | 0.923077 | 0.923077 | 1
Sample figure: metrics_radar.png overlays one polygon per model on the seven unit-bounded automatic metrics (closer to the outer ring is better).
The judge consumes a DialogRecord (already enriched with ground_truth.expected_tools / expected_plan / expected_final_answer / task_success_criteria), prompts an LLM in JSON mode, and writes the three subjective scores back into llm_evaluation.
Directory inputs use paired mode by default. Same-stem files from different model folders (for example model_a/dialog1.jsonl and model_b/dialog1.jsonl) are scored in one shared prompt as candidate_a, candidate_b, … so the subjective scores are calibrated against the same ground truth and rubric context. Single-candidate groups automatically fall back to single scoring.
Use --judge-mode single when you intentionally want to score each file independently.
--judge |
Module | Network | Use case |
|---|---|---|---|
openai (default) |
OpenAIJudge |
OpenAI Chat Completions or compatible | production scoring (--model defaults to gpt-4o-mini) |
mock |
MockJudge |
no | offline pipeline / CI (--model ignored) |
--base-url lets you point the OpenAI client at vLLM, LiteLLM, Azure proxies, or any OpenAI-compatible gateway.
| Variable | Required | Notes |
|---|---|---|
OPENAI_API_KEY |
unless --judge mock |
OpenAI / compatible service API key |
OPENAI_BASE_URL |
optional | custom base URL |
OPENAI_ORG / OPENAI_ORGANIZATION |
optional | OpenAI organization id |
# Default judge is OpenAI; --model defaults to gpt-4o-mini
export OPENAI_API_KEY=sk-...
python main.py llm-judge \
--input ./data/_evaluated --out ./data/_judged \
--runs 3 --seed 1337
# Explicit backend / model overrides
python main.py llm-judge \
--input ./data/_evaluated --out ./data/_judged \
--judge openai --model gpt-4.1 \
--runs 3 --seed 1337
# Independent per-file scoring instead of paired same-dialog scoring
python main.py llm-judge \
--input ./data/_evaluated --out ./data/_judged \
--judge-mode single
# Reasoning-class models (o1/o3/o4 ignore --temperature automatically)
python main.py llm-judge --input ./data/_evaluated --model o4-mini
# Self-hosted endpoint
python main.py llm-judge \
--input ./data/_evaluated --judge openai \
--model my-judge-finetune --base-url http://localhost:8000/v1
# No API credentials
python main.py llm-judge --input ./data/_evaluated --out ./data/_judged --judge mockUse --continue-on-error to keep going on per-file failures; the run exits with code 1 if any failed.
Request: one system message ("return only JSON") + one user message produced by evaluators/llm_judge/prompt.py, response_format={"type": "json_object"}.
Single-mode response:
{
"planning_effectiveness": 0.0,
"tool_usage_quality": 0.0,
"task_completion": 0.0,
"reasoning": {
"planning_comment": "",
"tool_comment": "",
"task_comment": ""
}
}Paired-mode response:
{
"candidate_a": {
"planning_effectiveness": 0.0,
"tool_usage_quality": 0.0,
"task_completion": 0.0,
"reasoning": {
"planning_comment": "",
"tool_comment": "",
"task_comment": ""
}
},
"candidate_b": {
"planning_effectiveness": 0.0,
"tool_usage_quality": 0.0,
"task_completion": 0.0,
"reasoning": {
"planning_comment": "",
"tool_comment": "",
"task_comment": ""
}
}
}Validation lives in evaluators/llm_judge/parser.py.
Two supervision sources:
- Teacher distillation (
--source llm, recommended): runllm-judgewith a strong model (e.g.gpt-4o,--runs 5), then fine-tune a smaller student to imitate it. - Human gold (
--source human):export-annotation→ label →import-annotation→human_evaluation→ fine-tune.
python main.py auto-evaluate --input ./data
# Option A — teacher distillation
export OPENAI_API_KEY=sk-...
python main.py llm-judge \
--input ./data/_evaluated --out ./data/_judged_teacher \
--judge openai --model gpt-4o --runs 5
python main.py prepare-judge-training \
--input ./data/_judged_teacher \
--out ./judge_train.jsonl \
--source llm \
--val-fraction 0.1
# Option B — human gold
python main.py export-annotation --input ./data/_evaluated
# … fill annotations_template.csv …
python main.py import-annotation \
--input ./data/_evaluated/_annotations/annotations_template.csv \
--results ./data/_evaluated
python main.py prepare-judge-training \
--input ./data/_evaluated \
--out ./judge_train.jsonl \
--source human \
--val-fraction 0.1The bundled helper takes care of file upload + job submission + status polling:
python scripts/finetune_judge.py \
--train ./judge_train.jsonl --val ./judge_train.val.jsonl \
--base-model gpt-4o-mini-2024-07-18 \
--suffix dialog-judge-v1You will eventually receive a ft:gpt-4o-mini-2024-07-18:my-org:dialog-judge-v1:abc123 model id.
python main.py llm-judge \
--input ./data/_evaluated --out ./data/_judged_student \
--judge openai --model ft:...id
python main.py leaderboard --input ./data/_judged_student --metric-source llm- Aim for ≥ 50 high-quality examples to start; 200–1000 for a focused judge.
- Keep
prompt.pyconsistent between training-data construction and inference — fine-tuning bakes the prompt shape into the model.
python main.py export-annotation --input ./data/_evaluated
# Edit ./data/_evaluated/_annotations/annotations_template.csv
# Columns: annotator_id, planning_effectiveness, tool_usage_quality, task_completion (0..1), comments
# Do NOT touch dialog_id / model_name
python main.py import-annotation \
--input ./data/_evaluated/_annotations/annotations_template.csv \
--results ./data/_evaluated
python main.py leaderboard --input ./data/_evaluated --metric-source human- New dialog scenario: add a section to
data/DESIGN.md;run-allre-parses it on every invocation. - New trace format: subclass
BaseDialogAdapterand register it inadapters/registry.py:default_registry. - Dependencies: Python ≥ 3.10,
pydantic>=2.5,openai>=1.0+matplotlib>=3.7(default pipeline uses the OpenAI SDK; use--judge mock --no-figuresto trim both).