AssetOps multi-turn dialog evaluation system

Python 3.10+ pipeline:

data/<model>/dialog<NN>_*.jsonl  →  adapter + DESIGN.md ground truth  →
DialogRecord  →  automatic eval  →  LLM judge  →  aggregation  →
leaderboard + comparison figures

0. Data layout

data/
├── DESIGN.md                       # 16 dialog scenario specs (ground truth source)
├── DESIGN_annotated.md             # optional: richer per-dialog ground truth (merged when present)
├── dialog_specs.json               # structured ground truth (regenerated by run-all)
├── model_a/                        # baseline / first agent (neutral folder name)
│   └── dialog<N>.jsonl
└── model_b/                       # second agent (same naming; use any neutral names you like)
    └── dialog<N>.jsonl

Folder names: use neutral directory names (model_a / model_b, run_1 / run_2, …). The leaderboard Model column uses the subdirectory name — avoid labels like “improved” that can bias informal reading. Under the hood the LLM judge also uses blind scoring by default: the rollout folder name is replaced by a placeholder (anonymous_agent) in the judge prompt so scores are comparable when traces are identical. Use --no-blind-model-name only if you intentionally want the judge to see folder names.

Optional DESIGN_annotated.md: When this file sits next to DESIGN.md, run-all / parse_design_md.py merge its ### **Ground Truth** blocks (characteristic form, required tool sequence, success-criteria bullets) into dialog_specs.json for the LLM judge. Use run-all --no-annotated-overlay to skip, or --annotated-design PATH to point at a specific file.

Filename convention: files should match dialog<N>...jsonl where <N> aligns with dialogs in DESIGN.md (e.g. dialog1.jsonl, dialog01_case_xyz.jsonl).

JSONL rollout format: all model rollouts use the AssetOps event-stream schema:

one timestamped agent event per line;
turn_id groups events into dialog turns;
tool_calls[] records tool invocations, including optional server plus tool_name;
planner/executor logs may include plan.steps and plan_kind=replan;
final text is carried by an event with task_type=final_response.

Force with --adapter assetops_event_stream_v1 if needed. The adapter attaches DESIGN ground truth via dialog_specs.json.

The adapter builds an architecture-neutral intermediate_plan for fair comparison across plan-execute and supervisor-specialist traces. It summarizes workflows, planned tool sequences, executed tool sequences, and recovery markers such as failed tool calls followed by replanning/re-execution.

run-all (see §2) regenerates dialog_specs.json automatically every run; you can also do it manually:

python scripts/parse_design_md.py --design ./data/DESIGN.md --out ./data/dialog_specs.json

1. Leaderboard metrics

The leaderboard always emits Model | 7 metric columns. Subjective scores are in ([0,1]); automatic metrics are rates in ([0,1]) or n/a.

For full formulas, per-dialog definitions, aggregation rules, and recovery/schema edge cases, see METRICS.md.

Metric	Meaning	Per-dialog	Corpus aggregation
Planning Effectiveness	Subjective planning	from `llm_evaluation` / `human_evaluation`	macro avg (\tfrac{1}{N}\sum_i s_i)
Tool Usage Quality	Subjective tool use	same	same
Task Completion	Subjective completion	same	same
Tool Name Validity	Legal tool-name rate	legal / total	micro avg over all calls
Schema Compliance	Argument structure valid among legal names	schema_ok / legal	micro avg
Execution Success Rate	Tool exec success	exec_ok / total	micro avg
Recovery Success Rate	Recovery → final success	dialogs with recovery_triggered: last turn succeeded	dialog-level rate; `n/a` when no recovery was triggered

Rows are sorted alphabetically by model name. Subjective source: leaderboard --metric-source llm|human|automatic selects llm_evaluation, human_evaluation, or leaves the first three columns blank (automatic-only view).

Schema Compliance details: tool existence and argument schemas are validated against schema_compliance_schema.yaml. Bare tool names are resolved when unique; otherwise the adapter combines server + tool_name as server.tool. Compliance checks are structural: arguments must be an object, required arguments must be present, unknown arguments are rejected, and JSON value types must match the canonical input schema.

Recovery details: plan-execute traces trigger recovery only on failed tool calls, not on skipped/unexecuted plan artifacts. Supervisor-specialist traces scope recovery to the failing specialist role and require later successful work from that same role.

2. End-to-end pipeline — one command

pip install -r requirements.txt

# API key: either export once per shell, OR create ./.env (see .env.example) — main.py loads it automatically.
export OPENAI_API_KEY=sk-...
python main.py run-all

# Optional tweaks
python main.py run-all --model gpt-4.1 --runs 3

# Let judge see real folder names in the prompt (not recommended for blind A/B):
# python main.py run-all --no-blind-model-name

# Offline / CI without OpenAI credentials:
python main.py run-all --judge mock

What run-all does, in order:

parses data/DESIGN.md → data/dialog_specs.json (re-runs every time so spec edits are picked up automatically);
discovers per-model rollouts under data/<model>/, groups same-stem dialog files across model folders, runs the paired LLM judge by default, and writes canonical DialogRecord JSON to data/_judged/<model>/<dialog_id>.json;
aggregates the corpus and writes data/_leaderboard/leaderboard_metrics.{txt,csv,json} plus the table to stdout;
writes metrics_radar.png (radar overlay of the seven [0,1] metrics under automatic_evaluation) into data/_leaderboard/figures/. Skip this step with --no-figures when matplotlib is unavailable.

Drop a second model's rollouts under data/model_b/ (next to data/model_a/) to compare in one leaderboard/radar chart.

If you want to step through the pipeline manually:

python scripts/parse_design_md.py --design ./data/DESIGN.md --out ./data/dialog_specs.json
python main.py auto-evaluate --input ./data
python main.py llm-judge      --input ./data/_evaluated --out ./data/_judged
python main.py leaderboard    --input ./data/_judged    --metric-source llm
python main.py visualize      --input ./data/_judged    --out ./data/_leaderboard/figures

Sample leaderboard:

Model   | Planning Effectiveness | Tool Usage Quality | Task Completion | Tool Name Validity | Schema Compliance | Execution Success Rate | Recovery Success Rate
model_a | 0.95                   | 0.85               | 0.95            | 1                  | 0.923077          | 0.923077               | 1

Sample figure: metrics_radar.png overlays one polygon per model on the seven unit-bounded automatic metrics (closer to the outer ring is better).

3. LLM-as-Judge

The judge consumes a DialogRecord (already enriched with ground_truth.expected_tools / expected_plan / expected_final_answer / task_success_criteria), prompts an LLM in JSON mode, and writes the three subjective scores back into llm_evaluation.

Directory inputs use paired mode by default. Same-stem files from different model folders (for example model_a/dialog1.jsonl and model_b/dialog1.jsonl) are scored in one shared prompt as candidate_a, candidate_b, … so the subjective scores are calibrated against the same ground truth and rubric context. Single-candidate groups automatically fall back to single scoring.

Use --judge-mode single when you intentionally want to score each file independently.

3.1 Backends

`--judge`	Module	Network	Use case
`openai` (default)	`OpenAIJudge`	OpenAI Chat Completions or compatible	production scoring (`--model` defaults to `gpt-4o-mini`)
`mock`	`MockJudge`	no	offline pipeline / CI (`--model` ignored)

--base-url lets you point the OpenAI client at vLLM, LiteLLM, Azure proxies, or any OpenAI-compatible gateway.

3.2 Environment variables

Variable	Required	Notes
`OPENAI_API_KEY`	unless `--judge mock`	OpenAI / compatible service API key
`OPENAI_BASE_URL`	optional	custom base URL
`OPENAI_ORG` / `OPENAI_ORGANIZATION`	optional	OpenAI organization id

3.3 CLI

# Default judge is OpenAI; --model defaults to gpt-4o-mini
export OPENAI_API_KEY=sk-...
python main.py llm-judge \
  --input ./data/_evaluated --out ./data/_judged \
  --runs 3 --seed 1337

# Explicit backend / model overrides
python main.py llm-judge \
  --input ./data/_evaluated --out ./data/_judged \
  --judge openai --model gpt-4.1 \
  --runs 3 --seed 1337

# Independent per-file scoring instead of paired same-dialog scoring
python main.py llm-judge \
  --input ./data/_evaluated --out ./data/_judged \
  --judge-mode single

# Reasoning-class models (o1/o3/o4 ignore --temperature automatically)
python main.py llm-judge --input ./data/_evaluated --model o4-mini

# Self-hosted endpoint
python main.py llm-judge \
  --input ./data/_evaluated --judge openai \
  --model my-judge-finetune --base-url http://localhost:8000/v1

# No API credentials
python main.py llm-judge --input ./data/_evaluated --out ./data/_judged --judge mock

Use --continue-on-error to keep going on per-file failures; the run exits with code 1 if any failed.

3.4 Request / response contract

Request: one system message ("return only JSON") + one user message produced by evaluators/llm_judge/prompt.py, response_format={"type": "json_object"}.

Single-mode response:

{
  "planning_effectiveness": 0.0,
  "tool_usage_quality": 0.0,
  "task_completion": 0.0,
  "reasoning": {
    "planning_comment": "",
    "tool_comment": "",
    "task_comment": ""
  }
}

Paired-mode response:

{
  "candidate_a": {
    "planning_effectiveness": 0.0,
    "tool_usage_quality": 0.0,
    "task_completion": 0.0,
    "reasoning": {
      "planning_comment": "",
      "tool_comment": "",
      "task_comment": ""
    }
  },
  "candidate_b": {
    "planning_effectiveness": 0.0,
    "tool_usage_quality": 0.0,
    "task_completion": 0.0,
    "reasoning": {
      "planning_comment": "",
      "tool_comment": "",
      "task_comment": ""
    }
  }
}

Validation lives in evaluators/llm_judge/parser.py.

4. Train your own LLM judge (OpenAI fine-tuning)

Two supervision sources:

Teacher distillation (--source llm, recommended): run llm-judge with a strong model (e.g. gpt-4o, --runs 5), then fine-tune a smaller student to imitate it.
Human gold (--source human): export-annotation → label → import-annotation → human_evaluation → fine-tune.

4.1 Workflow

python main.py auto-evaluate --input ./data

# Option A — teacher distillation
export OPENAI_API_KEY=sk-...
python main.py llm-judge \
  --input ./data/_evaluated --out ./data/_judged_teacher \
  --judge openai --model gpt-4o --runs 5

python main.py prepare-judge-training \
  --input ./data/_judged_teacher \
  --out ./judge_train.jsonl \
  --source llm \
  --val-fraction 0.1

# Option B — human gold
python main.py export-annotation --input ./data/_evaluated
# … fill annotations_template.csv …
python main.py import-annotation \
  --input ./data/_evaluated/_annotations/annotations_template.csv \
  --results ./data/_evaluated

python main.py prepare-judge-training \
  --input ./data/_evaluated \
  --out ./judge_train.jsonl \
  --source human \
  --val-fraction 0.1

4.2 Submit the OpenAI fine-tuning job

The bundled helper takes care of file upload + job submission + status polling:

python scripts/finetune_judge.py \
  --train ./judge_train.jsonl --val ./judge_train.val.jsonl \
  --base-model gpt-4o-mini-2024-07-18 \
  --suffix dialog-judge-v1

You will eventually receive a ft:gpt-4o-mini-2024-07-18:my-org:dialog-judge-v1:abc123 model id.

4.3 Evaluate the trained judge

python main.py llm-judge \
  --input ./data/_evaluated --out ./data/_judged_student \
  --judge openai --model ft:...id

python main.py leaderboard --input ./data/_judged_student --metric-source llm

4.4 Rules of thumb

Aim for ≥ 50 high-quality examples to start; 200–1000 for a focused judge.
Keep prompt.py consistent between training-data construction and inference — fine-tuning bakes the prompt shape into the model.

5. Human annotation (optional)

python main.py export-annotation --input ./data/_evaluated
# Edit ./data/_evaluated/_annotations/annotations_template.csv
# Columns: annotator_id, planning_effectiveness, tool_usage_quality, task_completion (0..1), comments
# Do NOT touch dialog_id / model_name

python main.py import-annotation \
  --input ./data/_evaluated/_annotations/annotations_template.csv \
  --results ./data/_evaluated

python main.py leaderboard --input ./data/_evaluated --metric-source human

6. Extension points

New dialog scenario: add a section to data/DESIGN.md; run-all re-parses it on every invocation.
New trace format: subclass BaseDialogAdapter and register it in adapters/registry.py:default_registry.
Dependencies: Python ≥ 3.10, pydantic>=2.5, openai>=1.0 + matplotlib>=3.7 (default pipeline uses the OpenAI SDK; use --judge mock --no-figures to trim both).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AssetOps multi-turn dialog evaluation system

0. Data layout

1. Leaderboard metrics

2. End-to-end pipeline — one command

3. LLM-as-Judge

3.1 Backends

3.2 Environment variables

3.3 CLI

3.4 Request / response contract

4. Train your own LLM judge (OpenAI fine-tuning)

4.1 Workflow

4.2 Submit the OpenAI fine-tuning job

4.3 Evaluate the trained judge

4.4 Rules of thumb

5. Human annotation (optional)

6. Extension points

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
evaluation_system		evaluation_system
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
METRICS.md		METRICS.md
METRICS.zh.md		METRICS.zh.md
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
schema_compliance_schema.yaml		schema_compliance_schema.yaml

Folders and files

Latest commit

History

Repository files navigation

AssetOps multi-turn dialog evaluation system

0. Data layout

1. Leaderboard metrics

2. End-to-end pipeline — one command

3. LLM-as-Judge

3.1 Backends

3.2 Environment variables

3.3 CLI

3.4 Request / response contract

4. Train your own LLM judge (OpenAI fine-tuning)

4.1 Workflow

4.2 Submit the OpenAI fine-tuning job

4.3 Evaluate the trained judge

4.4 Rules of thumb

5. Human annotation (optional)

6. Extension points

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages