Skip to content

Coderlicr/Multi-Turn-AssetOps-Evaluation

 
 

Repository files navigation

AssetOps multi-turn dialog evaluation system

Python 3.10+ pipeline:

data/<model>/dialog<NN>_*.jsonl  →  adapter + DESIGN.md ground truth  →
DialogRecord  →  automatic eval  →  LLM judge  →  aggregation  →
leaderboard + comparison figures

0. Data layout

data/
├── DESIGN.md                       # 16 dialog scenario specs (ground truth source)
├── DESIGN_annotated.md             # optional: richer per-dialog ground truth (merged when present)
├── dialog_specs.json               # structured ground truth (regenerated by run-all)
├── model_a/                        # baseline / first agent (neutral folder name)
│   └── dialog<N>.jsonl
└── model_b/                       # second agent (same naming; use any neutral names you like)
    └── dialog<N>.jsonl

Folder names: use neutral directory names (model_a / model_b, run_1 / run_2, …). The leaderboard Model column uses the subdirectory name — avoid labels like “improved” that can bias informal reading. Under the hood the LLM judge also uses blind scoring by default: the rollout folder name is replaced by a placeholder (anonymous_agent) in the judge prompt so scores are comparable when traces are identical. Use --no-blind-model-name only if you intentionally want the judge to see folder names.

Optional DESIGN_annotated.md: When this file sits next to DESIGN.md, run-all / parse_design_md.py merge its ### **Ground Truth** blocks (characteristic form, required tool sequence, success-criteria bullets) into dialog_specs.json for the LLM judge. Use run-all --no-annotated-overlay to skip, or --annotated-design PATH to point at a specific file.

Filename convention: files should match dialog<N>...jsonl where <N> aligns with dialogs in DESIGN.md (e.g. dialog1.jsonl, dialog01_case_xyz.jsonl).

JSONL rollout format: all model rollouts use the AssetOps event-stream schema:

  • one timestamped agent event per line;
  • turn_id groups events into dialog turns;
  • tool_calls[] records tool invocations, including optional server plus tool_name;
  • planner/executor logs may include plan.steps and plan_kind=replan;
  • final text is carried by an event with task_type=final_response.

Force with --adapter assetops_event_stream_v1 if needed. The adapter attaches DESIGN ground truth via dialog_specs.json.

The adapter builds an architecture-neutral intermediate_plan for fair comparison across plan-execute and supervisor-specialist traces. It summarizes workflows, planned tool sequences, executed tool sequences, and recovery markers such as failed tool calls followed by replanning/re-execution.

run-all (see §2) regenerates dialog_specs.json automatically every run; you can also do it manually:

python scripts/parse_design_md.py --design ./data/DESIGN.md --out ./data/dialog_specs.json

1. Leaderboard metrics

The leaderboard always emits Model | 7 metric columns. Subjective scores are in ([0,1]); automatic metrics are rates in ([0,1]) or n/a.

For full formulas, per-dialog definitions, aggregation rules, and recovery/schema edge cases, see METRICS.md.

Metric Meaning Per-dialog Corpus aggregation
Planning Effectiveness Subjective planning from llm_evaluation / human_evaluation macro avg (\tfrac{1}{N}\sum_i s_i)
Tool Usage Quality Subjective tool use same same
Task Completion Subjective completion same same
Tool Name Validity Legal tool-name rate legal / total micro avg over all calls
Schema Compliance Argument structure valid among legal names schema_ok / legal micro avg
Execution Success Rate Tool exec success exec_ok / total micro avg
Recovery Success Rate Recovery → final success dialogs with recovery_triggered: last turn succeeded dialog-level rate; n/a when no recovery was triggered

Rows are sorted alphabetically by model name. Subjective source: leaderboard --metric-source llm|human|automatic selects llm_evaluation, human_evaluation, or leaves the first three columns blank (automatic-only view).

Schema Compliance details: tool existence and argument schemas are validated against schema_compliance_schema.yaml. Bare tool names are resolved when unique; otherwise the adapter combines server + tool_name as server.tool. Compliance checks are structural: arguments must be an object, required arguments must be present, unknown arguments are rejected, and JSON value types must match the canonical input schema.

Recovery details: plan-execute traces trigger recovery only on failed tool calls, not on skipped/unexecuted plan artifacts. Supervisor-specialist traces scope recovery to the failing specialist role and require later successful work from that same role.


2. End-to-end pipeline — one command

pip install -r requirements.txt

# API key: either export once per shell, OR create ./.env (see .env.example) — main.py loads it automatically.
export OPENAI_API_KEY=sk-...
python main.py run-all

# Optional tweaks
python main.py run-all --model gpt-4.1 --runs 3

# Let judge see real folder names in the prompt (not recommended for blind A/B):
# python main.py run-all --no-blind-model-name

# Offline / CI without OpenAI credentials:
python main.py run-all --judge mock

What run-all does, in order:

  1. parses data/DESIGN.mddata/dialog_specs.json (re-runs every time so spec edits are picked up automatically);
  2. discovers per-model rollouts under data/<model>/, groups same-stem dialog files across model folders, runs the paired LLM judge by default, and writes canonical DialogRecord JSON to data/_judged/<model>/<dialog_id>.json;
  3. aggregates the corpus and writes data/_leaderboard/leaderboard_metrics.{txt,csv,json} plus the table to stdout;
  4. writes metrics_radar.png (radar overlay of the seven [0,1] metrics under automatic_evaluation) into data/_leaderboard/figures/. Skip this step with --no-figures when matplotlib is unavailable.

Drop a second model's rollouts under data/model_b/ (next to data/model_a/) to compare in one leaderboard/radar chart.

If you want to step through the pipeline manually:

python scripts/parse_design_md.py --design ./data/DESIGN.md --out ./data/dialog_specs.json
python main.py auto-evaluate --input ./data
python main.py llm-judge      --input ./data/_evaluated --out ./data/_judged
python main.py leaderboard    --input ./data/_judged    --metric-source llm
python main.py visualize      --input ./data/_judged    --out ./data/_leaderboard/figures

Sample leaderboard:

Model   | Planning Effectiveness | Tool Usage Quality | Task Completion | Tool Name Validity | Schema Compliance | Execution Success Rate | Recovery Success Rate
model_a | 0.95                   | 0.85               | 0.95            | 1                  | 0.923077          | 0.923077               | 1

Sample figure: metrics_radar.png overlays one polygon per model on the seven unit-bounded automatic metrics (closer to the outer ring is better).


3. LLM-as-Judge

The judge consumes a DialogRecord (already enriched with ground_truth.expected_tools / expected_plan / expected_final_answer / task_success_criteria), prompts an LLM in JSON mode, and writes the three subjective scores back into llm_evaluation.

Directory inputs use paired mode by default. Same-stem files from different model folders (for example model_a/dialog1.jsonl and model_b/dialog1.jsonl) are scored in one shared prompt as candidate_a, candidate_b, … so the subjective scores are calibrated against the same ground truth and rubric context. Single-candidate groups automatically fall back to single scoring.

Use --judge-mode single when you intentionally want to score each file independently.

3.1 Backends

--judge Module Network Use case
openai (default) OpenAIJudge OpenAI Chat Completions or compatible production scoring (--model defaults to gpt-4o-mini)
mock MockJudge no offline pipeline / CI (--model ignored)

--base-url lets you point the OpenAI client at vLLM, LiteLLM, Azure proxies, or any OpenAI-compatible gateway.

3.2 Environment variables

Variable Required Notes
OPENAI_API_KEY unless --judge mock OpenAI / compatible service API key
OPENAI_BASE_URL optional custom base URL
OPENAI_ORG / OPENAI_ORGANIZATION optional OpenAI organization id

3.3 CLI

# Default judge is OpenAI; --model defaults to gpt-4o-mini
export OPENAI_API_KEY=sk-...
python main.py llm-judge \
  --input ./data/_evaluated --out ./data/_judged \
  --runs 3 --seed 1337

# Explicit backend / model overrides
python main.py llm-judge \
  --input ./data/_evaluated --out ./data/_judged \
  --judge openai --model gpt-4.1 \
  --runs 3 --seed 1337

# Independent per-file scoring instead of paired same-dialog scoring
python main.py llm-judge \
  --input ./data/_evaluated --out ./data/_judged \
  --judge-mode single

# Reasoning-class models (o1/o3/o4 ignore --temperature automatically)
python main.py llm-judge --input ./data/_evaluated --model o4-mini

# Self-hosted endpoint
python main.py llm-judge \
  --input ./data/_evaluated --judge openai \
  --model my-judge-finetune --base-url http://localhost:8000/v1

# No API credentials
python main.py llm-judge --input ./data/_evaluated --out ./data/_judged --judge mock

Use --continue-on-error to keep going on per-file failures; the run exits with code 1 if any failed.

3.4 Request / response contract

Request: one system message ("return only JSON") + one user message produced by evaluators/llm_judge/prompt.py, response_format={"type": "json_object"}.

Single-mode response:

{
  "planning_effectiveness": 0.0,
  "tool_usage_quality": 0.0,
  "task_completion": 0.0,
  "reasoning": {
    "planning_comment": "",
    "tool_comment": "",
    "task_comment": ""
  }
}

Paired-mode response:

{
  "candidate_a": {
    "planning_effectiveness": 0.0,
    "tool_usage_quality": 0.0,
    "task_completion": 0.0,
    "reasoning": {
      "planning_comment": "",
      "tool_comment": "",
      "task_comment": ""
    }
  },
  "candidate_b": {
    "planning_effectiveness": 0.0,
    "tool_usage_quality": 0.0,
    "task_completion": 0.0,
    "reasoning": {
      "planning_comment": "",
      "tool_comment": "",
      "task_comment": ""
    }
  }
}

Validation lives in evaluators/llm_judge/parser.py.


4. Train your own LLM judge (OpenAI fine-tuning)

Two supervision sources:

  1. Teacher distillation (--source llm, recommended): run llm-judge with a strong model (e.g. gpt-4o, --runs 5), then fine-tune a smaller student to imitate it.
  2. Human gold (--source human): export-annotation → label → import-annotationhuman_evaluation → fine-tune.

4.1 Workflow

python main.py auto-evaluate --input ./data

# Option A — teacher distillation
export OPENAI_API_KEY=sk-...
python main.py llm-judge \
  --input ./data/_evaluated --out ./data/_judged_teacher \
  --judge openai --model gpt-4o --runs 5

python main.py prepare-judge-training \
  --input ./data/_judged_teacher \
  --out ./judge_train.jsonl \
  --source llm \
  --val-fraction 0.1

# Option B — human gold
python main.py export-annotation --input ./data/_evaluated
# … fill annotations_template.csv …
python main.py import-annotation \
  --input ./data/_evaluated/_annotations/annotations_template.csv \
  --results ./data/_evaluated

python main.py prepare-judge-training \
  --input ./data/_evaluated \
  --out ./judge_train.jsonl \
  --source human \
  --val-fraction 0.1

4.2 Submit the OpenAI fine-tuning job

The bundled helper takes care of file upload + job submission + status polling:

python scripts/finetune_judge.py \
  --train ./judge_train.jsonl --val ./judge_train.val.jsonl \
  --base-model gpt-4o-mini-2024-07-18 \
  --suffix dialog-judge-v1

You will eventually receive a ft:gpt-4o-mini-2024-07-18:my-org:dialog-judge-v1:abc123 model id.

4.3 Evaluate the trained judge

python main.py llm-judge \
  --input ./data/_evaluated --out ./data/_judged_student \
  --judge openai --model ft:...id

python main.py leaderboard --input ./data/_judged_student --metric-source llm

4.4 Rules of thumb

  • Aim for ≥ 50 high-quality examples to start; 200–1000 for a focused judge.
  • Keep prompt.py consistent between training-data construction and inference — fine-tuning bakes the prompt shape into the model.

5. Human annotation (optional)

python main.py export-annotation --input ./data/_evaluated
# Edit ./data/_evaluated/_annotations/annotations_template.csv
# Columns: annotator_id, planning_effectiveness, tool_usage_quality, task_completion (0..1), comments
# Do NOT touch dialog_id / model_name

python main.py import-annotation \
  --input ./data/_evaluated/_annotations/annotations_template.csv \
  --results ./data/_evaluated

python main.py leaderboard --input ./data/_evaluated --metric-source human

6. Extension points

  • New dialog scenario: add a section to data/DESIGN.md; run-all re-parses it on every invocation.
  • New trace format: subclass BaseDialogAdapter and register it in adapters/registry.py:default_registry.
  • Dependencies: Python ≥ 3.10, pydantic>=2.5, openai>=1.0 + matplotlib>=3.7 (default pipeline uses the OpenAI SDK; use --judge mock --no-figures to trim both).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%