Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
180 changes: 180 additions & 0 deletions .apm/skills/benchmark-qed-autoe/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
---
name: benchmark-qed-autoe
description: >
Evaluate RAG system outputs using benchmark-qed scoring methods. Use when:
running pairwise comparisons, reference-based scoring, assertion-based
evaluation (flat or hierarchical), retrieval metrics, or statistical
significance tests on RAG outputs. Also use when the user wants to score,
compare, or evaluate RAG methods, measure retrieval quality, or run
significance tests on benchmark results — even if they don't say "autoe"
explicitly.
---

# Benchmark-QED Evaluation (autoe)

Evaluate and compare RAG system outputs using LLM-judged scoring, assertion-based evaluation, and retrieval metrics — all with built-in statistical significance testing.

## Prerequisites

- Generated questions/assertions from the autoq pipeline (or your own)
- RAG method answer files (JSON, one per method per question set)
- A valid `settings.yaml` for the evaluation type
- A configured workspace with valid `settings.yaml` (use the `benchmark-qed-setup` skill to initialize and configure)
- LLM API key configured

Run all commands with:
```bash
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed <command>
```

## Evaluation Methods Overview

| Method | Command | Best for |
|--------|---------|----------|
| Pairwise comparison | `autoe pairwise-scores` | Comparing two RAG methods head-to-head |
| Reference scoring | `autoe reference-scores` | Scoring against gold-standard answers |
| Assertion scoring | `autoe assertion-scores` | Evaluating with ground-truth assertions (single or multi-RAG) |
| Hierarchical assertions | `autoe hierarchical-assertion-scores` | Global + local assertion hierarchies |
| Retrieval metrics | `autoe retrieval-scores` | Precision, recall, fidelity of retrieval |
| Significance tests | `autoe assertion-significance` | Post-hoc significance on existing scores |

## Commands

### 1. Pairwise Scores

Compare RAG methods using LLM-judged pairwise comparisons.

```bash
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe pairwise-scores <config.yaml> <output_dir> [OPTIONS]
```

**Options:**
| Option | Default | Description |
|--------|---------|-------------|
| `--alpha` | `0.05` | P-value threshold for significance |
| `--exclude-criteria` | `[]` | Criteria to exclude (repeatable) |
| `--print-model-usage` | `false` | Print LLM token usage |

**Config requires**: `base` (reference method), `others` (methods to compare), `question_sets`, `criteria`, `trials` (must be even), `llm_config`, `prompt_config`

Default criteria: `comprehensiveness`, `diversity`, `empowerment`, `relevance`

**Output**: `{question_set}_{base}--{other}.csv`, `win_rates.csv`, `winrates_sig_tests.csv`

### 2. Reference Scores

Score generated answers against reference (gold-standard) answers.

```bash
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe reference-scores <config.yaml> <output_dir> [OPTIONS]
```

**Config requires**: `reference`, `generated` (list), `criteria`, `score_min`/`score_max`, `trials`, `llm_config`

Default criteria: `correctness`, `completeness`. Default score range: 1–10.

**Output**: `reference_scores-{name}.csv`, `model_usage.json`

### 3. Assertion Scores

Evaluate RAG methods using assertion-based scoring. Auto-detects single-RAG vs multi-RAG config.

```bash
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe assertion-scores <config.yaml> <output_dir> [OPTIONS]
```

**Options:**
| Option | Default | Description |
|--------|---------|-------------|
| `--alpha` | `0.05` | Significance threshold (multi-RAG) |
| `--print-model-usage` | `false` | Print LLM token usage |

**Auto-detection**: If the YAML contains a `rag_methods` key, it runs in multi-RAG mode with automated significance testing. Otherwise, single-RAG mode.

**Single-RAG output**: `assertion_scores.csv`, `assertion_summary_by_question.csv`, `eval_summary.json`

**Multi-RAG output**: Per-method scores + significance tests in structured `output_dir/`

### 4. Hierarchical Assertion Scores

Score hierarchical assertions (global assertions with supporting local assertions).

```bash
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe hierarchical-assertion-scores <config.yaml> <output_dir> [OPTIONS]
```

**Modes**: `staged` (default — evaluate local first, then global) or `joint` (evaluate together)

**Extra field**: `detect_discovery: true` enables detection of novel findings not covered by assertions.

Also auto-detects single vs multi-RAG config (same as assertion-scores).

### 5. Assertion Significance

Run statistical significance tests on existing assertion scores (no LLM calls).

```bash
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe assertion-significance <config.yaml>
```

**Config requires**: `output_dir`, `rag_methods`, `question_sets`, `alpha`, `correction_method`

**Correction methods**: `holm` (default, recommended), `bonferroni`, `fdr_bh`

### 6. Hierarchical Assertion Significance

Significance tests on hierarchical assertion scores.

```bash
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe hierarchical-assertion-significance <config.yaml>
```

**Config requires**: `scores_dir`, `rag_methods`, `scores_filename_template`, `alpha`, `correction_method`, `output_dir`

### 7. Generate Retrieval Reference

Generate cluster relevance reference data for retrieval evaluation (one-off prep step).

```bash
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe generate-retrieval-reference <config.yaml>
```

**Config requires**: `llm_config`, `embedding_config`, question source (`questions_path` or `question_sets`), `text_units_path`

**Key settings**: `num_clusters`, `assessor_type` (`rationale` or `bing`), `semantic_neighbors`, `centroid_neighbors`

### 8. Retrieval Scores

Evaluate retrieval precision, recall, and fidelity for RAG methods.

```bash
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe retrieval-scores <config.yaml>
```

**Config requires**: `rag_methods`, `question_sets`, `reference_dir`, `text_units_path`, `output_dir`

**Fidelity metrics**: `js` (Jensen-Shannon divergence) or `tvd` (total variation distance)

## Workflow

### Quick Evaluation (Assertion-Based)

- [ ] Step 1: Verify questions and answers exist — list the workspace and confirm a `settings.yaml` (or `config.yaml`), question JSON files (typically under `output/`), and your RAG method answer JSONs are present.
- [ ] Step 2: Initialize eval config — use the `benchmark-qed-setup` skill to create and configure an assertion evaluation workspace.
- [ ] Step 3: Configure settings.yaml with answer paths and assertion paths
- [ ] Step 4: Run evaluation — `uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe assertion-scores ./eval_workspace/settings.yaml ./eval_output`
- [ ] Step 5: Summarize results — read the CSVs in `<output_dir>` (e.g. `assertion_scores.csv`, `assertion_summary_by_question.csv`) and `eval_summary.json` directly.

### Multi-RAG Comparison

For comparing multiple RAG methods, use multi-RAG config format (include `rag_methods` key in YAML). This gives you automated pairwise significance testing.

## Gotchas

- **Config auto-detection**: `assertion-scores` and `hierarchical-assertion-scores` detect single vs multi-RAG based on the `rag_methods` key in YAML. Ensure your config matches your intent.
- **Trials must be even**: For pairwise scores, `trials` must be even (for counterbalancing). Use 4 as default.
- **Stale outputs**: Several commands skip existing output files. Use a fresh output directory or delete specific files to force re-evaluation.
- **Output is in files**: All scores are written to CSV/JSON files. Parse output files, not CLI stdout.
- **Long-running**: Evaluation with many questions and trials can take hours. Use background execution.
- **No `config init` for hierarchical/retrieval**: The `benchmark-qed-setup` skill only supports `autoe_assertion`, `autoe_pairwise`, and `autoe_reference`. For hierarchical, multi-RAG, and retrieval configs, create YAML manually.
- **Advanced config types**: Use the `benchmark-qed-setup` skill for configuration guidance on advanced config types.
165 changes: 165 additions & 0 deletions .apm/skills/benchmark-qed-autoq/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
---
name: benchmark-qed-autoq
description: >
Generate benchmark questions and assertions from input data using
benchmark-qed. Use when: generating local, global, linked, or activity
questions for RAG benchmarking, creating assertions for existing questions,
computing assertion statistics, or running the autoq question generation
pipeline. Also use when the user wants to create a benchmark question set,
build evaluation questions from a dataset, or generate ground-truth
assertions — even if they don't say "autoq" explicitly.
---

# Benchmark-QED Question Generation (autoq)

Generate benchmark questions and assertions from input data for RAG evaluation.

## Prerequisites

- A configured workspace with valid `settings.yaml` (use the `/benchmark-qed-setup` skill first)
- A configured workspace with valid `settings.yaml` (use the `benchmark-qed-setup` skill to initialize and configure)
- Input data (CSV or JSON) in the workspace `input/` directory
- Valid LLM API key in `.env`

Run all commands with:
```bash
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed <command>
```

## Commands

### 1. Generate Questions (`autoq`)

The main question generation pipeline. Generates benchmark questions from input data.

```bash
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq <settings.yaml> <output_dir> [OPTIONS]
```

**Options:**
| Option | Description |
|--------|-------------|
| `--generation-types` | Specific types to generate (repeatable). CLI default: all except `data_linked`, but this skill always includes `data_linked` |
| `--print-model-usage` | Print LLM token usage stats |

**Generation types and dependencies:**

```
data_local ← runs first (no dependencies)
├── data_global ← requires data_local candidates
└── data_linked ← requires data_local candidates (not in CLI default, but this skill always includes it)

activity_local ← auto-generates activity_context first
└── activity_global ← requires activity_local
```

> **Important**: `data_linked` is NOT included in the CLI's default generation types, but this skill always generates it by passing all types explicitly. If running the CLI manually, you must add `--generation-types data_linked`.

> **Gotcha**: `data_global` and `data_linked` silently return empty results if `data_local` hasn't been run first. Always run `data_local` before these types.

**Examples:**
```bash
# Run from the workspace directory (paths resolve relative to settings.yaml location)
cd ./workspace

# Generate all types including data_linked (skill default)
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq settings.yaml ./output \
--generation-types data_local --generation-types data_global --generation-types data_linked \
--generation-types activity_local --generation-types activity_global

# Generate only local questions
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq settings.yaml ./output --generation-types data_local

# Generate local + linked questions
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq settings.yaml ./output \
--generation-types data_local --generation-types data_linked
```

**Output structure:**
```
output_dir/
├── sample_texts.parquet # Intermediate: clustered text samples
├── data_local_questions/
│ ├── selected_questions.json # Final curated questions
│ ├── selected_questions_text.json # Human-readable version
│ └── candidate_questions.json # All generated candidates
├── data_global_questions/ # Same structure
├── data_linked_questions/ # Same structure + question_stats.json
├── activity_local_questions/ # Same structure
├── activity_global_questions/ # Same structure
├── context/
│ └── activity_context_full.json # Generated activity context
└── model_usage.json # LLM token/cost tracking
```

### 2. Generate Assertions (`generate-assertions`)

Generate ground-truth assertions for existing questions (decoupled from question generation). This is a **top-level** command, not a subcommand of `autoq`.

```bash
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed generate-assertions <settings.yaml> <questions.json> <output_dir> [OPTIONS]
```

**Options:**
| Option | Description |
|--------|-------------|
| `--type` / `-t` | Assertion type: `local`, `global`, or `linked` (default: `local`) |
| `--print-model-usage` | Print LLM token usage stats |

**Examples:**
```bash
# Run from the workspace directory (paths resolve relative to settings.yaml location)
cd ./workspace

uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed generate-assertions \
settings.yaml \
./output/data_local_questions/candidate_questions.json \
./output/data_local_questions/ \
--type local

uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed generate-assertions \
settings.yaml \
./output/data_global_questions/candidate_questions.json \
./output/data_global_questions/ \
--type global
```

### 3. Assertion Statistics (`assertion-stats`)

Compute quality statistics for assertion files. This is a **top-level** command, not a subcommand of `autoq`.

```bash
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed assertion-stats <assertions_path> [OPTIONS]
```

**Options:**
| Option | Description |
|--------|-------------|
| `--output` / `-o` | Output path for stats JSON (auto-generated if omitted) |
| `--type` / `-t` | `global`, `map`, or `local` (auto-inferred if omitted) |
| `--quiet` / `-q` | Suppress console output |

**Examples:**
```bash
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed assertion-stats ./output/assertions.json
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed assertion-stats ./output/data_global_questions/ -q
```

## Workflow

### Standard Question Generation Flow

- [ ] Step 1: Initialize workspace if needed — use the `benchmark-qed-setup` skill to create and configure the workspace. Verify `settings.yaml`, `.env`, and `input/` exist.
- [ ] Step 2: `cd <workspace_dir>` then run question generation — `uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq settings.yaml ./output --generation-types data_local --generation-types data_global --generation-types data_linked --generation-types activity_local --generation-types activity_global`
- [ ] Step 3: Verify output artifacts — list `<output_dir>` and confirm the per-type `selected_questions.json` files (see "Output structure" above) plus `model_usage.json` exist.
- [ ] Step 4: (Optional) Generate additional assertions — use `generate-assertions`
- [ ] Step 5: (Optional) Check assertion quality — use `assertion-stats`

## Gotchas

- **Path resolution**: The `autoq` and `generate-assertions` commands resolve `output_dir` (and other relative paths) **relative to the settings.yaml file's directory**, not the current working directory. Always `cd` into the workspace directory first, or use absolute paths. For example, running `benchmark-qed autoq workspace/settings.yaml workspace/output` from the repo root creates output at `workspace/workspace/output/` (not `workspace/output/`).
- **Stale outputs**: The pipeline skips steps if output files already exist (`sample_texts.parquet`, `activity_context_full.json`). Use a fresh output directory for clean runs, or delete specific files to re-run a step.
- **Long-running**: Question generation with large datasets can take hours. Use background execution and monitor via `model_usage.json` presence.
- **Output is in files, not stdout**: All results are written to JSON/CSV/Parquet files. Parse the output files, not CLI stdout.
- **Generation ordering**: `data_global` and `data_linked` depend on `data_local`. `activity_global` depends on `activity_local`. Running dependent types without their prerequisites produces silent empty results.
- **`data_linked` CLI opt-in**: The CLI excludes `data_linked` by default, but this skill always includes it. If running the CLI manually outside this skill, add `--generation-types data_linked`.
Loading
Loading