microsoft · andresmor-ms · May 6, 2026 · Apr 21, 2026 · Apr 22, 2026 · Apr 22, 2026
diff --git a/.apm/skills/benchmark-qed-autoe/SKILL.md b/.apm/skills/benchmark-qed-autoe/SKILL.md
@@ -0,0 +1,180 @@
+---
+name: benchmark-qed-autoe
+description: >
+  Evaluate RAG system outputs using benchmark-qed scoring methods. Use when:
+  running pairwise comparisons, reference-based scoring, assertion-based
+  evaluation (flat or hierarchical), retrieval metrics, or statistical
+  significance tests on RAG outputs. Also use when the user wants to score,
+  compare, or evaluate RAG methods, measure retrieval quality, or run
+  significance tests on benchmark results — even if they don't say "autoe"
+  explicitly.
+---
+
+# Benchmark-QED Evaluation (autoe)
+
+Evaluate and compare RAG system outputs using LLM-judged scoring, assertion-based evaluation, and retrieval metrics — all with built-in statistical significance testing.
+
+## Prerequisites
+
+- Generated questions/assertions from the autoq pipeline (or your own)
+- RAG method answer files (JSON, one per method per question set)
+- A valid `settings.yaml` for the evaluation type
+- A configured workspace with valid `settings.yaml` (use the `benchmark-qed-setup` skill to initialize and configure)
+- LLM API key configured
+
+Run all commands with:
+```bash
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed <command>
+```
+
+## Evaluation Methods Overview
+
+| Method | Command | Best for |
+|--------|---------|----------|
+| Pairwise comparison | `autoe pairwise-scores` | Comparing two RAG methods head-to-head |
+| Reference scoring | `autoe reference-scores` | Scoring against gold-standard answers |
+| Assertion scoring | `autoe assertion-scores` | Evaluating with ground-truth assertions (single or multi-RAG) |
+| Hierarchical assertions | `autoe hierarchical-assertion-scores` | Global + local assertion hierarchies |
+| Retrieval metrics | `autoe retrieval-scores` | Precision, recall, fidelity of retrieval |
+| Significance tests | `autoe assertion-significance` | Post-hoc significance on existing scores |
+
+## Commands
+
+### 1. Pairwise Scores
+
+Compare RAG methods using LLM-judged pairwise comparisons.
+
+```bash
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe pairwise-scores <config.yaml> <output_dir> [OPTIONS]
+```
+
+**Options:**
+| Option | Default | Description |
+|--------|---------|-------------|
+| `--alpha` | `0.05` | P-value threshold for significance |
+| `--exclude-criteria` | `[]` | Criteria to exclude (repeatable) |
+| `--print-model-usage` | `false` | Print LLM token usage |
+
+**Config requires**: `base` (reference method), `others` (methods to compare), `question_sets`, `criteria`, `trials` (must be even), `llm_config`, `prompt_config`
+
+Default criteria: `comprehensiveness`, `diversity`, `empowerment`, `relevance`
+
+**Output**: `{question_set}_{base}--{other}.csv`, `win_rates.csv`, `winrates_sig_tests.csv`
+
+### 2. Reference Scores
+
+Score generated answers against reference (gold-standard) answers.
+
+```bash
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe reference-scores <config.yaml> <output_dir> [OPTIONS]
+```
+
+**Config requires**: `reference`, `generated` (list), `criteria`, `score_min`/`score_max`, `trials`, `llm_config`
+
+Default criteria: `correctness`, `completeness`. Default score range: 1–10.
+
+**Output**: `reference_scores-{name}.csv`, `model_usage.json`
+
+### 3. Assertion Scores
+
+Evaluate RAG methods using assertion-based scoring. Auto-detects single-RAG vs multi-RAG config.
+
+```bash
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe assertion-scores <config.yaml> <output_dir> [OPTIONS]
+```
+
+**Options:**
+| Option | Default | Description |
+|--------|---------|-------------|
+| `--alpha` | `0.05` | Significance threshold (multi-RAG) |
+| `--print-model-usage` | `false` | Print LLM token usage |
+
+**Auto-detection**: If the YAML contains a `rag_methods` key, it runs in multi-RAG mode with automated significance testing. Otherwise, single-RAG mode.
+
+**Single-RAG output**: `assertion_scores.csv`, `assertion_summary_by_question.csv`, `eval_summary.json`
+
+**Multi-RAG output**: Per-method scores + significance tests in structured `output_dir/`
+
+### 4. Hierarchical Assertion Scores
+
+Score hierarchical assertions (global assertions with supporting local assertions).
+
+```bash
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe hierarchical-assertion-scores <config.yaml> <output_dir> [OPTIONS]
+```
+
+**Modes**: `staged` (default — evaluate local first, then global) or `joint` (evaluate together)
+
+**Extra field**: `detect_discovery: true` enables detection of novel findings not covered by assertions.
+
+Also auto-detects single vs multi-RAG config (same as assertion-scores).
+
+### 5. Assertion Significance
+
+Run statistical significance tests on existing assertion scores (no LLM calls).
+
+```bash
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe assertion-significance <config.yaml>
+```
+
+**Config requires**: `output_dir`, `rag_methods`, `question_sets`, `alpha`, `correction_method`
+
+**Correction methods**: `holm` (default, recommended), `bonferroni`, `fdr_bh`
+
+### 6. Hierarchical Assertion Significance
+
+Significance tests on hierarchical assertion scores.
+
+```bash
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe hierarchical-assertion-significance <config.yaml>
+```
+
+**Config requires**: `scores_dir`, `rag_methods`, `scores_filename_template`, `alpha`, `correction_method`, `output_dir`
+
+### 7. Generate Retrieval Reference
+
+Generate cluster relevance reference data for retrieval evaluation (one-off prep step).
+
+```bash
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe generate-retrieval-reference <config.yaml>
+```
+
+**Config requires**: `llm_config`, `embedding_config`, question source (`questions_path` or `question_sets`), `text_units_path`
+
+**Key settings**: `num_clusters`, `assessor_type` (`rationale` or `bing`), `semantic_neighbors`, `centroid_neighbors`
+
+### 8. Retrieval Scores
+
+Evaluate retrieval precision, recall, and fidelity for RAG methods.
+
+```bash
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe retrieval-scores <config.yaml>
+```
+
+**Config requires**: `rag_methods`, `question_sets`, `reference_dir`, `text_units_path`, `output_dir`
+
+**Fidelity metrics**: `js` (Jensen-Shannon divergence) or `tvd` (total variation distance)
+
+## Workflow
+
+### Quick Evaluation (Assertion-Based)
+
+- [ ] Step 1: Verify questions and answers exist — list the workspace and confirm a `settings.yaml` (or `config.yaml`), question JSON files (typically under `output/`), and your RAG method answer JSONs are present.
+- [ ] Step 2: Initialize eval config — use the `benchmark-qed-setup` skill to create and configure an assertion evaluation workspace.
+- [ ] Step 3: Configure settings.yaml with answer paths and assertion paths
+- [ ] Step 4: Run evaluation — `uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe assertion-scores ./eval_workspace/settings.yaml ./eval_output`
+- [ ] Step 5: Summarize results — read the CSVs in `<output_dir>` (e.g. `assertion_scores.csv`, `assertion_summary_by_question.csv`) and `eval_summary.json` directly.
+
+### Multi-RAG Comparison
+
+For comparing multiple RAG methods, use multi-RAG config format (include `rag_methods` key in YAML). This gives you automated pairwise significance testing.
+
+## Gotchas
+
+- **Config auto-detection**: `assertion-scores` and `hierarchical-assertion-scores` detect single vs multi-RAG based on the `rag_methods` key in YAML. Ensure your config matches your intent.
+- **Trials must be even**: For pairwise scores, `trials` must be even (for counterbalancing). Use 4 as default.
+- **Stale outputs**: Several commands skip existing output files. Use a fresh output directory or delete specific files to force re-evaluation.
+- **Output is in files**: All scores are written to CSV/JSON files. Parse output files, not CLI stdout.
+- **Long-running**: Evaluation with many questions and trials can take hours. Use background execution.
+- **No `config init` for hierarchical/retrieval**: The `benchmark-qed-setup` skill only supports `autoe_assertion`, `autoe_pairwise`, and `autoe_reference`. For hierarchical, multi-RAG, and retrieval configs, create YAML manually.
+- **Advanced config types**: Use the `benchmark-qed-setup` skill for configuration guidance on advanced config types.
diff --git a/.apm/skills/benchmark-qed-autoq/SKILL.md b/.apm/skills/benchmark-qed-autoq/SKILL.md
@@ -0,0 +1,165 @@
+---
+name: benchmark-qed-autoq
+description: >
+  Generate benchmark questions and assertions from input data using
+  benchmark-qed. Use when: generating local, global, linked, or activity
+  questions for RAG benchmarking, creating assertions for existing questions,
+  computing assertion statistics, or running the autoq question generation
+  pipeline. Also use when the user wants to create a benchmark question set,
+  build evaluation questions from a dataset, or generate ground-truth
+  assertions — even if they don't say "autoq" explicitly.
+---
+
+# Benchmark-QED Question Generation (autoq)
+
+Generate benchmark questions and assertions from input data for RAG evaluation.
+
+## Prerequisites
+
+- A configured workspace with valid `settings.yaml` (use the `/benchmark-qed-setup` skill first)
+- A configured workspace with valid `settings.yaml` (use the `benchmark-qed-setup` skill to initialize and configure)
+- Input data (CSV or JSON) in the workspace `input/` directory
+- Valid LLM API key in `.env`
+
+Run all commands with:
+```bash
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed <command>
+```
+
+## Commands
+
+### 1. Generate Questions (`autoq`)
+
+The main question generation pipeline. Generates benchmark questions from input data.
+
+```bash
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq <settings.yaml> <output_dir> [OPTIONS]
+```
+
+**Options:**
+| Option | Description |
+|--------|-------------|
+| `--generation-types` | Specific types to generate (repeatable). CLI default: all except `data_linked`, but this skill always includes `data_linked` |
+| `--print-model-usage` | Print LLM token usage stats |
+
+**Generation types and dependencies:**
+
+```
+data_local          ← runs first (no dependencies)
+  ├── data_global   ← requires data_local candidates
+  └── data_linked   ← requires data_local candidates (not in CLI default, but this skill always includes it)
+
+activity_local      ← auto-generates activity_context first
+  └── activity_global ← requires activity_local
+```
+
+> **Important**: `data_linked` is NOT included in the CLI's default generation types, but this skill always generates it by passing all types explicitly. If running the CLI manually, you must add `--generation-types data_linked`.
+
+> **Gotcha**: `data_global` and `data_linked` silently return empty results if `data_local` hasn't been run first. Always run `data_local` before these types.
+
+**Examples:**
+```bash
+# Run from the workspace directory (paths resolve relative to settings.yaml location)
+cd ./workspace
+
+# Generate all types including data_linked (skill default)
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq settings.yaml ./output \
+  --generation-types data_local --generation-types data_global --generation-types data_linked \
+  --generation-types activity_local --generation-types activity_global
+
+# Generate only local questions
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq settings.yaml ./output --generation-types data_local
+
+# Generate local + linked questions
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq settings.yaml ./output \
+  --generation-types data_local --generation-types data_linked
+```
+
+**Output structure:**
+```
+output_dir/
+├── sample_texts.parquet              # Intermediate: clustered text samples
+├── data_local_questions/
+│   ├── selected_questions.json       # Final curated questions
+│   ├── selected_questions_text.json  # Human-readable version
+│   └── candidate_questions.json      # All generated candidates
+├── data_global_questions/            # Same structure
+├── data_linked_questions/            # Same structure + question_stats.json
+├── activity_local_questions/         # Same structure
+├── activity_global_questions/        # Same structure
+├── context/
+│   └── activity_context_full.json    # Generated activity context
+└── model_usage.json                  # LLM token/cost tracking
+```
+
+### 2. Generate Assertions (`generate-assertions`)
+
+Generate ground-truth assertions for existing questions (decoupled from question generation). This is a **top-level** command, not a subcommand of `autoq`.
+
+```bash
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed generate-assertions <settings.yaml> <questions.json> <output_dir> [OPTIONS]
+```
+
+**Options:**
+| Option | Description |
+|--------|-------------|
+| `--type` / `-t` | Assertion type: `local`, `global`, or `linked` (default: `local`) |
+| `--print-model-usage` | Print LLM token usage stats |
+
+**Examples:**
+```bash
+# Run from the workspace directory (paths resolve relative to settings.yaml location)
+cd ./workspace
+
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed generate-assertions \
+  settings.yaml \
+  ./output/data_local_questions/candidate_questions.json \
+  ./output/data_local_questions/ \
+  --type local
+
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed generate-assertions \
+  settings.yaml \
+  ./output/data_global_questions/candidate_questions.json \
+  ./output/data_global_questions/ \
+  --type global
+```
+
+### 3. Assertion Statistics (`assertion-stats`)
+
+Compute quality statistics for assertion files. This is a **top-level** command, not a subcommand of `autoq`.
+
+```bash
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed assertion-stats <assertions_path> [OPTIONS]
+```
+
+**Options:**
+| Option | Description |
+|--------|-------------|
+| `--output` / `-o` | Output path for stats JSON (auto-generated if omitted) |
+| `--type` / `-t` | `global`, `map`, or `local` (auto-inferred if omitted) |
+| `--quiet` / `-q` | Suppress console output |
+
+**Examples:**
+```bash
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed assertion-stats ./output/assertions.json
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed assertion-stats ./output/data_global_questions/ -q
+```
+
+## Workflow
+
+### Standard Question Generation Flow
+
+- [ ] Step 1: Initialize workspace if needed — use the `benchmark-qed-setup` skill to create and configure the workspace. Verify `settings.yaml`, `.env`, and `input/` exist.
+- [ ] Step 2: `cd <workspace_dir>` then run question generation — `uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq settings.yaml ./output --generation-types data_local --generation-types data_global --generation-types data_linked --generation-types activity_local --generation-types activity_global`
+- [ ] Step 3: Verify output artifacts — list `<output_dir>` and confirm the per-type `selected_questions.json` files (see "Output structure" above) plus `model_usage.json` exist.
+- [ ] Step 4: (Optional) Generate additional assertions — use `generate-assertions`
+- [ ] Step 5: (Optional) Check assertion quality — use `assertion-stats`
+
+## Gotchas
+
+- **Path resolution**: The `autoq` and `generate-assertions` commands resolve `output_dir` (and other relative paths) **relative to the settings.yaml file's directory**, not the current working directory. Always `cd` into the workspace directory first, or use absolute paths. For example, running `benchmark-qed autoq workspace/settings.yaml workspace/output` from the repo root creates output at `workspace/workspace/output/` (not `workspace/output/`).
+- **Stale outputs**: The pipeline skips steps if output files already exist (`sample_texts.parquet`, `activity_context_full.json`). Use a fresh output directory for clean runs, or delete specific files to re-run a step.
+- **Long-running**: Question generation with large datasets can take hours. Use background execution and monitor via `model_usage.json` presence.
+- **Output is in files, not stdout**: All results are written to JSON/CSV/Parquet files. Parse the output files, not CLI stdout.
+- **Generation ordering**: `data_global` and `data_linked` depend on `data_local`. `activity_global` depends on `activity_local`. Running dependent types without their prerequisites produces silent empty results.
+- **`data_linked` CLI opt-in**: The CLI excludes `data_linked` by default, but this skill always includes it. If running the CLI manually outside this skill, add `--generation-types data_linked`.