Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions .apm/skills/benchmark-qed-autoe/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe
| `--alpha` | `0.05` | P-value threshold for significance |
| `--exclude-criteria` | `[]` | Criteria to exclude (repeatable) |
| `--print-model-usage` | `false` | Print LLM token usage |
| `--account-url` | `null` | Azure Blob Storage account URL (managed-identity auth). Use when the config path is a `blob://` URI. |
| `--connection-string` | `null` | Azure Blob Storage connection string. Use when the config path is a `blob://` URI. |

**Config requires**: `base` (reference method), `others` (methods to compare), `question_sets`, `criteria`, `trials` (must be even), `llm_config`, `prompt_config`

Expand Down Expand Up @@ -88,6 +90,8 @@ uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe
|--------|---------|-------------|
| `--alpha` | `0.05` | Significance threshold (multi-RAG) |
| `--print-model-usage` | `false` | Print LLM token usage |
| `--account-url` | `null` | Azure Blob Storage account URL (managed-identity auth). Use when the config path is a `blob://` URI. |
| `--connection-string` | `null` | Azure Blob Storage connection string. Use when the config path is a `blob://` URI. |

**Auto-detection**: If the YAML contains a `rag_methods` key, it runs in multi-RAG mode with automated significance testing. Otherwise, single-RAG mode.

Expand Down Expand Up @@ -178,3 +182,36 @@ For comparing multiple RAG methods, use multi-RAG config format (include `rag_me
- **Long-running**: Evaluation with many questions and trials can take hours. Use background execution.
- **No `config init` for hierarchical/retrieval**: The `benchmark-qed-setup` skill only supports `autoe_assertion`, `autoe_pairwise`, and `autoe_reference`. For hierarchical, multi-RAG, and retrieval configs, create YAML manually.
- **Advanced config types**: Use the `benchmark-qed-setup` skill for configuration guidance on advanced config types.

## Azure Blob Storage

All `autoe` commands support reading config files from Azure Blob Storage using `blob://` URIs:

```bash
# Config file in blob storage
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe assertion-scores \
blob://my-container/eval/settings.yaml ./eval_output \
--account-url https://myaccount.blob.core.windows.net
```

In addition, `settings.yaml` supports `input_storage` and `output_storage` blocks so the evaluation pipeline can read answers/assertions from and write results to Azure Blob Storage:

```yaml
# Read answers and assertions from blob storage
input_storage:
type: blob
container_name: my-datasets
connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
# Or use managed identity:
# account_url: https://myaccount.blob.core.windows.net

# Write evaluation output to blob storage
output_storage:
type: blob
container_name: my-output
connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
```

When using storage blocks, `answer_base_path` and `assertions_path` in the config are resolved relative to the storage container (not the local filesystem).

See [references/config-reference.md](../benchmark-qed-setup/references/config-reference.md) for full `StorageConfig` fields.
44 changes: 44 additions & 0 deletions .apm/skills/benchmark-qed-autoq/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq
|--------|-------------|
| `--generation-types` | Specific types to generate (repeatable). CLI default: all except `data_linked`, but this skill always includes `data_linked` |
| `--print-model-usage` | Print LLM token usage stats |
| `--account-url` | Azure Blob Storage account URL (managed-identity auth). Use when the config path is a `blob://` URI. Falls back to `$AZURE_STORAGE_ACCOUNT_URL`. |
| `--connection-string` | Azure Blob Storage connection string. Use when the config path is a `blob://` URI. Falls back to `$AZURE_STORAGE_CONNECTION_STRING`. |

**Generation types and dependencies:**

Expand Down Expand Up @@ -73,6 +75,12 @@ uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq
# Generate local + linked questions
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq settings.yaml ./output \
--generation-types data_local --generation-types data_linked

# Use a config stored in Azure Blob Storage
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq \
blob://my-container/configs/settings.yaml ./output \
--account-url https://myaccount.blob.core.windows.net \
--generation-types data_local
```

**Output structure:**
Expand Down Expand Up @@ -105,6 +113,8 @@ uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed genera
|--------|-------------|
| `--type` / `-t` | Assertion type: `local`, `global`, or `linked` (default: `local`) |
| `--print-model-usage` | Print LLM token usage stats |
| `--account-url` | Azure Blob Storage account URL (managed-identity auth). Use when the config path is a `blob://` URI. |
| `--connection-string` | Azure Blob Storage connection string. Use when the config path is a `blob://` URI. |

**Examples:**
```bash
Expand Down Expand Up @@ -155,6 +165,39 @@ uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed assert
- [ ] Step 4: (Optional) Generate additional assertions — use `generate-assertions`
- [ ] Step 5: (Optional) Check assertion quality — use `assertion-stats`

## Azure Blob Storage

All `autoq` and `generate-assertions` commands support reading config files from Azure Blob Storage using `blob://` URIs:

```bash
# Config file in blob storage — the CLI downloads the config (and sibling prompt files) to a temp directory
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq \
blob://my-container/project/settings.yaml ./output \
--connection-string "$AZURE_STORAGE_CONNECTION_STRING"
```

In addition, `settings.yaml` supports `input.storage` and `output_storage` blocks to read/write data from Azure Blob Storage:

```yaml
# Read input data from blob storage
input:
dataset_path: data/input.csv
storage:
type: blob
container_name: my-datasets
connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
# Or use managed identity:
# account_url: https://myaccount.blob.core.windows.net

# Write output to blob storage
output_storage:
type: blob
container_name: my-output
connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
```

See [references/config-reference.md](../benchmark-qed-setup/references/config-reference.md) for full `StorageConfig` fields.

## Gotchas

- **Path resolution**: The `autoq` and `generate-assertions` commands resolve `output_dir` (and other relative paths) **relative to the settings.yaml file's directory**, not the current working directory. Always `cd` into the workspace directory first, or use absolute paths. For example, running `benchmark-qed autoq workspace/settings.yaml workspace/output` from the repo root creates output at `workspace/workspace/output/` (not `workspace/output/`).
Expand All @@ -163,3 +206,4 @@ uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed assert
- **Output is in files, not stdout**: All results are written to JSON/CSV/Parquet files. Parse the output files, not CLI stdout.
- **Generation ordering**: `data_global` and `data_linked` depend on `data_local`. `activity_global` depends on `activity_local`. Running dependent types without their prerequisites produces silent empty results.
- **`data_linked` CLI opt-in**: The CLI excludes `data_linked` by default, but this skill always includes it. If running the CLI manually outside this skill, add `--generation-types data_linked`.
- **Blob URI format**: Use `blob://<container>/<key>` for config paths. The CLI downloads the config and all sibling files (prompt templates) to a temp directory so relative paths resolve correctly.
56 changes: 56 additions & 0 deletions .apm/skills/benchmark-qed-setup/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,26 @@ Example:
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed config init autoq ./my_workspace
```

**Storage options** for `config init`:
| Option | Description |
|--------|-------------|
| `--storage-type` / `-s` | `local` (default) or `blob`. When `blob`, storage config sections are scaffolded as active YAML (not commented out). |
| `--container-name` | Pre-fill the blob container name in generated storage config. |
| `--account-url` | Pre-fill the account URL (managed-identity auth) in generated storage config. |
| `--connection-string` | Pre-fill the connection string in generated storage config. |
| `--base-dir` | Pre-fill a base prefix path within the container. |

When `--storage-type blob` is combined with `--account-url` or `--connection-string`, the generated config and prompt files are also uploaded directly to the blob container.

Example (blob):
```bash
uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed config init autoq ./my_workspace \
--storage-type blob \
--container-name my-datasets \
--account-url https://myaccount.blob.core.windows.net \
--base-dir experiments/run1
```

This creates:
```
root/
Expand All @@ -95,6 +115,23 @@ echo y | uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-q

**Available datasets**: `AP_news`, `podcast`, `example_answers`

**Storage options** for `data download`:
| Option | Description |
|--------|-------------|
| `--storage-type` | Set to `blob` to upload the dataset to Azure Blob Storage instead of extracting locally. |
| `--container-name` | The blob container name. |
| `--account-url` | Azure storage account URL (managed-identity auth). |
| `--connection-string` | Azure storage connection string (alternative to `--account-url`). |
| `--base-dir` | Base prefix in blob storage. Files are stored as `{base_dir}/{output_dir}/`. |

Example (download to blob):
```bash
echo y | uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed data download AP_news datasets \
--storage-type blob \
--container-name my-datasets \
--account-url https://myaccount.blob.core.windows.net
```

### Step 3 — Gather Configuration Choices from the User

Before writing any values into `settings.yaml`, **prompt the user with `ask_user`** to collect the LLM / auth / endpoint settings. Do not guess — these decisions are environment-specific and getting them wrong wastes downstream LLM calls. Use enum/boolean fields whenever possible so the user picks from a known set rather than typing free-form text.
Expand Down Expand Up @@ -138,6 +175,22 @@ Only ask the questions relevant to the chosen `config_type`:
- `autoe_reference`: `reference.name` + `reference.answer_base_path`, list of `generated`, and `question_sets`.
- `autoe_assertion`: in single-RAG mode, `generated.name` + `generated.answer_base_path` and `assertions.assertions_path`. In multi-RAG mode (`rag_methods` provided), ask for `input_dir`, `output_dir`, `rag_methods` list, and `question_sets`.

#### Storage fields (all config types, optional)

Ask the user if they want to use Azure Blob Storage for input/output. If yes, collect:

| Field | Type | Notes |
|-------|------|-------|
| `use_blob_storage` | boolean | Whether to configure cloud storage. |
| `storage_container_name` | string | Azure Blob container name (e.g. `my-datasets`). |
| `storage_auth_method` | enum (`connection_string`, `managed_identity`) | How to authenticate to Azure. |
| `storage_connection_string_env_var` | string | Env var name for connection string (default: `AZURE_STORAGE_CONNECTION_STRING`). Only when `storage_auth_method=connection_string`. |
| `storage_account_url` | string (uri) | Storage account URL. Only when `storage_auth_method=managed_identity`. |
| `storage_base_dir` | string | Optional prefix path within the container. |
| `separate_output_container` | boolean | Whether output uses a different container than input. |

If storage is enabled, write the appropriate `input.storage`, `input_storage`, and/or `output_storage` blocks into `settings.yaml`.

If the user declines a field, fall back to the documented default and call out the assumption in your response.

### Step 4 — Apply the Answers
Expand Down Expand Up @@ -212,3 +265,6 @@ Key highlights:
- Config types `autoe_pairwise`, `autoe_reference`, and `autoe_assertion` generate different settings.yaml templates — use the correct type for your evaluation method.
- Prompts are copied as `.txt` files using Python `string.Template` syntax (`$variable` or `${variable}`).
- **`prompt_config` key**: The runtime expects `prompt_config` (singular) for all autoe config types. Both `benchmark-qed init` and `config init` now generate the correct key. If you hand-edit YAML, ensure you use `prompt_config`, not `prompts_config`.
- **`config init --storage-type blob`**: When combined with `--account-url` or `--connection-string`, the command uploads the generated `settings.yaml` and prompt files directly to blob storage. Without those auth options, it only scaffolds the storage YAML sections locally.
- **Blob URI format**: CLI commands accept `blob://<container>/<key>` for config paths. The CLI downloads the config and all sibling files (prompt templates) to a temp directory so relative paths resolve correctly. Credentials can be passed via `--account-url`/`--connection-string` or the environment variables `AZURE_STORAGE_ACCOUNT_URL`/`AZURE_STORAGE_CONNECTION_STRING`.
- **Storage config in YAML**: AutoQ uses `input.storage` (nested under `input`) and `output_storage` (top-level). AutoE uses `input_storage` and `output_storage` (both top-level). When storage is omitted, local filesystem is used.
4 changes: 4 additions & 0 deletions .semversioner/next-release/minor-20260506063439936521.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "minor",
"description": "storage improvements"
}
64 changes: 64 additions & 0 deletions benchmark_qed/autoe/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,11 @@
)
from benchmark_qed.autoe.pairwise import analyze_criteria, get_pairwise_scores
from benchmark_qed.autoe.reference import get_reference_scores
from benchmark_qed.cli.config_resolver import (
AccountUrlOption,
ConnectionStringOption,
resolve_config_path,
)
from benchmark_qed.cli.utils import print_df
from benchmark_qed.llm.factory import ModelFactory

Expand Down Expand Up @@ -148,10 +153,17 @@ def pairwise_scores(
help="The key in the JSON file that contains the question ID. This is used to match questions across different conditions."
),
] = "question_id",
account_url: AccountUrlOption = None,
connection_string: ConnectionStringOption = None,
) -> None:
"""Generate scores for the different conditions provided in the JSON file."""
if exclude_criteria is None:
exclude_criteria = []
comparison_spec = resolve_config_path(
comparison_spec,
account_url=account_url,
connection_string=connection_string,
)
config = load_config(PairwiseConfig, comparison_spec)

config.criteria = [
Expand Down Expand Up @@ -282,10 +294,17 @@ def reference_scores(
help="The key in the JSON file that contains the question ID. This is used to match questions across different conditions."
),
] = "question_id",
account_url: AccountUrlOption = None,
connection_string: ConnectionStringOption = None,
) -> None:
"""Generate scores for the generated answers provided in the JSON file."""
if exclude_criteria is None:
exclude_criteria = []
comparison_spec = resolve_config_path(
comparison_spec,
account_url=account_url,
connection_string=connection_string,
)
config = load_config(ReferenceConfig, comparison_spec)

config.criteria = [
Expand Down Expand Up @@ -397,6 +416,8 @@ def assertion_scores(
str,
typer.Option(help="Assertions key in JSON (single-RAG mode only)."),
] = "assertions",
account_url: AccountUrlOption = None,
connection_string: ConnectionStringOption = None,
) -> None:
"""Score assertions for RAG method(s).

Expand Down Expand Up @@ -425,6 +446,12 @@ def assertion_scores(
"""
import yaml

config_path = resolve_config_path(
config_path,
account_url=account_url,
connection_string=connection_string,
)

# Load raw YAML to detect format
with Path(config_path).open(encoding="utf-8") as f:
raw_config = yaml.safe_load(f)
Expand Down Expand Up @@ -672,6 +699,8 @@ def hierarchical_assertion_scores(
help="The key in assertions that contains the supporting assertions list."
),
] = "supporting_assertions",
account_url: AccountUrlOption = None,
connection_string: ConnectionStringOption = None,
) -> None:
"""Score hierarchical assertions with supporting assertions.

Expand Down Expand Up @@ -708,6 +737,11 @@ def hierarchical_assertion_scores(
"""
import yaml

config_path = resolve_config_path(
config_path,
account_url=account_url,
connection_string=connection_string,
)
# Load raw YAML to detect format
with Path(config_path).open(encoding="utf-8") as f:
raw_config = yaml.safe_load(f)
Expand Down Expand Up @@ -933,6 +967,9 @@ def assertion_significance(
help="Path to the assertion significance configuration YAML file."
),
],
*,
account_url: AccountUrlOption = None,
connection_string: ConnectionStringOption = None,
) -> None:
"""Run statistical significance tests on standard assertion scores.

Expand All @@ -954,6 +991,11 @@ def assertion_significance(
"""
from benchmark_qed.autoe.assertion import compare_assertion_scores_significance

config_path = resolve_config_path(
config_path,
account_url=account_url,
connection_string=connection_string,
)
config = load_config(AssertionSignificanceConfig, config_path)

rich_print("[bold]Running assertion significance tests[/bold]")
Expand Down Expand Up @@ -996,6 +1038,9 @@ def hierarchical_assertion_significance(
help="Path to the hierarchical assertion significance config YAML."
),
],
*,
account_url: AccountUrlOption = None,
connection_string: ConnectionStringOption = None,
) -> None:
"""Run statistical significance tests on hierarchical assertion scores.

Expand All @@ -1018,6 +1063,11 @@ def hierarchical_assertion_significance(
compare_hierarchical_assertion_scores_significance,
)

config_path = resolve_config_path(
config_path,
account_url=account_url,
connection_string=connection_string,
)
config = load_config(HierarchicalAssertionSignificanceConfig, config_path)

rich_print("[bold]Running hierarchical assertion significance tests[/bold]")
Expand Down Expand Up @@ -1073,6 +1123,8 @@ def generate_retrieval_reference(
bool,
typer.Option(help="Whether to print the model usage statistics."),
] = False,
account_url: AccountUrlOption = None,
connection_string: ConnectionStringOption = None,
) -> None:
"""Generate retrieval reference data (cluster relevance) for a question set.

Expand All @@ -1097,6 +1149,11 @@ def generate_retrieval_reference(
Otherwise, text units will be loaded from text_units_path and clustered.
"""
# Run all async work in a single event loop
config_path = resolve_config_path(
config_path,
account_url=account_url,
connection_string=connection_string,
)
asyncio.run(_generate_retrieval_reference_async(config_path, print_model_usage))


Expand Down Expand Up @@ -1410,6 +1467,8 @@ def retrieval_scores(
int,
typer.Option(help="Maximum concurrent relevance assessments."),
] = 8,
account_url: AccountUrlOption = None,
connection_string: ConnectionStringOption = None,
) -> None:
"""Evaluate retrieval metrics (precision, recall, fidelity) for RAG methods.

Expand All @@ -1428,6 +1487,11 @@ def retrieval_scores(
RationaleRelevanceRater,
)

config_path = resolve_config_path(
config_path,
account_url=account_url,
connection_string=connection_string,
)
config = load_config(RetrievalScoresConfig, config_path)
config.output_dir.mkdir(parents=True, exist_ok=True)

Expand Down
Loading
Loading