microsoft · gaudyb · May 6, 2026 · May 6, 2026 · May 6, 2026 · May 6, 2026
diff --git a/.apm/skills/benchmark-qed-autoe/SKILL.md b/.apm/skills/benchmark-qed-autoe/SKILL.md
@@ -54,6 +54,8 @@ uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe
 | `--alpha` | `0.05` | P-value threshold for significance |
 | `--exclude-criteria` | `[]` | Criteria to exclude (repeatable) |
 | `--print-model-usage` | `false` | Print LLM token usage |
+| `--account-url` | `null` | Azure Blob Storage account URL (managed-identity auth). Use when the config path is a `blob://` URI. |
+| `--connection-string` | `null` | Azure Blob Storage connection string. Use when the config path is a `blob://` URI. |
 
 **Config requires**: `base` (reference method), `others` (methods to compare), `question_sets`, `criteria`, `trials` (must be even), `llm_config`, `prompt_config`
 
@@ -88,6 +90,8 @@ uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe
 |--------|---------|-------------|
 | `--alpha` | `0.05` | Significance threshold (multi-RAG) |
 | `--print-model-usage` | `false` | Print LLM token usage |
+| `--account-url` | `null` | Azure Blob Storage account URL (managed-identity auth). Use when the config path is a `blob://` URI. |
+| `--connection-string` | `null` | Azure Blob Storage connection string. Use when the config path is a `blob://` URI. |
 
 **Auto-detection**: If the YAML contains a `rag_methods` key, it runs in multi-RAG mode with automated significance testing. Otherwise, single-RAG mode.
 
@@ -178,3 +182,36 @@ For comparing multiple RAG methods, use multi-RAG config format (include `rag_me
 - **Long-running**: Evaluation with many questions and trials can take hours. Use background execution.
 - **No `config init` for hierarchical/retrieval**: The `benchmark-qed-setup` skill only supports `autoe_assertion`, `autoe_pairwise`, and `autoe_reference`. For hierarchical, multi-RAG, and retrieval configs, create YAML manually.
 - **Advanced config types**: Use the `benchmark-qed-setup` skill for configuration guidance on advanced config types.
+
+## Azure Blob Storage
+
+All `autoe` commands support reading config files from Azure Blob Storage using `blob://` URIs:
+
+```bash
+# Config file in blob storage
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoe assertion-scores \
+  blob://my-container/eval/settings.yaml ./eval_output \
+  --account-url https://myaccount.blob.core.windows.net
+```
+
+In addition, `settings.yaml` supports `input_storage` and `output_storage` blocks so the evaluation pipeline can read answers/assertions from and write results to Azure Blob Storage:
+
+```yaml
+# Read answers and assertions from blob storage
+input_storage:
+  type: blob
+  container_name: my-datasets
+  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
+  # Or use managed identity:
+  # account_url: https://myaccount.blob.core.windows.net
+
+# Write evaluation output to blob storage
+output_storage:
+  type: blob
+  container_name: my-output
+  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
+```
+
+When using storage blocks, `answer_base_path` and `assertions_path` in the config are resolved relative to the storage container (not the local filesystem).
+
+See [references/config-reference.md](../benchmark-qed-setup/references/config-reference.md) for full `StorageConfig` fields.
diff --git a/.apm/skills/benchmark-qed-autoq/SKILL.md b/.apm/skills/benchmark-qed-autoq/SKILL.md
@@ -41,6 +41,8 @@ uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq
 |--------|-------------|
 | `--generation-types` | Specific types to generate (repeatable). CLI default: all except `data_linked`, but this skill always includes `data_linked` |
 | `--print-model-usage` | Print LLM token usage stats |
+| `--account-url` | Azure Blob Storage account URL (managed-identity auth). Use when the config path is a `blob://` URI. Falls back to `$AZURE_STORAGE_ACCOUNT_URL`. |
+| `--connection-string` | Azure Blob Storage connection string. Use when the config path is a `blob://` URI. Falls back to `$AZURE_STORAGE_CONNECTION_STRING`. |
 
 **Generation types and dependencies:**
 
@@ -73,6 +75,12 @@ uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq
 # Generate local + linked questions
 uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq settings.yaml ./output \
   --generation-types data_local --generation-types data_linked
+
+# Use a config stored in Azure Blob Storage
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq \
+  blob://my-container/configs/settings.yaml ./output \
+  --account-url https://myaccount.blob.core.windows.net \
+  --generation-types data_local
 ```
 
 **Output structure:**
@@ -105,6 +113,8 @@ uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed genera
 |--------|-------------|
 | `--type` / `-t` | Assertion type: `local`, `global`, or `linked` (default: `local`) |
 | `--print-model-usage` | Print LLM token usage stats |
+| `--account-url` | Azure Blob Storage account URL (managed-identity auth). Use when the config path is a `blob://` URI. |
+| `--connection-string` | Azure Blob Storage connection string. Use when the config path is a `blob://` URI. |
 
 **Examples:**
 ```bash
@@ -155,6 +165,39 @@ uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed assert
 - [ ] Step 4: (Optional) Generate additional assertions — use `generate-assertions`
 - [ ] Step 5: (Optional) Check assertion quality — use `assertion-stats`
 
+## Azure Blob Storage
+
+All `autoq` and `generate-assertions` commands support reading config files from Azure Blob Storage using `blob://` URIs:
+
+```bash
+# Config file in blob storage — the CLI downloads the config (and sibling prompt files) to a temp directory
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed autoq \
+  blob://my-container/project/settings.yaml ./output \
+  --connection-string "$AZURE_STORAGE_CONNECTION_STRING"
+```
+
+In addition, `settings.yaml` supports `input.storage` and `output_storage` blocks to read/write data from Azure Blob Storage:
+
+```yaml
+# Read input data from blob storage
+input:
+  dataset_path: data/input.csv
+  storage:
+    type: blob
+    container_name: my-datasets
+    connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
+    # Or use managed identity:
+    # account_url: https://myaccount.blob.core.windows.net
+
+# Write output to blob storage
+output_storage:
+  type: blob
+  container_name: my-output
+  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
+```
+
+See [references/config-reference.md](../benchmark-qed-setup/references/config-reference.md) for full `StorageConfig` fields.
+
 ## Gotchas
 
 - **Path resolution**: The `autoq` and `generate-assertions` commands resolve `output_dir` (and other relative paths) **relative to the settings.yaml file's directory**, not the current working directory. Always `cd` into the workspace directory first, or use absolute paths. For example, running `benchmark-qed autoq workspace/settings.yaml workspace/output` from the repo root creates output at `workspace/workspace/output/` (not `workspace/output/`).
@@ -163,3 +206,4 @@ uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed assert
 - **Output is in files, not stdout**: All results are written to JSON/CSV/Parquet files. Parse the output files, not CLI stdout.
 - **Generation ordering**: `data_global` and `data_linked` depend on `data_local`. `activity_global` depends on `activity_local`. Running dependent types without their prerequisites produces silent empty results.
 - **`data_linked` CLI opt-in**: The CLI excludes `data_linked` by default, but this skill always includes it. If running the CLI manually outside this skill, add `--generation-types data_linked`.
+- **Blob URI format**: Use `blob://<container>/<key>` for config paths. The CLI downloads the config and all sibling files (prompt templates) to a temp directory so relative paths resolve correctly.
diff --git a/.apm/skills/benchmark-qed-setup/SKILL.md b/.apm/skills/benchmark-qed-setup/SKILL.md
@@ -70,6 +70,26 @@ Example:
 uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed config init autoq ./my_workspace
 ```
 
+**Storage options** for `config init`:
+| Option | Description |
+|--------|-------------|
+| `--storage-type` / `-s` | `local` (default) or `blob`. When `blob`, storage config sections are scaffolded as active YAML (not commented out). |
+| `--container-name` | Pre-fill the blob container name in generated storage config. |
+| `--account-url` | Pre-fill the account URL (managed-identity auth) in generated storage config. |
+| `--connection-string` | Pre-fill the connection string in generated storage config. |
+| `--base-dir` | Pre-fill a base prefix path within the container. |
+
+When `--storage-type blob` is combined with `--account-url` or `--connection-string`, the generated config and prompt files are also uploaded directly to the blob container.
+
+Example (blob):
+```bash
+uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed config init autoq ./my_workspace \
+  --storage-type blob \
+  --container-name my-datasets \
+  --account-url https://myaccount.blob.core.windows.net \
+  --base-dir experiments/run1
+```
+
 This creates:
 ```
 root/
@@ -95,6 +115,23 @@ echo y | uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-q
 
 **Available datasets**: `AP_news`, `podcast`, `example_answers`
 
+**Storage options** for `data download`:
+| Option | Description |
+|--------|-------------|
+| `--storage-type` | Set to `blob` to upload the dataset to Azure Blob Storage instead of extracting locally. |
+| `--container-name` | The blob container name. |
+| `--account-url` | Azure storage account URL (managed-identity auth). |
+| `--connection-string` | Azure storage connection string (alternative to `--account-url`). |
+| `--base-dir` | Base prefix in blob storage. Files are stored as `{base_dir}/{output_dir}/`. |
+
+Example (download to blob):
+```bash
+echo y | uvx --from "git+https://github.com/microsoft/benchmark-qed" benchmark-qed data download AP_news datasets \
+  --storage-type blob \
+  --container-name my-datasets \
+  --account-url https://myaccount.blob.core.windows.net
+```
+
 ### Step 3 — Gather Configuration Choices from the User
 
 Before writing any values into `settings.yaml`, **prompt the user with `ask_user`** to collect the LLM / auth / endpoint settings. Do not guess — these decisions are environment-specific and getting them wrong wastes downstream LLM calls. Use enum/boolean fields whenever possible so the user picks from a known set rather than typing free-form text.
@@ -138,6 +175,22 @@ Only ask the questions relevant to the chosen `config_type`:
 - `autoe_reference`: `reference.name` + `reference.answer_base_path`, list of `generated`, and `question_sets`.
 - `autoe_assertion`: in single-RAG mode, `generated.name` + `generated.answer_base_path` and `assertions.assertions_path`. In multi-RAG mode (`rag_methods` provided), ask for `input_dir`, `output_dir`, `rag_methods` list, and `question_sets`.
 
+#### Storage fields (all config types, optional)
+
+Ask the user if they want to use Azure Blob Storage for input/output. If yes, collect:
+
+| Field | Type | Notes |
+|-------|------|-------|
+| `use_blob_storage` | boolean | Whether to configure cloud storage. |
+| `storage_container_name` | string | Azure Blob container name (e.g. `my-datasets`). |
+| `storage_auth_method` | enum (`connection_string`, `managed_identity`) | How to authenticate to Azure. |
+| `storage_connection_string_env_var` | string | Env var name for connection string (default: `AZURE_STORAGE_CONNECTION_STRING`). Only when `storage_auth_method=connection_string`. |
+| `storage_account_url` | string (uri) | Storage account URL. Only when `storage_auth_method=managed_identity`. |
+| `storage_base_dir` | string | Optional prefix path within the container. |
+| `separate_output_container` | boolean | Whether output uses a different container than input. |
+
+If storage is enabled, write the appropriate `input.storage`, `input_storage`, and/or `output_storage` blocks into `settings.yaml`.
+
 If the user declines a field, fall back to the documented default and call out the assumption in your response.
 
 ### Step 4 — Apply the Answers
@@ -212,3 +265,6 @@ Key highlights:
 - Config types `autoe_pairwise`, `autoe_reference`, and `autoe_assertion` generate different settings.yaml templates — use the correct type for your evaluation method.
 - Prompts are copied as `.txt` files using Python `string.Template` syntax (`$variable` or `${variable}`).
 - **`prompt_config` key**: The runtime expects `prompt_config` (singular) for all autoe config types. Both `benchmark-qed init` and `config init` now generate the correct key. If you hand-edit YAML, ensure you use `prompt_config`, not `prompts_config`.
+- **`config init --storage-type blob`**: When combined with `--account-url` or `--connection-string`, the command uploads the generated `settings.yaml` and prompt files directly to blob storage. Without those auth options, it only scaffolds the storage YAML sections locally.
+- **Blob URI format**: CLI commands accept `blob://<container>/<key>` for config paths. The CLI downloads the config and all sibling files (prompt templates) to a temp directory so relative paths resolve correctly. Credentials can be passed via `--account-url`/`--connection-string` or the environment variables `AZURE_STORAGE_ACCOUNT_URL`/`AZURE_STORAGE_CONNECTION_STRING`.
+- **Storage config in YAML**: AutoQ uses `input.storage` (nested under `input`) and `output_storage` (top-level). AutoE uses `input_storage` and `output_storage` (both top-level). When storage is omitted, local filesystem is used.
diff --git a/.semversioner/next-release/minor-20260506063439936521.json b/.semversioner/next-release/minor-20260506063439936521.json
@@ -0,0 +1,4 @@
+{
+  "type": "minor",
+  "description": "storage improvements"
+}
diff --git a/benchmark_qed/autoe/cli.py b/benchmark_qed/autoe/cli.py
@@ -41,6 +41,11 @@
 )
 from benchmark_qed.autoe.pairwise import analyze_criteria, get_pairwise_scores
 from benchmark_qed.autoe.reference import get_reference_scores
+from benchmark_qed.cli.config_resolver import (
+    AccountUrlOption,
+    ConnectionStringOption,
+    resolve_config_path,
+)
 from benchmark_qed.cli.utils import print_df
 from benchmark_qed.llm.factory import ModelFactory
 
@@ -148,10 +153,17 @@ def pairwise_scores(
             help="The key in the JSON file that contains the question ID. This is used to match questions across different conditions."
         ),
     ] = "question_id",
+    account_url: AccountUrlOption = None,
+    connection_string: ConnectionStringOption = None,
 ) -> None:
     """Generate scores for the different conditions provided in the JSON file."""
     if exclude_criteria is None:
         exclude_criteria = []
+    comparison_spec = resolve_config_path(
+        comparison_spec,
+        account_url=account_url,
+        connection_string=connection_string,
+    )
     config = load_config(PairwiseConfig, comparison_spec)
 
     config.criteria = [
@@ -282,10 +294,17 @@ def reference_scores(
             help="The key in the JSON file that contains the question ID. This is used to match questions across different conditions."
         ),
     ] = "question_id",
+    account_url: AccountUrlOption = None,
+    connection_string: ConnectionStringOption = None,
 ) -> None:
     """Generate scores for the generated answers provided in the JSON file."""
     if exclude_criteria is None:
         exclude_criteria = []
+    comparison_spec = resolve_config_path(
+        comparison_spec,
+        account_url=account_url,
+        connection_string=connection_string,
+    )
     config = load_config(ReferenceConfig, comparison_spec)
 
     config.criteria = [
@@ -397,6 +416,8 @@ def assertion_scores(
         str,
         typer.Option(help="Assertions key in JSON (single-RAG mode only)."),
     ] = "assertions",
+    account_url: AccountUrlOption = None,
+    connection_string: ConnectionStringOption = None,
 ) -> None:
     """Score assertions for RAG method(s).
 
@@ -425,6 +446,12 @@ def assertion_scores(
     """
     import yaml
 
+    config_path = resolve_config_path(
+        config_path,
+        account_url=account_url,
+        connection_string=connection_string,
+    )
+
     # Load raw YAML to detect format
     with Path(config_path).open(encoding="utf-8") as f:
         raw_config = yaml.safe_load(f)
@@ -672,6 +699,8 @@ def hierarchical_assertion_scores(
             help="The key in assertions that contains the supporting assertions list."
         ),
     ] = "supporting_assertions",
+    account_url: AccountUrlOption = None,
+    connection_string: ConnectionStringOption = None,
 ) -> None:
     """Score hierarchical assertions with supporting assertions.
 
@@ -708,6 +737,11 @@ def hierarchical_assertion_scores(
     """
     import yaml
 
+    config_path = resolve_config_path(
+        config_path,
+        account_url=account_url,
+        connection_string=connection_string,
+    )
     # Load raw YAML to detect format
     with Path(config_path).open(encoding="utf-8") as f:
         raw_config = yaml.safe_load(f)
@@ -933,6 +967,9 @@ def assertion_significance(
             help="Path to the assertion significance configuration YAML file."
         ),
     ],
+    *,
+    account_url: AccountUrlOption = None,
+    connection_string: ConnectionStringOption = None,
 ) -> None:
     """Run statistical significance tests on standard assertion scores.
 
@@ -954,6 +991,11 @@ def assertion_significance(
     """
     from benchmark_qed.autoe.assertion import compare_assertion_scores_significance
 
+    config_path = resolve_config_path(
+        config_path,
+        account_url=account_url,
+        connection_string=connection_string,
+    )
     config = load_config(AssertionSignificanceConfig, config_path)
 
     rich_print("[bold]Running assertion significance tests[/bold]")
@@ -996,6 +1038,9 @@ def hierarchical_assertion_significance(
             help="Path to the hierarchical assertion significance config YAML."
         ),
     ],
+    *,
+    account_url: AccountUrlOption = None,
+    connection_string: ConnectionStringOption = None,
 ) -> None:
     """Run statistical significance tests on hierarchical assertion scores.
 
@@ -1018,6 +1063,11 @@ def hierarchical_assertion_significance(
         compare_hierarchical_assertion_scores_significance,
     )
 
+    config_path = resolve_config_path(
+        config_path,
+        account_url=account_url,
+        connection_string=connection_string,
+    )
     config = load_config(HierarchicalAssertionSignificanceConfig, config_path)
 
     rich_print("[bold]Running hierarchical assertion significance tests[/bold]")
@@ -1073,6 +1123,8 @@ def generate_retrieval_reference(
         bool,
         typer.Option(help="Whether to print the model usage statistics."),
     ] = False,
+    account_url: AccountUrlOption = None,
+    connection_string: ConnectionStringOption = None,
 ) -> None:
     """Generate retrieval reference data (cluster relevance) for a question set.
 
@@ -1097,6 +1149,11 @@ def generate_retrieval_reference(
     Otherwise, text units will be loaded from text_units_path and clustered.
     """
     # Run all async work in a single event loop
+    config_path = resolve_config_path(
+        config_path,
+        account_url=account_url,
+        connection_string=connection_string,
+    )
     asyncio.run(_generate_retrieval_reference_async(config_path, print_model_usage))
 
 
@@ -1410,6 +1467,8 @@ def retrieval_scores(
         int,
         typer.Option(help="Maximum concurrent relevance assessments."),
     ] = 8,
+    account_url: AccountUrlOption = None,
+    connection_string: ConnectionStringOption = None,
 ) -> None:
     """Evaluate retrieval metrics (precision, recall, fidelity) for RAG methods.
 
@@ -1428,6 +1487,11 @@ def retrieval_scores(
         RationaleRelevanceRater,
     )
 
+    config_path = resolve_config_path(
+        config_path,
+        account_url=account_url,
+        connection_string=connection_string,
+    )
     config = load_config(RetrievalScoresConfig, config_path)
     config.output_dir.mkdir(parents=True, exist_ok=True)