feat(azure): SKU + region + quota pre-flight before terraform apply#1038
feat(azure): SKU + region + quota pre-flight before terraform apply#1038vvnpn-nv wants to merge 9 commits into
Conversation
`osmo-deploy --provider azure` takes 15-25 min and silently fails at the end on SKU/region/quota shortages. Add a fast pre-flight that catches the common cases in seconds: - AKS Kubernetes version GA in the target region. Azure rolls supported versions forward and prunes older ones — the default in terraform.tfvars.example silently goes stale (the previous default `1.31.1` is no longer GA in many regions, which is why the example was just bumped to `1.33.11` in a sibling PR). The preflight reads `TF_K8S_VERSION` and validates against `az aks get-versions`. - Postgres Flexible Server SKU available in the region. Available SKUs vary across regions; this surfaces "no, $TF_POSTGRES_SKU not offered here" before 20 min of apply. Reads `TF_POSTGRES_SKU` (defaults to the hardcoded value in azure_generate_tfvars). - vCPU quota for the SKU families we'll consume (DSv3 family for the AKS system pool; NCadsH100v5 family for the GPU pool when `--gpu-node-pool` is on). Informational only — Azure quota request flow is async + family-row naming has edge cases — but the output gives the operator an immediate "before/after" view. - Managed Redis SKU note. No `az redis-managed list-skus` exists per region today; the preflight warns when redis_sku_name is set to Balanced_B0/B1/B3 (which empirically hit AllocationFailed in capacity-constrained regions per variables.tf comment block). Wiring: new `azure_preflight_sku_quota` function in azure/terraform.sh, called right after the existing `azure_preflight_checks` in deploy-osmo-minimal.sh. Hard-fails on k8s version and Postgres SKU mismatches; warns on quota + Redis. Function is `export -f`'d alongside the other azure_* helpers. Test plan: - `deploy-osmo-minimal.sh --provider azure --non-interactive` against a working region: preflight passes, deploy proceeds. - Set `TF_K8S_VERSION=1.31.1` (or any retired version): preflight hard-fails with a remediation hint pointing at `az aks get-versions`. - Set `TF_POSTGRES_SKU=GP_Standard_D2s_v9001`: preflight hard-fails with a remediation hint pointing at `az postgres flexible-server list-skus`. - Set `TF_REDIS_SKU_NAME=Balanced_B0`: preflight warns but proceeds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds exported Azure preflight functions: ChangesAzure Preflight SKU and Quota Validation
Sequence Diagram(s)sequenceDiagram
participant DeployScript as deploy-osmo-minimal.sh
participant TerraformScript as deployments/scripts/azure/terraform.sh
participant AzureCLI as az
DeployScript->>TerraformScript: call azure_preflight_sku_quota()
TerraformScript->>AzureCLI: az aks get-versions (validate TF_K8S_VERSION)
TerraformScript->>AzureCLI: az postgres flexible-server list-skus (validate Postgres SKU)
TerraformScript->>AzureCLI: az vm list-skus / az vm list-usage (validate VM SKU, quotas)
TerraformScript->>TerraformScript: call _azure_check_vcpu_quota() to compute available vCPUs
AzureCLI-->>TerraformScript: returns versions/skus/quotas
TerraformScript-->>DeployScript: exit 1 on mismatch / log warnings/info
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deployments/scripts/azure/terraform.sh`:
- Around line 605-612: The quota lookup can run against the wrong subscription
and the GPU family is hard-coded; update the usage query in the block that
builds family_filter and runs az vm list-usage so it (1) uses the selected
subscription (either call az account set with TF_SUBSCRIPTION_ID before the
query or add --subscription "$TF_SUBSCRIPTION_ID" to the az vm list-usage
invocation) and (2) derives the GPU family substring from TF_GPU_VM_SIZE when
gpu_enabled is true (replace the hard-coded 'NCadsH100v5' in family_filter with
the family extracted from TF_GPU_VM_SIZE, e.g., strip size/sku suffix to match
name.value contains that family). Ensure you still fall back to the original
standardDSv3 filter and only append the derived GPU family when gpu_enabled ==
"true".
In `@deployments/scripts/deploy-osmo-minimal.sh`:
- Around line 553-556: The azure_preflight_sku_quota and azure_preflight_checks
calls are being run too early (before handle_configuration and before deciding
--destroy/--skip-terraform), so move those calls out of the initial "azure)"
case and invoke azure_preflight_checks and azure_preflight_sku_quota only in the
Terraform apply path after handle_configuration has run and after confirming
we're not in --destroy or --skip-terraform mode; update the script locations
where these functions are called (current occurrences around the azure case and
the later block at lines ~926-930) to run post-configuration and gated by the
apply-only condition.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 538dfc91-3a0e-4621-8e97-be31f8c2a638
📒 Files selected for processing (2)
deployments/scripts/azure/terraform.shdeployments/scripts/deploy-osmo-minimal.sh
…nsistency
Blocker (confidence 92): The Postgres SKU check used
`--query "[].sku"` which doesn't exist in the API response. The actual
JSON shape is:
[
{
"name": "FlexibleServerCapabilities",
"supportedServerEditions": [
{ "name": "GeneralPurpose",
"supportedServerSkus": [{"name": "Standard_D2s_v3"}, ...] },
{ "name": "MemoryOptimized", ... },
{ "name": "Burstable", ... },
]
}
]
Also: `TF_POSTGRES_SKU=GP_Standard_D2s_v3` follows the Terraform
azurerm convention where the tier prefix (GP_/MO_/B_) maps to an
edition (`GeneralPurpose`/`MemoryOptimized`/`Burstable`). The raw
`supportedServerSkus[].name` list does NOT carry the prefix; you have
to strip it before comparing.
Fix:
- Strip TF tier prefix (GP_/MO_/B_) from $postgres_sku before lookup.
- Switch JMESPath to
`[].supportedServerEditions[].supportedServerSkus[].name` (the real
path to the SKU name list).
- Verified locally:
- GP_Standard_D2s_v3 (→ Standard_D2s_v3) matches in eastus2 ✓
- GP_Standard_D2s_v9001 (→ Standard_D2s_v9001) does not match ✓
Important (confidence 85): Switch `return 1` → `exit 1` on both
hard-fail paths to match the sibling `azure_preflight_checks` style
and avoid the conditional-call footgun (where `return 1` doesn't
abort if the function were ever called from `if`/`||`/`&&`).
Suggestion (confidence 82): Validate `TF_K8S_VERSION` matches
`^[0-9]+\.[0-9]+\.[0-9]+$` before interpolating into the JMESPath
query. Defense-in-depth against shell-injection / JMESPath parse
errors if a user sets the env to something unexpected.
All three from the PR review.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deployments/scripts/azure/terraform.sh`:
- Around line 610-611: The availability check uses grep -qx which treats
postgres_sku_name as a regex and can mis-match SKUs with regex metacharacters;
update the check in terraform.sh that references postgres_sku and
postgres_sku_name to perform fixed-string, whole-line matching (e.g., add the -F
flag to grep or otherwise use a fixed-string match like grep -F -x -q) so the
value of postgres_sku_name is matched literally and not interpreted as a regex.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c45eb1e9-b86f-4998-a54e-b1277f005eb5
📒 Files selected for processing (1)
deployments/scripts/azure/terraform.sh
Three CodeRabbit comments from the latest review: Major: vCPU quota lookup didn't use TF_SUBSCRIPTION_ID and hardcoded the GPU family as 'NCadsH100v5'. Fix: - Declare `sub_args=(--subscription "$TF_SUBSCRIPTION_ID")` once at the top of the function (when the env is set), apply consistently to all three `az` calls (k8s, Postgres, vm list-usage) so the preflight runs against the same subscription the deploy will use. - Derive the GPU family substring from TF_GPU_VM_SIZE via sed (strip 'Standard_' and leading numeric size, drop underscores). Works for the common Azure naming convention (NCadsH100v5 from Standard_NC40ads_H100_v5, NVadsA10v5 from Standard_NV36ads_A10_v5, etc.). Best-effort — exotic SKUs may produce no match, but the GPU vCPU row is informational only, the deploy still proceeds. Major: azure_preflight_sku_quota ran inside preflight_checks (line 555), before handle_configuration resolves the user-visible config and before the --destroy / --skip-terraform branching. That meant: - It validated stale defaults instead of the user's actual config. - It fired on flows that aren't applying Terraform (destroy, skip-tf). Fix: leave azure_preflight_checks where it is (cheap cluster/auth checks), but move the SKU+quota preflight into the TF-apply branch in main() (lines 926-930), guarded by `[[ $SKIP_TERRAFORM == false ]]` and running right after `handle_configuration` so the resolved config is what's validated. Minor: `grep -qx "$postgres_sku_name"` was regex mode — if the SKU name ever contained regex metacharacters (e.g. `.`, `*`), it could false-positive against an unintended SKU. Fix: `grep -Fqx` for fixed-string literal matching. Comment added explaining why. All three CodeRabbit comments addressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…PU quota math) Per reviewer follow-up: the preflight was only checking k8s version and Postgres SKU as hard-fails. The most common AKS capacity issues (`Standard_NC40ads_H100_v5` not offered in region, GPU vCPU quota exhausted) silently passed preflight and only surfaced ~5 min into TF apply at the node-pool creation step. That's the exact failure mode that bit us on a real deploy in swedencentral — quota was 160 vCPUs (= max 4 H100 nodes) but gpu_max was set to 10. Preflight would have caught it. Extends azure_preflight_sku_quota with hard-fail checks for: 1. **AKS node-pool VM SKU availability in region** (`az vm list-skus`). 2. **AKS GPU-pool VM SKU availability in region** (when --gpu-node-pool is enabled). 3. **vCPU quota math for the node pool family**: read vCPUs/node from `az vm list-skus --query capabilities[?name=='vCPUs']`, multiply by `TF_NODE_GROUP_MAX_SIZE` (default 5), compare against the available quota (`limit - used`) in the family's row from `az vm list-usage`. 4. **vCPU quota math for the GPU pool family**: same as #3 with `TF_GPU_VM_SIZE` and `TF_GPU_COUNT`. Skipped when gpu_max=0 (autoscaler disabled at provision time). New helper `_azure_check_vcpu_quota` factors the "need vs available" math out of both checks. Handles three Azure API edge cases: - No row matching the family substring → warn + skip (Azure may not have published quota for that family yet). - Row has malformed limit/used values → warn + skip. - Otherwise: compare `need` vs `(limit - used)`; hard-fail with a remediation pointer to `Azure Portal → Subscriptions → Usage + quotas` if requested vCPUs exceed available. Family substring derivation (used by all checks) is the same algorithm introduced in 485a275 for the informational table: strip `Standard_` prefix, strip leading numeric size letters (NC40 → NC), drop underscores. Works for NCadsH100v5, NVadsA10v5, NDA100v4, DDSv5, DSv3, etc. What's intentionally NOT checked (deferred): - Postgres / Managed Redis quota — they use separate provider quota families (`Microsoft.DBforPostgreSQL`, `Microsoft.Cache`); both usually generous and there's no clean `az` CLI for them. Postgres SKU regional availability is already hard-checked above; Redis SKU is warn-only per variables.tf rationale. - Storage account count quota — usually not the binding constraint; could add if it bites someone. Test plan: - bash -n: syntax OK. - Smoke against real eastus2 quota row planned (see PR description). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
deployments/scripts/azure/terraform.sh (2)
611-617:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the preflight inputs aligned with the generated tfvars.
azure_preflight_sku_quota()readsTF_POSTGRES_SKU,TF_NODE_INSTANCE_TYPE, andTF_NODE_GROUP_MAX_SIZE, but Line 457-460 and Line 470 still hardcode the Terraform values. Any override here can make the preflight pass or fail against a config thatterraform applynever uses.Suggested patch
- node_instance_type = "Standard_D2s_v3" + node_instance_type = "${TF_NODE_INSTANCE_TYPE:-Standard_D2s_v3}" ... - node_group_max_size = 5 + node_group_max_size = ${TF_NODE_GROUP_MAX_SIZE:-5} ... - postgres_sku_name = "GP_Standard_D2s_v3" + postgres_sku_name = "${TF_POSTGRES_SKU:-GP_Standard_D2s_v3}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deployments/scripts/azure/terraform.sh` around lines 611 - 617, The preflight currently hardcodes Terraform SKU/size values while azure_generate_tfvars uses environment overrides; update azure_preflight_sku_quota to read the same env vars so preflight matches the generated tfvars: use TF_POSTGRES_SKU (map to postgres_sku), TF_REDIS_SKU_NAME (redis_sku), TF_NODE_INSTANCE_TYPE (node_sku) and TF_NODE_GROUP_MAX_SIZE (node_max) instead of the hardcoded Terraform defaults, ensuring the logic in azure_preflight_sku_quota mirrors azure_generate_tfvars's variable resolution.
642-643:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix AKS version guard to exclude preview Kubernetes versions
The check at lines 642-643 only verifies that
$k8s_versionappears inaz aks get-versions; it doesn’t exclude preview releases.az aks get-versionscan return preview versions, so a preview build can pass this guard.if ! az aks get-versions -l "$region" ${sub_args[@]+"${sub_args[@]}"} \ --query "values[?version=='$k8s_version']" -o tsv 2>/dev/null | grep -q .; thenUpdate the
--queryto match GA/non-preview entries by filtering onisPreview(i.e., requireisPreview == false, and treatisPreview == nullas non-preview if that’s how GA is represented).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deployments/scripts/azure/terraform.sh` around lines 642 - 643, The AKS version guard using the az aks get-versions call (command using variables region, sub_args and k8s_version) must be tightened to exclude preview releases; update the --query expression in that az aks get-versions invocation so it selects entries where version == k8s_version AND (isPreview == false OR isPreview == null) before checking output with grep, ensuring preview builds are not accepted.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deployments/scripts/azure/terraform.sh`:
- Around line 692-699: The preflight uses a sed-derived node_family/gpu_family
which can mismatch Azure's reported family and bypass _azure_check_vcpu_quota;
call azure_describe_vm_sku for the SKU (e.g., use azure_describe_vm_sku
"$node_sku" and capture the returned .family) and replace the sed-derived
node_family (and likewise gpu_family) with that returned family before invoking
_azure_check_vcpu_quota and before printing the az vm list-usage informational
table; update azure_preflight_sku_quota to reuse azure_describe_vm_sku's family
output for both quota check and display so the contains(name.value, '$family')
filter matches Azure's actual data.
---
Outside diff comments:
In `@deployments/scripts/azure/terraform.sh`:
- Around line 611-617: The preflight currently hardcodes Terraform SKU/size
values while azure_generate_tfvars uses environment overrides; update
azure_preflight_sku_quota to read the same env vars so preflight matches the
generated tfvars: use TF_POSTGRES_SKU (map to postgres_sku), TF_REDIS_SKU_NAME
(redis_sku), TF_NODE_INSTANCE_TYPE (node_sku) and TF_NODE_GROUP_MAX_SIZE
(node_max) instead of the hardcoded Terraform defaults, ensuring the logic in
azure_preflight_sku_quota mirrors azure_generate_tfvars's variable resolution.
- Around line 642-643: The AKS version guard using the az aks get-versions call
(command using variables region, sub_args and k8s_version) must be tightened to
exclude preview releases; update the --query expression in that az aks
get-versions invocation so it selects entries where version == k8s_version AND
(isPreview == false OR isPreview == null) before checking output with grep,
ensuring preview builds are not accepted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 03c5b564-dbea-4ee1-acb5-de4a4a566eab
📒 Files selected for processing (1)
deployments/scripts/azure/terraform.sh
CodeRabbit on commit 8d74611: `azure_preflight_sku_quota` derived `node_family` / `gpu_family` via `sed` from the SKU name, but `_azure_check_vcpu_quota` enforces quotas by filtering `az vm list-usage` with `contains(name.value, '$family')`. When the sed-derived substring didn't match Azure's actual `name.value` value, the helper logged a warning and returned 0 — silently skipping the hard quota math we just added. The existing `azure_describe_vm_sku()` helper already issues a single `az vm list-skus --size <sku>` and returns Azure's authoritative `family` field (e.g. `standardDSv3Family`, `standardNCadsH100v5Family`) along with vCPUs/node. Switch both pools (node + GPU) to use it — one az call per pool instead of two, and the returned family is guaranteed to match `name.value` in vm list-usage. Also guards the informational table: when azure_describe_vm_sku returns no family (unknown / typo'd SKU), the `family_filter` block now skips printing rather than emitting `contains(name.value, '')` which would match every row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deployments/scripts/azure/terraform.sh`:
- Around line 697-699: The azure_describe_vm_sku lookup currently queries all
SKUs and can return the wrong regional variant; change azure_describe_vm_sku to
accept and use region and subscription args (pass $region and ${sub_args[@]}
into the function) and update its az call to include -l "$region" and any
${sub_args...}, then filter the JSON result by exact SKU name (e.g., select
where name == "$sku") before extracting family and vCPU fields; finally update
the calls that set node_family/node_vcpus and gpu_family/gpu_vcpus (the read -r
lines around the node/gpu sections) to pass the region and sub_args into
azure_describe_vm_sku so _azure_check_vcpu_quota uses the region-specific SKU.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 95ae2c33-a1c1-4f02-9c16-b112b19f0104
📒 Files selected for processing (1)
deployments/scripts/azure/terraform.sh
….version
Bug surfaced when validating against swedencentral, but the issue is
universal across regions: `az aks get-versions`'s `values[].version`
field is principal-only ("1.33", "1.36", ...) in every region tested.
Full patch versions ("1.33.11", "1.33.10", ...) live as KEYS inside
each entry's `patchVersions` object.
The previous query, `values[?version=='1.33.11']`, was therefore
guaranteed to return empty everywhere — the preflight would have
hard-failed on a perfectly valid k8s version in any region, blocking
TF apply for no reason.
Switch to `values[].patchVersions.keys(@) | []` which flattens every
patch-version key across all principal entries into a single tsv
column, then exact-match TF_K8S_VERSION against that list with
`grep -Fqx`. Verified locally against both eastus2 and swedencentral:
1.33.11 now matches correctly in both.
Update the remediation hint in the error message to point at the
correct query so an operator hitting a future failure can reproduce
the lookup with a copy-paste command.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop WHAT-comments, historical references (sed approach, prior bugs), verbose enumerations the code itself reveals, and step-numbering that broke (1,2,3,4,5,4) after additions. Keep only the WHY-comments that encode non-obvious Azure API quirks a future reader would need: - patchVersions vs values[].version (regional API shape) - TF azurerm tier-prefix vs supportedServerEditions mapping - grep -F (SKU names contain `.`) - empty family_filter guard (avoid contains(name.value, '') matching all rows) - azure_describe_vm_sku family vs name.value matching Net: 65 lines removed across the two files; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…max name
Two latent bugs that surfaced during a real swedencentral deploy:
1. tfvars heredoc hardcoded `availability_zones = ["1", "2"]`. Works in
eastus2 (Standard_NC40ads_H100_v5 zones are 1,2,3) but breaks in
swedencentral (zones are 1,3) — TF apply hard-failed with
AvailabilityZoneNotSupported on the GPU pool. Both AKS pools share
`var.availability_zones` in TF so the GPU SKU's zone set must
intersect with the node SKU's.
2. tfvars heredoc wrote `gpu_min` / `gpu_max`, but variables.tf declares
them as `gpu_node_pool_min_size` / `gpu_node_pool_max_size`. TF
ignored the unknown vars (warning only), the GPU pool fell back to
its hardcoded TF defaults (1/3), and the preflight's `gpu_max ×
vCPUs/node` quota math was applied to a number TF never read.
Fixes:
- New `_azure_sku_zones_json` + `_azure_resolve_pool_zones` helpers
query `az vm list-skus` and compute zone intersection.
- `azure_generate_tfvars` now writes the intersection (with
`TF_AVAILABILITY_ZONES` env override for hand-picking).
- Heredoc now uses the correct `gpu_node_pool_{min,max}_size` names.
- `azure_preflight_sku_quota` hard-fails on empty intersection with
the two SKUs' actual zones in the error so the operator can pick
a different region.
Companion SKILL.md edit (skills/osmo-deploy/SKILL.md):
- Promoted "Required user inputs" to section #2 ("🛑 STOP — gather
inputs before invoking the script") so an agent reading top-down
hits it before any invocation example.
- Restructured Provider → GPU prompts → per-provider prompts.
- Fixed broken numbering where Azure-only restarted at 4 instead of 5.
- Reconciled flag/env wording (`--no-gpu` exists, `--gpu-count` does
not — explicitly noted) and added anchor to Common invocations.
Verified locally against eastus2 and swedencentral:
swedencentral / D2s_v3 + H100 → ["1", "3"]
eastus2 / D2s_v3 + H100 → ["1", "2", "3"]
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously --find-gpu-region was query-only — it printed the chosen region and exited unconditionally. Agent flows had to capture stdout, re-invoke the script with TF_REGION=<that>, and replicate every other flag. Bad UX and a common source of "I ran it but nothing happened." Now: standalone `--find-gpu-region <sku> <count>` (no --provider) still prints the region and exits (backward-compatible query mode). But combined with `--provider azure --gpu-node-pool ...` the script exports the resolved region into TF_REGION and continues with the deploy in the same invocation. Companion SKILL.md update documents the combined-invocation flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Description
osmo-deploy --provider azuretakes 15–25 min for a fresh apply, and the most common failure modes (SKU not available in region, k8s version retired, GPU vCPU quota exhausted) only surface deep intoterraform apply. This PR adds a fast pre-flight that catches those in seconds, before any TF resource is created.New
azure_preflight_sku_quotafunction indeployments/scripts/azure/terraform.sh+ a_azure_check_vcpu_quotahelper. Called fromdeploy-osmo-minimal.shin the TF-apply branch, gated on[[ $SKIP_TERRAFORM == false ]], afterhandle_configurationresolves the user's actual config (so we validate what's about to be applied, not stale defaults).Checks
TF_K8S_VERSIONformat^[0-9]+\.[0-9]+\.[0-9]+$azcall — defense-in-depth against shell-injection / JMESPath parse errorsaz aks get-versionshintaz aks get-versions -l <region>az postgres flexible-server list-skus -l <region>— tier prefix (GP_/MO_/B_) stripped before lookup;grep -Ffixed-string matchingaz vm list-skus -l <region> --query "[?name=='$node_sku']"--gpu-node-pool)(node_max × vCPUs_per_node) ≤ available_azure_check_vcpu_quota— family + vCPUs/node fromazure_describe_vm_sku, available fromaz vm list-usagelimit − usedgpu_max > 0)(gpu_max × vCPUs_per_node) ≤ availableaz vm list-usage -l <region>ComputeOptimized_X3variables.tf:222-227empirical-allocation comment blockAll
azcalls run with--subscription "$TF_SUBSCRIPTION_ID"when set, so the preflight queries the same subscription the deploy will use.Why preflight runs inside the TF-apply branch (not in
preflight_checks)Per reviewer feedback: SKU/quota validation has to happen after
handle_configurationresolves the user's actual config (otherwise it validates stale defaults), and only when we're actually about to apply Terraform (running it on--destroy/--skip-terraformis moot + noisy). Function is wired intodeploy-osmo-minimal.shnext torun_terraform_init/run_terraform_apply, behind the existing[[ $SKIP_TERRAFORM == false ]]gate.vCPU quota math
_azure_check_vcpu_quota:[limit, currentValue]from the firstaz vm list-usagerow whosename.valuecontains the family substring.available = limit - used.requested > availablewith a clear remediation hint pointing atAzure Portal → Subscriptions → Usage + quotas.Family + vCPUs/node come from the existing
azure_describe_vm_sku()helper — a singleaz vm list-skus --size <sku>returning Azure's authoritativefamilyfield (e.g.standardDSv3Family,standardNCadsH100v5Family). Earlier revisions used a sed-derived family substring; reviewer flagged that when the derived substring didn't match Azure's actualname.value,_azure_check_vcpu_quotasilently skipped the math. Using the helper guarantees thecontains(name.value, '$family')filter matches.Graceful handling for Azure API edge cases:
azure_describe_vm_skureturns empty family / vCPUs → warn + skip quota math for that pool.az vm list-usagereturning no row for the family → warn + skip.contains(name.value, '')which would match every row).What's intentionally NOT checked (deferred)
Microsoft.DBforPostgreSQL,Microsoft.Cache); both are usually generous and there's no cleanazCLI to query per-region availability. SKU regional availability is already hard-checked above; Redis SKU is warn-only per thevariables.tf:222-227empirical-allocation rationale.Files
deployments/scripts/azure/terraform.sh— newazure_preflight_sku_quota+_azure_check_vcpu_quotahelper +export -ffor bothdeployments/scripts/deploy-osmo-minimal.sh— call site in the TF-apply branchNet diff: 2 files, ~250 LOC across 5 commits.
Issue #None
Test plan
bash -non terraform.sh: syntax OK.az postgres flexible-server list-skusJSON shape: valid SKUs match, invalid SKUs fail._azure_check_vcpu_quotamath hand-verified:(5 × 2 vCPUs/D2s_v3) = 10 vCPUsshould pass against eastus2 DSv3 quota.deploy-osmo-minimal.sh --provider azure --non-interactivewith realistic config: preflight passes afterhandle_configurationruns, deploy proceeds.TF_K8S_VERSION=1.27.0 ...: hard-fails before TF runs.TF_K8S_VERSION='1.33; rm -rf'(malformed): hard-fails on format validation.TF_POSTGRES_SKU=GP_Standard_D2s_v9001 ...: hard-fails on availability.TF_NODE_INSTANCE_TYPE=Standard_Bogus_v9001 ...: hard-fails on SKU availability.TF_NODE_GROUP_MAX_SIZE=999999 ...: hard-fails on vCPU quota math.--gpu-node-pool TF_GPU_COUNT=10 TF_GPU_VM_SIZE=Standard_NC40ads_H100_v5 ...in a region with NC-family quota < 400 vCPUs: hard-fails on GPU quota math.--gpu-node-pool TF_GPU_VM_SIZE=Standard_Bogus_GPU_v1 ...: hard-fails on GPU SKU availability.TF_REDIS_SKU_NAME=Balanced_B0 ...: preflight warns but proceeds.TF_SUBSCRIPTION_ID=<sub-A> ...with defaultaz accounton<sub-B>: quota math/availability checks run against sub-A.--destroy: preflight quota check does NOT run.--skip-terraform: preflight quota check does NOT run.Checklist
azCLI; an integration test against a live Azure subscription would be most meaningful. Hand-tested manually per the test plan above. Postgres SKU JSON shape was verified against the actualaz postgres flexible-server list-skus -l eastus2output. Open to suggestions on how to add this to the existing test scaffolding.)🤖 Generated with Claude Code