Skip to content

feat(azure): SKU + region + quota pre-flight before terraform apply#1038

Open
vvnpn-nv wants to merge 9 commits into
mainfrom
vivianp/azure-preflight-sku-quota
Open

feat(azure): SKU + region + quota pre-flight before terraform apply#1038
vvnpn-nv wants to merge 9 commits into
mainfrom
vivianp/azure-preflight-sku-quota

Conversation

@vvnpn-nv

@vvnpn-nv vvnpn-nv commented May 22, 2026

Copy link
Copy Markdown
Contributor

Description

osmo-deploy --provider azure takes 15–25 min for a fresh apply, and the most common failure modes (SKU not available in region, k8s version retired, GPU vCPU quota exhausted) only surface deep into terraform apply. This PR adds a fast pre-flight that catches those in seconds, before any TF resource is created.

New azure_preflight_sku_quota function in deployments/scripts/azure/terraform.sh + a _azure_check_vcpu_quota helper. Called from deploy-osmo-minimal.sh in the TF-apply branch, gated on [[ $SKIP_TERRAFORM == false ]], after handle_configuration resolves the user's actual config (so we validate what's about to be applied, not stale defaults).

Checks

Resource Check Outcome Source
TF_K8S_VERSION format matches ^[0-9]+\.[0-9]+\.[0-9]+$ Hard-fail before any az call — defense-in-depth against shell-injection / JMESPath parse errors regex
AKS k8s version GA in region Hard-fail with az aks get-versions hint az aks get-versions -l <region>
Postgres Flexible Server SKU offered in region Hard-fail az postgres flexible-server list-skus -l <region> — tier prefix (GP_/MO_/B_) stripped before lookup; grep -F fixed-string matching
AKS node-pool VM SKU offered in region Hard-fail az vm list-skus -l <region> --query "[?name=='$node_sku']"
AKS GPU-pool VM SKU (when --gpu-node-pool) offered in region Hard-fail same
AKS node-pool vCPU quota (node_max × vCPUs_per_node) ≤ available Hard-fail with quota request hint _azure_check_vcpu_quota — family + vCPUs/node from azure_describe_vm_sku, available from az vm list-usage limit − used
GPU pool vCPU quota (when gpu_max > 0) (gpu_max × vCPUs_per_node) ≤ available Hard-fail with quota request hint same helper — this is the check that would have caught the swedencentral 160-vCPU H100 quota issue (gpu_max=10 → 400 vCPUs needed → fails in seconds instead of ~5 min into TF apply)
vCPU usage informational table covers node + GPU families Print az vm list-usage -l <region>
Managed Redis SKU not on Balanced_B0/B1/B3 Warn + recommend ComputeOptimized_X3 variables.tf:222-227 empirical-allocation comment block

All az calls run with --subscription "$TF_SUBSCRIPTION_ID" when set, so the preflight queries the same subscription the deploy will use.

Why preflight runs inside the TF-apply branch (not in preflight_checks)

Per reviewer feedback: SKU/quota validation has to happen after handle_configuration resolves the user's actual config (otherwise it validates stale defaults), and only when we're actually about to apply Terraform (running it on --destroy / --skip-terraform is moot + noisy). Function is wired into deploy-osmo-minimal.sh next to run_terraform_init/run_terraform_apply, behind the existing [[ $SKIP_TERRAFORM == false ]] gate.

vCPU quota math

_azure_check_vcpu_quota:

  1. Projects [limit, currentValue] from the first az vm list-usage row whose name.value contains the family substring.
  2. Computes available = limit - used.
  3. Hard-fails if requested > available with a clear remediation hint pointing at Azure Portal → Subscriptions → Usage + quotas.

Family + vCPUs/node come from the existing azure_describe_vm_sku() helper — a single az vm list-skus --size <sku> returning Azure's authoritative family field (e.g. standardDSv3Family, standardNCadsH100v5Family). Earlier revisions used a sed-derived family substring; reviewer flagged that when the derived substring didn't match Azure's actual name.value, _azure_check_vcpu_quota silently skipped the math. Using the helper guarantees the contains(name.value, '$family') filter matches.

Graceful handling for Azure API edge cases:

  • azure_describe_vm_sku returns empty family / vCPUs → warn + skip quota math for that pool.
  • az vm list-usage returning no row for the family → warn + skip.
  • The informational table also skips emitting when both families resolved empty (avoids contains(name.value, '') which would match every row).

What's intentionally NOT checked (deferred)

  • Postgres / Managed Redis CPU/storage quota — both use separate provider quota families (Microsoft.DBforPostgreSQL, Microsoft.Cache); both are usually generous and there's no clean az CLI to query per-region availability. SKU regional availability is already hard-checked above; Redis SKU is warn-only per the variables.tf:222-227 empirical-allocation rationale.
  • Storage account count quota — usually not the binding constraint; could add later if it bites.

Files

  • deployments/scripts/azure/terraform.sh — new azure_preflight_sku_quota + _azure_check_vcpu_quota helper + export -f for both
  • deployments/scripts/deploy-osmo-minimal.sh — call site in the TF-apply branch

Net diff: 2 files, ~250 LOC across 5 commits.

Issue #None

Test plan

  • bash -n on terraform.sh: syntax OK.
  • Postgres SKU lookup verified locally against actual az postgres flexible-server list-skus JSON shape: valid SKUs match, invalid SKUs fail.
  • _azure_check_vcpu_quota math hand-verified: (5 × 2 vCPUs/D2s_v3) = 10 vCPUs should pass against eastus2 DSv3 quota.
  • End-to-end deploy-osmo-minimal.sh --provider azure --non-interactive with realistic config: preflight passes after handle_configuration runs, deploy proceeds.
  • TF_K8S_VERSION=1.27.0 ...: hard-fails before TF runs.
  • TF_K8S_VERSION='1.33; rm -rf' (malformed): hard-fails on format validation.
  • TF_POSTGRES_SKU=GP_Standard_D2s_v9001 ...: hard-fails on availability.
  • TF_NODE_INSTANCE_TYPE=Standard_Bogus_v9001 ...: hard-fails on SKU availability.
  • TF_NODE_GROUP_MAX_SIZE=999999 ...: hard-fails on vCPU quota math.
  • --gpu-node-pool TF_GPU_COUNT=10 TF_GPU_VM_SIZE=Standard_NC40ads_H100_v5 ... in a region with NC-family quota < 400 vCPUs: hard-fails on GPU quota math.
  • --gpu-node-pool TF_GPU_VM_SIZE=Standard_Bogus_GPU_v1 ...: hard-fails on GPU SKU availability.
  • TF_REDIS_SKU_NAME=Balanced_B0 ...: preflight warns but proceeds.
  • TF_SUBSCRIPTION_ID=<sub-A> ... with default az account on <sub-B>: quota math/availability checks run against sub-A.
  • --destroy: preflight quota check does NOT run.
  • --skip-terraform: preflight quota check does NOT run.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes. (Bash preflight calling out to az CLI; an integration test against a live Azure subscription would be most meaningful. Hand-tested manually per the test plan above. Postgres SKU JSON shape was verified against the actual az postgres flexible-server list-skus -l eastus2 output. Open to suggestions on how to add this to the existing test scaffolding.)
  • The documentation is up to date with these changes. (Function-level docstring covers the new behavior; no SKILL.md change needed since preflight runs automatically.)

🤖 Generated with Claude Code

`osmo-deploy --provider azure` takes 15-25 min and silently fails at
the end on SKU/region/quota shortages. Add a fast pre-flight that
catches the common cases in seconds:

- AKS Kubernetes version GA in the target region. Azure rolls
  supported versions forward and prunes older ones — the default in
  terraform.tfvars.example silently goes stale (the previous default
  `1.31.1` is no longer GA in many regions, which is why the example
  was just bumped to `1.33.11` in a sibling PR). The preflight reads
  `TF_K8S_VERSION` and validates against `az aks get-versions`.
- Postgres Flexible Server SKU available in the region. Available SKUs
  vary across regions; this surfaces "no, $TF_POSTGRES_SKU not offered
  here" before 20 min of apply. Reads `TF_POSTGRES_SKU` (defaults to
  the hardcoded value in azure_generate_tfvars).
- vCPU quota for the SKU families we'll consume (DSv3 family for the
  AKS system pool; NCadsH100v5 family for the GPU pool when
  `--gpu-node-pool` is on). Informational only — Azure quota request
  flow is async + family-row naming has edge cases — but the output
  gives the operator an immediate "before/after" view.
- Managed Redis SKU note. No `az redis-managed list-skus` exists per
  region today; the preflight warns when redis_sku_name is set to
  Balanced_B0/B1/B3 (which empirically hit AllocationFailed in
  capacity-constrained regions per variables.tf comment block).

Wiring: new `azure_preflight_sku_quota` function in azure/terraform.sh,
called right after the existing `azure_preflight_checks` in
deploy-osmo-minimal.sh. Hard-fails on k8s version and Postgres SKU
mismatches; warns on quota + Redis. Function is `export -f`'d
alongside the other azure_* helpers.

Test plan:
- `deploy-osmo-minimal.sh --provider azure --non-interactive` against
  a working region: preflight passes, deploy proceeds.
- Set `TF_K8S_VERSION=1.31.1` (or any retired version): preflight
  hard-fails with a remediation hint pointing at `az aks get-versions`.
- Set `TF_POSTGRES_SKU=GP_Standard_D2s_v9001`: preflight hard-fails
  with a remediation hint pointing at `az postgres flexible-server
  list-skus`.
- Set `TF_REDIS_SKU_NAME=Balanced_B0`: preflight warns but proceeds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vvnpn-nv vvnpn-nv requested a review from a team as a code owner May 22, 2026 19:35
@coderabbitai

coderabbitai Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds exported Azure preflight functions: _azure_check_vcpu_quota() computes available vCPUs from az vm list-usage; azure_preflight_sku_quota() validates AKS version, PostgreSQL SKU, node/GPU VM SKU availability and vCPU quota, prints usage tables, warns on Redis Balanced tiers, and is invoked from the deploy script before Terraform apply.

Changes

Azure Preflight SKU and Quota Validation

Layer / File(s) Summary
vCPU quota helper
deployments/scripts/azure/terraform.sh
Adds _azure_check_vcpu_quota(region, family, need, pool_label, math_label) which queries az vm list-usage, computes available vCPUs (limit - currentValue), returns non-zero on shortfall.
azure_preflight_sku_quota implementation
deployments/scripts/azure/terraform.sh
New azure_preflight_sku_quota() normalizes TF_REGION, validates AKS TF_K8S_VERSION GA availability and PostgreSQL Flexible Server SKU availability, validates node-pool (and optional GPU-pool) VM SKU availability, enforces vCPU quotas via _azure_check_vcpu_quota, prints az vm list-usage filtered tables, warns for TF_REDIS_SKU_NAME Balanced tiers, and exports functions.
Integration into deployment provisioning and preflight checks
deployments/scripts/deploy-osmo-minimal.sh
Adds comments deferring SKU/quota checks in early preflight; calls azure_preflight_sku_quota after handle_configuration when PROVIDER=azure, before Terraform init/apply.

Sequence Diagram(s)

sequenceDiagram
  participant DeployScript as deploy-osmo-minimal.sh
  participant TerraformScript as deployments/scripts/azure/terraform.sh
  participant AzureCLI as az
  DeployScript->>TerraformScript: call azure_preflight_sku_quota()
  TerraformScript->>AzureCLI: az aks get-versions (validate TF_K8S_VERSION)
  TerraformScript->>AzureCLI: az postgres flexible-server list-skus (validate Postgres SKU)
  TerraformScript->>AzureCLI: az vm list-skus / az vm list-usage (validate VM SKU, quotas)
  TerraformScript->>TerraformScript: call _azure_check_vcpu_quota() to compute available vCPUs
  AzureCLI-->>TerraformScript: returns versions/skus/quotas
  TerraformScript-->>DeployScript: exit 1 on mismatch / log warnings/info
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/OSMO#1031: Related Azure GPU SKU/quota discovery changes and region quota selection logic.

Suggested reviewers

  • cypres
  • RyaliNvidia

Poem

🐰 I hop through scripts at break of dawn,
Counting vCPUs and SKUs with a yawn,
AKS versions checked, Postgres scanned,
GPU quotas tallied by careful hand,
Preflight clears — the cluster's drawn! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically describes the main change: adding Azure SKU, region, and quota pre-flight validation before terraform apply.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch vivianp/azure-preflight-sku-quota

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deployments/scripts/azure/terraform.sh`:
- Around line 605-612: The quota lookup can run against the wrong subscription
and the GPU family is hard-coded; update the usage query in the block that
builds family_filter and runs az vm list-usage so it (1) uses the selected
subscription (either call az account set with TF_SUBSCRIPTION_ID before the
query or add --subscription "$TF_SUBSCRIPTION_ID" to the az vm list-usage
invocation) and (2) derives the GPU family substring from TF_GPU_VM_SIZE when
gpu_enabled is true (replace the hard-coded 'NCadsH100v5' in family_filter with
the family extracted from TF_GPU_VM_SIZE, e.g., strip size/sku suffix to match
name.value contains that family). Ensure you still fall back to the original
standardDSv3 filter and only append the derived GPU family when gpu_enabled ==
"true".

In `@deployments/scripts/deploy-osmo-minimal.sh`:
- Around line 553-556: The azure_preflight_sku_quota and azure_preflight_checks
calls are being run too early (before handle_configuration and before deciding
--destroy/--skip-terraform), so move those calls out of the initial "azure)"
case and invoke azure_preflight_checks and azure_preflight_sku_quota only in the
Terraform apply path after handle_configuration has run and after confirming
we're not in --destroy or --skip-terraform mode; update the script locations
where these functions are called (current occurrences around the azure case and
the later block at lines ~926-930) to run post-configuration and gated by the
apply-only condition.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 538dfc91-3a0e-4621-8e97-be31f8c2a638

📥 Commits

Reviewing files that changed from the base of the PR and between 62625f7 and 42521d6.

📒 Files selected for processing (2)
  • deployments/scripts/azure/terraform.sh
  • deployments/scripts/deploy-osmo-minimal.sh

Comment thread deployments/scripts/azure/terraform.sh Outdated
Comment thread deployments/scripts/deploy-osmo-minimal.sh
…nsistency

Blocker (confidence 92): The Postgres SKU check used
`--query "[].sku"` which doesn't exist in the API response. The actual
JSON shape is:

  [
    {
      "name": "FlexibleServerCapabilities",
      "supportedServerEditions": [
        { "name": "GeneralPurpose",
          "supportedServerSkus": [{"name": "Standard_D2s_v3"}, ...] },
        { "name": "MemoryOptimized", ... },
        { "name": "Burstable", ... },
      ]
    }
  ]

Also: `TF_POSTGRES_SKU=GP_Standard_D2s_v3` follows the Terraform
azurerm convention where the tier prefix (GP_/MO_/B_) maps to an
edition (`GeneralPurpose`/`MemoryOptimized`/`Burstable`). The raw
`supportedServerSkus[].name` list does NOT carry the prefix; you have
to strip it before comparing.

Fix:
- Strip TF tier prefix (GP_/MO_/B_) from $postgres_sku before lookup.
- Switch JMESPath to
  `[].supportedServerEditions[].supportedServerSkus[].name` (the real
  path to the SKU name list).
- Verified locally:
  - GP_Standard_D2s_v3 (→ Standard_D2s_v3) matches in eastus2 ✓
  - GP_Standard_D2s_v9001 (→ Standard_D2s_v9001) does not match ✓

Important (confidence 85): Switch `return 1` → `exit 1` on both
hard-fail paths to match the sibling `azure_preflight_checks` style
and avoid the conditional-call footgun (where `return 1` doesn't
abort if the function were ever called from `if`/`||`/`&&`).

Suggestion (confidence 82): Validate `TF_K8S_VERSION` matches
`^[0-9]+\.[0-9]+\.[0-9]+$` before interpolating into the JMESPath
query. Defense-in-depth against shell-injection / JMESPath parse
errors if a user sets the env to something unexpected.

All three from the PR review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deployments/scripts/azure/terraform.sh`:
- Around line 610-611: The availability check uses grep -qx which treats
postgres_sku_name as a regex and can mis-match SKUs with regex metacharacters;
update the check in terraform.sh that references postgres_sku and
postgres_sku_name to perform fixed-string, whole-line matching (e.g., add the -F
flag to grep or otherwise use a fixed-string match like grep -F -x -q) so the
value of postgres_sku_name is matched literally and not interpreted as a regex.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c45eb1e9-b86f-4998-a54e-b1277f005eb5

📥 Commits

Reviewing files that changed from the base of the PR and between 42521d6 and 94b6264.

📒 Files selected for processing (1)
  • deployments/scripts/azure/terraform.sh

Comment thread deployments/scripts/azure/terraform.sh Outdated
vvnpn-nv and others added 2 commits May 22, 2026 13:11
Three CodeRabbit comments from the latest review:

Major: vCPU quota lookup didn't use TF_SUBSCRIPTION_ID and hardcoded the
GPU family as 'NCadsH100v5'. Fix:
- Declare `sub_args=(--subscription "$TF_SUBSCRIPTION_ID")` once at the
  top of the function (when the env is set), apply consistently to all
  three `az` calls (k8s, Postgres, vm list-usage) so the preflight runs
  against the same subscription the deploy will use.
- Derive the GPU family substring from TF_GPU_VM_SIZE via sed
  (strip 'Standard_' and leading numeric size, drop underscores). Works
  for the common Azure naming convention (NCadsH100v5 from
  Standard_NC40ads_H100_v5, NVadsA10v5 from Standard_NV36ads_A10_v5,
  etc.). Best-effort — exotic SKUs may produce no match, but the GPU
  vCPU row is informational only, the deploy still proceeds.

Major: azure_preflight_sku_quota ran inside preflight_checks (line 555),
before handle_configuration resolves the user-visible config and before
the --destroy / --skip-terraform branching. That meant:
- It validated stale defaults instead of the user's actual config.
- It fired on flows that aren't applying Terraform (destroy, skip-tf).

Fix: leave azure_preflight_checks where it is (cheap cluster/auth
checks), but move the SKU+quota preflight into the TF-apply branch in
main() (lines 926-930), guarded by `[[ $SKIP_TERRAFORM == false ]]`
and running right after `handle_configuration` so the resolved config
is what's validated.

Minor: `grep -qx "$postgres_sku_name"` was regex mode — if the SKU
name ever contained regex metacharacters (e.g. `.`, `*`), it could
false-positive against an unintended SKU. Fix: `grep -Fqx` for
fixed-string literal matching. Comment added explaining why.

All three CodeRabbit comments addressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…PU quota math)

Per reviewer follow-up: the preflight was only checking k8s version and
Postgres SKU as hard-fails. The most common AKS capacity issues
(`Standard_NC40ads_H100_v5` not offered in region, GPU vCPU quota
exhausted) silently passed preflight and only surfaced ~5 min into TF
apply at the node-pool creation step. That's the exact failure mode
that bit us on a real deploy in swedencentral — quota was 160 vCPUs
(= max 4 H100 nodes) but gpu_max was set to 10. Preflight would have
caught it.

Extends azure_preflight_sku_quota with hard-fail checks for:

1. **AKS node-pool VM SKU availability in region** (`az vm list-skus`).
2. **AKS GPU-pool VM SKU availability in region** (when
   --gpu-node-pool is enabled).
3. **vCPU quota math for the node pool family**: read vCPUs/node from
   `az vm list-skus --query capabilities[?name=='vCPUs']`, multiply by
   `TF_NODE_GROUP_MAX_SIZE` (default 5), compare against the available
   quota (`limit - used`) in the family's row from `az vm list-usage`.
4. **vCPU quota math for the GPU pool family**: same as #3 with
   `TF_GPU_VM_SIZE` and `TF_GPU_COUNT`. Skipped when gpu_max=0
   (autoscaler disabled at provision time).

New helper `_azure_check_vcpu_quota` factors the "need vs available"
math out of both checks. Handles three Azure API edge cases:
- No row matching the family substring → warn + skip (Azure may not
  have published quota for that family yet).
- Row has malformed limit/used values → warn + skip.
- Otherwise: compare `need` vs `(limit - used)`; hard-fail with a
  remediation pointer to `Azure Portal → Subscriptions → Usage + quotas`
  if requested vCPUs exceed available.

Family substring derivation (used by all checks) is the same algorithm
introduced in 485a275 for the informational table: strip `Standard_`
prefix, strip leading numeric size letters (NC40 → NC), drop underscores.
Works for NCadsH100v5, NVadsA10v5, NDA100v4, DDSv5, DSv3, etc.

What's intentionally NOT checked (deferred):
- Postgres / Managed Redis quota — they use separate provider quota
  families (`Microsoft.DBforPostgreSQL`, `Microsoft.Cache`); both
  usually generous and there's no clean `az` CLI for them. Postgres
  SKU regional availability is already hard-checked above; Redis SKU
  is warn-only per variables.tf rationale.
- Storage account count quota — usually not the binding constraint;
  could add if it bites someone.

Test plan:
- bash -n: syntax OK.
- Smoke against real eastus2 quota row planned (see PR description).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
deployments/scripts/azure/terraform.sh (2)

611-617: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the preflight inputs aligned with the generated tfvars.

azure_preflight_sku_quota() reads TF_POSTGRES_SKU, TF_NODE_INSTANCE_TYPE, and TF_NODE_GROUP_MAX_SIZE, but Line 457-460 and Line 470 still hardcode the Terraform values. Any override here can make the preflight pass or fail against a config that terraform apply never uses.

Suggested patch
- node_instance_type                  = "Standard_D2s_v3"
+ node_instance_type                  = "${TF_NODE_INSTANCE_TYPE:-Standard_D2s_v3}"
...
- node_group_max_size                 = 5
+ node_group_max_size                 = ${TF_NODE_GROUP_MAX_SIZE:-5}
...
- postgres_sku_name                      = "GP_Standard_D2s_v3"
+ postgres_sku_name                      = "${TF_POSTGRES_SKU:-GP_Standard_D2s_v3}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deployments/scripts/azure/terraform.sh` around lines 611 - 617, The preflight
currently hardcodes Terraform SKU/size values while azure_generate_tfvars uses
environment overrides; update azure_preflight_sku_quota to read the same env
vars so preflight matches the generated tfvars: use TF_POSTGRES_SKU (map to
postgres_sku), TF_REDIS_SKU_NAME (redis_sku), TF_NODE_INSTANCE_TYPE (node_sku)
and TF_NODE_GROUP_MAX_SIZE (node_max) instead of the hardcoded Terraform
defaults, ensuring the logic in azure_preflight_sku_quota mirrors
azure_generate_tfvars's variable resolution.

642-643: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix AKS version guard to exclude preview Kubernetes versions

The check at lines 642-643 only verifies that $k8s_version appears in az aks get-versions; it doesn’t exclude preview releases. az aks get-versions can return preview versions, so a preview build can pass this guard.

if ! az aks get-versions -l "$region" ${sub_args[@]+"${sub_args[@]}"} \
        --query "values[?version=='$k8s_version']" -o tsv 2>/dev/null | grep -q .; then

Update the --query to match GA/non-preview entries by filtering on isPreview (i.e., require isPreview == false, and treat isPreview == null as non-preview if that’s how GA is represented).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deployments/scripts/azure/terraform.sh` around lines 642 - 643, The AKS
version guard using the az aks get-versions call (command using variables
region, sub_args and k8s_version) must be tightened to exclude preview releases;
update the --query expression in that az aks get-versions invocation so it
selects entries where version == k8s_version AND (isPreview == false OR
isPreview == null) before checking output with grep, ensuring preview builds are
not accepted.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deployments/scripts/azure/terraform.sh`:
- Around line 692-699: The preflight uses a sed-derived node_family/gpu_family
which can mismatch Azure's reported family and bypass _azure_check_vcpu_quota;
call azure_describe_vm_sku for the SKU (e.g., use azure_describe_vm_sku
"$node_sku" and capture the returned .family) and replace the sed-derived
node_family (and likewise gpu_family) with that returned family before invoking
_azure_check_vcpu_quota and before printing the az vm list-usage informational
table; update azure_preflight_sku_quota to reuse azure_describe_vm_sku's family
output for both quota check and display so the contains(name.value, '$family')
filter matches Azure's actual data.

---

Outside diff comments:
In `@deployments/scripts/azure/terraform.sh`:
- Around line 611-617: The preflight currently hardcodes Terraform SKU/size
values while azure_generate_tfvars uses environment overrides; update
azure_preflight_sku_quota to read the same env vars so preflight matches the
generated tfvars: use TF_POSTGRES_SKU (map to postgres_sku), TF_REDIS_SKU_NAME
(redis_sku), TF_NODE_INSTANCE_TYPE (node_sku) and TF_NODE_GROUP_MAX_SIZE
(node_max) instead of the hardcoded Terraform defaults, ensuring the logic in
azure_preflight_sku_quota mirrors azure_generate_tfvars's variable resolution.
- Around line 642-643: The AKS version guard using the az aks get-versions call
(command using variables region, sub_args and k8s_version) must be tightened to
exclude preview releases; update the --query expression in that az aks
get-versions invocation so it selects entries where version == k8s_version AND
(isPreview == false OR isPreview == null) before checking output with grep,
ensuring preview builds are not accepted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 03c5b564-dbea-4ee1-acb5-de4a4a566eab

📥 Commits

Reviewing files that changed from the base of the PR and between 485a275 and 8d74611.

📒 Files selected for processing (1)
  • deployments/scripts/azure/terraform.sh

Comment thread deployments/scripts/azure/terraform.sh Outdated
CodeRabbit on commit 8d74611: `azure_preflight_sku_quota` derived
`node_family` / `gpu_family` via `sed` from the SKU name, but
`_azure_check_vcpu_quota` enforces quotas by filtering
`az vm list-usage` with `contains(name.value, '$family')`. When the
sed-derived substring didn't match Azure's actual `name.value` value,
the helper logged a warning and returned 0 — silently skipping the
hard quota math we just added.

The existing `azure_describe_vm_sku()` helper already issues a single
`az vm list-skus --size <sku>` and returns Azure's authoritative
`family` field (e.g. `standardDSv3Family`, `standardNCadsH100v5Family`)
along with vCPUs/node. Switch both pools (node + GPU) to use it —
one az call per pool instead of two, and the returned family is
guaranteed to match `name.value` in vm list-usage.

Also guards the informational table: when azure_describe_vm_sku
returns no family (unknown / typo'd SKU), the `family_filter` block
now skips printing rather than emitting `contains(name.value, '')`
which would match every row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deployments/scripts/azure/terraform.sh`:
- Around line 697-699: The azure_describe_vm_sku lookup currently queries all
SKUs and can return the wrong regional variant; change azure_describe_vm_sku to
accept and use region and subscription args (pass $region and ${sub_args[@]}
into the function) and update its az call to include -l "$region" and any
${sub_args...}, then filter the JSON result by exact SKU name (e.g., select
where name == "$sku") before extracting family and vCPU fields; finally update
the calls that set node_family/node_vcpus and gpu_family/gpu_vcpus (the read -r
lines around the node/gpu sections) to pass the region and sub_args into
azure_describe_vm_sku so _azure_check_vcpu_quota uses the region-specific SKU.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 95ae2c33-a1c1-4f02-9c16-b112b19f0104

📥 Commits

Reviewing files that changed from the base of the PR and between 8d74611 and 33729e3.

📒 Files selected for processing (1)
  • deployments/scripts/azure/terraform.sh

Comment thread deployments/scripts/azure/terraform.sh
vvnpn-nv and others added 4 commits May 22, 2026 14:52
….version

Bug surfaced when validating against swedencentral, but the issue is
universal across regions: `az aks get-versions`'s `values[].version`
field is principal-only ("1.33", "1.36", ...) in every region tested.
Full patch versions ("1.33.11", "1.33.10", ...) live as KEYS inside
each entry's `patchVersions` object.

The previous query, `values[?version=='1.33.11']`, was therefore
guaranteed to return empty everywhere — the preflight would have
hard-failed on a perfectly valid k8s version in any region, blocking
TF apply for no reason.

Switch to `values[].patchVersions.keys(@) | []` which flattens every
patch-version key across all principal entries into a single tsv
column, then exact-match TF_K8S_VERSION against that list with
`grep -Fqx`. Verified locally against both eastus2 and swedencentral:
1.33.11 now matches correctly in both.

Update the remediation hint in the error message to point at the
correct query so an operator hitting a future failure can reproduce
the lookup with a copy-paste command.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop WHAT-comments, historical references (sed approach, prior bugs),
verbose enumerations the code itself reveals, and step-numbering that
broke (1,2,3,4,5,4) after additions. Keep only the WHY-comments that
encode non-obvious Azure API quirks a future reader would need:
- patchVersions vs values[].version (regional API shape)
- TF azurerm tier-prefix vs supportedServerEditions mapping
- grep -F (SKU names contain `.`)
- empty family_filter guard (avoid contains(name.value, '') matching all rows)
- azure_describe_vm_sku family vs name.value matching

Net: 65 lines removed across the two files; no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…max name

Two latent bugs that surfaced during a real swedencentral deploy:

1. tfvars heredoc hardcoded `availability_zones = ["1", "2"]`. Works in
   eastus2 (Standard_NC40ads_H100_v5 zones are 1,2,3) but breaks in
   swedencentral (zones are 1,3) — TF apply hard-failed with
   AvailabilityZoneNotSupported on the GPU pool. Both AKS pools share
   `var.availability_zones` in TF so the GPU SKU's zone set must
   intersect with the node SKU's.

2. tfvars heredoc wrote `gpu_min` / `gpu_max`, but variables.tf declares
   them as `gpu_node_pool_min_size` / `gpu_node_pool_max_size`. TF
   ignored the unknown vars (warning only), the GPU pool fell back to
   its hardcoded TF defaults (1/3), and the preflight's `gpu_max ×
   vCPUs/node` quota math was applied to a number TF never read.

Fixes:
- New `_azure_sku_zones_json` + `_azure_resolve_pool_zones` helpers
  query `az vm list-skus` and compute zone intersection.
- `azure_generate_tfvars` now writes the intersection (with
  `TF_AVAILABILITY_ZONES` env override for hand-picking).
- Heredoc now uses the correct `gpu_node_pool_{min,max}_size` names.
- `azure_preflight_sku_quota` hard-fails on empty intersection with
  the two SKUs' actual zones in the error so the operator can pick
  a different region.

Companion SKILL.md edit (skills/osmo-deploy/SKILL.md):
- Promoted "Required user inputs" to section #2 ("🛑 STOP — gather
  inputs before invoking the script") so an agent reading top-down
  hits it before any invocation example.
- Restructured Provider → GPU prompts → per-provider prompts.
- Fixed broken numbering where Azure-only restarted at 4 instead of 5.
- Reconciled flag/env wording (`--no-gpu` exists, `--gpu-count` does
  not — explicitly noted) and added anchor to Common invocations.

Verified locally against eastus2 and swedencentral:
  swedencentral / D2s_v3 + H100 → ["1", "3"]
  eastus2       / D2s_v3 + H100 → ["1", "2", "3"]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously --find-gpu-region was query-only — it printed the chosen
region and exited unconditionally. Agent flows had to capture stdout,
re-invoke the script with TF_REGION=<that>, and replicate every other
flag. Bad UX and a common source of "I ran it but nothing happened."

Now: standalone `--find-gpu-region <sku> <count>` (no --provider)
still prints the region and exits (backward-compatible query mode).
But combined with `--provider azure --gpu-node-pool ...` the script
exports the resolved region into TF_REGION and continues with the
deploy in the same invocation.

Companion SKILL.md update documents the combined-invocation flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant