Skip to content

ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios#1168

Draft
skosuri1 wants to merge 50 commits into
skosuri/clustermesh-scalefrom
skosuri/clustermesh-scale-2
Draft

ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios#1168
skosuri1 wants to merge 50 commits into
skosuri/clustermesh-scalefrom
skosuri/clustermesh-scale-2

Conversation

@skosuri1
Copy link
Copy Markdown

@skosuri1 skosuri1 commented May 6, 2026

Stacked on top of #1157 (skosuri/clustermesh-scale). Do not merge until #1157 is merged; review/merge order matters.

This PR continues the ClusterMesh scale-test scenario with Phase 3 work — moving from harness validation (2 small clusters) to real scale measurement across cluster-count tiers.

Phase 3 Deliverables

  • 20-node baseline cluster size (spec line 24). Current clusters are 3 nodes (2 default + 1 prompool) — sized for harness validation, not real scale measurement.
  • Cluster-count tiers: add azure-5.tfvars, azure-10.tfvars, azure-20.tfvars and corresponding pipeline matrix entries. Each tier: validate quota, validate peering count (N·(N-1) at separate-VNet mode — 380 at N=20), tune CL2 timeouts, document breaking points.
  • Parallel CL2 fan-out: replace sequential per-cluster CL2 with bounded concurrency (default 4). Requires async wrapping of utils.run_cl2_command (currently synchronous, modules/python/clusterloader2/utils.py:66-72) and confirming the AzDO agent has CPU/RAM headroom for N concurrent CL2 + Prometheus.
  • etcd PodMonitor capacity check at 20 clusters: 28 watchers per cluster × 20 = 560 watchers; verify Prom scrape budget holds.
  • Scaling-curve dashboards from cluster-attributed results (Kusto).

Out of Scope (deferred to later phases / pre-merge of #1157)

skosuri and others added 30 commits May 6, 2026 13:59
…idn't fix root cause); fix n5 condition syntax
… referenced it but variables.tf didn't declare)
@skosuri1 skosuri1 changed the title ClusterMesh scale: Phase 3 — scale tiers + parallel CL2 fan-out ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios May 12, 2026
skosuri added 16 commits May 12, 2026 17:20
…ibe + events) + pre-pull CL2 image to avoid ghcr.io parallel-pull rate-limit
… budget) replaces brittle 5min kubectl wait
…tl jsonpath nested filter is broken; switch to shell-side filter
… + killer fix for HA + clustermesh-apiserver pod resource metrics; restore 4-scenario share-infra matrix
… pod-churn loop, peer sleep-observes; collect target-aware churn knobs; restore 5-scenario share-infra matrix
…current==mesh_size so every peer's Prometheus window overlaps target's churn
… scale + vmss delete-instances driven from execute.yml in parallel with CL2 observers
…fig parse + drop broken NewNodesAppearedInWindow PromQL + proactive failure debug dumps in node-churner.sh and execute.yml
…of agentpool label (label drift on k8s 1.34.7) + use kubeconfig-augmented JSON in scenario_failure_diag + tee kubectl stderr to debug log + debug_dump on every replace-phase abort path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant