Add Cilium ClusterMesh scale-test scenario#1157
Draft
skosuri1 wants to merge 46 commits into
Draft
Conversation
added 7 commits
April 28, 2026 08:54
…e mesh diagnostics
…uring mesh validation
…to azure-2.json to match tfvars stem
Author
|
@microsoft-github-policy-service agree company="Microsoft" |
added 22 commits
April 29, 2026 09:51
…th multi-cluster aggregation invariant
…ntrol-plane, clustermesh-metrics)
…true (AKS managed Cilium gates sync at the namespace level per CFP-39876)
…esn't expand in env: blocks)
added 17 commits
May 3, 2026 19:08
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft PR for the Cilium ClusterMesh scale-test scenario on AKS. Phase 1 + Phase 2 validated end-to-end on consecutive green runs; pre-merge cleanup pending (see end).
What this adds
A new perf-eval scenario, clustermesh-scale, that scale-tests Cilium ClusterMesh on AKS end-to-end through Telescope.
The scenario provisions N AKS clusters in separate VNets (with peering), joins them into a Cilium ClusterMesh via Azure Fleet Manager's ClusterMeshProfile, runs a ClusterLoader2 workload on every cluster, and aggregates per-cluster results into a single JSONL keyed by source/target cluster.
Reference: #1053 (CNL) — same structural patterns, but #1053 is single-cluster. This PR introduces the multi-cluster execution + aggregation path.
Work is split into phases. Phase 1 adds the vertical slice with all wiring (terraform modules, Python harness, topology/engine YAMLs, pipeline YAML, cross-cluster data-path smoke test) for N=2. Phase 2 adds full observability (Cilium + clustermesh-apiserver + etcd + logs per spec testing.txt lines 34–38, 131–135) and the first real scenario (Cross-Cluster Event Throughput, scenario #1 in the spec). Phase 3 runs at 5 / 10 / 20 clusters (and bumps the baseline cluster size to 20 nodes per spec line 24) and builds Kusto dashboards. Phase 4 adds the remaining six scenarios. Phase 5 is polish, tests, and the official pipeline request. Vertical-slice ordering is intentional — every later phase reuses Phase 1's plumbing, so debugging at N=2 is far cheaper than at N=20.
Phase 1 contents: new terraform submodules fleet/ and vnet-peering/ plus a
--pod-subnet-idflag in aks-cli/; scenario tfvars under scenarios/perf-eval/clustermesh-scale/ with a vendored fleet-2.0.4 wheel; a per-cluster prompool extra_node_pool (Standard_D8s_v3 × 1, labelprometheus=true) for Prometheus isolation, mirroring the prompool pattern in cnl-azurecni-overlay-cilium;--max-pods=110on the default pool to fit the 200-pod workload (Phase 2); a single-cluster scale.py harness with--cluster-namefor per-cluster JSONL attribution; PodMonitor scraping ports 9963 (etcd) + 9964 (kvstoremesh) on the clustermesh-apiserver pod; topology and engine YAMLs that discover clusters via Azure resource tags, fan out CL2 sequentially across clusters (parallel fan-out is Phase 3), and aggregate JSONLs; a cross-cluster data-path smoke test (global Service curl from mesh-2 to mesh-1) gating CL2 so we don't ship a "green" Phase 1 with control plane healthy but data plane broken; a conditional Fleet-wheel install in setup-tests.yml gated on the scenario name (no impact on other scenarios); and a pipeline YAML in pipelines/perf-eval/Network Benchmark/clustermesh-scale.yml running weekly in eastus2euap with manual triggers always available.External prerequisite: the pipeline service principal needs
Microsoft.Authorization/roleAssignments/writeat sub scope, granted as RBAC Administrator with an ABAC condition restricting to the Network Contributor role GUID. Cilium-managed clustermesh-apiserver provisions an internal LB on each cluster's VNet and requires this role on the underlying VNet to mutate it.Phase 2 contents: scale-scenario #1 (Cross-Cluster Event Throughput) — a new event-throughput.yaml CL2 config and supporting modules that deploy N namespaces × M deployments × R replicas (default 5/4/10 = 200 pods) across the mesh, then exercise create / warmup / rolling-restart burst / settle / delete to drive measurable kvstore traffic; reused cilium.yaml and control-plane.yaml from PR #1053 (cilium agent/operator CPU/mem, container restarts, API responsiveness, pod startup latency); new clustermesh-metrics.yaml for always-on mesh measurements (remote clusters connected, remote cluster failures, kvstore events rate aggregate AND per-type by
scopelabel — Identities/Services/Endpoints, kvstore operation duration p50/p90/p99, watch queue depth, identity count); new clustermesh-throughput.yaml for scenario-specific measurements (event backlog rate, global services count, kvstore op-duration p95 split); new etcd-metrics.yaml covering the embedded etcd inside clustermesh-apiserver (watch count, slow watchers, pending events, MVCC keys, compaction keys + duration, backend write latency) — sourced via the existing PodMonitor on port 9963, no new scrape target needed, so all of spec testing.txt line 34 (Cilium / clustermesh-apiserver / etcd) is covered with one PodMonitor; pod logs (clustermesh-apiserver × 3 containers + cilium-agent + cilium-operator) archived to$report_dir/logsper spec line 35; network bytes per component (Tx/Rx) added to cilium.yaml per spec line 38; junit-aware success gate in execute.yml that distinguishes CL2 logic failures from infra failures; Python unit tests + mock-data fixtures covering single-cluster, multi-cluster aggregation, and failure paths.Fleet members with labels and ClusterMeshProfile have no native Terraform support today, so both go through terraform_data + local-exec az calls following the aks-cli/main.tf precedent. The az fleet 2.0.4 extension also exposes no detach/remove-member API, so destroying a clustermesh hits a chicken-and-egg:
member deleteis rejected while the member is in any profile, andclustermeshprofile deleteis rejected while members exist. The destroy provisioner on terraform_data.clustermeshprofile breaks this by relabeling members off the profile selector (az fleet member update --labelsREPLACES the labels map, droppingmesh=true), re-applying the profile, then polling list-members until the applied set drains to 0 (10-minute budget, periodic re-apply nudge in case the first apply was a no-op), then deleting the profile with a 30×5s backstop retry. Tested across multiple consecutive runs; post-timeout backstop covers cases where Fleet RP's list-members view lags the actual deletable state.Known limitations / deferred to Phase 3+:
cilium_kvstoremesh_kvstore_operations_duration_seconds_bucketp99. A synthetic probe (pod created on A → visible on B at time T) is not implemented because CL2's per-cluster execution model doesn't natively support cross-cluster timing.--auto-compaction-retention=1h; the metric is wired and will populate at Phase 3 long runs.Known infra flakes (accepted, not fixed):
--enable-acnsextension reconcile and prompool add →OperationNotAllowed: PutExtensionAddonHandler.PUT in progress. Recurs ~1 in 5 runs; AzDO RetryHelper absorbs in 2-3 attempts.412 PreconditionFailedon the clustermesh-apiserver Service (concurrent VMSS modification race). Rare; observed once, resolved by next build.Changes:
ClusterMesh / Fleet / VNet peering setup
Terraform / AKS provisioning
Mesh validation + cross-cluster data path
Telescope pipeline / step conventions
ClusterLoader2 harness + multi-cluster aggregation
CL2 measurement modules (Phase 2)
Tests
Vendored binary
Pre-merge cleanup pending: