Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a new ClusterLoader2-based performance test scenario to measure container image pull performance on AKS clusters. The scenario deploys 10 small deployments with 1 replica each (totaling 10 pods) using imagePullPolicy: Always to force image pulls, then collects detailed metrics from kubelet and containerd via Prometheus.
Key Changes
- New image-pull-n10 scenario with terraform configuration for 3-node default pool, 1-node Prometheus pool, and 10-node user pool
- ClusterLoader2 test configuration that measures image pulling throughput, kubelet runtime operations, and pod startup latency
- Python module with execute/collect functions following established patterns from cri and other modules
- Comprehensive unit tests for the new Python module
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
steps/topology/image-pull/*.yml |
Topology step templates for validation, execution, and collection |
steps/engine/clusterloader2/image_pull/*.yml |
Engine-level execute and collect steps with environment configuration |
modules/python/clusterloader2/image_pull/image_pull.py |
Main Python module implementing CL2 config override, execution, and results collection |
modules/python/tests/test_image_pull.py |
Unit tests covering execute, collect, and CLI functions |
modules/python/clusterloader2/image_pull/config/image-pull.yaml |
Main CL2 test configuration with pod deployment and measurement steps |
modules/python/clusterloader2/image_pull/config/deployment_template.yaml |
Kubernetes deployment template with imagePullPolicy: Always |
modules/python/clusterloader2/image_pull/config/kubelet-measurement.yaml |
Kubelet runtime operation duration measurements |
modules/python/clusterloader2/image_pull/config/containerd-measurements.yaml |
Containerd CRI metrics for image pulling throughput and network operations |
scenarios/perf-eval/image-pull-n10/terraform-inputs/azure.tfvars |
Infrastructure configuration for Azure AKS cluster with 3 node pools |
scenarios/perf-eval/image-pull-n10/terraform-test-inputs/azure.json |
Test input parameters for scenario |
scenarios/perf-eval/image-pull-n10/README.md |
Documentation of scenario purpose, infrastructure, and usage |
modules/python/clusterloader2/image_pull/config/containerd-measurements.yaml
Outdated
Show resolved
Hide resolved
modules/python/clusterloader2/image_pull/config/containerd-measurements.yaml
Outdated
Show resolved
Hide resolved
modules/python/clusterloader2/image_pull/config/containerd-measurements.yaml
Outdated
Show resolved
Hide resolved
modules/python/clusterloader2/image_pull/config/containerd-measurements.yaml
Outdated
Show resolved
Hide resolved
modules/python/clusterloader2/image_pull/config/containerd-measurements.yaml
Outdated
Show resolved
Hide resolved
modules/python/clusterloader2/image_pull/config/containerd-measurements.yaml
Outdated
Show resolved
Hide resolved
modules/python/clusterloader2/image_pull/config/containerd-measurements.yaml
Outdated
Show resolved
Hide resolved
|
We have image pull benchmark at cri module, is it possible to use the same module(add the extra metrics there) instead of creating a new module |
|
Do you have a test run result? |
johnsonshi
left a comment
There was a problem hiding this comment.
Looks good so far. Please address comments.
modules/python/clusterloader2/cri/config/containerd-measurements.yaml
Outdated
Show resolved
Hide resolved
modules/python/clusterloader2/cri/config/containerd-measurements.yaml
Outdated
Show resolved
Hide resolved
xgugeng
left a comment
There was a problem hiding this comment.
LGTM, please request approval from the maintainers.
modules/python/clusterloader2/cri/config/containerd-measurements.yaml
Outdated
Show resolved
Hide resolved
scenarios/perf-eval/image-pull-n10/terraform-inputs/azure.tfvars
Outdated
Show resolved
Hide resolved
scenarios/perf-eval/image-pull-n10/terraform-inputs/azure.tfvars
Outdated
Show resolved
Hide resolved
Addressed all requested changes in the latest commits. Dismissing to unblock merge due to dependent work.
Summary
Adds containerd image pull metrics collection to the existing CRI module using a
scrape_containerdtoggle.Approach
Instead of creating a new module, this PR extends the existing CRI infrastructure:
scrape_containerdparameter to enable containerd metrics collectioncri-resource-consumetopologycontainerd-measurements.yamlconfig to CRI moduleMetrics Collected
ContainerdCriImagePullingThroughput
Image pull throughput (MB/s) with the following aggregations:
Results
scrape_containerd=TruePipeline: #20260106.12 • Add 20s pod_startup_latency_threshold
Json Result: perf-eval/image-pull-n10/image_pull_prototype/47996-233c2bdc-3a73-570f-5588-ad30a97c6402.json
scrape_containerd=FalsePipeline: #20260106.27 • test with scrape_containerd=False
Json Result: perf-eval/image-pull-n10/image_pull_prototype/48089-233c2bdc-3a73-570f-5588-ad30a97c6402.json