-
Notifications
You must be signed in to change notification settings - Fork 19
Add image pull 10 nodes scenario #963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
b394e66
Add image pull scenario prototype
jasminetMSFT 695a5d6
Restructure image-pull scenario for ADO pipeline execution
jasminetMSFT 1ae1992
Fix terraform test input - use minimal JSON
jasminetMSFT f6c5608
Fix K8s version to 1.33 and revert gitignore
jasminetMSFT f393815
Fix Prometheus setup for image-pull test
jasminetMSFT 3d7f271
Rename image-pull scenario
jasminetMSFT 0430f5a
Fix image-pull-n10 scenario to follow existing conventions
jasminetMSFT ef74ffb
Update pipeline to use image-pull-n10 scenario name
jasminetMSFT 2940ef8
Add image-pull-n10 pipeline and revert new-pipeline-test.yml
jasminetMSFT 59bf82f
refactor: reuse CRI module with scrape_containerd toggle for image-pu…
jasminetMSFT 57d043a
fix: add missing CRI matrix parameters
jasminetMSFT b106a5b
fix: enable scrape_kubelets for kubelet metrics
jasminetMSFT 1cff449
fix: use kubernetes_version 1.33
jasminetMSFT 31d18c7
fix: use cri-resource-consume label expected by CRI module
jasminetMSFT 24157e5
Remove irrelevant CNI metrics from CRI containerd measurements, use a…
jasminetMSFT 612a547
Add percentile metrics for image pull throughput
jasminetMSFT 6d1eaa1
Revert new-pipeline-test.yml
jasminetMSFT 860c7e6
Add scrape_containerd parameter to collect phase for CRI engine
jasminetMSFT 52463ab
Add CONTAINERD_SCRAPE_INTERVAL: 30s for CRI scenario
jasminetMSFT 7b29721
Remove scrape_containerd from collect phase as it has no effect
jasminetMSFT a8e19ff
Merge remote-tracking branch 'origin/main' into jasminet/image_pull_p…
jasminetMSFT 4fb9205
Add comment explaining why containerd metrics are not in verify_measu…
jasminetMSFT e267e0e
Add test workload details to image-pull-n10 README
jasminetMSFT 64fdd4e
Move pipeline to ACR Benchmark folder and revert new-pipeline-test.yml
jasminetMSFT 0ee61fb
Add 20s pod_startup_latency_threshold
jasminetMSFT a4a5f7a
test with scrape_containerd=False
jasminetMSFT 12fc728
Revert new-pipeline-test.yml to template
jasminetMSFT 8d285c5
feat: add 4-hour schedule trigger to image-pull-n10 pipeline
jasminetMSFT 78c9338
modify interval and vm size
jasminetMSFT 5906cea
Merge main into branch
jasminetMSFT 2b4af6e
Test image-pull-n10 scenario in new-pipeline-test
jasminetMSFT 87f847a
Revert new-pipeline-test.yml to template
jasminetMSFT File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
29 changes: 29 additions & 0 deletions
29
modules/python/clusterloader2/cri/config/containerd-measurements.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| {{$action := .action}} # start, gather | ||
|
|
||
| steps: | ||
| - name: {{$action}} Containerd Measurements | ||
| measurements: | ||
| - Identifier: ContainerdCriImagePullingThroughput | ||
| Method: GenericPrometheusQuery | ||
| Params: | ||
| action: {{$action}} | ||
| metricName: ContainerdCriImagePullingThroughput | ||
| metricVersion: v1 | ||
| unit: MB/s | ||
| queries: | ||
| # Weighted average throughput per image pull (nodes with more pulls have more weight) | ||
| - name: Avg | ||
| query: sum(rate(containerd_cri_image_pulling_throughput_sum{nodepool=~"userpool.*"}[%v])) / sum(rate(containerd_cri_image_pulling_throughput_count{nodepool=~"userpool.*"}[%v])) | ||
jasminetMSFT marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| # Unweighted average - each node contributes equally regardless of pull count | ||
| - name: AvgPerNode | ||
| query: avg(sum by (instance) (rate(containerd_cri_image_pulling_throughput_sum{nodepool=~"userpool.*"}[%v])) / sum by (instance) (rate(containerd_cri_image_pulling_throughput_count{nodepool=~"userpool.*"}[%v]))) | ||
| # Number of successful image pull observations | ||
| - name: Count | ||
| query: sum(containerd_cri_image_pulling_throughput_count{nodepool=~"userpool.*"}) | ||
| # Cluster level percentiles - throughput distribution across nodes | ||
| - name: Perc50 | ||
| query: quantile(0.5, sum by (instance) (rate(containerd_cri_image_pulling_throughput_sum{nodepool=~"userpool.*"}[%v])) / sum by (instance) (rate(containerd_cri_image_pulling_throughput_count{nodepool=~"userpool.*"}[%v]))) | ||
| - name: Perc90 | ||
| query: quantile(0.9, sum by (instance) (rate(containerd_cri_image_pulling_throughput_sum{nodepool=~"userpool.*"}[%v])) / sum by (instance) (rate(containerd_cri_image_pulling_throughput_count{nodepool=~"userpool.*"}[%v]))) | ||
| - name: Perc99 | ||
| query: quantile(0.99, sum by (instance) (rate(containerd_cri_image_pulling_throughput_sum{nodepool=~"userpool.*"}[%v])) / sum by (instance) (rate(containerd_cri_image_pulling_throughput_count{nodepool=~"userpool.*"}[%v]))) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| trigger: none | ||
jasminetMSFT marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| schedules: | ||
| - cron: "0 */4 * * *" | ||
| displayName: "Every 4 Hour" | ||
| branches: | ||
| include: | ||
| - main | ||
| always: true | ||
|
|
||
jasminetMSFT marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| variables: | ||
| SCENARIO_TYPE: perf-eval | ||
| SCENARIO_NAME: image-pull-n10 | ||
|
|
||
| stages: | ||
| - stage: azure_eastus2_image_pull | ||
| dependsOn: [] | ||
jasminetMSFT marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| jobs: | ||
| - template: /jobs/competitive-test.yml | ||
| parameters: | ||
| cloud: azure | ||
| regions: | ||
| - eastus2 | ||
| engine: clusterloader2 | ||
| engine_input: | ||
| image: "ghcr.io/azure/clusterloader2:v20250513" | ||
| topology: cri-resource-consume | ||
jasminetMSFT marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| matrix: | ||
| image-pull-10pods: | ||
| node_count: 10 | ||
| max_pods: 30 | ||
| repeats: 1 | ||
| operation_timeout: 3m | ||
| load_type: memory | ||
| scrape_containerd: True | ||
| scrape_kubelets: True | ||
| kubernetes_version: "1.34" | ||
| pod_startup_latency_threshold: 20s | ||
| max_parallel: 1 | ||
| credential_type: service_connection | ||
| ssh_key_enabled: false | ||
| timeout_in_minutes: 60 | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| # image-pull-n10 | ||
|
|
||
| ## Overview | ||
|
|
||
| Measures containerd image pulling throughput (MB/s) and network plugin operation metrics using the CRI module with `scrape_containerd: True`. Uses the `cri-resource-consume` topology. | ||
|
|
||
| ## Infrastructure | ||
|
|
||
| | Component | Configuration | | ||
| |-----------|---------------| | ||
| | Cloud Provider | Azure | | ||
| | Cluster SKU | Standard | | ||
| | Network Plugin | Azure CNI Overlay | | ||
| | Default Node Pool | 3 x Standard_D4s_v3 | | ||
| | Prometheus Pool | 1 x Standard_D8s_v3 | | ||
| | User Pool | 10 x Standard_D4s_v3 | | ||
|
|
||
| ## Test Workload | ||
|
|
||
| | Component | Value | | ||
| |-----------|-------| | ||
| | Registry | Azure Container Registry (`akscritelescope.azurecr.io`) | | ||
| | Image | `e2e-test-images/resource-consumer:1.13` | | ||
| | Image Size | ~50MB | | ||
|
|
||
| ## Metrics Collected | ||
|
|
||
| ### ContainerdCriImagePullingThroughput | ||
|
|
||
| Image pull throughput (MB/s) with the following aggregations: | ||
|
|
||
| | Metric | Description | | ||
| |--------|-------------| | ||
| | **Avg** | Weighted average throughput per image pull | | ||
jasminetMSFT marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| | **AvgPerNode** | Unweighted average - each node contributes equally | | ||
| | **Count** | Total number of image pulls | | ||
| | **Perc50** | 50th percentile (median) throughput across nodes | | ||
| | **Perc90** | 90th percentile throughput across nodes | | ||
| | **Perc99** | 99th percentile throughput across nodes | | ||
|
|
||
| ## Known Limitations | ||
|
|
||
| ### Cannot Use histogram_quantile() Per Node | ||
|
|
||
| Using Prometheus `histogram_quantile()` on per-node throughput data always returns `10` (the maximum bucket boundary) regardless of actual throughput values. This happens because: | ||
|
|
||
| - The histogram has fixed bucket boundaries: `0.5, 1, 2, 4, 6, 8, 10` MB/s | ||
| - When actual throughput exceeds 10 MB/s, all samples fall into the `+Inf` bucket | ||
| - `histogram_quantile()` can only interpolate within defined buckets, so it caps at `10` | ||
|
|
||
| **Current Approach**: Instead of `histogram_quantile()` per node, we use weighted average (`_sum / _count`) per node, then compute percentiles across the node averages. | ||
|
|
||
| ### Per-Node Metrics May Return "no samples" | ||
|
|
||
| The per-node metrics (`AvgPerNode`, `Perc50`, `Perc90`, `Perc99`) may return "no samples" while aggregate metrics (`Avg`, `Count`) work correctly. This is caused by Prometheus `rate()` function requiring **at least 2 data points** within the query window. | ||
|
|
||
| **Root Cause**: If image pulls complete faster than the Prometheus scrape interval (default 15s), only one data point is collected per pull operation. The `rate()` function cannot compute a rate from a single sample, resulting in empty per-node results. | ||
|
|
||
| **Why Aggregate Metrics Work**: `Avg` and `Count` use `sum()` which aggregates samples across all pods/nodes before applying `rate()`, accumulating enough data points within the window. | ||
|
|
||
| **Workaround Options**: | ||
| - Increase scrape frequency (may impact cluster performance) | ||
| - Use larger images that take longer to pull | ||
| - Rely on aggregate metrics (`Avg`, `Count`) for throughput analysis | ||
|
|
||
| ### Metric Includes Unpack Time | ||
|
|
||
| The `containerd_cri_image_pulling_throughput` metric measures **total image size divided by total pull time**, which includes both: | ||
| - Image layer download time | ||
| - Image layer decompression/unpack time | ||
|
|
||
| This is not a pure network throughput metric. See [containerd source](https://github.com/containerd/containerd/blob/main/internal/cri/server/images/image_pull.go). | ||
|
|
||
| ### verify_measurement() Cannot Check Containerd Metrics | ||
|
|
||
| The CRI module's `verify_measurement()` function only validates kubelet metrics (accessible via Kubernetes node proxy endpoint at `/api/v1/nodes/{node}/proxy/metrics`). Containerd metrics are only available through the Prometheus server and cannot be verified through this endpoint. | ||
|
|
||
| ## References | ||
|
|
||
| - [Best Practices](../../../docs/best-practices.md) | ||
| - [Test Scenario Implementation Guide](../../../docs/test-scenario-implementation-guide.md) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.