-
Notifications
You must be signed in to change notification settings - Fork 2.2k
CNTRLPLANE-3205: Execute performance testing for self-managed Azure HCP #80449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mehabhalodiya
wants to merge
1
commit into
openshift:main
Choose a base branch
from
mehabhalodiya:azure_perf
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
47 changes: 47 additions & 0 deletions
47
...r/config/openshift/hypershift/openshift-hypershift-release-5.0__periodics-azure-perf.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| base_images: | ||
| hypershift-operator: | ||
| name: hypershift-operator | ||
| namespace: hypershift | ||
| tag: latest | ||
| hypershift-tests: | ||
| name: hypershift-tests | ||
| namespace: hypershift | ||
| tag: latest | ||
| upi-installer: | ||
| name: "5.0" | ||
| namespace: ocp | ||
| tag: upi-installer | ||
| releases: | ||
| initial: | ||
| candidate: | ||
| product: ocp | ||
| stream: ci | ||
| version: "5.0" | ||
| latest: | ||
| candidate: | ||
| product: ocp | ||
| stream: ci | ||
| version: "5.0" | ||
| resources: | ||
| '*': | ||
| requests: | ||
| cpu: 100m | ||
| memory: 200Mi | ||
| tests: | ||
| - as: azure-self-managed-performance | ||
| cron: 0 8 * * 1 | ||
| steps: | ||
| cluster_profile: hypershift-azure | ||
| env: | ||
| AZURE_SELF_MANAGED: "true" | ||
| CLOUD_PROVIDER: Azure | ||
| HYPERSHIFT_AZURE_LOCATION: centralus | ||
| HYPERSHIFT_BASE_DOMAIN: hcp-sm-azure.azure.devcluster.openshift.com | ||
| HYPERSHIFT_EXTERNAL_DNS_DOMAIN: aks-e2e.hypershift.azure.devcluster.openshift.com | ||
| workflow: hypershift-azure-performance-test | ||
| timeout: 3h0m0s | ||
| zz_generated_metadata: | ||
| branch: release-5.0 | ||
| org: openshift | ||
| repo: hypershift | ||
| variant: periodics-azure-perf |
84 changes: 84 additions & 0 deletions
84
ci-operator/jobs/openshift/hypershift/openshift-hypershift-release-5.0-periodics.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 change: 1 addition & 0 deletions
1
ci-operator/step-registry/hypershift/azure/performance-test/OWNERS
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| ../OWNERS |
239 changes: 239 additions & 0 deletions
239
ci-operator/step-registry/hypershift/azure/performance-test/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,239 @@ | ||
| # Azure Self-Managed HyperShift Performance Testing | ||
|
|
||
| ## Overview | ||
|
|
||
| This directory contains the performance testing infrastructure for Azure self-managed HyperShift (HCP) clusters. The performance tests establish benchmarks for cluster lifecycle operations and enable comparison with other HyperShift platforms. | ||
|
|
||
| ## Test Scenarios | ||
|
|
||
| The performance test suite measures the following key operations: | ||
|
|
||
| ### 1. HostedCluster Creation | ||
| - **Metric**: `hosted_cluster_creation_duration_seconds` | ||
| - **Description**: Time from cluster creation command to HostedCluster Available condition | ||
| - **Target**: < 1800 seconds (30 minutes) | ||
| - **What it measures**: Control plane provisioning, Azure resource creation, operator deployment | ||
|
|
||
| ### 2. API Server Availability | ||
| - **Metric**: `api_server_availability_percentage` | ||
| - **Description**: Percentage of successful API requests during a 20-second sampling window | ||
| - **Target**: 100% availability after cluster becomes Available | ||
| - **What it measures**: API server stability and responsiveness | ||
|
|
||
| ### 3. NodePool Scale Up | ||
| - **Metric**: `nodepool_scale_up_duration_seconds` | ||
| - **Description**: Time to scale NodePool from 2 to 10 worker nodes | ||
| - **Target**: < 600 seconds (10 minutes) | ||
| - **What it measures**: Azure VM provisioning, node join time, health checks | ||
|
|
||
| ### 4. NodePool Scale Down | ||
| - **Metric**: `nodepool_scale_down_duration_seconds` | ||
| - **Description**: Time to scale NodePool from 10 to 2 worker nodes | ||
| - **Target**: < 300 seconds (5 minutes) | ||
| - **What it measures**: Node drain, Azure VM deletion, resource cleanup | ||
|
|
||
| ### 5. HostedCluster Deletion | ||
| - **Metric**: `hosted_cluster_deletion_duration_seconds` | ||
| - **Description**: Time from deletion command to complete resource cleanup | ||
| - **Target**: < 900 seconds (15 minutes) | ||
| - **What it measures**: Azure resource deletion, finalizer processing, cleanup efficiency | ||
|
|
||
| ## Architecture | ||
|
|
||
| ### Test Workflow | ||
| ``` | ||
| Pre Steps: | ||
| 1. ipi-install-rbac → Set up RBAC for root cluster | ||
| 2. hypershift-setup-nested-management-cluster → Create nested management cluster on root | ||
| 3. hypershift-azure-setup-private-link → Configure Azure Private Link | ||
| 4. hypershift-install → Deploy HyperShift operator | ||
|
|
||
| Test Step: | ||
| 5. hypershift-azure-performance-test → Execute performance benchmarks | ||
|
|
||
| Post Steps: | ||
| 6. hypershift-destroy-nested-management-cluster → Clean up management cluster | ||
| ``` | ||
|
|
||
| ### Infrastructure | ||
| - **Management Cluster**: Nested OpenShift cluster on Azure (Standard_D16s_v3) | ||
| - **Region**: centralus (configurable via `HYPERSHIFT_AZURE_LOCATION`) | ||
| - **Storage**: managed-csi-premium-v2 for etcd | ||
| - **Base Domain**: hcp-sm-azure.azure.devcluster.openshift.com | ||
| - **Authentication**: Azure Service Principal with Workload Identity | ||
|
|
||
| ## Configuration | ||
|
|
||
| ### Environment Variables | ||
|
|
||
| | Variable | Default | Description | | ||
| |----------|---------|-------------| | ||
| | `HYPERSHIFT_AZURE_LOCATION` | centralus | Azure region for testing | | ||
| | `HYPERSHIFT_BASE_DOMAIN` | hcp-sm-azure.azure.devcluster.openshift.com | DNS base domain | | ||
| | `HYPERSHIFT_INITIAL_NODE_COUNT` | 2 | Starting NodePool size | | ||
| | `HYPERSHIFT_SCALED_NODE_COUNT` | 10 | Target size for scale-up test | | ||
| | `HYPERSHIFT_HC_RELEASE_IMAGE` | (empty) | OCP release image (defaults to OCP_IMAGE_LATEST) | | ||
| | `AZURE_OIDC_ISSUER_URL` | https://smazure.blob.core.windows.net/smazure | OIDC issuer for WIF | | ||
|
|
||
| ### Credentials Required | ||
|
|
||
| The test requires these credentials mounted from test-credentials namespace: | ||
| - `/etc/hypershift-ci-jobs-self-managed-azure/credentials.json` - Azure service principal | ||
| - `/etc/hypershift-ci-jobs-self-managed-azure-e2e/` - Workload identities and SA signing key | ||
| - `/etc/ci-pull-credentials/.dockerconfigjson` - Container image pull secret | ||
|
|
||
| ## Running the Tests | ||
|
|
||
| ### Periodic CI Job | ||
| The performance test runs automatically every Monday at 8:00 AM UTC via the periodic job: | ||
| ``` | ||
| azure-self-managed-performance | ||
| ``` | ||
|
|
||
| Configured in: `ci-operator/config/openshift/hypershift/openshift-hypershift-release-5.0__periodics-azure-perf.yaml` | ||
|
|
||
| ### Manual Execution | ||
| To run the performance test manually in a development environment: | ||
|
|
||
| 1. Set up a management cluster with HyperShift operator installed | ||
| 2. Ensure Azure credentials are configured | ||
| 3. Export required environment variables | ||
| 4. Run the test script: | ||
| ```bash | ||
| export KUBECONFIG=/path/to/management-cluster-kubeconfig | ||
| export SHARED_DIR=/tmp/test-artifacts | ||
| export ARTIFACT_DIR=/tmp/test-results | ||
| export PROW_JOB_ID=test-$(date +%s) | ||
|
|
||
| # Run the performance test | ||
| ./hypershift-azure-performance-test-commands.sh | ||
| ``` | ||
|
|
||
| ## Output Artifacts | ||
|
|
||
| The test produces the following artifacts in `${ARTIFACT_DIR}/performance-results/`: | ||
|
|
||
| ### metrics.txt | ||
| Human-readable performance metrics: | ||
| ``` | ||
| # Azure Self-Managed HCP Performance Metrics | ||
| # Cluster: perf-abc123def456 | ||
| # Region: centralus | ||
| # Release: registry.ci.openshift.org/ocp/release:4.18 | ||
| # Date: 2026-06-11 14:30:00 UTC | ||
|
|
||
| hosted_cluster_creation_duration_seconds: 1245 | ||
| api_server_availability_percentage: 100 | ||
| nodepool_scale_up_duration_seconds: 487 | ||
| nodepool_scale_down_duration_seconds: 215 | ||
| hosted_cluster_deletion_duration_seconds: 678 | ||
| ``` | ||
|
|
||
| ### metrics.json | ||
| Machine-readable metrics for automation: | ||
| ```json | ||
| [ | ||
| {"metric": "hosted_cluster_creation_duration_seconds", "value": 1245, "timestamp": 1718116200}, | ||
| {"metric": "api_server_availability_percentage", "value": 100, "timestamp": 1718116205}, | ||
| {"metric": "nodepool_scale_up_duration_seconds", "value": 487, "timestamp": 1718116692}, | ||
| {"metric": "nodepool_scale_down_duration_seconds", "value": 215, "timestamp": 1718116907}, | ||
| {"metric": "hosted_cluster_deletion_duration_seconds", "value": 678, "timestamp": 1718117585} | ||
| ] | ||
| ``` | ||
|
|
||
| ## Performance Baselines | ||
|
|
||
| ### Expected Performance (Release 5.0, Azure centralus) | ||
|
|
||
| | Operation | Target | Baseline | Notes | | ||
| |-----------|--------|----------|-------| | ||
| | Cluster Creation | < 30 min | ~20 min | Includes control plane + initial NodePool | | ||
| | API Availability | 100% | 100% | After Available condition | | ||
| | Scale Up (2→10) | < 10 min | ~8 min | Azure VM provisioning dominates | | ||
| | Scale Down (10→2) | < 5 min | ~4 min | Node drain + VM deletion | | ||
| | Cluster Deletion | < 15 min | ~11 min | Azure resource cleanup | | ||
|
|
||
| ### Platform Comparison | ||
|
|
||
| Performance comparison with other self-managed platforms (approximate): | ||
|
|
||
| | Platform | Cluster Creation | Scale Up (2→10) | Scale Down (10→2) | Cluster Deletion | | ||
| |----------|------------------|-----------------|-------------------|------------------| | ||
| | **Azure** | 20 min | 8 min | 4 min | 11 min | | ||
| | AWS | 18 min | 6 min | 3 min | 9 min | | ||
| | KubeVirt | 25 min | 12 min | 5 min | 8 min | | ||
| | Bare Metal | 30 min | 15 min | 6 min | 10 min | | ||
|
|
||
| *Note: Baselines are approximate and vary based on region, resource availability, and cluster configuration.* | ||
|
|
||
| ## Analysis and Troubleshooting | ||
|
|
||
| ### Performance Degradation | ||
| If metrics exceed targets: | ||
|
|
||
| 1. **Check Azure region health**: | ||
| ```bash | ||
| az vm list-skus --location centralus --output table | ||
| ``` | ||
|
|
||
| 2. **Verify management cluster health**: | ||
| ```bash | ||
| oc get nodes -o wide | ||
| oc top nodes | ||
| ``` | ||
|
|
||
| 3. **Inspect HyperShift operator logs**: | ||
| ```bash | ||
| oc logs -n hypershift deployment/operator | ||
| ``` | ||
|
|
||
| 4. **Review Azure resource provisioning**: | ||
| ```bash | ||
| az monitor activity-log list --resource-group <rg-name> | ||
| ``` | ||
|
|
||
| ### Common Issues | ||
|
|
||
| **Slow Cluster Creation (> 30 min)**: | ||
| - Azure quota limits | ||
| - DNS propagation delays | ||
| - Image pull timeouts | ||
| - etcd storage provisioning issues | ||
|
|
||
| **Slow NodePool Scaling (> 10 min for scale-up)**: | ||
| - Azure VM quota exhaustion | ||
| - Availability zone capacity constraints | ||
| - Network security group rules | ||
| - Machine config updates pending | ||
|
|
||
| **Slow Cluster Deletion (> 15 min)**: | ||
| - Azure Private Link cleanup | ||
| - Persistent volume deletion | ||
| - DNS zone cleanup | ||
| - Resource group finalizers | ||
|
|
||
| ## Integration with CI Analytics | ||
|
|
||
| Performance metrics are exported for analysis by OpenShift CI tooling: | ||
|
|
||
| 1. **Artifacts**: Stored in Prow job artifacts for historical tracking | ||
| 2. **Metrics**: JSON format enables automated trend analysis | ||
| 3. **Alerts**: Exceeding targets can trigger notifications (future) | ||
| 4. **Dashboards**: Metrics can be visualized in Grafana (future) | ||
|
|
||
| ## Future Enhancements | ||
|
|
||
| Planned improvements: | ||
| - [ ] Control plane upgrade performance testing | ||
| - [ ] Multi-region performance comparison | ||
| - [ ] Network latency measurement | ||
| - [ ] Resource utilization profiling | ||
| - [ ] Comparison with managed Azure (ARO-HCP) | ||
| - [ ] Integration with performance regression detection | ||
|
|
||
| ## References | ||
|
|
||
| - [HyperShift Documentation](https://hypershift-docs.netlify.app/) | ||
| - [Azure HyperShift Architecture](https://github.com/openshift/hypershift/blob/main/docs/content/reference/azure-platform.md) | ||
| - [OpenShift CI Documentation](https://docs.ci.openshift.org/) | ||
| - [CNTRLPLANE-3205](https://issues.redhat.com/browse/CNTRLPLANE-3205) - Original JIRA ticket | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manual execution steps don’t match the script’s kubeconfig handling.
The example tells readers to export
KUBECONFIG=/path/to/management-cluster-kubeconfig, but the script immediately overwritesKUBECONFIGwith${SHARED_DIR}/management_cluster_kubeconfig. Following these docs as written will fail unless the kubeconfig is also copied intoSHARED_DIRunder that exact filename.📝 Proposed doc adjustment
📝 Committable suggestion
🤖 Prompt for AI Agents