CNTRLPLANE-2678: add HCPEtcdBackup controller to HyperShift Operator by jparrill · Pull Request #8139 · openshift/hypershift

jparrill · 2026-03-31T15:11:42Z

Summary

Adds the HCPEtcdBackup controller to the HyperShift Operator, implementing the reconciliation flow defined in Enhancement PR #1945
The controller watches HCPEtcdBackup CRs and orchestrates a 3-container backup Job (fetch-etcd-certs → etcdctl snapshot → etcd-upload) with temporary RBAC and NetworkPolicy for cross-namespace etcd access
Reports backup status through CR conditions (BackupCompleted) and propagates EtcdBackupActive condition to the HostedControlPlane
Handles permanent failures (missing credential/pull Secrets) by setting BackupFailed terminal condition instead of retrying indefinitely
Includes count-based retention to delete oldest completed backups when exceeding --etcd-backup-max-count
Integration tests for S3, Azure Blob, and invalid-credentials scenarios with self-contained cloud resource setup/teardown via run.sh controller

Dependencies

PR CNTRLPLANE-2678: Adapt etcd-backup command and add HO namespace manifests #8040 — etcd-backup CPO subcommand (merged)
PR CNTRLPLANE-2678: Add fetch-etcd-certs CPO subcommand for HCPEtcdBackup #8010 — fetch-etcd-certs CPO subcommand (merged)
PR CNTRLPLANE-2684: CPO etcd-upload subcommand for cloud storage upload #8017 — etcd-upload CPO subcommand (merged)
PR OCPBUGS-81333: Fix S3 multipart upload InvalidPart error by upgrading Transfer Manager #8114 — Transfer Manager upgrade (merged)

Jira

CNTRLPLANE-2678

Test plan

Unit tests pass (go test ./hypershift-operator/controllers/etcdbackup/...)
Integration test: S3 backup completes successfully with snapshotURL, RBAC/NetworkPolicy cleanup, HCP condition propagation
Integration test: Azure Blob backup completes successfully
Integration test: Invalid credentials sets BackupFailed terminal condition
make run-gitlint passes
make verify passes
CI green

Testing notes

Unit tests

go test ./hypershift-operator/controllers/etcdbackup/... -v

Integration tests

Integration tests run against a live management cluster with a HostedCluster deployed. They require:

KUBECONFIG pointing to the management cluster
ETCD_BACKUP_TEST_HCP_NAMESPACE set to the HCP namespace (e.g. clusters-my-hcp)
The HCPEtcdBackup controller running on the cluster (with the TechPreviewNoUpgrade feature set)
AWS CLI authenticated (aws sts get-caller-identity) and/or Azure CLI authenticated (az account show)

The run.sh controller subcommand handles all cloud resource lifecycle (S3 buckets, Azure storage accounts, K8s Secrets) automatically:

# Run all providers (auto-detects authenticated CLIs)
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
  ./test/integration/oadp/run.sh controller

# Run S3 tests only
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
  ./test/integration/oadp/run.sh controller aws

# Run Azure Blob tests only
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
  ./test/integration/oadp/run.sh controller azure

Alternatively, run Go tests directly with manual env vars (useful for debugging):

KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
ETCD_BACKUP_TEST_S3_BUCKET=my-bucket \
ETCD_BACKUP_TEST_S3_REGION=us-east-2 \
ETCD_BACKUP_TEST_S3_KEY_PREFIX=etcd-backups/test \
ETCD_BACKUP_TEST_S3_CREDENTIALS_SECRET=my-aws-creds \
  go test -tags integration -v -timeout 10m ./test/integration/oadp/controller/...

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added etcd backup capability for Hosted Control Planes with status tracking.
- Support for S3 and Azure Blob storage backends.
- Automated backup retention with configurable maximum backup count.
- Backup status conditions propagated to Hosted Clusters.
Tests
- Added comprehensive integration tests validating etcd backup workflows across storage providers and error scenarios.

…rator Implements the HCPEtcdBackup reconciler that orchestrates etcd snapshot and upload Jobs. The controller watches HCPEtcdBackup CRs, validates etcd health, manages temporary RBAC and NetworkPolicy resources for cross-namespace access, creates 3-container backup Jobs (fetch-certs, snapshot, upload), tracks Job status via pod termination messages, and enforces count-based retention of completed backups. Adds EtcdBackupActive condition type for HCP-to-HC status propagation. Controller is registered behind the HCPEtcdBackup feature gate. Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>

…ller Add integration tests that validate the HCPEtcdBackup controller end-to-end against a live management cluster with S3 and Azure Blob storage backends. Tests: - S3 happy path: backup completes with snapshotURL, RBAC/NetworkPolicy cleaned up, EtcdBackupActive condition propagated to HCP - Azure Blob happy path: same validations for Azure storage - Invalid credentials: controller sets BackupFailed instead of retrying indefinitely when credential Secret is missing Controller fix: Reconcile now catches IsNotFound errors from createBackupJob (missing credential/pull Secret) and sets BackupCompleted=False/BackupFailed as a terminal condition with resource cleanup, rather than returning an error that causes infinite requeue. run.sh gains a `controller` subcommand that creates/destroys all cloud resources (S3 buckets, Azure RG/storage/SP, K8s Secrets) automatically via setup/teardown functions with EXIT trap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>

openshift-ci-robot · 2026-03-31T15:11:46Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-03-31T15:12:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jparrill
Once this PR has been reviewed and has the lgtm label, please assign sjenning for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jparrill]
api/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-03-31T15:13:40Z

@jparrill: This pull request references CNTRLPLANE-2678 which is a valid jira issue.

Details

In response to this:

Summary

Adds the HCPEtcdBackup controller to the HyperShift Operator, implementing the reconciliation flow defined in Enhancement PR #1945

The controller watches HCPEtcdBackup CRs and orchestrates a 3-container backup Job (fetch-etcd-certs → etcdctl snapshot → etcd-upload) with temporary RBAC and NetworkPolicy for cross-namespace etcd access

Reports backup status through CR conditions (BackupCompleted) and propagates EtcdBackupActive condition to the HostedControlPlane

Handles permanent failures (missing credential/pull Secrets) by setting BackupFailed terminal condition instead of retrying indefinitely

Includes count-based retention to delete oldest completed backups when exceeding --etcd-backup-max-count

Integration tests for S3, Azure Blob, and invalid-credentials scenarios with self-contained cloud resource setup/teardown via run.sh controller

Dependencies

PR CNTRLPLANE-2678: Adapt etcd-backup command and add HO namespace manifests #8040 — etcd-backup CPO subcommand (merged)

PR CNTRLPLANE-2678: Add fetch-etcd-certs CPO subcommand for HCPEtcdBackup #8010 — fetch-etcd-certs CPO subcommand (merged)

PR CNTRLPLANE-2684: CPO etcd-upload subcommand for cloud storage upload #8017 — etcd-upload CPO subcommand (merged)

PR OCPBUGS-81333: Fix S3 multipart upload InvalidPart error by upgrading Transfer Manager #8114 — Transfer Manager upgrade (merged)

Jira

CNTRLPLANE-2678

Test plan

Unit tests pass (go test ./hypershift-operator/controllers/etcdbackup/...)

Integration test: S3 backup completes successfully with snapshotURL, RBAC/NetworkPolicy cleanup, HCP condition propagation

Integration test: Azure Blob backup completes successfully

Integration test: Invalid credentials sets BackupFailed terminal condition

make run-gitlint passes

make verify passes

CI green

Testing notes

Unit tests
go test ./hypershift-operator/controllers/etcdbackup/... -v
Integration tests

Integration tests run against a live management cluster with a HostedCluster deployed. They require:

KUBECONFIG pointing to the management cluster

ETCD_BACKUP_TEST_HCP_NAMESPACE set to the HCP namespace (e.g. clusters-my-hcp)

The HCPEtcdBackup controller running on the cluster (with the TechPreviewNoUpgrade feature set)

AWS CLI authenticated (aws sts get-caller-identity) and/or Azure CLI authenticated (az account show)

The run.sh controller subcommand handles all cloud resource lifecycle (S3 buckets, Azure storage accounts, K8s Secrets) automatically:
# Run all providers (auto-detects authenticated CLIs)
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
 ./test/integration/oadp/run.sh controller

# Run S3 tests only
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
 ./test/integration/oadp/run.sh controller aws

# Run Azure Blob tests only
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
 ./test/integration/oadp/run.sh controller azure
Alternatively, run Go tests directly with manual env vars (useful for debugging):
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
ETCD_BACKUP_TEST_S3_BUCKET=my-bucket \
ETCD_BACKUP_TEST_S3_REGION=us-east-2 \
ETCD_BACKUP_TEST_S3_KEY_PREFIX=etcd-backups/test \
ETCD_BACKUP_TEST_S3_CREDENTIALS_SECRET=my-aws-creds \
 go test -tags integration -v -timeout 10m ./test/integration/oadp/controller/...
🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jparrill · 2026-03-31T15:13:59Z

/label tide/merge-method-squash

coderabbitai · 2026-03-31T15:14:21Z

📝 Walkthrough

Walkthrough

The changes introduce a new etcd backup feature for hosted control planes. A new condition type EtcdBackupActive is added to the API specification. A new HCPEtcdBackupReconciler controller is implemented to manage HCPEtcdBackup resources, orchestrating the full backup lifecycle: validating etcd health, enforcing single active backups per HCP, creating temporary RBAC and NetworkPolicy resources, launching backup jobs with appropriate image resolution, monitoring job completion, extracting snapshot URLs from job logs, propagating completion status to owning HostedControlPlane resources, and enforcing retention policies when backup count exceeds configured limits. The controller is feature-gated and wired into the operator's startup. The backup condition is propagated from HostedControlPlane to HostedCluster resources. Comprehensive test coverage and integration tests validate controller behavior with S3 and Azure storage backends.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/Operator
    participant K8s as Kubernetes API
    participant Reconciler as HCPEtcdBackupReconciler
    participant HCP as HostedControlPlane
    participant Etcd as Etcd StatefulSet
    participant Job as Backup Job
    participant Storage as Object Storage
    
    User->>K8s: Create HCPEtcdBackup CR
    K8s->>Reconciler: Reconcile triggered
    
    Reconciler->>K8s: Fetch HCPEtcdBackup
    K8s-->>Reconciler: CR retrieved
    
    Reconciler->>HCP: Lookup HostedControlPlane
    HCP-->>Reconciler: HCP found
    
    Reconciler->>Etcd: Check ReadyReplicas
    Etcd-->>Reconciler: Healthy status
    
    Reconciler->>K8s: Find active backup Jobs
    K8s-->>Reconciler: None found
    
    Reconciler->>K8s: Create ServiceAccount (HO namespace)
    K8s-->>Reconciler: Created
    
    Reconciler->>K8s: Create Role + RoleBinding (HCP namespace)
    K8s-->>Reconciler: Created
    
    Reconciler->>K8s: Create NetworkPolicy (HCP namespace)
    K8s-->>Reconciler: Created
    
    Reconciler->>K8s: Create Backup Job with init containers
    K8s-->>Reconciler: Job created
    
    K8s->>Job: Fetch etcd certs
    Job->>Job: Run etcdctl snapshot save
    
    Job->>Storage: Upload snapshot (etcd-upload container)
    Storage-->>Job: Snapshot URL returned
    
    K8s->>Reconciler: Job completion event
    Reconciler->>Job: Extract snapshot URL from logs
    Job-->>Reconciler: URL extracted
    
    Reconciler->>K8s: Update HCPEtcdBackup status (succeeded)
    K8s-->>Reconciler: Status updated
    
    Reconciler->>HCP: Set EtcdBackupActive=True condition
    HCP-->>Reconciler: Condition propagated
    
    Reconciler->>K8s: Delete NetworkPolicy + RBAC
    K8s-->>Reconciler: Cleanup completed

sequenceDiagram
    participant HostedCluster as HostedCluster Controller
    participant HCP as HostedControlPlane
    participant K8s as Kubernetes API
    participant User as HostedCluster Status
    
    K8s->>HCP: Condition updated (EtcdBackupActive=True)
    HCP-->>K8s: Status condition set
    
    HostedCluster->>K8s: Watch HCP conditions
    K8s-->>HostedCluster: Change detected
    
    HostedCluster->>HCP: Read status.conditions
    HCP-->>HostedCluster: Conditions retrieved
    
    HostedCluster->>K8s: Update HostedCluster status
    Note over HostedCluster: Include EtcdBackupActive condition
    K8s-->>HostedCluster: Status updated
    
    K8s->>User: EtcdBackupActive condition visible

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

jparrill · 2026-03-31T15:14:24Z

/uncc csrwng

jparrill · 2026-03-31T15:14:38Z

/uncc @cblecker

jparrill · 2026-03-31T15:15:11Z

/auto-cc

openshift-ci-robot · 2026-03-31T15:16:09Z

@jparrill: This pull request references CNTRLPLANE-2678 which is a valid jira issue.

Details

In response to this:

Summary

Adds the HCPEtcdBackup controller to the HyperShift Operator, implementing the reconciliation flow defined in Enhancement PR #1945

The controller watches HCPEtcdBackup CRs and orchestrates a 3-container backup Job (fetch-etcd-certs → etcdctl snapshot → etcd-upload) with temporary RBAC and NetworkPolicy for cross-namespace etcd access

Reports backup status through CR conditions (BackupCompleted) and propagates EtcdBackupActive condition to the HostedControlPlane

Handles permanent failures (missing credential/pull Secrets) by setting BackupFailed terminal condition instead of retrying indefinitely

Includes count-based retention to delete oldest completed backups when exceeding --etcd-backup-max-count

Integration tests for S3, Azure Blob, and invalid-credentials scenarios with self-contained cloud resource setup/teardown via run.sh controller

Dependencies

PR CNTRLPLANE-2678: Adapt etcd-backup command and add HO namespace manifests #8040 — etcd-backup CPO subcommand (merged)

PR CNTRLPLANE-2678: Add fetch-etcd-certs CPO subcommand for HCPEtcdBackup #8010 — fetch-etcd-certs CPO subcommand (merged)

PR CNTRLPLANE-2684: CPO etcd-upload subcommand for cloud storage upload #8017 — etcd-upload CPO subcommand (merged)

PR OCPBUGS-81333: Fix S3 multipart upload InvalidPart error by upgrading Transfer Manager #8114 — Transfer Manager upgrade (merged)

Jira

CNTRLPLANE-2678

Test plan

Unit tests pass (go test ./hypershift-operator/controllers/etcdbackup/...)

Integration test: S3 backup completes successfully with snapshotURL, RBAC/NetworkPolicy cleanup, HCP condition propagation

Integration test: Azure Blob backup completes successfully

Integration test: Invalid credentials sets BackupFailed terminal condition

make run-gitlint passes

make verify passes

CI green

Testing notes

Unit tests
go test ./hypershift-operator/controllers/etcdbackup/... -v
Integration tests

Integration tests run against a live management cluster with a HostedCluster deployed. They require:

KUBECONFIG pointing to the management cluster

ETCD_BACKUP_TEST_HCP_NAMESPACE set to the HCP namespace (e.g. clusters-my-hcp)

The HCPEtcdBackup controller running on the cluster (with the TechPreviewNoUpgrade feature set)

AWS CLI authenticated (aws sts get-caller-identity) and/or Azure CLI authenticated (az account show)

The run.sh controller subcommand handles all cloud resource lifecycle (S3 buckets, Azure storage accounts, K8s Secrets) automatically:
# Run all providers (auto-detects authenticated CLIs)
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
 ./test/integration/oadp/run.sh controller

# Run S3 tests only
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
 ./test/integration/oadp/run.sh controller aws

# Run Azure Blob tests only
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
 ./test/integration/oadp/run.sh controller azure
Alternatively, run Go tests directly with manual env vars (useful for debugging):
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
ETCD_BACKUP_TEST_S3_BUCKET=my-bucket \
ETCD_BACKUP_TEST_S3_REGION=us-east-2 \
ETCD_BACKUP_TEST_S3_KEY_PREFIX=etcd-backups/test \
ETCD_BACKUP_TEST_S3_CREDENTIALS_SECRET=my-aws-creds \
 go test -tags integration -v -timeout 10m ./test/integration/oadp/controller/...
🤖 Generated with Claude Code

Summary by CodeRabbit

New Features

Added etcd backup capability for Hosted Control Planes with status tracking.

Support for S3 and Azure Blob storage backends.

Automated backup retention with configurable maximum backup count.

Backup status conditions propagated to Hosted Clusters.

Tests

Added comprehensive integration tests validating etcd backup workflows across storage providers and error scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

hypershift-operator/controllers/etcdbackup/reconciler.go (1)
374-378: Consider failing backup if snapshot URL extraction fails.

If getSnapshotURLFromPod fails or returns empty, the backup is still marked as successful but without a usable snapshotURL. This could result in "successful" backups that cannot be used for restore operations.

Consider whether a missing snapshot URL should cause the backup to be marked as failed, or at minimum add a warning condition.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/etcdbackup/reconciler.go` around lines 374 -
378, The code currently swallows errors/empty results from getSnapshotURLFromPod
causing backups to appear successful without a usable SnapshotURL; change the
logic in reconciler.go around the getSnapshotURLFromPod call so that when err !=
nil or url == "" you set backup.Status.Phase to a failing state (e.g.,
"Failed"), populate backup.Status.Conditions or a warning condition with a clear
Reason and Message that includes the error or "empty snapshot URL", and persist
the status via the same status update path used elsewhere (e.g.,
r.client.Status().Update or r.updateStatus). Keep the existing successful branch
to set backup.Status.SnapshotURL when present. Ensure the change references
getSnapshotURLFromPod, backup.Status.SnapshotURL, backup.Status.Phase, and the
status update method so the backup is marked failed (or has a warning condition)
when URL extraction fails or is empty.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/hypershift/v1beta1/hostedcluster_conditions.go`:
- Around line 198-202: The EtcdBackupActive condition is semantically inverted;
change the API to express success/completion instead of activity by renaming the
ConditionType EtcdBackupActive to something like EtcdBackupSucceeded (or
EtcdBackupComplete) and update its docstring so True means the last backup
succeeded and False means it failed/in progress, then update all uses that set
or check EtcdBackupActive (e.g., HCP/HostedCluster status setters, getters, and
any callers) to use the new EtcdBackupSucceeded name and ensure the boolean
polarity is adjusted where values are set or evaluated so consumers reading
HostedCluster status see True as success.

In `@hypershift-operator/main.go`:
- Line 173: Validate opts.EtcdBackupMaxCount immediately after parsing flags
(where cmd.Flags().IntVar is used) and before the controller is wired: if the
value is less than 1 (unless 0 is intentionally allowed and documented), return
an error and exit startup so invalid bounds never reach MaxBackupCount in the
reconciler; update the startup path that wires the controller to check
opts.EtcdBackupMaxCount (and mirror this validation for the other flag group
mentioned around lines 516-523) and include a clear error message referencing
the flag name.

In `@test/integration/oadp/run.sh`:
- Around line 426-449: The fallback currently copies the entire
~/.aws/credentials into creds_content; change it to extract only the active
profile (use $AWS_PROFILE if set, otherwise "default") and serialize just that
profile's keys (aws_access_key_id, aws_secret_access_key, aws_session_token if
present) into creds_content instead of cat'ing the whole file; update the branch
that sets creds_content (the code touching access_key/secret_key/session_token,
creds_content, tmp_creds, and CTRL_AWS_CREDS_SECRET) to parse ~/.aws/credentials
for the selected profile and include only those three entries so unrelated
profiles/accounts are not included in the Secret.

---

Nitpick comments:
In `@hypershift-operator/controllers/etcdbackup/reconciler.go`:
- Around line 374-378: The code currently swallows errors/empty results from
getSnapshotURLFromPod causing backups to appear successful without a usable
SnapshotURL; change the logic in reconciler.go around the getSnapshotURLFromPod
call so that when err != nil or url == "" you set backup.Status.Phase to a
failing state (e.g., "Failed"), populate backup.Status.Conditions or a warning
condition with a clear Reason and Message that includes the error or "empty
snapshot URL", and persist the status via the same status update path used
elsewhere (e.g., r.client.Status().Update or r.updateStatus). Keep the existing
successful branch to set backup.Status.SnapshotURL when present. Ensure the
change references getSnapshotURLFromPod, backup.Status.SnapshotURL,
backup.Status.Phase, and the status update method so the backup is marked failed
(or has a warning condition) when URL extraction fails or is empty.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: e6f30f0c-d4db-4460-8317-f106584ae3c5

📥 Commits

Reviewing files that changed from the base of the PR and between 7c04ffa and 3762a05.

⛔ Files ignored due to path filters (3)

docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
docs/content/reference/api.md is excluded by !docs/content/reference/api.md
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.go is excluded by !vendor/**, !**/vendor/**

📒 Files selected for processing (7)

api/hypershift/v1beta1/hostedcluster_conditions.go
hypershift-operator/controllers/etcdbackup/reconciler.go
hypershift-operator/controllers/etcdbackup/reconciler_test.go
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
hypershift-operator/main.go
test/integration/oadp/controller/controller_test.go
test/integration/oadp/run.sh

coderabbitai · 2026-03-31T15:24:43Z

api/hypershift/v1beta1/hostedcluster_conditions.go

+	// EtcdBackupActive bubbles up from HCP. It indicates whether an etcd backup
+	// is currently in progress or reports the result of the most recent backup.
+	// True means a backup completed successfully; False means a backup is in progress
+	// or the last backup failed.
+	EtcdBackupActive ConditionType = "EtcdBackupActive"


⚠️ Potential issue | 🟠 Major

EtcdBackupActive is semantically inverted.

The doc says True means “last backup succeeded” and False means “backup in progress or failed”. On an exported *Active condition, consumers will read True as “a backup is currently running”, especially once this is surfaced on HostedCluster status. Please rename this to a success/completion condition or flip the polarity before this API contract ships.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@api/hypershift/v1beta1/hostedcluster_conditions.go` around lines 198 - 202, The EtcdBackupActive condition is semantically inverted; change the API to express success/completion instead of activity by renaming the ConditionType EtcdBackupActive to something like EtcdBackupSucceeded (or EtcdBackupComplete) and update its docstring so True means the last backup succeeded and False means it failed/in progress, then update all uses that set or check EtcdBackupActive (e.g., HCP/HostedCluster status setters, getters, and any callers) to use the new EtcdBackupSucceeded name and ensure the boolean polarity is adjusted where values are set or evaluated so consumers reading HostedCluster status see True as success.

coderabbitai · 2026-03-31T15:24:43Z

hypershift-operator/main.go

 	cmd.Flags().BoolVar(&opts.EnableDedicatedRequestServingIsolation, "enable-dedicated-request-serving-isolation", true, "If true, enables scheduling of request serving components to dedicated nodes")
 	cmd.Flags().StringVar(&opts.ScaleFromZeroProvider, "scale-from-zero-provider", opts.ScaleFromZeroProvider, "Platform type for scale-from-zero autoscaling (aws)")
 	cmd.Flags().StringVar(&opts.ScaleFromZeroCreds, "scale-from-zero-creds", opts.ScaleFromZeroCreds, "Path to credentials file for scale-from-zero instance type queries")
+	cmd.Flags().IntVar(&opts.EtcdBackupMaxCount, "etcd-backup-max-count", 5, "Maximum number of completed HCPEtcdBackup CRs to retain per HostedControlPlane")


⚠️ Potential issue | 🟠 Major

Validate --etcd-backup-max-count before wiring the controller.

IntVar accepts negative values, and this option is passed straight into MaxBackupCount. For backup retention, the operator should fail fast on invalid bounds instead of leaving behavior to reconciler internals. If 0 is intentional, document that explicitly; otherwise reject values below 1.

Also applies to: 516-523

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/main.go` at line 173, Validate opts.EtcdBackupMaxCount immediately after parsing flags (where cmd.Flags().IntVar is used) and before the controller is wired: if the value is less than 1 (unless 0 is intentionally allowed and documented), return an error and exit startup so invalid bounds never reach MaxBackupCount in the reconciler; update the startup path that wires the controller to check opts.EtcdBackupMaxCount (and mirror this validation for the other flag group mentioned around lines 516-523) and include a clear error message referencing the flag name.

coderabbitai · 2026-03-31T15:24:43Z

test/integration/oadp/run.sh

+    local creds_content=""
+    if [[ -n "${access_key}" && -n "${secret_key}" ]]; then
+        creds_content="[default]
+aws_access_key_id = ${access_key}
+aws_secret_access_key = ${secret_key}"
+        if [[ -n "${session_token}" ]]; then
+            creds_content="${creds_content}
+aws_session_token = ${session_token}"
+        fi
+    elif [[ -f "${HOME}/.aws/credentials" ]]; then
+        creds_content=$(cat "${HOME}/.aws/credentials")
+    else
+        err "Cannot determine AWS credentials."
+    fi
+
+    # Create K8s Secret in HO namespace
+    CTRL_AWS_CREDS_SECRET="etcd-backup-aws-test-${SUFFIX}"
+    log "Creating K8s Secret ${CTRL_AWS_CREDS_SECRET} in ${ho_namespace}..."
+    local tmp_creds
+    tmp_creds=$(mktemp /tmp/etcd-ctrl-aws-creds.XXXXXX)
+    echo "${creds_content}" > "${tmp_creds}"
+    oc create secret generic "${CTRL_AWS_CREDS_SECRET}" \
+        --from-file=credentials="${tmp_creds}" \
+        -n "${ho_namespace}"


⚠️ Potential issue | 🟠 Major

Don't copy the whole local AWS credentials file into the cluster.

The fallback at Lines 435-436 reads ~/.aws/credentials verbatim and then stores it in a Secret. That can include unrelated profiles/accounts and exposes far more credential material than these tests need. Serialize only the active/default profile into the Secret.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/integration/oadp/run.sh` around lines 426 - 449, The fallback currently copies the entire ~/.aws/credentials into creds_content; change it to extract only the active profile (use $AWS_PROFILE if set, otherwise "default") and serialize just that profile's keys (aws_access_key_id, aws_secret_access_key, aws_session_token if present) into creds_content instead of cat'ing the whole file; update the branch that sets creds_content (the code touching access_key/secret_key/session_token, creds_content, tmp_creds, and CTRL_AWS_CREDS_SECRET) to parse ~/.aws/credentials for the selected profile and include only those three entries so unrelated profiles/accounts are not included in the Secret.

codecov · 2026-03-31T15:29:53Z

Codecov Report

❌ Patch coverage is 75.00000% with 169 lines in your changes missing coverage. Please review.
✅ Project coverage is 29.76%. Comparing base (7c04ffa) to head (3762a05).
⚠️ Report is 33 commits behind head on main.

Files with missing lines	Patch %	Lines
...hift-operator/controllers/etcdbackup/reconciler.go	76.31%	114 Missing and 43 partials ⚠️
hypershift-operator/main.go	0.00%	12 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8139      +/-   ##
==========================================
+ Coverage   29.44%   29.76%   +0.31%     
==========================================
  Files        1043     1044       +1     
  Lines       95594    96270     +676     
==========================================
+ Hits        28150    28657     +507     
- Misses      65027    65153     +126     
- Partials     2417     2460      +43

Files with missing lines	Coverage Δ
...trollers/hostedcluster/hostedcluster_controller.go	`42.99% <100.00%> (+0.01%)`	⬆️
hypershift-operator/main.go	`0.00% <0.00%> (ø)`
...hift-operator/controllers/etcdbackup/reconciler.go	`76.31% <76.31%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

openshift-ci · 2026-03-31T15:41:20Z

@jparrill: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

enxebre · 2026-03-31T16:18:19Z

api/hypershift/v1beta1/hostedcluster_conditions.go


+	// EtcdBackupActive bubbles up from HCP. It indicates whether an etcd backup
+	// is currently in progress or reports the result of the most recent backup.
+	// True means a backup completed successfully; False means a backup is in progress


who consumes this?

EtcdBackupActive=false meaning a backup is in progress seems pretty counter intuitive?

// True means a backup completed successfully; False means a backup is in progress

how does this relate to BackupCompleted? why do they live in different places?

Could you present all possible permutations for these two conditions with their reasons?

Who consumes this?

EtcdBackupActive lives on the HCP/HC so users watching HostedCluster status can see backup activity without looking at HCPEtcdBackup CRs directly. The HC controller propagates it from HCP → HC via the existing condition bubble-up mechanism.

EtcdBackupActive=false meaning a backup is in progress seems pretty counter intuitive?

I agree it's confusing. EtcdBackupActive=False meaning "in progress" is counter-intuitive. I followed the existing EtcdRecoveryActive pattern, but the semantics don't map well here since EtcdRecoveryActive=True means "recovery is happening" while here True means "backup succeeded".

Maybe I can rename to EtcdBackupSucceeded:

True = last backup succeeded

False = in progress or failed

How does it relate to BackupCompleted?

BackupCompleted lives on the HCPEtcdBackup CR (per-backup). EtcdBackupActive lives on the HCP/HC (cluster-level summary). The controller sets EtcdBackupActive on the HCP as a mirror of the latest backup's status.

Permutations

HCPEtcdBackup BackupCompleted Reason HCP EtcdBackupActive Reason Meaning

(not set) — (not set) — CR just created, not yet reconciled

False EtcdUnhealthy (not set) — etcd StatefulSet not ready, waiting to retry

False BackupAlreadyInProgress False BackupAlreadyInProgress backup Job created and running

False BackupFailed False BackupFailed permanent failure (missing creds, Job failed)

True BackupSucceeded True BackupSucceeded backup completed successfully

Maybe I can rename to EtcdBackupSucceeded:

True = last backup succeeded False = in progress or failed

That sounds better to me as well.

I also think that BackupAlreadyInProgressReason could be just BackupInProgressReason with value BackupInProgress

It's a minor detail but since we're just defining the API it's worth mentioning that having "Already" in the name sounds redundant.

gaol · 2026-04-01T02:37:54Z

hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go

 			hyperv1.HostedClusterRestoredFromBackup,
 			hyperv1.DataPlaneConnectionAvailable,
 			hyperv1.ControlPlaneConnectionAvailable,
+			hyperv1.ConditionType("EtcdBackupActive"),


why not use the hyperv1.EtcdBackupActive directly ?

gaol · 2026-04-01T05:07:58Z

hypershift-operator/controllers/etcdbackup/reconciler.go

+		pod := &podList.Items[i]
+		for _, cs := range pod.Status.ContainerStatuses {
+			if cs.Name == "upload" && cs.State.Terminated != nil && cs.State.Terminated.Message != "" {
+				return strings.TrimSpace(cs.State.Terminated.Message), nil


maybe it is not necessary, but will it be more stable if the etcd-upload write some structure message into the terminator log (like: SNAPSHOT_URL=XXX in a separate line) and we parse that from the message?

gaol · 2026-04-01T05:46:39Z

hypershift-operator/controllers/etcdbackup/reconciler.go

+		logger.Error(err, "failed to update HCP backup condition")
+	}
+
+	return ctrl.Result{RequeueAfter: requeueInterval}, nil


why we re queue the reconciliation ? it has watched the Job as well, so job status change should trigger it automatically IMO.

mgencur · 2026-04-02T05:33:01Z

hypershift-operator/controllers/etcdbackup/reconciler.go

+	mountPathCredentials = "/etc/etcd-backup-creds"
+
+	// etcdClientPort is the etcd client port used in the NetworkPolicy and etcdctl endpoint.
+	etcdClientPort int32 = 2379


Nit: It would be cool to put the port in constants, next to https://github.com/openshift/hypershift/blob/main/support/config/constants.go#L30

mgencur · 2026-04-02T05:40:04Z

api/hypershift/v1beta1/hostedcluster_conditions.go


+	// EtcdBackupActive bubbles up from HCP. It indicates whether an etcd backup
+	// is currently in progress or reports the result of the most recent backup.
+	// True means a backup completed successfully; False means a backup is in progress


Maybe I can rename to EtcdBackupSucceeded:

True = last backup succeeded False = in progress or failed

That sounds better to me as well.

mgencur · 2026-04-02T05:42:12Z

api/hypershift/v1beta1/hostedcluster_conditions.go


+	// EtcdBackupActive bubbles up from HCP. It indicates whether an etcd backup
+	// is currently in progress or reports the result of the most recent backup.
+	// True means a backup completed successfully; False means a backup is in progress


I also think that BackupAlreadyInProgressReason could be just BackupInProgressReason with value BackupInProgress

It's a minor detail but since we're just defining the API it's worth mentioning that having "Already" in the name sounds redundant.

jparrill and others added 2 commits March 31, 2026 16:56

openshift-ci bot added the do-not-merge/needs-area label Mar 31, 2026

openshift-ci bot requested review from cblecker and csrwng March 31, 2026 15:12

jparrill changed the title ~~feat(CNTRLPLANE-2678): add HCPEtcdBackup controller to HyperShift Operator~~ CNTRLPLANE-2678: add HCPEtcdBackup controller to HyperShift Operator Mar 31, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 31, 2026

openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Mar 31, 2026

jparrill deployed to docs-preview March 31, 2026 15:14 — with GitHub Actions View deployment

openshift-ci bot removed the request for review from csrwng March 31, 2026 15:14

openshift-ci bot removed the request for review from cblecker March 31, 2026 15:14

openshift-ci bot requested review from enxebre and muraee March 31, 2026 15:15

coderabbitai bot reviewed Mar 31, 2026

View reviewed changes

enxebre reviewed Mar 31, 2026

View reviewed changes

gaol reviewed Apr 1, 2026

View reviewed changes

mgencur reviewed Apr 2, 2026

View reviewed changes

HCPEtcdBackup `BackupCompleted`	Reason	HCP `EtcdBackupActive`	Reason	Meaning
(not set)	—	(not set)	—	CR just created, not yet reconciled
False	EtcdUnhealthy	(not set)	—	etcd StatefulSet not ready, waiting to retry
False	BackupAlreadyInProgress	False	BackupAlreadyInProgress	backup Job created and running
False	BackupFailed	False	BackupFailed	permanent failure (missing creds, Job failed)
True	BackupSucceeded	True	BackupSucceeded	backup completed successfully

Conversation

jparrill commented Mar 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependencies

Jira

Test plan

Testing notes

Unit tests

Integration tests

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Mar 31, 2026

Uh oh!

openshift-ci bot commented Mar 31, 2026

Uh oh!

openshift-ci-robot commented Mar 31, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependencies

Jira

Test plan

Testing notes

Unit tests

Integration tests

Uh oh!

jparrill commented Mar 31, 2026

Uh oh!

coderabbitai bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Uh oh!

jparrill commented Mar 31, 2026

Uh oh!

jparrill commented Mar 31, 2026

Uh oh!

jparrill commented Mar 31, 2026

Uh oh!

openshift-ci-robot commented Mar 31, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependencies

Jira

Test plan

Testing notes

Unit tests

Integration tests

Summary by CodeRabbit

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-ci bot commented Mar 31, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jparrill Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jparrill commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

openshift-ci-robot commented Mar 31, 2026 •

edited by openshift-ci bot

Loading

coderabbitai bot commented Mar 31, 2026 •

edited

Loading

openshift-ci-robot commented Mar 31, 2026 •

edited by openshift-ci bot

Loading

codecov bot commented Mar 31, 2026 •

edited

Loading

jparrill Mar 31, 2026 •

edited

Loading