Skip to content

CNTRLPLANE-2678: add HCPEtcdBackup controller to HyperShift Operator#8139

Open
jparrill wants to merge 2 commits intoopenshift:mainfrom
jparrill:CNTRLPLANE-2678-controller
Open

CNTRLPLANE-2678: add HCPEtcdBackup controller to HyperShift Operator#8139
jparrill wants to merge 2 commits intoopenshift:mainfrom
jparrill:CNTRLPLANE-2678-controller

Conversation

@jparrill
Copy link
Copy Markdown
Contributor

@jparrill jparrill commented Mar 31, 2026

Summary

  • Adds the HCPEtcdBackup controller to the HyperShift Operator, implementing the reconciliation flow defined in Enhancement PR #1945
  • The controller watches HCPEtcdBackup CRs and orchestrates a 3-container backup Job (fetch-etcd-certs → etcdctl snapshot → etcd-upload) with temporary RBAC and NetworkPolicy for cross-namespace etcd access
  • Reports backup status through CR conditions (BackupCompleted) and propagates EtcdBackupActive condition to the HostedControlPlane
  • Handles permanent failures (missing credential/pull Secrets) by setting BackupFailed terminal condition instead of retrying indefinitely
  • Includes count-based retention to delete oldest completed backups when exceeding --etcd-backup-max-count
  • Integration tests for S3, Azure Blob, and invalid-credentials scenarios with self-contained cloud resource setup/teardown via run.sh controller

Dependencies

Jira

CNTRLPLANE-2678

Test plan

  • Unit tests pass (go test ./hypershift-operator/controllers/etcdbackup/...)
  • Integration test: S3 backup completes successfully with snapshotURL, RBAC/NetworkPolicy cleanup, HCP condition propagation
  • Integration test: Azure Blob backup completes successfully
  • Integration test: Invalid credentials sets BackupFailed terminal condition
  • make run-gitlint passes
  • make verify passes
  • CI green

Testing notes

Unit tests

go test ./hypershift-operator/controllers/etcdbackup/... -v

Integration tests

Integration tests run against a live management cluster with a HostedCluster deployed. They require:

  • KUBECONFIG pointing to the management cluster
  • ETCD_BACKUP_TEST_HCP_NAMESPACE set to the HCP namespace (e.g. clusters-my-hcp)
  • The HCPEtcdBackup controller running on the cluster (with the TechPreviewNoUpgrade feature set)
  • AWS CLI authenticated (aws sts get-caller-identity) and/or Azure CLI authenticated (az account show)

The run.sh controller subcommand handles all cloud resource lifecycle (S3 buckets, Azure storage accounts, K8s Secrets) automatically:

# Run all providers (auto-detects authenticated CLIs)
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
  ./test/integration/oadp/run.sh controller

# Run S3 tests only
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
  ./test/integration/oadp/run.sh controller aws

# Run Azure Blob tests only
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
  ./test/integration/oadp/run.sh controller azure

Alternatively, run Go tests directly with manual env vars (useful for debugging):

KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
ETCD_BACKUP_TEST_S3_BUCKET=my-bucket \
ETCD_BACKUP_TEST_S3_REGION=us-east-2 \
ETCD_BACKUP_TEST_S3_KEY_PREFIX=etcd-backups/test \
ETCD_BACKUP_TEST_S3_CREDENTIALS_SECRET=my-aws-creds \
  go test -tags integration -v -timeout 10m ./test/integration/oadp/controller/...

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added etcd backup capability for Hosted Control Planes with status tracking.
    • Support for S3 and Azure Blob storage backends.
    • Automated backup retention with configurable maximum backup count.
    • Backup status conditions propagated to Hosted Clusters.
  • Tests

    • Added comprehensive integration tests validating etcd backup workflows across storage providers and error scenarios.

jparrill and others added 2 commits March 31, 2026 16:56
…rator

Implements the HCPEtcdBackup reconciler that orchestrates etcd snapshot
and upload Jobs. The controller watches HCPEtcdBackup CRs, validates
etcd health, manages temporary RBAC and NetworkPolicy resources for
cross-namespace access, creates 3-container backup Jobs (fetch-certs,
snapshot, upload), tracks Job status via pod termination messages, and
enforces count-based retention of completed backups.

Adds EtcdBackupActive condition type for HCP-to-HC status propagation.
Controller is registered behind the HCPEtcdBackup feature gate.

Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
…ller

Add integration tests that validate the HCPEtcdBackup controller
end-to-end against a live management cluster with S3 and Azure Blob
storage backends.

Tests:
- S3 happy path: backup completes with snapshotURL, RBAC/NetworkPolicy
  cleaned up, EtcdBackupActive condition propagated to HCP
- Azure Blob happy path: same validations for Azure storage
- Invalid credentials: controller sets BackupFailed instead of
  retrying indefinitely when credential Secret is missing

Controller fix: Reconcile now catches IsNotFound errors from
createBackupJob (missing credential/pull Secret) and sets
BackupCompleted=False/BackupFailed as a terminal condition with
resource cleanup, rather than returning an error that causes
infinite requeue.

run.sh gains a `controller` subcommand that creates/destroys all
cloud resources (S3 buckets, Azure RG/storage/SP, K8s Secrets)
automatically via setup/teardown functions with EXIT trap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 31, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jparrill
Once this PR has been reviewed and has the lgtm label, please assign sjenning for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot requested review from cblecker and csrwng March 31, 2026 15:12
@openshift-ci openshift-ci bot added area/api Indicates the PR includes changes for the API area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Mar 31, 2026
@jparrill jparrill changed the title feat(CNTRLPLANE-2678): add HCPEtcdBackup controller to HyperShift Operator CNTRLPLANE-2678: add HCPEtcdBackup controller to HyperShift Operator Mar 31, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 31, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 31, 2026

@jparrill: This pull request references CNTRLPLANE-2678 which is a valid jira issue.

Details

In response to this:

Summary

  • Adds the HCPEtcdBackup controller to the HyperShift Operator, implementing the reconciliation flow defined in Enhancement PR #1945
  • The controller watches HCPEtcdBackup CRs and orchestrates a 3-container backup Job (fetch-etcd-certs → etcdctl snapshot → etcd-upload) with temporary RBAC and NetworkPolicy for cross-namespace etcd access
  • Reports backup status through CR conditions (BackupCompleted) and propagates EtcdBackupActive condition to the HostedControlPlane
  • Handles permanent failures (missing credential/pull Secrets) by setting BackupFailed terminal condition instead of retrying indefinitely
  • Includes count-based retention to delete oldest completed backups when exceeding --etcd-backup-max-count
  • Integration tests for S3, Azure Blob, and invalid-credentials scenarios with self-contained cloud resource setup/teardown via run.sh controller

Dependencies

Jira

CNTRLPLANE-2678

Test plan

  • Unit tests pass (go test ./hypershift-operator/controllers/etcdbackup/...)
  • Integration test: S3 backup completes successfully with snapshotURL, RBAC/NetworkPolicy cleanup, HCP condition propagation
  • Integration test: Azure Blob backup completes successfully
  • Integration test: Invalid credentials sets BackupFailed terminal condition
  • make run-gitlint passes
  • make verify passes
  • CI green

Testing notes

Unit tests

go test ./hypershift-operator/controllers/etcdbackup/... -v

Integration tests

Integration tests run against a live management cluster with a HostedCluster deployed. They require:

  • KUBECONFIG pointing to the management cluster
  • ETCD_BACKUP_TEST_HCP_NAMESPACE set to the HCP namespace (e.g. clusters-my-hcp)
  • The HCPEtcdBackup controller running on the cluster (with the TechPreviewNoUpgrade feature set)
  • AWS CLI authenticated (aws sts get-caller-identity) and/or Azure CLI authenticated (az account show)

The run.sh controller subcommand handles all cloud resource lifecycle (S3 buckets, Azure storage accounts, K8s Secrets) automatically:

# Run all providers (auto-detects authenticated CLIs)
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
 ./test/integration/oadp/run.sh controller

# Run S3 tests only
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
 ./test/integration/oadp/run.sh controller aws

# Run Azure Blob tests only
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
 ./test/integration/oadp/run.sh controller azure

Alternatively, run Go tests directly with manual env vars (useful for debugging):

KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
ETCD_BACKUP_TEST_S3_BUCKET=my-bucket \
ETCD_BACKUP_TEST_S3_REGION=us-east-2 \
ETCD_BACKUP_TEST_S3_KEY_PREFIX=etcd-backups/test \
ETCD_BACKUP_TEST_S3_CREDENTIALS_SECRET=my-aws-creds \
 go test -tags integration -v -timeout 10m ./test/integration/oadp/controller/...

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jparrill
Copy link
Copy Markdown
Contributor Author

/label tide/merge-method-squash

@openshift-ci openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Mar 31, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

The changes introduce a new etcd backup feature for hosted control planes. A new condition type EtcdBackupActive is added to the API specification. A new HCPEtcdBackupReconciler controller is implemented to manage HCPEtcdBackup resources, orchestrating the full backup lifecycle: validating etcd health, enforcing single active backups per HCP, creating temporary RBAC and NetworkPolicy resources, launching backup jobs with appropriate image resolution, monitoring job completion, extracting snapshot URLs from job logs, propagating completion status to owning HostedControlPlane resources, and enforcing retention policies when backup count exceeds configured limits. The controller is feature-gated and wired into the operator's startup. The backup condition is propagated from HostedControlPlane to HostedCluster resources. Comprehensive test coverage and integration tests validate controller behavior with S3 and Azure storage backends.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/Operator
    participant K8s as Kubernetes API
    participant Reconciler as HCPEtcdBackupReconciler
    participant HCP as HostedControlPlane
    participant Etcd as Etcd StatefulSet
    participant Job as Backup Job
    participant Storage as Object Storage
    
    User->>K8s: Create HCPEtcdBackup CR
    K8s->>Reconciler: Reconcile triggered
    
    Reconciler->>K8s: Fetch HCPEtcdBackup
    K8s-->>Reconciler: CR retrieved
    
    Reconciler->>HCP: Lookup HostedControlPlane
    HCP-->>Reconciler: HCP found
    
    Reconciler->>Etcd: Check ReadyReplicas
    Etcd-->>Reconciler: Healthy status
    
    Reconciler->>K8s: Find active backup Jobs
    K8s-->>Reconciler: None found
    
    Reconciler->>K8s: Create ServiceAccount (HO namespace)
    K8s-->>Reconciler: Created
    
    Reconciler->>K8s: Create Role + RoleBinding (HCP namespace)
    K8s-->>Reconciler: Created
    
    Reconciler->>K8s: Create NetworkPolicy (HCP namespace)
    K8s-->>Reconciler: Created
    
    Reconciler->>K8s: Create Backup Job with init containers
    K8s-->>Reconciler: Job created
    
    K8s->>Job: Fetch etcd certs
    Job->>Job: Run etcdctl snapshot save
    
    Job->>Storage: Upload snapshot (etcd-upload container)
    Storage-->>Job: Snapshot URL returned
    
    K8s->>Reconciler: Job completion event
    Reconciler->>Job: Extract snapshot URL from logs
    Job-->>Reconciler: URL extracted
    
    Reconciler->>K8s: Update HCPEtcdBackup status (succeeded)
    K8s-->>Reconciler: Status updated
    
    Reconciler->>HCP: Set EtcdBackupActive=True condition
    HCP-->>Reconciler: Condition propagated
    
    Reconciler->>K8s: Delete NetworkPolicy + RBAC
    K8s-->>Reconciler: Cleanup completed
Loading
sequenceDiagram
    participant HostedCluster as HostedCluster Controller
    participant HCP as HostedControlPlane
    participant K8s as Kubernetes API
    participant User as HostedCluster Status
    
    K8s->>HCP: Condition updated (EtcdBackupActive=True)
    HCP-->>K8s: Status condition set
    
    HostedCluster->>K8s: Watch HCP conditions
    K8s-->>HostedCluster: Change detected
    
    HostedCluster->>HCP: Read status.conditions
    HCP-->>HostedCluster: Conditions retrieved
    
    HostedCluster->>K8s: Update HostedCluster status
    Note over HostedCluster: Include EtcdBackupActive condition
    K8s-->>HostedCluster: Status updated
    
    K8s->>User: EtcdBackupActive condition visible
Loading
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@jparrill
Copy link
Copy Markdown
Contributor Author

/uncc csrwng

@openshift-ci openshift-ci bot removed the request for review from csrwng March 31, 2026 15:14
@jparrill
Copy link
Copy Markdown
Contributor Author

/uncc @cblecker

@openshift-ci openshift-ci bot removed the request for review from cblecker March 31, 2026 15:14
@jparrill
Copy link
Copy Markdown
Contributor Author

/auto-cc

@openshift-ci openshift-ci bot requested review from enxebre and muraee March 31, 2026 15:15
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 31, 2026

@jparrill: This pull request references CNTRLPLANE-2678 which is a valid jira issue.

Details

In response to this:

Summary

  • Adds the HCPEtcdBackup controller to the HyperShift Operator, implementing the reconciliation flow defined in Enhancement PR #1945
  • The controller watches HCPEtcdBackup CRs and orchestrates a 3-container backup Job (fetch-etcd-certs → etcdctl snapshot → etcd-upload) with temporary RBAC and NetworkPolicy for cross-namespace etcd access
  • Reports backup status through CR conditions (BackupCompleted) and propagates EtcdBackupActive condition to the HostedControlPlane
  • Handles permanent failures (missing credential/pull Secrets) by setting BackupFailed terminal condition instead of retrying indefinitely
  • Includes count-based retention to delete oldest completed backups when exceeding --etcd-backup-max-count
  • Integration tests for S3, Azure Blob, and invalid-credentials scenarios with self-contained cloud resource setup/teardown via run.sh controller

Dependencies

Jira

CNTRLPLANE-2678

Test plan

  • Unit tests pass (go test ./hypershift-operator/controllers/etcdbackup/...)
  • Integration test: S3 backup completes successfully with snapshotURL, RBAC/NetworkPolicy cleanup, HCP condition propagation
  • Integration test: Azure Blob backup completes successfully
  • Integration test: Invalid credentials sets BackupFailed terminal condition
  • make run-gitlint passes
  • make verify passes
  • CI green

Testing notes

Unit tests

go test ./hypershift-operator/controllers/etcdbackup/... -v

Integration tests

Integration tests run against a live management cluster with a HostedCluster deployed. They require:

  • KUBECONFIG pointing to the management cluster
  • ETCD_BACKUP_TEST_HCP_NAMESPACE set to the HCP namespace (e.g. clusters-my-hcp)
  • The HCPEtcdBackup controller running on the cluster (with the TechPreviewNoUpgrade feature set)
  • AWS CLI authenticated (aws sts get-caller-identity) and/or Azure CLI authenticated (az account show)

The run.sh controller subcommand handles all cloud resource lifecycle (S3 buckets, Azure storage accounts, K8s Secrets) automatically:

# Run all providers (auto-detects authenticated CLIs)
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
 ./test/integration/oadp/run.sh controller

# Run S3 tests only
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
 ./test/integration/oadp/run.sh controller aws

# Run Azure Blob tests only
KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
 ./test/integration/oadp/run.sh controller azure

Alternatively, run Go tests directly with manual env vars (useful for debugging):

KUBECONFIG=/path/to/kubeconfig \
ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \
ETCD_BACKUP_TEST_S3_BUCKET=my-bucket \
ETCD_BACKUP_TEST_S3_REGION=us-east-2 \
ETCD_BACKUP_TEST_S3_KEY_PREFIX=etcd-backups/test \
ETCD_BACKUP_TEST_S3_CREDENTIALS_SECRET=my-aws-creds \
 go test -tags integration -v -timeout 10m ./test/integration/oadp/controller/...

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

  • Added etcd backup capability for Hosted Control Planes with status tracking.

  • Support for S3 and Azure Blob storage backends.

  • Automated backup retention with configurable maximum backup count.

  • Backup status conditions propagated to Hosted Clusters.

  • Tests

  • Added comprehensive integration tests validating etcd backup workflows across storage providers and error scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
hypershift-operator/controllers/etcdbackup/reconciler.go (1)

374-378: Consider failing backup if snapshot URL extraction fails.

If getSnapshotURLFromPod fails or returns empty, the backup is still marked as successful but without a usable snapshotURL. This could result in "successful" backups that cannot be used for restore operations.

Consider whether a missing snapshot URL should cause the backup to be marked as failed, or at minimum add a warning condition.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/etcdbackup/reconciler.go` around lines 374 -
378, The code currently swallows errors/empty results from getSnapshotURLFromPod
causing backups to appear successful without a usable SnapshotURL; change the
logic in reconciler.go around the getSnapshotURLFromPod call so that when err !=
nil or url == "" you set backup.Status.Phase to a failing state (e.g.,
"Failed"), populate backup.Status.Conditions or a warning condition with a clear
Reason and Message that includes the error or "empty snapshot URL", and persist
the status via the same status update path used elsewhere (e.g.,
r.client.Status().Update or r.updateStatus). Keep the existing successful branch
to set backup.Status.SnapshotURL when present. Ensure the change references
getSnapshotURLFromPod, backup.Status.SnapshotURL, backup.Status.Phase, and the
status update method so the backup is marked failed (or has a warning condition)
when URL extraction fails or is empty.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/hypershift/v1beta1/hostedcluster_conditions.go`:
- Around line 198-202: The EtcdBackupActive condition is semantically inverted;
change the API to express success/completion instead of activity by renaming the
ConditionType EtcdBackupActive to something like EtcdBackupSucceeded (or
EtcdBackupComplete) and update its docstring so True means the last backup
succeeded and False means it failed/in progress, then update all uses that set
or check EtcdBackupActive (e.g., HCP/HostedCluster status setters, getters, and
any callers) to use the new EtcdBackupSucceeded name and ensure the boolean
polarity is adjusted where values are set or evaluated so consumers reading
HostedCluster status see True as success.

In `@hypershift-operator/main.go`:
- Line 173: Validate opts.EtcdBackupMaxCount immediately after parsing flags
(where cmd.Flags().IntVar is used) and before the controller is wired: if the
value is less than 1 (unless 0 is intentionally allowed and documented), return
an error and exit startup so invalid bounds never reach MaxBackupCount in the
reconciler; update the startup path that wires the controller to check
opts.EtcdBackupMaxCount (and mirror this validation for the other flag group
mentioned around lines 516-523) and include a clear error message referencing
the flag name.

In `@test/integration/oadp/run.sh`:
- Around line 426-449: The fallback currently copies the entire
~/.aws/credentials into creds_content; change it to extract only the active
profile (use $AWS_PROFILE if set, otherwise "default") and serialize just that
profile's keys (aws_access_key_id, aws_secret_access_key, aws_session_token if
present) into creds_content instead of cat'ing the whole file; update the branch
that sets creds_content (the code touching access_key/secret_key/session_token,
creds_content, tmp_creds, and CTRL_AWS_CREDS_SECRET) to parse ~/.aws/credentials
for the selected profile and include only those three entries so unrelated
profiles/accounts are not included in the Secret.

---

Nitpick comments:
In `@hypershift-operator/controllers/etcdbackup/reconciler.go`:
- Around line 374-378: The code currently swallows errors/empty results from
getSnapshotURLFromPod causing backups to appear successful without a usable
SnapshotURL; change the logic in reconciler.go around the getSnapshotURLFromPod
call so that when err != nil or url == "" you set backup.Status.Phase to a
failing state (e.g., "Failed"), populate backup.Status.Conditions or a warning
condition with a clear Reason and Message that includes the error or "empty
snapshot URL", and persist the status via the same status update path used
elsewhere (e.g., r.client.Status().Update or r.updateStatus). Keep the existing
successful branch to set backup.Status.SnapshotURL when present. Ensure the
change references getSnapshotURLFromPod, backup.Status.SnapshotURL,
backup.Status.Phase, and the status update method so the backup is marked failed
(or has a warning condition) when URL extraction fails or is empty.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: e6f30f0c-d4db-4460-8317-f106584ae3c5

📥 Commits

Reviewing files that changed from the base of the PR and between 7c04ffa and 3762a05.

⛔ Files ignored due to path filters (3)
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (7)
  • api/hypershift/v1beta1/hostedcluster_conditions.go
  • hypershift-operator/controllers/etcdbackup/reconciler.go
  • hypershift-operator/controllers/etcdbackup/reconciler_test.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • hypershift-operator/main.go
  • test/integration/oadp/controller/controller_test.go
  • test/integration/oadp/run.sh

Comment on lines +198 to +202
// EtcdBackupActive bubbles up from HCP. It indicates whether an etcd backup
// is currently in progress or reports the result of the most recent backup.
// True means a backup completed successfully; False means a backup is in progress
// or the last backup failed.
EtcdBackupActive ConditionType = "EtcdBackupActive"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

EtcdBackupActive is semantically inverted.

The doc says True means “last backup succeeded” and False means “backup in progress or failed”. On an exported *Active condition, consumers will read True as “a backup is currently running”, especially once this is surfaced on HostedCluster status. Please rename this to a success/completion condition or flip the polarity before this API contract ships.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/hypershift/v1beta1/hostedcluster_conditions.go` around lines 198 - 202,
The EtcdBackupActive condition is semantically inverted; change the API to
express success/completion instead of activity by renaming the ConditionType
EtcdBackupActive to something like EtcdBackupSucceeded (or EtcdBackupComplete)
and update its docstring so True means the last backup succeeded and False means
it failed/in progress, then update all uses that set or check EtcdBackupActive
(e.g., HCP/HostedCluster status setters, getters, and any callers) to use the
new EtcdBackupSucceeded name and ensure the boolean polarity is adjusted where
values are set or evaluated so consumers reading HostedCluster status see True
as success.

cmd.Flags().BoolVar(&opts.EnableDedicatedRequestServingIsolation, "enable-dedicated-request-serving-isolation", true, "If true, enables scheduling of request serving components to dedicated nodes")
cmd.Flags().StringVar(&opts.ScaleFromZeroProvider, "scale-from-zero-provider", opts.ScaleFromZeroProvider, "Platform type for scale-from-zero autoscaling (aws)")
cmd.Flags().StringVar(&opts.ScaleFromZeroCreds, "scale-from-zero-creds", opts.ScaleFromZeroCreds, "Path to credentials file for scale-from-zero instance type queries")
cmd.Flags().IntVar(&opts.EtcdBackupMaxCount, "etcd-backup-max-count", 5, "Maximum number of completed HCPEtcdBackup CRs to retain per HostedControlPlane")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate --etcd-backup-max-count before wiring the controller.

IntVar accepts negative values, and this option is passed straight into MaxBackupCount. For backup retention, the operator should fail fast on invalid bounds instead of leaving behavior to reconciler internals. If 0 is intentional, document that explicitly; otherwise reject values below 1.

Also applies to: 516-523

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/main.go` at line 173, Validate opts.EtcdBackupMaxCount
immediately after parsing flags (where cmd.Flags().IntVar is used) and before
the controller is wired: if the value is less than 1 (unless 0 is intentionally
allowed and documented), return an error and exit startup so invalid bounds
never reach MaxBackupCount in the reconciler; update the startup path that wires
the controller to check opts.EtcdBackupMaxCount (and mirror this validation for
the other flag group mentioned around lines 516-523) and include a clear error
message referencing the flag name.

Comment on lines +426 to +449
local creds_content=""
if [[ -n "${access_key}" && -n "${secret_key}" ]]; then
creds_content="[default]
aws_access_key_id = ${access_key}
aws_secret_access_key = ${secret_key}"
if [[ -n "${session_token}" ]]; then
creds_content="${creds_content}
aws_session_token = ${session_token}"
fi
elif [[ -f "${HOME}/.aws/credentials" ]]; then
creds_content=$(cat "${HOME}/.aws/credentials")
else
err "Cannot determine AWS credentials."
fi

# Create K8s Secret in HO namespace
CTRL_AWS_CREDS_SECRET="etcd-backup-aws-test-${SUFFIX}"
log "Creating K8s Secret ${CTRL_AWS_CREDS_SECRET} in ${ho_namespace}..."
local tmp_creds
tmp_creds=$(mktemp /tmp/etcd-ctrl-aws-creds.XXXXXX)
echo "${creds_content}" > "${tmp_creds}"
oc create secret generic "${CTRL_AWS_CREDS_SECRET}" \
--from-file=credentials="${tmp_creds}" \
-n "${ho_namespace}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't copy the whole local AWS credentials file into the cluster.

The fallback at Lines 435-436 reads ~/.aws/credentials verbatim and then stores it in a Secret. That can include unrelated profiles/accounts and exposes far more credential material than these tests need. Serialize only the active/default profile into the Secret.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/integration/oadp/run.sh` around lines 426 - 449, The fallback currently
copies the entire ~/.aws/credentials into creds_content; change it to extract
only the active profile (use $AWS_PROFILE if set, otherwise "default") and
serialize just that profile's keys (aws_access_key_id, aws_secret_access_key,
aws_session_token if present) into creds_content instead of cat'ing the whole
file; update the branch that sets creds_content (the code touching
access_key/secret_key/session_token, creds_content, tmp_creds, and
CTRL_AWS_CREDS_SECRET) to parse ~/.aws/credentials for the selected profile and
include only those three entries so unrelated profiles/accounts are not included
in the Secret.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 75.00000% with 169 lines in your changes missing coverage. Please review.
✅ Project coverage is 29.76%. Comparing base (7c04ffa) to head (3762a05).
⚠️ Report is 33 commits behind head on main.

Files with missing lines Patch % Lines
...hift-operator/controllers/etcdbackup/reconciler.go 76.31% 114 Missing and 43 partials ⚠️
hypershift-operator/main.go 0.00% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8139      +/-   ##
==========================================
+ Coverage   29.44%   29.76%   +0.31%     
==========================================
  Files        1043     1044       +1     
  Lines       95594    96270     +676     
==========================================
+ Hits        28150    28657     +507     
- Misses      65027    65153     +126     
- Partials     2417     2460      +43     
Files with missing lines Coverage Δ
...trollers/hostedcluster/hostedcluster_controller.go 42.99% <100.00%> (+0.01%) ⬆️
hypershift-operator/main.go 0.00% <0.00%> (ø)
...hift-operator/controllers/etcdbackup/reconciler.go 76.31% <76.31%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 31, 2026

@jparrill: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.


// EtcdBackupActive bubbles up from HCP. It indicates whether an etcd backup
// is currently in progress or reports the result of the most recent backup.
// True means a backup completed successfully; False means a backup is in progress
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who consumes this?

EtcdBackupActive=false meaning a backup is in progress seems pretty counter intuitive?

// True means a backup completed successfully; False means a backup is in progress

how does this relate to BackupCompleted? why do they live in different places?

Could you present all possible permutations for these two conditions with their reasons?

Copy link
Copy Markdown
Contributor Author

@jparrill jparrill Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who consumes this?

EtcdBackupActive lives on the HCP/HC so users watching HostedCluster status can see backup activity without looking at HCPEtcdBackup CRs directly. The HC controller propagates it from HCP → HC via the existing condition bubble-up mechanism.

EtcdBackupActive=false meaning a backup is in progress seems pretty counter intuitive?

I agree it's confusing. EtcdBackupActive=False meaning "in progress" is counter-intuitive. I followed the existing EtcdRecoveryActive pattern, but the semantics don't map well here since EtcdRecoveryActive=True means "recovery is happening" while here True means "backup succeeded".

Maybe I can rename to EtcdBackupSucceeded:

  • True = last backup succeeded
  • False = in progress or failed

How does it relate to BackupCompleted?

BackupCompleted lives on the HCPEtcdBackup CR (per-backup). EtcdBackupActive lives on the HCP/HC (cluster-level summary). The controller sets EtcdBackupActive on the HCP as a mirror of the latest backup's status.

Permutations

HCPEtcdBackup BackupCompleted Reason HCP EtcdBackupActive Reason Meaning
(not set) (not set) CR just created, not yet reconciled
False EtcdUnhealthy (not set) etcd StatefulSet not ready, waiting to retry
False BackupAlreadyInProgress False BackupAlreadyInProgress backup Job created and running
False BackupFailed False BackupFailed permanent failure (missing creds, Job failed)
True BackupSucceeded True BackupSucceeded backup completed successfully

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I can rename to EtcdBackupSucceeded:

True = last backup succeeded
False = in progress or failed

That sounds better to me as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that BackupAlreadyInProgressReason could be just BackupInProgressReason with value BackupInProgress

It's a minor detail but since we're just defining the API it's worth mentioning that having "Already" in the name sounds redundant.

hyperv1.HostedClusterRestoredFromBackup,
hyperv1.DataPlaneConnectionAvailable,
hyperv1.ControlPlaneConnectionAvailable,
hyperv1.ConditionType("EtcdBackupActive"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use the hyperv1.EtcdBackupActive directly ?

pod := &podList.Items[i]
for _, cs := range pod.Status.ContainerStatuses {
if cs.Name == "upload" && cs.State.Terminated != nil && cs.State.Terminated.Message != "" {
return strings.TrimSpace(cs.State.Terminated.Message), nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is not necessary, but will it be more stable if the etcd-upload write some structure message into the terminator log (like: SNAPSHOT_URL=XXX in a separate line) and we parse that from the message?

logger.Error(err, "failed to update HCP backup condition")
}

return ctrl.Result{RequeueAfter: requeueInterval}, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we re queue the reconciliation ? it has watched the Job as well, so job status change should trigger it automatically IMO.

mountPathCredentials = "/etc/etcd-backup-creds"

// etcdClientPort is the etcd client port used in the NetworkPolicy and etcdctl endpoint.
etcdClientPort int32 = 2379
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It would be cool to put the port in constants, next to https://github.com/openshift/hypershift/blob/main/support/config/constants.go#L30


// EtcdBackupActive bubbles up from HCP. It indicates whether an etcd backup
// is currently in progress or reports the result of the most recent backup.
// True means a backup completed successfully; False means a backup is in progress
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I can rename to EtcdBackupSucceeded:

True = last backup succeeded
False = in progress or failed

That sounds better to me as well.


// EtcdBackupActive bubbles up from HCP. It indicates whether an etcd backup
// is currently in progress or reports the result of the most recent backup.
// True means a backup completed successfully; False means a backup is in progress
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that BackupAlreadyInProgressReason could be just BackupInProgressReason with value BackupInProgress

It's a minor detail but since we're just defining the API it's worth mentioning that having "Already" in the name sounds redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api Indicates the PR includes changes for the API area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants