CNTRLPLANE-2678: add HCPEtcdBackup controller to HyperShift Operator#8139
CNTRLPLANE-2678: add HCPEtcdBackup controller to HyperShift Operator#8139jparrill wants to merge 2 commits intoopenshift:mainfrom
Conversation
…rator Implements the HCPEtcdBackup reconciler that orchestrates etcd snapshot and upload Jobs. The controller watches HCPEtcdBackup CRs, validates etcd health, manages temporary RBAC and NetworkPolicy resources for cross-namespace access, creates 3-container backup Jobs (fetch-certs, snapshot, upload), tracks Job status via pod termination messages, and enforces count-based retention of completed backups. Adds EtcdBackupActive condition type for HCP-to-HC status propagation. Controller is registered behind the HCPEtcdBackup feature gate. Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
…ller Add integration tests that validate the HCPEtcdBackup controller end-to-end against a live management cluster with S3 and Azure Blob storage backends. Tests: - S3 happy path: backup completes with snapshotURL, RBAC/NetworkPolicy cleaned up, EtcdBackupActive condition propagated to HCP - Azure Blob happy path: same validations for Azure storage - Invalid credentials: controller sets BackupFailed instead of retrying indefinitely when credential Secret is missing Controller fix: Reconcile now catches IsNotFound errors from createBackupJob (missing credential/pull Secret) and sets BackupCompleted=False/BackupFailed as a terminal condition with resource cleanup, rather than returning an error that causes infinite requeue. run.sh gains a `controller` subcommand that creates/destroys all cloud resources (S3 buckets, Azure RG/storage/SP, K8s Secrets) automatically via setup/teardown functions with EXIT trap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jparrill The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@jparrill: This pull request references CNTRLPLANE-2678 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/label tide/merge-method-squash |
📝 WalkthroughWalkthroughThe changes introduce a new etcd backup feature for hosted control planes. A new condition type Sequence Diagram(s)sequenceDiagram
participant User as User/Operator
participant K8s as Kubernetes API
participant Reconciler as HCPEtcdBackupReconciler
participant HCP as HostedControlPlane
participant Etcd as Etcd StatefulSet
participant Job as Backup Job
participant Storage as Object Storage
User->>K8s: Create HCPEtcdBackup CR
K8s->>Reconciler: Reconcile triggered
Reconciler->>K8s: Fetch HCPEtcdBackup
K8s-->>Reconciler: CR retrieved
Reconciler->>HCP: Lookup HostedControlPlane
HCP-->>Reconciler: HCP found
Reconciler->>Etcd: Check ReadyReplicas
Etcd-->>Reconciler: Healthy status
Reconciler->>K8s: Find active backup Jobs
K8s-->>Reconciler: None found
Reconciler->>K8s: Create ServiceAccount (HO namespace)
K8s-->>Reconciler: Created
Reconciler->>K8s: Create Role + RoleBinding (HCP namespace)
K8s-->>Reconciler: Created
Reconciler->>K8s: Create NetworkPolicy (HCP namespace)
K8s-->>Reconciler: Created
Reconciler->>K8s: Create Backup Job with init containers
K8s-->>Reconciler: Job created
K8s->>Job: Fetch etcd certs
Job->>Job: Run etcdctl snapshot save
Job->>Storage: Upload snapshot (etcd-upload container)
Storage-->>Job: Snapshot URL returned
K8s->>Reconciler: Job completion event
Reconciler->>Job: Extract snapshot URL from logs
Job-->>Reconciler: URL extracted
Reconciler->>K8s: Update HCPEtcdBackup status (succeeded)
K8s-->>Reconciler: Status updated
Reconciler->>HCP: Set EtcdBackupActive=True condition
HCP-->>Reconciler: Condition propagated
Reconciler->>K8s: Delete NetworkPolicy + RBAC
K8s-->>Reconciler: Cleanup completed
sequenceDiagram
participant HostedCluster as HostedCluster Controller
participant HCP as HostedControlPlane
participant K8s as Kubernetes API
participant User as HostedCluster Status
K8s->>HCP: Condition updated (EtcdBackupActive=True)
HCP-->>K8s: Status condition set
HostedCluster->>K8s: Watch HCP conditions
K8s-->>HostedCluster: Change detected
HostedCluster->>HCP: Read status.conditions
HCP-->>HostedCluster: Conditions retrieved
HostedCluster->>K8s: Update HostedCluster status
Note over HostedCluster: Include EtcdBackupActive condition
K8s-->>HostedCluster: Status updated
K8s->>User: EtcdBackupActive condition visible
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/uncc csrwng |
|
/uncc @cblecker |
|
/auto-cc |
|
@jparrill: This pull request references CNTRLPLANE-2678 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
hypershift-operator/controllers/etcdbackup/reconciler.go (1)
374-378: Consider failing backup if snapshot URL extraction fails.If
getSnapshotURLFromPodfails or returns empty, the backup is still marked as successful but without a usablesnapshotURL. This could result in "successful" backups that cannot be used for restore operations.Consider whether a missing snapshot URL should cause the backup to be marked as failed, or at minimum add a warning condition.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/etcdbackup/reconciler.go` around lines 374 - 378, The code currently swallows errors/empty results from getSnapshotURLFromPod causing backups to appear successful without a usable SnapshotURL; change the logic in reconciler.go around the getSnapshotURLFromPod call so that when err != nil or url == "" you set backup.Status.Phase to a failing state (e.g., "Failed"), populate backup.Status.Conditions or a warning condition with a clear Reason and Message that includes the error or "empty snapshot URL", and persist the status via the same status update path used elsewhere (e.g., r.client.Status().Update or r.updateStatus). Keep the existing successful branch to set backup.Status.SnapshotURL when present. Ensure the change references getSnapshotURLFromPod, backup.Status.SnapshotURL, backup.Status.Phase, and the status update method so the backup is marked failed (or has a warning condition) when URL extraction fails or is empty.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/hypershift/v1beta1/hostedcluster_conditions.go`:
- Around line 198-202: The EtcdBackupActive condition is semantically inverted;
change the API to express success/completion instead of activity by renaming the
ConditionType EtcdBackupActive to something like EtcdBackupSucceeded (or
EtcdBackupComplete) and update its docstring so True means the last backup
succeeded and False means it failed/in progress, then update all uses that set
or check EtcdBackupActive (e.g., HCP/HostedCluster status setters, getters, and
any callers) to use the new EtcdBackupSucceeded name and ensure the boolean
polarity is adjusted where values are set or evaluated so consumers reading
HostedCluster status see True as success.
In `@hypershift-operator/main.go`:
- Line 173: Validate opts.EtcdBackupMaxCount immediately after parsing flags
(where cmd.Flags().IntVar is used) and before the controller is wired: if the
value is less than 1 (unless 0 is intentionally allowed and documented), return
an error and exit startup so invalid bounds never reach MaxBackupCount in the
reconciler; update the startup path that wires the controller to check
opts.EtcdBackupMaxCount (and mirror this validation for the other flag group
mentioned around lines 516-523) and include a clear error message referencing
the flag name.
In `@test/integration/oadp/run.sh`:
- Around line 426-449: The fallback currently copies the entire
~/.aws/credentials into creds_content; change it to extract only the active
profile (use $AWS_PROFILE if set, otherwise "default") and serialize just that
profile's keys (aws_access_key_id, aws_secret_access_key, aws_session_token if
present) into creds_content instead of cat'ing the whole file; update the branch
that sets creds_content (the code touching access_key/secret_key/session_token,
creds_content, tmp_creds, and CTRL_AWS_CREDS_SECRET) to parse ~/.aws/credentials
for the selected profile and include only those three entries so unrelated
profiles/accounts are not included in the Secret.
---
Nitpick comments:
In `@hypershift-operator/controllers/etcdbackup/reconciler.go`:
- Around line 374-378: The code currently swallows errors/empty results from
getSnapshotURLFromPod causing backups to appear successful without a usable
SnapshotURL; change the logic in reconciler.go around the getSnapshotURLFromPod
call so that when err != nil or url == "" you set backup.Status.Phase to a
failing state (e.g., "Failed"), populate backup.Status.Conditions or a warning
condition with a clear Reason and Message that includes the error or "empty
snapshot URL", and persist the status via the same status update path used
elsewhere (e.g., r.client.Status().Update or r.updateStatus). Keep the existing
successful branch to set backup.Status.SnapshotURL when present. Ensure the
change references getSnapshotURLFromPod, backup.Status.SnapshotURL,
backup.Status.Phase, and the status update method so the backup is marked failed
(or has a warning condition) when URL extraction fails or is empty.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: e6f30f0c-d4db-4460-8317-f106584ae3c5
⛔ Files ignored due to path filters (3)
docs/content/reference/aggregated-docs.mdis excluded by!docs/content/reference/aggregated-docs.mddocs/content/reference/api.mdis excluded by!docs/content/reference/api.mdvendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.gois excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (7)
api/hypershift/v1beta1/hostedcluster_conditions.gohypershift-operator/controllers/etcdbackup/reconciler.gohypershift-operator/controllers/etcdbackup/reconciler_test.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller.gohypershift-operator/main.gotest/integration/oadp/controller/controller_test.gotest/integration/oadp/run.sh
| // EtcdBackupActive bubbles up from HCP. It indicates whether an etcd backup | ||
| // is currently in progress or reports the result of the most recent backup. | ||
| // True means a backup completed successfully; False means a backup is in progress | ||
| // or the last backup failed. | ||
| EtcdBackupActive ConditionType = "EtcdBackupActive" |
There was a problem hiding this comment.
EtcdBackupActive is semantically inverted.
The doc says True means “last backup succeeded” and False means “backup in progress or failed”. On an exported *Active condition, consumers will read True as “a backup is currently running”, especially once this is surfaced on HostedCluster status. Please rename this to a success/completion condition or flip the polarity before this API contract ships.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/hypershift/v1beta1/hostedcluster_conditions.go` around lines 198 - 202,
The EtcdBackupActive condition is semantically inverted; change the API to
express success/completion instead of activity by renaming the ConditionType
EtcdBackupActive to something like EtcdBackupSucceeded (or EtcdBackupComplete)
and update its docstring so True means the last backup succeeded and False means
it failed/in progress, then update all uses that set or check EtcdBackupActive
(e.g., HCP/HostedCluster status setters, getters, and any callers) to use the
new EtcdBackupSucceeded name and ensure the boolean polarity is adjusted where
values are set or evaluated so consumers reading HostedCluster status see True
as success.
| cmd.Flags().BoolVar(&opts.EnableDedicatedRequestServingIsolation, "enable-dedicated-request-serving-isolation", true, "If true, enables scheduling of request serving components to dedicated nodes") | ||
| cmd.Flags().StringVar(&opts.ScaleFromZeroProvider, "scale-from-zero-provider", opts.ScaleFromZeroProvider, "Platform type for scale-from-zero autoscaling (aws)") | ||
| cmd.Flags().StringVar(&opts.ScaleFromZeroCreds, "scale-from-zero-creds", opts.ScaleFromZeroCreds, "Path to credentials file for scale-from-zero instance type queries") | ||
| cmd.Flags().IntVar(&opts.EtcdBackupMaxCount, "etcd-backup-max-count", 5, "Maximum number of completed HCPEtcdBackup CRs to retain per HostedControlPlane") |
There was a problem hiding this comment.
Validate --etcd-backup-max-count before wiring the controller.
IntVar accepts negative values, and this option is passed straight into MaxBackupCount. For backup retention, the operator should fail fast on invalid bounds instead of leaving behavior to reconciler internals. If 0 is intentional, document that explicitly; otherwise reject values below 1.
Also applies to: 516-523
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hypershift-operator/main.go` at line 173, Validate opts.EtcdBackupMaxCount
immediately after parsing flags (where cmd.Flags().IntVar is used) and before
the controller is wired: if the value is less than 1 (unless 0 is intentionally
allowed and documented), return an error and exit startup so invalid bounds
never reach MaxBackupCount in the reconciler; update the startup path that wires
the controller to check opts.EtcdBackupMaxCount (and mirror this validation for
the other flag group mentioned around lines 516-523) and include a clear error
message referencing the flag name.
| local creds_content="" | ||
| if [[ -n "${access_key}" && -n "${secret_key}" ]]; then | ||
| creds_content="[default] | ||
| aws_access_key_id = ${access_key} | ||
| aws_secret_access_key = ${secret_key}" | ||
| if [[ -n "${session_token}" ]]; then | ||
| creds_content="${creds_content} | ||
| aws_session_token = ${session_token}" | ||
| fi | ||
| elif [[ -f "${HOME}/.aws/credentials" ]]; then | ||
| creds_content=$(cat "${HOME}/.aws/credentials") | ||
| else | ||
| err "Cannot determine AWS credentials." | ||
| fi | ||
|
|
||
| # Create K8s Secret in HO namespace | ||
| CTRL_AWS_CREDS_SECRET="etcd-backup-aws-test-${SUFFIX}" | ||
| log "Creating K8s Secret ${CTRL_AWS_CREDS_SECRET} in ${ho_namespace}..." | ||
| local tmp_creds | ||
| tmp_creds=$(mktemp /tmp/etcd-ctrl-aws-creds.XXXXXX) | ||
| echo "${creds_content}" > "${tmp_creds}" | ||
| oc create secret generic "${CTRL_AWS_CREDS_SECRET}" \ | ||
| --from-file=credentials="${tmp_creds}" \ | ||
| -n "${ho_namespace}" |
There was a problem hiding this comment.
Don't copy the whole local AWS credentials file into the cluster.
The fallback at Lines 435-436 reads ~/.aws/credentials verbatim and then stores it in a Secret. That can include unrelated profiles/accounts and exposes far more credential material than these tests need. Serialize only the active/default profile into the Secret.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/integration/oadp/run.sh` around lines 426 - 449, The fallback currently
copies the entire ~/.aws/credentials into creds_content; change it to extract
only the active profile (use $AWS_PROFILE if set, otherwise "default") and
serialize just that profile's keys (aws_access_key_id, aws_secret_access_key,
aws_session_token if present) into creds_content instead of cat'ing the whole
file; update the branch that sets creds_content (the code touching
access_key/secret_key/session_token, creds_content, tmp_creds, and
CTRL_AWS_CREDS_SECRET) to parse ~/.aws/credentials for the selected profile and
include only those three entries so unrelated profiles/accounts are not included
in the Secret.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8139 +/- ##
==========================================
+ Coverage 29.44% 29.76% +0.31%
==========================================
Files 1043 1044 +1
Lines 95594 96270 +676
==========================================
+ Hits 28150 28657 +507
- Misses 65027 65153 +126
- Partials 2417 2460 +43
🚀 New features to boost your workflow:
|
|
@jparrill: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
|
||
| // EtcdBackupActive bubbles up from HCP. It indicates whether an etcd backup | ||
| // is currently in progress or reports the result of the most recent backup. | ||
| // True means a backup completed successfully; False means a backup is in progress |
There was a problem hiding this comment.
who consumes this?
EtcdBackupActive=false meaning a backup is in progress seems pretty counter intuitive?
// True means a backup completed successfully; False means a backup is in progress
how does this relate to BackupCompleted? why do they live in different places?
Could you present all possible permutations for these two conditions with their reasons?
There was a problem hiding this comment.
Who consumes this?
EtcdBackupActive lives on the HCP/HC so users watching HostedCluster status can see backup activity without looking at HCPEtcdBackup CRs directly. The HC controller propagates it from HCP → HC via the existing condition bubble-up mechanism.
EtcdBackupActive=false meaning a backup is in progress seems pretty counter intuitive?
I agree it's confusing. EtcdBackupActive=False meaning "in progress" is counter-intuitive. I followed the existing EtcdRecoveryActive pattern, but the semantics don't map well here since EtcdRecoveryActive=True means "recovery is happening" while here True means "backup succeeded".
Maybe I can rename to EtcdBackupSucceeded:
True= last backup succeededFalse= in progress or failed
How does it relate to BackupCompleted?
BackupCompleted lives on the HCPEtcdBackup CR (per-backup). EtcdBackupActive lives on the HCP/HC (cluster-level summary). The controller sets EtcdBackupActive on the HCP as a mirror of the latest backup's status.
Permutations
HCPEtcdBackup BackupCompleted |
Reason | HCP EtcdBackupActive |
Reason | Meaning |
|---|---|---|---|---|
| (not set) | — | (not set) | — | CR just created, not yet reconciled |
| False | EtcdUnhealthy | (not set) | — | etcd StatefulSet not ready, waiting to retry |
| False | BackupAlreadyInProgress | False | BackupAlreadyInProgress | backup Job created and running |
| False | BackupFailed | False | BackupFailed | permanent failure (missing creds, Job failed) |
| True | BackupSucceeded | True | BackupSucceeded | backup completed successfully |
There was a problem hiding this comment.
Maybe I can rename to EtcdBackupSucceeded:
True = last backup succeeded False = in progress or failed
That sounds better to me as well.
There was a problem hiding this comment.
I also think that BackupAlreadyInProgressReason could be just BackupInProgressReason with value BackupInProgress
It's a minor detail but since we're just defining the API it's worth mentioning that having "Already" in the name sounds redundant.
| hyperv1.HostedClusterRestoredFromBackup, | ||
| hyperv1.DataPlaneConnectionAvailable, | ||
| hyperv1.ControlPlaneConnectionAvailable, | ||
| hyperv1.ConditionType("EtcdBackupActive"), |
There was a problem hiding this comment.
why not use the hyperv1.EtcdBackupActive directly ?
| pod := &podList.Items[i] | ||
| for _, cs := range pod.Status.ContainerStatuses { | ||
| if cs.Name == "upload" && cs.State.Terminated != nil && cs.State.Terminated.Message != "" { | ||
| return strings.TrimSpace(cs.State.Terminated.Message), nil |
There was a problem hiding this comment.
maybe it is not necessary, but will it be more stable if the etcd-upload write some structure message into the terminator log (like: SNAPSHOT_URL=XXX in a separate line) and we parse that from the message?
| logger.Error(err, "failed to update HCP backup condition") | ||
| } | ||
|
|
||
| return ctrl.Result{RequeueAfter: requeueInterval}, nil |
There was a problem hiding this comment.
why we re queue the reconciliation ? it has watched the Job as well, so job status change should trigger it automatically IMO.
| mountPathCredentials = "/etc/etcd-backup-creds" | ||
|
|
||
| // etcdClientPort is the etcd client port used in the NetworkPolicy and etcdctl endpoint. | ||
| etcdClientPort int32 = 2379 |
There was a problem hiding this comment.
Nit: It would be cool to put the port in constants, next to https://github.com/openshift/hypershift/blob/main/support/config/constants.go#L30
|
|
||
| // EtcdBackupActive bubbles up from HCP. It indicates whether an etcd backup | ||
| // is currently in progress or reports the result of the most recent backup. | ||
| // True means a backup completed successfully; False means a backup is in progress |
There was a problem hiding this comment.
Maybe I can rename to EtcdBackupSucceeded:
True = last backup succeeded False = in progress or failed
That sounds better to me as well.
|
|
||
| // EtcdBackupActive bubbles up from HCP. It indicates whether an etcd backup | ||
| // is currently in progress or reports the result of the most recent backup. | ||
| // True means a backup completed successfully; False means a backup is in progress |
There was a problem hiding this comment.
I also think that BackupAlreadyInProgressReason could be just BackupInProgressReason with value BackupInProgress
It's a minor detail but since we're just defining the API it's worth mentioning that having "Already" in the name sounds redundant.
Summary
HCPEtcdBackupcontroller to the HyperShift Operator, implementing the reconciliation flow defined in Enhancement PR #1945HCPEtcdBackupCRs and orchestrates a 3-container backup Job (fetch-etcd-certs → etcdctl snapshot → etcd-upload) with temporary RBAC and NetworkPolicy for cross-namespace etcd accessBackupCompleted) and propagatesEtcdBackupActivecondition to the HostedControlPlaneBackupFailedterminal condition instead of retrying indefinitely--etcd-backup-max-countrun.sh controllerDependencies
etcd-backupCPO subcommand (merged)fetch-etcd-certsCPO subcommand (merged)etcd-uploadCPO subcommand (merged)Jira
CNTRLPLANE-2678
Test plan
go test ./hypershift-operator/controllers/etcdbackup/...)make run-gitlintpassesmake verifypassesTesting notes
Unit tests
go test ./hypershift-operator/controllers/etcdbackup/... -vIntegration tests
Integration tests run against a live management cluster with a HostedCluster deployed. They require:
KUBECONFIGpointing to the management clusterETCD_BACKUP_TEST_HCP_NAMESPACEset to the HCP namespace (e.g.clusters-my-hcp)TechPreviewNoUpgradefeature set)aws sts get-caller-identity) and/or Azure CLI authenticated (az account show)The
run.sh controllersubcommand handles all cloud resource lifecycle (S3 buckets, Azure storage accounts, K8s Secrets) automatically:Alternatively, run Go tests directly with manual env vars (useful for debugging):
KUBECONFIG=/path/to/kubeconfig \ ETCD_BACKUP_TEST_HCP_NAMESPACE=clusters-my-hcp \ ETCD_BACKUP_TEST_S3_BUCKET=my-bucket \ ETCD_BACKUP_TEST_S3_REGION=us-east-2 \ ETCD_BACKUP_TEST_S3_KEY_PREFIX=etcd-backups/test \ ETCD_BACKUP_TEST_S3_CREDENTIALS_SECRET=my-aws-creds \ go test -tags integration -v -timeout 10m ./test/integration/oadp/controller/...🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests