refactor(hostedcluster): segregate reconcile loop into error-collecting blocks#7908
refactor(hostedcluster): segregate reconcile loop into error-collecting blocks#7908muraee wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis pull request introduces a phased, modular refactoring of the HostedCluster reconciliation loop. It adds a design document analyzing reconciliation segregation, restructures the controller into nine sequential phases with independent error aggregation, and introduces multiple helper functions to isolate functionality. A signature change removes the Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/test e2e-aws |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/design/hostedcluster-reconcile-segregation-analysis.md`:
- Around line 112-190: The markdown fenced block containing the ASCII diagram
(the block that begins with
"+-----------------------------------------------------+" and includes "CRITICAL
PREREQUISITES (must succeed first)") needs a language label to satisfy
markdownlint: change the opening fence from ``` to ```text so the diagram is
fenced as a text block; update the single fenced block in the file (the ASCII
diagram between the backticks) accordingly.
In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 1726-1728: The code currently checks and then removes the
HostedClusterRestoredFromBackupAnnotation from hcluster before writing the
durable status (ReconciliationSucceeded/HostedClusterRestoredFromBackup
condition), which can lose the trigger if the status update fails; change the
flow so you do not consume/remove HostedClusterRestoredFromBackupAnnotation
until after the status write is confirmed: first set the
HostedClusterRestoredFromBackup condition on the HostedCluster status and
perform the status update (updateStatus on hcluster), retrying on conflict as
needed, and only after the status update succeeds remove the
HostedClusterRestoredFromBackupAnnotation (or perform the annotation removal in
a separate patch/update with proper conflict handling) so the reconcile will
retry if the status write failed.
- Around line 1456-1483: If reconcileCoreHCPChain failed and hcp is nil, phase‑8
helpers will dereference hcp and panic; guard the entire phase‑8 block by
checking if hcp == nil and, if so, append a recoverable error to componentErrs
(e.g. fmt.Errorf("skipping phase 8: HostedControlPlane is nil due to earlier
error")) and skip calling reconcileKubeconfigAndPasswordSync,
reconcileOperatorDeployments, reconcileRBACAndPolicies,
reconcilePlatformSpecific, and reconcileAuxiliary; otherwise run the existing
calls as before.
- Around line 2161-2174: The status fields holding secret references
(hcluster.Status.CustomKubeconfig and hcluster.Status.KubeadminPassword) are
only being cleared in memory; after deleting the Secrets you must also persist
those changes to the API by clearing the fields on the HostedCluster status and
calling the Status().Update (or Client.Status().Update) to save them. Modify the
branch that deletes the custom kubeconfig (and the other branch mentioned around
the KubeadminPassword) to set hcluster.Status.CustomKubeconfig = nil and/or
hcluster.Status.KubeadminPassword = nil as appropriate and then call
r.Status().Update(ctx, hcluster) (handling and returning any error) so the API
no longer holds dangling secret refs; use the existing DeleteIfNeeded flow and
ensure both branches behave the same way.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 33f6ef88-9d24-48cc-8609-666d2cca5d82
📒 Files selected for processing (3)
docs/design/hostedcluster-reconcile-segregation-analysis.mdhypershift-operator/controllers/hostedcluster/hostedcluster_controller.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go
| ``` | ||
| +-----------------------------------------------------+ | ||
| | CRITICAL PREREQUISITES (must succeed first) | | ||
| | P1: Get HCP | | ||
| | S2: ICSP/IDMS registry | | ||
| | S3: Pull secret bytes | | ||
| | S4+S5: CPO image + labels | | ||
| | S27: Lookup release image -> S26: payload arch | | ||
| | R22: Parse release image version | | ||
| | R10: Get platform | | ||
| +------------------------+----------------------------+ | ||
| | | ||
| +------------------------------+------------------------------+ | ||
| | | | | ||
| v v v | ||
| +-----------------+ +--------------------------+ +--------------------------+ | ||
| | BLOCK A: | | BLOCK B: | | BLOCK C: | | ||
| | Status Updates | | Namespace & Foundation | | Validation Gates | | ||
| | (S1-S30) | | R1: Finalizer | | R7: Block on invalid | | ||
| | Mostly reads, | | R2: CAPI finalizers | | config/release | | ||
| | independent | | R4: Cluster IDs | | | | ||
| | from each other | | R5: CLI secrets | | (depends on S18,S19,S25) | | ||
| | | | R6: AWS resource tags | | | | ||
| | | | R9: CP namespace --------+-->| | | ||
| +-----------------+ +-----------+--------------+ +-----------+--------------+ | ||
| | | | ||
| | (R9: CP namespace exists) | | ||
| v | | ||
| +------------------------------------------+ | | ||
| | BLOCK D: Secret & ConfigMap Sync | | | ||
| | R11: Platform credentials | | | ||
| | R13: Pull secret sync | | | ||
| | R14: Secret encryption | | | ||
| | R15: Audit webhook | | | ||
| | R16: SSH key | | | ||
| | R17: AdditionalTrustBundle | | | ||
| | R18: SA signing key | | | ||
| | R19: Unmanaged etcd MTLS | | | ||
| | R21: Global config CMs/secrets | | | ||
| | | | | ||
| | *** ALL INDEPENDENT OF EACH OTHER *** | | | ||
| +---------------------+--------------------+ | | ||
| | | | ||
| v | | ||
| +------------------------------------------+ | | ||
| | BLOCK E: Core Control Plane |<----------+ | ||
| | R23: Reconcile HCP object | | ||
| | R24: Reconcile CAPI Infra CR (-> R23) | | ||
| | R25: Reconcile AWS subnets (-> R24) | | ||
| | R27: Reconcile CAPI Cluster (-> R23,R24) | | ||
| | | | ||
| | *** STRICTLY SEQUENTIAL *** | | ||
| +---------------------+--------------------+ | ||
| | | ||
| +--------------+------+--------+---------------+ | ||
| v v v v | ||
| +--------------+ +------------+ +-------------+ +--------------+ | ||
| | BLOCK F: | | BLOCK G: | | BLOCK H: | | BLOCK I: | | ||
| | Operators | | RBAC & | | Kubeconfig | | Platform- | | ||
| | | | Policies | | & Secrets | | specific | | ||
| | R36: CPO | | R26: Prom | | R29: kube- | | R41: OIDC/ | | ||
| | R37: CAPI | | RBAC | | config | | SecretProv | | ||
| | manager | | R39: PKI | | R30: custom | | KubeVirt | | ||
| | R38: CAPI | | RBAC | | kubeconf | | | | ||
| | provider| | R40: Net | | R31: admin | | | | ||
| | R42: Karpent | | Pol. | | password | | | | ||
| +--------------+ +------------+ +-------------+ +--------------+ | ||
|
|
||
| +-------------------------------------------------+ | ||
| | BLOCK J: Auxiliary / Monitoring | | ||
| | R12: Restored-from-backup condition | | ||
| | R20: ETCD member recovery | | ||
| | R28: Monitoring dashboard | | ||
| | R33: SRE metrics config | | ||
| | R34: OpenShift trusted CAs | | ||
| | | | ||
| | *** LARGELY INDEPENDENT *** | | ||
| +-------------------------------------------------+ | ||
| ``` |
There was a problem hiding this comment.
Add a language to this fenced block.
The diagram fence is currently unlabeled, which trips markdownlint and makes rendering/tooling less predictable. Use a language like text for this block.
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 112-112: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/design/hostedcluster-reconcile-segregation-analysis.md` around lines 112
- 190, The markdown fenced block containing the ASCII diagram (the block that
begins with "+-----------------------------------------------------+" and
includes "CRITICAL PREREQUISITES (must succeed first)") needs a language label
to satisfy markdownlint: change the opening fence from ``` to ```text so the
diagram is fenced as a text block; update the single fenced block in the file
(the ASCII diagram between the backticks) accordingly.
| fakeClient := fake.NewClientBuilder(). | ||
| WithScheme(api.Scheme). | ||
| WithObjects(hcluster, hcp). | ||
| Build() | ||
|
|
||
| r := &HostedClusterReconciler{ | ||
| Client: fakeClient, | ||
| EnableOCPClusterMonitoring: true, | ||
| ManagementClusterCapabilities: &fakecapabilities.FakeSupportAllCapabilities{}, | ||
| } | ||
| createOrUpdate := upsert.New(false).CreateOrUpdate | ||
| log := zap.New(zap.UseDevMode(true), zap.Level(zapcore.InfoLevel)) | ||
| releaseVersion := semver.MustParse("4.16.0") | ||
|
|
||
| _ = r.reconcileRBACAndPolicies(t.Context(), log, createOrUpdate, hcluster, hcp, | ||
| true, // controlPlanePKIOperatorSignsCSRs — enables PKI RBAC | ||
| false, releaseVersion) |
There was a problem hiding this comment.
This test never actually exercises the PKI-RBAC failure path.
Nothing in this setup forces reconcilePKIOperatorRBAC to fail, and the collected errors are discarded, so the test still passes when every sub-reconcile succeeds. That means it doesn't prove the isolation behavior in the test name. Make the PKI RBAC branch fail deterministically and assert that failure is returned while the Prometheus Role/RoleBinding are still created.
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
Outdated
Show resolved
Hide resolved
| if _, exists := hcluster.Annotations[hyperv1.HostedClusterRestoredFromBackupAnnotation]; !exists { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Don't consume the restore annotation before the condition is durable.
Line 1757 removes HostedClusterRestoredFromBackupAnnotation before the status write. If that update conflicts/fails and ReconciliationSucceeded is unchanged, the next reconcile exits at Line 1726 and never retries publishing HostedClusterRestoredFromBackup.
Also applies to: 1756-1770
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`
around lines 1726 - 1728, The code currently checks and then removes the
HostedClusterRestoredFromBackupAnnotation from hcluster before writing the
durable status (ReconciliationSucceeded/HostedClusterRestoredFromBackup
condition), which can lose the trigger if the status update fails; change the
flow so you do not consume/remove HostedClusterRestoredFromBackupAnnotation
until after the status write is confirmed: first set the
HostedClusterRestoredFromBackup condition on the HostedCluster status and
perform the status update (updateStatus on hcluster), retrying on conflict as
needed, and only after the status update succeeds remove the
HostedClusterRestoredFromBackupAnnotation (or perform the annotation removal in
a separate patch/update with proper conflict handling) so the reconcile will
retry if the status write failed.
| } else { | ||
| // Delete the custom external kubeconfig secret if it exists and the external name is not set | ||
| if hcluster.Status.CustomKubeconfig != nil { | ||
| customKubeconfig := &corev1.Secret{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Namespace: hcluster.Namespace, | ||
| Name: hcluster.Status.CustomKubeconfig.Name, | ||
| }, | ||
| } | ||
| if _, err := hyperutil.DeleteIfNeeded(ctx, r.Client, customKubeconfig); err != nil { | ||
| return fmt.Errorf("failed to delete custom external kubeconfig secret %q: %w", client.ObjectKeyFromObject(customKubeconfig), err) | ||
| } | ||
| hcluster.Status.CustomKubeconfig = nil | ||
| } |
There was a problem hiding this comment.
Clear and persist the HostedCluster secret refs when these synced secrets are removed.
These branches run after phase 4 has already written status. CustomKubeconfig is only nilled in memory, and KubeadminPassword is never cleared at all, so the API can keep dangling references to deleted secrets indefinitely.
Also applies to: 2213-2225
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`
around lines 2161 - 2174, The status fields holding secret references
(hcluster.Status.CustomKubeconfig and hcluster.Status.KubeadminPassword) are
only being cleared in memory; after deleting the Secrets you must also persist
those changes to the API by clearing the fields on the HostedCluster status and
calling the Status().Update (or Client.Status().Update) to save them. Modify the
branch that deletes the custom kubeconfig (and the other branch mentioned around
the KubeadminPassword) to set hcluster.Status.CustomKubeconfig = nil and/or
hcluster.Status.KubeadminPassword = nil as appropriate and then call
r.Status().Update(ctx, hcluster) (handling and returning any error) so the API
no longer holds dangling secret refs; use the existing DeleteIfNeeded flow and
ensure both branches behave the same way.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: muraee The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| - AWS resource tags error -> HCP never updated | ||
|
|
||
| **After (proposed)**: Only genuinely dependent operations block each other. Examples: | ||
| - SSH key secret missing -> SSH key not synced (reported via condition), but CPO still |
There was a problem hiding this comment.
As of today, I see ssh as dependency for the dataplane, as changing it would trigger a rollout, is this still the case?
There was a problem hiding this comment.
no behavior change, this will only make other controlPlane operations that are not affected by the missing ssh-key proceed.
There was a problem hiding this comment.
this should help decopuling data-plane/control-plane upgrades we are working on. Otherwise, control-plane upgrade will not proceed if ssh-key (data-plane input) is missing
| **After (proposed)**: Only genuinely dependent operations block each other. Examples: | ||
| - SSH key secret missing -> SSH key not synced (reported via condition), but CPO still | ||
| deploys, HCP still reconciles, CAPI still works | ||
| - Monitoring dashboard error -> reported, but karpenter and network policies proceed normally |
There was a problem hiding this comment.
can we include more examples and have test cases for all of them?
|
was there jira bug we can ref reporting the scenario where this was being problematic for managed? |
| ## Key Problems Identified | ||
|
|
||
| 1. **Secret sync operations block each other** (R13-R21): A missing SSH key secret prevents | ||
| pull secret sync, secret encryption setup, ETCD MTLS, and global config sync. These are |
There was a problem hiding this comment.
why would we want to proceed with the CPO if the pull secret is invalid?
There was a problem hiding this comment.
two reasons:
- image could be already available in the cluster so an invalid pull-secret doesn't necessarily block the deployment
- the imagePull error would be reported in the CPO deployment making it easier to discover than going through the HO logs
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
04ede4b to
168e67a
Compare
168e67a to
15e6a0c
Compare
15e6a0c to
e7bc83c
Compare
e7bc83c to
5bc8ca5
Compare
5bc8ca5 to
922ef75
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #7908 +/- ##
==========================================
+ Coverage 26.85% 27.03% +0.17%
==========================================
Files 1090 1091 +1
Lines 105254 105800 +546
==========================================
+ Hits 28263 28598 +335
- Misses 74563 74767 +204
- Partials 2428 2435 +7
🚀 New features to boost your workflow:
|
922ef75 to
84b3624
Compare
…ng blocks The reconcile() method executes ~50 sequential operations where every error causes an early return, short-circuiting all remaining work. An unrelated failure (e.g., missing SSH key secret) prevents critical operations like deploying the CPO or reconciling the HCP object. This refactoring: - Extracts 12 inline blocks into named methods (reconcilePullSecretSync, reconcileSSHKeySync, reconcileSecretEncryptionSync, etc.) - Groups operations into phased error-collecting blocks: - Phase 6: Independent secret/config sync (syncErrs) - Phase 7: Core HCP chain (coreErrs) - Phase 8: Component groups via wrapper methods (componentErrs) - Introduces 5 wrapper methods returning []error: reconcileKubeconfigAndPasswordSync, reconcileOperatorDeployments, reconcileRBACAndPolicies, reconcileAuxiliary, reconcilePlatformSpecific - Aggregates all errors with utilerrors.NewAggregate at the end - Moves reconcileCLISecrets into reconcileAuxiliary (error-collecting) After this change, failures in one phase no longer block unrelated phases. For example, a missing SSH key no longer prevents CPO deployment or HCP object creation. Includes the analysis document and integration tests that verify non-blocking behavior across phases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
84b3624 to
44fa50d
Compare
|
@muraee: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
reconcile()in the HostedCluster controller to use error-collecting blocks instead of sequential short-circuiting. Previously, any single failure among ~50 operations would block all subsequent work — e.g., a missing SSH key secret prevented CPO deployment and HCP creation.reconcileOperatorDeployments,reconcileRBACAndPolicies,reconcileKubeconfigAndPasswordSync,reconcileAuxiliary,reconcilePlatformSpecific) that collect errors independently.utilerrors.NewAggregate(), ensuring unrelated failures don't cascade. Phase 0–5 (prerequisites, deletion, validation gates) retain short-circuiting since they are genuine dependencies.Key changes
syncErrs)coreErrs)componentErrs)Analysis
See
docs/design/hostedcluster-reconcile-segregation-analysis.mdfor the full dependency analysis and operation map.Test plan
go test -count=1 -race ./hypershift-operator/controllers/hostedcluster/)make verifypasses (codespell failure is pre-existing untracked files)TestReconcileKubeconfigAndPasswordSync_*,TestReconcileRBACAndPolicies_*)TestReconcile_WhenPhase6SSHKeyFails_ItShouldStillCreateHCPAndCPODeployment) runs the fullreconcile()with a broken Phase 6, verifies HCP object (Phase 7) and CPO Deployment (Phase 8) are still created🤖 Generated with Claude Code
Summary by CodeRabbit
Documentation
Refactor
Tests