Skip to content

refactor(hostedcluster): segregate reconcile loop into error-collecting blocks#7908

Open
muraee wants to merge 1 commit intoopenshift:mainfrom
muraee:refactor/hostedcluster-reconcile-error-collecting
Open

refactor(hostedcluster): segregate reconcile loop into error-collecting blocks#7908
muraee wants to merge 1 commit intoopenshift:mainfrom
muraee:refactor/hostedcluster-reconcile-error-collecting

Conversation

@muraee
Copy link
Copy Markdown
Contributor

@muraee muraee commented Mar 10, 2026

Summary

  • Refactors reconcile() in the HostedCluster controller to use error-collecting blocks instead of sequential short-circuiting. Previously, any single failure among ~50 operations would block all subsequent work — e.g., a missing SSH key secret prevented CPO deployment and HCP creation.
  • Extracts 12 inline code blocks into named methods and introduces 5 wrapper methods (reconcileOperatorDeployments, reconcileRBACAndPolicies, reconcileKubeconfigAndPasswordSync, reconcileAuxiliary, reconcilePlatformSpecific) that collect errors independently.
  • Phases 6–8 now aggregate errors via utilerrors.NewAggregate(), ensuring unrelated failures don't cascade. Phase 0–5 (prerequisites, deletion, validation gates) retain short-circuiting since they are genuine dependencies.

Key changes

Phase Behavior Operations
0–5 Short-circuit (prerequisites) HCP get, deletion, platform defaults, status, finalizers, namespace, platform
6 Error-collecting (syncErrs) Platform creds, pull secret, SSH key, secret encryption, audit webhook, trust bundle, SA signing key, etcd MTLS, global config (11 operations)
7 Error-collecting (coreErrs) HCP object → CAPI InfraCR → CAPI Cluster (sequential within, independent from Phase 6/8)
8 Error-collecting (componentErrs) CPO/CAPI/Karpenter deployments, RBAC/network policies, kubeconfig/password sync, monitoring, platform-specific (5 independent wrapper groups)

Analysis

See docs/design/hostedcluster-reconcile-segregation-analysis.md for the full dependency analysis and operation map.

Test plan

  • All existing unit tests pass (go test -count=1 -race ./hypershift-operator/controllers/hostedcluster/)
  • make verify passes (codespell failure is pre-existing untracked files)
  • New unit tests for wrapper method isolation (TestReconcileKubeconfigAndPasswordSync_*, TestReconcileRBACAndPolicies_*)
  • New integration test (TestReconcile_WhenPhase6SSHKeyFails_ItShouldStillCreateHCPAndCPODeployment) runs the full reconcile() with a broken Phase 6, verifies HCP object (Phase 7) and CPO Deployment (Phase 8) are still created

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation

    • Added comprehensive design documentation analyzing HostedCluster reconciliation flow, dependency chains, and operational bottlenecks.
  • Refactor

    • Restructured reconciliation into modular phases enabling independent operations to continue despite partial failures; improved error aggregation and reporting.
  • Tests

    • Added test cases validating that component failures don't block unrelated operations, ensuring robust partial-failure handling.

@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 10, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 77172037-4e72-4b98-b1bc-f259e5744919

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request introduces a phased, modular refactoring of the HostedCluster reconciliation loop. It adds a design document analyzing reconciliation segregation, restructures the controller into nine sequential phases with independent error aggregation, and introduces multiple helper functions to isolate functionality. A signature change removes the defaultIngressDomain parameter from reconcileControlPlaneOperator, and new tests validate partial progress when operations fail.

Changes

Cohort / File(s) Summary
Design Documentation
docs/design/hostedcluster-reconcile-segregation-analysis.md
New design document detailing reconciliation segregation analysis, including operation map splits (Pre-requisite, Part One, Part Two), dependency graphs, identified blocking issues, and impact assessment showing partial progress scenarios.
Controller Refactoring
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
Major restructuring into 9 phases: initialization, pre-deletion propagation, deletion handling, conversion fixes, status updates, prerequisites, and three independent phase blocks. Introduces 15+ new modular helper functions (e.g., reconcileCoreHCPChain, reconcileOperatorDeployments, reconcilePlatformCredentialsWithStatus), changes reconcileControlPlaneOperator signature (removes defaultIngressDomain parameter), and implements aggregated error collection across independent syncs.
Test Coverage
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go
Adds three new tests validating resilient reconciliation: kubeconfig sync failure with continued kubeadmin-password sync, RBAC failure with continued Prometheus RBAC creation, and Phase 6 SSH key failure with continued Phase 7–8 completion. Includes rbacv1 import for RBAC assertions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Three new tests lack context timeouts, have incomplete coverage for failure scenarios, and include inconsistent assertion messages. Add context timeouts using context.WithTimeout(), mock dependencies to force failures in the PKI RBAC test, and add meaningful messages to all assertions.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main refactoring: segregating the reconcile loop into error-collecting blocks to prevent full short-circuiting on errors.
Stable And Deterministic Test Names ✅ Passed Pull request uses Go's standard testing package (func TestXxx) rather than Ginkgo, so the Ginkgo test title stability check does not apply. Test function names are descriptive and static with no dynamic values.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@muraee
Copy link
Copy Markdown
Contributor Author

muraee commented Mar 10, 2026

/test e2e-aws

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/hostedcluster-reconcile-segregation-analysis.md`:
- Around line 112-190: The markdown fenced block containing the ASCII diagram
(the block that begins with
"+-----------------------------------------------------+" and includes "CRITICAL
PREREQUISITES (must succeed first)") needs a language label to satisfy
markdownlint: change the opening fence from ``` to ```text so the diagram is
fenced as a text block; update the single fenced block in the file (the ASCII
diagram between the backticks) accordingly.

In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 1726-1728: The code currently checks and then removes the
HostedClusterRestoredFromBackupAnnotation from hcluster before writing the
durable status (ReconciliationSucceeded/HostedClusterRestoredFromBackup
condition), which can lose the trigger if the status update fails; change the
flow so you do not consume/remove HostedClusterRestoredFromBackupAnnotation
until after the status write is confirmed: first set the
HostedClusterRestoredFromBackup condition on the HostedCluster status and
perform the status update (updateStatus on hcluster), retrying on conflict as
needed, and only after the status update succeeds remove the
HostedClusterRestoredFromBackupAnnotation (or perform the annotation removal in
a separate patch/update with proper conflict handling) so the reconcile will
retry if the status write failed.
- Around line 1456-1483: If reconcileCoreHCPChain failed and hcp is nil, phase‑8
helpers will dereference hcp and panic; guard the entire phase‑8 block by
checking if hcp == nil and, if so, append a recoverable error to componentErrs
(e.g. fmt.Errorf("skipping phase 8: HostedControlPlane is nil due to earlier
error")) and skip calling reconcileKubeconfigAndPasswordSync,
reconcileOperatorDeployments, reconcileRBACAndPolicies,
reconcilePlatformSpecific, and reconcileAuxiliary; otherwise run the existing
calls as before.
- Around line 2161-2174: The status fields holding secret references
(hcluster.Status.CustomKubeconfig and hcluster.Status.KubeadminPassword) are
only being cleared in memory; after deleting the Secrets you must also persist
those changes to the API by clearing the fields on the HostedCluster status and
calling the Status().Update (or Client.Status().Update) to save them. Modify the
branch that deletes the custom kubeconfig (and the other branch mentioned around
the KubeadminPassword) to set hcluster.Status.CustomKubeconfig = nil and/or
hcluster.Status.KubeadminPassword = nil as appropriate and then call
r.Status().Update(ctx, hcluster) (handling and returning any error) so the API
no longer holds dangling secret refs; use the existing DeleteIfNeeded flow and
ensure both branches behave the same way.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 33f6ef88-9d24-48cc-8609-666d2cca5d82

📥 Commits

Reviewing files that changed from the base of the PR and between cc479bc and 04ede4b.

📒 Files selected for processing (3)
  • docs/design/hostedcluster-reconcile-segregation-analysis.md
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go

Comment on lines +112 to +190
```
+-----------------------------------------------------+
| CRITICAL PREREQUISITES (must succeed first) |
| P1: Get HCP |
| S2: ICSP/IDMS registry |
| S3: Pull secret bytes |
| S4+S5: CPO image + labels |
| S27: Lookup release image -> S26: payload arch |
| R22: Parse release image version |
| R10: Get platform |
+------------------------+----------------------------+
|
+------------------------------+------------------------------+
| | |
v v v
+-----------------+ +--------------------------+ +--------------------------+
| BLOCK A: | | BLOCK B: | | BLOCK C: |
| Status Updates | | Namespace & Foundation | | Validation Gates |
| (S1-S30) | | R1: Finalizer | | R7: Block on invalid |
| Mostly reads, | | R2: CAPI finalizers | | config/release |
| independent | | R4: Cluster IDs | | |
| from each other | | R5: CLI secrets | | (depends on S18,S19,S25) |
| | | R6: AWS resource tags | | |
| | | R9: CP namespace --------+-->| |
+-----------------+ +-----------+--------------+ +-----------+--------------+
| |
| (R9: CP namespace exists) |
v |
+------------------------------------------+ |
| BLOCK D: Secret & ConfigMap Sync | |
| R11: Platform credentials | |
| R13: Pull secret sync | |
| R14: Secret encryption | |
| R15: Audit webhook | |
| R16: SSH key | |
| R17: AdditionalTrustBundle | |
| R18: SA signing key | |
| R19: Unmanaged etcd MTLS | |
| R21: Global config CMs/secrets | |
| | |
| *** ALL INDEPENDENT OF EACH OTHER *** | |
+---------------------+--------------------+ |
| |
v |
+------------------------------------------+ |
| BLOCK E: Core Control Plane |<----------+
| R23: Reconcile HCP object |
| R24: Reconcile CAPI Infra CR (-> R23) |
| R25: Reconcile AWS subnets (-> R24) |
| R27: Reconcile CAPI Cluster (-> R23,R24) |
| |
| *** STRICTLY SEQUENTIAL *** |
+---------------------+--------------------+
|
+--------------+------+--------+---------------+
v v v v
+--------------+ +------------+ +-------------+ +--------------+
| BLOCK F: | | BLOCK G: | | BLOCK H: | | BLOCK I: |
| Operators | | RBAC & | | Kubeconfig | | Platform- |
| | | Policies | | & Secrets | | specific |
| R36: CPO | | R26: Prom | | R29: kube- | | R41: OIDC/ |
| R37: CAPI | | RBAC | | config | | SecretProv |
| manager | | R39: PKI | | R30: custom | | KubeVirt |
| R38: CAPI | | RBAC | | kubeconf | | |
| provider| | R40: Net | | R31: admin | | |
| R42: Karpent | | Pol. | | password | | |
+--------------+ +------------+ +-------------+ +--------------+

+-------------------------------------------------+
| BLOCK J: Auxiliary / Monitoring |
| R12: Restored-from-backup condition |
| R20: ETCD member recovery |
| R28: Monitoring dashboard |
| R33: SRE metrics config |
| R34: OpenShift trusted CAs |
| |
| *** LARGELY INDEPENDENT *** |
+-------------------------------------------------+
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language to this fenced block.

The diagram fence is currently unlabeled, which trips markdownlint and makes rendering/tooling less predictable. Use a language like text for this block.

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 112-112: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/hostedcluster-reconcile-segregation-analysis.md` around lines 112
- 190, The markdown fenced block containing the ASCII diagram (the block that
begins with "+-----------------------------------------------------+" and
includes "CRITICAL PREREQUISITES (must succeed first)") needs a language label
to satisfy markdownlint: change the opening fence from ``` to ```text so the
diagram is fenced as a text block; update the single fenced block in the file
(the ASCII diagram between the backticks) accordingly.

Comment on lines +6039 to +6055
fakeClient := fake.NewClientBuilder().
WithScheme(api.Scheme).
WithObjects(hcluster, hcp).
Build()

r := &HostedClusterReconciler{
Client: fakeClient,
EnableOCPClusterMonitoring: true,
ManagementClusterCapabilities: &fakecapabilities.FakeSupportAllCapabilities{},
}
createOrUpdate := upsert.New(false).CreateOrUpdate
log := zap.New(zap.UseDevMode(true), zap.Level(zapcore.InfoLevel))
releaseVersion := semver.MustParse("4.16.0")

_ = r.reconcileRBACAndPolicies(t.Context(), log, createOrUpdate, hcluster, hcp,
true, // controlPlanePKIOperatorSignsCSRs — enables PKI RBAC
false, releaseVersion)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This test never actually exercises the PKI-RBAC failure path.

Nothing in this setup forces reconcilePKIOperatorRBAC to fail, and the collected errors are discarded, so the test still passes when every sub-reconcile succeeds. That means it doesn't prove the isolation behavior in the test name. Make the PKI RBAC branch fail deterministically and assert that failure is returned while the Prometheus Role/RoleBinding are still created.

Comment on lines +1726 to 1728
if _, exists := hcluster.Annotations[hyperv1.HostedClusterRestoredFromBackupAnnotation]; !exists {
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't consume the restore annotation before the condition is durable.

Line 1757 removes HostedClusterRestoredFromBackupAnnotation before the status write. If that update conflicts/fails and ReconciliationSucceeded is unchanged, the next reconcile exits at Line 1726 and never retries publishing HostedClusterRestoredFromBackup.

Also applies to: 1756-1770

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`
around lines 1726 - 1728, The code currently checks and then removes the
HostedClusterRestoredFromBackupAnnotation from hcluster before writing the
durable status (ReconciliationSucceeded/HostedClusterRestoredFromBackup
condition), which can lose the trigger if the status update fails; change the
flow so you do not consume/remove HostedClusterRestoredFromBackupAnnotation
until after the status write is confirmed: first set the
HostedClusterRestoredFromBackup condition on the HostedCluster status and
perform the status update (updateStatus on hcluster), retrying on conflict as
needed, and only after the status update succeeds remove the
HostedClusterRestoredFromBackupAnnotation (or perform the annotation removal in
a separate patch/update with proper conflict handling) so the reconcile will
retry if the status write failed.

Comment on lines +2161 to 2174
} else {
// Delete the custom external kubeconfig secret if it exists and the external name is not set
if hcluster.Status.CustomKubeconfig != nil {
customKubeconfig := &corev1.Secret{
ObjectMeta: metav1.ObjectMeta{
Namespace: hcluster.Namespace,
Name: hcluster.Status.CustomKubeconfig.Name,
},
}
if _, err := hyperutil.DeleteIfNeeded(ctx, r.Client, customKubeconfig); err != nil {
return fmt.Errorf("failed to delete custom external kubeconfig secret %q: %w", client.ObjectKeyFromObject(customKubeconfig), err)
}
hcluster.Status.CustomKubeconfig = nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clear and persist the HostedCluster secret refs when these synced secrets are removed.

These branches run after phase 4 has already written status. CustomKubeconfig is only nilled in memory, and KubeadminPassword is never cleared at all, so the API can keep dangling references to deleted secrets indefinitely.

Also applies to: 2213-2225

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`
around lines 2161 - 2174, The status fields holding secret references
(hcluster.Status.CustomKubeconfig and hcluster.Status.KubeadminPassword) are
only being cleared in memory; after deleting the Secrets you must also persist
those changes to the API by clearing the fields on the HostedCluster status and
calling the Status().Update (or Client.Status().Update) to save them. Modify the
branch that deletes the custom kubeconfig (and the other branch mentioned around
the KubeadminPassword) to set hcluster.Status.CustomKubeconfig = nil and/or
hcluster.Status.KubeadminPassword = nil as appropriate and then call
r.Status().Update(ctx, hcluster) (handling and returning any error) so the API
no longer holds dangling secret refs; use the existing DeleteIfNeeded flow and
ensure both branches behave the same way.

@openshift-ci openshift-ci bot requested review from enxebre and sjenning March 10, 2026 16:26
@openshift-ci openshift-ci bot added the area/documentation Indicates the PR includes changes for documentation label Mar 10, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 10, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: muraee

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Mar 10, 2026
- AWS resource tags error -> HCP never updated

**After (proposed)**: Only genuinely dependent operations block each other. Examples:
- SSH key secret missing -> SSH key not synced (reported via condition), but CPO still
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of today, I see ssh as dependency for the dataplane, as changing it would trigger a rollout, is this still the case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no behavior change, this will only make other controlPlane operations that are not affected by the missing ssh-key proceed.

Copy link
Copy Markdown
Contributor Author

@muraee muraee Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should help decopuling data-plane/control-plane upgrades we are working on. Otherwise, control-plane upgrade will not proceed if ssh-key (data-plane input) is missing

**After (proposed)**: Only genuinely dependent operations block each other. Examples:
- SSH key secret missing -> SSH key not synced (reported via condition), but CPO still
deploys, HCP still reconciles, CAPI still works
- Monitoring dashboard error -> reported, but karpenter and network policies proceed normally
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we include more examples and have test cases for all of them?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@enxebre
Copy link
Copy Markdown
Member

enxebre commented Mar 12, 2026

was there jira bug we can ref reporting the scenario where this was being problematic for managed?

## Key Problems Identified

1. **Secret sync operations block each other** (R13-R21): A missing SSH key secret prevents
pull secret sync, secret encryption setup, ETCD MTLS, and global config sync. These are
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we want to proceed with the CPO if the pull secret is invalid?

Copy link
Copy Markdown
Contributor Author

@muraee muraee Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two reasons:

  1. image could be already available in the cluster so an invalid pull-secret doesn't necessarily block the deployment
  2. the imagePull error would be reported in the CPO deployment making it easier to discover than going through the HO logs

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 15, 2026
@openshift-merge-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from 04ede4b to 168e67a Compare March 31, 2026 11:16
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 31, 2026
@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from 168e67a to 15e6a0c Compare March 31, 2026 11:23
@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from 15e6a0c to e7bc83c Compare March 31, 2026 11:27
@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from e7bc83c to 5bc8ca5 Compare March 31, 2026 11:40
@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from 5bc8ca5 to 922ef75 Compare March 31, 2026 11:47
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 37.87611% with 351 lines in your changes missing coverage. Please review.
✅ Project coverage is 27.03%. Comparing base (7ce6015) to head (44fa50d).
⚠️ Report is 27 commits behind head on main.

Files with missing lines Patch % Lines
...trollers/hostedcluster/hostedcluster_controller.go 37.94% 306 Missing and 44 partials ⚠️
...rollers/hostedcluster/internal/platform/aws/aws.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7908      +/-   ##
==========================================
+ Coverage   26.85%   27.03%   +0.17%     
==========================================
  Files        1090     1091       +1     
  Lines      105254   105800     +546     
==========================================
+ Hits        28263    28598     +335     
- Misses      74563    74767     +204     
- Partials     2428     2435       +7     
Files with missing lines Coverage Δ
...rollers/hostedcluster/internal/platform/aws/aws.go 14.09% <0.00%> (ø)
...trollers/hostedcluster/hostedcluster_controller.go 47.46% <37.94%> (+4.48%) ⬆️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from 922ef75 to 84b3624 Compare March 31, 2026 13:42
@openshift-ci openshift-ci bot added the area/platform/aws PR/issue for AWS (AWSPlatform) platform label Mar 31, 2026
…ng blocks

The reconcile() method executes ~50 sequential operations where every
error causes an early return, short-circuiting all remaining work. An
unrelated failure (e.g., missing SSH key secret) prevents critical
operations like deploying the CPO or reconciling the HCP object.

This refactoring:

- Extracts 12 inline blocks into named methods (reconcilePullSecretSync,
  reconcileSSHKeySync, reconcileSecretEncryptionSync, etc.)
- Groups operations into phased error-collecting blocks:
  - Phase 6: Independent secret/config sync (syncErrs)
  - Phase 7: Core HCP chain (coreErrs)
  - Phase 8: Component groups via wrapper methods (componentErrs)
- Introduces 5 wrapper methods returning []error:
  reconcileKubeconfigAndPasswordSync, reconcileOperatorDeployments,
  reconcileRBACAndPolicies, reconcileAuxiliary, reconcilePlatformSpecific
- Aggregates all errors with utilerrors.NewAggregate at the end
- Moves reconcileCLISecrets into reconcileAuxiliary (error-collecting)

After this change, failures in one phase no longer block unrelated
phases. For example, a missing SSH key no longer prevents CPO deployment
or HCP object creation.

Includes the analysis document and integration tests that verify
non-blocking behavior across phases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from 84b3624 to 44fa50d Compare March 31, 2026 16:55
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 31, 2026

@muraee: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants