Skip to content

OCPBUGS-87962: Boot image skew check silently passes when MachineSets are reconcile-skipped#6162

Open
djoshy wants to merge 3 commits into
openshift:mainfrom
djoshy:fix-skew-add-rm-machineset
Open

OCPBUGS-87962: Boot image skew check silently passes when MachineSets are reconcile-skipped#6162
djoshy wants to merge 3 commits into
openshift:mainfrom
djoshy:fix-skew-add-rm-machineset

Conversation

@djoshy

@djoshy djoshy commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

- What I did
When one or more MachineSets are reconcile-skipped (e.g. due to an OwnerReference, missing arch annotation, or an unrecognisable boot image), the ClusterBootImageAutomatic record becomes stale: a newly added or modified MachineSet may be running an older image, but the recorded OCP version no longer reflects that. The skew check then silently passes when it should not. resetClusterBootImage() addresses this by writing the cluster's install version into ClusterBootImageAutomatic whenever a reconcile-skip occurs; and so the skew check correctly surfaces a violation if the cluster is genuinely out of date. We fall back to 0.0.0 as a conservative default if the version could not be queried from cluster resources; this would only be needed if the cluster is in a broken state and is very unlikely.

I also refactored a couple of things to simplify and improve readablity of the code, they have been broken out into separate commits.

- How to verify it

  1. Create a cluster on AWS/GCP/Azure/vSphere on a recent 4.22 release.
  2. Upgrade to a 5.0 release with this fix. The cluster bootimage version stored in the MachineConfiguration value should indicate 5.0 when the upgrade is complete and the bootimage controller has completed a loop
  3. Add or Update the MachineSet to a bootimage with a non-reconcilable value. This should reset the cluster bootimage back to the installversion of the cluster, 4.22.z
  4. Revert the machineset to a reconcilable value, this should set the boot image version back to 5.0.

Summary by CodeRabbit

  • New Features

    • Derives and records OCP install-time version for boot-image skew enforcement, with a safe "0.0.0" fallback.
  • Bug Fixes

    • Removed redundant operator-version gating during boot-image reconciliation.
    • Only persists status when it actually changes; improved handling when ClusterVersion or ConfigMap are unavailable.
  • Refactor

    • Streamlined ConfigMap fetch and reconciliation flow with clearer skip/error semantics and explicit reset/update logic.
  • Tests

    • Added tests covering cluster boot-image reset behavior across history and lookup scenarios.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2026
@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 9, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-87962, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

- What I did
When one or more MachineSets are reconcile-skipped (e.g. due to an OwnerReference, missing arch annotation, or an unrecognisable boot image), the ClusterBootImageAutomatic record becomes stale: a newly added or modified MachineSet may be running an older image, but the recorded OCP version no longer reflects that. The skew check then silently passes when it should not. resetClusterBootImage() addresses this by writing the cluster's install version into ClusterBootImageAutomatic whenever a reconcile-skip occurs; and so the skew check correctly surfaces a violation if the cluster is genuinely out of date.

I also refactored a couple of things to simplify and improve readablity of the code, they have been broken out into separate commits.

- How to verify it

  1. Create a cluster on AWS/GCP/Azure/vSphere on a recent 4.22 release.
  2. Upgrade to a 5.0 release with this fix. The cluster bootimage version stored in the MachineConfiguration value should indicate 5.0 when the upgrade is complete and the bootimage controller has completed a loop
  3. Add or Update the MachineSet to a bootimage with a non-reconcilable value. This should reset the cluster bootimage back to the installversion of the cluster, 4.22.z
  4. Revert the machineset to a reconcilable value, this should set the boot image version back to 5.0.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a9f66a85-0c6d-4710-857f-2b7a8e3336ef

📥 Commits

Reviewing files that changed from the base of the PR and between 0b5082d and a2697ed.

📒 Files selected for processing (5)
  • pkg/apihelpers/apihelpers.go
  • pkg/controller/bootimage/boot_image_controller.go
  • pkg/controller/bootimage/boot_image_controller_test.go
  • pkg/controller/bootimage/ms_helpers.go
  • pkg/operator/sync.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • pkg/apihelpers/apihelpers.go
  • pkg/controller/bootimage/boot_image_controller.go
  • pkg/controller/bootimage/ms_helpers.go
  • pkg/controller/bootimage/boot_image_controller_test.go

Walkthrough

Centralizes install-time OCP version extraction into apihelpers, uses it in operator and controller reset logic, prefetches coreos-bootimages for MachineSet sync, removes ConfigMap stamping checks from CPMS/MS flows, and adds tests for the reset behavior.

Changes

Boot image skew enforcement refactor

Layer / File(s) Summary
Install version extraction
pkg/apihelpers/apihelpers.go
Adds GetOCPInstallVersion to derive install-time OCP version from ClusterVersion.Status.History, parsing to major.minor.patch or returning "0.0.0" on empty history/parse failure.
Operator wiring for install version
pkg/operator/sync.go
Operator sync now fetches ClusterVersion and delegates install-version extraction to apihelpers.GetOCPInstallVersion.
Controller reset and status update tightening
pkg/controller/bootimage/boot_image_controller.go
Adds resetClusterBootImage, tighter deep-equality check before persisting BootImageSkewEnforcementStatus, and uses apihelpers-derived install version when resetting Automatic.OCPVersion.
Reset controller tests and test imports
pkg/controller/bootimage/boot_image_controller_test.go
Adds TestResetClusterBootImage, updates test imports and fake client usage, and validates reset/fallback behavior across ClusterVersion history scenarios.
MachineSet reconciliation refactor
pkg/controller/bootimage/ms_helpers.go
Prefetches coreos-bootimages ConfigMap, passes it into per-MachineSet reconciliation, removes in-function operator-stamp checks, aggregates reconcileSkipped/errors, reports degraded on ConfigMap fetch failure, and gates cluster boot-image record update vs reset.
ControlPlaneMachineSet reconcile update
pkg/controller/bootimage/cpms_helpers.go
Removes operator-version stamping/validation from CPMS reconcile; proceeds to checkControlPlaneMachineSet after fetching coreos-bootimages.
Platform helpers semantics
pkg/controller/bootimage/platform_helpers.go
Clarifies reconcileSkipped meaning, renames patchSkipped to reconcileSkipped, and returns reconcileSkipped on early non-patch paths.

Sequence Diagram

sequenceDiagram
  participant MSController
  participant ClusterVersionLister
  participant APIHelpers
  participant StatusUpdater
  MSController->>ClusterVersionLister: Get("version")
  ClusterVersionLister->>APIHelpers: GetOCPInstallVersion(ClusterVersion)
  APIHelpers-->>MSController: "major.minor.patch" / "0.0.0"
  MSController->>StatusUpdater: updateMachineConfigurationStatus(new SkewStatus)
  StatusUpdater-->>MSController: persist result / no-op
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

lgtm, verified

Suggested reviewers

  • dkhater-redhat
🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ❓ Inconclusive The custom check specifies "Review Ginkgo test code" but TestResetClusterBootImage uses standard Go testing (testing.T, t.Run), not Ginkgo (no Describe/It/BeforeEach blocks). The check instructions... Clarify whether quality requirements apply to standard Go tests using t.Run table-driven patterns or only Ginkgo-style BDD tests. If applied to standard tests: TestResetClusterBootImage lacks assertion failure messages (though consistent...
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the bug being fixed and concisely summarizes the main change: resetting boot image when MachineSets are reconcile-skipped.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the PR are stable and deterministic with no dynamic content (timestamps, UUIDs, generated IDs, IPs, etc.). The new TestResetClusterBootImage test uses descriptive static names wit...
Microshift Test Compatibility ✅ Passed The test added (TestResetClusterBootImage) is a standard Go unit test, not a Ginkgo e2e test. The custom check applies only to Ginkgo e2e tests, so it is not applicable to this PR.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests were added in this PR. The new test (TestResetClusterBootImage) is a standard Go unit test using testing.T with mocked clients, not an e2e test; check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only controller logic and helpers (GetOCPInstallVersion, resetClusterBootImage), not deployment manifests or scheduling constraints. No topology-unsafe pod affinity, PDBs, replica count...
Ote Binary Stdout Contract ✅ Passed All code changes use klog which writes to stderr by default, and all logging occurs in worker/reconciliation loops, not process-level code. No direct stdout writes detected.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR does not add Ginkgo e2e tests; only adds standard Go unit test TestResetClusterBootImage(). Check is not applicable.
No-Weak-Crypto ✅ Passed No weak cryptographic algorithms, custom crypto implementations, or non-constant-time secret comparisons detected. Changes focus on boot image versioning and controller logic without cryptographic...
Container-Privileges ✅ Passed The PR only modifies Go source code files related to boot image skew enforcement logic (controllers, helpers, and tests). No Kubernetes manifests, container security configurations, or pod specific...
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data exposed in logs. New logging statements only log version information (e.g., "5.0.0") and generic operational messages; no passwords, tokens, API keys, PII, or customer data.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-87962, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

- What I did
When one or more MachineSets are reconcile-skipped (e.g. due to an OwnerReference, missing arch annotation, or an unrecognisable boot image), the ClusterBootImageAutomatic record becomes stale: a newly added or modified MachineSet may be running an older image, but the recorded OCP version no longer reflects that. The skew check then silently passes when it should not. resetClusterBootImage() addresses this by writing the cluster's install version into ClusterBootImageAutomatic whenever a reconcile-skip occurs; and so the skew check correctly surfaces a violation if the cluster is genuinely out of date.

I also refactored a couple of things to simplify and improve readablity of the code, they have been broken out into separate commits.

- How to verify it

  1. Create a cluster on AWS/GCP/Azure/vSphere on a recent 4.22 release.
  2. Upgrade to a 5.0 release with this fix. The cluster bootimage version stored in the MachineConfiguration value should indicate 5.0 when the upgrade is complete and the bootimage controller has completed a loop
  3. Add or Update the MachineSet to a bootimage with a non-reconcilable value. This should reset the cluster bootimage back to the installversion of the cluster, 4.22.z
  4. Revert the machineset to a reconcilable value, this should set the boot image version back to 5.0.

Summary by CodeRabbit

  • New Features

  • Improved boot image skew enforcement to track the OCP install-time version from cluster history.

  • Bug Fixes

  • Removed redundant operator version validation during boot image reconciliation.

  • Refactor

  • Enhanced boot image handling with optimized ConfigMap management and reconciliation flow.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy djoshy marked this pull request as ready for review June 9, 2026 18:47
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/apihelpers/apihelpers.go`:
- Around line 667-670: When clusterVersion.Status.History is empty the code
currently returns an empty string; change that fallback to the stable
conservative version "0.0.0" instead so downstream skew comparisons have a
usable version. Update the branch that checks if
len(clusterVersion.Status.History) == 0 (and the similar block around the later
parse-failure handling) to return "0.0.0" instead of "" and ensure any code
paths that previously relied on "" now receive the same "0.0.0" default used on
parse failure.

In `@pkg/controller/bootimage/boot_image_controller.go`:
- Around line 653-657: The current early return after
ctrl.clusterVersionLister.Get("version") can leave ClusterBootImageAutomatic
stale; instead, when clusterVersion lookup fails capture and log the error but
set a conservative fallback version (e.g. a minimum supported/zero version
string) and continue the status reset logic for ClusterBootImageAutomatic in the
reconcile-skipped path so the controller proceeds to clear/reset status; update
the block around clusterVersion, err := ctrl.clusterVersionLister.Get("version")
to not return on error but assign the fallback and continue using clusterVersion
(or the fallback) for subsequent status update logic.

In `@pkg/controller/bootimage/ms_helpers.go`:
- Around line 71-76: The controller currently fetches the BootImages ConfigMap
even when there are no MAPI MachineSets, which can cause an unnecessary
degradation via ctrl.updateConditions; update the logic in ms_helpers.go to
check len(mapiMachineSets) and return early when it is zero before calling
ctrl.mcoCmLister.ConfigMaps(...).Get (and avoid invoking
ctrl.updateConditions/degrading in that path); ensure the check references the
existing mapiMachineSets variable and that any subsequent code that uses
configMap only runs when the ConfigMap fetch is actually needed.

In `@pkg/operator/sync.go`:
- Around line 2607-2611: When optr.clusterVersionLister.Get("version") fails in
sync.go the code currently returns an empty string; change this to return a
conservative comparable version string (e.g. "0.0.0") so skew enforcement
comparisons remain robust. Keep the existing klog.Warningf call but replace the
return "" with return "0.0.0" (refer to the clusterVersion variable and
optr.clusterVersionLister.Get("version") in this function).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d76e6a14-632c-45cd-a965-15d0263af03e

📥 Commits

Reviewing files that changed from the base of the PR and between ee994cd and 7645991.

📒 Files selected for processing (6)
  • pkg/apihelpers/apihelpers.go
  • pkg/controller/bootimage/boot_image_controller.go
  • pkg/controller/bootimage/cpms_helpers.go
  • pkg/controller/bootimage/ms_helpers.go
  • pkg/controller/bootimage/platform_helpers.go
  • pkg/operator/sync.go

Comment thread pkg/apihelpers/apihelpers.go
Comment thread pkg/controller/bootimage/boot_image_controller.go
Comment thread pkg/controller/bootimage/ms_helpers.go Outdated
Comment thread pkg/operator/sync.go
@djoshy djoshy force-pushed the fix-skew-add-rm-machineset branch 2 times, most recently from cc61586 to bd7589e Compare June 9, 2026 19:47
djoshy added 2 commits June 9, 2026 15:47
isClusterStable() already gates the entire sync loop — the operator stamps the configmap with the MCO version hash and OCP release version as soon as the first master node updates, which always happens before ClusterVersion marks the upgrade complete. By the time syncMAPIMachineSets and syncControlPlaneMachineSet run, the configmap is guaranteed to be up to date.
@djoshy djoshy force-pushed the fix-skew-add-rm-machineset branch from bd7589e to 34a431d Compare June 9, 2026 19:47

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/apihelpers/apihelpers.go (1)

663-665: ⚡ Quick win

Update the function comment to reflect the actual return value.

The comment states the function returns an empty string on failure, but the implementation returns "0.0.0" (lines 670, 687). Update the comment to match the code.

📝 Suggested comment fix
 // GetOCPInstallVersion extracts the install-time OCP version from
 // ClusterVersion history — the oldest completed update entry, normalised to Major.Minor.Patch.
-// Returns an empty string if the history is empty or the version cannot be parsed.
+// Returns "0.0.0" as a conservative fallback if the history is empty or the version cannot be parsed.
 func GetOCPInstallVersion(clusterVersion *configv1.ClusterVersion) string {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/apihelpers/apihelpers.go` around lines 663 - 665, Update the function
comment for GetOCPInstallVersion to correctly describe the actual return value
on failure: it returns "0.0.0" (not an empty string). Modify the docstring to
state that the function returns the normalized Major.Minor.Patch version or
"0.0.0" if the history is empty or parsing fails, so the comment matches the
implementation in GetOCPInstallVersion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/apihelpers/apihelpers.go`:
- Around line 663-665: Update the function comment for GetOCPInstallVersion to
correctly describe the actual return value on failure: it returns "0.0.0" (not
an empty string). Modify the docstring to state that the function returns the
normalized Major.Minor.Patch version or "0.0.0" if the history is empty or
parsing fails, so the comment matches the implementation in
GetOCPInstallVersion.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: caebce40-b738-4116-bd00-2b1675f9d096

📥 Commits

Reviewing files that changed from the base of the PR and between 7645991 and 34a431d.

📒 Files selected for processing (6)
  • pkg/apihelpers/apihelpers.go
  • pkg/controller/bootimage/boot_image_controller.go
  • pkg/controller/bootimage/cpms_helpers.go
  • pkg/controller/bootimage/ms_helpers.go
  • pkg/controller/bootimage/platform_helpers.go
  • pkg/operator/sync.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/controller/bootimage/cpms_helpers.go
  • pkg/operator/sync.go
  • pkg/controller/bootimage/platform_helpers.go

@djoshy djoshy force-pushed the fix-skew-add-rm-machineset branch 2 times, most recently from ec09df4 to 0b5082d Compare June 10, 2026 16:48

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/controller/bootimage/boot_image_controller_test.go (1)

977-1067: 💤 Low value

Consider adding a test case for ClusterVersion retrieval failure.

The test currently covers scenarios where GetOCPInstallVersion returns "0.0.0" (empty history, case 4), but doesn't test the code path where clusterVersionLister.Get("version") itself fails. From the implementation, when Get() fails, it logs a warning and keeps the default "0.0.0".

Adding a test case where the ClusterVersion is not added to the indexer would verify this fallback path and improve completeness.

📝 Example test case for Get() failure
{
    name:              "falls back to 0.0.0 when ClusterVersion unavailable",
    history:           nil, // ClusterVersion won't be added to indexer
    mode:              opv1.BootImageSkewEnforcementModeStatusAutomatic,
    initialOCPVersion: "4.19.0",
    expectedVersion:   "0.0.0",
},

Then in the test body, conditionally skip adding the ClusterVersion when tc.history == nil:

if tc.history != nil {
    cv := &osconfigv1.ClusterVersion{
        ObjectMeta: v1.ObjectMeta{Name: "version"},
        Status:     osconfigv1.ClusterVersionStatus{History: tc.history},
    }
    cvIndexer := cache.NewIndexer(cache.MetaNamespaceKeyFunc, cache.Indexers{})
    require.NoError(t, cvIndexer.Add(cv))
    ctrl.clusterVersionLister = configlistersv1.NewClusterVersionLister(cvIndexer)
} else {
    // Empty indexer to trigger Get() failure
    cvIndexer := cache.NewIndexer(cache.MetaNamespaceKeyFunc, cache.Indexers{})
    ctrl.clusterVersionLister = configlistersv1.NewClusterVersionLister(cvIndexer)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/bootimage/boot_image_controller_test.go` around lines 977 -
1067, Add a test case to TestResetClusterBootImage that covers the
ClusterVersion Get() failure path by supplying a case where history is nil
(e.g., name "falls back to 0.0.0 when ClusterVersion unavailable", history: nil,
mode automatic, initialOCPVersion: "4.19.0", expectedVersion: "0.0.0") and
modify the test setup in the loop so that when tc.history == nil you do NOT add
a ClusterVersion to the cvIndexer before creating ctrl.clusterVersionLister;
this forces clusterVersionLister.Get("version") to fail and verifies
resetClusterBootImage still falls back to "0.0.0".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/controller/bootimage/boot_image_controller_test.go`:
- Around line 977-1067: Add a test case to TestResetClusterBootImage that covers
the ClusterVersion Get() failure path by supplying a case where history is nil
(e.g., name "falls back to 0.0.0 when ClusterVersion unavailable", history: nil,
mode automatic, initialOCPVersion: "4.19.0", expectedVersion: "0.0.0") and
modify the test setup in the loop so that when tc.history == nil you do NOT add
a ClusterVersion to the cvIndexer before creating ctrl.clusterVersionLister;
this forces clusterVersionLister.Get("version") to fail and verifies
resetClusterBootImage still falls back to "0.0.0".

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 49f116c7-53d6-49fb-b605-b7cd0dbe1c16

📥 Commits

Reviewing files that changed from the base of the PR and between ec09df4 and 0b5082d.

📒 Files selected for processing (5)
  • pkg/apihelpers/apihelpers.go
  • pkg/controller/bootimage/boot_image_controller.go
  • pkg/controller/bootimage/boot_image_controller_test.go
  • pkg/controller/bootimage/ms_helpers.go
  • pkg/operator/sync.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • pkg/operator/sync.go
  • pkg/apihelpers/apihelpers.go
  • pkg/controller/bootimage/boot_image_controller.go
  • pkg/controller/bootimage/ms_helpers.go

When one or more MachineSets are reconcile-skipped (e.g. due to an OwnerReference, missing arch annotation, or an unrecognisable boot image), the ClusterBootImageAutomatic record becomes stale: a newly added or modified MachineSet may be running an older image, but the recorded OCP version no longer reflects that. The skew check then silently passes when it should not. resetClusterBootImage() addresses this by writing the cluster's install version into ClusterBootImageAutomatic whenever a reconcile-skip occurs without an accompanying error — the most conservative value we can reliably claim — so the skew check correctly surfaces a violation if the cluster is genuinely out of date.
@djoshy djoshy force-pushed the fix-skew-add-rm-machineset branch from 0b5082d to a2697ed Compare June 10, 2026 17:20

// Only make an API call if there is an update to the skew enforcement status
if !reflect.DeepEqual(mcop.Status.BootImageSkewEnforcementStatus, newBootImageSkewEnforcementStatus) {
if !reflect.DeepEqual(mcop.Status.BootImageSkewEnforcementStatus, *newBootImageSkewEnforcementStatus) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was something caught by the new units; the old code wasn't dereferencing the newBootImageSkewEnforcementStatus pointer, meaning the API call was always being made - no bueno! 🫨

@djoshy

djoshy commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-fips-proxy-longduration

@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@djoshy: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-fips-proxy-longduration-1of2
  • periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-fips-proxy-longduration-2of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/73f76660-64f3-11f1-9601-524e54fbd0f7-0

@djoshy

djoshy commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai resume

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
✅ Action performed

Reviews resumed.

@ptalgulk01

ptalgulk01 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Pre-merge verified:

OCP Version: 5.0.0-0-2026-06-15-062252-test-ci-ln-p1kbq5b-latest
Platform: AWS

Steps:

  • Get the 4.22 cluster, upgrade the cluster to fix
  • After upgrade check the bootimage version
$ oc get machineconfiguration cluster \
    -o jsonpath='{.status.bootImageSkewEnforcementStatus.automatic.ocpVersion}'
5.0.0
  • Make a copy of Machineset and then patch the fake value
oc get machinesets.machine.openshift.io -n openshift-machine-api ppt-1506a-cvhj4-worker-us-east-2a -o yaml > ms-backup.yaml

  oc patch machineset  openshift-machine-api ppt-1506a-cvhj4-worker-us-east-2a -n openshift-machine-api --type merge \
    -p '{"spec":{"template":{"spec":{"providerSpec":{"value":{"ami":{"id":"ami-FAKE123456789"}}}}}}}'
machineset.machine.openshift.io/ppt-1506a-cvhj4-worker-us-east-2a patched
  • Check the bootimage value is 4.22
  oc get machineconfiguration cluster \
    -o jsonpath='{.status.bootImageSkewEnforcementStatus.automatic.ocpVersion}'
4.22.0
  • Revert the change and check bootimage value is same as before.
oc get machineconfiguration cluster \
    -o jsonpath='{.status.bootImageSkewEnforcementStatus.automatic.ocpVersion}'
5.0.0
 passed: (1m10s) 2026-06-15T12:56:51 "[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-81403-[P1][OnCLayer] In BootImages Machineset should update by default [Disruptive] [Serial]"
 passed: (1m28s) 2026-06-15T12:58:19 "[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-80435-[OnCLayer] Bootimage no json data error upgrading stub ignition to spec 3 [Disruptive] [Serial]"
  passed: (5m47s) 2026-06-15T13:04:06 "[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-High-80437-[P1] Bootimage upgrade stub ignition to spec 3 [Disruptive] [Serial]"
  passed: (1m39s) 2026-06-15T13:05:45 "[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-80434-[OnCLayer] Bootimage wrong version error upgrading stub ignition to spec 3 [Disruptive] [Serial]"
  passed: (53.2s) 2026-06-15T13:06:39 "[sig-mco] MCO Bootimages Author:ptalgulk-NonHyperShiftHOST-NonPreRelease-High-81395-[P1] Verify in boot-image by default update is opt-in. [Disruptive] [Serial]"
passed: (1m28s) 2026-06-15T13:08:07 "[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-80436-Bootimage secret doesn't exist error upgrading stub ignition to spec 3 [Disruptive] [Serial]"
 passed: (49.3s) 2026-06-15T13:08:56 "[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74751-[P2][OnCLayer] ManagedBootImages. Fix errors [Disruptive] [Serial]"
 passed: (7m25s) 2026-06-15T13:16:22 "[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Critical-82747-Correctly handle marketplace bootimages [Disruptive] [Serial]"
  passed: (2m9s) 2026-06-15T13:18:30 "[sig-mco] MCO Bootimages Author:ptalgulk-NonHyperShiftHOST-NonPreRelease-High-83998-Check in the boot image controller to work with multiple labels for annotation [Disruptive] [Serial]"
  passed: (6m48s) 2026-06-15T13:25:18 "[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74240-[P2][OnCLayer] ManagedBootImages. Restore All MachineSet images [Disruptive] [Serial]"
  passed: (8m50s) 2026-06-15T13:34:08 "[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74239-[OnCLayer] ManagedBootImages. Restore Partial MachineSet images [Disruptive] [Serial]"

/label qe-approved
/verified by @ptalgulk01

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 15, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@ptalgulk01: This PR has been marked as verified by @ptalgulk01.

Details

In response to this:

Pre-merge verified:

OCP Version: 5.0.0-0-2026-06-15-062252-test-ci-ln-p1kbq5b-latest
Platform: AWS

Steps:

  • Get the 4.22 cluster, upgrade the cluster to fix
  • After upgrade check the bootimage version
$ oc get machineconfiguration cluster \
   -o jsonpath='{.status.bootImageSkewEnforcementStatus.automatic.ocpVersion}'
5.0.0
  • Make a copy of Machineset and then patch the fake value
oc get machinesets.machine.openshift.io -n openshift-machine-api ppt-1506a-cvhj4-worker-us-east-2a -o yaml > ms-backup.yaml

 oc patch machineset  openshift-machine-api ppt-1506a-cvhj4-worker-us-east-2a -n openshift-machine-api --type merge \
   -p '{"spec":{"template":{"spec":{"providerSpec":{"value":{"ami":{"id":"ami-FAKE123456789"}}}}}}}'
machineset.machine.openshift.io/ppt-1506a-cvhj4-worker-us-east-2a patched
  • Check the bootimage value is 4.22
 oc get machineconfiguration cluster \
   -o jsonpath='{.status.bootImageSkewEnforcementStatus.automatic.ocpVersion}'
4.22.0
  • Revert the change and check bootimage value is same as before.
oc get machineconfiguration cluster \
   -o jsonpath='{.status.bootImageSkewEnforcementStatus.automatic.ocpVersion}'
5.0.0

/label qe-approved
/verified by @ptalgulk01

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the qe-approved Signifies that QE has signed off on this PR label Jun 15, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-87962, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

- What I did
When one or more MachineSets are reconcile-skipped (e.g. due to an OwnerReference, missing arch annotation, or an unrecognisable boot image), the ClusterBootImageAutomatic record becomes stale: a newly added or modified MachineSet may be running an older image, but the recorded OCP version no longer reflects that. The skew check then silently passes when it should not. resetClusterBootImage() addresses this by writing the cluster's install version into ClusterBootImageAutomatic whenever a reconcile-skip occurs; and so the skew check correctly surfaces a violation if the cluster is genuinely out of date. We fall back to 0.0.0 as a conservative default if the version could not be queried from cluster resources; this would only be needed if the cluster is in a broken state and is very unlikely.

I also refactored a couple of things to simplify and improve readablity of the code, they have been broken out into separate commits.

- How to verify it

  1. Create a cluster on AWS/GCP/Azure/vSphere on a recent 4.22 release.
  2. Upgrade to a 5.0 release with this fix. The cluster bootimage version stored in the MachineConfiguration value should indicate 5.0 when the upgrade is complete and the bootimage controller has completed a loop
  3. Add or Update the MachineSet to a bootimage with a non-reconcilable value. This should reset the cluster bootimage back to the installversion of the cluster, 4.22.z
  4. Revert the machineset to a reconcilable value, this should set the boot image version back to 5.0.

Summary by CodeRabbit

  • New Features

  • Derives and records OCP install-time version for boot-image skew enforcement, with a safe "0.0.0" fallback.

  • Bug Fixes

  • Removed redundant operator-version gating during boot-image reconciliation.

  • Only persists status when it actually changes; improved handling when ClusterVersion or ConfigMap are unavailable.

  • Refactor

  • Streamlined ConfigMap fetch and reconciliation flow with clearer skip/error semantics and explicit reset/update logic.

  • Tests

  • Added tests covering cluster boot-image reset behavior across history and lookup scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@isabella-janssen isabella-janssen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

The implementation seems sane, the PR has been verified, and the new units created for the functionality are passing, so this should be safe to send.

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 15, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-gcp-op-ocl-part1
/test e2e-gcp-op-ocl-part2
/test e2e-gcp-op-part1
/test e2e-gcp-op-part2
/test e2e-gcp-op-single-node
/test e2e-hypershift

@openshift-ci

openshift-ci Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, isabella-janssen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [djoshy,isabella-janssen]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 49eaf75 and 2 for PR HEAD a2697ed in total

@isabella-janssen

Copy link
Copy Markdown
Member

/retest-required

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@djoshy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn a2697ed link true /test e2e-aws-ovn
ci/prow/e2e-gcp-op-part1 a2697ed link true /test e2e-gcp-op-part1
ci/prow/e2e-gcp-op-ocl-part2 a2697ed link true /test e2e-gcp-op-ocl-part2
ci/prow/e2e-gcp-op-single-node a2697ed link true /test e2e-gcp-op-single-node
ci/prow/e2e-gcp-op-part2 a2697ed link true /test e2e-gcp-op-part2
ci/prow/e2e-gcp-op-ocl-part1 a2697ed link true /test e2e-gcp-op-ocl-part1

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants