OCPBUGS-87962: Boot image skew check silently passes when MachineSets are reconcile-skipped#6162
OCPBUGS-87962: Boot image skew check silently passes when MachineSets are reconcile-skipped#6162djoshy wants to merge 3 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@djoshy: This pull request references Jira Issue OCPBUGS-87962, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (4)
WalkthroughCentralizes install-time OCP version extraction into apihelpers, uses it in operator and controller reset logic, prefetches coreos-bootimages for MachineSet sync, removes ConfigMap stamping checks from CPMS/MS flows, and adds tests for the reset behavior. ChangesBoot image skew enforcement refactor
Sequence DiagramsequenceDiagram
participant MSController
participant ClusterVersionLister
participant APIHelpers
participant StatusUpdater
MSController->>ClusterVersionLister: Get("version")
ClusterVersionLister->>APIHelpers: GetOCPInstallVersion(ClusterVersion)
APIHelpers-->>MSController: "major.minor.patch" / "0.0.0"
MSController->>StatusUpdater: updateMachineConfigurationStatus(new SkewStatus)
StatusUpdater-->>MSController: persist result / no-op
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 13 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (13 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@djoshy: This pull request references Jira Issue OCPBUGS-87962, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/apihelpers/apihelpers.go`:
- Around line 667-670: When clusterVersion.Status.History is empty the code
currently returns an empty string; change that fallback to the stable
conservative version "0.0.0" instead so downstream skew comparisons have a
usable version. Update the branch that checks if
len(clusterVersion.Status.History) == 0 (and the similar block around the later
parse-failure handling) to return "0.0.0" instead of "" and ensure any code
paths that previously relied on "" now receive the same "0.0.0" default used on
parse failure.
In `@pkg/controller/bootimage/boot_image_controller.go`:
- Around line 653-657: The current early return after
ctrl.clusterVersionLister.Get("version") can leave ClusterBootImageAutomatic
stale; instead, when clusterVersion lookup fails capture and log the error but
set a conservative fallback version (e.g. a minimum supported/zero version
string) and continue the status reset logic for ClusterBootImageAutomatic in the
reconcile-skipped path so the controller proceeds to clear/reset status; update
the block around clusterVersion, err := ctrl.clusterVersionLister.Get("version")
to not return on error but assign the fallback and continue using clusterVersion
(or the fallback) for subsequent status update logic.
In `@pkg/controller/bootimage/ms_helpers.go`:
- Around line 71-76: The controller currently fetches the BootImages ConfigMap
even when there are no MAPI MachineSets, which can cause an unnecessary
degradation via ctrl.updateConditions; update the logic in ms_helpers.go to
check len(mapiMachineSets) and return early when it is zero before calling
ctrl.mcoCmLister.ConfigMaps(...).Get (and avoid invoking
ctrl.updateConditions/degrading in that path); ensure the check references the
existing mapiMachineSets variable and that any subsequent code that uses
configMap only runs when the ConfigMap fetch is actually needed.
In `@pkg/operator/sync.go`:
- Around line 2607-2611: When optr.clusterVersionLister.Get("version") fails in
sync.go the code currently returns an empty string; change this to return a
conservative comparable version string (e.g. "0.0.0") so skew enforcement
comparisons remain robust. Keep the existing klog.Warningf call but replace the
return "" with return "0.0.0" (refer to the clusterVersion variable and
optr.clusterVersionLister.Get("version") in this function).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d76e6a14-632c-45cd-a965-15d0263af03e
📒 Files selected for processing (6)
pkg/apihelpers/apihelpers.gopkg/controller/bootimage/boot_image_controller.gopkg/controller/bootimage/cpms_helpers.gopkg/controller/bootimage/ms_helpers.gopkg/controller/bootimage/platform_helpers.gopkg/operator/sync.go
cc61586 to
bd7589e
Compare
isClusterStable() already gates the entire sync loop — the operator stamps the configmap with the MCO version hash and OCP release version as soon as the first master node updates, which always happens before ClusterVersion marks the upgrade complete. By the time syncMAPIMachineSets and syncControlPlaneMachineSet run, the configmap is guaranteed to be up to date.
bd7589e to
34a431d
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/apihelpers/apihelpers.go (1)
663-665: ⚡ Quick winUpdate the function comment to reflect the actual return value.
The comment states the function returns an empty string on failure, but the implementation returns
"0.0.0"(lines 670, 687). Update the comment to match the code.📝 Suggested comment fix
// GetOCPInstallVersion extracts the install-time OCP version from // ClusterVersion history — the oldest completed update entry, normalised to Major.Minor.Patch. -// Returns an empty string if the history is empty or the version cannot be parsed. +// Returns "0.0.0" as a conservative fallback if the history is empty or the version cannot be parsed. func GetOCPInstallVersion(clusterVersion *configv1.ClusterVersion) string {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/apihelpers/apihelpers.go` around lines 663 - 665, Update the function comment for GetOCPInstallVersion to correctly describe the actual return value on failure: it returns "0.0.0" (not an empty string). Modify the docstring to state that the function returns the normalized Major.Minor.Patch version or "0.0.0" if the history is empty or parsing fails, so the comment matches the implementation in GetOCPInstallVersion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/apihelpers/apihelpers.go`:
- Around line 663-665: Update the function comment for GetOCPInstallVersion to
correctly describe the actual return value on failure: it returns "0.0.0" (not
an empty string). Modify the docstring to state that the function returns the
normalized Major.Minor.Patch version or "0.0.0" if the history is empty or
parsing fails, so the comment matches the implementation in
GetOCPInstallVersion.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: caebce40-b738-4116-bd00-2b1675f9d096
📒 Files selected for processing (6)
pkg/apihelpers/apihelpers.gopkg/controller/bootimage/boot_image_controller.gopkg/controller/bootimage/cpms_helpers.gopkg/controller/bootimage/ms_helpers.gopkg/controller/bootimage/platform_helpers.gopkg/operator/sync.go
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/controller/bootimage/cpms_helpers.go
- pkg/operator/sync.go
- pkg/controller/bootimage/platform_helpers.go
ec09df4 to
0b5082d
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/controller/bootimage/boot_image_controller_test.go (1)
977-1067: 💤 Low valueConsider adding a test case for ClusterVersion retrieval failure.
The test currently covers scenarios where
GetOCPInstallVersionreturns "0.0.0" (empty history, case 4), but doesn't test the code path whereclusterVersionLister.Get("version")itself fails. From the implementation, when Get() fails, it logs a warning and keeps the default "0.0.0".Adding a test case where the ClusterVersion is not added to the indexer would verify this fallback path and improve completeness.
📝 Example test case for Get() failure
{ name: "falls back to 0.0.0 when ClusterVersion unavailable", history: nil, // ClusterVersion won't be added to indexer mode: opv1.BootImageSkewEnforcementModeStatusAutomatic, initialOCPVersion: "4.19.0", expectedVersion: "0.0.0", },Then in the test body, conditionally skip adding the ClusterVersion when
tc.history == nil:if tc.history != nil { cv := &osconfigv1.ClusterVersion{ ObjectMeta: v1.ObjectMeta{Name: "version"}, Status: osconfigv1.ClusterVersionStatus{History: tc.history}, } cvIndexer := cache.NewIndexer(cache.MetaNamespaceKeyFunc, cache.Indexers{}) require.NoError(t, cvIndexer.Add(cv)) ctrl.clusterVersionLister = configlistersv1.NewClusterVersionLister(cvIndexer) } else { // Empty indexer to trigger Get() failure cvIndexer := cache.NewIndexer(cache.MetaNamespaceKeyFunc, cache.Indexers{}) ctrl.clusterVersionLister = configlistersv1.NewClusterVersionLister(cvIndexer) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controller/bootimage/boot_image_controller_test.go` around lines 977 - 1067, Add a test case to TestResetClusterBootImage that covers the ClusterVersion Get() failure path by supplying a case where history is nil (e.g., name "falls back to 0.0.0 when ClusterVersion unavailable", history: nil, mode automatic, initialOCPVersion: "4.19.0", expectedVersion: "0.0.0") and modify the test setup in the loop so that when tc.history == nil you do NOT add a ClusterVersion to the cvIndexer before creating ctrl.clusterVersionLister; this forces clusterVersionLister.Get("version") to fail and verifies resetClusterBootImage still falls back to "0.0.0".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/controller/bootimage/boot_image_controller_test.go`:
- Around line 977-1067: Add a test case to TestResetClusterBootImage that covers
the ClusterVersion Get() failure path by supplying a case where history is nil
(e.g., name "falls back to 0.0.0 when ClusterVersion unavailable", history: nil,
mode automatic, initialOCPVersion: "4.19.0", expectedVersion: "0.0.0") and
modify the test setup in the loop so that when tc.history == nil you do NOT add
a ClusterVersion to the cvIndexer before creating ctrl.clusterVersionLister;
this forces clusterVersionLister.Get("version") to fail and verifies
resetClusterBootImage still falls back to "0.0.0".
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 49f116c7-53d6-49fb-b605-b7cd0dbe1c16
📒 Files selected for processing (5)
pkg/apihelpers/apihelpers.gopkg/controller/bootimage/boot_image_controller.gopkg/controller/bootimage/boot_image_controller_test.gopkg/controller/bootimage/ms_helpers.gopkg/operator/sync.go
🚧 Files skipped from review as they are similar to previous changes (4)
- pkg/operator/sync.go
- pkg/apihelpers/apihelpers.go
- pkg/controller/bootimage/boot_image_controller.go
- pkg/controller/bootimage/ms_helpers.go
When one or more MachineSets are reconcile-skipped (e.g. due to an OwnerReference, missing arch annotation, or an unrecognisable boot image), the ClusterBootImageAutomatic record becomes stale: a newly added or modified MachineSet may be running an older image, but the recorded OCP version no longer reflects that. The skew check then silently passes when it should not. resetClusterBootImage() addresses this by writing the cluster's install version into ClusterBootImageAutomatic whenever a reconcile-skip occurs without an accompanying error — the most conservative value we can reliably claim — so the skew check correctly surfaces a violation if the cluster is genuinely out of date.
0b5082d to
a2697ed
Compare
|
|
||
| // Only make an API call if there is an update to the skew enforcement status | ||
| if !reflect.DeepEqual(mcop.Status.BootImageSkewEnforcementStatus, newBootImageSkewEnforcementStatus) { | ||
| if !reflect.DeepEqual(mcop.Status.BootImageSkewEnforcementStatus, *newBootImageSkewEnforcementStatus) { |
There was a problem hiding this comment.
This was something caught by the new units; the old code wasn't dereferencing the newBootImageSkewEnforcementStatus pointer, meaning the API call was always being made - no bueno! 🫨
|
/payload-job periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-fips-proxy-longduration |
|
@djoshy: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/73f76660-64f3-11f1-9601-524e54fbd0f7-0 |
|
@coderabbitai resume |
✅ Action performedReviews resumed. |
|
Pre-merge verified: OCP Version: 5.0.0-0-2026-06-15-062252-test-ci-ln-p1kbq5b-latest Steps:
/label qe-approved |
|
@ptalgulk01: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@djoshy: This pull request references Jira Issue OCPBUGS-87962, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
isabella-janssen
left a comment
There was a problem hiding this comment.
/lgtm
The implementation seems sane, the PR has been verified, and the new units created for the functionality are passing, so this should be safe to send.
|
Scheduling tests matching the |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: djoshy, isabella-janssen The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required |
|
@djoshy: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
- What I did
When one or more
MachineSetsare reconcile-skipped (e.g. due to anOwnerReference, missing arch annotation, or an unrecognisable boot image), theClusterBootImageAutomaticrecord becomes stale: a newly added or modifiedMachineSetmay be running an older image, but the recorded OCP version no longer reflects that. The skew check then silently passes when it should not.resetClusterBootImage()addresses this by writing the cluster's install version intoClusterBootImageAutomaticwhenever a reconcile-skip occurs; and so the skew check correctly surfaces a violation if the cluster is genuinely out of date. We fall back to0.0.0as a conservative default if the version could not be queried from cluster resources; this would only be needed if the cluster is in a broken state and is very unlikely.I also refactored a couple of things to simplify and improve readablity of the code, they have been broken out into separate commits.
- How to verify it
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Tests