RFE-9359: Cordon before rebooting SNO clusters#6192
Conversation
Introduces a "cordon" value for the drainer annotation that indicates that the node should be cordoned but not drained, and adds an additional case to the verb switch in syncNode to handle that scenario. Ref: https://redhat.atlassian.net/browse/RFE-9359 Signed-off-by: Sam Lucidi <slucidi@redhat.com> Assisted-by: Claude Opus 4.6
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@mansam: This pull request references RFE-9359 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the feature request to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughIntroduces a "cordon-only" update path for single-node topology (SNO). A new ChangesSNO Cordon-Only Update Flow
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 13 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (13 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mansam The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @mansam. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/controller/drain/drain_controller.go (1)
364-427:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRe-check node schedulability before writing cordon-only completion.
The new cordon-only branch can still write
LastAppliedDrainerAnnotationKeyeven if the node is externally uncordoned right after the cordon call. That makes the daemon’sdesired == lastAppliedconvergence check pass while the node is schedulable.Suggested fix
@@ case daemonconsts.DrainerStateCordon: ctrl.logNode(node, "cordoning without drain") if err := ctrl.cordonOrUncordonNode(true, node, drainer); err != nil { @@ if err != nil { klog.Errorf("Error making MCN for Cordon-only Success: %v", err) } @@ + // Mirror the drain-path safety check: do not mark completion if the node is no longer cordoned. + if desiredVerb == daemonconsts.DrainerStateCordon { + node, err = ctrl.nodeLister.Get(name) + if err != nil { + return err + } + if !node.Spec.Unschedulable { + klog.Infof("node %s: externally uncordoned during cordon-only, skipping completion annotation", name) + return nil + } + } + ctrl.logNode(node, "operation successful; applying completion annotation")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controller/drain/drain_controller.go` around lines 364 - 427, The cordon-only branch in the DrainerStateCordon case writes the LastAppliedDrainerAnnotationKey annotation without verifying the node is still cordoned, creating a race condition where an external uncordon between the cordon call and annotation write would be masked. After the successful cordon call in the cordonOrUncordonNode invocation, re-fetch the node using ctrl.nodeLister.Get (similar to how it's done in the DrainerStateDrain case), check if node.Spec.Unschedulable is still true, and only proceed to write the completion annotation at the end if it remains cordoned. If the node was externally uncordoned, log an informational message and return nil early, skipping the annotation write.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@pkg/controller/drain/drain_controller.go`:
- Around line 364-427: The cordon-only branch in the DrainerStateCordon case
writes the LastAppliedDrainerAnnotationKey annotation without verifying the node
is still cordoned, creating a race condition where an external uncordon between
the cordon call and annotation write would be masked. After the successful
cordon call in the cordonOrUncordonNode invocation, re-fetch the node using
ctrl.nodeLister.Get (similar to how it's done in the DrainerStateDrain case),
check if node.Spec.Unschedulable is still true, and only proceed to write the
completion annotation at the end if it remains cordoned. If the node was
externally uncordoned, log an informational message and return nil early,
skipping the annotation write.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 37aa86bf-c775-4f35-a893-39e614907060
📒 Files selected for processing (5)
pkg/controller/drain/drain_controller.gopkg/controller/drain/drain_controller_test.gopkg/daemon/constants/constants.gopkg/daemon/drain.gotest/e2e-single-node/sno_mcd_test.go
Introduces a
cordonvalue for the drainer annotation that indicates that the node should be cordoned but not drained, and adds an additional case to the verb switch in syncNode to handle that scenario. Uses this newcordonmode for SNO to cause the node to be cordoned before rebooting.This was drafted by Claude and then cleaned up by me. I have run it on a CRC cluster and I believe it works as intended.
Ref: https://redhat.atlassian.net/browse/RFE-9359
Assisted-by: Claude Opus 4.6
- What I did
cordonvalue for the drainer annotation.cordoncase tosyncNode()in the drain controller to handle cordoning- How to verify it
Create a new MachineConfig on an SNO environment and observe that the node becomes cordoned before restarting.
- Description for the changelog
Cordon before rebooting single-node clusters.
Summary by CodeRabbit
Release Notes
New Features
Tests