RFE-9359: Cordon before rebooting SNO clusters by mansam · Pull Request #6192 · openshift/machine-config-operator

mansam · 2026-06-15T20:13:01Z

Introduces a cordon value for the drainer annotation that indicates that the node should be cordoned but not drained, and adds an additional case to the verb switch in syncNode to handle that scenario. Uses this new cordon mode for SNO to cause the node to be cordoned before rebooting.

This was drafted by Claude and then cleaned up by me. I have run it on a CRC cluster and I believe it works as intended.

Ref: https://redhat.atlassian.net/browse/RFE-9359

Assisted-by: Claude Opus 4.6

- What I did

Added a new cordon value for the drainer annotation.
Added a cordon case to syncNode() in the drain controller to handle cordoning
Added a cordon-only path to the daemon when drain is not required.
Added a e2e test

- How to verify it

Create a new MachineConfig on an SNO environment and observe that the node becomes cordoned before restarting.

- Description for the changelog

Cordon before rebooting single-node clusters.

Summary by CodeRabbit

Release Notes

New Features
- Introduced a new cordon-only operation for single-node clusters that marks nodes as unschedulable during machine config updates without draining workloads, optimizing the update process.
Tests
- Added comprehensive test coverage for cordon-only behavior, including scenarios where cordoning is already completed and e2e validation during machine config updates.

Introduces a "cordon" value for the drainer annotation that indicates that the node should be cordoned but not drained, and adds an additional case to the verb switch in syncNode to handle that scenario. Ref: https://redhat.atlassian.net/browse/RFE-9359 Signed-off-by: Sam Lucidi <slucidi@redhat.com> Assisted-by: Claude Opus 4.6

openshift-merge-bot · 2026-06-15T20:13:04Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci-robot · 2026-06-15T20:13:06Z

@mansam: This pull request references RFE-9359 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the feature request to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Introduces a cordon value for the drainer annotation that indicates that the node should be cordoned but not drained, and adds an additional case to the verb switch in syncNode to handle that scenario. Uses this new cordon mode for SNO to cause the node to be cordoned before rebooting.

This was drafted by Claude and then cleaned up by me. I have run it on a CRC cluster and I believe it works as intended.

Ref: https://redhat.atlassian.net/browse/RFE-9359

Assisted-by: Claude Opus 4.6

- What I did

Added a new cordon value for the drainer annotation.

Added a cordon case to syncNode() in the drain controller to handle cordoning

Added a cordon-only path to the daemon when drain is not required.

Added a e2e test

- How to verify it

Create a new MachineConfig on an SNO environment and observe that the node becomes cordoned before restarting.

- Description for the changelog

Cordon before rebooting single-node clusters.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-06-15T20:13:17Z

Walkthrough

Introduces a "cordon-only" update path for single-node topology (SNO). A new DrainerStateCordon = "cordon" constant is added. The daemon's performDrain() delegates to a new performCordonOnly() when drain is not required, polling until the cordon annotation is applied. The drain controller gains a matching switch case that cordons the node without draining and emits upgrade-monitor conditions.

Changes

SNO Cordon-Only Update Flow

Layer / File(s)	Summary
DrainerStateCordon constant and daemon performCordonOnly() `pkg/daemon/constants/constants.go`, `pkg/daemon/drain.go`	Adds `DrainerStateCordon = "cordon"`, updates imports, and rewires `performDrain()` to call a new `performCordonOnly()` helper that sets the desired drainer annotation and polls (2-min timeout) until `DesiredDrainerAnnotationKey` matches `LastAppliedDrainerAnnotationKey` on SNO.
Drain controller cordon handler `pkg/controller/drain/drain_controller.go`	Adds a `daemonconsts.DrainerStateCordon` switch case that calls `cordonOrUncordonNode(true, ...)` and emits `GenerateAndApplyMachineConfigNodes` conditions for cordon failure and success, returning immediately on cordon error.
Unit and e2e tests `pkg/controller/drain/drain_controller_test.go`, `test/e2e-single-node/sno_mcd_test.go`	Adds `testCordonState` constant and two `TestSyncNode` subtests asserting correct patch operations for cordon-requested and already-completed cordon states; adds `TestSNOCordonDuringUpdate` e2e test validating the full cordon-during-update and rollback lifecycle on SNO.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	TestSNOCordonDuringUpdate violates requirement `#2` (Setup/cleanup): test creates cluster-scoped MachineConfig resource without guaranteed cleanup if assertions fail before explicit deletion at line...	Wrap MachineConfig creation in defer or use t.Cleanup() to ensure deletion executes even if an assertion fails (e.g., at line 411-412 WaitForRenderedConfig). This prevents resource leaks in test failures.

✅ Passed checks (13 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'RFE-9359: Cordon before rebooting SNO clusters' clearly and specifically describes the main change—adding cordoning functionality for single-node OpenShift clusters before reboot.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	Tests added in this PR use standard Go testing framework (not Ginkgo). The custom check applies only to Ginkgo tests, so it is not applicable to these tests.
Microshift Test Compatibility	✅ Passed	No new Ginkgo e2e tests (It(), Describe(), Context(), etc.) were added in this PR. The new test TestSNOCordonDuringUpdate is a standard Go test function, not a Ginkgo test, so the check does not ap...
Single Node Openshift (Sno) Test Compatibility	✅ Passed	The new TestSNOCordonDuringUpdate test is in test/e2e-single-node/ (SNO-specific directory), uses GetSingleNodeByRole() which asserts exactly 1 master node, and is explicitly documented as SNO-spec...
Topology-Aware Scheduling Compatibility	✅ Passed	PR introduces topology-aware cordon-only mechanism for SNO. All changes check ControlPlaneTopology before applying SNO-specific behavior. No scheduling constraints (affinity, PDBs, topology spread,...
Ote Binary Stdout Contract	✅ Passed	The PR adds test file with only t.Logf() calls (test-scoped, not process-level) and production code using klog.Info(). No process-level stdout writes that would violate OTE binary JSON contract.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	The new test TestSNOCordonDuringUpdate is a standard Go test (not Ginkgo), uses only cluster-internal Kubernetes APIs, and contains no IPv4 assumptions or external connectivity requirements.
No-Weak-Crypto	✅ Passed	No weak cryptography detected. PR adds cordon feature without crypto imports, algorithms, custom implementations, or non-constant-time secret comparisons.
Container-Privileges	✅ Passed	PR modifies only Go source files (daemon and controller logic), not Kubernetes manifests or container configs that the container-privileges check targets.
No-Sensitive-Data-In-Logs	✅ Passed	No sensitive data exposed in logs. New logging contains only operational messages (node names, status booleans, generic error descriptions), no passwords, tokens, API keys, PII, or credentials.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-15T20:14:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mansam
Once this PR has been reviewed and has the lgtm label, please assign pablintino for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-06-15T20:14:08Z

Hi @mansam. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/controller/drain/drain_controller.go (1)

364-427: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Re-check node schedulability before writing cordon-only completion.

The new cordon-only branch can still write LastAppliedDrainerAnnotationKey even if the node is externally uncordoned right after the cordon call. That makes the daemon’s desired == lastApplied convergence check pass while the node is schedulable.

Suggested fix

@@
 	case daemonconsts.DrainerStateCordon:
 		ctrl.logNode(node, "cordoning without drain")
 		if err := ctrl.cordonOrUncordonNode(true, node, drainer); err != nil {
@@
 		if err != nil {
 			klog.Errorf("Error making MCN for Cordon-only Success: %v", err)
 		}
@@
+	// Mirror the drain-path safety check: do not mark completion if the node is no longer cordoned.
+	if desiredVerb == daemonconsts.DrainerStateCordon {
+		node, err = ctrl.nodeLister.Get(name)
+		if err != nil {
+			return err
+		}
+		if !node.Spec.Unschedulable {
+			klog.Infof("node %s: externally uncordoned during cordon-only, skipping completion annotation", name)
+			return nil
+		}
+	}
+
 	ctrl.logNode(node, "operation successful; applying completion annotation")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/drain/drain_controller.go` around lines 364 - 427, The
cordon-only branch in the DrainerStateCordon case writes the
LastAppliedDrainerAnnotationKey annotation without verifying the node is still
cordoned, creating a race condition where an external uncordon between the
cordon call and annotation write would be masked. After the successful cordon
call in the cordonOrUncordonNode invocation, re-fetch the node using
ctrl.nodeLister.Get (similar to how it's done in the DrainerStateDrain case),
check if node.Spec.Unschedulable is still true, and only proceed to write the
completion annotation at the end if it remains cordoned. If the node was
externally uncordoned, log an informational message and return nil early,
skipping the annotation write.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/controller/drain/drain_controller.go`:
- Around line 364-427: The cordon-only branch in the DrainerStateCordon case
writes the LastAppliedDrainerAnnotationKey annotation without verifying the node
is still cordoned, creating a race condition where an external uncordon between
the cordon call and annotation write would be masked. After the successful
cordon call in the cordonOrUncordonNode invocation, re-fetch the node using
ctrl.nodeLister.Get (similar to how it's done in the DrainerStateDrain case),
check if node.Spec.Unschedulable is still true, and only proceed to write the
completion annotation at the end if it remains cordoned. If the node was
externally uncordoned, log an informational message and return nil early,
skipping the annotation write.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 37aa86bf-c775-4f35-a893-39e614907060

📥 Commits

Reviewing files that changed from the base of the PR and between 49eaf75 and b88df89.

📒 Files selected for processing (5)

pkg/controller/drain/drain_controller.go
pkg/controller/drain/drain_controller_test.go
pkg/daemon/constants/constants.go
pkg/daemon/drain.go
test/e2e-single-node/sno_mcd_test.go

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 15, 2026

openshift-ci Bot requested review from HarshwardhanPatil07 and ptalgulk01 June 15, 2026 20:14

openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 15, 2026

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE-9359: Cordon before rebooting SNO clusters#6192

RFE-9359: Cordon before rebooting SNO clusters#6192
mansam wants to merge 1 commit into
openshift:mainfrom
mansam:cordon-sno

mansam commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-merge-bot Bot commented Jun 15, 2026

Uh oh!

openshift-ci-robot commented Jun 15, 2026 •

edited by openshift-ci Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited by openshift-ci Bot

Loading

❌ Failed checks (2 warnings)

Uh oh!

openshift-ci Bot commented Jun 15, 2026

Uh oh!

openshift-ci Bot commented Jun 15, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mansam commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

openshift-merge-bot Bot commented Jun 15, 2026

Uh oh!

openshift-ci-robot commented Jun 15, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

openshift-ci Bot commented Jun 15, 2026

Uh oh!

openshift-ci Bot commented Jun 15, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mansam commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Jun 15, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited by openshift-ci Bot

Loading