Skip to content

Add optional restart-only predicate to inplace upgrade flow#145

Open
rajathagasthya wants to merge 1 commit into
NVIDIA:mainfrom
rajathagasthya:restart-only-predicate
Open

Add optional restart-only predicate to inplace upgrade flow#145
rajathagasthya wants to merge 1 commit into
NVIDIA:mainfrom
rajathagasthya:restart-only-predicate

Conversation

@rajathagasthya

@rajathagasthya rajathagasthya commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Consumers occasionally need a DaemonSet's pods rolled to a new revision without the disruptive pod-eviction and drain sequence — for example when a pod-template change does not affect the driver (a label-only change) yet still changes the controller revision hash, so the node is otherwise driven through the full upgrade flow.

Add an optional RestartOnlyPredicate hook, registered via WithRestartOnlyPredicate. In the inplace ProcessUpgradeRequiredNodes, when the predicate reports that a full upgrade is not needed, cordon the node and route it straight to pod-restart-required, skipping wait-for-jobs, pod-deletion, and drain. The pod is still restarted so the DaemonSet converges to the new revision. Cordoning keeps the node unschedulable if the restart fails, matching the full flow; the uncordon-required state uncordons it on success.

Upgrade state transitions

Today, an out-of-sync node always takes the full flow:

upgrade-required → cordon-required → wait-for-jobs-required → pod-deletion-required
                 → drain-required → pod-restart-required → validation-required
                 → uncordon-required → upgrade-done

With this change, when a registered predicate returns true for the node:

upgrade-required → pod-restart-required → validation-required
                 → uncordon-required → upgrade-done

The node never enters cordon-required, wait-for-jobs-required, pod-deletion-required, or drain-required. It is still cordoned — directly in the routing step rather than via the cordon-required state — so it stays unschedulable until uncordon-required, exactly as in the full flow. When no predicate is registered, or when the predicate returns false, the full flow above is unchanged.

The change is additive and backward compatible: a nil predicate (the default, and every existing consumer) preserves current behavior, and podInSyncWithDS is unchanged. If the predicate returns an error or the cordon fails, the node is kept in upgrade-required and retried on a later reconcile, with a Warning event recorded — an upgrade is never started on an unknown answer. The predicate is not consulted for orphaned pods, nodes with an explicit upgrade-requested annotation, or nodes waiting for safe driver load — the safe-load handshake relies on the full flow to evict workloads before the driver load is unblocked at pod-restart-required. The maxParallelUpgrades throttle applies to both paths.

Example: the GPU Operator uses this to restart driver pods in place when only cosmetic pod-template metadata changed (the helm.sh/chart label bumped by a patch chart release) while DRIVER_CONFIG_DIGEST is unchanged — see NVIDIA/gpu-operator#2527 (draft).

@rajathagasthya rajathagasthya force-pushed the restart-only-predicate branch 2 times, most recently from 328f701 to dfc93f8 Compare June 11, 2026 20:04
@rajathagasthya rajathagasthya marked this pull request as ready for review June 11, 2026 20:57
@rajathagasthya

Copy link
Copy Markdown
Contributor Author

/ok-to-test dfc93f8

Consumers occasionally need a DaemonSet's pods rolled to a new revision
without the disruptive pod-eviction and drain sequence -- for example
when a pod-template change does not affect the managed software (a
label-only change) yet still changes the controller revision hash, so
the node is otherwise driven through the full upgrade flow.

Add an optional RestartOnlyPredicate hook, registered via
WithRestartOnlyPredicate. In the inplace ProcessUpgradeRequiredNodes,
when the predicate reports that a full upgrade is not needed, cordon the
node and route it straight to pod-restart-required, skipping
wait-for-jobs, pod-deletion, and drain. The pod is still restarted so
the DaemonSet converges to the new revision. Cordoning keeps the node
unschedulable if the restart fails, matching the full flow; the
uncordon-required state uncordons it on success.

The change is additive and backward compatible: a nil predicate (the
default, and every existing consumer) preserves current behavior, and
podInSyncWithDS is unchanged. If the predicate returns an error or the
cordon fails, the node is kept in upgrade-required and retried on a
later reconcile, with a Warning event recorded -- an upgrade is never
started on an unknown answer. The predicate is not consulted for
orphaned pods, nodes with an explicit upgrade-requested annotation, or
nodes waiting for safe driver load -- the safe-load handshake relies on
the full flow to evict workloads before the driver load is unblocked at
pod-restart-required. The maxParallelUpgrades throttle applies to both
paths. The upgrade-requested state is captured before the annotation is
cleared, because the node provider re-fetches and overwrites the
in-memory node object.

Signed-off-by: Rajath Agasthya <ragasthya@nvidia.com>
@rajathagasthya rajathagasthya force-pushed the restart-only-predicate branch from dfc93f8 to 80df349 Compare June 16, 2026 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant