Skip to content

nodetask: add RestartSeid kind (sidecar restart-seid), remove RestartPod#389

Merged
bdchatham merged 2 commits into
mainfrom
feat/seinodetask-restartnode
Jun 7, 2026
Merged

nodetask: add RestartSeid kind (sidecar restart-seid), remove RestartPod#389
bdchatham merged 2 commits into
mainfrom
feat/seinodetask-restartnode

Conversation

@bdchatham

@bdchatham bdchatham commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Linear: PLT-438 · Consumes seictl#199 (v0.0.56)

What

RestartSeid — the SeiNode-scoped, sidecar-backed successor to RestartPod. Dispatches the seictl restart-seid task, which restarts the seid process in place (SIGTERM the process → kubelet restarts the container) so seid re-reads config.toml without bouncing the sidecar. Removes RestartPod entirely.

Named RestartSeid (not RestartNode) to avoid conflating a Kubernetes Node with a SeiNode, and to align the kind with the restart-seid sidecar task + RestartSeidTask client.

Why

RestartPod deleted the whole pod → restarted the sidecar → lost its in-process readiness flag → seid's start-gate + rbac-proxy probe (both on /v0/healthz) waited for the controller's ~30s mark-ready reapproval → ~30–40s not-signing gap per restart. Restarting only the seid process keeps the sidecar (and its ready flag) alive → no gap. Validated on harbor arctic-1/syncer-0-0-0 (2026-06-07).

How

  • New RestartSeid kind, empty RestartSeidPayload, CEL union + has(self.restartSeid).
  • restartSeidParams(sidecar.TaskTypeRestartSeid, sidecar.RestartSeidTask{}).
  • Registry (false) = poll-to-completion — the sidecar task waits for seid's RPC to come back and fails loud (no SIGKILL) if seid outlives the grace window; that failure surfaces as a Failed SeiNodeTask (tested). effectiveTimeout 10m (envelope > the sidecar's ~6.5m worst case).
  • requirePhase=Running (default), deadlock-safe — in-place seid restart doesn't change the SeiNode phase.
  • Removed RestartPod: kind, RestartPodPayload/podUID, spec field, the three podUID CEL rules, restartPodParams, restart_pod.go + tests, registry entry. pod_cycle.go kept (used by replace_pod).

Completion = "seid RPC serving again", NOT caught-up/voting — gate height with a downstream AwaitNodesAtHeight.

Breaking change — verified safe

Removing RestartPod from the kind enum is a one-way door. Confirmed zero live kind: RestartPod SeiNodeTask CRs on harbor/prod/dev (the RestartPod-carrying controller image was never rolled out), and no platform/workflow/runbook reference. kind: RestartPod is now rejected by the enum (regression-tested).

Rollout order

Requires the controller image rebuilt against seictl v0.0.56. Roll out the controller image before/with the CRD: a new-CRD + old-controller window fails closed (UnsupportedKind — a RestartSeid CR Fails synthesis, never silently no-ops), so the ordering is safe either way; no live consumer depends on RestartSeid yet.

Test

  • SeiNodeTaskParamsFor maps RestartSeid → restart-seid + RestartSeidTask{}; nil payload → reasoned ParamsBuildFailed.
  • envtest CEL: RestartSeid requires restartSeid; union rejects multi-payload; kind: RestartPod rejected by the enum.
  • controller e2e: RestartSeid drives the sidecar task to completion (poll shape); never-completes → times out; sidecar fail-loud → Failed SeiNodeTask (Reason=TaskFailed).
  • make manifests generate, make test, make test-integration, golangci-lint --new-from-rev=origin/main → 0.

🤖 Generated with Claude Code

@cursor

cursor Bot commented Jun 7, 2026

Copy link
Copy Markdown

PR Summary

High Risk
Breaking CRD/API change removes RestartPod and changes how config.toml restarts are applied (sidecar in-place vs pod delete), affecting validator signing gaps and rollout order with seictl v0.0.56.

Overview
Replaces the RestartPod SeiNodeTask kind with RestartSeid: an empty-payload CRD kind that drives the sidecar restart-seid task (seictl v0.0.56) so seid restarts in place and re-reads config.toml without deleting the pod or resetting sidecar readiness. RestartPod is removed from the enum, along with RestartPodPayload/podUID, controller-side restart-pod execution, shared pod_cycle plumbing (inlined into replace-pod only), and related CEL rules.

The nodetask controller now maps RestartSeid to sidecar.TaskTypeRestartSeid with poll-to-completion semantics, a 10m default execution timeout, and tests for happy path, timeout, and sidecar fail-loud failures. DiscoverPeers docs and CEL/envtest coverage are updated; kind: RestartPod is rejected at admission.

Reviewed by Cursor Bugbot for commit 6f7fba7. Bugbot is set up for automated code reviews on this repo. Configure here.

RestartSeid is the SeiNode-scoped, sidecar-backed successor to RestartPod. It
dispatches the seictl restart-seid task (v0.0.56), which restarts the seid
process in place — seid re-reads config.toml WITHOUT bouncing the sidecar, so
the sidecar's ready flag survives and there is no ~30-40s mark-ready reapproval
gap. Empty payload, no caller-supplied pod UID. Poll-to-completion (registry
false): the controller polls until restart-seid reports terminal (the sidecar
waits for seid's RPC to come back, and fails loud — no SIGKILL — if seid
outlives the grace window; that failure surfaces as a Failed SeiNodeTask).
Completion means "seid RPC serving again", not caught-up/voting — gate height
with a downstream AwaitNodesAtHeight.

Named RestartSeid (not RestartNode) to avoid conflating a Kubernetes Node with
a SeiNode, and to align the kind with the restart-seid sidecar task.

RestartPod is removed entirely (kind, RestartPodPayload/podUID, spec field, the
three podUID CEL rules, restartPodParams, restart_pod.go + tests, registry
entry). pod_cycle.go stays — replace_pod still uses it. Verified zero live
kind:RestartPod CRs on harbor/prod/dev before removal.

Requires seictl v0.0.56. Roll out the controller image (built on v0.0.56)
before/with the CRD; a new-CRD + old-controller window fails closed
(UnsupportedKind), never silently.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham bdchatham force-pushed the feat/seinodetask-restartnode branch from d428fab to bbd20be Compare June 7, 2026 03:26
@bdchatham bdchatham changed the title nodetask: add RestartNode kind (sidecar restart-seid), remove RestartPod nodetask: add RestartSeid kind (sidecar restart-seid), remove RestartPod Jun 7, 2026
podCycle was extracted to share the StatefulSet-fetch / owned-pod / delete
helpers between replace-pod and restart-pod. With RestartPod removed, replace-pod
is the sole user, so the shared abstraction is no longer justified
(CLAUDE.md: no premature helpers). Dissolve it:

- fetchStatefulSet/ownedPods/deletePod become methods on replacePodExecution
  (which now holds cfg ExecutionConfig directly, no embedded podCycle);
  guardSelectorAndReplicas/ownedByStatefulSet are plain unexported funcs.
- Delete pod_cycle.go / pod_cycle_test.go; the still-relevant unit tests move
  into replace_pod_test.go and the leftover restart-named fixtures are renamed
  to replace-pod equivalents.
- Drop podReady (dead — only restart-pod used it).

Pure internal refactor: replace-pod's revision-gated, readiness-blind behavior
is byte-for-byte preserved; no API/CRD change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham bdchatham merged commit d4c69a9 into main Jun 7, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant