feat(nodetask): add DiscoverPeers + RestartPod SeiNodeTask kinds by bdchatham · Pull Request #386 · sei-protocol/sei-k8s-controller

bdchatham · 2026-06-06T20:22:58Z

What

Two new composable SeiNodeTask kinds to refresh a running node's persistent-peers without baking peer-drift logic into the core SeiNode reconcile:

DiscoverPeers (sidecar-backed) — re-resolves the target SeiNode's spec.peers and writes persistent-peers into config.toml on the PVC via the existing sidecar discover-peers task. Scalar merge → preserves the rest of the config, so no config-apply needed. Disk-only: the running seid picks it up on its next restart.
RestartPod (controller-side) — restarts the target's pod so seid re-reads config.toml.

Sequenced DiscoverPeers → RestartPod applies fresh peers to a live node (the arctic-1 EC2→K8s validator migration use case), composable into Chaos Mesh Workflows.

Why

Migrated K8s validators need their persistent-peers updated after a spec.peers change (e.g. adding the proposer cohort). We established that:

peers is not in the controller's rollout templateHash, so a peers edit doesn't trigger a plan;
the NodeUpdate (image) plan runs neither discover-peers nor config-apply, and mark-ready is a no-op — so an image bump restarts seid onto the stale config.toml;
there is no working hot-reload of persistent-peers into a live seid (seictl config-reload's seid-signal is a TODO).

Rather than enshrine a peer-drift trigger in the core reconcile (a heavier, harder-to-reverse change with fleet-wide blast radius), this delivers the capability as opt-in, composable tasks. Promote to an automatic controller trigger later if it proves worth it (tracked separately).

Design notes

RestartPod completion is content-addressed by pod UID. Execute deletes only the captured-UID pod; Status completes when an owned Ready pod has a different UID. The UID is captured at synthesis and persisted on status.task (survives reconcile reconstruction). This is immune to the creationTimestamp 1s-resolution race a wall-clock epoch would have. It does not reuse replace-pod (which keys on StatefulSet revision and no-ops when the spec is unchanged).
Bounded by default. RestartPod defaults to a 10m timeout, DiscoverPeers 2m (explicit spec.timeoutSeconds always wins; sidecar gov/await kinds stay unbounded). A hand-authored RestartPod whose pod never goes Ready transitions to Failed(reason=Timeout) rather than requeuing forever.
Fail-fast on empty peers. A target with no spec.peers — including a label source whose status.resolvedPeers is empty — fails with ParamsBuildFailed rather than submitting a zero-source discover-peers that would silently wipe persistent-peers.
Double-sign safe. RestartPod is the same single-replica + OnDelete + RWO-PVC stop-then-start as the existing replace-pod; multi-replica is a terminal guard.
Atomicity caveat (documented on the kind): DiscoverPeers succeeds then RestartPod fails → fresh config on disk + old seid still running; transient, observable, self-healing on the next successful RestartPod.

Tests

Unit: taskParamsForKind for both kinds (ec2Tags/label/static/mixed-empty-label/empty-peers/nil-target; RestartPod params + UID threading), restart-pod execution (UID match/no-op/missing/terminating/multi-replica guard, same-second replacement regression), CEL union (reject zero/multiple payloads, kind immutability).
Reconcile (fake client): both kinds happy-path, empty-peers→Failed, label-empty-resolved→Failed, pod-missing→waits→completes, never-Ready→Timeout.
envtest: CEL validation against a real apiserver.
make test + make test-integration green; make lint 0 new findings vs main.

Follow-up (deferred)

Promote peer-drift to an automatic controller trigger once proven.
seid pod has no explicit terminationGracePeriodSeconds (inherits 30s) — if seid needs longer to flush on shutdown, worth setting in the StatefulSet generator. Independent of this PR.

🤖 Generated with Claude Code

Two composable SeiNodeTask kinds to refresh a running node's persistent peers without baking peer-drift logic into the core SeiNode reconcile: - DiscoverPeers (sidecar-backed): re-resolves the target's spec.peers and writes persistent-peers into config.toml on the PVC via the sidecar discover-peers task (scalar merge — no config-apply needed; disk-only, the running seid picks it up on next restart). - RestartPod (controller-side): restarts the target's pod so seid re-reads config.toml. Completion is content-addressed by pod UID — Execute deletes only the captured-UID pod; Status completes when an owned Ready pod has a different UID. UID is captured at synthesis and persisted on status.task, so it survives reconcile reconstruction and is immune to creationTimestamp second-resolution races (unlike a wall-clock epoch). Sequenced DiscoverPeers -> RestartPod applies fresh peers to a live validator (the arctic-1 EC2->K8s migration use case) via declarative task CRs, composable into Chaos Mesh Workflows. The DiscoverPeers-ok/RestartPod-fails window is a transient, self-healing divergence (documented on the kind), and RestartPod carries a bounded default timeout (10m; DiscoverPeers 2m) so a hand-authored CR cannot hang silently. Empty spec.peers (incl. a label source with no resolvedPeers) fails fast rather than submitting a zero-source discover-peers that would wipe persistent-peers. Conditions follow the Ready/Failed latch + ObservedGeneration conventions; status patches use optimistic concurrency. Deferred (tracked separately): promoting peer-drift to an automatic controller trigger once this proves valuable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cursor · 2026-06-06T20:23:03Z

PR Summary

Medium Risk
Touches live validator pods (delete/restart) and on-disk peer config; mistakes could affect connectivity or cause brief downtime, though single-replica/RWO guards and fail-fast peer validation limit blast radius.

Overview
Adds two SeiNodeTask kinds for opt-in peer refresh on running nodes: DiscoverPeers (sidecar writes persistent-peers to disk from the target’s spec.peers) and RestartPod (deletes a caller-supplied pod UID so seid reloads config.toml). CRD/API gains union validation, executionStartedAt / restartedPodUID status fields, and clarified timeout semantics (execution budget starts after requirePhase, with per-kind defaults for the new kinds).

The nodetask reconciler delegates param building to task.SeiNodeTaskParamsFor, uses FailureReason for stable fail reasons, stamps execution metadata at synthesis, and gains pod delete RBAC. restart-pod is a new controller task (UID-based completion, shared podCycle with refactored replace-pod). DiscoverPeers fails fast when peer sources are empty/unresolved (unlike the planner init path).

Tests: broad unit/reconcile coverage, new nodetask envtest for CEL rules, and make test-integration extended to that package.

^{Reviewed by Cursor Bugbot for commit 467ea31. Bugbot is set up for automated code reviews on this repo. Configure here.}

…out, sweep Cursor Bugbot (HIGH) + CI + maintainer review: - Timeout measured from execution start, not task creation: new status.task.executionStartedAt (stamped once the target is ready + the task is synthesized) so the requirePhase wait isn't charged against the per-kind execution budget. spec.timeoutSeconds is now execution-only. - lint goconst: hoisted repeated test literals to package consts. - Share peer-source building: new task.DiscoverPeerSources is the single impl; planner.discoverPeersTask now wraps it (removed its duplicate). The shared empty-label-skip also removes a latent planner edge that could write an empty persistent-peers for a not-yet-resolved label source. - taskParamsForKind is now a thin dispatcher; all per-kind param constructors live in the task package (internal/task/seinodetask_params.go). Behavior- preserving for existing kinds. - Moved per-kind timeout consts + the unsupported-kind sentinel (task.ErrUnsupportedKind) to their proper homes; trimmed verbose doc comments to load-bearing rationale. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

bdchatham · 2026-06-06T20:55:02Z

Pushed 2a91ca9 addressing all review feedback:

Cursor Bugbot (HIGH, timeout includes target wait): execution budget now measured from a new status.task.executionStartedAt (stamped once the target is ready), so the requirePhase wait is no longer charged against the per-kind timeout. spec.timeoutSeconds is now execution-only (documented).
CI lint (goconst): repeated test literals hoisted to consts.
Your 4 comments: consts → top of file; ErrUnsupportedKind → task package; shared peer-source builder (planner de-duped); taskParamsForKind → thin dispatcher with per-kind constructors in the task package; verbose doc comments trimmed.

Two expert cross-review rounds + a regression pass signed off on the refactor. Note: spec.timeoutSeconds semantics changed from total→execution-only (additive executionStartedAt status field, back-compat).

… empty sources - DRY: extract a shared podCycle helper (StatefulSet fetch + selector/multi-replica guards + owned-pod lookup/delete) used by both replace-pod and restart-pod. Behavior-preserving — replace-pod keeps its revision gate (runs before the shared guards) + synchronous no-readiness-wait completion; restart-pod keeps its captured-UID delete predicate + different-UID-Ready readiness wait. - Fix Bugbot MEDIUM "init plan empty discover-peers": the shared DiscoverPeerSources skips an unresolved label source, so the planner's init path could submit a zero-source discover-peers that wipes persistent-peers. New controller-side discover-peers-init task re-derives sources from the LIVE node each reconcile: empty sources -> Failed (consumes one retry-budget unit, re-pends, re-evaluates), so it never writes zero sources and converges once status.resolvedPeers populates, bounded by discoverPeersMaxRetries. A build-time gate can't converge (executor freezes params; buildRunningPlan has no discover-peers). Nodetask path unchanged (synthesis-time fail-fast). Note: discover-peers-init uses setFailed-in-Execute(returns nil) deliberately — only the Status-failed path is governed by the retry budget; returning an error would be unbounded. Don't "fix" it to return err. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

bdchatham · 2026-06-06T21:26:35Z

Pushed 1aaaefa:

Bugbot MEDIUM (init plan empty discover-peers): fixed — new controller-side discover-peers-init task re-derives peer sources from the live node each reconcile; empty/unresolved sources defer via the retry budget (bounded) instead of submitting a zero-source write that could wipe persistent-peers. Converges once status.resolvedPeers populates.
Bugbot HIGH (timeout includes target wait): already resolved in 2a91ca9 (execution budget from executionStartedAt); the re-anchored inline was the original-commit comment.
DRY (maintainer follow-up): extracted a shared podCycle helper so restart-pod reuses replace-pod's pod-deletion plumbing — both behavior-preserving.

Regression cross-review signed off on both. CI + Bugbot re-running.

When no owned pod exists at synthesis (STS cache-lag / brief gap), targetPodUID returned empty and restart-pod then deleted nothing and completed on any owned Ready pod — a silent no-op (seid never restarts, DiscoverPeers peers never apply). - Synthesis: for RestartPod, if targetPodUID is empty, defer (requeue every targetWaitInterval, TargetReady=False/PodNotObserved) until an owned pod is observed and its UID captured, bounded by requirePhaseTimeout (anchored on status.startedAt) → Failed/RestartTargetUnresolved if it never appears. A real UID proceeds as before (delete it; complete only on a different-UID Ready pod). - Backstop: restart-pod Status with an empty UID now stays Running (never completes); a stray empty UID surfaces via the execution timeout as Failed, never success. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

bdchatham · 2026-06-06T21:52:52Z

Ready for human review — automated bar met (`f477e34`)

All CI checks green and Cursor Bugbot passes with no new findings:

Cursor Bugbot   pass     lint   pass     test   pass     test-integration   pass     verify-generated   pass

Review trail (each fix had a regression cross-review before push):

Finding	Resolution
Bugbot HIGH — timeout includes requirePhase wait	`executionStartedAt` (exec-only budget) — `2a91ca9`
CI lint goconst	hoisted test literals to consts — `2a91ca9`
Maintainer ×4 (consts→top, sentinel→task pkg, shared peer builder, thin dispatcher)	`2a91ca9`
Bugbot MEDIUM — init zero-source discover-peers	`discover-peers-init` defers via retry budget — `1aaaefa`
Maintainer — DRY restart/replace	shared `podCycle` helper — `1aaaefa`
Bugbot MEDIUM — empty-UID silent skip	synthesis-defer + task backstop — `f477e34`

The three inline comments still shown are all from earlier commits and resolved above (the HIGH @387 is the original first-commit comment re-anchored to the diff; Bugbot's per-commit check is green).

Note: spec.timeoutSeconds is now execution-only (additive status.task.executionStartedAt, back-compat). Deferred follow-up (separate): promote peer-drift to an automatic controller trigger once proven.

Over to you for the final look, @bdchatham.

…nd; restart fail-fast Maintainer review: the shared peer builder forced one interface across the planner (frozen plan params) and the nodetask (live, fail-fast) paths, which spawned the discover-peers-init task. Unwind it. - Delete discover-peers-init (task + test + registry). Revert the planner's init discover-peers to main — byte-identical (own inline builder, frozen sidecar discover-peers). The nodetask DiscoverPeers builds its own peer sources with its own empty/label-unresolved fail-fast (never submits a zero-source write that would wipe persistent-peers). Slight, deliberate duplication; no shared interface to conflict. - ErrUnsupportedKind is now a typed error carrying the kind (errors.As), not a sentinel. - RestartPod: no owned pod at synthesis -> fail fast with reason RestartTargetPodNotFound (drop the awaitRestartTarget bounded-defer). targetPodUID reads uncached (APIReader) to avoid cache-lag false negatives. Task-side backstop kept (empty UID never completes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

bdchatham · 2026-06-06T22:23:09Z

Pushed 39bd88c — unwound the over-built complexity per your call + addressed the two new comments:

Unwind: deleted discover-peers-init; reverted the planner's init discover-peers to main (byte-identical — git diff origin/main -- internal/planner/planner.go is empty); the nodetask DiscoverPeers now builds its own peer sources with its own empty/label-unresolved fail-fast (slight, deliberate duplication — no shared interface to conflict). The nodetask still can't submit a zero-source write that would wipe persistent-peers.
:149 — ErrUnsupportedKind is now a typed error carrying the kind (errors.As).
:159 — RestartPod no-pod-at-synthesis → fail fast (RestartTargetPodNotFound), dropping the awaitRestartTarget defer; uncached read avoids cache-lag false negatives; task-side backstop kept.

Net: the init path is untouched vs main, and the two new tasks are self-contained. Expert regression cross-review signed off. CI + Bugbot re-running.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 39bd88c. Configure here.}

…asks The execution-timeout deadline is gated on status.task.executionStartedAt (added this PR). A SeiNodeTask already Running when the controller upgrades to this build has a populated status.task but a nil anchor, so no timeout applied → it could run unbounded after the upgrade. Lazy-stamp executionStartedAt=now (idempotent, only when nil) before the deadline check, giving in-flight legacy tasks a bounded fresh budget from the upgrade moment. Not anchored to startedAt (would put the deadline in the past for a long-running task and spuriously fail it). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

bdchatham · 2026-06-06T22:37:46Z

Pushed ec2b591. Status of the Bugbot findings on the latest commit:

In-flight tasks lose timeout (HIGH) — fixed: driveTask lazy-stamps executionStartedAt=now when a populated status.task has a nil anchor (a task synthesized by a pre-this-change controller), so in-flight legacy tasks get a bounded fresh budget on upgrade instead of running unbounded. Idempotent; not anchored to startedAt (would back-date the deadline).
Task timeout includes target wait (HIGH) — stale/re-anchored, resolved in 2a91ca9: the budget measures from executionStartedAt (stamped after the target is ready), not startedAt.
Empty pod UID skips restart (MEDIUM) — resolved in 39bd88c: no owned pod at synthesis → fail fast (RestartTargetPodNotFound); task-side backstop keeps an empty UID in Running (never completes). The defer machinery is gone.

CI + Bugbot re-running on ec2b591.

…ram errors Address two review comments on the SeiNodeTask DiscoverPeers/RestartPod work: 1. Param-build errors now carry their own condition reason. A ReasonedError interface + FailureReason(err) helper replace the synthesis-site `if errors.As(...)` that special-cased only UnsupportedKind. ErrUnsupportedKind reports UnsupportedKind; a new paramsError reports ParamsBuildFailed; both call sites (synthesis + driveTask) derive the reason via FailureReason with no conditional. 2. RestartPod takes the pod UID as a required spec field (spec.restartPod.podUID, CEL-enforced non-empty) instead of the controller discovering it. The synthesis path copies it verbatim into status.task.restartedPodUID; the targetPodUID fetch, the APIReader/reader() plumbing, the GetAPIReader() wiring in main.go, and the RestartTargetPodNotFound fail-fast are removed. The caller-owns-UID contract is documented on the PodUID field and the RestartPod kind doc: fetch the UID as late as possible; a stale UID completes as a no-op (the controller does not re-validate against the live pod). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cursor Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread internal/controller/nodetask/controller.go

bdchatham commented Jun 6, 2026

View reviewed changes

Comment thread internal/controller/nodetask/controller.go Outdated

Comment thread internal/controller/nodetask/controller.go Outdated

Comment thread internal/controller/nodetask/controller.go Outdated

Comment thread internal/controller/nodetask/controller.go

cursor Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread internal/planner/planner.go Outdated

cursor Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread internal/controller/nodetask/controller.go

bdchatham commented Jun 6, 2026

View reviewed changes

Comment thread internal/controller/nodetask/controller.go Outdated

Comment thread internal/controller/nodetask/controller.go Outdated

cursor Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread internal/controller/nodetask/controller.go

Comment thread internal/controller/nodetask/controller.go Outdated

bdchatham commented Jun 6, 2026

View reviewed changes

Comment thread internal/controller/nodetask/controller.go Outdated

Comment thread internal/controller/nodetask/controller.go Outdated

bdchatham merged commit 204c1c0 into main Jun 6, 2026
5 checks passed

Conversation

bdchatham commented Jun 6, 2026

What

Why

Design notes

Tests

Follow-up (deferred)

Uh oh!

cursor Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bdchatham commented Jun 6, 2026

Uh oh!

Uh oh!

bdchatham commented Jun 6, 2026

Uh oh!

Uh oh!

bdchatham commented Jun 6, 2026

Ready for human review — automated bar met (f477e34)

Uh oh!

Uh oh!

Uh oh!

bdchatham commented Jun 6, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bdchatham commented Jun 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor Bot commented Jun 6, 2026 •

edited

Loading

Ready for human review — automated bar met (`f477e34`)