feat(nodetask): add DiscoverPeers + RestartPod SeiNodeTask kinds#386
Conversation
Two composable SeiNodeTask kinds to refresh a running node's persistent peers without baking peer-drift logic into the core SeiNode reconcile: - DiscoverPeers (sidecar-backed): re-resolves the target's spec.peers and writes persistent-peers into config.toml on the PVC via the sidecar discover-peers task (scalar merge — no config-apply needed; disk-only, the running seid picks it up on next restart). - RestartPod (controller-side): restarts the target's pod so seid re-reads config.toml. Completion is content-addressed by pod UID — Execute deletes only the captured-UID pod; Status completes when an owned Ready pod has a different UID. UID is captured at synthesis and persisted on status.task, so it survives reconcile reconstruction and is immune to creationTimestamp second-resolution races (unlike a wall-clock epoch). Sequenced DiscoverPeers -> RestartPod applies fresh peers to a live validator (the arctic-1 EC2->K8s migration use case) via declarative task CRs, composable into Chaos Mesh Workflows. The DiscoverPeers-ok/RestartPod-fails window is a transient, self-healing divergence (documented on the kind), and RestartPod carries a bounded default timeout (10m; DiscoverPeers 2m) so a hand-authored CR cannot hang silently. Empty spec.peers (incl. a label source with no resolvedPeers) fails fast rather than submitting a zero-source discover-peers that would wipe persistent-peers. Conditions follow the Ready/Failed latch + ObservedGeneration conventions; status patches use optimistic concurrency. Deferred (tracked separately): promoting peer-drift to an automatic controller trigger once this proves valuable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR SummaryMedium Risk Overview The nodetask reconciler delegates param building to Tests: broad unit/reconcile coverage, new nodetask envtest for CEL rules, and Reviewed by Cursor Bugbot for commit 467ea31. Bugbot is set up for automated code reviews on this repo. Configure here. |
…out, sweep Cursor Bugbot (HIGH) + CI + maintainer review: - Timeout measured from execution start, not task creation: new status.task.executionStartedAt (stamped once the target is ready + the task is synthesized) so the requirePhase wait isn't charged against the per-kind execution budget. spec.timeoutSeconds is now execution-only. - lint goconst: hoisted repeated test literals to package consts. - Share peer-source building: new task.DiscoverPeerSources is the single impl; planner.discoverPeersTask now wraps it (removed its duplicate). The shared empty-label-skip also removes a latent planner edge that could write an empty persistent-peers for a not-yet-resolved label source. - taskParamsForKind is now a thin dispatcher; all per-kind param constructors live in the task package (internal/task/seinodetask_params.go). Behavior- preserving for existing kinds. - Moved per-kind timeout consts + the unsupported-kind sentinel (task.ErrUnsupportedKind) to their proper homes; trimmed verbose doc comments to load-bearing rationale. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Pushed
Two expert cross-review rounds + a regression pass signed off on the refactor. Note: |
… empty sources - DRY: extract a shared podCycle helper (StatefulSet fetch + selector/multi-replica guards + owned-pod lookup/delete) used by both replace-pod and restart-pod. Behavior-preserving — replace-pod keeps its revision gate (runs before the shared guards) + synchronous no-readiness-wait completion; restart-pod keeps its captured-UID delete predicate + different-UID-Ready readiness wait. - Fix Bugbot MEDIUM "init plan empty discover-peers": the shared DiscoverPeerSources skips an unresolved label source, so the planner's init path could submit a zero-source discover-peers that wipes persistent-peers. New controller-side discover-peers-init task re-derives sources from the LIVE node each reconcile: empty sources -> Failed (consumes one retry-budget unit, re-pends, re-evaluates), so it never writes zero sources and converges once status.resolvedPeers populates, bounded by discoverPeersMaxRetries. A build-time gate can't converge (executor freezes params; buildRunningPlan has no discover-peers). Nodetask path unchanged (synthesis-time fail-fast). Note: discover-peers-init uses setFailed-in-Execute(returns nil) deliberately — only the Status-failed path is governed by the retry budget; returning an error would be unbounded. Don't "fix" it to return err. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Pushed
Regression cross-review signed off on both. CI + Bugbot re-running. |
When no owned pod exists at synthesis (STS cache-lag / brief gap), targetPodUID returned empty and restart-pod then deleted nothing and completed on any owned Ready pod — a silent no-op (seid never restarts, DiscoverPeers peers never apply). - Synthesis: for RestartPod, if targetPodUID is empty, defer (requeue every targetWaitInterval, TargetReady=False/PodNotObserved) until an owned pod is observed and its UID captured, bounded by requirePhaseTimeout (anchored on status.startedAt) → Failed/RestartTargetUnresolved if it never appears. A real UID proceeds as before (delete it; complete only on a different-UID Ready pod). - Backstop: restart-pod Status with an empty UID now stays Running (never completes); a stray empty UID surfaces via the execution timeout as Failed, never success. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Ready for human review — automated bar met (
|
| Finding | Resolution |
|---|---|
| Bugbot HIGH — timeout includes requirePhase wait | executionStartedAt (exec-only budget) — 2a91ca9 |
| CI lint goconst | hoisted test literals to consts — 2a91ca9 |
| Maintainer ×4 (consts→top, sentinel→task pkg, shared peer builder, thin dispatcher) | 2a91ca9 |
| Bugbot MEDIUM — init zero-source discover-peers | discover-peers-init defers via retry budget — 1aaaefa |
| Maintainer — DRY restart/replace | shared podCycle helper — 1aaaefa |
| Bugbot MEDIUM — empty-UID silent skip | synthesis-defer + task backstop — f477e34 |
The three inline comments still shown are all from earlier commits and resolved above (the HIGH @387 is the original first-commit comment re-anchored to the diff; Bugbot's per-commit check is green).
Note: spec.timeoutSeconds is now execution-only (additive status.task.executionStartedAt, back-compat). Deferred follow-up (separate): promote peer-drift to an automatic controller trigger once proven.
Over to you for the final look, @bdchatham.
…nd; restart fail-fast Maintainer review: the shared peer builder forced one interface across the planner (frozen plan params) and the nodetask (live, fail-fast) paths, which spawned the discover-peers-init task. Unwind it. - Delete discover-peers-init (task + test + registry). Revert the planner's init discover-peers to main — byte-identical (own inline builder, frozen sidecar discover-peers). The nodetask DiscoverPeers builds its own peer sources with its own empty/label-unresolved fail-fast (never submits a zero-source write that would wipe persistent-peers). Slight, deliberate duplication; no shared interface to conflict. - ErrUnsupportedKind is now a typed error carrying the kind (errors.As), not a sentinel. - RestartPod: no owned pod at synthesis -> fail fast with reason RestartTargetPodNotFound (drop the awaitRestartTarget bounded-defer). targetPodUID reads uncached (APIReader) to avoid cache-lag false negatives. Task-side backstop kept (empty UID never completes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Pushed
Net: the init path is untouched vs main, and the two new tasks are self-contained. Expert regression cross-review signed off. CI + Bugbot re-running. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 39bd88c. Configure here.
…asks The execution-timeout deadline is gated on status.task.executionStartedAt (added this PR). A SeiNodeTask already Running when the controller upgrades to this build has a populated status.task but a nil anchor, so no timeout applied → it could run unbounded after the upgrade. Lazy-stamp executionStartedAt=now (idempotent, only when nil) before the deadline check, giving in-flight legacy tasks a bounded fresh budget from the upgrade moment. Not anchored to startedAt (would put the deadline in the past for a long-running task and spuriously fail it). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Pushed
CI + Bugbot re-running on ec2b591. |
…ram errors Address two review comments on the SeiNodeTask DiscoverPeers/RestartPod work: 1. Param-build errors now carry their own condition reason. A ReasonedError interface + FailureReason(err) helper replace the synthesis-site `if errors.As(...)` that special-cased only UnsupportedKind. ErrUnsupportedKind reports UnsupportedKind; a new paramsError reports ParamsBuildFailed; both call sites (synthesis + driveTask) derive the reason via FailureReason with no conditional. 2. RestartPod takes the pod UID as a required spec field (spec.restartPod.podUID, CEL-enforced non-empty) instead of the controller discovering it. The synthesis path copies it verbatim into status.task.restartedPodUID; the targetPodUID fetch, the APIReader/reader() plumbing, the GetAPIReader() wiring in main.go, and the RestartTargetPodNotFound fail-fast are removed. The caller-owns-UID contract is documented on the PodUID field and the RestartPod kind doc: fetch the UID as late as possible; a stale UID completes as a no-op (the controller does not re-validate against the live pod). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

What
Two new composable
SeiNodeTaskkinds to refresh a running node'spersistent-peerswithout baking peer-drift logic into the coreSeiNodereconcile:DiscoverPeers(sidecar-backed) — re-resolves the target SeiNode'sspec.peersand writespersistent-peersintoconfig.tomlon the PVC via the existing sidecardiscover-peerstask. Scalar merge → preserves the rest of the config, so noconfig-applyneeded. Disk-only: the running seid picks it up on its next restart.RestartPod(controller-side) — restarts the target's pod so seid re-readsconfig.toml.Sequenced
DiscoverPeers→RestartPodapplies fresh peers to a live node (the arctic-1 EC2→K8s validator migration use case), composable into Chaos Mesh Workflows.Why
Migrated K8s validators need their
persistent-peersupdated after aspec.peerschange (e.g. adding the proposer cohort). We established that:peersis not in the controller's rollouttemplateHash, so a peers edit doesn't trigger a plan;discover-peersnorconfig-apply, andmark-readyis a no-op — so an image bump restarts seid onto the staleconfig.toml;persistent-peersinto a live seid (seictlconfig-reload's seid-signal is aTODO).Rather than enshrine a peer-drift trigger in the core reconcile (a heavier, harder-to-reverse change with fleet-wide blast radius), this delivers the capability as opt-in, composable tasks. Promote to an automatic controller trigger later if it proves worth it (tracked separately).
Design notes
RestartPodcompletion is content-addressed by pod UID. Execute deletes only the captured-UID pod; Status completes when an ownedReadypod has a different UID. The UID is captured at synthesis and persisted onstatus.task(survives reconcile reconstruction). This is immune to thecreationTimestamp1s-resolution race a wall-clock epoch would have. It does not reusereplace-pod(which keys on StatefulSet revision and no-ops when the spec is unchanged).RestartPoddefaults to a 10m timeout,DiscoverPeers2m (explicitspec.timeoutSecondsalways wins; sidecar gov/await kinds stay unbounded). A hand-authoredRestartPodwhose pod never goes Ready transitions toFailed(reason=Timeout)rather than requeuing forever.spec.peers— including alabelsource whosestatus.resolvedPeersis empty — fails withParamsBuildFailedrather than submitting a zero-sourcediscover-peersthat would silently wipepersistent-peers.RestartPodis the same single-replica + OnDelete + RWO-PVC stop-then-start as the existingreplace-pod; multi-replica is a terminal guard.DiscoverPeerssucceeds thenRestartPodfails → fresh config on disk + old seid still running; transient, observable, self-healing on the next successfulRestartPod.Tests
taskParamsForKindfor both kinds (ec2Tags/label/static/mixed-empty-label/empty-peers/nil-target; RestartPod params + UID threading),restart-podexecution (UID match/no-op/missing/terminating/multi-replica guard, same-second replacement regression), CEL union (reject zero/multiple payloads, kind immutability).make test+make test-integrationgreen;make lint0 new findings vs main.Follow-up (deferred)
seidpod has no explicitterminationGracePeriodSeconds(inherits 30s) — if seid needs longer to flush on shutdown, worth setting in the StatefulSet generator. Independent of this PR.🤖 Generated with Claude Code