Skip to content

feat(sidecar): add restart-seid task for in-place seid restart#199

Merged
bdchatham merged 2 commits into
mainfrom
brandon2/plt-438-restart-seid
Jun 7, 2026
Merged

feat(sidecar): add restart-seid task for in-place seid restart#199
bdchatham merged 2 commits into
mainfrom
brandon2/plt-438-restart-seid

Conversation

@bdchatham

Copy link
Copy Markdown
Contributor

Linear: PLT-438

What

Adds a restart-seid sidecar task that restarts the co-located seid process in place — seid re-reads config.toml on the restart without bouncing the sidecar.

Why

Today the only way to make a running node re-read config (e.g. a refreshed persistent-peers set from discover-peers) is to delete the whole pod (the controller's RestartPod kind). That restarts the sidecar too, which loses its in-process readiness flag; seid's start-gate and the rbac-proxy readiness probe both sit on /v0/healthz, which only returns 200 after mark-ready — re-marked by the controller on a ~30s poll. Net: a ~30–40s not-signing gap per restart on a validator.

Restarting only the seid process keeps the sidecar (and its ready flag) alive → /v0/healthz stays 200 → seid reboots immediately, no gap. Validated on harbor arctic-1/syncer-0-0-0 (2026-06-07): seid restarts in place, sidecar restarts=0/ready throughout, pod UID unchanged.

How

  • Find the running seid start process via /proccomm == seid corroborated with the start subcommand, so it never matches seid-init or the bash wait-loop wrapper. (The sidecar image is distroless; this is done in Go, not via ps.)
  • Drive the existing actions.GracefulStop: SIGTERM → 30s grace → SIGKILL. Works because seid + sidecar share the pod PID namespace (shareProcessNamespace: true) and run as the same UID (65532) — no CAP_KILL.
  • Complete when seid's local RPC serves /status again (the sidecar's own /v0/healthz stays 200, so completion probes seid directly). The kubelet restarts the seid container (restartPolicy: Always) once its process exits.
  • Does not flip the engine ready flag — not a readiness op.

The three OS interactions (find-pid / signal / probe-rpc-up) are injectable for unit tests.

Changes

  • sidecar/engine/types.goTaskRestartSeid task type
  • sidecar/tasks/restart_seid.go (+ test) — the handler; reuses actions.GracefulStop / SignalPID / PIDAlive
  • serve.go — register the handler
  • sidecar/client/tasks.goTaskTypeRestartSeid + RestartSeidTask{} (mirrors MarkReadyTask); sidecar/client/client.goSubmitRestartSeidTask
  • version.json — v0.0.55 → v0.0.56 (cuts the release + container build on merge)

Test

  • Handler: happy path (found → SIGTERM → gone → RPC up), grace-timeout → SIGKILL escalation, seid-not-found → wait-for-up, RPC-never-up → timeout error; isSeidStart cmdline table.
  • Client: RestartSeidTask round-trip.
  • go build ./..., go test ./sidecar/... green.

Consumed by

sei-k8s-controller RestartNode SeiNodeTask kind (supersedes RestartPod) — wired after this releases.

🤖 Generated with Claude Code

Adds a `restart-seid` sidecar task that restarts the co-located seid process
in place — seid re-reads config.toml on the restart WITHOUT bouncing the
sidecar, so the sidecar's in-process readiness flag survives and /v0/healthz
stays 200 (no mark-ready reapproval gap).

The handler finds the running `seid start` process via /proc (corroborating
argv[0]==seid with the "start" subcommand so it never matches seid-init or the
bash wrapper), drives the existing actions.GracefulStop (SIGTERM → 30s grace →
SIGKILL), and completes when seid's local RPC serves /status again. The kubelet
restarts the seid container (restartPolicy: Always) once its main process exits;
this works because seid and the sidecar share the pod PID namespace and run as
the same UID. The handler never starts seid and never flips the engine ready
flag — it is not a readiness operation.

The three OS interactions (find-pid, signal, probe-rpc-up) are injectable for
unit testing. Adds the RestartSeidTask client struct + SubmitRestartSeidTask
helper. Bumps version.json v0.0.55 → v0.0.56.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cursor

cursor Bot commented Jun 7, 2026

Copy link
Copy Markdown

PR Summary

High Risk
The task sends SIGTERM to a live validator and can leave signing interrupted for minutes if shutdown or RPC recovery fails; graceful-only policy avoids SIGKILL but increases stuck-process risk.

Overview
Adds a restart-seid sidecar task so operators can recycle the co-located seid start process without restarting the sidecar—intended to reload config.toml (e.g. after peer discovery) while keeping the in-process ready flag and /v0/healthz behavior unchanged.

The new RestartSeider handler locates seid start via /proc (not generic seid / init / bash wrappers), sends SIGTERM, waits up to 90s for exit without SIGKILL, then polls local CometBFT /status for up to 5m. It fails if RPC is up but the process is invisible in /proc, and does not treat “RPC up” as caught-up or voting.

Wiring: TaskRestartSeid in the engine, handler registration in serve.go, client RestartSeidTask + SubmitRestartSeidTask, unit tests, and version v0.0.56.

Reviewed by Cursor Bugbot for commit 967c90e. Bugbot is set up for automated code reviews on this repo. Configure here.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a7b70b3. Configure here.

Comment thread sidecar/tasks/restart_seid.go Outdated
Comment thread sidecar/tasks/restart_seid.go
Comment thread sidecar/tasks/restart_seid.go
Address cross-review (k8s + platform + sei-network):

- sei-network BLOCKER: never silently force-kill a validator. Grace 30s→90s
  (the ~3s figure was idle-only; loaded shutdown = WAL flush + PebbleDB/IAVL
  close, possibly mid-compaction). Replace the inherited unconditional-SIGKILL
  GracefulStop with a graceful-only stop: SIGTERM, poll until exit or the grace
  deadline; if seid is still alive at the deadline, FAIL the task and leave seid
  running (a stuck-but-alive validator is safer than a force-kill mid-commit).
  No SIGKILL path remains; no force opt-in added (deferred, YAGNI).
- k8s: close the silent no-op — if the seid process isn't found but its RPC is
  already serving, return a hard error rather than completing a restart that
  didn't happen. Genuinely-down (RPC not serving) still proceeds to wait-for-up.
- platform: fix the inaccurate seidRPCUp comment (it checks latest_block_height
  parses, not node_info.network).
- Document the completion contract: complete = "seid RPC serving again", NOT
  caught-up/voting; gate height downstream (AwaitNodesAtHeight).

Tests: grace-timeout → fail-without-SIGKILL (asserts only SIGTERM sent);
not-found+RPC-down → wait; not-found+RPC-up → hard error; waitForUp
context-cancellation; plus happy-path, timeout, isSeidStart table.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

Suggested version: v0.0.56

Comparing to: v0.0.55 (diff)

Changes in go.mod file(s):

(empty)

gorelease says:

gorelease: preparing to load packages for github.com/sei-protocol/seictl: looking for missing dependencies: go: -d flag is deprecated. -d=true is a no-op
go: github.com/gogo/protobuf@v1.3.3: reading github.com/gogo/protobuf/go.mod at revision v1.3.3: unknown revision v1.3.3

gocompat says:

Your branch is up to date with 'origin/main'.

Cutting a Release (and modifying non-markdown files)

This PR is modifying both version.json and non-markdown files.
The Release Checker is not able to analyse files that are not checked in to main. This might cause the above analysis to be inaccurate.
Please consider performing all the code changes in a separate PR before cutting the release.

Automatically created GitHub Release

A draft GitHub Release has been created.
It is going to be published when this PR is merged.
You can modify its' body to include any release notes you wish to include with the release.

@bdchatham bdchatham merged commit 640a599 into main Jun 7, 2026
4 checks passed
@bdchatham bdchatham deleted the brandon2/plt-438-restart-seid branch June 7, 2026 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant