Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions internal/controller/nodedeployment/envtest/inplace_rollout_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,27 @@ func TestInPlaceRollout_EndToEnd(t *testing.T) {
return cond.Reason == "RolloutComplete"
}, "rollout plan complete, status.rollout cleared, RolloutInProgress=False/RolloutComplete")

// 6b. No image flip-flop after rollout. The major-upgrade scenario depends
// on the SND template being the single source of child image: once the
// rollout settles, every child's spec.image must stay pinned to the
// template image and never revert. The pre-fix per-child UpdateNodeImage
// path drove children away from the template, and ensureSeiNode would
// re-assert it every reconcile -> oscillation. We assert spec.image
// (the actual invariant) rather than metadata.generation, which takes
// one benign post-rollout bump when the revision podLabel resyncs.
settled := getSND(t, key)
g.Expect(listChildren(t, settled)).To(HaveLen(replicas))
g.Consistently(func() bool {
for _, kid := range listChildren(t, getSND(t, key)) {
if kid.Spec.Image != newImage {
t.Logf("child %s image reverted: %s", kid.Name, kid.Spec.Image)
return false
}
}
return true
}, 3*time.Second, pollInterval).Should(BeTrue(),
"child spec.image stays pinned to template image after rollout (no flip-flop)")

// 7. Deleting the SND removes all children. envtest has no kube-
// controller-manager to perform garbage-collection by owner-ref,
// so the SND controller's finalizer path is what we exercise:
Expand Down
6 changes: 4 additions & 2 deletions runner/rbac.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,10 +39,12 @@ rules:
verbs: ["get", "list", "watch"]
# SeiNodeDeployment: create + read for provision-snd. Polls .status.phase
# until Ready and reads .status.endpoints to publish role-scoped TM/REST/
# EVM URLs into workflow-vars.
# EVM URLs into workflow-vars. patch covers the major-upgrade bump-snd-image
# step, which `kubectl patch --type=merge`es spec.template.spec.image to roll
# all validators onto the post-upgrade binary in a single write.
- apiGroups: ["sei.io"]
resources: ["seinodedeployments"]
verbs: ["create", "get", "list", "watch"]
verbs: ["create", "get", "list", "watch", "patch"]
- apiGroups: ["sei.io"]
resources: ["seinodedeployments/status"]
verbs: ["get"]
Expand Down
27 changes: 13 additions & 14 deletions scenarios/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ the acceptance test for one capability surface.

| File | Mirrors | Purpose |
|---|---|---|
| `major-upgrade.yaml` | `sei-chain/integration_test/upgrade_module/major_upgrade_test.yaml` | 4-validator software-upgrade flow with early-upgrade panic, at-height panic, downgrade-and-catchup, and final convergence. MVP acceptance for the SeiNodeTask CRD. |
| `major-upgrade.yaml` | `sei-chain/integration_test/upgrade_module/major_upgrade_test.yaml` | 4-validator software-upgrade flow: gov proposal, vote, then a single SND template image bump that rolls all validators onto the new binary at the upgrade height. MVP acceptance for the SeiNodeTask CRD. |
| `testnet-deployment.yaml` | n/a | Reference 4-validator `SeiNodeDeployment` the Workflow can target. |

## Where this runs
Expand Down Expand Up @@ -145,19 +145,13 @@ Per-step interpretation:

| Step | What success means |
|---|---|
| `compute-target-height` | Created `workflow-vars-${SEI_WORKFLOW_RUN_ID}` ConfigMap with `TARGET_HEIGHT` / `UPGRADE_HEIGHT` / `POST_UPGRADE_HEIGHT` / `PANIC_BOUNDARY`. |
| `compute-target-height` | Created `workflow-vars-${SEI_WORKFLOW_RUN_ID}` ConfigMap with `TARGET_HEIGHT` / `UPGRADE_HEIGHT` / `POST_UPGRADE_HEIGHT`. |
| `submit-upgrade-proposal` | SeiNodeTask `.status.phase=Complete`. proposalId is NOT extracted here (sidecar structured outputs are intentionally empty post-PR 3); `resolve-proposal-id` derives it from the chain. |
| `resolve-proposal-id` | Polled gov REST for a voting-period proposal whose plan name matches `$SEI_UPGRADE_NAME`, merged `PROPOSAL_ID` into the workflow-vars ConfigMap. |
| `vote-yes-all-validators` | All 4 vote tasks Complete. |
| `wait-for-proposal-to-pass` | Proposal observed `PROPOSAL_STATUS_PASSED`. |
| `early-upgrade-node-0` | SeiNode status.currentImage observed equal to post-upgrade image (NOT readiness -- see LLD). |
| `wait-for-target-height-nodes-1-2-3` | Sidecar AwaitNodesAtHeight observed local height >= `TARGET_HEIGHT` on each of nodes 1/2/3. |
| `upgrade-nodes-1-2-3` | Image patch landed on each (same semantics as early-upgrade). |
| `await-post-upgrade-progress-nodes-1-2-3` | Post-upgrade height-advance check: each of nodes 1/2/3 advanced past `POST_UPGRADE_HEIGHT` (= `TARGET_HEIGHT + 10`) via AwaitCondition. This is the liveness assertion. |
| `downgrade-node-0` | Image reverted to pre-upgrade (same semantics as upgrade). |
| `wait-for-target-height-node-0` | Node-0 caught up to `TARGET_HEIGHT - 1` (will panic at `TARGET_HEIGHT` on the pre-upgrade binary). |
| `upgrade-node-0` | Final image patch to post-upgrade on node-0. |
| `await-post-upgrade-progress-node-0` | Post-upgrade height-advance check on node-0 past `POST_UPGRADE_HEIGHT`. Final liveness assertion. |
| `bump-snd-image` | `kubectl patch seinodedeployment` set `spec.template.spec.image` to the post-upgrade build. The SND controller (InPlace) re-asserts the image onto every child and rolls all validators onto the new binary. |
| `await-post-upgrade-progress` | Post-upgrade height-advance check: each of nodes 0/1/2/3 advanced past `POST_UPGRADE_HEIGHT` (= `TARGET_HEIGHT + 10`) via AwaitCondition. This is the liveness assertion -- a node that crosses the boundary has survived the upgrade. |

### 5. Cleanup

Expand Down Expand Up @@ -236,10 +230,15 @@ namespace as the Workflow:
steps to a structured kind also lets us delete the `configmaps` RBAC
verbs (only the runner's outputs ConfigMap-write would remain).

3. **`UpdateNodeImage` completes on image-applied, not Ready.** Required
by this scenario (early-upgrade is expected to CrashLoop), but
surprising for happy-path users. Documented on the kind's CRD
description.
3. **Upgrade rolls the whole fleet, not staggered per-node.** This
Workflow bumps the SND template image once and lets the SND controller
roll all validators together. It does NOT exercise the staggered
early-upgrade-one-node-then-the-rest path the source
`major_upgrade_test.yaml` does. Per-child `UpdateNodeImage` against a
SND-owned node fights the controller's template re-assertion (the child
image flip-flops, the StatefulSet churns, `observe-image` never settles),
so staggered rollout needs a different primitive (e.g. SND-level
partition/maxUnavailable) before it can return.

4. **The runner image is not yet auto-published.** Add a `runner` step to
`.github/workflows/ecr.yml` once this scenario is wired into a CI job.
Expand Down
202 changes: 82 additions & 120 deletions scenarios/major-upgrade.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,12 @@
# all carry ownerRef to this Workflow CR, so the wrapper's only cleanup duty
# is `kubectl delete workflow`.
#
# Upgrade mechanism: a single bump-snd-image step patches the SND template
# image; the SND controller rolls all validators onto the new binary. The
# SND template is the one source of truth for child image -- per-child
# UpdateNodeImage would fight the controller's template re-assertion and
# churn the StatefulSet so the rollout never settles.
#
# Workflow-vars producers/consumers
# ---------------------------------
# provision-validator-chain seeds CHAIN_ID + VALIDATOR_TM_RPC + VALIDATOR_REST.
Expand Down Expand Up @@ -54,10 +60,9 @@ spec:
- resolve-proposal-id
- vote-yes-all-validators
- wait-for-proposal-to-pass
- early-upgrade-node-0
- wait-for-target-height-nodes-1-2-3
- upgrade-nodes-1-2-3
- await-post-upgrade-progress-nodes-1-2-3
- settle-into-halt
- bump-snd-image
- await-post-upgrade-progress
- upload-report

# Every seitask container projects Workflow identity via downward API:
Expand Down Expand Up @@ -369,127 +374,94 @@ spec:
- configMapRef:
name: workflow-vars-major-upgrade-$SEI_WORKFLOW_RUN_ID

# Patches node-0 image to the post-upgrade build. UpdateNodeImage
# completes on observed currentImage, NOT readiness -- nodes are
# expected to CrashLoop after early upgrade.
- name: early-upgrade-node-0
# Waits for the chain to reach UPGRADE_HEIGHT and halt before the binary
# swap. The old binary panics ("UPGRADE NEEDED") at UPGRADE_HEIGHT; the new
# binary panics ("BINARY UPDATED BEFORE TRIGGER", sei-cosmos x/upgrade
# abci.go) if it processes ANY block below UPGRADE_HEIGHT. So bump-snd-image
# must land only after every validator has committed UPGRADE_HEIGHT-1 and
# halted. The height can't be polled at that point -- all validators halt
# together and stop serving RPC exactly when the predicate would be true --
# so this is a fixed wait, not an AwaitCondition. UPGRADE_HEIGHT is current
# + 200 blocks measured at compute-target-height, but the proposal flow
# (~60s voting period + tally) burns most of that budget first, so only
# ~100 blocks (~60s at ~600ms blocks) remain once the proposal has passed.
# Over-waiting is free (the chain just sits halted until the swap); the only
# failure mode is waiting too short. The full wall-clock from height
# measurement to swap (~60s voting + 150s here) must exceed 200 x block_time,
# so block time above ~1s would break it -- raise this if a cold chain's
# early blocks run slow.
- name: settle-into-halt
templateType: Task
deadline: 10m
deadline: 8m
task:
container:
name: runner
image: $SEITASK_IMAGE
name: settle-into-halt
image: alpine/k8s:1.31.0
command: ["/bin/sh", "-c"]
args:
- runner
- --template=/templates/update-node-image.yaml.tmpl
- --var=NODE=$SEI_CHAIN_ID-0
- --var=IMAGE=$SEI_POST_UPGRADE_IMG
- --var=REQUIRE_PHASE=Running
- --timeout=8m
env:
- name: SEI_WORKFLOW_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['chaos-mesh.org/workflow']
- name: SEI_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
envFrom:
- configMapRef:
name: workflow-vars-major-upgrade-$SEI_WORKFLOW_RUN_ID
- |
set -eu
echo "waiting 150s for the chain to reach UPGRADE_HEIGHT and halt before swapping the binary"
sleep 150
echo "settle window elapsed; proceeding to bump-snd-image"

# Nodes 1-3 panic-halt at the upgrade height and their RPC dies with them,
# so RPC-polled height-awaits stall indefinitely. Sleep long enough that
# the chain has provably passed the upgrade height (TARGET = CUR+200 ≈
# 120s at ~600ms blocks, plus voting/tally margin).
- name: wait-for-target-height-nodes-1-2-3
# Bumps the SeiNodeDeployment template image to the post-upgrade build in
# a single patch. The SND controller (InPlace strategy) re-asserts the new
# image onto every child SeiNode and drives each node's NodeUpdate plan;
# the validators roll together onto the new binary at the upgrade height.
#
# Patches spec.template.spec.image only -- a strategic/merge patch leaves
# the rest of the template untouched. Per-child UpdateNodeImage is NOT used
# here: the SND controller would re-assert the template image every
# reconcile, flip-flopping the child spec.image and churning the
# StatefulSet so the rollout never settles (observe-image never completes).
# The SND template is the single source of truth for child image.
- name: bump-snd-image
templateType: Task
deadline: 5m
task:
container:
name: sleep
name: bump-snd-image
image: alpine/k8s:1.31.0
command: ["/bin/sh", "-c", "sleep 180"]

# Upstream major_upgrade_test.yaml runs each as its own sequential input,
# so we serialize here too. Avoids stampeding the SeiNode reconciler.
- name: upgrade-nodes-1-2-3
templateType: Serial
deadline: 30m
children:
- upgrade-node-1
- upgrade-node-2
- upgrade-node-3

- name: upgrade-node-1
templateType: Task
deadline: 10m
task:
container:
name: runner
image: $SEITASK_IMAGE
command: ["/bin/sh", "-c"]
args:
- runner
- --template=/templates/update-node-image.yaml.tmpl
- --var=NODE=$SEI_CHAIN_ID-1
- --var=IMAGE=$SEI_POST_UPGRADE_IMG
- --var=REQUIRE_PHASE=Running
- --timeout=8m
- |
set -eu
kubectl patch seinodedeployment "${SEI_CHAIN_ID}" \
--type=merge \
--patch "{\"spec\":{\"template\":{\"spec\":{\"image\":\"${SEI_POST_UPGRADE_IMG}\"}}}}"
echo "patched seinodedeployment/${SEI_CHAIN_ID} template image to ${SEI_POST_UPGRADE_IMG}"
env:
- name: SEI_WORKFLOW_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['chaos-mesh.org/workflow']
- name: SEI_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
envFrom:
- configMapRef:
name: workflow-vars-major-upgrade-$SEI_WORKFLOW_RUN_ID
- name: SEI_CHAIN_ID
value: "$SEI_CHAIN_ID"
- name: SEI_POST_UPGRADE_IMG
value: "$SEI_POST_UPGRADE_IMG"

- name: upgrade-node-2
templateType: Task
deadline: 10m
task:
container:
name: runner
image: $SEITASK_IMAGE
args:
- runner
- --template=/templates/update-node-image.yaml.tmpl
- --var=NODE=$SEI_CHAIN_ID-2
- --var=IMAGE=$SEI_POST_UPGRADE_IMG
- --var=REQUIRE_PHASE=Running
- --timeout=8m
env:
- name: SEI_WORKFLOW_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['chaos-mesh.org/workflow']
- name: SEI_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
envFrom:
- configMapRef:
name: workflow-vars-major-upgrade-$SEI_WORKFLOW_RUN_ID
# Liveness: each validator advances past TARGET_HEIGHT+10
# (= POST_UPGRADE_HEIGHT) after the SND rolls all nodes onto the new
# binary. AwaitCondition over the height predicate, one per validator.
- name: await-post-upgrade-progress
templateType: Parallel
deadline: 15m
children:
- await-post-upgrade-progress-node-0
- await-post-upgrade-progress-node-1
- await-post-upgrade-progress-node-2
- await-post-upgrade-progress-node-3

- name: upgrade-node-3
- name: await-post-upgrade-progress-node-0
templateType: Task
deadline: 10m
deadline: 12m
task:
container:
name: runner
image: $SEITASK_IMAGE
args:
- runner
- --template=/templates/update-node-image.yaml.tmpl
- --var=NODE=$SEI_CHAIN_ID-3
- --var=IMAGE=$SEI_POST_UPGRADE_IMG
- --var=REQUIRE_PHASE=Running
- --timeout=8m
- --template=/templates/await-condition.yaml.tmpl
- --var=NODE=$SEI_CHAIN_ID-0
- --var=TARGET_HEIGHT=$(POST_UPGRADE_HEIGHT)
- --timeout=10m
env:
- name: SEI_WORKFLOW_NAME
valueFrom:
Expand All @@ -503,19 +475,9 @@ spec:
- configMapRef:
name: workflow-vars-major-upgrade-$SEI_WORKFLOW_RUN_ID

# Liveness: each upgraded node advances past TARGET_HEIGHT+10
# (= POST_UPGRADE_HEIGHT). AwaitCondition over the height predicate.
- name: await-post-upgrade-progress-nodes-1-2-3
templateType: Parallel
deadline: 10m
children:
- await-post-upgrade-progress-node-1
- await-post-upgrade-progress-node-2
- await-post-upgrade-progress-node-3

- name: await-post-upgrade-progress-node-1
templateType: Task
deadline: 8m
deadline: 12m
task:
container:
name: runner
Expand All @@ -525,7 +487,7 @@ spec:
- --template=/templates/await-condition.yaml.tmpl
- --var=NODE=$SEI_CHAIN_ID-1
- --var=TARGET_HEIGHT=$(POST_UPGRADE_HEIGHT)
- --timeout=6m
- --timeout=10m
env:
- name: SEI_WORKFLOW_NAME
valueFrom:
Expand All @@ -541,7 +503,7 @@ spec:

- name: await-post-upgrade-progress-node-2
templateType: Task
deadline: 8m
deadline: 12m
task:
container:
name: runner
Expand All @@ -551,7 +513,7 @@ spec:
- --template=/templates/await-condition.yaml.tmpl
- --var=NODE=$SEI_CHAIN_ID-2
- --var=TARGET_HEIGHT=$(POST_UPGRADE_HEIGHT)
- --timeout=6m
- --timeout=10m
env:
- name: SEI_WORKFLOW_NAME
valueFrom:
Expand All @@ -567,7 +529,7 @@ spec:

- name: await-post-upgrade-progress-node-3
templateType: Task
deadline: 8m
deadline: 12m
task:
container:
name: runner
Expand All @@ -577,7 +539,7 @@ spec:
- --template=/templates/await-condition.yaml.tmpl
- --var=NODE=$SEI_CHAIN_ID-3
- --var=TARGET_HEIGHT=$(POST_UPGRADE_HEIGHT)
- --timeout=6m
- --timeout=10m
env:
- name: SEI_WORKFLOW_NAME
valueFrom:
Expand Down
Loading