Add KB: Cluster Recovery After Full Power Outage (closes #441) by CarlRodabaugh · Pull Request #442 · verge-io/docs

CarlRodabaugh · 2026-04-28T01:33:50Z

Summary

Adds a new KB article addressing #441: a consolidated how-to / troubleshooting guide for recovering a VergeOS cluster after an unplanned full power loss.

Changes

New file: docs/knowledge-base/posts/cluster-recovery-after-power-outage.md
Patterned after kb-template.md (frontmatter, Key Points, Prerequisites, Steps, Troubleshooting, Prevention, Additional Resources, Feedback, Document Information)
Cross-links to existing docs: Proper Power Sequence, Proper Shutdown Procedure, Journal Walks, Generating System Diagnostics, Repair Server (ioGuardian), vSAN Diagnostics Guide, System Diagnostics, Sizing & Hardware Requirements

Issue coverage (per #441 "Suggested Content")

✅ Expected cluster behavior after simultaneous full power loss (What to Expect subsection)
✅ Recommended host power-on sequence — including the verbatim Waiting for the vSAN to mount console prompt as the operator's go-signal between Node1 and Node2
✅ Rejoin order and how nodes resync vSAN tiers (auto-reconciliation via Journal Walks once quorum is reached)
✅ Step-by-step recovery procedure (pre-checks → power-on → verification)
✅ How to verify vSAN health and tier sync status post-recovery (Status tile fields, Repairs/Bad Drives interpretation, vSAN Diagnostics CLI equivalents)
✅ Recommendations to prevent data inconsistency or corruption — UPS sizing, graceful shutdown automation (UI / API / VRG CLI), On Power Loss VM settings, ioGuardian repair server, off-site snapshots, fencing-handled-internally explainer
✅ Troubleshooting: node fails to rejoin, vSAN won't mount, stuck/growing repairs, split-brain
✅ When to engage VergeIO support, with explicit sysdiag generation steps and support@verge.io for the air-gapped/manual path

Verification

All UI nav paths verified against existing operational KBs (System → vSAN → Tiers, System → vSAN → Drives, System → Nodes, System → vSAN Diagnostics, System → System Diagnostics)
API payload POST /v4/cluster_actions { action: shutdown } matches the existing Proper Shutdown Procedure doc and API Tables reference
Cluster reconciliation, Journal Walk, and quorum behavior cross-checked against the Journal Walks KB and stuck-repairs internal doc
Sysdiag UI label and parent/root requirement cross-checked against both Generating System Diagnostics and System Diagnostics docs
Version footer aligned with currently supported releases (26.0+)

Test plan

Render preview in mkdocs and verify all admonitions, code blocks, and internal links resolve
Confirm the Waiting for the vSAN to mount console string matches what customers see on 26.x (verified against an in-the-wild console screenshot)
Tech review by support / engineering for any internal-only details that shouldn't be public

Closes #441 Adds a consolidated how-to / troubleshooting guide for recovering a VergeOS cluster after an unplanned full power loss. Covers expected behavior, pre-power-on checks, the Node1 → Node2 → remaining-nodes sequence (including the "Waiting for the vSAN to mount" prompt as the operator's go-signal), post-recovery verification, troubleshooting (stuck repairs, split-brain, failed rejoin), prevention (UPS sizing, graceful shutdown automation via API/VRG, ioGuardian repair server, fencing handled internally), and when to engage support with sysdiag generation steps. Patterned after kb-template.md; cross-links to existing power-sequence, shutdown-procedure, journal-walks, repair-server, vSAN-diagnostics, sizing, and system-diagnostics docs.

- draft: false (frontmatter blocker) - Promote Prevention, When to Engage Support, Generating a System Diagnostic for Support from h3 to h2 - Drop unverified verbatim "Waiting for the vSAN to mount" console string; describe the wait-state behavior instead - Drop undefined "Kill Mode" term; describe IPMI hard power-off inline - Link Maintenance Mode to /product-guide/operations/maintenance-mode/ - Align slug to filename (cluster-recovery-after-power-outage) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…w fixes - Replace 'at least 2 nodes' quorum framing with N-1 vSAN nodes (per Jason Yaeger). - Simplify power-on sequence to 'Node1 first, then power on the rest paced ~1 min apart'; vSAN mounts on its own when N-1 is reached. - Reframe Bad Drives: count of drives the cluster currently can't see; persistent non-zero is a real fault, not transient walk noise. - Fix cluster shutdown API payload to include cluster id and params. - Soften 'no built-in NUT/UPS' claim ('does not currently document'). - Drop unsupported 'often auto-created' claim about ioGuardian. - Reframe split-brain as a recovery-time network-partition risk, not 'during the outage'. - Trim Pro Tip; 'stoplights' -> 'status lights'; bump last-updated.

…truction guide

bcampbellverg

multiple changes to the initial new KB article:

stronger encouragement to prevent abrupt power loss to a cluster when possible,
change power on sequence instructions-- no need to wait 1 minute between each node,
added additional information about post-power-on verifications,
modified vague language "before any destructive action" to more specifically say "before rebooting nodes or making significant changes"
Changed "Prevention" section to "Prevention/Mitigation" because many of the items listed are intended to mitigate issues that can be caused by ungraceful shutdown, rather than prevent the situation.

Modified existing kb article for proper power on procedures:

change power on sequence instructions-- no need to wait 1 minute between each node
claude grammar changes

CarlRodabaugh and others added 5 commits April 27, 2026 21:32

need to verify delay between nodes - and update existing power up ins…

92babc6

…truction guide

ready for review/push

85973cf

bcampbellverg reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KB: Cluster Recovery After Full Power Outage (closes #441)#442

Add KB: Cluster Recovery After Full Power Outage (closes #441)#442
CarlRodabaugh wants to merge 5 commits into
mainfrom
carl/cluster-recovery-after-power-outage

CarlRodabaugh commented Apr 28, 2026

Uh oh!

bcampbellverg left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CarlRodabaugh commented Apr 28, 2026

Summary

Changes

Issue coverage (per #441 "Suggested Content")

Verification

Test plan

Uh oh!

bcampbellverg left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants