Skip to content

DST: corrupt-manifest-and-restart op + fail-closed coverage#25

Merged
pbudzik merged 1 commit into
mainfrom
feat/dst-corruption-ops
May 16, 2026
Merged

DST: corrupt-manifest-and-restart op + fail-closed coverage#25
pbudzik merged 1 commit into
mainfrom
feat/dst-corruption-ops

Conversation

@pbudzik

@pbudzik pbudzik commented May 16, 2026

Copy link
Copy Markdown
Owner

Summary

  • Adds Op::CorruptLatestManifestAndRestart to the DST state machine. Overwrites the highest-numbered manifest-NNNNNN.json with garbage, then restarts; recovery falls back to generation N-1 per the existing fail-closed semantics. The reference model resyncs from the SUT post-restart (re-derives acked, per-account sums, and closed periods from the actual on-disk state), so subsequent ops in the sequence remain correctly modeled.
  • Adds a dedicated corrupt_only_manifest_generation_is_fatal unit test for the case the state machine intentionally skips: when there's no older generation to fall back to, Recovery must return Err rather than silently start with an empty DB.

Why this matters

Manifest corruption is the production failure mode billing operators most worry about — a partial write, disk bitrot, or fsck-style intervention that leaves the on-disk state inconsistent. Phase A already wired up generation rollback; this PR proves under random op sequences that the rollback keeps the database internally consistent (raw == rollup post-recovery, no panics, no ghost events) and that the truly-unrecoverable case fails loud rather than quiet.

Notes

  • The state machine op is a no-op when current_generation < 2 (no fallback target), so a corruption-restart never produces an unrecoverable state mid-sequence.
  • The fail-closed test had to extract the TempDir guard from the harness before dropping it — otherwise drop(harness) would have deleted db_root before Recovery ran, and the test would pass for the wrong reason (Recovery sees no manifest → fresh DB).

Test plan

  • cargo test --test dst — 2 tests pass (state machine + fail-closed)
  • cargo test — full suite passes
  • DST state-machine cases observed to include CorruptLatestManifestAndRestart interleavings (op weight = 1, so most 5–30-op sequences exercise it)

🤖 Generated with Claude Code

Adds an Op variant that overwrites the highest-numbered manifest
generation file with garbage and restarts. Recovery falls back to
generation N-1 per the existing fail-closed semantics; the reference
model resyncs from the SUT post-restart so subsequent ops in the
sequence are still modeled correctly. The op is a no-op when only
one generation exists (no fallback target).

A separate `corrupt_only_manifest_generation_is_fatal` unit test
covers the case the state machine deliberately skips: corruption of
the only generation file must fail closed (Recovery returns Err).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pbudzik pbudzik merged commit 2599dd3 into main May 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant