DST: corrupt-manifest-and-restart op + fail-closed coverage by pbudzik · Pull Request #25 · pbudzik/usagedb

pbudzik · 2026-05-16T20:53:18Z

Summary

Adds Op::CorruptLatestManifestAndRestart to the DST state machine. Overwrites the highest-numbered manifest-NNNNNN.json with garbage, then restarts; recovery falls back to generation N-1 per the existing fail-closed semantics. The reference model resyncs from the SUT post-restart (re-derives acked, per-account sums, and closed periods from the actual on-disk state), so subsequent ops in the sequence remain correctly modeled.
Adds a dedicated corrupt_only_manifest_generation_is_fatal unit test for the case the state machine intentionally skips: when there's no older generation to fall back to, Recovery must return Err rather than silently start with an empty DB.

Why this matters

Manifest corruption is the production failure mode billing operators most worry about — a partial write, disk bitrot, or fsck-style intervention that leaves the on-disk state inconsistent. Phase A already wired up generation rollback; this PR proves under random op sequences that the rollback keeps the database internally consistent (raw == rollup post-recovery, no panics, no ghost events) and that the truly-unrecoverable case fails loud rather than quiet.

Notes

The state machine op is a no-op when current_generation < 2 (no fallback target), so a corruption-restart never produces an unrecoverable state mid-sequence.
The fail-closed test had to extract the TempDir guard from the harness before dropping it — otherwise drop(harness) would have deleted db_root before Recovery ran, and the test would pass for the wrong reason (Recovery sees no manifest → fresh DB).

Test plan

cargo test --test dst — 2 tests pass (state machine + fail-closed)
cargo test — full suite passes
DST state-machine cases observed to include CorruptLatestManifestAndRestart interleavings (op weight = 1, so most 5–30-op sequences exercise it)

🤖 Generated with Claude Code

Adds an Op variant that overwrites the highest-numbered manifest generation file with garbage and restarts. Recovery falls back to generation N-1 per the existing fail-closed semantics; the reference model resyncs from the SUT post-restart so subsequent ops in the sequence are still modeled correctly. The op is a no-op when only one generation exists (no fallback target). A separate `corrupt_only_manifest_generation_is_fatal` unit test covers the case the state machine deliberately skips: corruption of the only generation file must fail closed (Recovery returns Err). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pbudzik merged commit 2599dd3 into main May 16, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DST: corrupt-manifest-and-restart op + fail-closed coverage#25

DST: corrupt-manifest-and-restart op + fail-closed coverage#25
pbudzik merged 1 commit into
mainfrom
feat/dst-corruption-ops

pbudzik commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pbudzik commented May 16, 2026

Summary

Why this matters

Notes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant