Skip to content

feat: consensus fixes and block sync#72

Merged
jorgeantonio21 merged 6 commits intomainfrom
feat/ja/consensus-fixes-and-block-sync
Feb 19, 2026
Merged

feat: consensus fixes and block sync#72
jorgeantonio21 merged 6 commits intomainfrom
feat/ja/consensus-fixes-and-block-sync

Conversation

@jorgeantonio21
Copy link
Collaborator

Summary

Fixes multiple interconnected consensus bugs that caused the test_multi_node_happy_path E2E test to be flaky (~70% pass rate → 100% pass rate across 40+ consecutive runs). The
root causes were:

  1. Cascade nullification corrupted chain integrity — intermediate M-notarized views were incorrectly nullified during cascade, breaking SelectParent consistency across nodes
  2. Stale pending state diffs caused InvalidNonce block validation failures — cascade nullification didn't clean up StateDiff entries from intermediate views, leaving stale
    nonce increments in pending state
  3. Block recovery was unreliable — only the view leader responded to recovery requests (single point of failure), and blocks in nullified/non-finalized storage were unreachable
  4. Finalization GC didn't trace the canonical chain — non-canonical fork views were persisted as finalized, and canonical ancestors could be silently dropped

Changes by file

consensus/src/consensus_manager/state_machine.rs

  • Cascade nullification (Algorithm 1, Step 8): Stop nullifying intermediate views between start_view and current_view. Per the paper, intermediate views were already
    processed through normal consensus flow — nullifying them corrupts their M-notarization state and causes SelectParent to return inconsistent results across nodes. Only the current
    view is now nullified.
  • Pending state rollback: After cascade nullification, call rollback_pending_diffs_in_range(start_view, current_view) to remove stale StateDiff entries. All correct nodes
    eventually cascade from the same start_view (2f+1 guarantee), so this converges to consistent pending state.
  • Force nullification: Added force parameter to nullify_view() to distinguish cascade nullification (bypasses has_voted/evidence checks) from normal timeout/Byzantine
    nullification. New create_nullify_for_cascade() on ViewContext supports this.
  • Deferred finalization loop: Added MAX_FINALIZATIONS_PER_PASS = 5 loop in progress_to_next_view() that retries pending finalizations. Limits per pass to avoid starving
    message processing (each finalization involves storage writes).
  • Current-view finalization guard: Defer finalization if the target is the current view — removing it from non_finalized_views would panic on the next tick().
  • Block recovery request/response handling: Added ShouldRequestBlock and ShouldRequestBlocks event handlers that broadcast BlockRecoveryRequest messages.
  • Shutdown check in inner loop: Added shutdown signal check between message processing iterations for timely exit.
  • ShouldNullifyRange: Same fix as cascade — only nullify current view, roll back pending diffs in range.

consensus/src/consensus_manager/view_chain.rs

  • Canonical chain tracing: finalize_with_l_notarization now traces parent hashes backwards from the finalized view to build a canonical_views set. Views in the canonical
    chain are persisted as finalized; non-canonical forks are persisted as nullified metadata. Previously, all non-nullified views were assumed to be canonical, leading to incorrect
    persistence.
  • Proactive canonical block recovery: When finalization defers because a canonical ancestor is missing its block, the view is added to pending_canonical_recovery. This is
    drained by tick() to trigger targeted recovery requests.
  • add_recovered_block(): New method that accepts recovered blocks for any view (including nullified ones). Validates block hash against M-notarization. Returns whether
    L-notarization is available.
  • get_block_for_recovery(): New method that searches ALL storage locations — non-finalized chain, FINALIZED_BLOCKS, NULLIFIED_BLOCKS, and NON_FINALIZED_BLOCKS — using
    both view height and block hash.
  • oldest_finalizable_view(): New method that finds the oldest view with L-notarization (n-f votes), a valid block, and a valid parent chain. Parent nullification check relaxed
    — M-notarization + nullification can coexist (n >= 5f+1).
  • progress_after_cascade(): New method for view progression after cascade. Unlike progress_with_nullification, does NOT require an aggregated nullification proof — only local
    nullification (has_nullified).
  • rollback_pending_diffs_in_range(): Removes pending StateDiff entries for views in [start_view, end_view] to prevent InvalidNonce errors.
  • remove_pending_diff(): Single-view variant called during mark_nullified().
  • GC improvements: Retain canonical/M-notarized views missing blocks (for later recovery). Accept locally-nullified views (has_nullified) in intermediate view checks, not just
    aggregated nullification.
  • non_finalized_view_numbers_range(): Compute range from actual HashMap keys instead of arithmetic (handles gaps from retained recovery views).
  • persist_m_notarized_view(): Changed to persist as FINALIZED_BLOCKS (was NON_FINALIZED_BLOCKS) — M-notarized canonical views are committed transitively via the
    descendant's L-notarization.
  • persist_nullified_view_or_metadata(): New method for non-canonical views without nullification artifacts.
  • Vote rejection for nullified views: route_vote() now rejects votes for nullified views early (Lemma 5.3: no L-notarization possible).
  • on_m_notarization(): Added M-notarization guard — only publishes StateDiff to pending state if M-notarization actually exists.
  • Late block M-notarization: add_new_view_block() now calls on_m_notarization() retroactively when a block arrives after votes already crossed the M-notarization threshold.
  • find_parent_view(): Changed visibility from fn to pub(crate) for use in leader proposal gating.
  • Lemma 5.3 guard in finalize_with_l_notarization(): Explicitly rejects finalization of nullified views.
  • Parent block deferral: If parent view exists but block hasn't been recovered yet, defer instead of error.

consensus/src/consensus_manager/view_manager.rs

  • Block recovery in tick(): Detects M-notarized views missing blocks and emits ShouldRequestBlock/ShouldRequestBlocks events. Uses per-view 500ms cooldown to avoid
    flooding. MAX_BATCH_RECOVERY = 5 per tick.
  • F+1 recovery responders: handle_block_recovery_request() allows the leader + F backup responders (next in round-robin). With at most F faulty nodes, this guarantees at least
    one honest responder. Previously only the leader responded (single point of failure).
  • handle_block_recovery_response(): Adds recovered block, clears cooldown, triggers finalization if L-notarization available or checks for deferred finalization via
    oldest_finalizable_view().
  • Leader proposal gating (Section 6.1, Modification 2): Leader waits for parent M-notarization and intermediate view nullifications before proposing. Reduces the window where
    replicas miss blocks.
  • Canonical recovery propagation: After finalize_view(), moves entries from ViewChain::pending_canonical_recovery to persistent canonical_recovery_pending set. Cleaned up
    each tick when blocks arrive or views are GC'd.
  • mark_nullified(): Now also removes the view's pending StateDiff.
  • rollback_pending_diffs_in_range(): Delegate to ViewChain.
  • progress_after_cascade(): Creates new view context and advances chain without requiring aggregated nullification proof.
  • Genesis cleanup: Removed unused genesis vote creation (create_genesis_vote), genesis M-notarization, and genesis ViewContext at view 0. ViewChain now starts directly at view

consensus/src/consensus_manager/events.rs

  • Added ShouldRequestBlock { view, block_hash } and ShouldRequestBlocks { requests } variants.

consensus/src/consensus.rs

  • Added BlockRecoveryRequest { view, block_hash } and BlockRecoveryResponse { view, block } to ConsensusMessage.

consensus/src/consensus_manager/view_context.rs

  • Added create_nullify_for_cascade() — creates nullify message for cascade without requiring Byzantine evidence or timeout.

p2p/src/service.rs

  • Reordered tokio::select! branches to prioritize message reception over sleep timeout during bootstrap. Prevents delayed peer discovery when both are ready simultaneously.

tests/src/e2e_consensus/scenarios.rs

  • Minor formatting cleanup (no behavioral change).

Test plan

  • cargo build --release — clean
  • cargo clippy --all-targets --all-features -- -D warnings — clean
  • cargo test --package consensus — 563/563 passed
  • test_multi_node_happy_path E2E — 20/20 passed (×2 consecutive runs = 40/40)
  • Verified that InvalidNonce block validation errors no longer occur
  • Verified block recovery completes under cascade nullification scenarios

@jorgeantonio21 jorgeantonio21 merged commit 5ff61e3 into main Feb 19, 2026
6 checks passed
@jorgeantonio21 jorgeantonio21 deleted the feat/ja/consensus-fixes-and-block-sync branch February 19, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant