snapshots: fix pipeline state machine edge cases #7570

amass-jump · 2025-12-09T17:27:27Z

No description provided.

amass-jump · 2025-12-09T17:29:19Z

src/discof/restore/fd_snapdc_tile.c

-      if( FD_UNLIKELY( ctx->is_zstd && ctx->dirty ) ) {
-        FD_LOG_WARNING(( "encountered end-of-file in the middle of a compressed frame" ));
-        ctx->state = FD_SNAPSHOT_STATE_ERROR;
-        fd_stem_publish( stem, 0UL, FD_SNAPSHOT_MSG_CTRL_ERROR, 0UL, 0UL, 0UL, 0UL, 0UL );
-        return;
-      }


This check could never trigger before (we always reset dirty to 0 above), but also returning here deadlocks the pipeline

amass-jump · 2025-12-09T17:31:44Z

src/discof/restore/fd_snapdc_tile.c

-      FD_TEST( ctx->state==FD_SNAPSHOT_STATE_PROCESSING ||
-               ctx->state==FD_SNAPSHOT_STATE_ERROR );


It's also possible for tiles to receive the FAIL control message in the IDLE state. This is the cause of several of the FD_TEST assertions we've seen.

This happens when snapct sends out a DONE, and an early tile immediately handles it and goes to IDLE. But a later tile may fail and generate ERROR, which causes snapct to send out a FAIL control message which is looped back through the pipeline

amass-jump · 2025-12-09T17:32:53Z

src/discof/restore/fd_snapin_tile.c

 transition_malformed( fd_snapin_tile_t *  ctx,
                      fd_stem_context_t * stem ) {
+  if( FD_UNLIKELY( ctx->state==FD_SNAPSHOT_STATE_ERROR ) ) return;
  ctx->state = FD_SNAPSHOT_STATE_ERROR;
  fd_stem_publish( stem, ctx->out_ct_idx, FD_SNAPSHOT_MSG_CTRL_ERROR, 0UL, 0UL, 0UL, 0UL, 0UL );
 }


Not a bug, but no need to generate extra ERROR messages when we are already in that state

amass-jump · 2025-12-09T17:37:32Z

src/discof/restore/fd_snapls_tile.c

-static void
-after_credit( fd_snapls_tile_t *  ctx,
-              fd_stem_context_t *  stem,
-              int *                opt_poll_in FD_PARAM_UNUSED,
-              int *                charge_busy FD_PARAM_UNUSED ) {
-  if( FD_UNLIKELY( ctx->hash_accum.received_lthashes==ctx->num_hash_tiles && ctx->hash_accum.awaiting_ack ) ) {
-    fd_lthash_sub( &ctx->hash_accum.calculated_lthash, &ctx->running_lthash );
-    if( FD_UNLIKELY( memcmp( &ctx->hash_accum.expected_lthash, &ctx->hash_accum.calculated_lthash, sizeof(fd_lthash_value_t) ) ) ) {


This doesn't need to be asynchronous, because receiving all the NEXT/DONE "acks" from the snapla's implies that we must have already seen the HASH_RESULT messages from each of them. So as soon as we get the last ack we can check the lthash for correctness.

github-actions · 2025-12-10T10:29:49Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-368528500-perf` per slot	`0.073258 s`	`0.074401 s`	`1.560%` ✅
backtest `mainnet-368528500-perf` snapshot load	`3.141 s`	`3.155 s`	`0.446%` ✅
backtest `mainnet-368528500-perf` total elapsed	`73.258409 s`	`74.401362 s`	`1.560%` ✅
`firedancer mem` usage with `mainnet.toml`	`1023.23 GiB`	`1023.23 GiB`	`0.000%` ✅

github-actions · 2025-12-10T11:05:18Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-368528500-perf` per slot	`0.073616 s`	`0.073573 s`	`-0.058%` ✅
backtest `mainnet-368528500-perf` snapshot load	`3.151 s`	`3.209 s`	`1.841%` ✅
backtest `mainnet-368528500-perf` total elapsed	`73.615818 s`	`73.573048 s`	`-0.058%` ✅
`firedancer mem` usage with `mainnet.toml`	`1023.23 GiB`	`1023.23 GiB`	`0.000%` ✅

github-actions · 2025-12-17T13:52:20Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-368528500-perf` per slot	`0.050826 s`	`0.050693 s`	`-0.262%` ✅
backtest `mainnet-368528500-perf` snapshot load	`1.67 s`	`1.659 s`	`-0.659%` ✅
backtest `mainnet-368528500-perf` total elapsed	`50.826305 s`	`50.692915 s`	`-0.262%` ✅
`firedancer mem` usage with `mainnet.toml`	`1005.23 GiB`	`1005.23 GiB`	`0.000%` ✅

src/discof/restore/utils/fd_ssctrl.h

src/discof/restore/fd_snapdc_tile.c

cali-jumptrading · 2025-12-19T20:52:24Z

src/discof/restore/fd_snapin_tile.c

      if( FD_UNLIKELY( ctx->state!=FD_SNAPSHOT_STATE_FINISHING ) ) {
        transition_malformed( ctx, stem );
-        return;
+        break;


why the break here instead of return? Do we still need to forward the control message if we are generating an error?

Yes, every control message needs to be forwarded down the pipeline immediately, or deferred for later (such as with snapls). But we can't outright drop a control message or the pipeline will be locked on waiting for that message to be flushed forever (each control message is only generated a single time and we do not generate new control messages until it's been flushed).

github-actions · 2026-01-06T15:10:41Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-368528500-perf` per slot	`0.050357 s`	`0.050345 s`	`-0.024%` ✅
backtest `mainnet-368528500-perf` snapshot load	`1.627 s`	`1.616 s`	`-0.676%` ✅
backtest `mainnet-368528500-perf` total elapsed	`50.357409 s`	`50.345456 s`	`-0.024%` ✅
`firedancer mem` usage with `mainnet.toml`	`993.23 GiB`	`993.23 GiB`	`0.000%` ✅

github-actions · 2026-01-06T15:26:00Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-368528500-perf` per slot	`0.050164 s`	`0.050232 s`	`0.136%` ✅
backtest `mainnet-368528500-perf` snapshot load	`1.627 s`	`1.641 s`	`0.860%` ✅
backtest `mainnet-368528500-perf` total elapsed	`50.164372 s`	`50.231918 s`	`0.135%` ✅
`firedancer mem` usage with `mainnet.toml`	`993.23 GiB`	`993.23 GiB`	`0.000%` ✅

github-actions · 2026-01-06T16:42:07Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-368528500-perf` per slot	`0.050662 s`	`0.050433 s`	`-0.452%` ✅
backtest `mainnet-368528500-perf` snapshot load	`1.636 s`	`1.646 s`	`0.611%` ✅
backtest `mainnet-368528500-perf` total elapsed	`50.66171 s`	`50.433074 s`	`-0.451%` ✅
`firedancer mem` usage with `mainnet.toml`	`993.23 GiB`	`993.23 GiB`	`0.000%` ✅

amass-jump commented Dec 9, 2025

View reviewed changes

amass-jump force-pushed the amass/snap-error branch from d435810 to 9ac6b51 Compare December 10, 2025 07:32

amass-jump changed the title ~~snapshots: fix pipeline state machine edge cases~~ [DO NOT MERGE] snapshots: fix pipeline state machine edge cases Dec 10, 2025

amass-jump requested review from cali-jumptrading and ripatel-fd December 10, 2025 10:24

amass-jump self-assigned this Dec 10, 2025

amass-jump marked this pull request as ready for review December 10, 2025 10:25

amass-jump force-pushed the amass/snap-error branch from d9a07ab to 61e85bc Compare December 10, 2025 11:00

amass-jump changed the title ~~[DO NOT MERGE] snapshots: fix pipeline state machine edge cases~~ [DO NOT MERGE] snapshots: fix pipeline state machine edge cases, randomly generate error control signals in testing Dec 10, 2025

amass-jump changed the title ~~[DO NOT MERGE] snapshots: fix pipeline state machine edge cases, randomly generate error control signals in testing~~ snapshots: fix pipeline state machine edge cases, randomly generate error control signals in testing Dec 12, 2025

amass-jump force-pushed the amass/snap-error branch from 61e85bc to 4536805 Compare December 17, 2025 13:44

cali-jumptrading reviewed Dec 19, 2025

View reviewed changes

amass-jump force-pushed the amass/snap-error branch from 4536805 to c5191c3 Compare January 6, 2026 15:04

amass-jump force-pushed the amass/snap-error branch from c5191c3 to b304768 Compare January 6, 2026 15:15

amass-jump changed the title ~~snapshots: fix pipeline state machine edge cases, randomly generate error control signals in testing~~ snapshots: fix pipeline state machine edge cases Jan 6, 2026

snapshots: fix pipeline state machine edge cases

95e973c

amass-jump force-pushed the amass/snap-error branch from b304768 to 95e973c Compare January 6, 2026 16:31

cali-jumptrading approved these changes Jan 6, 2026

View reviewed changes

amass-jump merged commit e6fab15 into main Jan 6, 2026
13 checks passed

amass-jump deleted the amass/snap-error branch January 6, 2026 17:39

		FD_TEST( ctx->state==FD_SNAPSHOT_STATE_PROCESSING \|\|
		ctx->state==FD_SNAPSHOT_STATE_ERROR );

snapshots: fix pipeline state machine edge cases #7570

snapshots: fix pipeline state machine edge cases #7570

Conversation

amass-jump commented Dec 9, 2025

Uh oh!

amass-jump Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

amass-jump Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

amass-jump Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

amass-jump Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 10, 2025

Performance Measurements ⏳

Uh oh!

github-actions bot commented Dec 10, 2025

Performance Measurements ⏳

Uh oh!

github-actions bot commented Dec 17, 2025

Performance Measurements ⏳

Uh oh!

Uh oh!

Uh oh!

cali-jumptrading Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

amass-jump Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 6, 2026

Performance Measurements ⏳

Uh oh!

github-actions bot commented Jan 6, 2026

Performance Measurements ⏳

Uh oh!

github-actions bot commented Jan 6, 2026

Performance Measurements ⏳

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants