-
Notifications
You must be signed in to change notification settings - Fork 389
snapshots: fix pipeline state machine edge cases #7570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if( FD_UNLIKELY( ctx->is_zstd && ctx->dirty ) ) { | ||
| FD_LOG_WARNING(( "encountered end-of-file in the middle of a compressed frame" )); | ||
| ctx->state = FD_SNAPSHOT_STATE_ERROR; | ||
| fd_stem_publish( stem, 0UL, FD_SNAPSHOT_MSG_CTRL_ERROR, 0UL, 0UL, 0UL, 0UL, 0UL ); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check could never trigger before (we always reset dirty to 0 above), but also returning here deadlocks the pipeline
| FD_TEST( ctx->state==FD_SNAPSHOT_STATE_PROCESSING || | ||
| ctx->state==FD_SNAPSHOT_STATE_ERROR ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also possible for tiles to receive the FAIL control message in the IDLE state. This is the cause of several of the FD_TEST assertions we've seen.
This happens when snapct sends out a DONE, and an early tile immediately handles it and goes to IDLE. But a later tile may fail and generate ERROR, which causes snapct to send out a FAIL control message which is looped back through the pipeline
| transition_malformed( fd_snapin_tile_t * ctx, | ||
| fd_stem_context_t * stem ) { | ||
| if( FD_UNLIKELY( ctx->state==FD_SNAPSHOT_STATE_ERROR ) ) return; | ||
| ctx->state = FD_SNAPSHOT_STATE_ERROR; | ||
| fd_stem_publish( stem, ctx->out_ct_idx, FD_SNAPSHOT_MSG_CTRL_ERROR, 0UL, 0UL, 0UL, 0UL, 0UL ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bug, but no need to generate extra ERROR messages when we are already in that state
| static void | ||
| after_credit( fd_snapls_tile_t * ctx, | ||
| fd_stem_context_t * stem, | ||
| int * opt_poll_in FD_PARAM_UNUSED, | ||
| int * charge_busy FD_PARAM_UNUSED ) { | ||
| if( FD_UNLIKELY( ctx->hash_accum.received_lthashes==ctx->num_hash_tiles && ctx->hash_accum.awaiting_ack ) ) { | ||
| fd_lthash_sub( &ctx->hash_accum.calculated_lthash, &ctx->running_lthash ); | ||
| if( FD_UNLIKELY( memcmp( &ctx->hash_accum.expected_lthash, &ctx->hash_accum.calculated_lthash, sizeof(fd_lthash_value_t) ) ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be asynchronous, because receiving all the NEXT/DONE "acks" from the snapla's implies that we must have already seen the HASH_RESULT messages from each of them. So as soon as we get the last ack we can check the lthash for correctness.
d435810 to
9ac6b51
Compare
Performance Measurements ⏳
|
d9a07ab to
61e85bc
Compare
Performance Measurements ⏳
|
61e85bc to
4536805
Compare
Performance Measurements ⏳
|
| if( FD_UNLIKELY( ctx->state!=FD_SNAPSHOT_STATE_FINISHING ) ) { | ||
| transition_malformed( ctx, stem ); | ||
| return; | ||
| break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the break here instead of return? Do we still need to forward the control message if we are generating an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, every control message needs to be forwarded down the pipeline immediately, or deferred for later (such as with snapls). But we can't outright drop a control message or the pipeline will be locked on waiting for that message to be flushed forever (each control message is only generated a single time and we do not generate new control messages until it's been flushed).
4536805 to
c5191c3
Compare
Performance Measurements ⏳
|
c5191c3 to
b304768
Compare
Performance Measurements ⏳
|
b304768 to
95e973c
Compare
Performance Measurements ⏳
|
No description provided.