DF-23489 multinode: tighten polling criteria for liveness#101
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
6a44457 to
e242962
Compare
| case <-time.After(pollInterval): | ||
| } | ||
| n.metrics.IncrementPolls(ctx, n.name) | ||
| pollCtx, cancel := context.WithTimeout(ctx, pollInterval) |
There was a problem hiding this comment.
Is there a config check that prevents the pollInterval from being 0? If not, in the case where PollSuccessThreshold != 0 && PollInterval == 0 this will enter an infinite failure loop.
There was a problem hiding this comment.
There isn't indeed. Added it to the guard condition.
efcf658 to
fae57ef
Compare
Successful polls now decrement the failure count, don't fully reset it. So that nodes (RPCs) are eventually declared unreachable if they sustain poll error rates above 1:1.
fae57ef to
3536a64
Compare
dimriou
left a comment
There was a problem hiding this comment.
I think the pollFailures change is too focused on throttled RPCs. I'm wondering if we're slowing down an RPC that might have a few transient errors during polling by decreasing the counter. I guess one explanation is that it's better to be slower at detecting healthy RPCs than producing false positives. Let's see how this change performs. Perhaps a middle ground solution could be to decrease the counter by 2 instead of one, in order to be more optimistic.
|
@dhaidashenko @DylanTinianov What do you think? |
33da723 to
0dc41d0
Compare
Description
Having ship-shape RPCs is crucial for keeping the odds of missing a transmit as low as possible, which is itself crucial for SVR. It is our suspicion (and telemetry concurs) that our nodes are too lenient on RPCs with respect to their polling failures and that there are gains to be had in booting those out of the alive pool. In particular, our nodes do not detect grey-failures — when RPCs fail polling intermittently under a certain rate.
This changes the behaviour in both directions between the unreachable and alive states:
Alive, mirroring the other direction and ensuring a node doesn't make it back into alive in the same condition that got it kicked out in the first place. However, instead of following the same "decaying" counter we opted for a stricter flow where a single poll failure resets the node back to square one (dialing). This new phase/step is governed by a new config propertyPollSuccessThresholdthat, defaulting to0matches current behavior and allows for progressive, explicit rollout.Implements DF-23489.
Requires Dependencies
None.
Resolves Dependencies
Waiting for tag: