Skip to content

DF-23489 multinode: tighten polling criteria for liveness#101

Merged
vlfig merged 6 commits intomainfrom
hysteretic-pollfailures
Apr 23, 2026
Merged

DF-23489 multinode: tighten polling criteria for liveness#101
vlfig merged 6 commits intomainfrom
hysteretic-pollfailures

Conversation

@vlfig
Copy link
Copy Markdown
Contributor

@vlfig vlfig commented Apr 13, 2026

Description

Having ship-shape RPCs is crucial for keeping the odds of missing a transmit as low as possible, which is itself crucial for SVR. It is our suspicion (and telemetry concurs) that our nodes are too lenient on RPCs with respect to their polling failures and that there are gains to be had in booting those out of the alive pool. In particular, our nodes do not detect grey-failures — when RPCs fail polling intermittently under a certain rate.

This changes the behaviour in both directions between the unreachable and alive states:

  1. In the alive loop, instead of resetting the failure counter, successful polls now only decrement it. This way, we keep the same behavior for error rates above 5/6=0.8333, where an RPC gets booted after 5 probes (on average), but we now eventually declare unreachable nodes that sustain error rates between that and 1:1, which we'd previously tolerate;
  2. In the unreachable loop, we add a polling phase to the journey back into Alive, mirroring the other direction and ensuring a node doesn't make it back into alive in the same condition that got it kicked out in the first place. However, instead of following the same "decaying" counter we opted for a stricter flow where a single poll failure resets the node back to square one (dialing). This new phase/step is governed by a new config property PollSuccessThreshold that, defaulting to 0 matches current behavior and allows for progressive, explicit rollout.

Implements DF-23489.

Requires Dependencies

None.

Resolves Dependencies

Waiting for tag:

@github-actions

This comment was marked as outdated.

@vlfig vlfig force-pushed the hysteretic-pollfailures branch 2 times, most recently from 6a44457 to e242962 Compare April 14, 2026 13:07
@vlfig vlfig changed the base branch from main to rm-racy-assertions April 14, 2026 13:11
@vlfig vlfig marked this pull request as ready for review April 14, 2026 13:12
@vlfig vlfig requested a review from a team as a code owner April 14, 2026 13:12
Base automatically changed from rm-racy-assertions to main April 15, 2026 09:45
case <-time.After(pollInterval):
}
n.metrics.IncrementPolls(ctx, n.name)
pollCtx, cancel := context.WithTimeout(ctx, pollInterval)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a config check that prevents the pollInterval from being 0? If not, in the case where PollSuccessThreshold != 0 && PollInterval == 0 this will enter an infinite failure loop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't indeed. Added it to the guard condition.

@vlfig vlfig force-pushed the hysteretic-pollfailures branch 2 times, most recently from efcf658 to fae57ef Compare April 17, 2026 13:57
vlfig added 4 commits April 17, 2026 15:01
Successful polls now decrement the failure count, don't fully reset it.
So that nodes (RPCs) are eventually declared unreachable if they sustain poll error rates above 1:1.
dimriou
dimriou previously approved these changes Apr 20, 2026
Copy link
Copy Markdown
Contributor

@dimriou dimriou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pollFailures change is too focused on throttled RPCs. I'm wondering if we're slowing down an RPC that might have a few transient errors during polling by decreasing the counter. I guess one explanation is that it's better to be slower at detecting healthy RPCs than producing false positives. Let's see how this change performs. Perhaps a middle ground solution could be to decrease the counter by 2 instead of one, in order to be more optimistic.

@vlfig
Copy link
Copy Markdown
Contributor Author

vlfig commented Apr 20, 2026

@dhaidashenko @DylanTinianov What do you think?

Comment thread multinode/config/config.go Outdated
dhaidashenko
dhaidashenko previously approved these changes Apr 21, 2026
@vlfig vlfig dismissed stale reviews from dhaidashenko and dimriou via 33da723 April 21, 2026 15:29
@vlfig vlfig requested review from dhaidashenko and dimriou April 21, 2026 15:44
@vlfig vlfig force-pushed the hysteretic-pollfailures branch from 33da723 to 0dc41d0 Compare April 21, 2026 15:45
@vlfig vlfig merged commit 5a427a1 into main Apr 23, 2026
24 checks passed
@vlfig vlfig deleted the hysteretic-pollfailures branch April 23, 2026 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants