DF-23489 multinode: tighten polling criteria for liveness by vlfig · Pull Request #101 · smartcontractkit/chainlink-framework

vlfig · 2026-04-13T17:14:20Z

Description

Having ship-shape RPCs is crucial for keeping the odds of missing a transmit as low as possible, which is itself crucial for SVR. It is our suspicion (and telemetry concurs) that our nodes are too lenient on RPCs with respect to their polling failures and that there are gains to be had in booting those out of the alive pool. In particular, our nodes do not detect grey-failures — when RPCs fail polling intermittently under a certain rate.

This changes the behaviour in both directions between the unreachable and alive states:

In the alive loop, instead of resetting the failure counter, successful polls now only decrement it. This way, we keep the same behavior for error rates above 5/6=0.8333, where an RPC gets booted after 5 probes (on average), but we now eventually declare unreachable nodes that sustain error rates between that and 1:1, which we'd previously tolerate;
In the unreachable loop, we add a polling phase to the journey back into Alive, mirroring the other direction and ensuring a node doesn't make it back into alive in the same condition that got it kicked out in the first place. However, instead of following the same "decaying" counter we opted for a stricter flow where a single poll failure resets the node back to square one (dialing). This new phase/step is governed by a new config property PollSuccessThreshold that, defaulting to 0 matches current behavior and allows for progressive, explicit rollout.

Implements DF-23489.

Requires Dependencies

None.

Resolves Dependencies

Waiting for tag:

dimriou · 2026-04-15T14:38:26Z

+		case <-time.After(pollInterval):
+		}
+		n.metrics.IncrementPolls(ctx, n.name)
+		pollCtx, cancel := context.WithTimeout(ctx, pollInterval)


Is there a config check that prevents the pollInterval from being 0? If not, in the case where PollSuccessThreshold != 0 && PollInterval == 0 this will enter an infinite failure loop.

There isn't indeed. Added it to the guard condition.

Successful polls now decrement the failure count, don't fully reset it. So that nodes (RPCs) are eventually declared unreachable if they sustain poll error rates above 1:1.

dimriou

I think the pollFailures change is too focused on throttled RPCs. I'm wondering if we're slowing down an RPC that might have a few transient errors during polling by decreasing the counter. I guess one explanation is that it's better to be slower at detecting healthy RPCs than producing false positives. Let's see how this change performs. Perhaps a middle ground solution could be to decrease the counter by 2 instead of one, in order to be more optimistic.

vlfig · 2026-04-20T10:49:07Z

@dhaidashenko @DylanTinianov What do you think?

This comment was marked as outdated.

Sign in to view

vlfig force-pushed the hysteretic-pollfailures branch 2 times, most recently from 6a44457 to e242962 Compare April 14, 2026 13:07

vlfig changed the base branch from main to rm-racy-assertions April 14, 2026 13:11

vlfig marked this pull request as ready for review April 14, 2026 13:12

vlfig requested a review from a team as a code owner April 14, 2026 13:12

product-security-plaid-production Bot requested a review from Unheilbar April 14, 2026 13:12

Base automatically changed from rm-racy-assertions to main April 15, 2026 09:45

dimriou reviewed Apr 15, 2026

View reviewed changes

vlfig force-pushed the hysteretic-pollfailures branch 2 times, most recently from efcf658 to fae57ef Compare April 17, 2026 13:57

vlfig added 4 commits April 17, 2026 15:01

DF-23489 boyscout: fix Context() deprecation in tests

eb173f5

DF-23489 multinode: make PollFailure hysteretic

3245ce4

Successful polls now decrement the failure count, don't fully reset it. So that nodes (RPCs) are eventually declared unreachable if they sustain poll error rates above 1:1.

DF-23489 unreachable node recovery depends on successful polls

e83ad43

DF-23489 multinode: short-circuit recovery polling on 0 PollInterval

3536a64

vlfig force-pushed the hysteretic-pollfailures branch from fae57ef to 3536a64 Compare April 17, 2026 14:02

This was referenced Apr 17, 2026

DF-23489 Bump chainlink-framework for multinode & metrics smartcontractkit/chainlink-evm#432

Merged

DF-23489 WIP bump multinode, retain behaviour smartcontractkit/chainlink-solana#1533

Closed

dimriou previously approved these changes Apr 20, 2026

View reviewed changes

dhaidashenko reviewed Apr 21, 2026

View reviewed changes

Comment thread multinode/config/config.go Outdated

dhaidashenko previously approved these changes Apr 21, 2026

View reviewed changes

vlfig dismissed stale reviews from dhaidashenko and dimriou via 33da723 April 21, 2026 15:29

vlfig requested review from dhaidashenko and dimriou April 21, 2026 15:44

fixup! DF-23489 unreachable node recovery depends on successful polls

0dc41d0

vlfig force-pushed the hysteretic-pollfailures branch from 33da723 to 0dc41d0 Compare April 21, 2026 15:45

dhaidashenko approved these changes Apr 21, 2026

View reviewed changes

silaslenihan approved these changes Apr 22, 2026

View reviewed changes

Merge branch 'main' into hysteretic-pollfailures

e26952c

vlfig merged commit 5a427a1 into main Apr 23, 2026
24 checks passed

vlfig deleted the hysteretic-pollfailures branch April 23, 2026 09:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DF-23489 multinode: tighten polling criteria for liveness#101

DF-23489 multinode: tighten polling criteria for liveness#101
vlfig merged 6 commits intomainfrom
hysteretic-pollfailures

vlfig commented Apr 13, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

dimriou Apr 15, 2026

Uh oh!

vlfig Apr 15, 2026

Uh oh!

dimriou left a comment

Uh oh!

vlfig commented Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vlfig commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Requires Dependencies

Resolves Dependencies

Uh oh!

This comment was marked as outdated.

dimriou Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

vlfig Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

dimriou left a comment

Choose a reason for hiding this comment

Uh oh!

vlfig commented Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vlfig commented Apr 13, 2026 •

edited

Loading