fix(client): add grpc.ConnectParams with Backoff override to avoid conflicting retry/backoff strategies#175
Merged
Merged
Conversation
djsly
reviewed
May 13, 2026
djsly
approved these changes
May 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
add grpc.ConnectParams with Backoff override to avoid conflicting retry/backoff strategies
The gRPC subchannel reconnection backoff uses a default
MaxDelayof 120 seconds.After ~8 consecutive connection failures (common during cluster startup when the network
path isn't yet established), the backoff reaches 120s between TCP connection attempts.
Meanwhile, the RPC-level retry interceptor fires every ~2s but only polls the cached
TRANSIENT_FAILUREstate — it does NOT trigger new connections. This means the clientcan miss a 2+ minute window where the path becomes available.
This is the root cause of
connection reset by peer/context deadline exceededfailuresobserved across production and E2E clusters.
MaxDelay: 3s— Caps subchannel reconnection backoff so the client retries a freshTCP+TLS connection every ~3s instead of escalating to 120s. This keeps retry frequency
aligned with the RPC-level retry interval (~2s) and ensures the client catches the narrow
window when the network path becomes available.
MinConnectTimeout: 5s— Reduces per-attempt connection timeout from the 20s default.When the LB path isn't ready, SYNs are black-holed (no RST, no response), so each attempt
wastes the full timeout before retrying. 5s yields ~8s retry cycles while providing ample
headroom for the actual handshake (<1s for healthy intra-Azure TCP + TLS 1.3).
Which issue(s) this PR fixes:
Fixes #