Skip to content

fix(client): add grpc.ConnectParams with Backoff override to avoid conflicting retry/backoff strategies#175

Merged
cameronmeissner merged 5 commits into
mainfrom
cameissner/retries
May 13, 2026
Merged

fix(client): add grpc.ConnectParams with Backoff override to avoid conflicting retry/backoff strategies#175
cameronmeissner merged 5 commits into
mainfrom
cameissner/retries

Conversation

@cameronmeissner
Copy link
Copy Markdown
Contributor

@cameronmeissner cameronmeissner commented May 11, 2026

What this PR does / why we need it:

add grpc.ConnectParams with Backoff override to avoid conflicting retry/backoff strategies

The gRPC subchannel reconnection backoff uses a default MaxDelay of 120 seconds.
After ~8 consecutive connection failures (common during cluster startup when the network
path isn't yet established), the backoff reaches 120s between TCP connection attempts.
Meanwhile, the RPC-level retry interceptor fires every ~2s but only polls the cached
TRANSIENT_FAILURE state — it does NOT trigger new connections. This means the client
can miss a 2+ minute window where the path becomes available.

This is the root cause of connection reset by peer / context deadline exceeded failures
observed across production and E2E clusters.

MaxDelay: 3s — Caps subchannel reconnection backoff so the client retries a fresh
TCP+TLS connection every ~3s instead of escalating to 120s. This keeps retry frequency
aligned with the RPC-level retry interval (~2s) and ensures the client catches the narrow
window when the network path becomes available.

MinConnectTimeout: 5s — Reduces per-attempt connection timeout from the 20s default.
When the LB path isn't ready, SYNs are black-holed (no RST, no response), so each attempt
wastes the full timeout before retrying. 5s yields ~8s retry cycles while providing ample
headroom for the actual handshake (<1s for healthy intra-Azure TCP + TLS 1.3).

Which issue(s) this PR fixes:

Fixes #

Comment thread client/internal/bootstrap/grpc.go Outdated
@cameronmeissner cameronmeissner merged commit 17142ab into main May 13, 2026
9 of 10 checks passed
@cameronmeissner cameronmeissner deleted the cameissner/retries branch May 13, 2026 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants