fix(client): add grpc.ConnectParams with Backoff override to avoid conflicting retry/backoff strategies by cameronmeissner · Pull Request #175 · Azure/aks-secure-tls-bootstrap

cameronmeissner · 2026-05-11T23:24:33Z

What this PR does / why we need it:

add grpc.ConnectParams with Backoff override to avoid conflicting retry/backoff strategies

The gRPC subchannel reconnection backoff uses a default MaxDelay of 120 seconds.
After ~8 consecutive connection failures (common during cluster startup when the network
path isn't yet established), the backoff reaches 120s between TCP connection attempts.
Meanwhile, the RPC-level retry interceptor fires every ~2s but only polls the cached
TRANSIENT_FAILURE state — it does NOT trigger new connections. This means the client
can miss a 2+ minute window where the path becomes available.

This is the root cause of connection reset by peer / context deadline exceeded failures
observed across production and E2E clusters.

MaxDelay: 3s — Caps subchannel reconnection backoff so the client retries a fresh
TCP+TLS connection every ~3s instead of escalating to 120s. This keeps retry frequency
aligned with the RPC-level retry interval (~2s) and ensures the client catches the narrow
window when the network path becomes available.

MinConnectTimeout: 5s — Reduces per-attempt connection timeout from the 20s default.
When the LB path isn't ready, SYNs are black-holed (no RST, no response), so each attempt
wastes the full timeout before retrying. 5s yields ~8s retry cycles while providing ample
headroom for the actual handshake (<1s for healthy intra-Azure TCP + TLS 1.3).

Which issue(s) this PR fixes:

Fixes #

…/backoff strategies

cameronmeissner added 3 commits May 11, 2026 15:56

fix(client): add explicit grpc.ConnectParams to avoid conflicts retry…

33bad70

…/backoff strategies

chore: cleanup

71bef3d

chore: comments

a31046a

cameronmeissner requested review from Devinwong, bcho, djsly, ganeshkumarashok, lilypan26, phealy and zachary-bailey as code owners May 11, 2026 23:24

djsly reviewed May 13, 2026

View reviewed changes

Comment thread client/internal/bootstrap/grpc.go Outdated

cameronmeissner added 2 commits May 13, 2026 15:12

chore: update MinConnectTimeout

595ba48

chore: changes

148c2de

djsly approved these changes May 13, 2026

View reviewed changes

cameronmeissner merged commit 17142ab into main May 13, 2026
9 of 10 checks passed

cameronmeissner deleted the cameissner/retries branch May 13, 2026 23:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(client): add grpc.ConnectParams with Backoff override to avoid conflicting retry/backoff strategies#175

fix(client): add grpc.ConnectParams with Backoff override to avoid conflicting retry/backoff strategies#175
cameronmeissner merged 5 commits into
mainfrom
cameissner/retries

cameronmeissner commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cameronmeissner commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cameronmeissner commented May 11, 2026 •

edited

Loading