Skip to content

client/v3: fix double resolver update in EtcdManualResolver.Build#21662

Open
BootstrapperSBL wants to merge 1 commit into
etcd-io:mainfrom
BootstrapperSBL:fix/resolver-single-update-on-build
Open

client/v3: fix double resolver update in EtcdManualResolver.Build#21662
BootstrapperSBL wants to merge 1 commit into
etcd-io:mainfrom
BootstrapperSBL:fix/resolver-single-update-on-build

Conversation

@BootstrapperSBL
Copy link
Copy Markdown

Fixes #21660.

Background

EtcdManualResolver.Build used to invoke the embedded manual.Resolver.Build first and only then push the endpoints plus the round_robin ServiceConfig via a follow-up updateState call. gRPC therefore observed an initial resolver state without the ServiceConfig and a second update shortly after that carried it, which forced gRPC to switch load balancers mid-connection and tear down the in-flight SubChannel. The warning visible in client logs was:

[core] [Channel #2 SubChannel #5] grpc: addrConn.createTransport failed to connect to
{Addr: "127.0.0.1:2379", ...}. Err: connection error: desc = "transport: Error while dialing:
dial tcp 127.0.0.1:2379: operation was canceled"

Every new grpc.ClientConn (every newClient, every per-endpoint maintenance.Status / HashKV / etc.) paid for a throwaway TCP dial and TLS handshake on top of the noise in the logs. The root cause was flagged in the issue by @zyriljamez and the proposed ordering change was endorsed by @ahrtr.

Fix

Seed the initial resolver state via manual.Resolver.InitialState before calling manual.Resolver.Build. That way the underlying resolver dispatches the first UpdateState to the ClientConn itself, already carrying the endpoints and the round_robin ServiceConfig as one atomic update, and the updateState trailer is no longer needed on the Build path.

r.serviceConfig = cc.ParseServiceConfig(...)
r.InitialState(r.buildState())
return r.Resolver.Build(target, cc, opts)

A small buildState helper is extracted so that updateState (used by SetEndpoints) keeps sharing the exact same state construction.

Before / after

  • Before: two resolver updates per Build (empty initial + ServiceConfig second); gRPC balancer switch; addrConn.createTransport ... operation was canceled warnings; one wasted TCP dial + TLS handshake per client.
  • After: one resolver update per Build, already carrying endpoints and ServiceConfig; no balancer switch; no warnings; no wasted dial.

Tests

Added client/v3/internal/resolver/resolver_test.go with:

  • TestBuildSendsSingleUpdateWithServiceConfig — asserts Build forwards exactly one UpdateState call and that call already carries both the endpoints and the ServiceConfig.
  • TestSetEndpointsAfterBuild — asserts SetEndpoints keeps propagating updates through the shared buildState path.
$ go test ./client/v3/internal/resolver/... -race
ok  	go.etcd.io/etcd/client/v3/internal/resolver	1.706s
$ golangci-lint run ./client/v3/internal/resolver/...
0 issues.
$ go test ./client/v3/...
ok  	go.etcd.io/etcd/client/v3	... (all packages pass)

cc @ahrtr for review — the approach here matches the sequencing change you endorsed on the issue.

EtcdManualResolver.Build called the embedded manual.Resolver.Build first
and then pushed the endpoints and the round_robin ServiceConfig through
a follow-up updateState. gRPC therefore saw an initial resolver state
without the ServiceConfig and then a second update carrying it, which
forced it to switch balancers mid-connection and tear down an in-flight
SubChannel. The resulting "grpc: addrConn.createTransport failed ...
operation was canceled" warnings are noisy, and every new grpc.ClientConn
pays for a throwaway TCP dial and TLS handshake.

Seed the initial state via manual.Resolver.InitialState before calling
manual.Resolver.Build so the first and only UpdateState dispatched to
gRPC already carries both the endpoints and the ServiceConfig. The
endpoint-to-state conversion is factored into a small buildState helper
so SetEndpoints keeps sharing the same code path.

Fixes etcd-io#21660

Signed-off-by: BootstrapperSBL <yvanwww01@gmail.com>
@k8s-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: BootstrapperSBL
Once this PR has been reviewed and has the lgtm label, please assign serathius for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Copy Markdown

Hi @BootstrapperSBL. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ahrtr
Copy link
Copy Markdown
Member

ahrtr commented Apr 25, 2026

/ok-to-test

@k8s-ci-robot
Copy link
Copy Markdown

@BootstrapperSBL: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-verify 55382b0 link true /test pull-etcd-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.47%. Comparing base (57e50fd) to head (55382b0).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
Files with missing lines Coverage Δ
client/v3/internal/resolver/resolver.go 93.33% <100.00%> (+5.83%) ⬆️

... and 149 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #21662      +/-   ##
==========================================
+ Coverage   61.81%   68.47%   +6.66%     
==========================================
  Files         418      432      +14     
  Lines       34276    35408    +1132     
==========================================
+ Hits        21188    24247    +3059     
+ Misses      11522     9756    -1766     
+ Partials     1566     1405     -161     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 57e50fd...55382b0. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

EtcdManualResolver.Build() sends two resolver updates causing unnecessary gRPC balancer switch and wasted connections

3 participants