[Duplicate] Add --dns-cache-ttl to avoid DNS storm at high concurrency#494
[Duplicate] Add --dns-cache-ttl to avoid DNS storm at high concurrency#494kojiwell wants to merge 1 commit into
Conversation
With CGO_ENABLED=0 (warp's release build), Go uses the pure resolver
which does no in-process DNS caching and performs a fresh lookup on
every dial. At high --concurrent against an FQDN host this produces a
burst of DNS queries per connection ("DNS storm").
Add an opt-in --dns-cache-ttl flag backed by a process-wide TTL cache:
the host FQDN is resolved at most once per TTL regardless of concurrency
or the number of clients created by host selection. The FQDN is kept for
TLS SNI, certificate validation and the Host header; only the TCP dial
address is substituted with a cached IP.
Details:
- Lazy re-resolution with single-flight on TTL expiry; on resolution
failure the last successful set is kept so the benchmark is not
interrupted.
- Multiple A/AAAA records are cached and handed out round-robin per dial.
- Literal IP hosts are a no-op; ttl 0 (default) disables caching and
preserves the previous behavior exactly.
- Applied to all transports (default / TLS / kTLS-h2 via DialContext,
kTLS-non-h2 via a DialTLSContext that pins ServerName=FQDN).
- DurationFlag propagates automatically to distributed clients.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR adds a shared, process-wide DNS cache to warp's CLI client transports, controlled by a new ChangesShared DNS cache and transport integration
Estimated code review effort: 4 (Complex) | ~60 minutes Sequence Diagram(s)sequenceDiagram
participant Client as Client Transport
participant Cache as sharedDNSCache
participant Resolver as net.DefaultResolver
participant Server as Remote Host
Client->>Cache: dialContext(host:port)
alt cache disabled or literal IP
Client->>Server: dial(original address)
else cache enabled
Cache->>Cache: lookup(host)
alt entry fresh
Cache-->>Client: cached IP (round-robin)
else entry expired
Cache->>Resolver: resolveHostIPs(host)
Resolver-->>Cache: resolved IPs or error
alt resolution succeeds
Cache-->>Client: new cached IP
else resolution fails
Cache-->>Client: stale IP (if available)
end
end
Client->>Server: dial(cached IP, pinned SNI)
end
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cli/client_transport.go`:
- Around line 36-53: The DNS cache TTL is currently captured once in getDNSCache
via dnsCacheOnce and sharedDNSCache, so the first cli.Context seen by
client_transport.go locks in the value for all later runs. Thread the
dns-cache-ttl value through the DNS cache creation path explicitly (for example
via the transport/client constructor or a cache factory) instead of reading it
once from ctx.Duration("dns-cache-ttl"), and update any callers that rely on
getDNSCache to pass the TTL they intend to use.
In `@cli/dnscache_test.go`:
- Around line 91-103: The test TestDNSCacheKeepsStaleOnFailure still depends on
real DNS resolution via dnsCache.lookup and net.DefaultResolver.LookupIPAddr,
which can make CI flaky. Make resolution injectable on dnsCache (for example
through a resolver function or interface used by lookup), then update this test
to install a fake resolver that always returns a deterministic error so the
stale-entry behavior is verified without any network dependency.
- Around line 67-89: The TTL and single-flight cache tests are only using
pre-seeded entries, so they never exercise the actual resolver path or verify
call counts. Update the DNS cache tests around dnsCache.lookup/newDNSCache to
use an injectable fake resolver, then assert the resolver is called zero times
for fresh entries, once after expiry, and only once across concurrent cache
misses. Keep the round-robin assertions in TestDNSCacheResolvesOncePerTTL, but
make the setup drive resolution through the real lookup path instead of manually
populating entries.
In `@cli/dnscache.go`:
- Around line 115-128: The cold-cache path in dnscache.go still allows
concurrent callers to bypass single-flight when a host has no cached IPs, so fix
the lookup flow in the resolver logic around dnsEntry and resolveHostIPs to make
cache-miss waiters block on the in-flight resolution instead of falling through.
Introduce a wait channel, condition, or similar synchronization on dnsEntry so
that when e.resolving is already true and len(e.ips) == 0, other goroutines wait
for the first resolver to finish and then reuse its result or error, while
preserving the existing stale-IP fast path.
- Around line 137-144: The stale-result path in getIP is returning a cached IP
after a DNS refresh failure, but e.expires is left expired so every subsequent
dial retries DNS immediately. Update the retry/backoff state in this stale
branch before returning from getIP, using the existing expiry/backoff logic in
the cache entry (e and its stale handling) so serving stale entries temporarily
suppresses repeated resolver calls during outages.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 77c1daa2-c033-4ec8-8c7c-27c7711c4bff
📒 Files selected for processing (5)
cli/client_ktls.gocli/client_transport.gocli/dnscache.gocli/dnscache_test.gocli/flags.go
| // sharedDNSCache is the process-wide DNS cache, initialized once from the | ||
| // --dns-cache-ttl flag. It is shared by every transport/client so a single | ||
| // FQDN is resolved at most once per TTL regardless of concurrency or the | ||
| // number of clients created by host selection. | ||
| var ( | ||
| dnsCacheOnce sync.Once | ||
| sharedDNSCache *dnsCache | ||
| ) | ||
|
|
||
| // getDNSCache returns the shared DNS cache, initializing it on first use from | ||
| // the --dns-cache-ttl flag value. | ||
| func getDNSCache(ctx *cli.Context) *dnsCache { | ||
| dnsCacheOnce.Do(func() { | ||
| sharedDNSCache = newDNSCache(ctx.Duration("dns-cache-ttl")) | ||
| }) | ||
| return sharedDNSCache | ||
| } | ||
|
|
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Find all call sites of getDNSCache to confirm the ctx passed is the
# top-level process context and not reconstructed per benchmark run/RPC call.
rg -n -B3 -A3 'getDNSCache\(' --type=goRepository: minio/warp
Length of output: 148
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== file list =='
git ls-files '*.go' | sed -n '1,200p'
echo
echo '== getDNSCache references =='
rg -n -B2 -A4 'getDNSCache\(' --glob '*.go' .
echo
echo '== dns-cache-ttl references =='
rg -n -B2 -A4 'dns-cache-ttl' --glob '*.go' --glob '*.md' .Repository: minio/warp
Length of output: 4995
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== outline: cli/client_transport.go =='
ast-grep outline cli/client_transport.go --view expanded
echo
echo '== outline: cli/client_ktls.go =='
ast-grep outline cli/client_ktls.go --view expanded
echo
echo '== outline: cli/client.go =='
ast-grep outline cli/client.go --view expanded
echo
echo '== outline: cli/benchclient.go =='
ast-grep outline cli/benchclient.go --view expandedRepository: minio/warp
Length of output: 3052
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== client_transport.go relevant section =='
sed -n '1,180p' cli/client_transport.go
echo
echo '== client_ktls.go relevant section =='
sed -n '1,140p' cli/client_ktls.go
echo
echo '== client.go relevant section =='
sed -n '1,220p' cli/client.go
echo
echo '== benchclient.go relevant section =='
sed -n '1,220p' cli/benchclient.goRepository: minio/warp
Length of output: 20946
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== getDNSCache callers with surrounding functions =='
rg -n -B8 -A20 'getDNSCache\(' cli/*.go
echo
echo '== functions that build HTTP clients / transports =='
rg -n -B4 -A20 'http.Client|Transport|new.*Client|new.*Transport|http.Transport' cli/*.goRepository: minio/warp
Length of output: 23386
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== direct construction or propagation of *cli.Context =='
rg -n -B3 -A5 'cli\.NewContext|*cli\.Context|Context\)' cli/*.go main.go
echo
echo '== benchmark/run entry points =='
rg -n -B4 -A12 'func (.*) Run|app\.Action|Commands|benchmark' main.go cli/*.goRepository: minio/warp
Length of output: 336
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== cli/cli.go relevant section =='
sed -n '1,240p' cli/cli.go
echo
echo '== cli/run.go relevant section =='
sed -n '1,220p' cli/run.go
echo
echo '== cli/benchmark.go relevant section =='
sed -n '1,240p' cli/benchmark.goRepository: minio/warp
Length of output: 19437
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== transport/client constructors in cli =='
rg -n -B6 -A16 'func .*new.*(Transport|Client)|func .*transport|http\.Transport|http\.Client' cli/*.go
echo
echo '== any setup of dns-cache-ttl beyond flags =='
rg -n -B3 -A6 'dns-cache-ttl' cli/*.go main.goRepository: minio/warp
Length of output: 10055
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== call graph for getDNSCache and client construction =='
python3 - <<'PY'
import re, pathlib, json
paths = [p for p in pathlib.Path('cli').glob('*.go')]
for path in paths:
text = path.read_text()
if 'getDNSCache(' in text or 'newTransport(' in text or 'newClient(' in text or 'newHTTP' in text:
print(f'## {path}')
for m in re.finditer(r'(^.*(getDNSCache\(|newTransport\(|newClient\().*$)', text, re.M):
line = text[:m.start()].count('\n') + 1
print(f'{line}: {m.group(1).strip()}')
PYRepository: minio/warp
Length of output: 588
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n -B3 -A6 'NewContext\(|cli\.NewContext\(' --glob '*.go' .Repository: minio/warp
Length of output: 1149
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== serverRequest / benchmark data structures =='
rg -n -B6 -A30 'type serverRequest|type .*Benchmark|Benchmark\.Flags|Flags map' cli/*.go pkg/*.go
echo
echo '== where benchmark flags are populated =='
rg -n -B6 -A20 's\.Benchmark\.Flags|Benchmark\.Flags|Flags:|flags :' cli/*.goRepository: minio/warp
Length of output: 46633
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== top-level command actions and context flow =='
rg -n -B4 -A18 'mainExec|runBench|executeBenchmark|newClient\(|newAdminClient\(|clientTransport\(' cli/*.go
echo
echo '== any runtime mutation of dns-cache-ttl =='
rg -n -B4 -A8 'Set\("dns-cache-ttl"|dns-cache-ttl".*Set|ctx\.Set\(.*dns-cache-ttl|String\("dns-cache-ttl"\)|Duration\("dns-cache-ttl"\)' cli/*.goRepository: minio/warp
Length of output: 29472
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
import pathlib, re
for path in pathlib.Path('cli').glob('*.go'):
txt = path.read_text()
if 'dns-cache-ttl' in txt or 'getDNSCache(' in txt or 'newClientTransport(' in txt or 'clientTransport(' in txt:
print(f'## {path}')
for i, line in enumerate(txt.splitlines(), 1):
if 'dns-cache-ttl' in line or 'getDNSCache(' in line or 'newClientTransport(' in line or 'clientTransport(' in line:
print(f'{i}: {line}')
PYRepository: minio/warp
Length of output: 1135
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== inspect server/client request payload types =='
sed -n '1,220p' cli/benchclient.go
echo
echo '== inspect run.go around ctx2 creation =='
sed -n '220,330p' cli/run.goRepository: minio/warp
Length of output: 8106
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== excludeFlags / global flags handling =='
rg -n -B6 -A20 'excludeFlags|globalFlags|dns-cache-ttl|combineFlags\(globalFlags' cli/*.go
echo
echo '== client mode benchmark execution path =='
sed -n '1,260p' cli/benchserver.go
echo
echo '== run.go benchmark reconstruction path =='
sed -n '220,280p' cli/run.goRepository: minio/warp
Length of output: 45749
Make dns-cache-ttl a constructor argument, not a one-shot global. getDNSCache is reached from rebuilt cli.Contexts during remote benchmark execution, so the first benchmark request in a long-lived warp-client process fixes the TTL for every later run. If that process can handle multiple jobs with different --dns-cache-ttl values, later values are ignored; thread the TTL through explicitly or document the process-wide contract.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cli/client_transport.go` around lines 36 - 53, The DNS cache TTL is currently
captured once in getDNSCache via dnsCacheOnce and sharedDNSCache, so the first
cli.Context seen by client_transport.go locks in the value for all later runs.
Thread the dns-cache-ttl value through the DNS cache creation path explicitly
(for example via the transport/client constructor or a cache factory) instead of
reading it once from ctx.Duration("dns-cache-ttl"), and update any callers that
rely on getDNSCache to pass the TTL they intend to use.
| func TestDNSCacheResolvesOncePerTTL(t *testing.T) { | ||
| c := newDNSCache(time.Hour) | ||
| // Prime with a fresh entry directly to avoid depending on real DNS. | ||
| c.mu.Lock() | ||
| c.entries["host.example"] = &dnsEntry{ | ||
| ips: []string{"192.0.2.1", "192.0.2.2"}, | ||
| expires: time.Now().Add(time.Hour), | ||
| } | ||
| c.mu.Unlock() | ||
|
|
||
| // Many lookups within TTL must not re-resolve and must round-robin. | ||
| seen := map[string]int{} | ||
| for i := 0; i < 100; i++ { | ||
| ip, err := c.lookup(context.Background(), "host.example") | ||
| if err != nil { | ||
| t.Fatal(err) | ||
| } | ||
| seen[ip]++ | ||
| } | ||
| if seen["192.0.2.1"] == 0 || seen["192.0.2.2"] == 0 { | ||
| t.Fatalf("expected round-robin across both IPs, got %v", seen) | ||
| } | ||
| } |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Exercise the resolver path in the TTL and single-flight tests.
Both tests pre-seed fresh entries, so they never verify that resolution happens once per TTL or that concurrent cache misses are single-flighted. With an injectable fake resolver, assert resolver call counts for fresh, expired, and concurrent-miss cases.
Also applies to: 112-132
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cli/dnscache_test.go` around lines 67 - 89, The TTL and single-flight cache
tests are only using pre-seeded entries, so they never exercise the actual
resolver path or verify call counts. Update the DNS cache tests around
dnsCache.lookup/newDNSCache to use an injectable fake resolver, then assert the
resolver is called zero times for fresh entries, once after expiry, and only
once across concurrent cache misses. Keep the round-robin assertions in
TestDNSCacheResolvesOncePerTTL, but make the setup drive resolution through the
real lookup path instead of manually populating entries.
| func TestDNSCacheKeepsStaleOnFailure(t *testing.T) { | ||
| c := newDNSCache(time.Millisecond) | ||
| // An already-expired entry with a stale set for a name that will fail to | ||
| // resolve (reserved .invalid TLD). The stale value must still be served | ||
| // rather than propagating the DNS error. | ||
| c.mu.Lock() | ||
| c.entries["nonexistent.invalid"] = &dnsEntry{ | ||
| ips: []string{"203.0.113.5"}, | ||
| expires: time.Now().Add(-time.Hour), | ||
| } | ||
| c.mu.Unlock() | ||
|
|
||
| ip, err := c.lookup(context.Background(), "nonexistent.invalid") |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Remove the real DNS dependency from this unit test.
This path calls net.DefaultResolver.LookupIPAddr with context.Background(), so CI can block or flake based on resolver behavior. Make resolution injectable on dnsCache and use a fake resolver that returns a deterministic error.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cli/dnscache_test.go` around lines 91 - 103, The test
TestDNSCacheKeepsStaleOnFailure still depends on real DNS resolution via
dnsCache.lookup and net.DefaultResolver.LookupIPAddr, which can make CI flaky.
Make resolution injectable on dnsCache (for example through a resolver function
or interface used by lookup), then update this test to install a fake resolver
that always returns a deterministic error so the stale-entry behavior is
verified without any network dependency.
| // Expired (or new) but another dial is already re-resolving: serve the | ||
| // stale set if we have one, otherwise fall through to resolve inline. | ||
| if e != nil && len(e.ips) > 0 && e.resolving { | ||
| ip := e.pick() | ||
| c.mu.Unlock() | ||
| return ip, nil | ||
| } | ||
|
|
||
| // We are the resolver for this host. | ||
| if e == nil { | ||
| e = &dnsEntry{} | ||
| c.entries[host] = e | ||
| } | ||
| e.resolving = true |
There was a problem hiding this comment.
🚀 Performance & Scalability | 🟠 Major | 🏗️ Heavy lift
Serialize cold-cache resolutions too.
Line 117 only single-flights refreshes when stale IPs exist. On the first lookup for a host, concurrent goroutines see resolving == true but len(e.ips) == 0, fall through, and each calls resolveHostIPs, recreating the DNS burst this cache is meant to prevent.
Add a wait channel/condition or equivalent so cache-miss waiters block until the in-flight resolver returns, then reuse its success/error instead of starting another lookup.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cli/dnscache.go` around lines 115 - 128, The cold-cache path in dnscache.go
still allows concurrent callers to bypass single-flight when a host has no
cached IPs, so fix the lookup flow in the resolver logic around dnsEntry and
resolveHostIPs to make cache-miss waiters block on the in-flight resolution
instead of falling through. Introduce a wait channel, condition, or similar
synchronization on dnsEntry so that when e.resolving is already true and
len(e.ips) == 0, other goroutines wait for the first resolver to finish and then
reuse its result or error, while preserving the existing stale-IP fast path.
| if err != nil || len(ips) == 0 { | ||
| // Keep serving the last successful set; never stop the benchmark | ||
| // on a transient DNS failure. | ||
| if len(stale) > 0 { | ||
| ip := e.pick() | ||
| c.mu.Unlock() | ||
| return ip, nil | ||
| } |
There was a problem hiding this comment.
🚀 Performance & Scalability | 🟠 Major | ⚡ Quick win
Advance expiry when serving stale after refresh failure.
When re-resolution fails, e.expires stays expired, so the next dial immediately tries DNS again. During resolver outages this can degrade back into one DNS query per connection. Bump the stale entry’s expiry/backoff before returning it.
Proposed fix
if len(stale) > 0 {
+ e.expires = time.Now().Add(c.ttl)
ip := e.pick()
c.mu.Unlock()
return ip, nil
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if err != nil || len(ips) == 0 { | |
| // Keep serving the last successful set; never stop the benchmark | |
| // on a transient DNS failure. | |
| if len(stale) > 0 { | |
| ip := e.pick() | |
| c.mu.Unlock() | |
| return ip, nil | |
| } | |
| if err != nil || len(ips) == 0 { | |
| // Keep serving the last successful set; never stop the benchmark | |
| // on a transient DNS failure. | |
| if len(stale) > 0 { | |
| e.expires = time.Now().Add(c.ttl) | |
| ip := e.pick() | |
| c.mu.Unlock() | |
| return ip, nil | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cli/dnscache.go` around lines 137 - 144, The stale-result path in getIP is
returning a cached IP after a DNS refresh failure, but e.expires is left expired
so every subsequent dial retries DNS immediately. Update the retry/backoff state
in this stale branch before returning from getIP, using the existing
expiry/backoff logic in the cache entry (e and its stale handling) so serving
stale entries temporarily suppresses repeated resolver calls during outages.
|
Sorry, this is a duplicate of #454. Please ignore. |
With CGO_ENABLED=0 (warp's release build), Go uses the pure resolver which does no in-process DNS caching and performs a fresh lookup on every dial. At high --concurrent against an FQDN host this produces a burst of DNS queries per connection ("DNS storm").
Add an opt-in --dns-cache-ttl flag backed by a process-wide TTL cache: the host FQDN is resolved at most once per TTL regardless of concurrency or the number of clients created by host selection. The FQDN is kept for TLS SNI, certificate validation and the Host header; only the TCP dial address is substituted with a cached IP.
Details:
Summary by CodeRabbit
New Features
--dns-cache-ttlCLI option to control DNS caching duration.Bug Fixes