Skip to content

[Duplicate] Add --dns-cache-ttl to avoid DNS storm at high concurrency#494

Closed
kojiwell wants to merge 1 commit into
minio:masterfrom
kojiwell:feat/dns-cache-ttl
Closed

[Duplicate] Add --dns-cache-ttl to avoid DNS storm at high concurrency#494
kojiwell wants to merge 1 commit into
minio:masterfrom
kojiwell:feat/dns-cache-ttl

Conversation

@kojiwell

@kojiwell kojiwell commented Jul 2, 2026

Copy link
Copy Markdown

With CGO_ENABLED=0 (warp's release build), Go uses the pure resolver which does no in-process DNS caching and performs a fresh lookup on every dial. At high --concurrent against an FQDN host this produces a burst of DNS queries per connection ("DNS storm").

Add an opt-in --dns-cache-ttl flag backed by a process-wide TTL cache: the host FQDN is resolved at most once per TTL regardless of concurrency or the number of clients created by host selection. The FQDN is kept for TLS SNI, certificate validation and the Host header; only the TCP dial address is substituted with a cached IP.

Details:

  • Lazy re-resolution with single-flight on TTL expiry; on resolution failure the last successful set is kept so the benchmark is not interrupted.
  • Multiple A/AAAA records are cached and handed out round-robin per dial.
  • Literal IP hosts are a no-op; ttl 0 (default) disables caching and preserves the previous behavior exactly.
  • Applied to all transports (default / TLS / kTLS-h2 via DialContext, kTLS-non-h2 via a DialTLSContext that pins ServerName=FQDN).
  • DurationFlag propagates automatically to distributed clients.

Summary by CodeRabbit

  • New Features

    • Added a new --dns-cache-ttl CLI option to control DNS caching duration.
    • Introduced shared DNS caching for outbound connections to reduce repeated lookups under high concurrency.
    • Improved TLS connection handling so cached IP dialing still preserves the original hostname for secure connections.
  • Bug Fixes

    • Avoids unnecessary DNS resolution for literal IP addresses.
    • Preserves previously resolved addresses when a refresh fails, helping connections stay stable.

With CGO_ENABLED=0 (warp's release build), Go uses the pure resolver
which does no in-process DNS caching and performs a fresh lookup on
every dial. At high --concurrent against an FQDN host this produces a
burst of DNS queries per connection ("DNS storm").

Add an opt-in --dns-cache-ttl flag backed by a process-wide TTL cache:
the host FQDN is resolved at most once per TTL regardless of concurrency
or the number of clients created by host selection. The FQDN is kept for
TLS SNI, certificate validation and the Host header; only the TCP dial
address is substituted with a cached IP.

Details:
- Lazy re-resolution with single-flight on TTL expiry; on resolution
  failure the last successful set is kept so the benchmark is not
  interrupted.
- Multiple A/AAAA records are cached and handed out round-robin per dial.
- Literal IP hosts are a no-op; ttl 0 (default) disables caching and
  preserves the previous behavior exactly.
- Applied to all transports (default / TLS / kTLS-h2 via DialContext,
  kTLS-non-h2 via a DialTLSContext that pins ServerName=FQDN).
- DurationFlag propagates automatically to distributed clients.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds a shared, process-wide DNS cache to warp's CLI client transports, controlled by a new --dns-cache-ttl flag. The cache is wired into both standard and KTLS dial paths, resolving hostnames at most once per TTL, with round-robin IP selection, stale-serving on resolution failure, and TLS SNI preserved via cloned configs.

Changes

Shared DNS cache and transport integration

Layer / File(s) Summary
DNS cache core
cli/dnscache.go
Implements dnsCache/dnsEntry types, TTL-based lookup with single-flight re-resolution and stale fallback, round-robin pick, dialContext wrapper, and resolveHostIPs helper.
DNS cache tests
cli/dnscache_test.go
Adds tests for disabled cache passthrough, literal IP bypass, resolve-once-per-TTL round-robin, stale-on-failure serving, and concurrent single-flight lookups.
dns-cache-ttl flag
cli/flags.go
Adds a DurationFlag --dns-cache-ttl to configure caching duration for the target host FQDN, disabled by default (0).
Standard transport wiring
cli/client_transport.go
Adds a lazily-initialized shared DNS cache (sync.Once) via getDNSCache, and wraps DialContext in newClientTransport when caching is enabled and DialTLSContext is unset.
KTLS transport wiring
cli/client_ktls.go
Retrieves the shared DNS cache in clientTransportKTLS, applies cache-aware dialing for HTTP/2 and non-HTTP/2 paths, and adds cachedDialTLS, which pins TLS ServerName and dials via the cached IP.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client Transport
  participant Cache as sharedDNSCache
  participant Resolver as net.DefaultResolver
  participant Server as Remote Host

  Client->>Cache: dialContext(host:port)
  alt cache disabled or literal IP
    Client->>Server: dial(original address)
  else cache enabled
    Cache->>Cache: lookup(host)
    alt entry fresh
      Cache-->>Client: cached IP (round-robin)
    else entry expired
      Cache->>Resolver: resolveHostIPs(host)
      Resolver-->>Cache: resolved IPs or error
      alt resolution succeeds
        Cache-->>Client: new cached IP
      else resolution fails
        Cache-->>Client: stale IP (if available)
      end
    end
    Client->>Server: dial(cached IP, pinned SNI)
  end
Loading

Poem

A hare hops fast, no time to waste,
DNS lookups now cached with taste 🐇
One query per TTL, round we spin,
Stale IPs saved when storms roll in,
SNI pinned tight, the tunnel's set —
Fewer queries, no regret!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding the --dns-cache-ttl flag to reduce DNS storms at high concurrency.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cli/client_transport.go`:
- Around line 36-53: The DNS cache TTL is currently captured once in getDNSCache
via dnsCacheOnce and sharedDNSCache, so the first cli.Context seen by
client_transport.go locks in the value for all later runs. Thread the
dns-cache-ttl value through the DNS cache creation path explicitly (for example
via the transport/client constructor or a cache factory) instead of reading it
once from ctx.Duration("dns-cache-ttl"), and update any callers that rely on
getDNSCache to pass the TTL they intend to use.

In `@cli/dnscache_test.go`:
- Around line 91-103: The test TestDNSCacheKeepsStaleOnFailure still depends on
real DNS resolution via dnsCache.lookup and net.DefaultResolver.LookupIPAddr,
which can make CI flaky. Make resolution injectable on dnsCache (for example
through a resolver function or interface used by lookup), then update this test
to install a fake resolver that always returns a deterministic error so the
stale-entry behavior is verified without any network dependency.
- Around line 67-89: The TTL and single-flight cache tests are only using
pre-seeded entries, so they never exercise the actual resolver path or verify
call counts. Update the DNS cache tests around dnsCache.lookup/newDNSCache to
use an injectable fake resolver, then assert the resolver is called zero times
for fresh entries, once after expiry, and only once across concurrent cache
misses. Keep the round-robin assertions in TestDNSCacheResolvesOncePerTTL, but
make the setup drive resolution through the real lookup path instead of manually
populating entries.

In `@cli/dnscache.go`:
- Around line 115-128: The cold-cache path in dnscache.go still allows
concurrent callers to bypass single-flight when a host has no cached IPs, so fix
the lookup flow in the resolver logic around dnsEntry and resolveHostIPs to make
cache-miss waiters block on the in-flight resolution instead of falling through.
Introduce a wait channel, condition, or similar synchronization on dnsEntry so
that when e.resolving is already true and len(e.ips) == 0, other goroutines wait
for the first resolver to finish and then reuse its result or error, while
preserving the existing stale-IP fast path.
- Around line 137-144: The stale-result path in getIP is returning a cached IP
after a DNS refresh failure, but e.expires is left expired so every subsequent
dial retries DNS immediately. Update the retry/backoff state in this stale
branch before returning from getIP, using the existing expiry/backoff logic in
the cache entry (e and its stale handling) so serving stale entries temporarily
suppresses repeated resolver calls during outages.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 77c1daa2-c033-4ec8-8c7c-27c7711c4bff

📥 Commits

Reviewing files that changed from the base of the PR and between c985a44 and 9489628.

📒 Files selected for processing (5)
  • cli/client_ktls.go
  • cli/client_transport.go
  • cli/dnscache.go
  • cli/dnscache_test.go
  • cli/flags.go

Comment thread cli/client_transport.go
Comment on lines +36 to +53
// sharedDNSCache is the process-wide DNS cache, initialized once from the
// --dns-cache-ttl flag. It is shared by every transport/client so a single
// FQDN is resolved at most once per TTL regardless of concurrency or the
// number of clients created by host selection.
var (
dnsCacheOnce sync.Once
sharedDNSCache *dnsCache
)

// getDNSCache returns the shared DNS cache, initializing it on first use from
// the --dns-cache-ttl flag value.
func getDNSCache(ctx *cli.Context) *dnsCache {
dnsCacheOnce.Do(func() {
sharedDNSCache = newDNSCache(ctx.Duration("dns-cache-ttl"))
})
return sharedDNSCache
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Find all call sites of getDNSCache to confirm the ctx passed is the
# top-level process context and not reconstructed per benchmark run/RPC call.
rg -n -B3 -A3 'getDNSCache\(' --type=go

Repository: minio/warp

Length of output: 148


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '== file list =='
git ls-files '*.go' | sed -n '1,200p'

echo
echo '== getDNSCache references =='
rg -n -B2 -A4 'getDNSCache\(' --glob '*.go' .

echo
echo '== dns-cache-ttl references =='
rg -n -B2 -A4 'dns-cache-ttl' --glob '*.go' --glob '*.md' .

Repository: minio/warp

Length of output: 4995


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '== outline: cli/client_transport.go =='
ast-grep outline cli/client_transport.go --view expanded

echo
echo '== outline: cli/client_ktls.go =='
ast-grep outline cli/client_ktls.go --view expanded

echo
echo '== outline: cli/client.go =='
ast-grep outline cli/client.go --view expanded

echo
echo '== outline: cli/benchclient.go =='
ast-grep outline cli/benchclient.go --view expanded

Repository: minio/warp

Length of output: 3052


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '== client_transport.go relevant section =='
sed -n '1,180p' cli/client_transport.go

echo
echo '== client_ktls.go relevant section =='
sed -n '1,140p' cli/client_ktls.go

echo
echo '== client.go relevant section =='
sed -n '1,220p' cli/client.go

echo
echo '== benchclient.go relevant section =='
sed -n '1,220p' cli/benchclient.go

Repository: minio/warp

Length of output: 20946


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '== getDNSCache callers with surrounding functions =='
rg -n -B8 -A20 'getDNSCache\(' cli/*.go

echo
echo '== functions that build HTTP clients / transports =='
rg -n -B4 -A20 'http.Client|Transport|new.*Client|new.*Transport|http.Transport' cli/*.go

Repository: minio/warp

Length of output: 23386


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '== direct construction or propagation of *cli.Context =='
rg -n -B3 -A5 'cli\.NewContext|*cli\.Context|Context\)' cli/*.go main.go

echo
echo '== benchmark/run entry points =='
rg -n -B4 -A12 'func (.*) Run|app\.Action|Commands|benchmark' main.go cli/*.go

Repository: minio/warp

Length of output: 336


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '== cli/cli.go relevant section =='
sed -n '1,240p' cli/cli.go

echo
echo '== cli/run.go relevant section =='
sed -n '1,220p' cli/run.go

echo
echo '== cli/benchmark.go relevant section =='
sed -n '1,240p' cli/benchmark.go

Repository: minio/warp

Length of output: 19437


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '== transport/client constructors in cli =='
rg -n -B6 -A16 'func .*new.*(Transport|Client)|func .*transport|http\.Transport|http\.Client' cli/*.go

echo
echo '== any setup of dns-cache-ttl beyond flags =='
rg -n -B3 -A6 'dns-cache-ttl' cli/*.go main.go

Repository: minio/warp

Length of output: 10055


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '== call graph for getDNSCache and client construction =='
python3 - <<'PY'
import re, pathlib, json
paths = [p for p in pathlib.Path('cli').glob('*.go')]
for path in paths:
    text = path.read_text()
    if 'getDNSCache(' in text or 'newTransport(' in text or 'newClient(' in text or 'newHTTP' in text:
        print(f'## {path}')
        for m in re.finditer(r'(^.*(getDNSCache\(|newTransport\(|newClient\().*$)', text, re.M):
            line = text[:m.start()].count('\n') + 1
            print(f'{line}: {m.group(1).strip()}')
PY

Repository: minio/warp

Length of output: 588


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n -B3 -A6 'NewContext\(|cli\.NewContext\(' --glob '*.go' .

Repository: minio/warp

Length of output: 1149


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '== serverRequest / benchmark data structures =='
rg -n -B6 -A30 'type serverRequest|type .*Benchmark|Benchmark\.Flags|Flags map' cli/*.go pkg/*.go

echo
echo '== where benchmark flags are populated =='
rg -n -B6 -A20 's\.Benchmark\.Flags|Benchmark\.Flags|Flags:|flags :' cli/*.go

Repository: minio/warp

Length of output: 46633


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '== top-level command actions and context flow =='
rg -n -B4 -A18 'mainExec|runBench|executeBenchmark|newClient\(|newAdminClient\(|clientTransport\(' cli/*.go

echo
echo '== any runtime mutation of dns-cache-ttl =='
rg -n -B4 -A8 'Set\("dns-cache-ttl"|dns-cache-ttl".*Set|ctx\.Set\(.*dns-cache-ttl|String\("dns-cache-ttl"\)|Duration\("dns-cache-ttl"\)' cli/*.go

Repository: minio/warp

Length of output: 29472


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python3 - <<'PY'
import pathlib, re
for path in pathlib.Path('cli').glob('*.go'):
    txt = path.read_text()
    if 'dns-cache-ttl' in txt or 'getDNSCache(' in txt or 'newClientTransport(' in txt or 'clientTransport(' in txt:
        print(f'## {path}')
        for i, line in enumerate(txt.splitlines(), 1):
            if 'dns-cache-ttl' in line or 'getDNSCache(' in line or 'newClientTransport(' in line or 'clientTransport(' in line:
                print(f'{i}: {line}')
PY

Repository: minio/warp

Length of output: 1135


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '== inspect server/client request payload types =='
sed -n '1,220p' cli/benchclient.go

echo
echo '== inspect run.go around ctx2 creation =='
sed -n '220,330p' cli/run.go

Repository: minio/warp

Length of output: 8106


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '== excludeFlags / global flags handling =='
rg -n -B6 -A20 'excludeFlags|globalFlags|dns-cache-ttl|combineFlags\(globalFlags' cli/*.go

echo
echo '== client mode benchmark execution path =='
sed -n '1,260p' cli/benchserver.go

echo
echo '== run.go benchmark reconstruction path =='
sed -n '220,280p' cli/run.go

Repository: minio/warp

Length of output: 45749


Make dns-cache-ttl a constructor argument, not a one-shot global. getDNSCache is reached from rebuilt cli.Contexts during remote benchmark execution, so the first benchmark request in a long-lived warp-client process fixes the TTL for every later run. If that process can handle multiple jobs with different --dns-cache-ttl values, later values are ignored; thread the TTL through explicitly or document the process-wide contract.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cli/client_transport.go` around lines 36 - 53, The DNS cache TTL is currently
captured once in getDNSCache via dnsCacheOnce and sharedDNSCache, so the first
cli.Context seen by client_transport.go locks in the value for all later runs.
Thread the dns-cache-ttl value through the DNS cache creation path explicitly
(for example via the transport/client constructor or a cache factory) instead of
reading it once from ctx.Duration("dns-cache-ttl"), and update any callers that
rely on getDNSCache to pass the TTL they intend to use.

Comment thread cli/dnscache_test.go
Comment on lines +67 to +89
func TestDNSCacheResolvesOncePerTTL(t *testing.T) {
c := newDNSCache(time.Hour)
// Prime with a fresh entry directly to avoid depending on real DNS.
c.mu.Lock()
c.entries["host.example"] = &dnsEntry{
ips: []string{"192.0.2.1", "192.0.2.2"},
expires: time.Now().Add(time.Hour),
}
c.mu.Unlock()

// Many lookups within TTL must not re-resolve and must round-robin.
seen := map[string]int{}
for i := 0; i < 100; i++ {
ip, err := c.lookup(context.Background(), "host.example")
if err != nil {
t.Fatal(err)
}
seen[ip]++
}
if seen["192.0.2.1"] == 0 || seen["192.0.2.2"] == 0 {
t.Fatalf("expected round-robin across both IPs, got %v", seen)
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Exercise the resolver path in the TTL and single-flight tests.

Both tests pre-seed fresh entries, so they never verify that resolution happens once per TTL or that concurrent cache misses are single-flighted. With an injectable fake resolver, assert resolver call counts for fresh, expired, and concurrent-miss cases.

Also applies to: 112-132

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cli/dnscache_test.go` around lines 67 - 89, The TTL and single-flight cache
tests are only using pre-seeded entries, so they never exercise the actual
resolver path or verify call counts. Update the DNS cache tests around
dnsCache.lookup/newDNSCache to use an injectable fake resolver, then assert the
resolver is called zero times for fresh entries, once after expiry, and only
once across concurrent cache misses. Keep the round-robin assertions in
TestDNSCacheResolvesOncePerTTL, but make the setup drive resolution through the
real lookup path instead of manually populating entries.

Comment thread cli/dnscache_test.go
Comment on lines +91 to +103
func TestDNSCacheKeepsStaleOnFailure(t *testing.T) {
c := newDNSCache(time.Millisecond)
// An already-expired entry with a stale set for a name that will fail to
// resolve (reserved .invalid TLD). The stale value must still be served
// rather than propagating the DNS error.
c.mu.Lock()
c.entries["nonexistent.invalid"] = &dnsEntry{
ips: []string{"203.0.113.5"},
expires: time.Now().Add(-time.Hour),
}
c.mu.Unlock()

ip, err := c.lookup(context.Background(), "nonexistent.invalid")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Remove the real DNS dependency from this unit test.

This path calls net.DefaultResolver.LookupIPAddr with context.Background(), so CI can block or flake based on resolver behavior. Make resolution injectable on dnsCache and use a fake resolver that returns a deterministic error.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cli/dnscache_test.go` around lines 91 - 103, The test
TestDNSCacheKeepsStaleOnFailure still depends on real DNS resolution via
dnsCache.lookup and net.DefaultResolver.LookupIPAddr, which can make CI flaky.
Make resolution injectable on dnsCache (for example through a resolver function
or interface used by lookup), then update this test to install a fake resolver
that always returns a deterministic error so the stale-entry behavior is
verified without any network dependency.

Comment thread cli/dnscache.go
Comment on lines +115 to +128
// Expired (or new) but another dial is already re-resolving: serve the
// stale set if we have one, otherwise fall through to resolve inline.
if e != nil && len(e.ips) > 0 && e.resolving {
ip := e.pick()
c.mu.Unlock()
return ip, nil
}

// We are the resolver for this host.
if e == nil {
e = &dnsEntry{}
c.entries[host] = e
}
e.resolving = true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Performance & Scalability | 🟠 Major | 🏗️ Heavy lift

Serialize cold-cache resolutions too.

Line 117 only single-flights refreshes when stale IPs exist. On the first lookup for a host, concurrent goroutines see resolving == true but len(e.ips) == 0, fall through, and each calls resolveHostIPs, recreating the DNS burst this cache is meant to prevent.

Add a wait channel/condition or equivalent so cache-miss waiters block until the in-flight resolver returns, then reuse its success/error instead of starting another lookup.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cli/dnscache.go` around lines 115 - 128, The cold-cache path in dnscache.go
still allows concurrent callers to bypass single-flight when a host has no
cached IPs, so fix the lookup flow in the resolver logic around dnsEntry and
resolveHostIPs to make cache-miss waiters block on the in-flight resolution
instead of falling through. Introduce a wait channel, condition, or similar
synchronization on dnsEntry so that when e.resolving is already true and
len(e.ips) == 0, other goroutines wait for the first resolver to finish and then
reuse its result or error, while preserving the existing stale-IP fast path.

Comment thread cli/dnscache.go
Comment on lines +137 to +144
if err != nil || len(ips) == 0 {
// Keep serving the last successful set; never stop the benchmark
// on a transient DNS failure.
if len(stale) > 0 {
ip := e.pick()
c.mu.Unlock()
return ip, nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Performance & Scalability | 🟠 Major | ⚡ Quick win

Advance expiry when serving stale after refresh failure.

When re-resolution fails, e.expires stays expired, so the next dial immediately tries DNS again. During resolver outages this can degrade back into one DNS query per connection. Bump the stale entry’s expiry/backoff before returning it.

Proposed fix
 		if len(stale) > 0 {
+			e.expires = time.Now().Add(c.ttl)
 			ip := e.pick()
 			c.mu.Unlock()
 			return ip, nil
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if err != nil || len(ips) == 0 {
// Keep serving the last successful set; never stop the benchmark
// on a transient DNS failure.
if len(stale) > 0 {
ip := e.pick()
c.mu.Unlock()
return ip, nil
}
if err != nil || len(ips) == 0 {
// Keep serving the last successful set; never stop the benchmark
// on a transient DNS failure.
if len(stale) > 0 {
e.expires = time.Now().Add(c.ttl)
ip := e.pick()
c.mu.Unlock()
return ip, nil
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cli/dnscache.go` around lines 137 - 144, The stale-result path in getIP is
returning a cached IP after a DNS refresh failure, but e.expires is left expired
so every subsequent dial retries DNS immediately. Update the retry/backoff state
in this stale branch before returning from getIP, using the existing
expiry/backoff logic in the cache entry (e and its stale handling) so serving
stale entries temporarily suppresses repeated resolver calls during outages.

@kojiwell

kojiwell commented Jul 2, 2026

Copy link
Copy Markdown
Author

Sorry, this is a duplicate of #454. Please ignore.

@kojiwell kojiwell closed this Jul 2, 2026
@kojiwell kojiwell changed the title Add --dns-cache-ttl to avoid DNS storm at high concurrency [Duplicate] Add --dns-cache-ttl to avoid DNS storm at high concurrency Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant