feat(interaction-store): add optional client gRPC keepalive#379
feat(interaction-store): add optional client gRPC keepalive#379kanhaiya-garg-meesho wants to merge 1 commit into
Conversation
Add opt-in client keepalive to the interaction-store gRPC client. Three new optional Config fields (KeepaliveTimeMs, KeepaliveTimeoutMs, KeepalivePermitWithoutStream) drive grpc.WithKeepaliveParams, applied only when KeepaliveTimeMs > 0. When unset (the zero value) the dial options are identical to before, so existing consumers are unaffected. No exported-signature change and no go.mod change (keepalive ships with google.golang.org/grpc). This lets latency-sensitive callers detect a silently dead connection (e.g. a black-holed TCP path) via HTTP/2 keepalive PINGs, instead of waiting for a per-RPC deadline or the kernel's TCP retransmit give-up. Validation: when keepalive is enabled, a non-positive KeepaliveTimeoutMs is rejected (error -> panic at construction via NewConnFromConfig) rather than passed to gRPC as a zero/negative timer that would fail keepalive immediately. Callers must also keep KeepaliveTimeMs at or above the server's keepalive EnforcementPolicy MinTime to avoid the server closing the connection with GOAWAY "too_many_pings" (documented on Config). Dial-option assembly is extracted into dialOptions / keepaliveParamsFromConfig and unit-tested: the opt-in gate, ms->Duration mapping, timeout validation, and a backward-compat check that the default option set keeps round_robin and adds no keepalive option. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: kanhaiya-garg-meesho <kanhaiya.garg@meesho.com>
| "(KeepaliveTimeMs=%d ms), got %d ms", config.KeepaliveTimeMs, config.KeepaliveTimeoutMs) | ||
| } | ||
| return keepalive.ClientParameters{ | ||
| Time: time.Duration(config.KeepaliveTimeMs) * time.Millisecond, |
There was a problem hiding this comment.
Very large positive millisecond values can overflow the duration multiplication into a negative or otherwise incorrect timer, bypassing the current non-positive validation and passing invalid keepalive settings to gRPC.
Also flagged on this line:
- 💡 SUGGESTION — KeepaliveTimeMs is multiplied into a time.Duration without checking for overflow, which can produce a negative or nonsensical ping interval.
| } | ||
| return &GRPCClient{Conn: gConn, DeadLine: int64(config.DeadLine)}, nil | ||
| if enabled { | ||
| opts = append(opts, grpc.WithKeepaliveParams(params)) |
There was a problem hiding this comment.
The new dial option makes gRPC send periodic HTTP/2 PINGs on each client connection according to caller-provided millisecond values. With no enforced minimum interval or guard for PermitWithoutStream, a misconfigured service can generate continuous idle-connection traffic and server ACK work across every pod and resolved backend connection; the SDK should clamp/reject intervals below the server policy or provide a safe named preset rather than accepting any positive value.
| if config.KeepaliveTimeMs <= 0 { | ||
| return keepalive.ClientParameters{}, false, nil | ||
| } | ||
| if config.KeepaliveTimeoutMs <= 0 { |
There was a problem hiding this comment.
The new gate treats a zero KeepaliveTimeoutMs as fatal whenever KeepaliveTimeMs is positive. Because these fields are advertised as optional and gRPC keepalive parameters normally have default timeout behavior, a partial rollout that sets only the ping interval turns into a startup panic via NewConnFromConfig.
| } | ||
| return keepalive.ClientParameters{ | ||
| Time: time.Duration(config.KeepaliveTimeMs) * time.Millisecond, | ||
| Timeout: time.Duration(config.KeepaliveTimeoutMs) * time.Millisecond, |
There was a problem hiding this comment.
The validation only rejects non-positive timeouts, then converts the raw millisecond value directly into the gRPC keepalive timeout. A timeout of a few milliseconds can expire before an ACK during routine network jitter, server scheduling delay, or client/server GC pauses, causing reconnects and retries that hurt tail latency; enforce a sane minimum such as seconds-scale or document/configure a safe default when keepalive is enabled.
| assert.Equal(t, "round_robin", sc.LoadBalancingPolicy) | ||
| } | ||
|
|
||
| func TestDialOptions(t *testing.T) { |
There was a problem hiding this comment.
💡 SUGGESTION — Convert TestDialOptions to a table-driven test instead of separate hand-written subtests for each input case.
This test covers multiple input cases for the same function, and the repository rule requires table-driven tests for that pattern; using a table also makes future dial option cases less repetitive and less error-prone.
FTR:AI-BLITZAI Blitz check0 passed, 3 failed, 6 skipped Failing checks: ❌ Unit-test coverage pipeline — pipeline did not run |
🟢 LOW — safe to fast-track · confidence 0.91This only adds optional gRPC keepalive settings to the interaction-store client; with defaults unset, connection setup is unchanged so nothing activates on merge. Top risks
Check before approving
8 tool(s) ran · DRS agentic v2 · #379 |
Context:
The
interaction-storegRPC client dials with no client keepalive. When a connection silently black-holes (stale conntrack / dead TCP path with no FIN or RST), the gRPC channel staysREADYand the client keeps writing onto a dead connection — failures persist until a per-RPC deadline or the kernel's TCP retransmit give-up (minutes). Latency-sensitive consumers (e.g. recently-viewed catalog recommendations on a tight per-call budget) need to detect and recycle such a connection quickly. This adds an opt-in keepalive.Describe your changes:
Configgains three optional fields:KeepaliveTimeMs,KeepaliveTimeoutMs,KeepalivePermitWithoutStream.getGRPCConnectionsassembles its dial options via a newdialOptions(Config)helper, appendinggrpc.WithKeepaliveParams(...)only whenKeepaliveTimeMs > 0(via the purekeepaliveParamsFromConfig).go.mod/go.sumchange (keepaliveships withgoogle.golang.org/grpc).KeepaliveTimeoutMsis rejected (error → panic at construction viaNewConnFromConfig) instead of being passed to gRPC as a zero/negative timer that would fail keepalive immediately.Config: callers must keepKeepaliveTimeMsat or above the server's keepaliveEnforcementPolicy.MinTime, or the server closes the connection withGOAWAY "too_many_pings".Testing:
grpc_test.go(testify) covering:keepaliveParamsFromConfigopt-in gate, ms→Durationmapping,PermitWithoutStream, and timeout validation (negative and zero rejected);dialOptionsbackward-compat (default keepsround_robinand adds no keepalive option; valid keepalive adds exactly one; invalid timeout → error);TestDefaultServiceConfigIsRoundRobinparses and asserts the LB policy; extendedTestNewClientV1(inclient_test.go) with a keepalive-enabled construction case.gofmt -lclean,go build,go vet, andgo test -race ./pkg/interaction-store/...all green; the existing suite is unaffected.too_many_pings ENHANCE_YOUR_CALMGOAWAY when pinging faster than the server'sMinTime; observed gRPC's 10s client-interval clamp; and confirmed the timeout-validation panic. Behavioural testing against the real server's enforcement policy is done in the consuming service's integration environment.Monitoring:
No new metrics are emitted by the SDK. Consumers that enable keepalive should watch their existing gRPC connection-state / error signals for
too_many_pingsGOAWAYs after rollout (would indicateKeepaliveTimeMsis below the server'sMinTime). The feature is inert unless explicitly configured.Rollback plan
Opt-in and config-driven at the consumer: unset / zero the keepalive fields to return to the prior no-keepalive behaviour with no code change. Reverting this PR is equally safe — it only adds optional
Configfields and one conditional dial option.Checklist before requesting a review
📂 Modules Affected
go-sdk(interaction-store client)✅ Type of Change
📊 Benchmark / Metrics (if applicable)
N/A — no hot-path or performance change. With keepalive disabled (the default) the dial options are identical to before; when enabled it adds background HTTP/2 PINGs at the configured interval.
🤖 Generated with Claude Code