feat(interaction-store): add optional client gRPC keepalive#378
feat(interaction-store): add optional client gRPC keepalive#378kanhaiya-garg-meesho wants to merge 2 commits into
Conversation
Add opt-in client keepalive to the interaction-store gRPC client. Three new optional Config fields (KeepaliveTimeMs, KeepaliveTimeoutMs, KeepalivePermitWithoutStream) drive grpc.WithKeepaliveParams, applied only when KeepaliveTimeMs > 0. When unset (the zero value) the dial options are identical to before, so existing consumers are unaffected. This lets latency-sensitive callers detect a silently dead connection (e.g. a black-holed TCP path) via HTTP/2 keepalive PINGs, instead of waiting for a per-RPC deadline or the kernel's TCP retransmit give-up. Callers must keep KeepaliveTimeMs at or above the server's keepalive EnforcementPolicy MinTime to avoid the server closing the connection with GOAWAY "too_many_pings". Dial-option assembly is extracted into dialOptions / keepaliveParamsFromConfig so the opt-in gate, millisecond-to-Duration mapping, and backward-compatible option set are unit tested. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: kanhaiya-garg-meesho <kanhaiya.garg@meesho.com>
🟢 LOW — safe to fast-track · confidence 0.92This SDK change only adds optional gRPC keepalive settings that stay off unless a caller explicitly sets them, so existing services keep the same connection behavior after merge. Tests in Top risks
Check before approving
8 tool(s) ran · DRS agentic v2 · #378 |
…d-compat test Address review feedback on the optional keepalive change: - Reject a non-positive KeepaliveTimeoutMs when keepalive is enabled (KeepaliveTimeMs > 0) instead of converting it into a zero/negative gRPC timer, which would make keepalive PINGs fail immediately and churn otherwise-healthy connections. keepaliveParamsFromConfig now returns an error, propagated through dialOptions and getGRPCConnections so the misconfiguration fails loudly at client construction (consistent with the existing validateConfig behaviour). - Strengthen the backward-compatibility test to assert the effective default dial behaviour rather than only the option count: the round_robin load-balancing policy is now a named constant that is parsed and verified, and the default path is asserted to produce no keepalive option. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: kanhaiya-garg-meesho <kanhaiya.garg@meesho.com>
FTR:AI-BLITZAI Blitz check0 passed, 3 failed, 6 skipped Failing checks: ❌ Unit-test coverage pipeline — pipeline did not run |
|
Closing this PR — the change is being raised in our internal source-of-truth repository instead (the public SDK is published from there). Apologies for the noise here. |
Context:
The
interaction-storegRPC client dials with no client keepalive. When a connection silently black-holes (e.g. a stale conntrack entry / dead TCP path with no FIN or RST), the gRPC channel staysREADYand the client keeps writing onto a dead connection — failures persist until a per-RPC deadline fires or the kernel's TCP retransmit logic gives up (on the order of minutes). Latency-sensitive consumers (e.g. recently-viewed catalog recommendations on a tight per-call budget) have no way to detect and recycle such a connection quickly. This change adds an opt-in keepalive so consumers that need it can enable it via config, without affecting anyone else.Describe your changes:
Configgains three optional fields:KeepaliveTimeMs,KeepaliveTimeoutMs,KeepalivePermitWithoutStream.getGRPCConnectionsnow assembles its dial options via a newdialOptions(Config)helper, which appendsgrpc.WithKeepaliveParams(...)only whenKeepaliveTimeMs > 0(derived by the pure helperkeepaliveParamsFromConfig).NewClientV1signature is unchanged, and there is nogo.mod/go.sumchange (keepaliveships withgoogle.golang.org/grpc).Configfields: callers must keepKeepaliveTimeMsat or above the server's keepaliveEnforcementPolicy.MinTime, or the server will close the connection withGOAWAY "too_many_pings".Testing:
grpc_test.gounit tests (table-driven, testify): the opt-in gate (disabled whenKeepaliveTimeMs <= 0, including negative), millisecond→time.Durationmapping,PermitWithoutStreamhandling, and a backward-compatibility assertion that the default dial-option set is unchanged and that enabling keepalive appends exactly one option.TestNewClientV1with a keepalive-enabled construction case.gofmt -lclean,go build,go vet, andgo test -race ./pkg/interaction-store/...all green; existing suite unaffected.EnforcementPolicy(thetoo_many_pingsGOAWAY edge) — is exercised separately in the consuming service's integration environment against the real server's enforcement policy. It is intentionally out of scope for these SDK unit tests, which cover the configuration/option-assembly logic only.Monitoring:
No new metrics are emitted by the SDK. Consumers that enable keepalive should watch their existing gRPC connection-state / error signals for
too_many_pingsGOAWAYs after rollout, which would indicate the configuredKeepaliveTimeMsis below the server'sMinTime. The feature is inert unless explicitly configured.Rollback plan
Opt-in and config-driven at the consumer: unset / zero the keepalive fields to return to the prior no-keepalive behaviour with no code change. Reverting this PR is equally safe — it only adds optional
Configfields and one conditional dial option.Checklist before requesting a review
📂 Modules Affected
go-sdk(interaction-store client)✅ Type of Change
📊 Benchmark / Metrics (if applicable)
N/A — no hot-path or performance change. With keepalive disabled (the default) the dial options are identical to before; when enabled it adds background HTTP/2 PINGs at the configured interval.
🤖 Generated with Claude Code