Skip to content

feat(interaction-store): add optional client gRPC keepalive#379

Open
kanhaiya-garg-meesho wants to merge 1 commit into
developfrom
feat/interaction-store-client-optional-keepalive
Open

feat(interaction-store): add optional client gRPC keepalive#379
kanhaiya-garg-meesho wants to merge 1 commit into
developfrom
feat/interaction-store-client-optional-keepalive

Conversation

@kanhaiya-garg-meesho

Copy link
Copy Markdown

Context:

The interaction-store gRPC client dials with no client keepalive. When a connection silently black-holes (stale conntrack / dead TCP path with no FIN or RST), the gRPC channel stays READY and the client keeps writing onto a dead connection — failures persist until a per-RPC deadline or the kernel's TCP retransmit give-up (minutes). Latency-sensitive consumers (e.g. recently-viewed catalog recommendations on a tight per-call budget) need to detect and recycle such a connection quickly. This adds an opt-in keepalive.

Describe your changes:

  • Config gains three optional fields: KeepaliveTimeMs, KeepaliveTimeoutMs, KeepalivePermitWithoutStream.
  • getGRPCConnections assembles its dial options via a new dialOptions(Config) helper, appending grpc.WithKeepaliveParams(...) only when KeepaliveTimeMs > 0 (via the pure keepaliveParamsFromConfig).
  • Opt-in / backward-compatible: with the fields unset (zero value), the dial-option set is byte-for-byte identical to before — existing consumers are unaffected. No exported-signature change; no go.mod / go.sum change (keepalive ships with google.golang.org/grpc).
  • Validation: when keepalive is enabled, a non-positive KeepaliveTimeoutMs is rejected (error → panic at construction via NewConnFromConfig) instead of being passed to gRPC as a zero/negative timer that would fail keepalive immediately.
  • Documented on Config: callers must keep KeepaliveTimeMs at or above the server's keepalive EnforcementPolicy.MinTime, or the server closes the connection with GOAWAY "too_many_pings".

Testing:

  • Added grpc_test.go (testify) covering: keepaliveParamsFromConfig opt-in gate, ms→Duration mapping, PermitWithoutStream, and timeout validation (negative and zero rejected); dialOptions backward-compat (default keeps round_robin and adds no keepalive option; valid keepalive adds exactly one; invalid timeout → error); TestDefaultServiceConfigIsRoundRobin parses and asserts the LB policy; extended TestNewClientV1 (in client_test.go) with a keepalive-enabled construction case.
  • gofmt -l clean, go build, go vet, and go test -race ./pkg/interaction-store/... all green; the existing suite is unaffected.
  • Validated end-to-end locally over a real socket (separate client/server processes against this SDK): keepalive PINGs confirmed by reproducing the too_many_pings ENHANCE_YOUR_CALM GOAWAY when pinging faster than the server's MinTime; observed gRPC's 10s client-interval clamp; and confirmed the timeout-validation panic. Behavioural testing against the real server's enforcement policy is done in the consuming service's integration environment.

Monitoring:

No new metrics are emitted by the SDK. Consumers that enable keepalive should watch their existing gRPC connection-state / error signals for too_many_pings GOAWAYs after rollout (would indicate KeepaliveTimeMs is below the server's MinTime). The feature is inert unless explicitly configured.

Rollback plan

Opt-in and config-driven at the consumer: unset / zero the keepalive fields to return to the prior no-keepalive behaviour with no code change. Reverting this PR is equally safe — it only adds optional Config fields and one conditional dial option.

Checklist before requesting a review

  • I have reviewed my own changes?
  • Relevant or critical functionality is covered by tests?
  • Monitoring needs have been evaluated?
  • Any necessary documentation updates have been considered?

📂 Modules Affected

  • Other: go-sdk (interaction-store client)

✅ Type of Change

  • Feature addition

📊 Benchmark / Metrics (if applicable)

N/A — no hot-path or performance change. With keepalive disabled (the default) the dial options are identical to before; when enabled it adds background HTTP/2 PINGs at the configured interval.

🤖 Generated with Claude Code

Add opt-in client keepalive to the interaction-store gRPC client. Three new
optional Config fields (KeepaliveTimeMs, KeepaliveTimeoutMs,
KeepalivePermitWithoutStream) drive grpc.WithKeepaliveParams, applied only when
KeepaliveTimeMs > 0. When unset (the zero value) the dial options are identical to
before, so existing consumers are unaffected. No exported-signature change and no
go.mod change (keepalive ships with google.golang.org/grpc).

This lets latency-sensitive callers detect a silently dead connection (e.g. a
black-holed TCP path) via HTTP/2 keepalive PINGs, instead of waiting for a per-RPC
deadline or the kernel's TCP retransmit give-up.

Validation: when keepalive is enabled, a non-positive KeepaliveTimeoutMs is rejected
(error -> panic at construction via NewConnFromConfig) rather than passed to gRPC as
a zero/negative timer that would fail keepalive immediately. Callers must also keep
KeepaliveTimeMs at or above the server's keepalive EnforcementPolicy MinTime to avoid
the server closing the connection with GOAWAY "too_many_pings" (documented on Config).

Dial-option assembly is extracted into dialOptions / keepaliveParamsFromConfig and
unit-tested: the opt-in gate, ms->Duration mapping, timeout validation, and a
backward-compat check that the default option set keeps round_robin and adds no
keepalive option.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: kanhaiya-garg-meesho <kanhaiya.garg@meesho.com>

@m-agentic-review m-agentic-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review — PR #379

Language: Go  |  Files reviewed: 4

Severity Count
SUGGESTION 💡 2
WARNING ⚠️ 4

"(KeepaliveTimeMs=%d ms), got %d ms", config.KeepaliveTimeMs, config.KeepaliveTimeoutMs)
}
return keepalive.ClientParameters{
Time: time.Duration(config.KeepaliveTimeMs) * time.Millisecond,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ WARNING — Validate that KeepaliveTimeMs and KeepaliveTimeoutMs fit in time.Duration before multiplying by time.Millisecond, and reject values that would overflow.

Very large positive millisecond values can overflow the duration multiplication into a negative or otherwise incorrect timer, bypassing the current non-positive validation and passing invalid keepalive settings to gRPC.

Also flagged on this line:

  • 💡 SUGGESTION — KeepaliveTimeMs is multiplied into a time.Duration without checking for overflow, which can produce a negative or nonsensical ping interval.

}
return &GRPCClient{Conn: gConn, DeadLine: int64(config.DeadLine)}, nil
if enabled {
opts = append(opts, grpc.WithKeepaliveParams(params))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ WARNING — Applying keepalive parameters without any SDK-side lower bound can turn idle connection pools into a fleet-wide background PING load when consumers enable aggressive keepalive, especially with PermitWithoutStream.

The new dial option makes gRPC send periodic HTTP/2 PINGs on each client connection according to caller-provided millisecond values. With no enforced minimum interval or guard for PermitWithoutStream, a misconfigured service can generate continuous idle-connection traffic and server ACK work across every pod and resolved backend connection; the SDK should clamp/reject intervals below the server policy or provide a safe named preset rather than accepting any positive value.

if config.KeepaliveTimeMs <= 0 {
return keepalive.ClientParameters{}, false, nil
}
if config.KeepaliveTimeoutMs <= 0 {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ WARNING — Enabling keepalive with only KeepaliveTimeMs set will now fail client construction instead of using gRPC's default keepalive timeout.

The new gate treats a zero KeepaliveTimeoutMs as fatal whenever KeepaliveTimeMs is positive. Because these fields are advertised as optional and gRPC keepalive parameters normally have default timeout behavior, a partial rollout that sets only the ping interval turns into a startup panic via NewConnFromConfig.

}
return keepalive.ClientParameters{
Time: time.Duration(config.KeepaliveTimeMs) * time.Millisecond,
Timeout: time.Duration(config.KeepaliveTimeoutMs) * time.Millisecond,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ WARNING — KeepaliveTimeoutMs is accepted as any positive duration, so very small values can churn otherwise healthy connections under normal RTT or GC pauses.

The validation only rejects non-positive timeouts, then converts the raw millisecond value directly into the gRPC keepalive timeout. A timeout of a few milliseconds can expire before an ACK during routine network jitter, server scheduling delay, or client/server GC pauses, causing reconnects and retries that hurt tail latency; enforce a sane minimum such as seconds-scale or document/configure a safe default when keepalive is enabled.

assert.Equal(t, "round_robin", sc.LoadBalancingPolicy)
}

func TestDialOptions(t *testing.T) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 SUGGESTION — Convert TestDialOptions to a table-driven test instead of separate hand-written subtests for each input case.

This test covers multiple input cases for the same function, and the repository rule requires table-driven tests for that pattern; using a table also makes future dial option cases less repetitive and less error-prone.

@turbo-turtle-github

Copy link
Copy Markdown

FTR:AI-BLITZ

AI Blitz check

0 passed, 3 failed, 6 skipped

Failing checks:

Unit-test coverage pipeline — pipeline did not run
Service-test pipeline — pipeline did not run
PR-diff mutation kill rate — no mutation data available for this PR (pipeline produced no score)

@m-agentic-review

Copy link
Copy Markdown

🟢 LOW — safe to fast-track · confidence 0.91

This only adds optional gRPC keepalive settings to the interaction-store client; with defaults unset, connection setup is unchanged so nothing activates on merge. keepaliveParamsFromConfig returns disabled when KeepaliveTimeMs &lt;= 0, and grpc_test.go asserts the default dial path still uses only transport credentials plus round-robin config. Cleared: no dependency bumps, schema changes, or new production config.

Top risks

  • go-sdk/pkg/interaction-store/grpc.go#dialOptions — refactors dial-option assembly; keepalive is appended only when KeepaliveTimeMs &gt; 0, and tests confirm the default two-option path matches pre-PR behavior

Check before approving

  • Keepalive is inert until a consumer sets KeepaliveTimeMs — re-review when a service enables it in config
  • If enabled, confirm KeepaliveTimeMs is at or above the server's keepalive MinTime to avoid too_many_pings disconnects

8 tool(s) ran · DRS agentic v2 · #379

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant