Skip to content

feat(interaction-store): add optional client gRPC keepalive#378

Closed
kanhaiya-garg-meesho wants to merge 2 commits into
developfrom
feat/interaction-store-client-optional-keepalive
Closed

feat(interaction-store): add optional client gRPC keepalive#378
kanhaiya-garg-meesho wants to merge 2 commits into
developfrom
feat/interaction-store-client-optional-keepalive

Conversation

@kanhaiya-garg-meesho

Copy link
Copy Markdown

Context:

The interaction-store gRPC client dials with no client keepalive. When a connection silently black-holes (e.g. a stale conntrack entry / dead TCP path with no FIN or RST), the gRPC channel stays READY and the client keeps writing onto a dead connection — failures persist until a per-RPC deadline fires or the kernel's TCP retransmit logic gives up (on the order of minutes). Latency-sensitive consumers (e.g. recently-viewed catalog recommendations on a tight per-call budget) have no way to detect and recycle such a connection quickly. This change adds an opt-in keepalive so consumers that need it can enable it via config, without affecting anyone else.

Describe your changes:

  • Config gains three optional fields: KeepaliveTimeMs, KeepaliveTimeoutMs, KeepalivePermitWithoutStream.
  • getGRPCConnections now assembles its dial options via a new dialOptions(Config) helper, which appends grpc.WithKeepaliveParams(...) only when KeepaliveTimeMs > 0 (derived by the pure helper keepaliveParamsFromConfig).
  • Fully backward-compatible / opt-in: with the fields unset (zero value), the dial-option set is byte-for-byte identical to before, so every existing consumer is unaffected. The exported NewClientV1 signature is unchanged, and there is no go.mod / go.sum change (keepalive ships with google.golang.org/grpc).
  • Documented on the Config fields: callers must keep KeepaliveTimeMs at or above the server's keepalive EnforcementPolicy.MinTime, or the server will close the connection with GOAWAY "too_many_pings".

Testing:

  • New grpc_test.go unit tests (table-driven, testify): the opt-in gate (disabled when KeepaliveTimeMs <= 0, including negative), millisecond→time.Duration mapping, PermitWithoutStream handling, and a backward-compatibility assertion that the default dial-option set is unchanged and that enabling keepalive appends exactly one option.
  • Extended TestNewClientV1 with a keepalive-enabled construction case.
  • gofmt -l clean, go build, go vet, and go test -race ./pkg/interaction-store/... all green; existing suite unaffected.
  • Behavioural validation — that keepalive actually emits PINGs and behaves correctly against a server's EnforcementPolicy (the too_many_pings GOAWAY edge) — is exercised separately in the consuming service's integration environment against the real server's enforcement policy. It is intentionally out of scope for these SDK unit tests, which cover the configuration/option-assembly logic only.

Monitoring:

No new metrics are emitted by the SDK. Consumers that enable keepalive should watch their existing gRPC connection-state / error signals for too_many_pings GOAWAYs after rollout, which would indicate the configured KeepaliveTimeMs is below the server's MinTime. The feature is inert unless explicitly configured.

Rollback plan

Opt-in and config-driven at the consumer: unset / zero the keepalive fields to return to the prior no-keepalive behaviour with no code change. Reverting this PR is equally safe — it only adds optional Config fields and one conditional dial option.

Checklist before requesting a review

  • I have reviewed my own changes?
  • Relevant or critical functionality is covered by tests?
  • Monitoring needs have been evaluated?
  • Any necessary documentation updates have been considered?

📂 Modules Affected

  • Other: go-sdk (interaction-store client)

✅ Type of Change

  • Feature addition

📊 Benchmark / Metrics (if applicable)

N/A — no hot-path or performance change. With keepalive disabled (the default) the dial options are identical to before; when enabled it adds background HTTP/2 PINGs at the configured interval.

🤖 Generated with Claude Code

Add opt-in client keepalive to the interaction-store gRPC client. Three new
optional Config fields (KeepaliveTimeMs, KeepaliveTimeoutMs,
KeepalivePermitWithoutStream) drive grpc.WithKeepaliveParams, applied only when
KeepaliveTimeMs > 0. When unset (the zero value) the dial options are identical
to before, so existing consumers are unaffected.

This lets latency-sensitive callers detect a silently dead connection (e.g. a
black-holed TCP path) via HTTP/2 keepalive PINGs, instead of waiting for a
per-RPC deadline or the kernel's TCP retransmit give-up. Callers must keep
KeepaliveTimeMs at or above the server's keepalive EnforcementPolicy MinTime to
avoid the server closing the connection with GOAWAY "too_many_pings".

Dial-option assembly is extracted into dialOptions / keepaliveParamsFromConfig so
the opt-in gate, millisecond-to-Duration mapping, and backward-compatible option
set are unit tested.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: kanhaiya-garg-meesho <kanhaiya.garg@meesho.com>

@m-agentic-review m-agentic-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review — PR #378

Language: Go  |  Files reviewed: 4

Severity Count
SUGGESTION 💡 1
WARNING ⚠️ 3

Comment thread go-sdk/pkg/interaction-store/grpc.go
Comment thread go-sdk/pkg/interaction-store/grpc.go Outdated
Comment thread go-sdk/pkg/interaction-store/grpc_test.go Outdated
@m-agentic-review

m-agentic-review Bot commented Jun 15, 2026

Copy link
Copy Markdown

🟢 LOW — safe to fast-track · confidence 0.92

This SDK change only adds optional gRPC keepalive settings that stay off unless a caller explicitly sets them, so existing services keep the same connection behavior after merge. Tests in grpc_test.go confirm the default dial path still builds exactly two options (transport + round-robin) with no keepalive, and NewClientV1 is unchanged. Cleared: no breaking API, schema, or dependency changes.

Top risks

  • go-sdk/pkg/interaction-store/grpc.go#dialOptions — keepalive is appended only when KeepaliveTimeMs > 0; default zero config leaves dial options identical to before

Check before approving

  • Keepalive is inactive at merge — only services that opt in via Config will send PINGs
  • If a consumer enables keepalive, confirm KeepaliveTimeMs meets the server's minimum interval to avoid too_many_pings disconnects

8 tool(s) ran · DRS agentic v2 · #378

…d-compat test

Address review feedback on the optional keepalive change:

- Reject a non-positive KeepaliveTimeoutMs when keepalive is enabled
  (KeepaliveTimeMs > 0) instead of converting it into a zero/negative gRPC timer,
  which would make keepalive PINGs fail immediately and churn otherwise-healthy
  connections. keepaliveParamsFromConfig now returns an error, propagated through
  dialOptions and getGRPCConnections so the misconfiguration fails loudly at client
  construction (consistent with the existing validateConfig behaviour).
- Strengthen the backward-compatibility test to assert the effective default dial
  behaviour rather than only the option count: the round_robin load-balancing policy
  is now a named constant that is parsed and verified, and the default path is asserted
  to produce no keepalive option.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: kanhaiya-garg-meesho <kanhaiya.garg@meesho.com>

@m-agentic-review m-agentic-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review — PR #378

Language: Go  |  Files reviewed: 4

Severity Count
SUGGESTION 💡 1
WARNING ⚠️ 3

Comment thread go-sdk/pkg/interaction-store/grpc.go
Comment thread go-sdk/pkg/interaction-store/grpc.go
Comment thread go-sdk/pkg/interaction-store/grpc.go
Comment thread go-sdk/pkg/interaction-store/grpc.go
@turbo-turtle-github

Copy link
Copy Markdown

FTR:AI-BLITZ

AI Blitz check

0 passed, 3 failed, 6 skipped

Failing checks:

Unit-test coverage pipeline — pipeline did not run
Service-test pipeline — pipeline did not run
PR-diff mutation kill rate — no mutation data available for this PR (pipeline produced no score)

@kanhaiya-garg-meesho

Copy link
Copy Markdown
Author

Closing this PR — the change is being raised in our internal source-of-truth repository instead (the public SDK is published from there). Apologies for the noise here.

@kanhaiya-garg-meesho kanhaiya-garg-meesho deleted the feat/interaction-store-client-optional-keepalive branch June 15, 2026 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant