Background
storage.WithRetry() currently exists as a no-op stub in storage.go. The question came up from a TCM use case: TCM has a GetWithRetry helper that issues up to len(endpoints) * 2 requests against the storage to ride out transient failures. We don't have an equivalent right now, and connection-pool failover alone isn't enough — the pool handles "instance is dead", not "this request just timed out / the leader changed".
What "retry" means here
Operation-level retry on transient errors (timeouts, Unavailable, network/leader-change errors) for safe-to-replay calls:
Execute is safe by construction — predicates make CAS-style transactions naturally idempotent.
Range is a pure read.
Watch: at most we'd retry the initial subscribe, never re-emit inside the stream — that's a caller concern.
What "retry" doesn't cover: business-logic errors (predicate-fail), validation errors, or anything the driver classifies as permanent.
Design axes
Independent of the chosen placement, every design needs to answer:
- Policy shape: max attempts vs. unlimited+ctx, backoff curve, jitter, per-driver retriability classifier.
- Default classifier per driver:
- etcd: gRPC
Unavailable, DeadlineExceeded, leader/election errors.
- TCS: tarantool transient error codes, network errors, pool errors.
- TCM parity heuristic: do we expose
MaxAttempts = len(endpoints) * 2 out of the box, or just a flat number?
A reasonable starting policy:
type Policy struct {
MaxAttempts int // 0 = unlimited, gated by context
InitialBackoff time.Duration
MaxBackoff time.Duration
Multiplier float64
Jitter float64 // 0..1, full jitter
IsRetriable func(error) bool
}
Design A — Driver decorator + storage option
New driver/retry package; storage.WithRetry(...) wires it in.
package retry
func Wrap(d driver.Driver, policy Policy) driver.Driver
// usage
s := storage.NewStorage(d, storage.WithRetry(retry.WithMaxAttempts(5)))
Pros:
- Orthogonal to the rest of the codebase, trivially testable with a fake driver.
- Composes cleanly with pool failover (pool: dead instance, retry: flaky request).
- Single retry surface for every consumer of the driver.
Cons:
- Adds a layer between Storage and the underlying driver — slightly more indirection in stack traces.
- Watch retry has to be handled specially or skipped.
Design B — Per-call option
Retry config travels with the call instead of the storage instance.
s.Range(ctx, storage.WithPrefix("/foo"), storage.WithRetry(...))
tx.Commit(storage.WithRetry(...))
Pros:
- Fine-grained: callers pick retry per operation (e.g., aggressive on reads, none on writes).
- No hidden behavior — what you see at the call site is what runs.
Cons:
- Verbose; same config repeated at every call site.
- Tx is a builder; threading retry through
Commit() adds API surface to the tx package.
- Two retry knobs (storage-level + call-level) if we ever add A on top — precedence rules get awkward.
Design C — Connection-level only
Push retry into the connect package and the underlying pool/etcd client configuration. No new abstraction at the driver/storage layer.
cfg := connect.Config{Endpoints: ..., Retry: connect.RetryConfig{...}}
Pros:
- Closest to where the transient errors physically occur.
- No new package; no decorator.
Cons:
- Tarantool pool and etcd client have different retry models — we'd end up with two divergent configs hidden under one knob, or leak the difference.
- Doesn't cover request-level retry that crosses pool semantics (e.g., timeout on a healthy instance) uniformly.
- Roman's TCM use case explicitly motivates request-level retries that pool failover doesn't catch — this design alone may not solve it.
Design D — Helper functions, no built-in option
Drop the stub entirely. Provide retry.Do(ctx, policy, fn) as a standalone helper and let callers wrap their own calls.
err := retry.Do(ctx, policy, func() error {
_, err := s.Range(ctx, ...)
return err
})
Pros:
- Smallest API surface; easiest to ship.
- Caller has full control, no hidden behavior.
Cons:
- Every consumer reimplements the same wrapper around storage calls.
- No driver-aware default classifier — caller has to know which errors are transient for which backend.
- Doesn't deliver on the existing
WithRetry option's implied promise.
Background
storage.WithRetry()currently exists as a no-op stub instorage.go. The question came up from a TCM use case: TCM has aGetWithRetryhelper that issues up tolen(endpoints) * 2requests against the storage to ride out transient failures. We don't have an equivalent right now, and connection-pool failover alone isn't enough — the pool handles "instance is dead", not "this request just timed out / the leader changed".What "retry" means here
Operation-level retry on transient errors (timeouts,
Unavailable, network/leader-change errors) for safe-to-replay calls:Executeis safe by construction — predicates make CAS-style transactions naturally idempotent.Rangeis a pure read.Watch: at most we'd retry the initial subscribe, never re-emit inside the stream — that's a caller concern.What "retry" doesn't cover: business-logic errors (predicate-fail), validation errors, or anything the driver classifies as permanent.
Design axes
Independent of the chosen placement, every design needs to answer:
Unavailable,DeadlineExceeded, leader/election errors.MaxAttempts = len(endpoints) * 2out of the box, or just a flat number?A reasonable starting policy:
Design A — Driver decorator + storage option
New
driver/retrypackage;storage.WithRetry(...)wires it in.Pros:
Cons:
Design B — Per-call option
Retry config travels with the call instead of the storage instance.
Pros:
Cons:
Commit()adds API surface to thetxpackage.Design C — Connection-level only
Push retry into the
connectpackage and the underlying pool/etcd client configuration. No new abstraction at the driver/storage layer.Pros:
Cons:
Design D — Helper functions, no built-in option
Drop the stub entirely. Provide
retry.Do(ctx, policy, fn)as a standalone helper and let callers wrap their own calls.Pros:
Cons:
WithRetryoption's implied promise.