Skip to content

Design discussion: WithRetry option for transient operation failures #76

@bigbes

Description

@bigbes

Background

storage.WithRetry() currently exists as a no-op stub in storage.go. The question came up from a TCM use case: TCM has a GetWithRetry helper that issues up to len(endpoints) * 2 requests against the storage to ride out transient failures. We don't have an equivalent right now, and connection-pool failover alone isn't enough — the pool handles "instance is dead", not "this request just timed out / the leader changed".

What "retry" means here

Operation-level retry on transient errors (timeouts, Unavailable, network/leader-change errors) for safe-to-replay calls:

  • Execute is safe by construction — predicates make CAS-style transactions naturally idempotent.
  • Range is a pure read.
  • Watch: at most we'd retry the initial subscribe, never re-emit inside the stream — that's a caller concern.

What "retry" doesn't cover: business-logic errors (predicate-fail), validation errors, or anything the driver classifies as permanent.

Design axes

Independent of the chosen placement, every design needs to answer:

  • Policy shape: max attempts vs. unlimited+ctx, backoff curve, jitter, per-driver retriability classifier.
  • Default classifier per driver:
    • etcd: gRPC Unavailable, DeadlineExceeded, leader/election errors.
    • TCS: tarantool transient error codes, network errors, pool errors.
  • TCM parity heuristic: do we expose MaxAttempts = len(endpoints) * 2 out of the box, or just a flat number?

A reasonable starting policy:

type Policy struct {
    MaxAttempts    int           // 0 = unlimited, gated by context
    InitialBackoff time.Duration
    MaxBackoff     time.Duration
    Multiplier     float64
    Jitter         float64       // 0..1, full jitter
    IsRetriable    func(error) bool
}

Design A — Driver decorator + storage option

New driver/retry package; storage.WithRetry(...) wires it in.

package retry
func Wrap(d driver.Driver, policy Policy) driver.Driver

// usage
s := storage.NewStorage(d, storage.WithRetry(retry.WithMaxAttempts(5)))

Pros:

  • Orthogonal to the rest of the codebase, trivially testable with a fake driver.
  • Composes cleanly with pool failover (pool: dead instance, retry: flaky request).
  • Single retry surface for every consumer of the driver.

Cons:

  • Adds a layer between Storage and the underlying driver — slightly more indirection in stack traces.
  • Watch retry has to be handled specially or skipped.

Design B — Per-call option

Retry config travels with the call instead of the storage instance.

s.Range(ctx, storage.WithPrefix("/foo"), storage.WithRetry(...))
tx.Commit(storage.WithRetry(...))

Pros:

  • Fine-grained: callers pick retry per operation (e.g., aggressive on reads, none on writes).
  • No hidden behavior — what you see at the call site is what runs.

Cons:

  • Verbose; same config repeated at every call site.
  • Tx is a builder; threading retry through Commit() adds API surface to the tx package.
  • Two retry knobs (storage-level + call-level) if we ever add A on top — precedence rules get awkward.

Design C — Connection-level only

Push retry into the connect package and the underlying pool/etcd client configuration. No new abstraction at the driver/storage layer.

cfg := connect.Config{Endpoints: ..., Retry: connect.RetryConfig{...}}

Pros:

  • Closest to where the transient errors physically occur.
  • No new package; no decorator.

Cons:

  • Tarantool pool and etcd client have different retry models — we'd end up with two divergent configs hidden under one knob, or leak the difference.
  • Doesn't cover request-level retry that crosses pool semantics (e.g., timeout on a healthy instance) uniformly.
  • Roman's TCM use case explicitly motivates request-level retries that pool failover doesn't catch — this design alone may not solve it.

Design D — Helper functions, no built-in option

Drop the stub entirely. Provide retry.Do(ctx, policy, fn) as a standalone helper and let callers wrap their own calls.

err := retry.Do(ctx, policy, func() error {
    _, err := s.Range(ctx, ...)
    return err
})

Pros:

  • Smallest API surface; easiest to ship.
  • Caller has full control, no hidden behavior.

Cons:

  • Every consumer reimplements the same wrapper around storage calls.
  • No driver-aware default classifier — caller has to know which errors are transient for which backend.
  • Doesn't deliver on the existing WithRetry option's implied promise.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions