Design discussion: WithRetry option for transient operation failures

## Background

`storage.WithRetry()` currently exists as a no-op stub in `storage.go`. The question came up from a TCM use case: TCM has a `GetWithRetry` helper that issues up to `len(endpoints) * 2` requests against the storage to ride out transient failures. We don't have an equivalent right now, and connection-pool failover alone isn't enough — the pool handles "instance is dead", not "this request just timed out / the leader changed".

## What "retry" means here

Operation-level retry on transient errors (timeouts, `Unavailable`, network/leader-change errors) for safe-to-replay calls:

- `Execute` is safe by construction — predicates make CAS-style transactions naturally idempotent.
- `Range` is a pure read.
- `Watch`: at most we'd retry the **initial subscribe**, never re-emit inside the stream — that's a caller concern.

What "retry" doesn't cover: business-logic errors (predicate-fail), validation errors, or anything the driver classifies as permanent.

## Design axes

Independent of the chosen placement, every design needs to answer:

- **Policy shape**: max attempts vs. unlimited+ctx, backoff curve, jitter, per-driver retriability classifier.
- **Default classifier per driver**:
  - etcd: gRPC `Unavailable`, `DeadlineExceeded`, leader/election errors.
  - TCS: tarantool transient error codes, network errors, pool errors.
- **TCM parity heuristic**: do we expose `MaxAttempts = len(endpoints) * 2` out of the box, or just a flat number?

A reasonable starting policy:

```go
type Policy struct {
    MaxAttempts    int           // 0 = unlimited, gated by context
    InitialBackoff time.Duration
    MaxBackoff     time.Duration
    Multiplier     float64
    Jitter         float64       // 0..1, full jitter
    IsRetriable    func(error) bool
}
```

## Design A — Driver decorator + storage option

New `driver/retry` package; `storage.WithRetry(...)` wires it in.

```go
package retry
func Wrap(d driver.Driver, policy Policy) driver.Driver

// usage
s := storage.NewStorage(d, storage.WithRetry(retry.WithMaxAttempts(5)))
```

Pros:
- Orthogonal to the rest of the codebase, trivially testable with a fake driver.
- Composes cleanly with pool failover (pool: dead instance, retry: flaky request).
- Single retry surface for every consumer of the driver.

Cons:
- Adds a layer between Storage and the underlying driver — slightly more indirection in stack traces.
- Watch retry has to be handled specially or skipped.

## Design B — Per-call option

Retry config travels with the call instead of the storage instance.

```go
s.Range(ctx, storage.WithPrefix("/foo"), storage.WithRetry(...))
tx.Commit(storage.WithRetry(...))
```

Pros:
- Fine-grained: callers pick retry per operation (e.g., aggressive on reads, none on writes).
- No hidden behavior — what you see at the call site is what runs.

Cons:
- Verbose; same config repeated at every call site.
- Tx is a builder; threading retry through `Commit()` adds API surface to the `tx` package.
- Two retry knobs (storage-level + call-level) if we ever add A on top — precedence rules get awkward.

## Design C — Connection-level only

Push retry into the `connect` package and the underlying pool/etcd client configuration. No new abstraction at the driver/storage layer.

```go
cfg := connect.Config{Endpoints: ..., Retry: connect.RetryConfig{...}}
```

Pros:
- Closest to where the transient errors physically occur.
- No new package; no decorator.

Cons:
- Tarantool pool and etcd client have **different** retry models — we'd end up with two divergent configs hidden under one knob, or leak the difference.
- Doesn't cover request-level retry that crosses pool semantics (e.g., timeout on a healthy instance) uniformly.
- Roman's TCM use case explicitly motivates request-level retries that pool failover doesn't catch — this design alone may not solve it.

## Design D — Helper functions, no built-in option

Drop the stub entirely. Provide `retry.Do(ctx, policy, fn)` as a standalone helper and let callers wrap their own calls.

```go
err := retry.Do(ctx, policy, func() error {
    _, err := s.Range(ctx, ...)
    return err
})
```

Pros:
- Smallest API surface; easiest to ship.
- Caller has full control, no hidden behavior.

Cons:
- Every consumer reimplements the same wrapper around storage calls.
- No driver-aware default classifier — caller has to know which errors are transient for which backend.
- Doesn't deliver on the existing `WithRetry` option's implied promise.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design discussion: WithRetry option for transient operation failures #76

Background

What "retry" means here

Design axes

Design A — Driver decorator + storage option

Design B — Per-call option

Design C — Connection-level only

Design D — Helper functions, no built-in option

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Design discussion: WithRetry option for transient operation failures #76

Description

Background

What "retry" means here

Design axes

Design A — Driver decorator + storage option

Design B — Per-call option

Design C — Connection-level only

Design D — Helper functions, no built-in option

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions