Skip to content

Implement retry logic for JetStream publish in clustered environments #131

@lalinsky

Description

@lalinsky

Description

JetStream publish operations should implement retry logic for NoResponders (503) errors to handle transient failures in clustered NATS environments, particularly during leadership elections or stream metadata propagation.

Background

According to ADR-22, when NATS Server is running with JetStream in cluster mode, there can be occasional blips in leadership which result in "no responders available" errors during elections. Additionally, after stream creation, there's a brief window where metadata hasn't propagated to all cluster nodes yet.

Current Behavior

  • No retry logic implemented (see src/jetstream.zig:1578-1587)
  • NoResponders errors are immediately converted to NoStreamResponse and returned
  • Tests work around this with manual delays (e.g., tests/jetstream_test.zig:711)

Expected Behavior

Implement automatic retry logic with configurable parameters:

  • Default backoff: 250ms between retries
  • Default retry attempts: 2 (total 3 attempts)
  • Respect the overall request timeout

Reference Implementations

  • nats.go: 250ms backoff, 2 retries by default (configurable via RetryWait and RetryAttempts options)
  • nats.c: Check their implementation
  • nats.java: Check their implementation

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions