Skip to content

Feature flag evaluation request has no retry on transient network errors #195

Description

@haacked

The feature flag evaluation request (PostHog::FeatureFlagsPoller#_request, used by _request_feature_flag_evaluation for POST /flags/?v=2) has no retry. On any transient network error it rescues and immediately re-raises, surfacing a hard error to the caller.

By contrast, the event-capture path (PostHog::Transport) already retries with backoff (retry_with_backoff). The flags path is the inconsistent one.

Why this matters

The flags request:

  • opens a fresh connection per request (Net::HTTP.start { ... }, no keep-alive),
  • has a 3s default timeout (feature_flag_request_timeout_seconds),
  • and does not retry.

So a single transient stall anywhere in the network path (packet loss + TCP retransmit, TLS setup jitter, an edge/proxy hiccup) that exceeds the 3s budget becomes a customer-visible Net::ReadTimeout, even when the PostHog flags service itself is healthy and responding in single-digit milliseconds.

We investigated a report of intermittent Net::ReadTimeout bursts against POST /flags/?v=2. Server-side the flags service was healthy throughout (fast 2xx responses, no deploy or pod churn during the window), and the gateway logged the customer's traffic as 2xx. The failures had no server-side trace, consistent with transient loss in the client → CDN → gateway path that a retry would have absorbed.

The error message the user sees comes from this path: Unable to complete request to https://<host>/flags/?v=2.

Proposed fix

Add a bounded retry (e.g. 1 attempt with small backoff) to _request (or just the flag evaluation call) for idempotent transient errors that are already rescued there:

  • Net::ReadTimeout
  • Net::OpenTimeout / Timeout::Error
  • Errno::ECONNRESET
  • EOFError

Flag evaluation is side-effect-free, so retrying is safe. The existing BackoffPolicy used by Transport could be reused.

Notes

  • Still reproduces on main (no retry on _request). Recent versions did add open_timeout/write_timeout, but not retry.
  • Affected user was on 3.5.4; upgrading helps with the timeout settings but does not add retry.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestfeature/flagsrubyPull requests that update ruby code

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions