The feature flag evaluation request (PostHog::FeatureFlagsPoller#_request, used by _request_feature_flag_evaluation for POST /flags/?v=2) has no retry. On any transient network error it rescues and immediately re-raises, surfacing a hard error to the caller.
By contrast, the event-capture path (PostHog::Transport) already retries with backoff (retry_with_backoff). The flags path is the inconsistent one.
Why this matters
The flags request:
- opens a fresh connection per request (
Net::HTTP.start { ... }, no keep-alive),
- has a 3s default timeout (
feature_flag_request_timeout_seconds),
- and does not retry.
So a single transient stall anywhere in the network path (packet loss + TCP retransmit, TLS setup jitter, an edge/proxy hiccup) that exceeds the 3s budget becomes a customer-visible Net::ReadTimeout, even when the PostHog flags service itself is healthy and responding in single-digit milliseconds.
We investigated a report of intermittent Net::ReadTimeout bursts against POST /flags/?v=2. Server-side the flags service was healthy throughout (fast 2xx responses, no deploy or pod churn during the window), and the gateway logged the customer's traffic as 2xx. The failures had no server-side trace, consistent with transient loss in the client → CDN → gateway path that a retry would have absorbed.
The error message the user sees comes from this path: Unable to complete request to https://<host>/flags/?v=2.
Proposed fix
Add a bounded retry (e.g. 1 attempt with small backoff) to _request (or just the flag evaluation call) for idempotent transient errors that are already rescued there:
Net::ReadTimeout
Net::OpenTimeout / Timeout::Error
Errno::ECONNRESET
EOFError
Flag evaluation is side-effect-free, so retrying is safe. The existing BackoffPolicy used by Transport could be reused.
Notes
- Still reproduces on
main (no retry on _request). Recent versions did add open_timeout/write_timeout, but not retry.
- Affected user was on
3.5.4; upgrading helps with the timeout settings but does not add retry.
The feature flag evaluation request (
PostHog::FeatureFlagsPoller#_request, used by_request_feature_flag_evaluationforPOST /flags/?v=2) has no retry. On any transient network error it rescues and immediately re-raises, surfacing a hard error to the caller.By contrast, the event-capture path (
PostHog::Transport) already retries with backoff (retry_with_backoff). The flags path is the inconsistent one.Why this matters
The flags request:
Net::HTTP.start { ... }, no keep-alive),feature_flag_request_timeout_seconds),So a single transient stall anywhere in the network path (packet loss + TCP retransmit, TLS setup jitter, an edge/proxy hiccup) that exceeds the 3s budget becomes a customer-visible
Net::ReadTimeout, even when the PostHog flags service itself is healthy and responding in single-digit milliseconds.We investigated a report of intermittent
Net::ReadTimeoutbursts againstPOST /flags/?v=2. Server-side the flags service was healthy throughout (fast 2xx responses, no deploy or pod churn during the window), and the gateway logged the customer's traffic as 2xx. The failures had no server-side trace, consistent with transient loss in the client → CDN → gateway path that a retry would have absorbed.The error message the user sees comes from this path:
Unable to complete request to https://<host>/flags/?v=2.Proposed fix
Add a bounded retry (e.g. 1 attempt with small backoff) to
_request(or just the flag evaluation call) for idempotent transient errors that are already rescued there:Net::ReadTimeoutNet::OpenTimeout/Timeout::ErrorErrno::ECONNRESETEOFErrorFlag evaluation is side-effect-free, so retrying is safe. The existing
BackoffPolicyused byTransportcould be reused.Notes
main(no retry on_request). Recent versions did addopen_timeout/write_timeout, but not retry.3.5.4; upgrading helps with the timeout settings but does not add retry.