fix(socket): suppress retry-storm Sentry noise + empty-token guard (OPENHUMAN-TAURI-8M) by sanil-23 · Pull Request #1568 · tinyhumansai/openhuman

sanil-23 · 2026-05-12T15:21:34Z

Summary

Downgrade ConnectionOutcome::Failed to warn for the first 5 consecutive attempts, escalating to error only on sustained failure — kills the retry-storm Sentry noise from a single backend incident.
Reset the failure streak on ConnectionOutcome::Lost, which is only reachable after a successful SIO CONNECT ACK and therefore proves the backend is healthy.
Add an empty-token guard at the top of ws_loop: if the bare connect(url, token) RPC is ever called with "" or whitespace, return immediately rather than spin a doomed reconnect loop.

Problem

OPENHUMAN-TAURI-8M — [socket] Connection failed: WebSocket connect: HTTP error: 503 Service Unavailable. 549 events on 2026-05-11, all on a single release (openhuman@0.53.22+90e9f48f3ada), spread across macOS 26.x (407), macOS 15.x (42), and Windows (48). Not OS-specific — a backend gateway 503 incident multiplied by every running client retrying.

The throw site is src/openhuman/socket/ws_loop.rs:62:

ConnectionOutcome::Failed(reason) => {
    log::error!("[socket] Connection failed: {}", reason);
    // keep growing backoff
}

connect_async returns Err at the HTTP layer (the gateway 503'd before any SIO/auth handshake), the loop wraps it as Failed, and log::error! bridges into the sentry-tracing layer (src/core/logging.rs::sentry_tracing_layer) — every retry fires another event. The exponential backoff (1s→30s) keeps retrying forever during an outage, so a sustained 5xx period multiplies into hundreds of events per client.

Separately: the public RPC openhuman.socket_connect_with_session correctly refuses to spawn without a session token (schemas.rs:228 — "no session token stored — user must log in first"), but the bare openhuman.socket_connect(url, token) RPC trusts its caller. ws_loop itself had zero defensive checks — an empty token was sent verbatim in the SIO CONNECT payload at line 142-143, producing exactly the kind of retry-storm noise this module is otherwise designed to suppress.

Solution

1. Consecutive-failure escalation in the loop. Track consecutive_failures: u32 across iterations; increment on Failed, reset on Lost. Log at warn until FAIL_ESCALATE_THRESHOLD = 5 is reached, then escalate to error. The Lost reset is sound because run_connection only returns Lost after a successful SIO CONNECT ACK — reaching that arm proves the backend was reachable and the token was valid in this attempt cycle.

Net effect on the original 8M incident: 549 events become ~1 per affected client (when it crosses the threshold).

2. Empty-token guard. First thing the function does — if token.trim().is_empty() { … return; }. Sets status to Disconnected, clears the socket id, returns. The log::error! on this branch is intentional: an empty token reaching ws_loop indicates a programming error at the RPC caller side and should be loud, not quiet.

Behavior on the happy path is unchanged. A successful connect → SIO CONNECT ACK → Lost cycle keeps logging at warn and resets the streak as before.

Submission Checklist

N/A: behaviour-only log-level routing + a guard; no new feature surface. New test covers the empty-token guard (ws_loop_refuses_to_start_with_empty_token). The consecutive-failure logic is internal log-level routing and is covered by inspection on the small diff.
N/A: behaviour-only change; the only new code paths are a guard branch (covered by the new test) and an if over a counter (no new branches outside the logging call).
N/A: behaviour-only change — no new/removed/renamed features.
N/A: no feature surface changes.
N/A: no new external network dependencies — pure log-level routing + guard.
N/A: no release-cut surface changes — socket reconnect is internal observability.
N/A: no GitHub issue to close — driven by Sentry OPENHUMAN-TAURI-8M.

Impact

Runtime: desktop only. No protocol or behavior change on the happy path. Socket reconnect cadence (1s→30s exponential backoff) is identical; only the log level of the Failed arm is reclassified for sub-threshold retries.
Observability: ~99% reduction in Sentry events from this issue under transient backend incidents. Sustained outages still surface exactly once per client once the threshold is crossed. Reviewers may want to tune FAIL_ESCALATE_THRESHOLD (currently 5) — at 1s initial backoff with exponential growth, 5 attempts ≈ 31 seconds of sustained failure before Sentry pages.
Security: empty-token guard prevents a class of pre-auth retry-storm bugs at the bare-RPC level.

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Key: N/A
URL: N/A

Commit & Branch

Branch: fix/sentry-8m-ws-noise-and-empty-token-guard
Commit SHA: 2494dc7

Validation Run

N/A: no frontend changes in this PR.
N/A: no frontend changes in this PR.
Focused tests: cargo test --manifest-path Cargo.toml --lib openhuman::socket::ws_loop — 18 passed, 0 failed (includes new ws_loop_refuses_to_start_with_empty_token).
Rust fmt/check: cargo check --manifest-path Cargo.toml clean; cargo fmt applied to changed files.
N/A: no Tauri shell changes.

Validation Blocked

command: N/A
error: N/A
impact: N/A

Behavior Changes

Intended behavior change: ConnectionOutcome::Failed logs at warn for the first 5 consecutive attempts before escalating to error; empty session token aborts ws_loop immediately instead of spinning.
User-visible effect: none. Sentry events from transient gateway 5xx during a single outage drop ~99%.

Parity Contract

Legacy behavior preserved: happy-path connect/lost/shutdown sequences unchanged; reconnect cadence unchanged; SIO/EIO handshake unchanged.
Guard/fallback/dispatch parity checks: existing ws_loop_* tests (5 e2e + 13 unit) still pass; new test covers the new guard branch.

Duplicate / Superseded PR Handling

Duplicate PR(s): none
Canonical PR: this one
Resolution: N/A

Summary by CodeRabbit

Release Notes

Bug Fixes
- WebSocket connections now validate authentication tokens at startup and disconnect if invalid
- Improved connection failure tracking with escalated error reporting once threshold is reached
- Connection loss properly resets failure counters and backoff intervals
Tests
- Added test coverage for WebSocket initialization with invalid tokens

…PENHUMAN-TAURI-8M) The ws_loop reconnect loop always logged `ConnectionOutcome::Failed` at `log::error!` level, which the tracing-sentry layer captures as a Sentry event. A single backend gateway 503 incident on 2026-05-11 produced 549 events on OPENHUMAN-TAURI-8M — one per retry attempt across every running client — because the exponential-backoff loop (1s→30s) keeps retrying forever during an outage and each attempt fires another event. Two complementary fixes: 1. **Consecutive-failure escalation**: `Failed` outcomes now log at `warn` for the first N (= 5) consecutive attempts, escalating to `error` only on sustained failure. The streak resets on `ConnectionOutcome::Lost`, which is only reachable after a successful SIO CONNECT ACK and therefore proves the backend is healthy and the token is valid. Net effect: a brief 5xx blip stays out of Sentry; a real sustained outage still pages exactly once per affected client. 2. **Empty-token guard**: the public `connect_with_session` RPC already refuses to spawn without a session token, but the bare `connect(url, token)` RPC trusts its caller. Treating an empty token as a programming error and returning immediately (instead of spinning a doomed reconnect cycle) prevents a class of pre-auth retry-storm bugs from reaching production. Both changes are scoped to `src/openhuman/socket/ws_loop.rs`; behavior on the happy path is unchanged. New test covers the empty-token guard; the consecutive-failure logic is internal log-level routing and stays covered by inspection on the small diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-12T15:21:50Z

📝 Walkthrough

Walkthrough

This PR strengthens WebSocket connection resilience by adding configurable failure escalation. The loop now validates session tokens upfront and refuses to start with empty credentials. Consecutive reconnection failures are tracked and logged at escalating severity, switching from warn to error once a threshold is crossed, replacing the previous uniform error reporting.

Changes

WebSocket Connection Resilience

Layer / File(s)	Summary
Failure escalation threshold constant `src/openhuman/socket/ws_loop.rs`	Introduced `FAIL_ESCALATE_THRESHOLD` constant and documentation defining when consecutive failed reconnection attempts trigger error-level logging.
Empty token validation and failure tracking initialization `src/openhuman/socket/ws_loop.rs`, `src/openhuman/socket/ws_loop_tests.rs`	Added early guard that rejects empty tokens by setting state to `Disconnected`, clearing socket id, and returning immediately; initialized per-loop `consecutive_failures` counter; new test verifies the loop exits cleanly when token is empty or whitespace-only.
Reconnection outcome handling with escalating logging `src/openhuman/socket/ws_loop.rs`	Updated `ConnectionOutcome::Lost` to reset both failure counter and backoff; upgraded `ConnectionOutcome::Failed` to increment counter (saturating) and switch logging from warn to error once threshold is met.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A rabbit hopped through tangled webs, 🐰
Where tokens fail and loops upset,
But now with thresholds, guards, and care,
The WebSocket breathes fresher air! 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main changes: reducing Sentry noise from retry storms and adding an empty-token guard, matching the primary modifications in ws_loop.rs.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/socket/ws_loop_tests.rs`:
- Around line 419-423: The test currently only asserts the timeout result
(`res.is_ok()`), which allows a panicking task to slip by as `Ok(Err(_))`;
update the assertion to also verify the task joined successfully (i.e., the
JoinHandle returned Ok) and that `ws_loop` completed without error. Concretely,
after awaiting `tokio::time::timeout(..., handle)`, unwrap the timeout result
and then assert the join result is Ok (or pattern-match to `Ok(Ok(()))`) so you
catch both timeouts and task panics; reference the `handle` variable and the
`ws_loop` invocation when locating the test code to change.

In `@src/openhuman/socket/ws_loop.rs`:
- Around line 96-109: The current escalation logs an error every retry once
consecutive_failures >= FAIL_ESCALATE_THRESHOLD, causing repeated error events;
change the logic to escalate only once when the threshold is first reached (e.g.
check consecutive_failures == FAIL_ESCALATE_THRESHOLD) or add an escalated flag
to track that an error has already been emitted, so subsequent retries log at
warn level; update the block around consecutive_failures,
FAIL_ESCALATE_THRESHOLD, and the log::error!/log::warn! calls (and reset the
flag when the connection succeeds) to ensure a single error event per sustained
outage.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1567f28c-0288-4573-a02e-476228c5a2ac

📥 Commits

Reviewing files that changed from the base of the PR and between 15c7442 and 2494dc7.

📒 Files selected for processing (2)

src/openhuman/socket/ws_loop.rs
src/openhuman/socket/ws_loop_tests.rs

coderabbitai · 2026-05-12T15:25:21Z

+    let res = tokio::time::timeout(tokio::time::Duration::from_secs(2), handle).await;
+    assert!(
+        res.is_ok(),
+        "ws_loop must return immediately on empty token, not spin"
+    );


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Strengthen the join assertion to catch task panics

At Line [421], res.is_ok() only verifies no timeout. It still passes if ws_loop panics (Ok(Err(_))).

Suggested adjustment

- assert!( - res.is_ok(), - "ws_loop must return immediately on empty token, not spin" - ); + assert!( + matches!(res, Ok(Ok(()))), + "ws_loop must return cleanly on empty token (no timeout, no panic)" + );

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let res = tokio::time::timeout(tokio::time::Duration::from_secs(2), handle).await;

assert!(

res.is_ok(),

"ws_loop must return immediately on empty token, not spin"

);

let res = tokio::time::timeout(tokio::time::Duration::from_secs(2), handle).await;

assert!(

matches!(res, Ok(Ok(()))),

"ws_loop must return cleanly on empty token (no timeout, no panic)"

);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/socket/ws_loop_tests.rs` around lines 419 - 423, The test currently only asserts the timeout result (`res.is_ok()`), which allows a panicking task to slip by as `Ok(Err(_))`; update the assertion to also verify the task joined successfully (i.e., the JoinHandle returned Ok) and that `ws_loop` completed without error. Concretely, after awaiting `tokio::time::timeout(..., handle)`, unwrap the timeout result and then assert the join result is Ok (or pattern-match to `Ok(Ok(()))`) so you catch both timeouts and task panics; reference the `handle` variable and the `ws_loop` invocation when locating the test code to change.

coderabbitai · 2026-05-12T15:25:21Z

+                consecutive_failures = consecutive_failures.saturating_add(1);
+                if consecutive_failures >= FAIL_ESCALATE_THRESHOLD {
+                    log::error!(
+                        "[socket] Connection failed {}x (sustained): {}",
+                        consecutive_failures,
+                        reason
+                    );
+                } else {
+                    log::warn!(
+                        "[socket] Connection failed (attempt {}): {}",
+                        consecutive_failures,
+                        reason
+                    );
+                }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Error escalation still allows unbounded error events after threshold

At Line [97], using >= FAIL_ESCALATE_THRESHOLD logs error on every retry after threshold. That still creates repeated Sentry events during sustained outages, especially since the message changes with attempt count.

Suggested adjustment

- if consecutive_failures >= FAIL_ESCALATE_THRESHOLD { + if consecutive_failures == FAIL_ESCALATE_THRESHOLD { log::error!( - "[socket] Connection failed {}x (sustained): {}", - consecutive_failures, + "[socket] Connection failed (sustained outage): {}", reason ); } else { log::warn!( "[socket] Connection failed (attempt {}): {}",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

consecutive_failures = consecutive_failures.saturating_add(1);

if consecutive_failures >= FAIL_ESCALATE_THRESHOLD {

log::error!(

"[socket] Connection failed {}x (sustained): {}",

consecutive_failures,

reason

);

} else {

log::warn!(

"[socket] Connection failed (attempt {}): {}",

consecutive_failures,

reason

);

}

consecutive_failures = consecutive_failures.saturating_add(1);

if consecutive_failures == FAIL_ESCALATE_THRESHOLD {

log::error!(

"[socket] Connection failed (sustained outage): {}",

reason

);

} else {

log::warn!(

"[socket] Connection failed (attempt {}): {}",

consecutive_failures,

reason

);

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/socket/ws_loop.rs` around lines 96 - 109, The current escalation logs an error every retry once consecutive_failures >= FAIL_ESCALATE_THRESHOLD, causing repeated error events; change the logic to escalate only once when the threshold is first reached (e.g. check consecutive_failures == FAIL_ESCALATE_THRESHOLD) or add an escalated flag to track that an error has already been emitted, so subsequent retries log at warn level; update the block around consecutive_failures, FAIL_ESCALATE_THRESHOLD, and the log::error!/log::warn! calls (and reset the flag when the connection succeeds) to ensure a single error event per sustained outage.

graycyrus

Review — beyond CodeRabbit

Core logic is sound. The Sentry noise fix genuinely addresses OPENHUMAN-TAURI-8M and the empty-token guard is clean. A few items below — two major, one minor, two nitpicks.

Design question (for the author): The >= vs == threshold check (already flagged by CodeRabbit) is actually a design ambiguity — the doc-comment says "sustained outages still surface as error" (implies >=), but the ticket says "at most one event per affected client" (implies ==). Which is intended? The answer determines whether CodeRabbit's == suggestion is correct.

Pre-existing (outside the diff): run_connection line 143 has a bare .unwrap() on serde_json::to_string. Can't panic on a json!({}) literal, but worth noting for a future cleanup pass.

graycyrus · 2026-05-12T18:44:05Z

+/// status must end up `Disconnected` and the function must return without
+/// ever opening a socket.
+#[tokio::test]
+async fn ws_loop_refuses_to_start_with_empty_token() {


[major] _emit_tx keeps the sender alive for the full scope. If the early-exit guard is ever removed or bypassed in a future change, emit_rx.recv() in the main loop never returns None, and the test silently hangs for 2s instead of failing clearly.

Suggest dropping the sender after spawning, matching the existing pattern in ws_loop_completes_handshake_and_shuts_down_cleanly:

// before let (_emit_tx, emit_rx) = mpsc::unbounded_channel::<String>(); // after let (emit_tx, emit_rx) = mpsc::unbounded_channel::<String>(); drop(emit_tx); // so ws_loop exits via Shutdown if guard is bypassed

graycyrus · 2026-05-12T18:44:05Z

+                        consecutive_failures,
+                        reason
+                    );
+                } else {


[major] The warn branch shows the attempt number but not how close it is to escalating. Operators watching logs during a transient outage can't tell if they're 2/5 or 4/5 from Sentry escalation.

Suggest including the threshold:

log::warn!( "[socket] Connection failed (attempt {}/{}): {}", consecutive_failures, FAIL_ESCALATE_THRESHOLD, reason );

graycyrus · 2026-05-12T18:44:05Z

                break;
            }
            ConnectionOutcome::Lost(reason) => {
+                // `Lost` is only returned after a successful SIO CONNECT ACK


[minor] After a retry storm logged at warn, the consecutive_failures = 0 reset is silent. A log::debug! here noting the streak cleared would satisfy the project's "log state transitions" requirement and make it visible that recovery happened:

consecutive_failures = 0; log::debug!("[socket] Connection re-established; resetting failure streak");

graycyrus · 2026-05-12T18:44:05Z

 }
+
+/// Empty-token guard: if a caller invokes the bare `connect(url, token)` RPC
+/// with an empty string the loop must bail immediately rather than spin a


[nitpick] Doc-comment says "without ever opening a socket" but there's no mock server to verify zero connection attempts. The test only proves the function returned quickly and set the right state. Either soften to "without completing the reconnect loop" or add a TcpListener bound to a random port and assert zero connections.

graycyrus · 2026-05-12T18:44:05Z

+    // gateway, producing exactly the kind of retry-storm noise this module
+    // is otherwise designed to suppress.
+    if token.trim().is_empty() {
+        log::error!("[socket] ws_loop: refusing to start — empty session token");


[nitpick] The guard's cleanup (Disconnected + socket_id = None + emit_state_change) is identical to the tail of the function (lines 86–88 in the base). Harmless, but a one-line comment explaining why it doesn't just fall through to the normal exit path would help future readers understand the intent:

// Replicate the tail cleanup here because we return early, // bypassing the reconnect loop that normally handles teardown.

sanil-23 requested a review from a team May 12, 2026 15:21

coderabbitai Bot requested changes May 12, 2026

View reviewed changes

graycyrus reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(socket): suppress retry-storm Sentry noise + empty-token guard (OPENHUMAN-TAURI-8M)#1568

fix(socket): suppress retry-storm Sentry noise + empty-token guard (OPENHUMAN-TAURI-8M)#1568
sanil-23 wants to merge 1 commit into
tinyhumansai:mainfrom
sanil-23:fix/sentry-8m-ws-noise-and-empty-token-guard

sanil-23 commented May 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 12, 2026

Uh oh!

coderabbitai Bot May 12, 2026

Uh oh!

graycyrus left a comment

Uh oh!

graycyrus May 12, 2026

Uh oh!

graycyrus May 12, 2026

Uh oh!

graycyrus May 12, 2026

Uh oh!

graycyrus May 12, 2026

Uh oh!

graycyrus May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sanil-23 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Submission Checklist

Impact

Related

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Commit & Branch

Validation Run

Validation Blocked

Behavior Changes

Parity Contract

Duplicate / Superseded PR Handling

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

graycyrus left a comment

Choose a reason for hiding this comment

Review — beyond CodeRabbit

Uh oh!

graycyrus May 12, 2026

Choose a reason for hiding this comment

Uh oh!

graycyrus May 12, 2026

Choose a reason for hiding this comment

Uh oh!

graycyrus May 12, 2026

Choose a reason for hiding this comment

Uh oh!

graycyrus May 12, 2026

Choose a reason for hiding this comment

Uh oh!

graycyrus May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sanil-23 commented May 12, 2026 •

edited

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading