Skip to content

fix(socket): suppress retry-storm Sentry noise + empty-token guard (OPENHUMAN-TAURI-8M)#1568

Open
sanil-23 wants to merge 1 commit into
tinyhumansai:mainfrom
sanil-23:fix/sentry-8m-ws-noise-and-empty-token-guard
Open

fix(socket): suppress retry-storm Sentry noise + empty-token guard (OPENHUMAN-TAURI-8M)#1568
sanil-23 wants to merge 1 commit into
tinyhumansai:mainfrom
sanil-23:fix/sentry-8m-ws-noise-and-empty-token-guard

Conversation

@sanil-23
Copy link
Copy Markdown
Contributor

@sanil-23 sanil-23 commented May 12, 2026

Summary

  • Downgrade ConnectionOutcome::Failed to warn for the first 5 consecutive attempts, escalating to error only on sustained failure — kills the retry-storm Sentry noise from a single backend incident.
  • Reset the failure streak on ConnectionOutcome::Lost, which is only reachable after a successful SIO CONNECT ACK and therefore proves the backend is healthy.
  • Add an empty-token guard at the top of ws_loop: if the bare connect(url, token) RPC is ever called with "" or whitespace, return immediately rather than spin a doomed reconnect loop.

Problem

OPENHUMAN-TAURI-8M[socket] Connection failed: WebSocket connect: HTTP error: 503 Service Unavailable. 549 events on 2026-05-11, all on a single release (openhuman@0.53.22+90e9f48f3ada), spread across macOS 26.x (407), macOS 15.x (42), and Windows (48). Not OS-specific — a backend gateway 503 incident multiplied by every running client retrying.

The throw site is src/openhuman/socket/ws_loop.rs:62:

ConnectionOutcome::Failed(reason) => {
    log::error!("[socket] Connection failed: {}", reason);
    // keep growing backoff
}

connect_async returns Err at the HTTP layer (the gateway 503'd before any SIO/auth handshake), the loop wraps it as Failed, and log::error! bridges into the sentry-tracing layer (src/core/logging.rs::sentry_tracing_layer) — every retry fires another event. The exponential backoff (1s→30s) keeps retrying forever during an outage, so a sustained 5xx period multiplies into hundreds of events per client.

Separately: the public RPC openhuman.socket_connect_with_session correctly refuses to spawn without a session token (schemas.rs:228"no session token stored — user must log in first"), but the bare openhuman.socket_connect(url, token) RPC trusts its caller. ws_loop itself had zero defensive checks — an empty token was sent verbatim in the SIO CONNECT payload at line 142-143, producing exactly the kind of retry-storm noise this module is otherwise designed to suppress.

Solution

1. Consecutive-failure escalation in the loop. Track consecutive_failures: u32 across iterations; increment on Failed, reset on Lost. Log at warn until FAIL_ESCALATE_THRESHOLD = 5 is reached, then escalate to error. The Lost reset is sound because run_connection only returns Lost after a successful SIO CONNECT ACK — reaching that arm proves the backend was reachable and the token was valid in this attempt cycle.

Net effect on the original 8M incident: 549 events become ~1 per affected client (when it crosses the threshold).

2. Empty-token guard. First thing the function does — if token.trim().is_empty() { … return; }. Sets status to Disconnected, clears the socket id, returns. The log::error! on this branch is intentional: an empty token reaching ws_loop indicates a programming error at the RPC caller side and should be loud, not quiet.

Behavior on the happy path is unchanged. A successful connect → SIO CONNECT ACK → Lost cycle keeps logging at warn and resets the streak as before.

Submission Checklist

  • N/A: behaviour-only log-level routing + a guard; no new feature surface. New test covers the empty-token guard (ws_loop_refuses_to_start_with_empty_token). The consecutive-failure logic is internal log-level routing and is covered by inspection on the small diff.
  • N/A: behaviour-only change; the only new code paths are a guard branch (covered by the new test) and an if over a counter (no new branches outside the logging call).
  • N/A: behaviour-only change — no new/removed/renamed features.
  • N/A: no feature surface changes.
  • N/A: no new external network dependencies — pure log-level routing + guard.
  • N/A: no release-cut surface changes — socket reconnect is internal observability.
  • N/A: no GitHub issue to close — driven by Sentry OPENHUMAN-TAURI-8M.

Impact

  • Runtime: desktop only. No protocol or behavior change on the happy path. Socket reconnect cadence (1s→30s exponential backoff) is identical; only the log level of the Failed arm is reclassified for sub-threshold retries.
  • Observability: ~99% reduction in Sentry events from this issue under transient backend incidents. Sustained outages still surface exactly once per client once the threshold is crossed. Reviewers may want to tune FAIL_ESCALATE_THRESHOLD (currently 5) — at 1s initial backoff with exponential growth, 5 attempts ≈ 31 seconds of sustained failure before Sentry pages.
  • Security: empty-token guard prevents a class of pre-auth retry-storm bugs at the bare-RPC level.

Related

  • Closes: N/A — no tracked GitHub issue; resolves Sentry OPENHUMAN-TAURI-8M.
  • Follow-up PR(s)/TODOs: none.

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: fix/sentry-8m-ws-noise-and-empty-token-guard
  • Commit SHA: 2494dc7

Validation Run

  • N/A: no frontend changes in this PR.
  • N/A: no frontend changes in this PR.
  • Focused tests: cargo test --manifest-path Cargo.toml --lib openhuman::socket::ws_loop — 18 passed, 0 failed (includes new ws_loop_refuses_to_start_with_empty_token).
  • Rust fmt/check: cargo check --manifest-path Cargo.toml clean; cargo fmt applied to changed files.
  • N/A: no Tauri shell changes.

Validation Blocked

  • command: N/A
  • error: N/A
  • impact: N/A

Behavior Changes

  • Intended behavior change: ConnectionOutcome::Failed logs at warn for the first 5 consecutive attempts before escalating to error; empty session token aborts ws_loop immediately instead of spinning.
  • User-visible effect: none. Sentry events from transient gateway 5xx during a single outage drop ~99%.

Parity Contract

  • Legacy behavior preserved: happy-path connect/lost/shutdown sequences unchanged; reconnect cadence unchanged; SIO/EIO handshake unchanged.
  • Guard/fallback/dispatch parity checks: existing ws_loop_* tests (5 e2e + 13 unit) still pass; new test covers the new guard branch.

Duplicate / Superseded PR Handling

  • Duplicate PR(s): none
  • Canonical PR: this one
  • Resolution: N/A

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • WebSocket connections now validate authentication tokens at startup and disconnect if invalid
    • Improved connection failure tracking with escalated error reporting once threshold is reached
    • Connection loss properly resets failure counters and backoff intervals
  • Tests

    • Added test coverage for WebSocket initialization with invalid tokens

Review Change Stack

…PENHUMAN-TAURI-8M)

The ws_loop reconnect loop always logged `ConnectionOutcome::Failed` at
`log::error!` level, which the tracing-sentry layer captures as a Sentry
event. A single backend gateway 503 incident on 2026-05-11 produced 549
events on OPENHUMAN-TAURI-8M — one per retry attempt across every
running client — because the exponential-backoff loop (1s→30s) keeps
retrying forever during an outage and each attempt fires another event.

Two complementary fixes:

1. **Consecutive-failure escalation**: `Failed` outcomes now log at
   `warn` for the first N (= 5) consecutive attempts, escalating to
   `error` only on sustained failure. The streak resets on
   `ConnectionOutcome::Lost`, which is only reachable after a successful
   SIO CONNECT ACK and therefore proves the backend is healthy and the
   token is valid. Net effect: a brief 5xx blip stays out of Sentry; a
   real sustained outage still pages exactly once per affected client.

2. **Empty-token guard**: the public `connect_with_session` RPC already
   refuses to spawn without a session token, but the bare
   `connect(url, token)` RPC trusts its caller. Treating an empty token
   as a programming error and returning immediately (instead of spinning
   a doomed reconnect cycle) prevents a class of pre-auth retry-storm
   bugs from reaching production.

Both changes are scoped to `src/openhuman/socket/ws_loop.rs`; behavior
on the happy path is unchanged. New test covers the empty-token guard;
the consecutive-failure logic is internal log-level routing and stays
covered by inspection on the small diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sanil-23 sanil-23 requested a review from a team May 12, 2026 15:21
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

This PR strengthens WebSocket connection resilience by adding configurable failure escalation. The loop now validates session tokens upfront and refuses to start with empty credentials. Consecutive reconnection failures are tracked and logged at escalating severity, switching from warn to error once a threshold is crossed, replacing the previous uniform error reporting.

Changes

WebSocket Connection Resilience

Layer / File(s) Summary
Failure escalation threshold constant
src/openhuman/socket/ws_loop.rs
Introduced FAIL_ESCALATE_THRESHOLD constant and documentation defining when consecutive failed reconnection attempts trigger error-level logging.
Empty token validation and failure tracking initialization
src/openhuman/socket/ws_loop.rs, src/openhuman/socket/ws_loop_tests.rs
Added early guard that rejects empty tokens by setting state to Disconnected, clearing socket id, and returning immediately; initialized per-loop consecutive_failures counter; new test verifies the loop exits cleanly when token is empty or whitespace-only.
Reconnection outcome handling with escalating logging
src/openhuman/socket/ws_loop.rs
Updated ConnectionOutcome::Lost to reset both failure counter and backoff; upgraded ConnectionOutcome::Failed to increment counter (saturating) and switch logging from warn to error once threshold is met.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A rabbit hopped through tangled webs, 🐰
Where tokens fail and loops upset,
But now with thresholds, guards, and care,
The WebSocket breathes fresher air! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main changes: reducing Sentry noise from retry storms and adding an empty-token guard, matching the primary modifications in ws_loop.rs.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/socket/ws_loop_tests.rs`:
- Around line 419-423: The test currently only asserts the timeout result
(`res.is_ok()`), which allows a panicking task to slip by as `Ok(Err(_))`;
update the assertion to also verify the task joined successfully (i.e., the
JoinHandle returned Ok) and that `ws_loop` completed without error. Concretely,
after awaiting `tokio::time::timeout(..., handle)`, unwrap the timeout result
and then assert the join result is Ok (or pattern-match to `Ok(Ok(()))`) so you
catch both timeouts and task panics; reference the `handle` variable and the
`ws_loop` invocation when locating the test code to change.

In `@src/openhuman/socket/ws_loop.rs`:
- Around line 96-109: The current escalation logs an error every retry once
consecutive_failures >= FAIL_ESCALATE_THRESHOLD, causing repeated error events;
change the logic to escalate only once when the threshold is first reached (e.g.
check consecutive_failures == FAIL_ESCALATE_THRESHOLD) or add an escalated flag
to track that an error has already been emitted, so subsequent retries log at
warn level; update the block around consecutive_failures,
FAIL_ESCALATE_THRESHOLD, and the log::error!/log::warn! calls (and reset the
flag when the connection succeeds) to ensure a single error event per sustained
outage.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1567f28c-0288-4573-a02e-476228c5a2ac

📥 Commits

Reviewing files that changed from the base of the PR and between 15c7442 and 2494dc7.

📒 Files selected for processing (2)
  • src/openhuman/socket/ws_loop.rs
  • src/openhuman/socket/ws_loop_tests.rs

Comment on lines +419 to +423
let res = tokio::time::timeout(tokio::time::Duration::from_secs(2), handle).await;
assert!(
res.is_ok(),
"ws_loop must return immediately on empty token, not spin"
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Strengthen the join assertion to catch task panics

At Line [421], res.is_ok() only verifies no timeout. It still passes if ws_loop panics (Ok(Err(_))).

Suggested adjustment
-    assert!(
-        res.is_ok(),
-        "ws_loop must return immediately on empty token, not spin"
-    );
+    assert!(
+        matches!(res, Ok(Ok(()))),
+        "ws_loop must return cleanly on empty token (no timeout, no panic)"
+    );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let res = tokio::time::timeout(tokio::time::Duration::from_secs(2), handle).await;
assert!(
res.is_ok(),
"ws_loop must return immediately on empty token, not spin"
);
let res = tokio::time::timeout(tokio::time::Duration::from_secs(2), handle).await;
assert!(
matches!(res, Ok(Ok(()))),
"ws_loop must return cleanly on empty token (no timeout, no panic)"
);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/socket/ws_loop_tests.rs` around lines 419 - 423, The test
currently only asserts the timeout result (`res.is_ok()`), which allows a
panicking task to slip by as `Ok(Err(_))`; update the assertion to also verify
the task joined successfully (i.e., the JoinHandle returned Ok) and that
`ws_loop` completed without error. Concretely, after awaiting
`tokio::time::timeout(..., handle)`, unwrap the timeout result and then assert
the join result is Ok (or pattern-match to `Ok(Ok(()))`) so you catch both
timeouts and task panics; reference the `handle` variable and the `ws_loop`
invocation when locating the test code to change.

Comment on lines +96 to +109
consecutive_failures = consecutive_failures.saturating_add(1);
if consecutive_failures >= FAIL_ESCALATE_THRESHOLD {
log::error!(
"[socket] Connection failed {}x (sustained): {}",
consecutive_failures,
reason
);
} else {
log::warn!(
"[socket] Connection failed (attempt {}): {}",
consecutive_failures,
reason
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Error escalation still allows unbounded error events after threshold

At Line [97], using >= FAIL_ESCALATE_THRESHOLD logs error on every retry after threshold. That still creates repeated Sentry events during sustained outages, especially since the message changes with attempt count.

Suggested adjustment
-                if consecutive_failures >= FAIL_ESCALATE_THRESHOLD {
+                if consecutive_failures == FAIL_ESCALATE_THRESHOLD {
                     log::error!(
-                        "[socket] Connection failed {}x (sustained): {}",
-                        consecutive_failures,
+                        "[socket] Connection failed (sustained outage): {}",
                         reason
                     );
                 } else {
                     log::warn!(
                         "[socket] Connection failed (attempt {}): {}",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
consecutive_failures = consecutive_failures.saturating_add(1);
if consecutive_failures >= FAIL_ESCALATE_THRESHOLD {
log::error!(
"[socket] Connection failed {}x (sustained): {}",
consecutive_failures,
reason
);
} else {
log::warn!(
"[socket] Connection failed (attempt {}): {}",
consecutive_failures,
reason
);
}
consecutive_failures = consecutive_failures.saturating_add(1);
if consecutive_failures == FAIL_ESCALATE_THRESHOLD {
log::error!(
"[socket] Connection failed (sustained outage): {}",
reason
);
} else {
log::warn!(
"[socket] Connection failed (attempt {}): {}",
consecutive_failures,
reason
);
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/socket/ws_loop.rs` around lines 96 - 109, The current
escalation logs an error every retry once consecutive_failures >=
FAIL_ESCALATE_THRESHOLD, causing repeated error events; change the logic to
escalate only once when the threshold is first reached (e.g. check
consecutive_failures == FAIL_ESCALATE_THRESHOLD) or add an escalated flag to
track that an error has already been emitted, so subsequent retries log at warn
level; update the block around consecutive_failures, FAIL_ESCALATE_THRESHOLD,
and the log::error!/log::warn! calls (and reset the flag when the connection
succeeds) to ensure a single error event per sustained outage.

Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — beyond CodeRabbit

Core logic is sound. The Sentry noise fix genuinely addresses OPENHUMAN-TAURI-8M and the empty-token guard is clean. A few items below — two major, one minor, two nitpicks.

Design question (for the author): The >= vs == threshold check (already flagged by CodeRabbit) is actually a design ambiguity — the doc-comment says "sustained outages still surface as error" (implies >=), but the ticket says "at most one event per affected client" (implies ==). Which is intended? The answer determines whether CodeRabbit's == suggestion is correct.

Pre-existing (outside the diff): run_connection line 143 has a bare .unwrap() on serde_json::to_string. Can't panic on a json!({}) literal, but worth noting for a future cleanup pass.

/// status must end up `Disconnected` and the function must return without
/// ever opening a socket.
#[tokio::test]
async fn ws_loop_refuses_to_start_with_empty_token() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] _emit_tx keeps the sender alive for the full scope. If the early-exit guard is ever removed or bypassed in a future change, emit_rx.recv() in the main loop never returns None, and the test silently hangs for 2s instead of failing clearly.

Suggest dropping the sender after spawning, matching the existing pattern in ws_loop_completes_handshake_and_shuts_down_cleanly:

// before
let (_emit_tx, emit_rx) = mpsc::unbounded_channel::<String>();

// after
let (emit_tx, emit_rx) = mpsc::unbounded_channel::<String>();
drop(emit_tx); // so ws_loop exits via Shutdown if guard is bypassed

consecutive_failures,
reason
);
} else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] The warn branch shows the attempt number but not how close it is to escalating. Operators watching logs during a transient outage can't tell if they're 2/5 or 4/5 from Sentry escalation.

Suggest including the threshold:

log::warn!(
    "[socket] Connection failed (attempt {}/{}): {}",
    consecutive_failures,
    FAIL_ESCALATE_THRESHOLD,
    reason
);

break;
}
ConnectionOutcome::Lost(reason) => {
// `Lost` is only returned after a successful SIO CONNECT ACK
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] After a retry storm logged at warn, the consecutive_failures = 0 reset is silent. A log::debug! here noting the streak cleared would satisfy the project's "log state transitions" requirement and make it visible that recovery happened:

consecutive_failures = 0;
log::debug!("[socket] Connection re-established; resetting failure streak");

}

/// Empty-token guard: if a caller invokes the bare `connect(url, token)` RPC
/// with an empty string the loop must bail immediately rather than spin a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Doc-comment says "without ever opening a socket" but there's no mock server to verify zero connection attempts. The test only proves the function returned quickly and set the right state. Either soften to "without completing the reconnect loop" or add a TcpListener bound to a random port and assert zero connections.

// gateway, producing exactly the kind of retry-storm noise this module
// is otherwise designed to suppress.
if token.trim().is_empty() {
log::error!("[socket] ws_loop: refusing to start — empty session token");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The guard's cleanup (Disconnected + socket_id = None + emit_state_change) is identical to the tail of the function (lines 86–88 in the base). Harmless, but a one-line comment explaining why it doesn't just fall through to the normal exit path would help future readers understand the intent:

// Replicate the tail cleanup here because we return early,
// bypassing the reconnect loop that normally handles teardown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants