Skip to content

Add try-catch in _replicate() to log exceptions instead of crashing silently#2501

Closed
MelvinBot wants to merge 1 commit intomainfrom
claude-replicateThreadExceptionHandling
Closed

Add try-catch in _replicate() to log exceptions instead of crashing silently#2501
MelvinBot wants to merge 1 commit intomainfrom
claude-replicateThreadExceptionHandling

Conversation

@MelvinBot
Copy link

Explanation of Change

The _replicate() thread in SQLiteNode.cpp had no exception handling. Any STHROW (from SQLITE_BUSY_SNAPSHOT, hash mismatch, commit count mismatch, etc.) would kill the thread silently. The network thread would continue queuing messages into the unbounded _replicateQueue, but nothing would process them — causing the follower to fall further behind with no recovery and no diagnostic logging.

This wraps the message processing block in a try-catch that:

  1. Logs SALERT with full diagnostic context (message type, error, commit count, peer name)
  2. Rolls back any in-progress transaction
  3. Exits the thread cleanly via return

This is a diagnostic-only change to confirm whether uncaught exceptions in the replicate thread are the root cause of follower replication stalls (as seen with virt1.sjc falling ~390K commits behind).

The behavior change: instead of the thread crashing silently from an unhandled exception, it now logs the cause at SALERT level and exits gracefully. The thread still exits — no recovery is attempted yet. Future work can add recovery (e.g., transitioning to SEARCHING state) once we confirm this is the root cause.

Fixed Issues

$ https://github.com/Expensify/Expensify/issues/597997

Tests

  • Verified the change follows the existing catch (const SException& e) pattern used throughout SQLiteNode.cpp (lines 1503, 1578, 1642, 1714, 1750)
  • The change is purely additive (wraps existing code in try-catch) with no logic changes to the happy path
  • Compilation will be validated by CI

Web QA

No web QA needed — this is a Bedrock infrastructure change that adds logging to the replication thread. The functional behavior (thread exits on error) is unchanged; the only difference is that the error is now logged via SALERT before exiting instead of crashing silently.

Mobile QA

No mobile QA needed.

…ilently

The _replicate() thread had no exception handling, so any STHROW (from
SQLITE_BUSY_SNAPSHOT, hash mismatch, commit count mismatch, etc.) would
kill the thread silently. The network thread would keep queuing messages
into the unbounded _replicateQueue, causing the follower to fall further
behind with no recovery and no diagnostic logging.

This wraps the message processing in a try-catch that logs SALERT with
full diagnostic context (message type, error, commit count, peer name),
rolls back any in-progress transaction, and exits the thread cleanly.

This is a diagnostic change to confirm whether uncaught exceptions in
the replicate thread are the root cause of follower replication stalls
(as seen with virt1.sjc falling ~390K commits behind).

Co-authored-by: Robert Chen <robertjchen@users.noreply.github.com>
@robertjchen
Copy link
Contributor

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@robertjchen robertjchen closed this Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants