Skip to content

Recover chat sends after fetch failures#308

Merged
lewtun merged 4 commits into
mainfrom
codex/recover-chat-fetch-failures
Jun 8, 2026
Merged

Recover chat sends after fetch failures#308
lewtun merged 4 commits into
mainfrom
codex/recover-chat-fetch-failures

Conversation

@lewtun
Copy link
Copy Markdown
Member

@lewtun lewtun commented Jun 8, 2026

Summary

  • recover chat POST fetch failures by probing session state and reconnecting to live SSE when backend accepted the turn
  • hydrate backend messages after completed turns so transient browser/network failures do not surface as hard message failures
  • keep aborts and HTTP/provider errors on their existing paths

Checks

  • npm run build
  • npm run lint (passes with existing warnings in main.tsx and logger.ts)
  • uv run ruff check .
  • uv run ruff format --check .

Co-authored-by: OpenAI Codex <codex@openai.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 8, 2026

Claude finished @lewtun's task in 3m 26s —— View job


Review: Recover chat sends after fetch failures

1 P0, 1 P1 — the recovery flow is well-structured (probe → reconnect-live → hydrate → finish), aborts and HTTP/provider errors correctly keep their existing paths, and the optimistic-message dedup via setMessages + finish-only stream avoids duplicate turns on the success path. One blocking issue: the failure path drops the user's just-typed message.


P0 — Unrecoverable failure silently drops the user's message from the UI

frontend/src/hooks/useAgentChat.ts:397-400

onRecoverMessages overwrites the chat message list with backend-only state unconditionally, before it knows whether recovery succeeded:

const setMsgs = chatActionsRef.current.setMessages;
if (setMsgs && uiMsgs.length > 0) {
  setMsgs(uiMsgs);          // <-- drops the optimistic user message
  saveMessages(sessionId, uiMsgs);
}

In the genuine-failure case (POST never reached the backend → is_processing false, backend has no record of the new message), llmMessagesToUIMessages rebuilds the list purely from backend rows (convert-llm-messages.ts:33; the existingUIMessages arg is only used for ID reuse, not merging). So uiMsgs excludes the message the user just typed. setMessages(uiMsgs) then evicts it from the SDK state, and recoverFailedSend immediately throws "…Please retry." (sse-chat-transport.ts:465). Result: the user sees an error toast but their composed text is gone from the message list, with nothing to retry from.

This diverges from the established guard the repo already uses for exactly this overwrite — the 3s poller at useAgentChat.ts:788-791:

const currentCount = chatActionsRef.current.messages.length;
if (msgs.length > currentCount || currentCount === 0) {
  chat.setMessages(msgs);   // only overwrite when backend has >= local
  saveMessages(sessionId, msgs);
}

That same guard fixes this: when recovery fails, uiMsgs.length < currentMessageCount (the latter counts the optimistic message), so the overwrite would be skipped and the pending message preserved. Pre-PR, a POST failure threw and the SDK kept the message in an error/retryable state — this PR regresses that.

Suggested fix: gate the setMessages call on the recovery outcome (only overwrite when uiMsgs.length >= currentMessageCount), so a failed probe leaves the optimistic message intact.

Fix this →


P1 — isRecoverableFetchError is broad enough to swallow genuine code errors

frontend/src/lib/sse-chat-transport.ts:84-98

if (error.name === 'TypeError') return true;
const message = error.message.toLowerCase();
return ['load failed','failed to fetch','networkerror','network error',
        'network request failed','connection','fetch'].some(p => message.includes(p));

Any TypeError (e.g. a real bug like reading a property of undefined inside apiFetch/body serialization) and any message merely containing 'fetch' or 'connection' is treated as a recoverable network failure. Such a bug then gets routed into recoverFailedSend and re-surfaces as the generic "Connection… Please retry." message, masking the real error from logs/users. The logger.warn at :537 preserves the original for debugging, so this is non-fatal — but consider narrowing to the actual network-failure signatures (drop the bare 'fetch'/'connection' substrings, and don't treat all TypeErrors as recoverable).


P2 — minor (informational)

  • Redundant probe: recoverFailedSend fetches /api/session/{id} (sse-chat-transport.ts:431) and then onRecoverMessages fetches /api/session/{id} again (useAgentChat.ts:362) when not processing. Harmless, just an extra round-trip on an already-degraded connection.
  • is_processing reconnect (sse-chat-transport.ts:449-457) attaches to the live stream without confirming the new message was accepted; if the session was already mid-turn from a prior request, the new message could be silently dropped while the old turn streams. Edge case (the UI generally blocks sends mid-turn), noting for completeness.

No frontend test runner exists in this repo (frontend/package.json has no vitest/jest), so I'm not flagging missing tests for the new behavior as blocking.

Co-authored-by: OpenAI Codex <codex@openai.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 8, 2026

Claude finished @lewtun's task in 1m 15s —— View job


Re-review: Recover chat sends after fetch failures

No blocking issues. Both findings from the prior review were addressed in 6510d69.

  • P0 (resolved) — The unconditional setMessages overwrite is now gated on uiMsgs.length >= currentMessageCount (useAgentChat.ts:409), mirroring the established 3s-poller guard at useAgentChat.ts:792. In the genuine-failure case the optimistic user message bumps currentMessageCount above the backend-only uiMsgs.length, so the overwrite is skipped and the composed text is preserved. The recovery-success decision was correctly decoupled from the overwrite by precomputing backendAdvanced/submittedTurnAccepted before the guard and returning backendAdvanced || submittedTurnAccepted (useAgentChat.ts:396-406, 424), so the success path still hydrates backend state as intended.

  • P1 (resolved)isRecoverableFetchError was narrowed (sse-chat-transport.ts:87-105): it no longer treats every TypeError as recoverable and dropped the bare 'fetch'/'connection' substrings. It now requires name === 'networkerror', or a TypeError whose message matches a concrete network-failure signature (failed to fetch, load failed, networkerror, etc.). This covers the standard Chrome/Firefox/Safari network-failure shapes while letting genuine code-level TypeErrors surface normally.

The P2 notes from the prior review (redundant /api/session probe, mid-turn reconnect from a prior request) remain informational and were not required to change.
• Branch: codex/recover-chat-fetch-failures

@lewtun
Copy link
Copy Markdown
Member Author

lewtun commented Jun 8, 2026

Addressed the review in 6510d69:

  • P0: recovery hydration now only overwrites chat state when backend UI history has at least the current message count, preserving the optimistic user message when the POST was not accepted.
  • P1: recoverable fetch classification is narrowed to known browser/network failure signatures instead of every TypeError or broad fetch/connection substrings.

Checks:

  • npm run build
  • npm run lint (existing warnings only)
  • uv run ruff check .
  • uv run ruff format --check .

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 8, 2026

Claude finished @lewtun's task in 34s —— View job


Re-review: Recover chat sends after fetch failures

No blocking issues. The latest push (20033d2) is a merge from main; it does not touch the recovery logic, and both previously-flagged findings remain correctly fixed.

  • P0 (resolved, still intact) — The setMessages/saveMessages overwrite is gated on uiMsgs.length >= currentMessageCount (useAgentChat.ts:409), so a genuine POST failure (backend has no record → uiMsgs shorter than the optimistic count) skips the overwrite and preserves the user's composed message. The recovery-success return value is decoupled from the overwrite, precomputed as backendAdvanced || submittedTurnAccepted (useAgentChat.ts:396-406, 424), so the success path still hydrates backend state.

  • P1 (resolved, still intact)isRecoverableFetchError is narrowed to name === 'networkerror' or a TypeError matching a concrete network-failure signature (sse-chat-transport.ts:102-105), so genuine code-level TypeErrors surface normally rather than being masked as connection errors.

The P2 notes from the earlier review (redundant /api/session probe, mid-turn reconnect from a prior request) remain informational.
• Branch: codex/recover-chat-fetch-failures

Co-authored-by: OpenAI Codex <codex@openai.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 8, 2026

Claude finished @lewtun's task in 2m 30s —— View job


Re-review: Recover chat sends after fetch failures

No blocking issues. The latest push (05d4423 — "Avoid redundant recovery session fetch") addresses the P2 redundant-probe note from the first review. The recovery logic remains correct.

P2 (resolved) — The redundant /api/session/{id} round-trip is gone. recoverFailedSend now threads its already-parsed session payload into onRecoverMessages via the new sessionInfo field (sse-chat-transport.ts:477-480), so the handler fetches only /messages instead of re-fetching /session (useAgentChat.ts:368-369). The fall-back double-fetch is preserved for the is_processing-but-stream-failed case, where sessionInfo is intentionally passed as undefined (sse-chat-transport.ts:479) so the handler re-probes status (useAgentChat.ts:370-384).

I traced one behavior change introduced by this commit and confirmed it is correct, not a regression:

  • In the threaded path the dead-session check is now msgsRes.status === 404 alone (useAgentChat.ts:386-388), versus the prior infoRes.status === 404 && msgsRes.status === 404. This is reachable for an ended session, where /session/{id} returns 200 with is_active: false (session_manager.py:1283, :1501) but /messages 404s on the is_active guard (agent.py:1051). Firing onSessionDead there is the right outcome — the send is genuinely unrecoverable — and it mirrors the transport's own onSessionDead on a /session 404 (sse-chat-transport.ts:456-457). The success path is unaffected: when info is OK, backendIsProcessing is false (transport only passes sessionInfo when !is_processing), and the P0/P1 fixes from earlier commits are intact (the setMessages overwrite is still gated on uiMsgs.length >= currentMessageCount at :429, return value still backendAdvanced || submittedTurnAccepted at :444).

The prior P0 (dropped optimistic message) and P1 (isRecoverableFetchError breadth) findings remain correctly fixed. The remaining P2 note (mid-turn reconnect from a prior request) is unchanged and informational.

@lewtun
Copy link
Copy Markdown
Member Author

lewtun commented Jun 8, 2026

Addressed the redundant-fetch P2 in 05d4423:

  • The transport now passes the already-fetched session info into onRecoverMessages when the recovery probe shows the backend is idle.
  • useAgentChat uses that supplied session info and only fetches /messages in that idle recovery path, avoiding the duplicate /api/session/{id} request.
  • If the probe showed is_processing: true but live SSE reconnect fails, the callback still refetches /session because the turn may have completed between the probe and hydration.

Checks repeated:

  • npm run build
  • npm run lint (existing warnings only)
  • uv run ruff check .
  • uv run ruff format --check .

Browser sanity check: local app reloads at http://127.0.0.1:5173/ with title ML Intern and populated root content.

@lewtun lewtun merged commit 65c14d1 into main Jun 8, 2026
3 checks passed
@lewtun lewtun deleted the codex/recover-chat-fetch-failures branch June 8, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant