Skip to content

fix(ui): SSE connection status indicators for mobile network drops#548

Closed
JDis03 wants to merge 20 commits into
NeuralNomadsAI:devfrom
JDis03:fix/sse-connection-status-indicator
Closed

fix(ui): SSE connection status indicators for mobile network drops#548
JDis03 wants to merge 20 commits into
NeuralNomadsAI:devfrom
JDis03:fix/sse-connection-status-indicator

Conversation

@JDis03

@JDis03 JDis03 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Problem

Mobile users experience SSE disconnections (wifi drops, Cloudflare idle timeouts) but get no visual feedback. The per-instance status dots remain green even when the SSE transport is down, because:

  1. EventSource.onerror doesn't fire immediately on mobile wifi drops (waits indefinitely)
  2. navigator.onLine is unreliable on mobile (always reports true)
  3. Instance status dots only update via SSE events (impossible when SSE is down)

Solution

This PR implements immediate connection status feedback via two detection mechanisms:

1. SSE Transport State Propagation

  • serverEvents now exposes onDisconnect() and onOpen() handlers
  • sseManager registers both handlers and propagates state to all instances:
    • On disconnect: all instances → connecting (amber dot)
    • On reconnect: connecting instances → cleared (green dot when events arrive)

2. Browser Offline Detection

  • sseManager watches isOnlineSignal() via createRoot + createEffect
  • When browser goes offline → immediate amber dots (no 45-60s TCP timeout wait)

3. Client-Side Heartbeat (15s timeout)

  • Detects SSE inactivity when EventSource.onerror doesn't fire
  • Forces disconnect after 15s of no SSE activity
  • Triggers reconnection attempt

Changes

  • packages/ui/src/lib/server-events.ts: Add onDisconnect() handler, heartbeat timer
  • packages/ui/src/lib/sse-manager.ts: Register SSE handlers, watch browser online state
  • packages/ui/src/stores/session-actions.ts: Set status to disconnected on network errors

Verification

  • ✅ TypeScript compilation clean
  • ✅ Vite build successful
  • ✅ Tested on mobile: wifi drop → immediate amber dots
  • ✅ Logs show correct state transitions (see attached log)

Log Evidence

[08:20:44.889] WARN sse: Heartbeat timeout (15000ms) — no SSE activity, forcing disconnect
[08:20:44.891] INFO sse: SSE transport disconnected → setting all instances to 'connecting'
[08:24:58.874] INFO sse: SSE transport reconnected → clearing 'connecting' status

Heartbeat detection working correctly. Frequent reconnections (~5min intervals) are Cloudflare idle timeouts, handled gracefully.

Related

JDis03 added 20 commits May 23, 2026 23:22
**User-visible behavior:**
- Pong "Client connection not found" errors no longer flood the debug log; now logged as a muted warning
- Debug overlay title bar shows "⚠ disconnected" when the SSE connection drops, so the user knows immediately
- ⚙ debug toggle button moved from bottom-right corner (overlapping scroll buttons) into the scroll button column, above the pause button
- "Enviando..." and "Generando..." status text now uses the accent color instead of muted gray, making it visible
- Spinner animation changed from pulse to bounce for more noticeable activity feedback
- Hard-refresh is required to pick up the latest build

**Implementation:**
- `server-events.ts`: added `.connected` getter, set `_connected=false` in scheduleReconnect and `_connected=true` onopen; pong catch now calls `debugWarn` instead of `log.error` so stale-connection pongs don't pollute logs
- `debug-overlay.tsx`: imports `serverEvents`, header shows disconnected warning when SSE is down
- `message-section.tsx`: removed broken `isAssistantThinking` memo (store is not reactive); ⚙ button rendered inside `renderControls` as first child of `message-scroll-button-wrapper`
- `messaging.css`: `.message-generating`/`.message-sending` use `--accent` color; spinner uses `generating-bounce` animation
- `message-section.css`: added `.debug-toggle-button` styles (2rem circle, muted → primary on hover)

**SSE reconnection note:**
The root cause of repeated SSE drops is likely a server-side or proxy issue (PM2, nginx timeout). This commit only handles the client-side symptoms gracefully. A durable fix would require investigating why the EventSource connection does not stay open on this deployment.

**Validation:**
- Build passes with `npm run build --workspace @codenomad/ui`
- Copy step `node packages/server/scripts/copy-ui-dist.mjs` completes
Accumulate text deltas in a buffer and flush every 50ms, so rapid
streaming chunks don't trigger a Solid store mutation + re-render
on every single delta event.

On mobile this cuts the main-thread blocking caused by KaTeX /
highlight.js re-renders during streaming responses.
… prompt

- Mark user message as 'sent' immediately after promptAsync resolves so
  the status no longer stays 'sending' indefinitely.
- Fix the showSending effect to handle the case where the component
  mounts after the status has already transitioned past 'sending'
  (e.g. to 'streaming' or 'complete') — show a brief 1.5s indicator
  for recent non-error messages.
- Remove the old 2-second grace window that kept re-setting the signal
  and prevented the indicator from ever hiding.
scrollToBottom() already calls setAutoScroll(true) internally.
Calling it redundantly before scrollToBottom interfered with the
scroll state and caused the list to jump up on mobile.
Throttle writes to localStorage to at most once per second instead of
on every log entry. JSON.stringify + setItem on mobile can block the
main thread when entries are frequent (e.g. during SSE reconnect
storms or rapid log writes).
- Inject service worker unregister script during UI build copy so
  old SW caches don't interfere with testing on mobile.
- Set injectRegister: null + selfDestroying: true in VitePWA config
  to prevent automatic SW registration.
- Change message-generating and message-sending indicator colors from
  accent to status-warning (orange) for better visibility on mobile.
…ID replacement

When message.updated SSE event arrives before promptAsync resolves,
replaceMessageIdV2 deletes the client ID and replaces it with the
server ID. If promptAsync then tries to upsertMessage with the old
client ID, it creates a duplicate orphan record that gets appended
to messageIds. This orphan can later be matched incorrectly by
findPendingSyntheticMessageId, causing message ordering bugs.

Fix: check if the message still exists before upserting after
promptAsync. If the client ID was already replaced, skip the
upsert since the server already manages the message state.
When message.part.updated arrives with a complete part, applyPartUpdate
replaces the part entirely with the server's state. But if deltas were
enqueued before the part update arrived and haven't flushed yet, those
stale deltas would be applied AFTER the replacement, causing text
duplication at the end of assistant messages.

Fix: clear pending deltas for a part before applying the full part
update, since the update already contains the complete state and any
accumulated deltas are now stale.

This prevents the race where:
1. Deltas accumulate in the 50ms throttle window
2. message.part.updated arrives and replaces the part with complete text
3. Delta timer expires and concatenates stale deltas → duplicate text
Temporary debug logging to understand message ordering issue where
user messages sometimes appear in the correct position and sometimes
'bien arriba' (too high up).

Logs:
- message.part.updated events (role, msgId, partType)
- message.updated events (role, msgId)
- insertMessageIntoSession calls (role, msgId, position, array length)

This will help identify the SSE event ordering and insertion sequence
that causes the intermittent positioning bug.
…ession

Remove .slice(0,8) truncation to see complete message IDs and
understand why duplicate insertions are happening. Also log when
insertion is skipped because ID already exists in array.
The previous fix for the sending indicator introduced message ordering
problems by:
1. Calling upsertMessage after promptAsync to change status to 'sent'
2. Modifying findPendingSyntheticMessageId to match 'sent' status

This caused multiple upsertMessage calls for the same user message,
which tried to insert the message ID multiple times (prevented by the
ids.includes check, but still caused unnecessary work and potential
race conditions).

Revert to upstream behavior:
- findPendingSyntheticMessageId only matches 'sending' status
- Remove upsertMessage call after promptAsync resolves

The sending indicator still works because message-item.tsx has a
third condition that shows the indicator for recent user messages
(< 1.5s old) regardless of status.
Remove diagnostic logs added to trace message insertion and SSE events.
The ordering issue has been fixed by reverting the problematic
status='sent' logic.
MOBILE-SSE-RESILIENCE-ANALYSIS.md documents the root cause of message
delivery failures on mobile networks with poor connectivity:

Problem:
- User sends message but receives no response
- Error only occurs on mobile networks with high latency/packet loss
- Response arrives only after user sends another message or refreshes

Root Cause:
- Server sends ping every 15 seconds via SSE
- Client responds with HTTP POST pong (separate request)
- POST fails on poor mobile networks → server closes SSE after 45s
- Message responses stuck in queue unable to send
- Reconnection on next message delivers queued responses

Timeline Evidence:
  Message sent at 14.8s
  → 29s later: POST pong fails
  → Server closes connection
  → Responses stuck until reconnection

Critical Path Files:
- Server: events.ts (ping), connection-manager.ts (timeout)
- Client: server-events.ts (pong handler), api-client.ts (HTTP POST)

Proposed Solutions:
1. Option A (quick win): Add retry logic to pong POST
2. Option B (robust): Send pong via SSE channel instead of HTTP

Recommendation: Start with Option A, escalate to Option B if needed.

This analysis supports future PR decisions on mobile resilience
improvements and serves as reference for developers implementing fixes.
MOBILE-SSE-FIX-IMPLEMENTATION-GUIDE.md provides detailed step-by-step
instructions for implementing Option A (quick win) from the root cause
analysis: adding retry logic with exponential backoff to the pong HTTP POST.

Includes:
- Utility function: retryWithBackoff with exponential backoff configuration
- Code changes: Updated server-events.ts pong handler with retry wrapper
- Logging enhancements: Track retry attempts for observability
- Test checklist: Unit tests, integration tests, manual testing scenarios
- Metrics monitoring: Post-deployment pong success/failure tracking
- Success criteria: Clear verification points before declaring fix complete
- Rollback plan: Graceful degradation if issues appear

This guide is ready for implementation by any contributor following the
checklist step-by-step. All code examples included.

This PR can be implemented independently and deployed for immediate
improvement to mobile network resilience.
…t ordering

When deltas are buffered for up to 50ms and message.updated arrives
before the flush timer fires, the message could be marked complete/error
before pending text mutations are applied. This causes the UI to observe
a terminal-status message with stale content.

Fix: flush any pending deltas for the message before applying the
message.updated event. This preserves the server's event ordering:
all delta content is applied first, then the message status/metadata
update runs on the complete content.

Addresses gatekeeper review blocking finding NeuralNomadsAI#1 on PR NeuralNomadsAI#536.
…tus indicators

When the workspace-level SSE stream drops (Cloudflare idle timeout, wifi
drop, mobile network switch), the per-instance connection status dots
remained frozen at 'connected' (green) because they depended exclusively
on instance.eventStatus events that travel over the broken SSE pipe.

Changes:
- server-events.ts: Add onDisconnect() handler (mirrors existing onOpen)
  and fire disconnectHandlers in scheduleReconnect() when _connected
  goes false.
- sse-manager.ts: Register onDisconnect handler that sets ALL known
  instances to 'connecting' (amber dot). Register onOpen handler that
  clears 'connecting' entries back to 'connected' (green dot) on
  reconnect, letting subsequent instance.eventStatus events restore
  real per-instance state (including genuine 'disconnected').
- instance-shell2.tsx: No changes needed — already renders connected/
  connecting/disconnected states at lines 1104-1121.

Edge cases verified:
- Transient drop: amber → green on reconnect (no false modal)
- Instance dies during outage: amber → green → server sends 'disconnected' → red
- 0 instances: Map empty, loop is no-op
- InstanceDisconnectedModal: only fires on instance.eventStatus, not
  workspace drop — prevents false disconnect modals on wifi flutters
- Rapid reconnect cycles: idempotent (setting 'connecting' on already-
  'connecting' is no-op)
- Mobile perf: O(n) on 1-3 instances, negligible signal overhead

Previously the only SSE connection visibility was the debug overlay
(Ctrl+Shift+D). Now all users see real transport state in the status
dot next to the command palette.

AGENTS.md: Add PR Review Principles section as mandatory contributor
guidelines (check regressions, look for better implementations, be
the PR gatekeeper, ruthless code quality, test before responding,
UI/server version parity).
The SSE EventSource onerror can take 45-60 seconds to fire after wifi
drop because the TCP connection timeout hasn't expired yet. During that
window, the per-instance status dots remain frozen at 'connected' (green)
even though the browser is offline.

Add createRoot + createEffect in sseManager constructor that watches
isOnlineSignal from network-status.ts. When the browser goes offline,
all known instances are immediately set to 'connecting' (amber dot),
bypassing the EventSource timeout delay.

This complements the existing serverEvents.onDisconnect handler which
catches actual SSE transport drops. Together they cover both paths:
- Browser offline → immediate amber dot
- SSE transport drop → amber dot
- SSE reconnect → green dot restored
Log every status change in the debug overlay (Ctrl+Shift+D) so users
can see exactly when and why the status dot changes:

- SSE transport disconnected → 'setting all instances to connecting'
- SSE transport reconnected → 'clearing connecting status'
- Browser offline → 'setting all instances to connecting'
- Per-instance status change → 'instanceId → status'

These appear in the debug overlay under the 'sse' source, making it
easy to verify the fix works on mobile/wifi tests.
…-visible)

Previously used debugInfo which only writes to the in-app debug overlay
(Ctrl+Shift+D). Now uses log.warn/log.info from the logger module so
messages appear in the browser DevTools console with the [sse] prefix.

The user can now verify the fix by checking the browser console for:
- 'sseManager initialized' on app start
- 'SSE transport disconnected' when SSE drops
- 'Browser offline' when wifi is toggled off
- 'connectionStatus <id> → connecting/connected' on each transition

Both console (log.*) and debug overlay (debugInfo) are written.
@JDis03

JDis03 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Closing to recreate with clean commit history (only connection status indicator commits, not the 20 mixed commits from base branch)

@JDis03 JDis03 closed this Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant