Skip to content

fix: recover from llama.cpp context overflow and reqwest SSE decode failures#3164

Open
petterreinholdtsen wants to merge 1 commit into
ultraworkers:mainfrom
petterreinholdtsen:llama.cpp-errors
Open

fix: recover from llama.cpp context overflow and reqwest SSE decode failures#3164
petterreinholdtsen wants to merge 1 commit into
ultraworkers:mainfrom
petterreinholdtsen:llama.cpp-errors

Conversation

@petterreinholdtsen
Copy link
Copy Markdown

Summary

Extend auto-compaction error detection to handle additional error patterns from llama.cpp backends: 'Context size has been exceeded', 'exceed_context_size_error', 'exceeds the available context size'. Also recover from reqwest 'error decoding response body' errors — some llama.cpp instances return a non-SSE plaintext HTTP 500 on context overflow, causing the SSE deserializer to fail.

Add dynamic threshold adaptation: parse server-reported context window size from error messages (e.g., '(81920 tokens)') and set the auto- compaction trigger at 70% of that value. This replaces the need for a hardcoded threshold, adapting automatically to any backend's limits.

This patch was developed with assistance from OpenCode and local Qwen 3.6 API server.

Anti-slop triage

I do not understand these triage points:

  • Classification:
  • Evidence:
  • Non-destructive review result:

Verification

I do not understand the two first of these check points.

  • Targeted tests/docs checks ran, or the gap is explicitly recorded.
  • git diff --check passes.
  • No live secrets, tokens, private logs, or unrelated generated churn are included.

Resolution gate

I am not aware of any issue related to this fix. I fixed the issues I ran into when trying to use my local llama.cpp server.

  • If this PR resolves an issue, the issue number and fix evidence are linked.
  • If this PR should not merge, the rejection/defer rationale is evidence-backed and does not rely on vibes.
  • I did not merge/close remote PRs or issues from an automation lane without owner approval.

…ailures

Extend auto-compaction error detection to handle additional error patterns
from llama.cpp backends: 'Context size has been exceeded',
'exceed_context_size_error', 'exceeds the available context size'. Also
recover from reqwest 'error decoding response body' errors — some
llama.cpp instances return a non-SSE plaintext HTTP 500 on context overflow,
causing the SSE deserializer to fail.

Add dynamic threshold adaptation: parse server-reported context window
size from error messages (e.g., '(81920 tokens)') and set the auto-
compaction trigger at 70% of that value. This replaces the need for a
hardcoded threshold, adapting automatically to any backend's limits.

This patch was developed with assistance from OpenCode and local Qwen 3.6
API server.
@petterreinholdtsen
Copy link
Copy Markdown
Author

I asked the coding agent to maintain a status summary in a markdown file. Here is what it ended up with for this patch. I have removed secrets and private DNS names.

Crash 1: "exceed_context_size_error" (context window overflow) [SOLVED]

  • Error: api returned 400 Bad Request (exceed_context_size_error): request (84717 tokens) exceeds the available context size (81920 tokens)
  • Root cause: A llama.cpp instance has an 81920-token context window. Default auto-compaction threshold (DEFAULT_AUTO_COMPACTION_INPUT_TOKENS_THRESHOLD = 100_000 in conversation.rs:18) is above the backend limit, so compaction fires too late.
  • Key files:
    • rust/crates/runtime/src/conversation.rs:18 — threshold constant (now overridden dynamically at runtime)
    • rust/crates/rusty-claude-cli/src/main.rs:5273-5277 — error detection for context-window overflow

What was done

  • Extended error detection in run_turn() to match "exceed_context_size" and "exceeds the available context size" (not just "context_window") so this llama.cpp-specific error format triggers recovery
  • Added extract_context_window_tokens_from_error() (main.rs:9486) that parses server-reported token limits from error strings (e.g., (81920 tokens)) and computes a 70% threshold
  • Added set_auto_compaction_input_tokens_threshold() setter on ConversationRuntime (conversation.rs:210) so the CLI can update the threshold at runtime after learning the window size
  • Wired into both initial error recovery and retry-loop paths — once we learn the server's window, future auto-compaction triggers are accurate

Why this is better than a fixed threshold

A hardcoded value (40k, 900k, etc.) is wrong because different models/servers have different windows. Dynamic parsing from server errors adapts automatically. The MAX_REQUEST_BODY_BYTES env-var pre-flight check was reverted since it was redundant: its byte-size estimate produced errors NOT caught by recovery, whereas the dynamic token-based approach lets the server reject oversized requests, parses the real limit, and compacts proactively at 70% of the actual window.


Crash 4: "Context size has been exceeded." (llama.cpp) [SOLVED]

  • Error: api returned 500 Internal Server Error (server_error): Context size has been exceeded.
  • Root cause: Same as Crash 1 — context window overflow. But the error message format differs: llama.cpp backends return "Context size has been exceeded." which didn't match any of our existing patterns (context_window, exceed_context_size, etc.), so it fell through to a hard crash instead of triggering auto-compaction recovery.
  • Key files:
    • rust/crates/rusty-claude-cli/src/main.rs:5273-5278 — initial error detection in run_turn()
    • rust/crates/rusty-claude-cli/src/main.rs:5378-5383 — retry-loop error detection

What was done

  • Added case-insensitive match for "context size has been exceeded" to both the initial and retry-loop error detection paths in run_turn(), so llama.cpp backends trigger auto-compaction + progressive retry just like the other context-window errors. Also updated comment at main.rs:5271–5272 to document these provider-specific formats.

Crash 2: "assistant stream produced no content" [SOLVED]

  • Error: error: assistant stream produced no content
  • Root cause: In conversation.rs, build_assistant_message() errors out when the stream produces zero text/tool/thinking events + message stop. Happens when provider returns empty/minimal response or stream aborts early.
  • Key files:
    • rust/crates/runtime/src/conversation.rs:715-764build_assistant_message() returns error if blocks.is_empty()
    • rust/crates/rusty-claude-cli/src/main.rs:9300,9422,9435consume_stream() with streaming→non-streaming fallback

What was done

  • Added "no content" error detection to run_turn() (main.rs:5281) so it triggers auto-compaction + retry like context-window errors (smaller context = simpler parse for the backend)
  • Added graceful handling in consume_stream(): if stream events are empty but we saw a stop, treat as end_turn rather than crashing

Crash 3: Server 500 "Failed to parse input at pos N" [SOLVED]

  • Error: api returned 500 Internal Server Error (server_error): Failed to parse input at pos 42: ...
  • Root cause: Provider backend rejects the request body with a parse error. Likely caused by malformed tool call XML or thinking content in the messages array — e.g., unclosed thinking tags before tool calls, or oversized payloads that exceed what the server can parse.
  • Key files:
    • rust/crates/api/src/providers/openai_compat.rs:307-320send_raw_request() builds and sends the body
    • rust/crates/api/src/providers/openai_compat.rs:1054-1146build_chat_completion_request_for_base_url() constructs messages

What was done

  • Added "Failed to parse input at pos" error detection to run_turn() (main.rs:5282) so it triggers auto-compaction + retry (smaller context = simpler structure for backend to parse)
  • Extended retry-able status codes to include 500 for OpenAI compat provider (transient backend errors)
  • Thinking tag safety: thinking blocks are properly closed before tool calls via close_thinking() in StreamState

General notes

  • The tested llama.cpp instance uses a model with ~81k context window (openai/qwen3.6) — smaller than typical models
  • Auto-compaction with progressive retries (main.rs:5284–5397+) fires after crash; dynamic threshold makes it proactive going forward
  • OPENAI_BASE_URL routes through OpenAiCompatClient, not native Anthropic client
  • The MAX_REQUEST_BODY_BYTES pre-flight check (commit 5e65b33) was reverted: its byte-size estimate produced errors NOT caught by auto-compaction recovery, while the dynamic token-based approach handles overflow gracefully. Keeping it would add unnecessary double-serialization overhead on every request.

Crash 5: "error decoding response body" (reqwest SSE decode failure) [SOLVED]

  • Error: http error: error decoding response body
  • Root cause: llama.cpp backend returns an HTTP 500 with a non-SSE plaintext body when context overflows, causing reqwest's SSE stream deserializer to fail. This is wrapped as ApiError::Http(reqwest::Error) and displays as "http error: error decoding response body". Because this string didn't match any recovery pattern in run_turn(), it fell through to a hard crash instead of triggering auto-compaction + progressive retry.
  • Key files:
    • rust/crates/rusty-claude-cli/src/main.rs:5286-5287 — initial error detection in run_turn() (added "error decoding response body" to is_no_content)
    • rust/crates/rusty-claude-cli/src/main.rs:5389-5390 — retry-loop error detection (same pattern added)
    • rust/crates/api/src/error.rs:46,300ApiError::Http(reqwest::Error) displays as "http error: {error}"

What was done

  • Added "error decoding response body" to the is_no_content recovery pattern in both the initial and retry-loop detection paths, so reqwest decode failures trigger auto-compaction + progressive retry just like other context-overflow errors. Comment updated at main.rs:5284–5287 to document that some backends (llama.cpp) return non-SSE bodies on overflow.

Test command

# llama.cpp instance (any model name works, server ignores it)
echo "Your task is in insights.md." | OPENAI_BASE_URL=https://my-api-server.example.com/v1 OPENAI_API_KEY=secret ~/src/claw-code-upstream/rust/target/debug/claw --model local/qwen3.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant