Skip to content

fix(openai): stop max_tokens streaming retry loop on reasoning models (#9716)#10448

Open
Dennisadira wants to merge 1 commit into
mudler:masterfrom
Dennisadira:fix/max-tokens-streaming-loop
Open

fix(openai): stop max_tokens streaming retry loop on reasoning models (#9716)#10448
Dennisadira wants to merge 1 commit into
mudler:masterfrom
Dennisadira:fix/max-tokens-streaming-loop

Conversation

@Dennisadira

Copy link
Copy Markdown
Contributor

What

Fixes the max_tokens streaming loop reported in #9716.

When a thinking model spends its entire max_tokens budget on the reasoning block, the C++ autoparser clears the raw Response and delivers reasoning-only ChatDeltas (no content, no tool calls). ComputeChoices' empty-response retry then sees an "empty" generation and regenerates from scratch — up to maxRetries (5) times — each attempt re-consuming the whole budget. So actual token consumption becomes a multiple of max_tokens, and the stream never cleanly terminates with finish_reason: "length".

Fix

  • Add a small reachedTokenBudget(completion, maxTokens) helper.
  • Suppress both the built-in empty-response retry and the caller-driven (reasoning-only) retry when the completion count has reached the configured max_tokens ceiling — regenerating would just hit the same ceiling again.
  • Report finish_reason: "length" (new FinishReasonLength constant) instead of "stop" in both the streaming final chunk and the non-streaming response when the budget was exhausted. Tool/function finish reasons are left untouched.

Tests

Adds a deterministic regression test that stubs ModelInferenceFunc and counts backend invocations for the budget-exhausted reasoning case:

  • Before: backend invoked 6 times (1 + 5 retries)
  • After: invoked 1 time

Plus boundary unit tests for reachedTokenBudget (no limit / below / at / above). Full core/http/endpoints/openai and core/backend suites pass; gofmt + go vet clean.

Scope / follow-up

This is a focused Go-side fix: budget exhaustion is derived from the completion-token count vs max_tokens, so it needs no backend/proto changes and resolves the user-facing loop today.

A natural follow-up — matching the "plumb the backend's length stop reason through" suggestion in the issue — would be to surface an authoritative stopped_limit from the llama.cpp gRPC server (and other backends) rather than inferring it. Happy to do that as a separate PR if you'd prefer the backend as the source of truth; I'd just want to confirm the proto shape (bool vs. a stop-reason enum) and default for backends that don't report it before writing it. The retry-suppression logic here is unchanged either way.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

When a thinking model spends its entire max_tokens budget on the reasoning
block, the C++ autoparser clears the raw Response and delivers reasoning-only
ChatDeltas (no content, no tool calls). ComputeChoices' empty-response retry
then fires and regenerates from scratch up to maxRetries times, each
re-consuming the whole budget, instead of terminating with finish_reason
"length" (issue mudler#9716).

Add a reachedTokenBudget helper and suppress both the built-in and
caller-driven retries when the completion count has reached the configured
max_tokens ceiling. Report finish_reason "length" instead of "stop" in the
streaming and non-streaming chat paths when the budget was exhausted.

Adds a deterministic regression test that counts backend invocations
(previously 6, now 1) plus boundary tests for the helper.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Dennisadira <dennisadira@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants