fix(openai): stop max_tokens streaming retry loop on reasoning models (#9716) by Dennisadira · Pull Request #10448 · mudler/LocalAI

Dennisadira · 2026-06-22T16:04:48Z

What

Fixes the max_tokens streaming loop reported in #9716.

When a thinking model spends its entire max_tokens budget on the reasoning block, the C++ autoparser clears the raw Response and delivers reasoning-only ChatDeltas (no content, no tool calls). ComputeChoices' empty-response retry then sees an "empty" generation and regenerates from scratch — up to maxRetries (5) times — each attempt re-consuming the whole budget. So actual token consumption becomes a multiple of max_tokens, and the stream never cleanly terminates with finish_reason: "length".

Fix

Add a small reachedTokenBudget(completion, maxTokens) helper.
Suppress both the built-in empty-response retry and the caller-driven (reasoning-only) retry when the completion count has reached the configured max_tokens ceiling — regenerating would just hit the same ceiling again.
Report finish_reason: "length" (new FinishReasonLength constant) instead of "stop" in both the streaming final chunk and the non-streaming response when the budget was exhausted. Tool/function finish reasons are left untouched.

Tests

Adds a deterministic regression test that stubs ModelInferenceFunc and counts backend invocations for the budget-exhausted reasoning case:

Before: backend invoked 6 times (1 + 5 retries)
After: invoked 1 time

Plus boundary unit tests for reachedTokenBudget (no limit / below / at / above). Full core/http/endpoints/openai and core/backend suites pass; gofmt + go vet clean.

Scope / follow-up

This is a focused Go-side fix: budget exhaustion is derived from the completion-token count vs max_tokens, so it needs no backend/proto changes and resolves the user-facing loop today.

A natural follow-up — matching the "plumb the backend's length stop reason through" suggestion in the issue — would be to surface an authoritative stopped_limit from the llama.cpp gRPC server (and other backends) rather than inferring it. Happy to do that as a separate PR if you'd prefer the backend as the source of truth; I'd just want to confirm the proto shape (bool vs. a stop-reason enum) and default for backends that don't report it before writing it. The retry-suppression logic here is unchanged either way.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

When a thinking model spends its entire max_tokens budget on the reasoning block, the C++ autoparser clears the raw Response and delivers reasoning-only ChatDeltas (no content, no tool calls). ComputeChoices' empty-response retry then fires and regenerates from scratch up to maxRetries times, each re-consuming the whole budget, instead of terminating with finish_reason "length" (issue mudler#9716). Add a reachedTokenBudget helper and suppress both the built-in and caller-driven retries when the completion count has reached the configured max_tokens ceiling. Report finish_reason "length" instead of "stop" in the streaming and non-streaming chat paths when the budget was exhausted. Adds a deterministic regression test that counts backend invocations (previously 6, now 1) plus boundary tests for the helper. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Dennisadira <dennisadira@gmail.com>

mudler approved these changes Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(openai): stop max_tokens streaming retry loop on reasoning models (#9716)#10448

fix(openai): stop max_tokens streaming retry loop on reasoning models (#9716)#10448
Dennisadira wants to merge 1 commit into
mudler:masterfrom
Dennisadira:fix/max-tokens-streaming-loop

Dennisadira commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Dennisadira commented Jun 22, 2026

What

Fix

Tests

Scope / follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants