Skip to content

Add OpenAI-compatible local serving#183

Open
haoruilee wants to merge 1 commit into
RL-Align:mainfrom
haoruilee:codex/openai-compatible-serving
Open

Add OpenAI-compatible local serving#183
haoruilee wants to merge 1 commit into
RL-Align:mainfrom
haoruilee:codex/openai-compatible-serving

Conversation

@haoruilee

@haoruilee haoruilee commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Adds a lightweight OpenAI-compatible local serving entry point for RL-Kernel.

It exposes completions, chat completions, and Responses endpoints, including SSE streaming and in-memory Responses retrieval/input item support. The server can run with a local Transformers model or through RolloutExecutor.

A few choices are intentionally conservative: hosted-only features such as built-in tools, multimodal/audio inputs, background jobs, structured outputs, and real reasoning traces return clear OpenAI-style errors instead of being silently ignored.

Validation:

  • pre-commit run --all-files
  • mypy --ignore-missing-imports rl_engine/
  • python3 -m pytest tests/test_openai_server.py -q
  • python3 -m pytest rl_engine/tests/test_dispatch.py -v
  • mkdocs build --strict -f mkdocs.yaml
  • CUDA smoke with sshleifer/tiny-gpt2
  • OpenAI Python SDK base_url create/stream/retrieve/input_items check

Closes #21

Summary by CodeRabbit

Release Notes

  • New Features

    • Added an OpenAI-compatible local HTTP server exposing Completions, Chat Completions, and Responses, including SSE streaming and tool/function calling.
    • Added optional in-memory Responses storage with create/retrieve/list/delete endpoints.
  • Documentation

    • Added an “OpenAI-Compatible Local Serving” user guide with supported endpoints, request/response shapes, streaming behavior, and notes on unsupported OpenAI-only features.
  • Tests

    • Added end-to-end tests covering schemas, validation/error paths, streaming event ordering, tool calling, and storage lifecycle (plus a gated GPU smoke test).

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds rl_engine/executors/openai_server.py, a new OpenAI-compatible local HTTP server supporting /v1/completions, /v1/chat/completions, and /v1/responses endpoints with SSE streaming, backed by either a Transformers causal-LM or a RolloutExecutor. Includes a full test suite and usage documentation.

Changes

OpenAI-Compatible Server Implementation

Layer / File(s) Summary
Module setup and core data structures
rl_engine/executors/openai_server.py
Adds imports for HTTP/threading/CLI/typing, ML libraries, and logging; defines GeneratedText, StreamChunk, StoredResponse dataclasses; defines CompletionBackend protocol; defines OpenAIServingError exception with OpenAI-compatible JSON serialization.
TransformersCompletionBackend: Hugging Face model loading and generation
rl_engine/executors/openai_server.py
Loads Hugging Face tokenizer/model with device/dtype resolution, applies chat template or fallback formatting, counts tokens, batches generation with stop-sequence truncation, computes finish_reason, and streams via TextIteratorStreamer with background threading.
RolloutCompletionBackend: RolloutExecutor integration with streaming fallback
rl_engine/executors/openai_server.py
Converts OpenAI sampling params (temperature, top_p, max_tokens, stop) into RolloutExecutor.generate_candidates requests, maps rollout outputs to GeneratedText with stop-derived finish_reason, and implements streaming by chunking non-stream output.
OpenAICompletionService: all endpoint implementations with storage
rl_engine/executors/openai_server.py
Implements create_completion, stream_completion, create_chat_completion, stream_chat_completion, create_response, and stream_response endpoints; manages in-memory response storage with retrieval, deletion, and pagination of input items via path routing.
Request parsing and parameter extraction for Completions and Chat Completions
rl_engine/executors/openai_server.py
Implements prompt/message extraction, generation parameter parsing (n/max_tokens/temperature/top_p/stop), unsupported field rejection, content-to-text normalization, and numeric range/streaming shape validation.
Responses request normalization and validation
rl_engine/executors/openai_server.py
Handles Responses input item normalization, content-part conversion, previous response chaining via previous_response_id, unsupported mode gating (background, conversation, prompt templates, audio, logprobs, truncation, parallel_tool_calls), and constraints on include and text.format.
Tool choice forcing and reasoning item construction
rl_engine/executors/openai_server.py
Implements tool_choice selection and forcing supporting none/auto/required modes and explicit function selection; constructs reasoning output items with optional encrypted-content inclusion; implements stop/finish-reason computation with stop vs length distinction.
Response object assembly and streaming utilities
rl_engine/executors/openai_server.py
Implements fallback_chat_prompt for message flattening, _response_object and _response_message_item to assemble OpenAI-Responses dicts, backend streaming dispatch with stop-sequence buffering, output text extraction, usage accumulation, and response pagination routing.
HTTP handler, server startup, and CLI entry point
rl_engine/executors/openai_server.py
Implements make_handler with GET/POST/DELETE routing for health/models and /v1/* endpoints, JSON body parsing, SSE framing, streaming error-to-SSE conversion; create_server for ThreadingHTTPServer, build_backend_from_args for CLI backend selection, and main() entry point.
Test doubles: RecordingBackend and rollout/model/tokenizer mocks
tests/test_openai_server.py
Introduces RecordingBackend with predictable GeneratedText and stream output; FakeRolloutCandidate, RecordingRolloutExecutor (captures sampling_params), FailingGenerateModel, StreamingErrorTokenizer; HTTP helper functions (_get_json, _post_json, _post_raw, _delete_json).
Completion and chat completion contracts and validation
tests/test_openai_server.py
Tests create_completion and create_chat_completion response structure (object/model/choices/usage), merged system/user content, rejection of unsupported multimodal (image_url) and unknown fields, model validation before generation, HTTP endpoint JSON returns.
Responses API contracts and response store lifecycle
tests/test_openai_server.py
Tests create_response output structure (output_text/text/reasoning/usage), stream_response typed SSE event ordering and transcript reconstruction, in-memory storage with create/retrieve/list/delete, pagination of input items, previous_response_id chaining, and response reuse as follow-up input.
Error handling and backend wiring
tests/test_openai_server.py
Tests rejection of malformed text parts, unsupported multimodal (input_image), and structured output (json_schema); RolloutCompletionBackend sampling_params filtering; TransformersCompletionBackend streaming error propagation as OpenAIServingError.
Tool calling and reasoning semantics
tests/test_openai_server.py
Tests forced tool-call output shape, rejection of required-argument inference, rejection of ambiguous required tool choice, reasoning output omission of raw thoughts, function_call_output as follow-up input, and stream_response function-call argument delta/done events.
Streaming behavior for Completions, Chat Completions, and Responses
tests/test_openai_server.py
Tests stream_completion initial chunk and preserved finish_reason, stream_chat_completion data-only SSE chunks ending with [DONE], /v1/responses streaming event markers, HTTP lifecycle for create/retrieve/list/delete.
Usage documentation and navigation updates
docs/.nav.yml, docs/usage/README.md, docs/api/README.md, docs/usage/openai-compatible-serving.md
Adds comprehensive usage guide covering endpoint specifications, Transformers backend setup with curl examples, Rollout backend vLLM configuration, optional GPU smoke test, tool/function-call semantics, reasoning output behavior, response state management, and in-memory storage; updates navigation and API documentation.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Handler as BaseHTTPRequestHandler
  participant Service as OpenAICompletionService
  participant Backend as CompletionBackend

  rect rgba(70, 130, 180, 0.5)
    Note over Client,Backend: Non-streaming POST /v1/responses
    Client->>Handler: POST /v1/responses {input, model, tools}
    Handler->>Service: create_response(body)
    Service->>Service: _reject_unsupported_response_fields()
    Service->>Service: _forced_function_call_item()
    Service->>Backend: generate(prompts, n, max_tokens, ...)
    Backend-->>Service: list[GeneratedText]
    Service->>Service: _response_object(generations)
    Service->>Service: store response in memory
    Service-->>Handler: response dict
    Handler-->>Client: 200 JSON
  end

  rect rgba(60, 179, 113, 0.5)
    Note over Client,Backend: Streaming POST /v1/responses
    Client->>Handler: POST /v1/responses {stream: true}
    Handler->>Service: stream_response(body)
    Service->>Backend: stream_generate(prompt, ...)
    Backend-->>Service: token chunks
    Service-->>Handler: SSE events (created/in_progress/delta/completed)
    Handler-->>Client: text/event-stream
  end

  rect rgba(205, 92, 92, 0.5)
    Note over Client,Service: Response lifecycle
    Client->>Handler: GET /v1/responses/{id}
    Handler->>Service: get_response(id)
    Service-->>Client: stored response JSON
    Client->>Handler: DELETE /v1/responses/{id}
    Handler->>Service: delete_response(id)
    Service-->>Client: deleted confirmation JSON
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 A server hops to life today,
With endpoints lined up, come what may.
Completions flow and responses stream,
Tool calls dance in a typed-SSE dream.
The rabbit builds what models need—
Local HTTP at bunny speed! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.62% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add OpenAI-compatible local serving' clearly summarizes the main change: adding OpenAI-compatible serving functionality to RL-Kernel.
Linked Issues check ✅ Passed The PR fully implements the objective from #21: OpenAI-compatible HTTP serving with chat completion and generation endpoints supporting Transformers and RolloutExecutor backends.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing OpenAI-compatible serving: new server implementation, documentation, navigation updates, and comprehensive tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@haoruilee haoruilee force-pushed the codex/openai-compatible-serving branch from e600a56 to 3c2a082 Compare June 23, 2026 09:37

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
rl_engine/executors/openai_server.py (1)

259-262: 🚀 Performance & Scalability | 🔵 Trivial | ⚖️ Poor tradeoff

Background generation can outlive the stream on early stop.

When _stream_with_stop returns early (stop sequence hit), the model.generate thread keeps decoding up to max_tokens; thread.join(timeout=5) only waits briefly and the daemon thread continues consuming GPU/CPU in the background. Consider passing a StoppingCriteria (or the resolved stop sequences) into generate so the worker terminates when the consumer stops, rather than relying on the timeout.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rl_engine/executors/openai_server.py` around lines 259 - 262, The issue is
that when _stream_with_stop returns early due to a stop sequence match, the
model.generate thread continues decoding up to max_tokens while only waiting
briefly with thread.join(timeout=5), causing unnecessary background resource
consumption. To fix this, pass a StoppingCriteria object (or the resolved stop
sequences) into the model.generate call so the generation thread terminates
early when the stop condition is met, rather than relying on a brief timeout to
join the daemon thread.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/usage/openai-compatible-serving.md`:
- Around line 21-27: The paragraph incorrectly lists text.format alongside
actual response fields in the Responses object. Since text.format is a
request-time option and not part of the response payload, remove it from the
list of response fields that includes output, output_text, reasoning, and usage
fields. Ensure the documentation clearly distinguishes between request-time
options and the actual fields returned in the Responses object to avoid
misleading readers about the API contract.

In `@rl_engine/executors/openai_server.py`:
- Around line 41-67: The code formatting in the file
rl_engine/executors/openai_server.py does not meet the black formatter's
standards. Run the black formatter locally on this file by executing `black
rl_engine/executors/openai_server.py` in your terminal to automatically reformat
the Protocol method bodies and ellipsis in the methods format_chat_prompt,
generate, and stream_generate, as well as the SSE list construction around line
969-973. Commit the reformatted changes to resolve the linting failure.
- Around line 295-306: The sampling_params dictionary in
RolloutCompletionBackend.generate() is being populated with all keys from the
extra dictionary without filtering, which includes invalid vLLM SamplingParams
keys like model, messages, prompt, stream, n, and stop. This causes a TypeError
when vLLM's SamplingParams constructor rejects unknown keys. Create a whitelist
of allowed sampling parameter keys similar to what TransformersCompletionBackend
does (lines 222-223), then filter the extra dictionary to only include those
allowed keys when building the sampling_params dictionary instead of copying the
entire extra dictionary.

In `@tests/test_openai_server.py`:
- Around line 201-205: The code snippet in the deltas list comprehension (lines
201-205) and the similar code segment mentioned at lines 340-345 do not conform
to Black formatting standards, which is causing the CI linting pipeline to fail.
Run the Black code formatter on the tests/test_openai_server.py file to
automatically format the entire file according to Black standards, then commit
the formatting-only changes to resolve the CI blocker.
- Around line 229-230: The pytest.raises call in the retrieve_response test is
using the overly broad Exception type instead of a specific exception. Replace
Exception with OpenAIServingError in the pytest.raises statement while keeping
the match="not found" pattern to make the test stricter and prevent masking
unrelated failures during the retrieve_response call.

---

Nitpick comments:
In `@rl_engine/executors/openai_server.py`:
- Around line 259-262: The issue is that when _stream_with_stop returns early
due to a stop sequence match, the model.generate thread continues decoding up to
max_tokens while only waiting briefly with thread.join(timeout=5), causing
unnecessary background resource consumption. To fix this, pass a
StoppingCriteria object (or the resolved stop sequences) into the model.generate
call so the generation thread terminates early when the stop condition is met,
rather than relying on a brief timeout to join the daemon thread.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d888eccb-0e57-489b-8bc8-c345ca93c45d

📥 Commits

Reviewing files that changed from the base of the PR and between 51b8b21 and e600a56.

📒 Files selected for processing (7)
  • benchmarks/benchmark_sampling.py
  • docs/.nav.yml
  • docs/api/README.md
  • docs/usage/README.md
  • docs/usage/openai-compatible-serving.md
  • rl_engine/executors/openai_server.py
  • tests/test_openai_server.py

Comment thread docs/usage/openai-compatible-serving.md Outdated
Comment thread rl_engine/executors/openai_server.py Outdated
Comment thread rl_engine/executors/openai_server.py Outdated
Comment thread tests/test_openai_server.py Outdated
Comment thread tests/test_openai_server.py Outdated
@haoruilee haoruilee force-pushed the codex/openai-compatible-serving branch from 3c2a082 to 227cc5c Compare June 23, 2026 09:46

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rl_engine/executors/openai_server.py`:
- Around line 1090-1104: The _chat_messages function and _content_to_text
function (referenced at lines 1175-1184) currently silently drop unsupported
multimodal content parts like image_url or audio instead of rejecting them with
an error. Add validation logic to check each message's content and raise an
OpenAIServingError when encountering unsupported content parts (such as mapping
objects without a text field) rather than silently ignoring them. This ensures
consistent error handling aligned with the PR's conservative approach to
unsupported features.
- Around line 934-935: The `log_message` method has a parameter named `format`
which shadows Python's built-in `format` function, triggering Ruff A002. Rename
the `format` parameter to a non-conflicting name like `msg_format` or `fmt`, and
update its usage in the string formatting operation (format % args) to use the
new parameter name.
- Around line 381-388: The delete_response method has a race condition where the
membership check on response_id and the subsequent deletion are separate
operations, allowing two concurrent DELETE requests for the same response to
both pass the check. Fix this by replacing the separate if statement that checks
membership and the del statement with a single atomic dict.pop() call that
removes and returns the response in one operation, then check if the result is
None to raise the not_found_error instead of risking a KeyError on concurrent
deletes.
- Around line 223-259: The stream_generate method runs model.generate in a
daemon thread without exception handling, causing the request to hang
indefinitely if generation fails. Wrap the target function passed to
threading.Thread in a try-except block that catches any exceptions from
model.generate, calls streamer.end() to signal termination when an exception
occurs, and stores the exception in an instance variable or local variable
accessible to the main thread. Then, in the main thread within the try block
after yield from _stream_with_stop, check for any stored exceptions and re-raise
them to propagate the error instead of allowing the stream to block
indefinitely.

In `@tests/test_openai_server.py`:
- Around line 525-567: The helper functions _get_json, _post_json, _post_raw,
and _delete_json accept any URL without validating the scheme, which triggers
Ruff S310 security audit warnings. Add scheme validation at the start of each
function to ensure the URL only uses http or https schemes (e.g., check that the
URL starts with either "http://" or "https://"), and raise an AssertionError if
the scheme is invalid. Then add noqa: S310 comments to each
urllib.request.urlopen call to explicitly suppress the audit warning since the
URLs are now validated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 88666bb8-29ca-4156-b830-c2006822b8df

📥 Commits

Reviewing files that changed from the base of the PR and between e600a56 and 3c2a082.

📒 Files selected for processing (6)
  • docs/.nav.yml
  • docs/api/README.md
  • docs/usage/README.md
  • docs/usage/openai-compatible-serving.md
  • rl_engine/executors/openai_server.py
  • tests/test_openai_server.py
✅ Files skipped from review due to trivial changes (4)
  • docs/api/README.md
  • docs/usage/README.md
  • docs/.nav.yml
  • docs/usage/openai-compatible-serving.md

Comment thread rl_engine/executors/openai_server.py Outdated
Comment thread rl_engine/executors/openai_server.py Outdated
Comment thread rl_engine/executors/openai_server.py Outdated
Comment thread rl_engine/executors/openai_server.py
Comment thread tests/test_openai_server.py
@haoruilee haoruilee force-pushed the codex/openai-compatible-serving branch from 227cc5c to 0586e5e Compare June 23, 2026 09:55
Comment thread rl_engine/executors/openai_server.py
Comment thread rl_engine/executors/openai_server.py
@haoruilee haoruilee force-pushed the codex/openai-compatible-serving branch from 0586e5e to 5e70594 Compare June 23, 2026 14:57

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rl_engine/executors/openai_server.py`:
- Around line 55-64: The stream_generate() method currently only yields text
content without preserving the backend finish reason, causing all streaming
endpoints to report "stop" regardless of whether generation actually stopped due
to max_tokens or other reasons. Modify the stream_generate() method signature
and implementation to thread the finish reason through the streaming backend
contract alongside the yielded text, then update all references to this method
(including the implementations at the locations noted) to capture and propagate
the actual finish reason from the backend, finally use this finish reason when
constructing terminal completion/chat chunks and the GeneratedText object in the
Response instead of defaulting to "stop".
- Around line 525-565: Add a chat-specific validator function that checks for
unsupported fields in the request body (tools, tool_choice, functions,
function_call, response_format, modalities, audio, logprobs, and top_logprobs)
and raises an OpenAI-style error if any are present. Call this validator early
in both the create_chat_completion and stream_chat_completion methods, before
the prompt formatting step (before calling format_chat_prompt), to ensure
clients receive an appropriate error response for unsupported features rather
than proceeding with normal text generation.
- Around line 1327-1332: The text part validation is inconsistent between
_content_to_input_content() and _responses_item_to_text() functions. The first
function (around lines 1327-1332) coerces missing or non-string text to empty
string, while the second function silently drops malformed parts. Instead, both
functions should validate that the text or input_text field contains a valid
string value, and raise OpenAIServingError if it's missing or not a string.
Update the code in both locations to check if the text/input_text value is a
string, and raise OpenAIServingError with a descriptive message if validation
fails, rather than coercing to empty string or silently dropping the part.
- Around line 351-370: The stream_generate() method in RolloutCompletionBackend
blocks on the full self.generate() call before yielding any chunks, causing the
HTTP response to appear hung while waiting for the first SSE event. Refactor
this method to emit an initial chunk (such as an empty completion chunk) before
calling the blocking generate() method, or alternatively restructure to use a
streaming approach that yields chunks as they become available rather than
waiting for the complete generation to finish. This will allow the HTTP headers
to be sent immediately while generation proceeds in the background.
- Around line 449-453: Move the _response_model() call before backend.generate()
in both the create_completion() and create_chat_completion() methods (including
their non-streaming paths). The model validation should happen first to ensure
the model is valid before invoking the expensive backend.generate() operation,
preventing wasted resources when an invalid model like [] is passed in the
request body.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a320365e-7feb-487f-9828-3bc7d144e99c

📥 Commits

Reviewing files that changed from the base of the PR and between 3c2a082 and 5e70594.

📒 Files selected for processing (6)
  • docs/.nav.yml
  • docs/api/README.md
  • docs/usage/README.md
  • docs/usage/openai-compatible-serving.md
  • rl_engine/executors/openai_server.py
  • tests/test_openai_server.py
✅ Files skipped from review due to trivial changes (4)
  • docs/.nav.yml
  • docs/api/README.md
  • docs/usage/README.md
  • docs/usage/openai-compatible-serving.md

Comment thread rl_engine/executors/openai_server.py Outdated
Comment thread rl_engine/executors/openai_server.py Outdated
Comment thread rl_engine/executors/openai_server.py
Comment thread rl_engine/executors/openai_server.py
Comment thread rl_engine/executors/openai_server.py
@haoruilee haoruilee force-pushed the codex/openai-compatible-serving branch from 5e70594 to 4a801e0 Compare June 23, 2026 16:27

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (1)
rl_engine/executors/openai_server.py (1)

989-993: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Cap JSON request bodies before reading them.

_read_json() trusts Content-Length and reads that many bytes into memory per worker. A large request can exhaust memory or tie up ThreadingHTTPServer threads; reject oversized bodies before rfile.read().

Proposed request-size guard
+_MAX_REQUEST_BODY_BYTES = 4 * 1024 * 1024
+
@@
         def _read_json(self) -> Mapping[str, Any]:
-            length = int(self.headers.get("Content-Length", "0"))
+            try:
+                length = int(self.headers.get("Content-Length", "0"))
+            except ValueError as exc:
+                raise OpenAIServingError("Content-Length must be an integer.") from exc
             if length <= 0:
                 raise OpenAIServingError("Request body must be a JSON object.")
+            if length > _MAX_REQUEST_BODY_BYTES:
+                raise OpenAIServingError(
+                    "Request body is too large.",
+                    status_code=HTTPStatus(413),
+                )
             raw = self.rfile.read(length)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rl_engine/executors/openai_server.py` around lines 989 - 993, The
_read_json() method currently trusts the Content-Length header and reads that
many bytes directly into memory without any size validation. Add a maximum
request body size limit and check if the parsed Content-Length exceeds this
limit before calling rfile.read(length). If the content length exceeds the
configured maximum, raise an OpenAIServingError with an appropriate message
indicating the request body is too large, preventing potential memory exhaustion
or thread starvation in the ThreadingHTTPServer.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rl_engine/executors/openai_server.py`:
- Around line 474-478: The create_completion method currently accepts
unsupported Completions options like echo, suffix, logprobs, best_of, and
stream_options in the body parameter but does not implement their semantics,
leading to silent failures. Add a completions-specific validator function
(similar to existing validators for Chat and Responses paths) that checks the
body parameter for these unsupported options and raises OpenAI-style errors
before proceeding. Call this validator at the beginning of create_completion,
before the calls to _completion_prompts, _generation_params, _response_model,
and self.backend.generate, to ensure unsupported capabilities are rejected early
rather than silently ignored.
- Around line 1367-1387: The function _normalize_response_input_item currently
accepts any item type without validation, causing unsupported types like
"input_image" to be silently dropped. Define an allow-list of valid item types
that includes the currently handled types: "message", "function_call_output",
"custom_tool_call_output", "local_shell_call_output", and
"mcp_approval_response". After the existing conditional checks for item_type,
add a final check that raises an OpenAIServingError with an appropriate message
if the item_type is not in the allow-list, ensuring unknown multimodal and tool
inputs are properly rejected rather than silently ignored.
- Around line 1213-1236: The `stop` parameter in the `_generation_params()`
function is not validated before being returned, which can cause failures later
in `_apply_stop()` or `_stop_sequences()` when invalid values are processed
after generation starts. Add validation logic in `_generation_params()` to check
that the `stop` value from the request body is either None, a string, or a list
of strings. If the value is invalid (for example, a scalar non-string type),
raise an OpenAIServingError with an appropriate error message and param name.
This validation should happen before the return statement so that clients
receive a 400 error early, before any backend processing begins.
- Around line 274-303: When a chunk with a finish_reason is detected in the
streaming loop (in the section where chunk.finish_reason is checked), the code
sets the finish_reason variable and continues, but the generation thread
continues running in the background. Call streamer.end() in that same block
(when chunk.finish_reason is true) to signal the generation thread to stop
processing before the response is finalized. This ensures the thread stops
instead of continuing to consume compute resources until max_tokens is reached.

In `@tests/test_openai_server.py`:
- Around line 92-93: Mutable class attributes like the lists prompt_token_ids
and token_ids (at lines 92 and 93) and the additional mutable container at line
97 can be shared across test instances and mutated during test execution,
causing state contamination between tests. Move these mutable containers from
class-level definition to instance-level initialization by defining them within
an __init__ method of the test double class so each test instance gets its own
independent copy.

---

Duplicate comments:
In `@rl_engine/executors/openai_server.py`:
- Around line 989-993: The _read_json() method currently trusts the
Content-Length header and reads that many bytes directly into memory without any
size validation. Add a maximum request body size limit and check if the parsed
Content-Length exceeds this limit before calling rfile.read(length). If the
content length exceeds the configured maximum, raise an OpenAIServingError with
an appropriate message indicating the request body is too large, preventing
potential memory exhaustion or thread starvation in the ThreadingHTTPServer.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 47d8dbd6-0139-429d-82aa-592ce97bf542

📥 Commits

Reviewing files that changed from the base of the PR and between 5e70594 and 4a801e0.

📒 Files selected for processing (6)
  • docs/.nav.yml
  • docs/api/README.md
  • docs/usage/README.md
  • docs/usage/openai-compatible-serving.md
  • rl_engine/executors/openai_server.py
  • tests/test_openai_server.py
✅ Files skipped from review due to trivial changes (4)
  • docs/api/README.md
  • docs/.nav.yml
  • docs/usage/README.md
  • docs/usage/openai-compatible-serving.md

Comment thread rl_engine/executors/openai_server.py
Comment thread rl_engine/executors/openai_server.py
Comment thread rl_engine/executors/openai_server.py
Comment thread rl_engine/executors/openai_server.py Outdated
Comment thread tests/test_openai_server.py Outdated
@haoruilee haoruilee force-pushed the codex/openai-compatible-serving branch 2 times, most recently from 04dfbc5 to 07b3991 Compare June 23, 2026 16:58

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rl_engine/executors/openai_server.py`:
- Around line 420-423: The _responses dictionary in the __init__ method of the
OpenaiApiServer class grows unbounded with no eviction policy, causing memory
leaks in long-running servers. Modify the initialization of _responses to
include a configurable maximum size limit and time-to-live (TTL) mechanism for
stored entries. Add a threading lock instance variable to ensure thread-safe
access during eviction operations. Then, update the code at lines 886-897 where
responses are stored and retrieved to implement eviction logic that removes
expired or excess entries before adding new ones, using the lock to protect
multi-step mutations. Make the max size and TTL configurable through the class
constructor parameters or environment variables.
- Around line 663-680: The create_response method calls backend.generate()
without first validating the metadata field, causing expensive GPU work to be
performed even when the request has invalid metadata that would later be
rejected by _response_object(). Move the metadata validation to occur earlier in
the create_response method, before the backend.generate() call on line 680.
Either move the existing metadata validation logic from _response_object() to
occur before backend.generate(), or enhance the
_reject_unsupported_response_fields() call at the beginning of create_response()
to also validate that metadata is an object if present, ensuring invalid
requests are rejected before any GPU work is performed.
- Around line 1012-1017: The _read_json method reads the entire request body
into memory based on a client-supplied Content-Length header without any size
validation, which can cause memory exhaustion or thread pinning. Add a maximum
request body size constant (define it at the class or module level), and before
reading from self.rfile, validate that the length does not exceed this maximum
limit. If the length exceeds the maximum, raise an OpenAIServingError with an
appropriate message before attempting to read the data with
self.rfile.read(length).
- Around line 1352-1361: The _positive_int function currently uses int(value)
which silently truncates floating point numbers (e.g., 1.9 becomes 1) instead of
validating that the input is truly an integer. To fix this, add explicit
validation in the _positive_int function to detect fractional values before
converting to int. Check if the value is a float or if converting it to int
would lose precision by comparing the converted integer back to the original
value, and raise an OpenAIServingError if a fractional part is detected. This
validation should occur after the boolean check but before accepting the parsed
integer value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3eedf0f3-c0f3-48d8-bd60-57ea0cbc553c

📥 Commits

Reviewing files that changed from the base of the PR and between 4a801e0 and 04dfbc5.

📒 Files selected for processing (6)
  • docs/.nav.yml
  • docs/api/README.md
  • docs/usage/README.md
  • docs/usage/openai-compatible-serving.md
  • rl_engine/executors/openai_server.py
  • tests/test_openai_server.py
✅ Files skipped from review due to trivial changes (4)
  • docs/.nav.yml
  • docs/api/README.md
  • docs/usage/README.md
  • docs/usage/openai-compatible-serving.md

Comment thread rl_engine/executors/openai_server.py Outdated
Comment thread rl_engine/executors/openai_server.py
Comment thread rl_engine/executors/openai_server.py
Comment thread rl_engine/executors/openai_server.py
@haoruilee haoruilee force-pushed the codex/openai-compatible-serving branch from 07b3991 to 2f6b460 Compare June 23, 2026 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT][api]: implement OpenAI-compatible entry point for local serving

2 participants