Add OpenAI-compatible local serving#183
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds ChangesOpenAI-Compatible Server Implementation
Sequence Diagram(s)sequenceDiagram
participant Client
participant Handler as BaseHTTPRequestHandler
participant Service as OpenAICompletionService
participant Backend as CompletionBackend
rect rgba(70, 130, 180, 0.5)
Note over Client,Backend: Non-streaming POST /v1/responses
Client->>Handler: POST /v1/responses {input, model, tools}
Handler->>Service: create_response(body)
Service->>Service: _reject_unsupported_response_fields()
Service->>Service: _forced_function_call_item()
Service->>Backend: generate(prompts, n, max_tokens, ...)
Backend-->>Service: list[GeneratedText]
Service->>Service: _response_object(generations)
Service->>Service: store response in memory
Service-->>Handler: response dict
Handler-->>Client: 200 JSON
end
rect rgba(60, 179, 113, 0.5)
Note over Client,Backend: Streaming POST /v1/responses
Client->>Handler: POST /v1/responses {stream: true}
Handler->>Service: stream_response(body)
Service->>Backend: stream_generate(prompt, ...)
Backend-->>Service: token chunks
Service-->>Handler: SSE events (created/in_progress/delta/completed)
Handler-->>Client: text/event-stream
end
rect rgba(205, 92, 92, 0.5)
Note over Client,Service: Response lifecycle
Client->>Handler: GET /v1/responses/{id}
Handler->>Service: get_response(id)
Service-->>Client: stored response JSON
Client->>Handler: DELETE /v1/responses/{id}
Handler->>Service: delete_response(id)
Service-->>Client: deleted confirmation JSON
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
e600a56 to
3c2a082
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
rl_engine/executors/openai_server.py (1)
259-262: 🚀 Performance & Scalability | 🔵 Trivial | ⚖️ Poor tradeoffBackground generation can outlive the stream on early stop.
When
_stream_with_stopreturns early (stop sequence hit), themodel.generatethread keeps decoding up tomax_tokens;thread.join(timeout=5)only waits briefly and the daemon thread continues consuming GPU/CPU in the background. Consider passing aStoppingCriteria(or the resolved stop sequences) intogenerateso the worker terminates when the consumer stops, rather than relying on the timeout.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rl_engine/executors/openai_server.py` around lines 259 - 262, The issue is that when _stream_with_stop returns early due to a stop sequence match, the model.generate thread continues decoding up to max_tokens while only waiting briefly with thread.join(timeout=5), causing unnecessary background resource consumption. To fix this, pass a StoppingCriteria object (or the resolved stop sequences) into the model.generate call so the generation thread terminates early when the stop condition is met, rather than relying on a brief timeout to join the daemon thread.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/usage/openai-compatible-serving.md`:
- Around line 21-27: The paragraph incorrectly lists text.format alongside
actual response fields in the Responses object. Since text.format is a
request-time option and not part of the response payload, remove it from the
list of response fields that includes output, output_text, reasoning, and usage
fields. Ensure the documentation clearly distinguishes between request-time
options and the actual fields returned in the Responses object to avoid
misleading readers about the API contract.
In `@rl_engine/executors/openai_server.py`:
- Around line 41-67: The code formatting in the file
rl_engine/executors/openai_server.py does not meet the black formatter's
standards. Run the black formatter locally on this file by executing `black
rl_engine/executors/openai_server.py` in your terminal to automatically reformat
the Protocol method bodies and ellipsis in the methods format_chat_prompt,
generate, and stream_generate, as well as the SSE list construction around line
969-973. Commit the reformatted changes to resolve the linting failure.
- Around line 295-306: The sampling_params dictionary in
RolloutCompletionBackend.generate() is being populated with all keys from the
extra dictionary without filtering, which includes invalid vLLM SamplingParams
keys like model, messages, prompt, stream, n, and stop. This causes a TypeError
when vLLM's SamplingParams constructor rejects unknown keys. Create a whitelist
of allowed sampling parameter keys similar to what TransformersCompletionBackend
does (lines 222-223), then filter the extra dictionary to only include those
allowed keys when building the sampling_params dictionary instead of copying the
entire extra dictionary.
In `@tests/test_openai_server.py`:
- Around line 201-205: The code snippet in the deltas list comprehension (lines
201-205) and the similar code segment mentioned at lines 340-345 do not conform
to Black formatting standards, which is causing the CI linting pipeline to fail.
Run the Black code formatter on the tests/test_openai_server.py file to
automatically format the entire file according to Black standards, then commit
the formatting-only changes to resolve the CI blocker.
- Around line 229-230: The pytest.raises call in the retrieve_response test is
using the overly broad Exception type instead of a specific exception. Replace
Exception with OpenAIServingError in the pytest.raises statement while keeping
the match="not found" pattern to make the test stricter and prevent masking
unrelated failures during the retrieve_response call.
---
Nitpick comments:
In `@rl_engine/executors/openai_server.py`:
- Around line 259-262: The issue is that when _stream_with_stop returns early
due to a stop sequence match, the model.generate thread continues decoding up to
max_tokens while only waiting briefly with thread.join(timeout=5), causing
unnecessary background resource consumption. To fix this, pass a
StoppingCriteria object (or the resolved stop sequences) into the model.generate
call so the generation thread terminates early when the stop condition is met,
rather than relying on a brief timeout to join the daemon thread.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d888eccb-0e57-489b-8bc8-c345ca93c45d
📒 Files selected for processing (7)
benchmarks/benchmark_sampling.pydocs/.nav.ymldocs/api/README.mddocs/usage/README.mddocs/usage/openai-compatible-serving.mdrl_engine/executors/openai_server.pytests/test_openai_server.py
3c2a082 to
227cc5c
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rl_engine/executors/openai_server.py`:
- Around line 1090-1104: The _chat_messages function and _content_to_text
function (referenced at lines 1175-1184) currently silently drop unsupported
multimodal content parts like image_url or audio instead of rejecting them with
an error. Add validation logic to check each message's content and raise an
OpenAIServingError when encountering unsupported content parts (such as mapping
objects without a text field) rather than silently ignoring them. This ensures
consistent error handling aligned with the PR's conservative approach to
unsupported features.
- Around line 934-935: The `log_message` method has a parameter named `format`
which shadows Python's built-in `format` function, triggering Ruff A002. Rename
the `format` parameter to a non-conflicting name like `msg_format` or `fmt`, and
update its usage in the string formatting operation (format % args) to use the
new parameter name.
- Around line 381-388: The delete_response method has a race condition where the
membership check on response_id and the subsequent deletion are separate
operations, allowing two concurrent DELETE requests for the same response to
both pass the check. Fix this by replacing the separate if statement that checks
membership and the del statement with a single atomic dict.pop() call that
removes and returns the response in one operation, then check if the result is
None to raise the not_found_error instead of risking a KeyError on concurrent
deletes.
- Around line 223-259: The stream_generate method runs model.generate in a
daemon thread without exception handling, causing the request to hang
indefinitely if generation fails. Wrap the target function passed to
threading.Thread in a try-except block that catches any exceptions from
model.generate, calls streamer.end() to signal termination when an exception
occurs, and stores the exception in an instance variable or local variable
accessible to the main thread. Then, in the main thread within the try block
after yield from _stream_with_stop, check for any stored exceptions and re-raise
them to propagate the error instead of allowing the stream to block
indefinitely.
In `@tests/test_openai_server.py`:
- Around line 525-567: The helper functions _get_json, _post_json, _post_raw,
and _delete_json accept any URL without validating the scheme, which triggers
Ruff S310 security audit warnings. Add scheme validation at the start of each
function to ensure the URL only uses http or https schemes (e.g., check that the
URL starts with either "http://" or "https://"), and raise an AssertionError if
the scheme is invalid. Then add noqa: S310 comments to each
urllib.request.urlopen call to explicitly suppress the audit warning since the
URLs are now validated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 88666bb8-29ca-4156-b830-c2006822b8df
📒 Files selected for processing (6)
docs/.nav.ymldocs/api/README.mddocs/usage/README.mddocs/usage/openai-compatible-serving.mdrl_engine/executors/openai_server.pytests/test_openai_server.py
✅ Files skipped from review due to trivial changes (4)
- docs/api/README.md
- docs/usage/README.md
- docs/.nav.yml
- docs/usage/openai-compatible-serving.md
227cc5c to
0586e5e
Compare
0586e5e to
5e70594
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rl_engine/executors/openai_server.py`:
- Around line 55-64: The stream_generate() method currently only yields text
content without preserving the backend finish reason, causing all streaming
endpoints to report "stop" regardless of whether generation actually stopped due
to max_tokens or other reasons. Modify the stream_generate() method signature
and implementation to thread the finish reason through the streaming backend
contract alongside the yielded text, then update all references to this method
(including the implementations at the locations noted) to capture and propagate
the actual finish reason from the backend, finally use this finish reason when
constructing terminal completion/chat chunks and the GeneratedText object in the
Response instead of defaulting to "stop".
- Around line 525-565: Add a chat-specific validator function that checks for
unsupported fields in the request body (tools, tool_choice, functions,
function_call, response_format, modalities, audio, logprobs, and top_logprobs)
and raises an OpenAI-style error if any are present. Call this validator early
in both the create_chat_completion and stream_chat_completion methods, before
the prompt formatting step (before calling format_chat_prompt), to ensure
clients receive an appropriate error response for unsupported features rather
than proceeding with normal text generation.
- Around line 1327-1332: The text part validation is inconsistent between
_content_to_input_content() and _responses_item_to_text() functions. The first
function (around lines 1327-1332) coerces missing or non-string text to empty
string, while the second function silently drops malformed parts. Instead, both
functions should validate that the text or input_text field contains a valid
string value, and raise OpenAIServingError if it's missing or not a string.
Update the code in both locations to check if the text/input_text value is a
string, and raise OpenAIServingError with a descriptive message if validation
fails, rather than coercing to empty string or silently dropping the part.
- Around line 351-370: The stream_generate() method in RolloutCompletionBackend
blocks on the full self.generate() call before yielding any chunks, causing the
HTTP response to appear hung while waiting for the first SSE event. Refactor
this method to emit an initial chunk (such as an empty completion chunk) before
calling the blocking generate() method, or alternatively restructure to use a
streaming approach that yields chunks as they become available rather than
waiting for the complete generation to finish. This will allow the HTTP headers
to be sent immediately while generation proceeds in the background.
- Around line 449-453: Move the _response_model() call before backend.generate()
in both the create_completion() and create_chat_completion() methods (including
their non-streaming paths). The model validation should happen first to ensure
the model is valid before invoking the expensive backend.generate() operation,
preventing wasted resources when an invalid model like [] is passed in the
request body.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a320365e-7feb-487f-9828-3bc7d144e99c
📒 Files selected for processing (6)
docs/.nav.ymldocs/api/README.mddocs/usage/README.mddocs/usage/openai-compatible-serving.mdrl_engine/executors/openai_server.pytests/test_openai_server.py
✅ Files skipped from review due to trivial changes (4)
- docs/.nav.yml
- docs/api/README.md
- docs/usage/README.md
- docs/usage/openai-compatible-serving.md
5e70594 to
4a801e0
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (1)
rl_engine/executors/openai_server.py (1)
989-993: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winCap JSON request bodies before reading them.
_read_json()trustsContent-Lengthand reads that many bytes into memory per worker. A large request can exhaust memory or tie upThreadingHTTPServerthreads; reject oversized bodies beforerfile.read().Proposed request-size guard
+_MAX_REQUEST_BODY_BYTES = 4 * 1024 * 1024 + @@ def _read_json(self) -> Mapping[str, Any]: - length = int(self.headers.get("Content-Length", "0")) + try: + length = int(self.headers.get("Content-Length", "0")) + except ValueError as exc: + raise OpenAIServingError("Content-Length must be an integer.") from exc if length <= 0: raise OpenAIServingError("Request body must be a JSON object.") + if length > _MAX_REQUEST_BODY_BYTES: + raise OpenAIServingError( + "Request body is too large.", + status_code=HTTPStatus(413), + ) raw = self.rfile.read(length)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rl_engine/executors/openai_server.py` around lines 989 - 993, The _read_json() method currently trusts the Content-Length header and reads that many bytes directly into memory without any size validation. Add a maximum request body size limit and check if the parsed Content-Length exceeds this limit before calling rfile.read(length). If the content length exceeds the configured maximum, raise an OpenAIServingError with an appropriate message indicating the request body is too large, preventing potential memory exhaustion or thread starvation in the ThreadingHTTPServer.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rl_engine/executors/openai_server.py`:
- Around line 474-478: The create_completion method currently accepts
unsupported Completions options like echo, suffix, logprobs, best_of, and
stream_options in the body parameter but does not implement their semantics,
leading to silent failures. Add a completions-specific validator function
(similar to existing validators for Chat and Responses paths) that checks the
body parameter for these unsupported options and raises OpenAI-style errors
before proceeding. Call this validator at the beginning of create_completion,
before the calls to _completion_prompts, _generation_params, _response_model,
and self.backend.generate, to ensure unsupported capabilities are rejected early
rather than silently ignored.
- Around line 1367-1387: The function _normalize_response_input_item currently
accepts any item type without validation, causing unsupported types like
"input_image" to be silently dropped. Define an allow-list of valid item types
that includes the currently handled types: "message", "function_call_output",
"custom_tool_call_output", "local_shell_call_output", and
"mcp_approval_response". After the existing conditional checks for item_type,
add a final check that raises an OpenAIServingError with an appropriate message
if the item_type is not in the allow-list, ensuring unknown multimodal and tool
inputs are properly rejected rather than silently ignored.
- Around line 1213-1236: The `stop` parameter in the `_generation_params()`
function is not validated before being returned, which can cause failures later
in `_apply_stop()` or `_stop_sequences()` when invalid values are processed
after generation starts. Add validation logic in `_generation_params()` to check
that the `stop` value from the request body is either None, a string, or a list
of strings. If the value is invalid (for example, a scalar non-string type),
raise an OpenAIServingError with an appropriate error message and param name.
This validation should happen before the return statement so that clients
receive a 400 error early, before any backend processing begins.
- Around line 274-303: When a chunk with a finish_reason is detected in the
streaming loop (in the section where chunk.finish_reason is checked), the code
sets the finish_reason variable and continues, but the generation thread
continues running in the background. Call streamer.end() in that same block
(when chunk.finish_reason is true) to signal the generation thread to stop
processing before the response is finalized. This ensures the thread stops
instead of continuing to consume compute resources until max_tokens is reached.
In `@tests/test_openai_server.py`:
- Around line 92-93: Mutable class attributes like the lists prompt_token_ids
and token_ids (at lines 92 and 93) and the additional mutable container at line
97 can be shared across test instances and mutated during test execution,
causing state contamination between tests. Move these mutable containers from
class-level definition to instance-level initialization by defining them within
an __init__ method of the test double class so each test instance gets its own
independent copy.
---
Duplicate comments:
In `@rl_engine/executors/openai_server.py`:
- Around line 989-993: The _read_json() method currently trusts the
Content-Length header and reads that many bytes directly into memory without any
size validation. Add a maximum request body size limit and check if the parsed
Content-Length exceeds this limit before calling rfile.read(length). If the
content length exceeds the configured maximum, raise an OpenAIServingError with
an appropriate message indicating the request body is too large, preventing
potential memory exhaustion or thread starvation in the ThreadingHTTPServer.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 47d8dbd6-0139-429d-82aa-592ce97bf542
📒 Files selected for processing (6)
docs/.nav.ymldocs/api/README.mddocs/usage/README.mddocs/usage/openai-compatible-serving.mdrl_engine/executors/openai_server.pytests/test_openai_server.py
✅ Files skipped from review due to trivial changes (4)
- docs/api/README.md
- docs/.nav.yml
- docs/usage/README.md
- docs/usage/openai-compatible-serving.md
04dfbc5 to
07b3991
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rl_engine/executors/openai_server.py`:
- Around line 420-423: The _responses dictionary in the __init__ method of the
OpenaiApiServer class grows unbounded with no eviction policy, causing memory
leaks in long-running servers. Modify the initialization of _responses to
include a configurable maximum size limit and time-to-live (TTL) mechanism for
stored entries. Add a threading lock instance variable to ensure thread-safe
access during eviction operations. Then, update the code at lines 886-897 where
responses are stored and retrieved to implement eviction logic that removes
expired or excess entries before adding new ones, using the lock to protect
multi-step mutations. Make the max size and TTL configurable through the class
constructor parameters or environment variables.
- Around line 663-680: The create_response method calls backend.generate()
without first validating the metadata field, causing expensive GPU work to be
performed even when the request has invalid metadata that would later be
rejected by _response_object(). Move the metadata validation to occur earlier in
the create_response method, before the backend.generate() call on line 680.
Either move the existing metadata validation logic from _response_object() to
occur before backend.generate(), or enhance the
_reject_unsupported_response_fields() call at the beginning of create_response()
to also validate that metadata is an object if present, ensuring invalid
requests are rejected before any GPU work is performed.
- Around line 1012-1017: The _read_json method reads the entire request body
into memory based on a client-supplied Content-Length header without any size
validation, which can cause memory exhaustion or thread pinning. Add a maximum
request body size constant (define it at the class or module level), and before
reading from self.rfile, validate that the length does not exceed this maximum
limit. If the length exceeds the maximum, raise an OpenAIServingError with an
appropriate message before attempting to read the data with
self.rfile.read(length).
- Around line 1352-1361: The _positive_int function currently uses int(value)
which silently truncates floating point numbers (e.g., 1.9 becomes 1) instead of
validating that the input is truly an integer. To fix this, add explicit
validation in the _positive_int function to detect fractional values before
converting to int. Check if the value is a float or if converting it to int
would lose precision by comparing the converted integer back to the original
value, and raise an OpenAIServingError if a fractional part is detected. This
validation should occur after the boolean check but before accepting the parsed
integer value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3eedf0f3-c0f3-48d8-bd60-57ea0cbc553c
📒 Files selected for processing (6)
docs/.nav.ymldocs/api/README.mddocs/usage/README.mddocs/usage/openai-compatible-serving.mdrl_engine/executors/openai_server.pytests/test_openai_server.py
✅ Files skipped from review due to trivial changes (4)
- docs/.nav.yml
- docs/api/README.md
- docs/usage/README.md
- docs/usage/openai-compatible-serving.md
07b3991 to
2f6b460
Compare
Adds a lightweight OpenAI-compatible local serving entry point for RL-Kernel.
It exposes completions, chat completions, and Responses endpoints, including SSE streaming and in-memory Responses retrieval/input item support. The server can run with a local Transformers model or through
RolloutExecutor.A few choices are intentionally conservative: hosted-only features such as built-in tools, multimodal/audio inputs, background jobs, structured outputs, and real reasoning traces return clear OpenAI-style errors instead of being silently ignored.
Validation:
pre-commit run --all-filesmypy --ignore-missing-imports rl_engine/python3 -m pytest tests/test_openai_server.py -qpython3 -m pytest rl_engine/tests/test_dispatch.py -vmkdocs build --strict -f mkdocs.yamlsshleifer/tiny-gpt2base_urlcreate/stream/retrieve/input_items checkCloses #21
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests