Current Behavior
Currently, ChatOpenAI uses the OpenAI Responses API by default (as noted in the CLAUDE.md file). Because the Responses API is a proprietary endpoint (/v1/responses) specific to OpenAI, attempting to connect ChatOpenAI to any third-party inference engine that implements the standard OpenAI Chat Completions API (e.g., vLLM, Ollama, LiteLLM) results in a 404 Not Found or similar routing error.
While ChatOpenAICompletions exists to use the standard /v1/chat/completions endpoint, it is not prominently documented, lacks feature parity (or at least the perception of it), and creates a confusing developer experience for users trying to use local models.
Expected Behavior
When a user provides a custom base_url pointing to a non-OpenAI host (e.g., http://localhost:8000/v1), ChatOpenAI should ideally fall back to using the standard Chat Completions API (/v1/chat/completions), which is the universally adopted standard by the open-source community.
Alternatively, the documentation and Quick Start guides should prominently feature ChatOpenAICompletions as the default choice for local/vLLM deployments, since ChatOpenAI is effectively locked to OpenAI's infrastructure.
Reproduction Steps
- Start a local vLLM server:
vllm serve ./model_repo/Qwen/Qwen3.6-27B --port 8000
- Attempt to connect using
ChatOpenAI as suggested in current docs:
from chatlas import ChatOpenAI
chat = ChatOpenAI(
model="./model_repo/Qwen/Qwen3.6-27B",
base_url="http://localhost:8000/v1",
api_key="EMPTY",
)
chat.chat("Hello")
(EngineCore pid=4042) WARNING 04-25 13:44:58 [compilation.py:1323] Capping cudagraph capture sizes from max 512 to 336 to fit Mamba cache blocks (338 blocks available). This limits the maximum batch size that can use CUDA graphs. To increase this limit, reduce max_num_seqs or increase available GPU memory.
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:04<00:00, 9.60it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:04<00:00, 8.26it/s]
(EngineCore pid=4042) INFO 04-25 13:45:08 [gpu_model_runner.py:6046] Graph capturing finished in 10 secs, took 0.91 GiB
(EngineCore pid=4042) INFO 04-25 13:45:08 [gpu_worker.py:597] CUDA graph pool memory: 0.91 GiB (actual), 1.13 GiB (estimated), difference: 0.23 GiB (25.2%).
(EngineCore pid=4042) INFO 04-25 13:45:08 [core.py:283] init engine (profile, create kv cache, warmup model) took 36.08 seconds
(EngineCore pid=4042) INFO 04-25 13:45:09 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=4017) INFO 04-25 13:45:09 [api_server.py:592] Supported tasks: ['generate']
(APIServer pid=4017) WARNING 04-25 13:45:09 [model.py:1435] Default vLLM sampling parameters have been overridden by the model's generation_config.json: {'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}. If this is not intended, please relaunch vLLM instance with --generation-config vllm.
(APIServer pid=4017) INFO 04-25 13:45:09 [hf.py:314] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
(APIServer pid=4017) INFO 04-25 13:45:16 [base.py:231] Multi-modal warmup completed in 6.819s
(APIServer pid=4017) INFO 04-25 13:45:16 [api_server.py:596] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:37] Available routes are:
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=4017) INFO: Started server process [4017]
(APIServer pid=4017) INFO: Waiting for application startup.
(APIServer pid=4017) INFO: Application startup complete.
(APIServer pid=4017) ERROR 04-25 13:45:40 [serving.py:337] Error with model error=ErrorInfo(message='The model ./model_repo/Qwen/Qwen3.6-27B/ does not exist.', type='NotFoundError', param='model', code=404)
(APIServer pid=4017) INFO: 127.0.0.1:50638 - "POST /v1/responses HTTP/1.1" 404 Not Found
(APIServer pid=4017) ERROR 04-25 13:45:47 [serving.py:337] Error with model error=ErrorInfo(message='The model ./model_repo/Qwen/Qwen3.6-27B/ does not exist.', type='NotFoundError', param='model', code=404)
(APIServer pid=4017) INFO: 127.0.0.1:51430 - "POST /v1/responses HTTP/1.1" 404 Not Found
(APIServer pid=4017) ERROR 04-25 13:46:07 [serving.py:337] Error with model error=ErrorInfo(message='The model ./model_repo/Qwen/Qwen3.6-27B/ does not exist.', type='NotFoundError', param='model', code=404)
(APIServer pid=4017) INFO: 127.0.0.1:48322 - "POST /v1/responses HTTP/1.1" 404 Not Found
3. The request fails because `chatlas` tries to hit `http://localhost:8000/v1/responses`, which vLLM does not implement.
**Workaround**
Users must explicitly import and use the undocumented `ChatOpenAICompletions` class:
```python
from chatlas import ChatOpenAICompletions
chat = ChatOpenAICompletions(
model="./model_repo/Qwen/Qwen3.6-27B",
base_url="http://localhost:8000/v1",
api_key="EMPTY",
)
chat.chat("Hello")
Advice
I see a few potential ways to improve this, though I'm not sure which aligns best with the project's architecture:
- Smart Routing: If
base_url is overridden and does not point to api.openai.com, automatically default to the Chat Completions API within ChatOpenAI.
- Promote
ChatOpenAICompletions: Since Chat Completions is the open standard, consider making ChatOpenAICompletions the primary class for generic OpenAI-compatible backends, and rename/specify ChatOpenAI strictly for official OpenAI API usage.
- Documentation Update: At the very least, add a clear "Connecting to vLLM / Local Models" section in the documentation that explicitly instructs users to use
ChatOpenAICompletions instead of ChatOpenAI.
- chatlas version: 0.16.0
- Python version: Python 3.12.3
- Backend: vLLM (OpenAI compatible server)
Current Behavior
Currently,
ChatOpenAIuses the OpenAI Responses API by default (as noted in the CLAUDE.md file). Because the Responses API is a proprietary endpoint (/v1/responses) specific to OpenAI, attempting to connectChatOpenAIto any third-party inference engine that implements the standard OpenAI Chat Completions API (e.g., vLLM, Ollama, LiteLLM) results in a404 Not Foundor similar routing error.While
ChatOpenAICompletionsexists to use the standard/v1/chat/completionsendpoint, it is not prominently documented, lacks feature parity (or at least the perception of it), and creates a confusing developer experience for users trying to use local models.Expected Behavior
When a user provides a custom
base_urlpointing to a non-OpenAI host (e.g.,http://localhost:8000/v1),ChatOpenAIshould ideally fall back to using the standard Chat Completions API (/v1/chat/completions), which is the universally adopted standard by the open-source community.Alternatively, the documentation and Quick Start guides should prominently feature
ChatOpenAICompletionsas the default choice for local/vLLM deployments, sinceChatOpenAIis effectively locked to OpenAI's infrastructure.Reproduction Steps
ChatOpenAIas suggested in current docs:(EngineCore pid=4042) WARNING 04-25 13:44:58 [compilation.py:1323] Capping cudagraph capture sizes from max 512 to 336 to fit Mamba cache blocks (338 blocks available). This limits the maximum batch size that can use CUDA graphs. To increase this limit, reduce max_num_seqs or increase available GPU memory.
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:04<00:00, 9.60it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:04<00:00, 8.26it/s]
(EngineCore pid=4042) INFO 04-25 13:45:08 [gpu_model_runner.py:6046] Graph capturing finished in 10 secs, took 0.91 GiB
(EngineCore pid=4042) INFO 04-25 13:45:08 [gpu_worker.py:597] CUDA graph pool memory: 0.91 GiB (actual), 1.13 GiB (estimated), difference: 0.23 GiB (25.2%).
(EngineCore pid=4042) INFO 04-25 13:45:08 [core.py:283] init engine (profile, create kv cache, warmup model) took 36.08 seconds
(EngineCore pid=4042) INFO 04-25 13:45:09 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=4017) INFO 04-25 13:45:09 [api_server.py:592] Supported tasks: ['generate']
(APIServer pid=4017) WARNING 04-25 13:45:09 [model.py:1435] Default vLLM sampling parameters have been overridden by the model's
generation_config.json:{'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}. If this is not intended, please relaunch vLLM instance with--generation-config vllm.(APIServer pid=4017) INFO 04-25 13:45:09 [hf.py:314] Detected the chat template content format to be 'string'. You can set
--chat-template-content-formatto override this.(APIServer pid=4017) INFO 04-25 13:45:16 [base.py:231] Multi-modal warmup completed in 6.819s
(APIServer pid=4017) INFO 04-25 13:45:16 [api_server.py:596] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:37] Available routes are:
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=4017) INFO: Started server process [4017]
(APIServer pid=4017) INFO: Waiting for application startup.
(APIServer pid=4017) INFO: Application startup complete.
(APIServer pid=4017) ERROR 04-25 13:45:40 [serving.py:337] Error with model error=ErrorInfo(message='The model
./model_repo/Qwen/Qwen3.6-27B/does not exist.', type='NotFoundError', param='model', code=404)(APIServer pid=4017) INFO: 127.0.0.1:50638 - "POST /v1/responses HTTP/1.1" 404 Not Found
(APIServer pid=4017) ERROR 04-25 13:45:47 [serving.py:337] Error with model error=ErrorInfo(message='The model
./model_repo/Qwen/Qwen3.6-27B/does not exist.', type='NotFoundError', param='model', code=404)(APIServer pid=4017) INFO: 127.0.0.1:51430 - "POST /v1/responses HTTP/1.1" 404 Not Found
(APIServer pid=4017) ERROR 04-25 13:46:07 [serving.py:337] Error with model error=ErrorInfo(message='The model
./model_repo/Qwen/Qwen3.6-27B/does not exist.', type='NotFoundError', param='model', code=404)(APIServer pid=4017) INFO: 127.0.0.1:48322 - "POST /v1/responses HTTP/1.1" 404 Not Found
Advice
I see a few potential ways to improve this, though I'm not sure which aligns best with the project's architecture:
base_urlis overridden and does not point toapi.openai.com, automatically default to the Chat Completions API withinChatOpenAI.ChatOpenAICompletions: Since Chat Completions is the open standard, consider makingChatOpenAICompletionsthe primary class for generic OpenAI-compatible backends, and rename/specifyChatOpenAIstrictly for official OpenAI API usage.ChatOpenAICompletionsinstead ofChatOpenAI.