`ChatOpenAI` defaults to OpenAI's proprietary Responses API, breaking compatibility with vLLM and other open-source backends

**Current Behavior**
Currently, `ChatOpenAI` uses the OpenAI **Responses API** by default (as noted in the [CLAUDE.md](https://github.com/posit-dev/chatlas/blob/main/CLAUDE.md) file). Because the Responses API is a proprietary endpoint (`/v1/responses`) specific to OpenAI, attempting to connect `ChatOpenAI` to any third-party inference engine that implements the standard OpenAI Chat Completions API (e.g., vLLM, Ollama, LiteLLM) results in a `404 Not Found` or similar routing error.

While `ChatOpenAICompletions` exists to use the standard `/v1/chat/completions` endpoint, it is not prominently documented, lacks feature parity (or at least the perception of it), and creates a confusing developer experience for users trying to use local models.

**Expected Behavior**
When a user provides a custom `base_url` pointing to a non-OpenAI host (e.g., `http://localhost:8000/v1`), `ChatOpenAI` should ideally fall back to using the standard **Chat Completions API** (`/v1/chat/completions`), which is the universally adopted standard by the open-source community. 

Alternatively, the documentation and Quick Start guides should prominently feature `ChatOpenAICompletions` as the default choice for local/vLLM deployments, since `ChatOpenAI` is effectively locked to OpenAI's infrastructure.

**Reproduction Steps**
1. Start a local vLLM server:
   ```bash
   vllm serve ./model_repo/Qwen/Qwen3.6-27B --port 8000
   ```
2. Attempt to connect using `ChatOpenAI` as suggested in current docs:
   ```python
   from chatlas import ChatOpenAI

   chat = ChatOpenAI(
       model="./model_repo/Qwen/Qwen3.6-27B",
       base_url="http://localhost:8000/v1",
       api_key="EMPTY",
   )
   chat.chat("Hello")

(EngineCore pid=4042) WARNING 04-25 13:44:58 [compilation.py:1323] Capping cudagraph capture sizes from max 512 to 336 to fit Mamba cache blocks (338 blocks available). This limits the maximum batch size that can use CUDA graphs. To increase this limit, reduce max_num_seqs or increase available GPU memory.
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:04<00:00,  9.60it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:04<00:00,  8.26it/s]
(EngineCore pid=4042) INFO 04-25 13:45:08 [gpu_model_runner.py:6046] Graph capturing finished in 10 secs, took 0.91 GiB
(EngineCore pid=4042) INFO 04-25 13:45:08 [gpu_worker.py:597] CUDA graph pool memory: 0.91 GiB (actual), 1.13 GiB (estimated), difference: 0.23 GiB (25.2%).
(EngineCore pid=4042) INFO 04-25 13:45:08 [core.py:283] init engine (profile, create kv cache, warmup model) took 36.08 seconds
(EngineCore pid=4042) INFO 04-25 13:45:09 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=4017) INFO 04-25 13:45:09 [api_server.py:592] Supported tasks: ['generate']
(APIServer pid=4017) WARNING 04-25 13:45:09 [model.py:1435] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=4017) INFO 04-25 13:45:09 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=4017) INFO 04-25 13:45:16 [base.py:231] Multi-modal warmup completed in 6.819s
(APIServer pid=4017) INFO 04-25 13:45:16 [api_server.py:596] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:37] Available routes are:
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=4017) INFO 04-25 13:45:16 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=4017) INFO:     Started server process [4017]
(APIServer pid=4017) INFO:     Waiting for application startup.
(APIServer pid=4017) INFO:     Application startup complete.
(APIServer pid=4017) ERROR 04-25 13:45:40 [serving.py:337] Error with model error=ErrorInfo(message='The model `./model_repo/Qwen/Qwen3.6-27B/` does not exist.', type='NotFoundError', param='model', code=404)
(APIServer pid=4017) INFO:     127.0.0.1:50638 - "POST /v1/responses HTTP/1.1" 404 Not Found
(APIServer pid=4017) ERROR 04-25 13:45:47 [serving.py:337] Error with model error=ErrorInfo(message='The model `./model_repo/Qwen/Qwen3.6-27B/` does not exist.', type='NotFoundError', param='model', code=404)
(APIServer pid=4017) INFO:     127.0.0.1:51430 - "POST /v1/responses HTTP/1.1" 404 Not Found
(APIServer pid=4017) ERROR 04-25 13:46:07 [serving.py:337] Error with model error=ErrorInfo(message='The model `./model_repo/Qwen/Qwen3.6-27B/` does not exist.', type='NotFoundError', param='model', code=404)
(APIServer pid=4017) INFO:     127.0.0.1:48322 - "POST /v1/responses HTTP/1.1" 404 Not Found
   ```



3. The request fails because `chatlas` tries to hit `http://localhost:8000/v1/responses`, which vLLM does not implement.

**Workaround**
Users must explicitly import and use the undocumented `ChatOpenAICompletions` class:
```python
from chatlas import ChatOpenAICompletions

chat = ChatOpenAICompletions(
    model="./model_repo/Qwen/Qwen3.6-27B",
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)
chat.chat("Hello")
```

### Advice

I see a few potential ways to improve this, though I'm not sure which aligns best with the project's architecture:

1. **Smart Routing:** If `base_url` is overridden and does not point to `api.openai.com`, automatically default to the Chat Completions API within `ChatOpenAI`.
2. **Promote `ChatOpenAICompletions`:** Since Chat Completions is the open standard, consider making `ChatOpenAICompletions` the primary class for generic OpenAI-compatible backends, and rename/specify `ChatOpenAI` strictly for official OpenAI API usage.
3. **Documentation Update:** At the very least, add a clear "Connecting to vLLM / Local Models" section in the documentation that explicitly instructs users to use `ChatOpenAICompletions` instead of `ChatOpenAI`.


- **chatlas version**: 0.16.0
- **Python version**: Python 3.12.3
- **Backend**: vLLM (OpenAI compatible server)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ChatOpenAI` defaults to OpenAI's proprietary Responses API, breaking compatibility with vLLM and other open-source backends #285

Advice

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ChatOpenAI defaults to OpenAI's proprietary Responses API, breaking compatibility with vLLM and other open-source backends #285

Description

Advice

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`ChatOpenAI` defaults to OpenAI's proprietary Responses API, breaking compatibility with vLLM and other open-source backends #285