The server currently loads exactly one GGUF model at startup, but the model metadata endpoints make it look like both DeepSeek V4 Flash and DeepSeek V4 Pro are available at the same time.
For example, when the server is started with a Flash GGUF:
./ds4-server -m <flash.gguf> --host 127.0.0.1 --port 8000
curl http://127.0.0.1:8000/v1/models
The response advertises both:
deepseek-v4-flash
deepseek-v4-pro
This is misleading because the model field in API requests does not actually switch the loaded GGUF. Inference always uses the single model loaded at server startup.
The same problem also affects GET /v1/models/: the server accepts both deepseek-v4-flash and deepseek-v4-pro as valid metadata endpoints, even if only one of those models is actually loaded.
Expected behavior:
- If a Flash GGUF is loaded, /v1/models should expose only deepseek-v4-flash
- If a Pro GGUF is loaded, /v1/models should expose only deepseek-v4-pro
- GET /v1/models/ should return 404 for the non-loaded model id
This matters for OpenAI-compatible clients that inspect /v1/models to decide which model IDs are available. The current behavior can make clients believe both variants are selectable, while the server can only run the already-loaded GGUF.
Proposed changes
Fix this by making model metadata reflect the single GGUF loaded at startup. Implemented in #287.
The server currently loads exactly one GGUF model at startup, but the model metadata endpoints make it look like both DeepSeek V4 Flash and DeepSeek V4 Pro are available at the same time.
For example, when the server is started with a Flash GGUF:
This is misleading because the model field in API requests does not actually switch the loaded GGUF. Inference always uses the single model loaded at server startup.
The same problem also affects GET /v1/models/: the server accepts both deepseek-v4-flash and deepseek-v4-pro as valid metadata endpoints, even if only one of those models is actually loaded.
Expected behavior:
This matters for OpenAI-compatible clients that inspect /v1/models to decide which model IDs are available. The current behavior can make clients believe both variants are selectable, while the server can only run the already-loaded GGUF.
Proposed changes
Fix this by making model metadata reflect the single GGUF loaded at startup. Implemented in #287.