Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions content/manuals/ai/model-runner/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ params:
group: AI
weight: 30
description: Learn how to use Docker Model Runner to manage and run AI models.
keywords: Docker, ai, model runner, docker desktop, docker engine, llm, openai, ollama, llama.cpp, vllm, cpu, nvidia, cuda, amd, rocm, vulkan, cline, continue, cursor
keywords: Docker, ai, model runner, docker desktop, docker engine, llm, openai, ollama, llama.cpp, vllm, diffusers, cpu, nvidia, cuda, amd, rocm, vulkan, cline, continue, cursor, image generation, stable diffusion
aliases:
- /desktop/features/model-runner/
- /model-runner/
Expand Down Expand Up @@ -34,7 +34,8 @@ with AI models locally.

- [Pull and push models to and from Docker Hub](https://hub.docker.com/u/ai)
- Serve models on [OpenAI and Ollama-compatible APIs](api-reference.md) for easy integration with existing apps
- Support for both [llama.cpp and vLLM inference engines](inference-engines.md) (vLLM on Linux x86_64/amd64 and Windows WSL2 with NVIDIA GPUs)
- Support for [llama.cpp, vLLM, and Diffusers inference engines](inference-engines.md) (vLLM and Diffusers on Linux with NVIDIA GPUs)
- [Generate images from text prompts](inference-engines.md#diffusers) using Stable Diffusion models with the Diffusers backend
- Package GGUF and Safetensors files as OCI Artifacts and publish them to any Container Registry
- Run and interact with AI models directly from the command line or from the Docker Desktop GUI
- [Connect to AI coding tools](ide-integrations.md) like Cline, Continue, Cursor, and Aider
Expand Down Expand Up @@ -89,14 +90,15 @@ access. You can interact with the model using

### Inference engines

Docker Model Runner supports two inference engines:
Docker Model Runner supports three inference engines:

| Engine | Best for | Model format |
|--------|----------|--------------|
| [llama.cpp](inference-engines.md#llamacpp) | Local development, resource efficiency | GGUF (quantized) |
| [vLLM](inference-engines.md#vllm) | Production, high throughput | Safetensors |
| [Diffusers](inference-engines.md#diffusers) | Image generation (Stable Diffusion) | Safetensors |

llama.cpp is the default engine and works on all platforms. vLLM requires NVIDIA GPUs and is supported on Linux x86_64 and Windows with WSL2. See [Inference engines](inference-engines.md) for detailed comparison and setup.
llama.cpp is the default engine and works on all platforms. vLLM requires NVIDIA GPUs and is supported on Linux x86_64 and Windows with WSL2. Diffusers enables image generation and requires NVIDIA GPUs on Linux (x86_64 or ARM64). See [Inference engines](inference-engines.md) for detailed comparison and setup.

### Context size

Expand Down Expand Up @@ -159,6 +161,6 @@ Thanks for trying out Docker Model Runner. To report bugs or request features, [
- [Get started with DMR](get-started.md) - Enable DMR and run your first model
- [API reference](api-reference.md) - OpenAI and Ollama-compatible API documentation
- [Configuration options](configuration.md) - Context size and runtime parameters
- [Inference engines](inference-engines.md) - llama.cpp and vLLM details
- [Inference engines](inference-engines.md) - llama.cpp, vLLM, and Diffusers details
- [IDE integrations](ide-integrations.md) - Connect Cline, Continue, Cursor, and more
- [Open WebUI integration](openwebui-integration.md) - Set up a web chat interface
60 changes: 59 additions & 1 deletion content/manuals/ai/model-runner/api-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ Docker Model Runner supports multiple API formats:
| [OpenAI API](#openai-compatible-api) | OpenAI-compatible chat completions, embeddings | Most AI frameworks and tools |
| [Anthropic API](#anthropic-compatible-api) | Anthropic-compatible messages endpoint | Tools built for Claude |
| [Ollama API](#ollama-compatible-api) | Ollama-compatible endpoints | Tools built for Ollama |
| [Image Generation API](#image-generation-api-diffusers) | Diffusers-based image generation | Generating images from text prompts |
| [DMR API](#dmr-native-endpoints) | Native Docker Model Runner endpoints | Model management |

## OpenAI-compatible API
Expand Down Expand Up @@ -223,6 +224,63 @@ curl http://localhost:12434/api/chat \
curl http://localhost:12434/api/tags
```

## Image generation API (Diffusers)

DMR supports image generation through the Diffusers backend, enabling you to generate
images from text prompts using models like Stable Diffusion.

> [!NOTE]
> The Diffusers backend requires an NVIDIA GPU with CUDA support and is only
> available on Linux (x86_64 and ARM64). See [Inference engines](inference-engines.md#diffusers)
> for setup instructions.

### Endpoint

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/engines/diffusers/v1/images/generations` | POST | Generate an image from a text prompt |

### Supported parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `model` | string | Required. The model identifier (e.g., `stable-diffusion:Q4`). |
| `prompt` | string | Required. The text description of the image to generate. |
| `size` | string | Image dimensions in `WIDTHxHEIGHT` format (e.g., `512x512`). |

### Response format

The API returns a JSON response with the generated image encoded in base64:

```json
{
"data": [
{
"b64_json": "<base64-encoded-image-data>"
}
]
}
```

### Example: Generate an image

```bash
curl -s -X POST http://localhost:12434/engines/diffusers/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "stable-diffusion:Q4",
"prompt": "A picture of a nice cat",
"size": "512x512"
}' | jq -r '.data[0].b64_json' | base64 -d > image.png
```

This command:
1. Sends a POST request to the Diffusers image generation endpoint
2. Specifies the model, prompt, and output image size
3. Extracts the base64-encoded image from the response using `jq`
4. Decodes the base64 data and saves it as `image.png`


## DMR native endpoints

These endpoints are specific to Docker Model Runner for model management:
Expand Down Expand Up @@ -378,4 +436,4 @@ console.log(response.choices[0].message.content);

- [IDE and tool integrations](ide-integrations.md) - Configure Cline, Continue, Cursor, and other tools
- [Configuration options](configuration.md) - Adjust context size and runtime parameters
- [Inference engines](inference-engines.md) - Learn about llama.cpp and vLLM options
- [Inference engines](inference-engines.md) - Learn about llama.cpp, vLLM, and Diffusers options
135 changes: 113 additions & 22 deletions content/manuals/ai/model-runner/inference-engines.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,28 @@
---
title: Inference engines
description: Learn about the llama.cpp and vLLM inference engines in Docker Model Runner.
description: Learn about the llama.cpp, vLLM, and Diffusers inference engines in Docker Model Runner.
weight: 50
keywords: Docker, ai, model runner, llama.cpp, vllm, inference, gguf, safetensors, cuda, gpu
keywords: Docker, ai, model runner, llama.cpp, vllm, diffusers, inference, gguf, safetensors, cuda, gpu, image generation, stable diffusion
---

Docker Model Runner supports two inference engines: **llama.cpp** and **vLLM**.
Docker Model Runner supports three inference engines: **llama.cpp**, **vLLM**, and **Diffusers**.
Each engine has different strengths, supported platforms, and model format
requirements. This guide helps you choose the right engine and configure it for
your use case.

## Engine comparison

| Feature | llama.cpp | vLLM |
|---------|-----------|------|
| **Model formats** | GGUF | Safetensors, HuggingFace |
| **Platforms** | All (macOS, Windows, Linux) | Linux x86_64 only |
| **GPU support** | NVIDIA, AMD, Apple Silicon, Vulkan | NVIDIA CUDA only |
| **CPU inference** | Yes | No |
| **Quantization** | Built-in (Q4, Q5, Q8, etc.) | Limited |
| **Memory efficiency** | High (with quantization) | Moderate |
| **Throughput** | Good | High (with batching) |
| **Best for** | Local development, resource-constrained environments | Production, high throughput |
| Feature | llama.cpp | vLLM | Diffusers |
|---------|-----------|------|-------------------------------------|
| **Model formats** | GGUF | Safetensors, HuggingFace | DDUF |
| **Platforms** | All (macOS, Windows, Linux) | Linux x86_64 only | Linux (x86_64, ARM64) |
| **GPU support** | NVIDIA, AMD, Apple Silicon, Vulkan | NVIDIA CUDA only | NVIDIA CUDA only |
| **CPU inference** | Yes | No | No |
| **Quantization** | Built-in (Q4, Q5, Q8, etc.) | Limited | Limited |
| **Memory efficiency** | High (with quantization) | Moderate | Moderate |
| **Throughput** | Good | High (with batching) | Good |
| **Best for** | Local development, resource-constrained environments | Production, high throughput | Image generation |
| **Use case** | Text generation (LLMs) | Text generation (LLMs) | Image generation (Stable Diffusion) |

## llama.cpp

Expand Down Expand Up @@ -205,9 +206,95 @@ $ docker model configure --hf_overrides '{"max_model_len": 8192}' ai/model-vllm
| Apple Silicon Mac | llama.cpp |
| Production deployment | vLLM (if hardware supports it) |

## Running both engines
## Diffusers

You can run both llama.cpp and vLLM simultaneously. Docker Model Runner routes
[Diffusers](https://github.com/huggingface/diffusers) is an inference engine
for image generation models, including Stable Diffusion. Unlike llama.cpp and
vLLM which focus on text generation with LLMs, Diffusers enables you to generate
images from text prompts.

### Platform support

| Platform | GPU | Support status |
|----------|-----|----------------|
| Linux x86_64 | NVIDIA CUDA | Supported |
| Linux ARM64 | NVIDIA CUDA | Supported |
| Windows | - | Not supported |
| macOS | - | Not supported |

> [!IMPORTANT]
> Diffusers requires an NVIDIA GPU with CUDA support. It does not support
> CPU-only inference.

### Setting up Diffusers

Install the Model Runner with Diffusers backend:

```console
$ docker model reinstall-runner --backend diffusers --gpu cuda
```

Verify the installation:

```console
$ docker model status
Docker Model Runner is running

Status:
llama.cpp: running llama.cpp version: 34ce48d
mlx: not installed
sglang: sglang package not installed
vllm: vLLM binary not found
diffusers: running diffusers version: 0.36.0
```

### Pulling Diffusers models

Pull a Stable Diffusion model:

```console
$ docker model pull stable-diffusion:Q4
```

### Generating images with Diffusers

Diffusers uses an image generation API endpoint. To generate an image:

```console
$ curl -s -X POST http://localhost:12434/engines/diffusers/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "stable-diffusion:Q4",
"prompt": "A picture of a nice cat",
"size": "512x512"
}' | jq -r '.data[0].b64_json' | base64 -d > image.png
```

This command:
1. Sends a POST request to the Diffusers image generation endpoint
2. Specifies the model, prompt, and output image size
3. Extracts the base64-encoded image from the response
4. Decodes it and saves it as `image.png`

### Diffusers API endpoint

When using Diffusers, specify the engine in the API path:

```text
POST /engines/diffusers/v1/images/generations
```

### Supported parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `model` | string | Required. The model identifier (e.g., `stable-diffusion:Q4`). |
| `prompt` | string | Required. The text description of the image to generate. |
| `size` | string | Image dimensions in `WIDTHxHEIGHT` format (e.g., `512x512`). |

## Running multiple engines

You can run llama.cpp, vLLM, and Diffusers simultaneously. Docker Model Runner routes
requests to the appropriate engine based on the model or explicit engine selection.

Check which engines are running:
Expand All @@ -217,17 +304,21 @@ $ docker model status
Docker Model Runner is running

Status:
llama.cpp: running llama.cpp version: c22473b
llama.cpp: running llama.cpp version: 34ce48d
mlx: not installed
sglang: sglang package not installed
vllm: running vllm version: 0.11.0
diffusers: running diffusers version: 0.36.0
```

### Engine-specific API paths

| Engine | API path |
|--------|----------|
| llama.cpp | `/engines/llama.cpp/v1/...` |
| vLLM | `/engines/vllm/v1/...` |
| Auto-select | `/engines/v1/...` |
| Engine | API path | Use case |
|--------|----------|----------|
| llama.cpp | `/engines/llama.cpp/v1/chat/completions` | Text generation |
| vLLM | `/engines/vllm/v1/chat/completions` | Text generation |
| Diffusers | `/engines/diffusers/v1/images/generations` | Image generation |
| Auto-select | `/engines/v1/chat/completions` | Text generation (auto-selects engine) |

## Managing inference engines

Expand All @@ -238,7 +329,7 @@ $ docker model install-runner --backend <engine> [--gpu <type>]
```

Options:
- `--backend`: `llama.cpp` or `vllm`
- `--backend`: `llama.cpp`, `vllm`, or `diffusers`
- `--gpu`: `cuda`, `rocm`, `vulkan`, or `metal` (depends on platform)

### Reinstall an engine
Expand Down