diff --git a/content/manuals/ai/model-runner/_index.md b/content/manuals/ai/model-runner/_index.md index f22aba3c58ae..f65a76613f7f 100644 --- a/content/manuals/ai/model-runner/_index.md +++ b/content/manuals/ai/model-runner/_index.md @@ -6,7 +6,7 @@ params: group: AI weight: 30 description: Learn how to use Docker Model Runner to manage and run AI models. -keywords: Docker, ai, model runner, docker desktop, docker engine, llm, openai, ollama, llama.cpp, vllm, cpu, nvidia, cuda, amd, rocm, vulkan, cline, continue, cursor +keywords: Docker, ai, model runner, docker desktop, docker engine, llm, openai, ollama, llama.cpp, vllm, diffusers, cpu, nvidia, cuda, amd, rocm, vulkan, cline, continue, cursor, image generation, stable diffusion aliases: - /desktop/features/model-runner/ - /model-runner/ @@ -34,7 +34,8 @@ with AI models locally. - [Pull and push models to and from Docker Hub](https://hub.docker.com/u/ai) - Serve models on [OpenAI and Ollama-compatible APIs](api-reference.md) for easy integration with existing apps -- Support for both [llama.cpp and vLLM inference engines](inference-engines.md) (vLLM on Linux x86_64/amd64 and Windows WSL2 with NVIDIA GPUs) +- Support for [llama.cpp, vLLM, and Diffusers inference engines](inference-engines.md) (vLLM and Diffusers on Linux with NVIDIA GPUs) +- [Generate images from text prompts](inference-engines.md#diffusers) using Stable Diffusion models with the Diffusers backend - Package GGUF and Safetensors files as OCI Artifacts and publish them to any Container Registry - Run and interact with AI models directly from the command line or from the Docker Desktop GUI - [Connect to AI coding tools](ide-integrations.md) like Cline, Continue, Cursor, and Aider @@ -89,14 +90,15 @@ access. You can interact with the model using ### Inference engines -Docker Model Runner supports two inference engines: +Docker Model Runner supports three inference engines: | Engine | Best for | Model format | |--------|----------|--------------| | [llama.cpp](inference-engines.md#llamacpp) | Local development, resource efficiency | GGUF (quantized) | | [vLLM](inference-engines.md#vllm) | Production, high throughput | Safetensors | +| [Diffusers](inference-engines.md#diffusers) | Image generation (Stable Diffusion) | Safetensors | -llama.cpp is the default engine and works on all platforms. vLLM requires NVIDIA GPUs and is supported on Linux x86_64 and Windows with WSL2. See [Inference engines](inference-engines.md) for detailed comparison and setup. +llama.cpp is the default engine and works on all platforms. vLLM requires NVIDIA GPUs and is supported on Linux x86_64 and Windows with WSL2. Diffusers enables image generation and requires NVIDIA GPUs on Linux (x86_64 or ARM64). See [Inference engines](inference-engines.md) for detailed comparison and setup. ### Context size @@ -159,6 +161,6 @@ Thanks for trying out Docker Model Runner. To report bugs or request features, [ - [Get started with DMR](get-started.md) - Enable DMR and run your first model - [API reference](api-reference.md) - OpenAI and Ollama-compatible API documentation - [Configuration options](configuration.md) - Context size and runtime parameters -- [Inference engines](inference-engines.md) - llama.cpp and vLLM details +- [Inference engines](inference-engines.md) - llama.cpp, vLLM, and Diffusers details - [IDE integrations](ide-integrations.md) - Connect Cline, Continue, Cursor, and more - [Open WebUI integration](openwebui-integration.md) - Set up a web chat interface diff --git a/content/manuals/ai/model-runner/api-reference.md b/content/manuals/ai/model-runner/api-reference.md index 7f61b7f2feab..edc36eec9aa4 100644 --- a/content/manuals/ai/model-runner/api-reference.md +++ b/content/manuals/ai/model-runner/api-reference.md @@ -68,6 +68,7 @@ Docker Model Runner supports multiple API formats: | [OpenAI API](#openai-compatible-api) | OpenAI-compatible chat completions, embeddings | Most AI frameworks and tools | | [Anthropic API](#anthropic-compatible-api) | Anthropic-compatible messages endpoint | Tools built for Claude | | [Ollama API](#ollama-compatible-api) | Ollama-compatible endpoints | Tools built for Ollama | +| [Image Generation API](#image-generation-api-diffusers) | Diffusers-based image generation | Generating images from text prompts | | [DMR API](#dmr-native-endpoints) | Native Docker Model Runner endpoints | Model management | ## OpenAI-compatible API @@ -223,6 +224,63 @@ curl http://localhost:12434/api/chat \ curl http://localhost:12434/api/tags ``` +## Image generation API (Diffusers) + +DMR supports image generation through the Diffusers backend, enabling you to generate +images from text prompts using models like Stable Diffusion. + +> [!NOTE] +> The Diffusers backend requires an NVIDIA GPU with CUDA support and is only +> available on Linux (x86_64 and ARM64). See [Inference engines](inference-engines.md#diffusers) +> for setup instructions. + +### Endpoint + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/engines/diffusers/v1/images/generations` | POST | Generate an image from a text prompt | + +### Supported parameters + +| Parameter | Type | Description | +|-----------|------|-------------| +| `model` | string | Required. The model identifier (e.g., `stable-diffusion:Q4`). | +| `prompt` | string | Required. The text description of the image to generate. | +| `size` | string | Image dimensions in `WIDTHxHEIGHT` format (e.g., `512x512`). | + +### Response format + +The API returns a JSON response with the generated image encoded in base64: + +```json +{ + "data": [ + { + "b64_json": "" + } + ] +} +``` + +### Example: Generate an image + +```bash +curl -s -X POST http://localhost:12434/engines/diffusers/v1/images/generations \ + -H "Content-Type: application/json" \ + -d '{ + "model": "stable-diffusion:Q4", + "prompt": "A picture of a nice cat", + "size": "512x512" + }' | jq -r '.data[0].b64_json' | base64 -d > image.png +``` + +This command: +1. Sends a POST request to the Diffusers image generation endpoint +2. Specifies the model, prompt, and output image size +3. Extracts the base64-encoded image from the response using `jq` +4. Decodes the base64 data and saves it as `image.png` + + ## DMR native endpoints These endpoints are specific to Docker Model Runner for model management: @@ -378,4 +436,4 @@ console.log(response.choices[0].message.content); - [IDE and tool integrations](ide-integrations.md) - Configure Cline, Continue, Cursor, and other tools - [Configuration options](configuration.md) - Adjust context size and runtime parameters -- [Inference engines](inference-engines.md) - Learn about llama.cpp and vLLM options +- [Inference engines](inference-engines.md) - Learn about llama.cpp, vLLM, and Diffusers options diff --git a/content/manuals/ai/model-runner/inference-engines.md b/content/manuals/ai/model-runner/inference-engines.md index 8ce93ce278e3..79a78e8833c3 100644 --- a/content/manuals/ai/model-runner/inference-engines.md +++ b/content/manuals/ai/model-runner/inference-engines.md @@ -1,27 +1,28 @@ --- title: Inference engines -description: Learn about the llama.cpp and vLLM inference engines in Docker Model Runner. +description: Learn about the llama.cpp, vLLM, and Diffusers inference engines in Docker Model Runner. weight: 50 -keywords: Docker, ai, model runner, llama.cpp, vllm, inference, gguf, safetensors, cuda, gpu +keywords: Docker, ai, model runner, llama.cpp, vllm, diffusers, inference, gguf, safetensors, cuda, gpu, image generation, stable diffusion --- -Docker Model Runner supports two inference engines: **llama.cpp** and **vLLM**. +Docker Model Runner supports three inference engines: **llama.cpp**, **vLLM**, and **Diffusers**. Each engine has different strengths, supported platforms, and model format requirements. This guide helps you choose the right engine and configure it for your use case. ## Engine comparison -| Feature | llama.cpp | vLLM | -|---------|-----------|------| -| **Model formats** | GGUF | Safetensors, HuggingFace | -| **Platforms** | All (macOS, Windows, Linux) | Linux x86_64 only | -| **GPU support** | NVIDIA, AMD, Apple Silicon, Vulkan | NVIDIA CUDA only | -| **CPU inference** | Yes | No | -| **Quantization** | Built-in (Q4, Q5, Q8, etc.) | Limited | -| **Memory efficiency** | High (with quantization) | Moderate | -| **Throughput** | Good | High (with batching) | -| **Best for** | Local development, resource-constrained environments | Production, high throughput | +| Feature | llama.cpp | vLLM | Diffusers | +|---------|-----------|------|-------------------------------------| +| **Model formats** | GGUF | Safetensors, HuggingFace | DDUF | +| **Platforms** | All (macOS, Windows, Linux) | Linux x86_64 only | Linux (x86_64, ARM64) | +| **GPU support** | NVIDIA, AMD, Apple Silicon, Vulkan | NVIDIA CUDA only | NVIDIA CUDA only | +| **CPU inference** | Yes | No | No | +| **Quantization** | Built-in (Q4, Q5, Q8, etc.) | Limited | Limited | +| **Memory efficiency** | High (with quantization) | Moderate | Moderate | +| **Throughput** | Good | High (with batching) | Good | +| **Best for** | Local development, resource-constrained environments | Production, high throughput | Image generation | +| **Use case** | Text generation (LLMs) | Text generation (LLMs) | Image generation (Stable Diffusion) | ## llama.cpp @@ -205,9 +206,95 @@ $ docker model configure --hf_overrides '{"max_model_len": 8192}' ai/model-vllm | Apple Silicon Mac | llama.cpp | | Production deployment | vLLM (if hardware supports it) | -## Running both engines +## Diffusers -You can run both llama.cpp and vLLM simultaneously. Docker Model Runner routes +[Diffusers](https://github.com/huggingface/diffusers) is an inference engine +for image generation models, including Stable Diffusion. Unlike llama.cpp and +vLLM which focus on text generation with LLMs, Diffusers enables you to generate +images from text prompts. + +### Platform support + +| Platform | GPU | Support status | +|----------|-----|----------------| +| Linux x86_64 | NVIDIA CUDA | Supported | +| Linux ARM64 | NVIDIA CUDA | Supported | +| Windows | - | Not supported | +| macOS | - | Not supported | + +> [!IMPORTANT] +> Diffusers requires an NVIDIA GPU with CUDA support. It does not support +> CPU-only inference. + +### Setting up Diffusers + +Install the Model Runner with Diffusers backend: + +```console +$ docker model reinstall-runner --backend diffusers --gpu cuda +``` + +Verify the installation: + +```console +$ docker model status +Docker Model Runner is running + +Status: +llama.cpp: running llama.cpp version: 34ce48d +mlx: not installed +sglang: sglang package not installed +vllm: vLLM binary not found +diffusers: running diffusers version: 0.36.0 +``` + +### Pulling Diffusers models + +Pull a Stable Diffusion model: + +```console +$ docker model pull stable-diffusion:Q4 +``` + +### Generating images with Diffusers + +Diffusers uses an image generation API endpoint. To generate an image: + +```console +$ curl -s -X POST http://localhost:12434/engines/diffusers/v1/images/generations \ + -H "Content-Type: application/json" \ + -d '{ + "model": "stable-diffusion:Q4", + "prompt": "A picture of a nice cat", + "size": "512x512" + }' | jq -r '.data[0].b64_json' | base64 -d > image.png +``` + +This command: +1. Sends a POST request to the Diffusers image generation endpoint +2. Specifies the model, prompt, and output image size +3. Extracts the base64-encoded image from the response +4. Decodes it and saves it as `image.png` + +### Diffusers API endpoint + +When using Diffusers, specify the engine in the API path: + +```text +POST /engines/diffusers/v1/images/generations +``` + +### Supported parameters + +| Parameter | Type | Description | +|-----------|------|-------------| +| `model` | string | Required. The model identifier (e.g., `stable-diffusion:Q4`). | +| `prompt` | string | Required. The text description of the image to generate. | +| `size` | string | Image dimensions in `WIDTHxHEIGHT` format (e.g., `512x512`). | + +## Running multiple engines + +You can run llama.cpp, vLLM, and Diffusers simultaneously. Docker Model Runner routes requests to the appropriate engine based on the model or explicit engine selection. Check which engines are running: @@ -217,17 +304,21 @@ $ docker model status Docker Model Runner is running Status: -llama.cpp: running llama.cpp version: c22473b +llama.cpp: running llama.cpp version: 34ce48d +mlx: not installed +sglang: sglang package not installed vllm: running vllm version: 0.11.0 +diffusers: running diffusers version: 0.36.0 ``` ### Engine-specific API paths -| Engine | API path | -|--------|----------| -| llama.cpp | `/engines/llama.cpp/v1/...` | -| vLLM | `/engines/vllm/v1/...` | -| Auto-select | `/engines/v1/...` | +| Engine | API path | Use case | +|--------|----------|----------| +| llama.cpp | `/engines/llama.cpp/v1/chat/completions` | Text generation | +| vLLM | `/engines/vllm/v1/chat/completions` | Text generation | +| Diffusers | `/engines/diffusers/v1/images/generations` | Image generation | +| Auto-select | `/engines/v1/chat/completions` | Text generation (auto-selects engine) | ## Managing inference engines @@ -238,7 +329,7 @@ $ docker model install-runner --backend [--gpu ] ``` Options: -- `--backend`: `llama.cpp` or `vllm` +- `--backend`: `llama.cpp`, `vllm`, or `diffusers` - `--gpu`: `cuda`, `rocm`, `vulkan`, or `metal` (depends on platform) ### Reinstall an engine