Skip to content

Commit d0f6992

Browse files
feat: debug "network volume model paths" issues (#185)
* Add diagnostic logging for network volume debugging - Check if extra_model_paths.yaml exists at /comfyui/ - Print yaml content to logs - Check if /runpod-volume is mounted - List contents of /runpod-volume and /runpod-volume/models This helps debug why models on network volumes might not be detected. * Dev workflow: only build base target for faster iteration * fix(handler): add python diagnostic logging for network volume debugging - add logging to check if extra_model_paths.yaml exists and print content - add logging to check if /runpod-volume is mounted and list contents - replaces bash echo statements that were not captured by runpod serverless this diagnostic output will appear in runpod logs to help debug why models on network volumes are not being detected by comfyui. * fix(handler): move diagnostics into handler function for log capture - diagnostics now run on every request instead of module import time - use print() instead of logger for better stdout capture - this ensures diagnostic output appears in runpod serverless logs * feat(debug): add opt-in NETWORK_VOLUME_DEBUG for model path troubleshooting - Add NETWORK_VOLUME_DEBUG environment variable (default: false) - Create comprehensive diagnostic function that shows: - Configuration file status - Network volume mount status - Directory structure validation - Model files found (with size and extension validation) - Expected structure guidance when issues found - Document network volume configuration in docs/configuration.md - Expected directory structure - Supported file extensions by model type - Step-by-step debugging instructions - Common issues and solutions Resolves user reports of models not being detected on network volumes. Root cause: user configuration issues (wrong directory structure or missing file extensions), not a bug in worker-comfyui. * style: apply formatting to network volume diagnostics * docs: move network volume troubleshooting to dedicated guide * docs(config): reorganize network volume debug and fix path documentation - move NETWORK_VOLUME_DEBUG from debugging to logging configuration table - remove redundant TIP block about COMFY_LOG_LEVEL - fix customization.md tip block syntax to proper github admonition format - correct network volume paths in customization guide: - clarify /runpod-volume mount point for serverless endpoints - update example paths to show /runpod-volume/models/ structure - fix note block syntax to proper github admonition format - add links to network-volumes.md for detailed structure and debugging - mention s3-compatible api as upload option * refactor: extract network volume diagnostics to separate module - move network volume diagnostic functions from handler.py to src/network_volume.py - remove obsolete bash diagnostics from src/start.sh (replaced by python diagnostics) - update handler.py to import diagnostics from network_volume module - update dockerfile to include src/network_volume.py in image build - reduces handler.py size by ~150 lines for better maintainability --------- Co-authored-by: Tim Pietrusky <timpietrusky@gmail.com>
1 parent d29f0bc commit d0f6992

File tree

8 files changed

+341
-20
lines changed

8 files changed

+341
-20
lines changed

.github/workflows/dev.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,10 +44,11 @@ jobs:
4444
echo "HUGGINGFACE_ACCESS_TOKEN=${{ secrets.HUGGINGFACE_ACCESS_TOKEN }}" >> $GITHUB_ENV
4545
echo "RELEASE_VERSION=${GITHUB_REF##refs/heads/}" | sed 's/\//-/g' >> $GITHUB_ENV
4646
47-
- name: Build and push the images to Docker Hub
47+
- name: Build and push the base image to Docker Hub
4848
uses: docker/bake-action@v2
4949
with:
5050
push: true
51+
targets: base
5152
set: |
5253
*.args.DOCKERHUB_REPO=${{ env.DOCKERHUB_REPO }}
5354
*.args.DOCKERHUB_IMG=${{ env.DOCKERHUB_IMG }}

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ WORKDIR /
7474
RUN uv pip install runpod requests websocket-client
7575

7676
# Add application code and scripts
77-
ADD src/start.sh handler.py test_input.json ./
77+
ADD src/start.sh src/network_volume.py handler.py test_input.json ./
7878
RUN chmod +x /start.sh
7979

8080
# Add script to install custom nodes

docs/configuration.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,10 @@ This document outlines the environment variables available for configuring the `
1212

1313
## Logging Configuration
1414

15-
| Environment Variable | Description | Default |
16-
| -------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
17-
| `COMFY_LOG_LEVEL` | Controls ComfyUI's internal logging verbosity. Options: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`. Use `DEBUG` for troubleshooting, `INFO` for production. | `DEBUG` |
15+
| Environment Variable | Description | Default |
16+
| ---------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
17+
| `COMFY_LOG_LEVEL` | Controls ComfyUI's internal logging verbosity. Options: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`. Use `DEBUG` for troubleshooting, `INFO` for production. | `DEBUG` |
18+
| `NETWORK_VOLUME_DEBUG` | Enable detailed network volume diagnostics in worker logs. Useful for debugging model path issues. See [Network Volumes & Model Paths](network-volumes.md). | `false` |
1819

1920
## Debugging Configuration
2021

@@ -24,8 +25,6 @@ This document outlines the environment variables available for configuring the `
2425
| `WEBSOCKET_RECONNECT_DELAY_S` | Delay in seconds between websocket reconnection attempts. | `3` |
2526
| `WEBSOCKET_TRACE` | Enable low-level websocket frame tracing for protocol debugging. Set to `true` only when diagnosing connection issues. | `false` |
2627

27-
> [!TIP] > **For troubleshooting:** Set `COMFY_LOG_LEVEL=DEBUG` to get detailed logs when ComfyUI crashes or behaves unexpectedly. This helps identify the exact point of failure in your workflows.
28-
2928
## AWS S3 Upload Configuration
3029

3130
Configure these variables **only** if you want the worker to upload generated images directly to an AWS S3 bucket. If these are not set, images will be returned as base64-encoded strings in the API response.

docs/customization.md

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,9 @@
22

33
This guide covers methods for adding your own models, custom nodes, and static input files into a custom `worker-comfyui`.
44

5-
> [!TIP] > **Looking for the easiest way to deploy custom workflows?**
5+
> [!TIP]
6+
>
7+
> **Looking for the easiest way to deploy custom workflows?**
68
>
79
> [ComfyUI-to-API](https://comfy.getrunpod.io) automatically generates a custom Dockerfile and GitHub repository from your ComfyUI workflow, eliminating the manual setup described below. See the [ComfyUI-to-API Documentation](https://docs.runpod.io/community-solutions/comfyui-to-api/overview) for details.
810
>
@@ -90,20 +92,21 @@ Using a Network Volume is primarily useful if you want to manage **models** sepa
9092
1. **Create a Network Volume**:
9193
- Follow the [RunPod Network Volumes guide](https://docs.runpod.io/pods/storage/create-network-volumes) to create a volume in the same region as your endpoint.
9294
2. **Populate the Volume with Models**:
93-
- Use one of the methods described in the RunPod guide (e.g., temporary Pod + `wget`, direct upload) to place your model files into the correct ComfyUI directory structure **within the volume**. The root of the volume corresponds to `/workspace` inside the container.
95+
- Use one of the methods described in the RunPod guide (e.g., temporary Pod + `wget`, direct upload, or the S3-compatible API) to place your model files into the correct ComfyUI directory structure **within the volume**.
96+
- For **serverless endpoints**, the network volume is mounted at `/runpod-volume`, and ComfyUI expects models under `/runpod-volume/models/...`. See [Network Volumes & Model Paths](network-volumes.md) for the exact structure and debugging tips.
9497
```bash
95-
# Example structure inside the Network Volume:
96-
# /models/checkpoints/your_model.safetensors
97-
# /models/loras/your_lora.pt
98-
# /models/vae/your_vae.safetensors
98+
# Example structure inside the Network Volume (serverless worker view):
99+
# /runpod-volume/models/checkpoints/your_model.safetensors
100+
# /runpod-volume/models/loras/your_lora.pt
101+
# /runpod-volume/models/vae/your_vae.safetensors
99102
```
100-
- **Important:** Ensure models are placed in the correct subdirectories (e.g., checkpoints in `models/checkpoints`, LoRAs in `models/loras`).
103+
- **Important:** Ensure models are placed in the correct subdirectories (e.g., checkpoints in `models/checkpoints`, LoRAs in `models/loras`). If models are not detected, enable `NETWORK_VOLUME_DEBUG` as described in [Network Volumes & Model Paths](network-volumes.md).
101104
3. **Configure Your Endpoint**:
102105
- Use the Network Volume in your endpoint configuration:
103106
- Either create a new endpoint or update an existing one (see [Deployment Guide](deployment.md)).
104107
- In the endpoint configuration, under `Advanced > Select Network Volume`, select your Network Volume.
105108
106-
**Note:**
107-
108-
- When a Network Volume is correctly attached, ComfyUI running inside the worker container will automatically detect and load models from the standard directories (`/workspace/models/...`) within that volume.
109-
- This method is **not suitable for installing custom nodes**; use the Custom Dockerfile method for that.
109+
> [!NOTE]
110+
>
111+
> - When a Network Volume is correctly attached, ComfyUI running inside the worker container will automatically detect and load models from the standard directories (`/runpod-volume/models/...`) within that volume (for serverless workers). For directory mapping details and troubleshooting, see [Network Volumes & Model Paths](network-volumes.md).
112+
> - This method is **not suitable for installing custom nodes**; use the Custom Dockerfile method for that.

docs/deployment.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ This is the simplest method if the official images meet your needs.
1616
- Container Registry Credentials: Leave as default (images are public).
1717
- Container Disk: Adjust based on the chosen image tag, see [GPU Recommendations](#gpu-recommendations).
1818
- (optional) Environment Variables: Configure S3 or other settings (see [Configuration Guide](configuration.md)).
19-
- Note: If you don't configure S3, images are returned as base64. For persistent storage across jobs without S3, consider using a [Network Volume](customization.md#method-2-network-volume-alternative-for-models).
19+
- Note: If you don't configure S3, images are returned as base64. For persistent storage across jobs without S3, consider using a [Network Volume](customization.md#method-2-network-volume-alternative-for-models). If models on your network volume are not being detected, see [Network Volumes & Model Paths](network-volumes.md) for troubleshooting steps.
2020
- Click on `Save Template`
2121

2222
### Create your endpoint
@@ -32,7 +32,7 @@ This is the simplest method if the official images meet your needs.
3232
- Idle Timeout: `5` (Default is usually fine, adjust if needed).
3333
- Flash Boot: `enabled` (Recommended for faster worker startup).
3434
- Select Template: `worker-comfyui` (or the name you gave your template).
35-
- (optional) Advanced: If you are using a Network Volume, select it under `Select Network Volume`. See the [Customization Guide](customization.md#method-2-network-volume-alternative-for-models).
35+
- (optional) Advanced: If you are using a Network Volume, select it under `Select Network Volume`. See the [Customization Guide](customization.md#method-2-network-volume-alternative-for-models). For detailed model path layout and debugging tips, see [Network Volumes & Model Paths](network-volumes.md).
3636

3737
- Click `deploy`
3838
- Your endpoint will be created. You can click on it to view the dashboard and find its ID.

docs/network-volumes.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Network Volumes & Model Paths
2+
3+
This document explains how to use RunPod **Network Volumes** with `worker-comfyui`, how model paths are resolved inside the container, and how to debug cases where models are not detected.
4+
5+
> **Scope**
6+
>
7+
> These instructions apply to **serverless endpoints** using this worker. Pods mount network volumes at `/workspace` by default, while serverless workers see them at `/runpod-volume`.
8+
9+
## Directory Mapping
10+
11+
For **serverless endpoints**:
12+
13+
- Network volume root is mounted at: `/runpod-volume`
14+
- ComfyUI models are expected under: `/runpod-volume/models/...`
15+
16+
For **Pods**:
17+
18+
- Network volume root is mounted at: `/workspace`
19+
- Equivalent ComfyUI model path: `/workspace/models/...`
20+
21+
If you use the S3-compatible API, the same paths map as:
22+
23+
- Serverless: `/runpod-volume/my-folder/file.txt`
24+
- Pod: `/workspace/my-folder/file.txt`
25+
- S3 API: `s3://<NETWORK_VOLUME_ID>/my-folder/file.txt`
26+
27+
## Expected Directory Structure
28+
29+
Models must be placed in the following structure on your network volume:
30+
31+
```text
32+
/runpod-volume/
33+
└── models/
34+
├── checkpoints/ # Stable Diffusion checkpoints (.safetensors, .ckpt)
35+
├── loras/ # LoRA files (.safetensors, .pt)
36+
├── vae/ # VAE models (.safetensors, .pt)
37+
├── clip/ # CLIP models (.safetensors, .pt)
38+
├── clip_vision/ # CLIP Vision models
39+
├── controlnet/ # ControlNet models (.safetensors, .pt)
40+
├── embeddings/ # Textual inversion embeddings (.safetensors, .pt)
41+
├── upscale_models/ # Upscaling models (.safetensors, .pt)
42+
├── unet/ # UNet models
43+
└── configs/ # Model configs (.yaml, .json)
44+
```
45+
46+
> **Note**
47+
>
48+
> Only create the subdirectories you actually need; empty or missing folders are fine.
49+
50+
## Supported File Extensions
51+
52+
ComfyUI only recognizes files with specific extensions when scanning model directories.
53+
54+
| Model Type | Supported Extensions |
55+
| -------------- | ------------------------------------------- |
56+
| Checkpoints | `.safetensors`, `.ckpt`, `.pt`, `.pth`, `.bin` |
57+
| LoRAs | `.safetensors`, `.pt` |
58+
| VAE | `.safetensors`, `.pt`, `.bin` |
59+
| CLIP | `.safetensors`, `.pt`, `.bin` |
60+
| ControlNet | `.safetensors`, `.pt`, `.pth`, `.bin` |
61+
| Embeddings | `.safetensors`, `.pt`, `.bin` |
62+
| Upscale Models | `.safetensors`, `.pt`, `.pth` |
63+
64+
Files with other extensions (for example `.txt`, `.zip`) are **ignored** by ComfyUI’s model discovery.
65+
66+
## Common Issues
67+
68+
- **Wrong root directory**
69+
- Models placed directly under `/runpod-volume/checkpoints/...` instead of `/runpod-volume/models/checkpoints/...`.
70+
- **Incorrect extensions**
71+
- Files named without one of the supported extensions are skipped.
72+
- **Empty directories**
73+
- No actual model files present in `models/checkpoints` (or other folders).
74+
- **Volume not attached**
75+
- Endpoint created without selecting a network volume under **Advanced → Select Network Volume**.
76+
77+
If any of the above is true, ComfyUI will silently fail to discover models from the network volume.
78+
79+
## Debugging with `NETWORK_VOLUME_DEBUG`
80+
81+
The worker exposes an opt‑in debug mode controlled via the `NETWORK_VOLUME_DEBUG` environment variable.
82+
83+
### When to Use
84+
85+
Enable this when:
86+
87+
- Models on your network volume are not appearing in ComfyUI
88+
- You suspect the directory structure or file extensions are wrong
89+
- You want to quickly verify what the worker can actually see on `/runpod-volume`
90+
91+
### How to Enable
92+
93+
1. Go to your serverless **Endpoint → Manage → Edit**.
94+
2. Under **Environment Variables**, add:
95+
96+
- `NETWORK_VOLUME_DEBUG=true`
97+
98+
3. Save and wait for workers to restart (or scale to zero and back up).
99+
4. Send any request to your endpoint (even a minimal one) to trigger the diagnostics.
100+
101+
### Reading the Diagnostics
102+
103+
When enabled, each request prints a detailed report to the worker logs, for example:
104+
105+
```text
106+
======================================================================
107+
NETWORK VOLUME DIAGNOSTICS (NETWORK_VOLUME_DEBUG=true)
108+
======================================================================
109+
110+
[1] Checking extra_model_paths.yaml configuration...
111+
✓ FOUND: /comfyui/extra_model_paths.yaml
112+
113+
[2] Checking network volume mount at /runpod-volume...
114+
✓ MOUNTED: /runpod-volume
115+
116+
[3] Checking directory structure...
117+
✓ FOUND: /runpod-volume/models
118+
119+
[4] Scanning model directories...
120+
121+
checkpoints/:
122+
- my-model.safetensors (6.5 GB)
123+
124+
loras/:
125+
- style-lora.safetensors (144.2 MB)
126+
127+
[5] Summary
128+
✓ Models found on network volume!
129+
======================================================================
130+
```
131+
132+
If there is a problem, the diagnostics will instead highlight it, for example:
133+
134+
- Missing `models/` directory
135+
- No valid model files in any subdirectory
136+
- Files present but ignored due to wrong extensions
137+
138+
### Disabling Debug Mode
139+
140+
Once you have resolved your issue, disable diagnostics to keep logs clean:
141+
142+
- Remove the `NETWORK_VOLUME_DEBUG` environment variable, **or**
143+
- Set `NETWORK_VOLUME_DEBUG=false`
144+
145+
This returns the worker to normal behavior without extra log noise.
146+
147+

handler.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,18 @@
1313
import tempfile
1414
import socket
1515
import traceback
16+
import logging
17+
18+
from network_volume import (
19+
is_network_volume_debug_enabled,
20+
run_network_volume_diagnostics,
21+
)
22+
23+
# ---------------------------------------------------------------------------
24+
# Logging setup
25+
# ---------------------------------------------------------------------------
26+
logging.basicConfig(level=logging.INFO)
27+
logger = logging.getLogger(__name__)
1628

1729
# Time to wait between API check attempts in milliseconds
1830
COMFY_API_AVAILABLE_INTERVAL_MS = 50
@@ -502,6 +514,12 @@ def handler(job):
502514
Returns:
503515
dict: A dictionary containing either an error message or a success status with generated images.
504516
"""
517+
# ---------------------------------------------------------------------------
518+
# Network Volume Diagnostics (opt-in via NETWORK_VOLUME_DEBUG=true)
519+
# ---------------------------------------------------------------------------
520+
if is_network_volume_debug_enabled():
521+
run_network_volume_diagnostics()
522+
505523
job_input = job["input"]
506524
job_id = job["id"]
507525

0 commit comments

Comments
 (0)