feat: debug "network volume model paths" issues (#185)

TimPietruskyRunPod · TimPietrusky · web-flow · commit d0f699290a06 · 2025-12-03T11:57:25.000+01:00
* Add diagnostic logging for network volume debugging

- Check if extra_model_paths.yaml exists at /comfyui/
- Print yaml content to logs
- Check if /runpod-volume is mounted
- List contents of /runpod-volume and /runpod-volume/models

This helps debug why models on network volumes might not be detected.

* Dev workflow: only build base target for faster iteration

* fix(handler): add python diagnostic logging for network volume debugging

- add logging to check if extra_model_paths.yaml exists and print content
- add logging to check if /runpod-volume is mounted and list contents
- replaces bash echo statements that were not captured by runpod serverless

this diagnostic output will appear in runpod logs to help debug why
models on network volumes are not being detected by comfyui.

* fix(handler): move diagnostics into handler function for log capture

- diagnostics now run on every request instead of module import time
- use print() instead of logger for better stdout capture
- this ensures diagnostic output appears in runpod serverless logs

* feat(debug): add opt-in NETWORK_VOLUME_DEBUG for model path troubleshooting

- Add NETWORK_VOLUME_DEBUG environment variable (default: false)
- Create comprehensive diagnostic function that shows:
  - Configuration file status
  - Network volume mount status
  - Directory structure validation
  - Model files found (with size and extension validation)
  - Expected structure guidance when issues found
- Document network volume configuration in docs/configuration.md
  - Expected directory structure
  - Supported file extensions by model type
  - Step-by-step debugging instructions
  - Common issues and solutions

Resolves user reports of models not being detected on network volumes.
Root cause: user configuration issues (wrong directory structure or
missing file extensions), not a bug in worker-comfyui.

* style: apply formatting to network volume diagnostics

* docs: move network volume troubleshooting to dedicated guide

* docs(config): reorganize network volume debug and fix path documentation

- move NETWORK_VOLUME_DEBUG from debugging to logging configuration table
- remove redundant TIP block about COMFY_LOG_LEVEL
- fix customization.md tip block syntax to proper github admonition format
- correct network volume paths in customization guide:
  - clarify /runpod-volume mount point for serverless endpoints
  - update example paths to show /runpod-volume/models/ structure
  - fix note block syntax to proper github admonition format
  - add links to network-volumes.md for detailed structure and debugging
  - mention s3-compatible api as upload option

* refactor: extract network volume diagnostics to separate module

- move network volume diagnostic functions from handler.py to src/network_volume.py
- remove obsolete bash diagnostics from src/start.sh (replaced by python diagnostics)
- update handler.py to import diagnostics from network_volume module
- update dockerfile to include src/network_volume.py in image build
- reduces handler.py size by ~150 lines for better maintainability

---------

Co-authored-by: Tim Pietrusky &lt;timpietrusky@gmail.com&gt;
diff --git a/.github/workflows/dev.yml b/.github/workflows/dev.yml
@@ -44,10 +44,11 @@ jobs:
           echo "HUGGINGFACE_ACCESS_TOKEN=${{ secrets.HUGGINGFACE_ACCESS_TOKEN }}" >> $GITHUB_ENV
           echo "RELEASE_VERSION=${GITHUB_REF##refs/heads/}" | sed 's/\//-/g' >> $GITHUB_ENV
 
-      - name: Build and push the images to Docker Hub
+      - name: Build and push the base image to Docker Hub
         uses: docker/bake-action@v2
         with:
           push: true
+          targets: base
           set: |
             *.args.DOCKERHUB_REPO=${{ env.DOCKERHUB_REPO }}
             *.args.DOCKERHUB_IMG=${{ env.DOCKERHUB_IMG }}
diff --git a/Dockerfile b/Dockerfile
@@ -74,7 +74,7 @@ WORKDIR /
 RUN uv pip install runpod requests websocket-client
 
 # Add application code and scripts
-ADD src/start.sh handler.py test_input.json ./
+ADD src/start.sh src/network_volume.py handler.py test_input.json ./
 RUN chmod +x /start.sh
 
 # Add script to install custom nodes
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -12,9 +12,10 @@ This document outlines the environment variables available for configuring the `
 
 ## Logging Configuration
 
-| Environment Variable | Description                                                                                                                                                      | Default |
-| -------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
-| `COMFY_LOG_LEVEL`    | Controls ComfyUI's internal logging verbosity. Options: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`. Use `DEBUG` for troubleshooting, `INFO` for production. | `DEBUG` |
+| Environment Variable   | Description                                                                                                                                                      | Default |
+| ---------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| `COMFY_LOG_LEVEL`      | Controls ComfyUI's internal logging verbosity. Options: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`. Use `DEBUG` for troubleshooting, `INFO` for production. | `DEBUG` |
+| `NETWORK_VOLUME_DEBUG` | Enable detailed network volume diagnostics in worker logs. Useful for debugging model path issues. See [Network Volumes & Model Paths](network-volumes.md).      | `false` |
 
 ## Debugging Configuration
 
@@ -24,8 +25,6 @@ This document outlines the environment variables available for configuring the `
 | `WEBSOCKET_RECONNECT_DELAY_S`  | Delay in seconds between websocket reconnection attempts.                                                              | `3`     |
 | `WEBSOCKET_TRACE`              | Enable low-level websocket frame tracing for protocol debugging. Set to `true` only when diagnosing connection issues. | `false` |
 
-> [!TIP] > **For troubleshooting:** Set `COMFY_LOG_LEVEL=DEBUG` to get detailed logs when ComfyUI crashes or behaves unexpectedly. This helps identify the exact point of failure in your workflows.
-
 ## AWS S3 Upload Configuration
 
 Configure these variables **only** if you want the worker to upload generated images directly to an AWS S3 bucket. If these are not set, images will be returned as base64-encoded strings in the API response.
diff --git a/docs/customization.md b/docs/customization.md
@@ -2,7 +2,9 @@
 
 This guide covers methods for adding your own models, custom nodes, and static input files into a custom `worker-comfyui`.
 
-> [!TIP] > **Looking for the easiest way to deploy custom workflows?**
+> [!TIP]
+>
+> **Looking for the easiest way to deploy custom workflows?**
 >
 > [ComfyUI-to-API](https://comfy.getrunpod.io) automatically generates a custom Dockerfile and GitHub repository from your ComfyUI workflow, eliminating the manual setup described below. See the [ComfyUI-to-API Documentation](https://docs.runpod.io/community-solutions/comfyui-to-api/overview) for details.
 >
@@ -90,20 +92,21 @@ Using a Network Volume is primarily useful if you want to manage **models** sepa
 1.  **Create a Network Volume**:
     - Follow the [RunPod Network Volumes guide](https://docs.runpod.io/pods/storage/create-network-volumes) to create a volume in the same region as your endpoint.
 2.  **Populate the Volume with Models**:
-    - Use one of the methods described in the RunPod guide (e.g., temporary Pod + `wget`, direct upload) to place your model files into the correct ComfyUI directory structure **within the volume**. The root of the volume corresponds to `/workspace` inside the container.
+    - Use one of the methods described in the RunPod guide (e.g., temporary Pod + `wget`, direct upload, or the S3-compatible API) to place your model files into the correct ComfyUI directory structure **within the volume**.
+    - For **serverless endpoints**, the network volume is mounted at `/runpod-volume`, and ComfyUI expects models under `/runpod-volume/models/...`. See [Network Volumes & Model Paths](network-volumes.md) for the exact structure and debugging tips.
       ```bash
-      # Example structure inside the Network Volume:
-      # /models/checkpoints/your_model.safetensors
-      # /models/loras/your_lora.pt
-      # /models/vae/your_vae.safetensors
+      # Example structure inside the Network Volume (serverless worker view):
+      # /runpod-volume/models/checkpoints/your_model.safetensors
+      # /runpod-volume/models/loras/your_lora.pt
+      # /runpod-volume/models/vae/your_vae.safetensors
       ```
-    - **Important:** Ensure models are placed in the correct subdirectories (e.g., checkpoints in `models/checkpoints`, LoRAs in `models/loras`).
+    - **Important:** Ensure models are placed in the correct subdirectories (e.g., checkpoints in `models/checkpoints`, LoRAs in `models/loras`). If models are not detected, enable `NETWORK_VOLUME_DEBUG` as described in [Network Volumes & Model Paths](network-volumes.md).
 3.  **Configure Your Endpoint**:
     - Use the Network Volume in your endpoint configuration:
       - Either create a new endpoint or update an existing one (see [Deployment Guide](deployment.md)).
       - In the endpoint configuration, under `Advanced > Select Network Volume`, select your Network Volume.
 
-**Note:**
-
-- When a Network Volume is correctly attached, ComfyUI running inside the worker container will automatically detect and load models from the standard directories (`/workspace/models/...`) within that volume.
-- This method is **not suitable for installing custom nodes**; use the Custom Dockerfile method for that.
+> [!NOTE]
+>
+> - When a Network Volume is correctly attached, ComfyUI running inside the worker container will automatically detect and load models from the standard directories (`/runpod-volume/models/...`) within that volume (for serverless workers). For directory mapping details and troubleshooting, see [Network Volumes & Model Paths](network-volumes.md).
+> - This method is **not suitable for installing custom nodes**; use the Custom Dockerfile method for that.
diff --git a/docs/deployment.md b/docs/deployment.md
@@ -16,7 +16,7 @@ This is the simplest method if the official images meet your needs.
   - Container Registry Credentials: Leave as default (images are public).
   - Container Disk: Adjust based on the chosen image tag, see [GPU Recommendations](#gpu-recommendations).
   - (optional) Environment Variables: Configure S3 or other settings (see [Configuration Guide](configuration.md)).
-    - Note: If you don't configure S3, images are returned as base64. For persistent storage across jobs without S3, consider using a [Network Volume](customization.md#method-2-network-volume-alternative-for-models).
+    - Note: If you don't configure S3, images are returned as base64. For persistent storage across jobs without S3, consider using a [Network Volume](customization.md#method-2-network-volume-alternative-for-models). If models on your network volume are not being detected, see [Network Volumes & Model Paths](network-volumes.md) for troubleshooting steps.
 - Click on `Save Template`
 
 ### Create your endpoint
@@ -32,7 +32,7 @@ This is the simplest method if the official images meet your needs.
   - Idle Timeout: `5` (Default is usually fine, adjust if needed).
   - Flash Boot: `enabled` (Recommended for faster worker startup).
   - Select Template: `worker-comfyui` (or the name you gave your template).
-  - (optional) Advanced: If you are using a Network Volume, select it under `Select Network Volume`. See the [Customization Guide](customization.md#method-2-network-volume-alternative-for-models).
+  - (optional) Advanced: If you are using a Network Volume, select it under `Select Network Volume`. See the [Customization Guide](customization.md#method-2-network-volume-alternative-for-models). For detailed model path layout and debugging tips, see [Network Volumes & Model Paths](network-volumes.md).
 
 - Click `deploy`
 - Your endpoint will be created. You can click on it to view the dashboard and find its ID.
diff --git a/docs/network-volumes.md b/docs/network-volumes.md
@@ -0,0 +1,147 @@
+# Network Volumes & Model Paths
+
+This document explains how to use RunPod **Network Volumes** with `worker-comfyui`, how model paths are resolved inside the container, and how to debug cases where models are not detected.
+
+> **Scope**
+>
+> These instructions apply to **serverless endpoints** using this worker. Pods mount network volumes at `/workspace` by default, while serverless workers see them at `/runpod-volume`.
+
+## Directory Mapping
+
+For **serverless endpoints**:
+
+- Network volume root is mounted at: `/runpod-volume`
+- ComfyUI models are expected under: `/runpod-volume/models/...`
+
+For **Pods**:
+
+- Network volume root is mounted at: `/workspace`
+- Equivalent ComfyUI model path: `/workspace/models/...`
+
+If you use the S3-compatible API, the same paths map as:
+
+- Serverless: `/runpod-volume/my-folder/file.txt`
+- Pod: `/workspace/my-folder/file.txt`
+- S3 API: `s3://<NETWORK_VOLUME_ID>/my-folder/file.txt`
+
+## Expected Directory Structure
+
+Models must be placed in the following structure on your network volume:
+
+```text
+/runpod-volume/
+└── models/
+    ├── checkpoints/      # Stable Diffusion checkpoints (.safetensors, .ckpt)
+    ├── loras/            # LoRA files (.safetensors, .pt)
+    ├── vae/              # VAE models (.safetensors, .pt)
+    ├── clip/             # CLIP models (.safetensors, .pt)
+    ├── clip_vision/      # CLIP Vision models
+    ├── controlnet/       # ControlNet models (.safetensors, .pt)
+    ├── embeddings/       # Textual inversion embeddings (.safetensors, .pt)
+    ├── upscale_models/   # Upscaling models (.safetensors, .pt)
+    ├── unet/             # UNet models
+    └── configs/          # Model configs (.yaml, .json)
+```
+
+> **Note**
+>
+> Only create the subdirectories you actually need; empty or missing folders are fine.
+
+## Supported File Extensions
+
+ComfyUI only recognizes files with specific extensions when scanning model directories.
+
+| Model Type     | Supported Extensions                        |
+| -------------- | ------------------------------------------- |
+| Checkpoints    | `.safetensors`, `.ckpt`, `.pt`, `.pth`, `.bin` |
+| LoRAs          | `.safetensors`, `.pt`                       |
+| VAE            | `.safetensors`, `.pt`, `.bin`               |
+| CLIP           | `.safetensors`, `.pt`, `.bin`               |
+| ControlNet     | `.safetensors`, `.pt`, `.pth`, `.bin`       |
+| Embeddings     | `.safetensors`, `.pt`, `.bin`               |
+| Upscale Models | `.safetensors`, `.pt`, `.pth`               |
+
+Files with other extensions (for example `.txt`, `.zip`) are **ignored** by ComfyUI’s model discovery.
+
+## Common Issues
+
+- **Wrong root directory**
+  - Models placed directly under `/runpod-volume/checkpoints/...` instead of `/runpod-volume/models/checkpoints/...`.
+- **Incorrect extensions**
+  - Files named without one of the supported extensions are skipped.
+- **Empty directories**
+  - No actual model files present in `models/checkpoints` (or other folders).
+- **Volume not attached**
+  - Endpoint created without selecting a network volume under **Advanced → Select Network Volume**.
+
+If any of the above is true, ComfyUI will silently fail to discover models from the network volume.
+
+## Debugging with `NETWORK_VOLUME_DEBUG`
+
+The worker exposes an opt‑in debug mode controlled via the `NETWORK_VOLUME_DEBUG` environment variable.
+
+### When to Use
+
+Enable this when:
+
+- Models on your network volume are not appearing in ComfyUI
+- You suspect the directory structure or file extensions are wrong
+- You want to quickly verify what the worker can actually see on `/runpod-volume`
+
+### How to Enable
+
+1. Go to your serverless **Endpoint → Manage → Edit**.
+2. Under **Environment Variables**, add:
+
+   - `NETWORK_VOLUME_DEBUG=true`
+
+3. Save and wait for workers to restart (or scale to zero and back up).
+4. Send any request to your endpoint (even a minimal one) to trigger the diagnostics.
+
+### Reading the Diagnostics
+
+When enabled, each request prints a detailed report to the worker logs, for example:
+
+```text
+======================================================================
+NETWORK VOLUME DIAGNOSTICS (NETWORK_VOLUME_DEBUG=true)
+======================================================================
+
+[1] Checking extra_model_paths.yaml configuration...
+    ✓ FOUND: /comfyui/extra_model_paths.yaml
+
+[2] Checking network volume mount at /runpod-volume...
+    ✓ MOUNTED: /runpod-volume
+
+[3] Checking directory structure...
+    ✓ FOUND: /runpod-volume/models
+
+[4] Scanning model directories...
+
+    checkpoints/:
+      - my-model.safetensors (6.5 GB)
+
+    loras/:
+      - style-lora.safetensors (144.2 MB)
+
+[5] Summary
+    ✓ Models found on network volume!
+======================================================================
+```
+
+If there is a problem, the diagnostics will instead highlight it, for example:
+
+- Missing `models/` directory
+- No valid model files in any subdirectory
+- Files present but ignored due to wrong extensions
+
+### Disabling Debug Mode
+
+Once you have resolved your issue, disable diagnostics to keep logs clean:
+
+- Remove the `NETWORK_VOLUME_DEBUG` environment variable, **or**
+- Set `NETWORK_VOLUME_DEBUG=false`
+
+This returns the worker to normal behavior without extra log noise.
+
+
diff --git a/handler.py b/handler.py
@@ -13,6 +13,18 @@
 import tempfile
 import socket
 import traceback
+import logging
+
+from network_volume import (
+    is_network_volume_debug_enabled,
+    run_network_volume_diagnostics,
+)
+
+# ---------------------------------------------------------------------------
+# Logging setup
+# ---------------------------------------------------------------------------
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
 
 # Time to wait between API check attempts in milliseconds
 COMFY_API_AVAILABLE_INTERVAL_MS = 50
@@ -502,6 +514,12 @@ def handler(job):
     Returns:
         dict: A dictionary containing either an error message or a success status with generated images.
     """
+    # ---------------------------------------------------------------------------
+    # Network Volume Diagnostics (opt-in via NETWORK_VOLUME_DEBUG=true)
+    # ---------------------------------------------------------------------------
+    if is_network_volume_debug_enabled():
+        run_network_volume_diagnostics()
+
     job_input = job["input"]
     job_id = job["id"]
 
diff --git a/src/network_volume.py b/src/network_volume.py