ComfyUI custom node for TurboDiffusion Image-to-Video generation with dual-expert sampling and SLA attention optimization.
- ✅ Complete I2V Pipeline: Single node handles text encoding, VAE encoding, dual-expert sampling, and decoding
- ✅ SLA Attention: 2-3x faster inference with Sparse Linear Attention optimization
- ✅ Quantized Models: Supports int8 block-wise quantized .pth models
- ✅ Dual-Expert Sampling: Automatic switching between high/low noise models
- ✅ Memory Management: Automatic model loading/offloading for efficient VRAM usage
- ✅ Vendored Code: No external TurboDiffusion installation required
- GPU: NVIDIA RTX 3090/4090 or better (12GB+ VRAM)
- Software: Python >= 3.9, PyTorch >= 2.0, ComfyUI
- Navigate to ComfyUI custom_nodes directory:
cd ComfyUI/custom_nodes/- Clone this repository:
git clone https://github.com/anveshane/Comfyui_turbodiffusion.gitPlease install the required Python dependencies before use:
pip install einops loguru omegaconf pandas- Restart ComfyUI
Download and place in your ComfyUI models directories:
TurboWan2.2-I2V-A14B-high-720P-quant.pthTurboWan2.2-I2V-A14B-low-720P-quant.pth
Download from: TurboDiffusion Model
wan_2.1_vae.safetensors(or.pth)
umt5-xxl_fp8_scaled.safetensors(or.pth)
The workflow uses 8 nodes total:
- TurboWanModelLoader → Load high noise model (.pth with SLA attention)
- TurboWanModelLoader → Load low noise model (.pth with SLA attention)
- CLIPLoader → Load umT5-xxl text encoder
- CLIPTextEncode → Create text prompt
- TurboWanVAELoader → Load Wan2.1 VAE (video VAE with temporal support)
- LoadImage → Load starting image
- TurboDiffusionI2VSampler → Complete inference (samples 77 frames in ~60-90s)
- TurboDiffusionSaveVideo → Save as MP4/GIF/WebM
See turbowan_workflow.json for a complete workflow.
Loads quantized .pth TurboDiffusion models with SLA attention optimization.
Inputs:
model_name: Model file from diffusion_models/attention_type: "sla" (recommended), "sagesla" (requires SpargeAttn), or "original"sla_topk: Top-k ratio for sparse attention (0.1 default)
Outputs:
MODEL: Loaded TurboDiffusion model
Loads Wan2.1 VAE with video encoding/decoding support.
Inputs:
vae_name: VAE file from models/vae/ folder
Outputs:
VAE: Wan2pt1VAEInterface object with temporal support
Note: This is NOT the same as ComfyUI's standard VAELoader. The Wan VAE handles video frames (B, C, T, H, W) with temporal compression, while standard VAEs only handle images (B, C, H, W).
Complete I2V inference with dual-expert sampling.
Inputs:
high_noise_model: High noise expert from TurboWanModelLoaderlow_noise_model: Low noise expert from TurboWanModelLoaderconditioning: Text conditioning from CLIPTextEncodevae: VAE from VAELoaderimage: Starting imagenum_frames: Frames to generate (must be 8n+1, e.g., 49, 77, 121)num_steps: Sampling steps (1-4, recommended: 4)resolution: "480", "480p", "512", "720", "720p" (see note below)aspect_ratio: 16:9, 9:16, 4:3, 3:4, 1:1boundary: Timestep for model switching (0.9 recommended)sigma_max: Initial sigma for rCM (200 recommended)seed: Random seeduse_ode: ODE vs SDE sampling (false = SDE recommended)
Outputs:
frames: Generated video frames (B*T, H, W, C)
Resolution Note:
"480": 480×480 (1:1), 640×480 (4:3), etc. - Lower VRAM"480p": 640×640 (1:1), 832×480 (16:9), etc. - Higher VRAM- For low VRAM (8-12GB): Use
"480"with 49 frames - For medium VRAM (16GB): Use
"480p"with 77 frames or"720p"with 49 frames - For high VRAM (24GB+): Use
"720p"with 77+ frames
How it works:
- Extracts text embedding from conditioning
- Encodes start image with VAE
- Creates conditioning dict with mask and encoded latents
- Initializes noise with seed
- Loads high_noise_model → samples steps 0 to boundary → offloads
- Loads low_noise_model → samples steps boundary to num_steps → offloads
- Decodes final latents with VAE
- Returns frames in ComfyUI IMAGE format
Saves frame sequence as video file.
Inputs:
frames: Video frames from samplerfilename_prefix: Output filename prefixfps: Frames per second (24 default)format: "mp4", "gif", or "webm"quality: Compression quality (8 default)loop: Whether to loop (for GIF)
With SLA attention on RTX 3090:
- 720p, 77 frames, 4 steps: ~60-90 seconds
- 2-3x faster than original attention
- ~12-15GB VRAM usage with automatic offloading
- Models: TurboDiffusion Wan2.2-A14B (i2v, 14B parameters)
- Quantization: Block-wise int8 with automatic dequantization
- Attention: SLA (Sparse Linear Attention) for 2-3x speedup
- Sampling: rCM (Rectified Consistency Model) with dual-expert switching
- VAE: Wan2.1 VAE (16 channel latents)
- Text Encoder: umT5-xxl
- High Noise Model (steps 0 → boundary): Generates coarse motion and structure
- Low Noise Model (steps boundary → num_steps): Refines details and quality
- Boundary (default 0.9): Switches at 90% of sampling (e.g., step 3.6 out of 4)
ComfyUI Integration:
- VAE wrapped with ComfyUI-compatible device management
- Automatic loading/offloading integrated with ComfyUI's model management system
- Calls
comfy.model_management.unload_all_models()before VAE encoding - VAE automatically moves to GPU for encoding/decoding, then returns to CPU
Manual Management:
- Diffusion models start on CPU
- Only one diffusion model on GPU at a time during sampling
- Automatic offloading after each sampling stage
- Text embeddings kept on CPU until needed for conditioning
"ModuleNotFoundError": Restart ComfyUI after installation
"Model not found": Verify model files are in correct ComfyUI directories
CUDA OOM: Reduce resolution or frame count
Slow performance: Check that attention_type is "sla" (not "original")
"TurboDiffusionI2VSampler" missing: Ensure all vendored files were copied (turbodiffusion_vendor/)
Based on testing on my system (Windows + ComfyUI + NVIDIA GPU), only the following configuration works reliably:
- Acceleration:
sla - Execution mode:
layerwise_gpu
Other acceleration options (original, sagesla) and other execution modes may fail to load or raise runtime errors on my setup.
This appears to be related to environment-specific factors (CUDA / PyTorch / driver / VRAM behavior).
Some users may be able to use other modes depending on their hardware and software environment, but on my system only sla + layerwise_gpu is stable and usable.
- TurboDiffusion by THU-ML
- ComfyUI by comfyanonymous
Apache 2.0 (same as TurboDiffusion)