Skip to content

Blaizzy/mlx-video

Repository files navigation

mlx-video

MLX-Video is the best package for inference and finetuning of Image-Video-Audio generation models on your Mac using MLX.

Installation

Option 1: Install with pip (requires git):

pip install git+https://github.com/Blaizzy/mlx-video.git

Option 2: Install with uv (ultra-fast package manager, optional):

uv pip install git+https://github.com/Blaizzy/mlx-video.git

Supported Models

  • LTX-2 — 19B parameter video generation model from Lightricks
  • Wan2.1 — 1.3B / 14B parameter T2V models (single-model pipeline)
  • Wan2.2 — T2V-14B, TI2V-5B, and I2V-14B models (dual-model pipeline)

Features

LTX-2 / LTX-2.3

  • Text-to-Video (T2V), Image-to-Video (I2V), Audio-to-Video (A2V)
  • Audio-Video joint generation
  • Multi-pipeline: distilled, dev, dev-two-stage, dev-two-stage-hq
  • 2x spatial upscaling for images and videos
  • Prompt enhancement via Gemma

Wan2.1 / Wan2.2

  • Text-to-Video (T2V) — 1.3B and 14B models
  • Image-to-Video (I2V) — 14B model
  • Flow-matching diffusion with classifier-free guidance
  • LoRA support (e.g. Wan2.2-Lightning for 4-step generation)

General

  • Optimized for Apple Silicon using MLX

LTX-2

Text-to-Video Generation

# Text-to-Video (distilled, fastest)
uv run mlx_video.ltx_2.generate --prompt "Two dogs wearing sunglasses, cinematic, sunset" -n 97 --width 768

# Image-to-Video
uv run mlx_video.ltx_2.generate --prompt "A person dancing" --image photo.jpg

# Audio-to-Video
uv run mlx_video.ltx_2.generate --audio-file music.wav --prompt "A band playing music"

# Dev pipeline with CFG (higher quality)
uv run mlx_video.ltx_2.generate --pipeline dev --prompt "A cinematic scene" --cfg-scale 3.0

# Dev two-stage HQ (highest quality)
uv run mlx_video.ltx_2.generate --pipeline dev-two-stage-hq \
    --prompt "A cinematic scene of ocean waves at golden hour" \
    --model-repo prince-canuma/LTX-2-dev

Poodles demo

Converting weights:

Pre-converted weights are available on HuggingFace (LTX-2-distilled, LTX-2-dev, LTX-2.3-distilled, LTX-2.3-dev), or convert from the original Lightricks checkpoint:

LTX-2 CLI Options

Option Default Description
--prompt, -p (required) Text description of the video
--height, -H 512 Output height (must be divisible by 64)
--width, -W 512 Output width (must be divisible by 64)
--num-frames, -n 100 Number of frames
--seed, -s 42 Random seed for reproducibility
--fps 24 Frames per second
--output, -o output.mp4 Output video path
--save-frames false Save individual frames as images
--model-repo Lightricks/LTX-2 HuggingFace model repository

Wan2.1 / Wan2.2

Both Wan2.1 and Wan2.2 are text-to-video diffusion models built on a DiT (Diffusion Transformer) backbone with a T5 text encoder and 3D VAE.

Step 0: Download and Convert Weights

See the dedicated Wan2.1/Wan2.2 README.md for details.

Step 1: Generate Video

# Wan2.1 — uses defaults from config (50 steps, shift=5.0, guide=5.0)
python -m mlx_video.wan_2.generate \
    --model-dir wan21_mlx \
    --prompt "A cat playing piano in a cozy room"

# Wan2.2 — uses defaults from config (40 steps, shift=12.0, guide=3.0,4.0)
python -m mlx_video.wan_2.generate \
    --model-dir wan22_mlx \
    --prompt "A cat playing piano in a cozy room"

With custom settings:

python -m mlx_video.wan_2.generate \
    --model-dir wan21_mlx \
    --prompt "Ocean waves at sunset, cinematic, 4K" \
    --negative-prompt "blurry, low quality" \
    --width 1280 \
    --height 720 \
    --num-frames 81 \
    --steps 50 \
    --guide-scale 5.0 \
    --shift 5.0 \
    --seed 42 \
    --output-path my_video.mp4

The pipeline auto-detects the model version from config.json and selects the right pipeline mode (single or dual model).

Image-to-Video (I2V-14B)

python -m mlx_video.wan_2.generate \
    --model-dir wan22_i2v_mlx \
    --prompt "The camera slowly zooms in as the subject begins to move" \
    --image start.png \
    --num-frames 81 \
    --output-path my_video.mp4

LoRA Support

LoRAs can be used with the --lora-high and --lora-low command line switches.

For example, using the distilled Wan2.2-Lightning LoRA for 4-step generation:

python -m mlx_video.wan_2.generate \
    --model-dir /Volumes/SSD/Wan-AI/Wan2.2-T2V-A14B-MLX \
    --width 480 \
    --height 704 \
    --num-frames 41 \
    --prompt "Two dogs of the poodle breed sitting on a beach wearing sunglasses, nodding with their heads, close up, cinematic, sunset" \
    --steps 4 \
    --guide-scale 1 \
    --trim-first-frames 1 \
    --seed 2391784614 \
    --lora-high /Volumes/SSD/Wan-AI/lightx2v/Wan2.2-Lightning/Wan2.2-T2V-A14B-4steps-lora-rank64-Seko-V2.0/high_noise_model.safetensors 1 \
    --lora-low /Volumes/SSD/Wan-AI/lightx2v/Wan2.2-Lightning/Wan2.2-T2V-A14B-4steps-lora-rank64-Seko-V2.0/low_noise_model.safetensors 1

Poodles

Wan CLI Options

Option Default Description
--model-dir (required) Path to converted MLX model directory
--prompt (required) Text description of the video
--image None Input image path (for I2V models)
--negative-prompt "" Negative prompt for guidance
--width 1280 Video width
--height 720 Video height
--num-frames 81 Number of frames (must be 4n+1)
--steps from config Number of diffusion steps
--guide-scale from config Guidance scale: float or low,high pair
--shift from config Noise schedule shift
--seed -1 (random) Random seed for reproducibility
--output-path output.mp4 Output video path

Requirements

  • macOS with Apple Silicon
  • Python >= 3.11
  • MLX >= 0.22.0
  • For weight conversion: PyTorch (pip install torch)

License

MIT

About

MLX-Video is the best package for inference and finetuning of Image-Video-Audio generation models on your Mac using MLX.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

 
 
 

Languages