Skip to content

Commit 9788e11

Browse files
authored
Merge pull request #2 from audiohacking/audio-video
Audio-video pipeline: concat/split DiT, WAV output, docs
2 parents 14daa19 + a65a6f9 commit 9788e11

8 files changed

Lines changed: 572 additions & 115 deletions

File tree

CLAUDE.md

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# ltx.cpp
22

3-
C++ inference engine for LTX-Video (LTX 2.3) — text-to-video and image-to-video generation using GGML backends (Metal, CUDA, CPU).
3+
C++ inference engine for LTX-Video (LTX 2.3) — text-to-video, image-to-video, and **audio-video (AV)** generation using GGML backends (Metal, CUDA, CPU).
4+
5+
**Branch `audio-video`**: same DiT sees concatenated video+audio latent; one denoise loop; output is video frames + WAV. See `docs/AV_PIPELINE.md` and README “Audio-video (AV)” section.
46

57
## Build
68

@@ -23,15 +25,17 @@ cmake -B build -DLTX_HIP=ON # ROCm/AMD
2325

2426
## Models
2527

26-
Download with `./models.sh` (requires `huggingface-cli`):
28+
Download with `./models.sh` (requires `curl` or `wget`):
2729
```bash
28-
./models.sh # DiT Q4_K_M + T5 Q8_0 + VAE + extras
30+
./models.sh # Dev DiT (default) + T5 + VAE + extras
31+
./models.sh --distilled # Distilled DiT (few-step 4–8, CFG=1) from same repo
2932
./models.sh --minimal # DiT + T5 + VAE only
3033
./models.sh --quant Q8_0 # different DiT quant
3134
```
3235

3336
Models land flat under `models/`. Key files:
34-
- `models/ltx-2.3-22b-dev-Q4_K_M.gguf` — DiT weights
37+
- `models/ltx-2.3-22b-dev-Q4_K_M.gguf` — Dev DiT (default)
38+
- `models/ltx-2.3-22b-distilled-Q4_K_M.gguf` — Distilled DiT (with `--distilled`)
3539
- `models/ltx-2.3-22b-dev_video_vae.safetensors` — VAE
3640
- `models/t5-v1_1-xxl-encoder-Q8_0.gguf` — T5 text encoder
3741

@@ -47,9 +51,14 @@ build/ltx-generate \
4751
--steps 20 --out output/frame
4852
```
4953

54+
**Audio-video (AV):** add `--av` and optionally `--out-wav path.wav` to get video frames + WAV from the same run. Mux with ffmpeg: `ffmpeg -framerate 24 -i out_%04d.ppm -i out.wav -c:v libx264 -c:a aac -shortest out.mp4`.
55+
5056
Useful flags:
5157
- `-v` — verbose per-step logging
5258
- `--perf` — print CPU%/RSS/free-RAM/GPU-MB to stderr every 10 s
59+
- `--av` — enable audio+video path (concat latent → DiT → split → decode both)
60+
- `--audio-vae path` — optional; for full audio VAE decoder when implemented
61+
- `--out-wav path` — WAV output when `--av` (default: `<out prefix>.wav`)
5362
- `--start-frame img.png` — image-to-video (I2V)
5463
- `--end-frame img.png` — keyframe interpolation
5564
- `--seed N`, `--cfg F`, `--shift F`, `--threads N`
@@ -65,14 +74,15 @@ BIN=build_debug/ltx-generate bash scripts/test-gpu-migration.sh
6574

6675
| File | Purpose |
6776
|------|---------|
68-
| `src/ltx-generate.cpp` | Main binary: arg parsing, model loading, denoising loop |
69-
| `src/ltx_dit.hpp` | DiT transformer (forward pass, block loop, Metal/CPU paths) |
77+
| `src/ltx-generate.cpp` | Main binary: arg parsing, model loading, denoising loop; AV path (concat/split, WAV output) |
78+
| `src/ltx_dit.hpp` | DiT transformer (forward pass, block loop); `patchify_audio` / `unpatchify_audio` for AV |
7079
| `src/video_vae.hpp` | VAE encoder/decoder (safetensors) |
7180
| `src/t5_encoder.hpp` | T5-XXL text encoder (GGUF) |
7281
| `src/scheduler.hpp` | RF flow scheduler (timesteps, Euler step, CFG) |
7382
| `src/ltx_perf.hpp` | Background perf monitor thread (CPU/RAM stats) |
7483
| `src/ltx_common.hpp` | Shared macros (`LTX_LOG`, `LTX_ERR`), GGML helpers |
7584
| `src/safetensors_loader.cpp` | safetensors file loader |
85+
| `docs/AV_PIPELINE.md` | AV pipeline design (token concat, shapes, CLI) |
7686

7787
## Architecture notes
7888

@@ -87,3 +97,8 @@ BIN=build_debug/ltx-generate bash scripts/test-gpu-migration.sh
8797
| Variable | Default | Effect |
8898
|----------|---------|--------|
8999
| `LTX_MIGRATE_MAX_TENSOR_MB` | `6144` | Max single-tensor size for GPU migration |
100+
101+
## Branch: audio-video
102+
103+
- **AV path**: with `--av`, video and audio latents are patchified, concatenated (video then audio tokens), passed through one DiT forward, then split; Euler step on both; video decoded with existing VAE, audio turned into WAV via a latent→waveform fallback.
104+
- **Full audio VAE** (safetensors decoder) is not yet implemented; audio quality uses the fallback. See `docs/AV_PIPELINE.md` and `DEV.md` §5.

DEV.md

Lines changed: 62 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,8 @@ navigate the known limitations.
1616
- [Build configurations](#build-configurations)
1717
- [Obtaining model files](#obtaining-model-files)
1818
4. [End-to-end data flow](#4-end-to-end-data-flow)
19-
5. [Source file reference](#5-source-file-reference)
19+
5. [Audio-video (AV) pipeline](#5-audio-video-av-pipeline)
20+
6. [Source file reference](#6-source-file-reference)
2021
- [ltx\_common.hpp](#ltx_commonhpp)
2122
- [scheduler.hpp](#schedulerhpp)
2223
- [t5\_encoder.hpp](#t5_encoderhpp)
@@ -25,25 +26,25 @@ navigate the known limitations.
2526
- [ltx-generate.cpp](#ltx-generatecpp)
2627
- [ltx-quantize.cpp](#ltx-quantizecpp)
2728
- [convert.py](#convertpy)
28-
6. [GGUF model format conventions](#6-gguf-model-format-conventions)
29+
7. [GGUF model format conventions](#7-gguf-model-format-conventions)
2930
- [DiT GGUF](#dit-gguf)
3031
- [VAE GGUF](#vae-gguf)
3132
- [T5 GGUF](#t5-gguf)
32-
7. [Image-to-video (I2V) design](#7-image-to-video-i2v-design)
33+
8. [Image-to-video (I2V) design](#8-image-to-video-i2v-design)
3334
- [VaeEncoder](#vaeencoder)
3435
- [Frame-conditioning schedule](#frame-conditioning-schedule)
3536
- [Hard-pinning at t=0](#hard-pinning-at-t0)
36-
8. [Key algorithms and design decisions](#8-key-algorithms-and-design-decisions)
37+
9. [Key algorithms and design decisions](#9-key-algorithms-and-design-decisions)
3738
- [Rectified Flow (RF) scheduling](#rectified-flow-rf-scheduling)
3839
- [Classifier-free guidance](#classifier-free-guidance)
3940
- [Patchify / unpatchify](#patchify--unpatchify)
4041
- [Latent dimension formulas](#latent-dimension-formulas)
4142
- [Tokenizer](#tokenizer)
42-
9. [Adding a new backend (GPU/Metal/Vulkan)](#9-adding-a-new-backend-gpumetalvulkan)
43-
10. [Known limitations and open tasks](#10-known-limitations-and-open-tasks)
44-
11. [Coding conventions](#11-coding-conventions)
45-
12. [Testing](#12-testing)
46-
13. [Contributing](#13-contributing)
43+
10. [Adding a new backend (GPU/Metal/Vulkan)](#10-adding-a-new-backend-gpumetalvulkan)
44+
11. [Known limitations and open tasks](#11-known-limitations-and-open-tasks)
45+
12. [Coding conventions](#12-coding-conventions)
46+
13. [Testing](#13-testing)
47+
14. [Contributing](#14-contributing)
4748

4849
---
4950

@@ -90,6 +91,9 @@ ltx.cpp/
9091
├── checkpoints.sh Download raw HF safetensors checkpoints
9192
├── models.sh Download pre-quantised GGUF models from Unsloth/HF
9293
├── quantize.sh Shell wrapper: run ltx-quantize on all BF16 GGUFs
94+
├── docs/
95+
│ ├── AV_PIPELINE.md Audio-video pipeline design (token concat, shapes, CLI)
96+
│ └── LTX_COMFY_REFERENCE.md ComfyUI workflow reference
9397
9498
└── ggml/ Git submodule — GGML tensor library
9599
```
@@ -246,12 +250,32 @@ CLI args
246250
→ start_lat / end_lat [H_lat × W_lat × 128]
247251
248252
These latents are blended into the live denoising latent after each Euler step
249-
(see §7 for the full schedule).
253+
(see §8 for the full schedule).
250254
```
251255

252256
---
253257

254-
## 5. Source file reference
258+
## 5. Audio-video (AV) pipeline
259+
260+
**Branch: `audio-video`.** The LTX 2.3 GGUF DiT is a full **audio-video** model: it expects a single sequence of **concatenated video + audio** tokens and outputs a combined velocity that is split back into video and audio.
261+
262+
**Data flow when `--av`:**
263+
264+
1. **Latent init**: Video latent `[T_lat, H_lat, W_lat, C]` (as today) plus audio latent `[T_lat, 8, 16]` (C_audio=8, mel_bins=16), both filled with noise.
265+
2. **Per step**:
266+
- `patchify()` → video tokens `[n_video_tok, 128]`; `patchify_audio()` → audio tokens `[T_lat, 128]`.
267+
- Concat → `[n_video_tok + T_lat, 128]` = `[n_tok_total, Pd]`.
268+
- `LtxDiT::forward(combined, n_tok_total, …)` → combined velocity.
269+
- Split: first `n_video_tok` tokens → video velocity; remainder → audio velocity.
270+
- Unpatchify both; Euler step on video latent and on audio latent.
271+
- (Optional) frame conditioning on video only (unchanged).
272+
3. **Decode**: Video VAE decode → PPM frames (unchanged). Audio: denoised audio latent → waveform via a **latent-to-waveform** path (fake mel + overlap-add sinusoids) → 16-bit WAV (16 kHz). A full **audio VAE decoder** (safetensors) can be integrated later for higher-quality audio.
273+
274+
**Code:** `patchify_audio` / `unpatchify_audio` in `ltx_dit.hpp`; combined patch buffer, split, and dual Euler step in `ltx-generate.cpp`; `write_wav()` and `latent_to_waveform()` in `ltx-generate.cpp`. Design details: [docs/AV_PIPELINE.md](docs/AV_PIPELINE.md).
275+
276+
---
277+
278+
## 6. Source file reference
255279

256280
### `ltx_common.hpp`
257281

@@ -332,7 +356,7 @@ Weights layout expected in the GGUF (prefix `vae.decoder.*`):
332356

333357
`decode(latents, T_lat, H_lat, W_lat)` runs a simplified per-frame 2-D decode
334358
with nearest-neighbour temporal upsampling. Full causal 3-D conv decode is a
335-
planned improvement (see §10).
359+
planned improvement (see §11).
336360

337361
#### `VaeEncoder`
338362

@@ -368,10 +392,12 @@ model.diffusion_model.norm_out.linear.{weight,bias}
368392
```
369393
Fallback names with prefix `dit.*` are also tried.
370394

395+
**Audio (AV pipeline)**: `patchify_audio(lat, T, C, F)` and `unpatchify_audio(tok, T, C, F)` in the same header convert audio latent `[T, 8, 16]``[T, 128]` tokens for concatenation with video tokens before the single DiT forward.
396+
371397
**Forward pass** (per call to `LtxDiT::forward()`):
372398
1. Sinusoidal timestep embedding → MLP → `hidden_size` vector
373399
2. AdaLN-single linear → `6 × hidden_size` (scale/shift params; currently
374-
stored but not yet fully applied per-block — see §10)
400+
stored but not yet fully applied per-block — see §11)
375401
3. Patchify projection: `[N_tok, patch_dim]``[N_tok, hidden_size]`
376402
4. Caption projection: `[S, 4096]``[S, hidden_size]`
377403
5. N × transformer blocks:
@@ -411,11 +437,13 @@ Orchestrates the full inference pipeline.
411437
| `start_frame_path` | `--start-frame` | `""` (disabled) |
412438
| `end_frame_path` | `--end-frame` | `""` (disabled) |
413439
| `frame_strength` | `--frame-strength` | 1.0 |
440+
| `av` | `--av` | false (enable audio+video path) |
441+
| `audio_vae_path` | `--audio-vae` | `""` (optional; for full decoder when implemented) |
442+
| `out_wav` | `--out-wav` | `""` (default: `<out prefix>.wav` when `--av`) |
414443
| `threads` | `--threads` | 4 |
415444
| `verbose` | `-v` | false |
416445

417-
**Output**: frames are written as `{out_prefix}_{NNNN}.ppm`. The output
418-
directory is created automatically (including intermediate directories).
446+
**Output**: frames are written as `{out_prefix}_{NNNN}.ppm`. When `--av`, a WAV file is also written (default `{out_prefix}.wav`). The output directory is created automatically (including intermediate directories).
419447

420448
---
421449

@@ -430,7 +458,7 @@ Rules:
430458
- Everything else → quantised to `target_type`
431459

432460
All GGUF KV metadata is copied verbatim. String arrays (e.g. the tokenizer
433-
vocabulary) are not currently copied — this is a known limitation (see §10).
461+
vocabulary) are not currently copied — this is a known limitation (see §11).
434462

435463
Supported quant types: `Q4_K_M`, `Q5_K_M`, `Q6_K`, `Q8_0`, `BF16`, `F32`, `F16`.
436464

@@ -458,7 +486,7 @@ For T5, the HF tokenizer vocabulary can be embedded into the GGUF via
458486

459487
---
460488

461-
## 6. GGUF model format conventions
489+
## 7. GGUF model format conventions
462490

463491
### DiT GGUF
464492

@@ -501,7 +529,7 @@ Architecture string: `"t5"`
501529

502530
---
503531

504-
## 7. Image-to-video (I2V) design
532+
## 8. Image-to-video (I2V) design
505533

506534
The I2V implementation does not modify the DiT architecture. Instead it
507535
works by conditioning the *latent* directly at the boundary frames before and
@@ -553,7 +581,7 @@ appearance, regardless of any residual denoising drift.
553581
554582
---
555583
556-
## 8. Key algorithms and design decisions
584+
## 9. Key algorithms and design decisions
557585
558586
### Rectified Flow (RF) scheduling
559587
@@ -587,17 +615,19 @@ The unconditional embedding is computed by encoding the `--neg` prompt
587615
588616
### Patchify / unpatchify
589617
590-
The DiT operates on *tokens*, not on the raw latent volume. The latent
618+
The DiT operates on *tokens*, not on the raw latent volume. The **video** latent
591619
`[T_lat, H_lat, W_lat, C]` is chunked into non-overlapping patches of size
592620
`(pt=1, ph=2, pw=2)` along the temporal, height, and width dimensions:
593621
594622
```
595-
patch_dim = pt * ph * pw * C = 1 * 2 * 2 * 128 = 512
623+
patch_dim = pt * ph * pw * C = 1 * 2 * 2 * 128 = 512 (or 128 for C=32)
596624
N_tok = (T_lat/pt) * (H_lat/ph) * (W_lat/pw)
597625
```
598626
599-
`patchify()` and `unpatchify()` are helper functions called from
600-
`ltx-generate.cpp`. Both are pure memory rearrangements with no arithmetic.
627+
`patchify()` and `unpatchify()` are helper functions in `ltx_dit.hpp` called from
628+
`ltx-generate.cpp`. For the **audio-video** path, `patchify_audio()` and
629+
`unpatchify_audio()` convert audio latent `[T, 8, 16]` to/from `[T, 128]` tokens;
630+
video and audio token sequences are concatenated before the DiT forward and split after. All are pure memory rearrangements with no arithmetic.
601631
602632
### Latent dimension formulas
603633
@@ -643,7 +673,7 @@ Scores are written by `convert.py --tokenizer` (via
643673
644674
---
645675
646-
## 9. Backends (GPU: Metal, CUDA, Vulkan, ROCm)
676+
## 10. Backends (GPU: Metal, CUDA, Vulkan, ROCm)
647677
648678
We follow the same pattern as [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp): **the build command determines the backend**. One backend per build; no platform-specific divergence in code.
649679
@@ -658,7 +688,7 @@ The main performance bottleneck is the DiT `forward()` call, which rebuilds a `g
658688
659689
---
660690
661-
## 10. Known limitations and open tasks
691+
## 11. Known limitations and open tasks
662692
663693
These are the main areas where the implementation is deliberately simplified
664694
and where contributions are most welcome.
@@ -677,10 +707,11 @@ and where contributions are most welcome.
677707
| 10 | **Threading** | `--threads` is parsed but not passed to `ggml_graph_compute_with_ctx` | Wire the thread count through to `ggml_graph_compute_with_ctx(ctx, gf, n_threads)` |
678708
| 11 | **Output formats** | Only binary PPM (P6) output | Add JPEG/PNG output via stb_image_write or a similar library |
679709
| 12 | **Windows `_mkdir`** | Only one level of directory is created on Windows | Implement recursive mkdir for Windows |
710+
| 13 | **Audio VAE decoder** | With `--av`, audio is synthesized from the denoised latent via a fallback (fake mel + overlap-add); no full audio VAE decode yet | Load `ltx-2.3-22b-dev_audio_vae.safetensors` and implement 2D conv decoder (see docs/AV_PIPELINE.md) |
680711
681712
---
682713
683-
## 11. Coding conventions
714+
## 12. Coding conventions
684715
685716
- **Language**: C++17 throughout; no exceptions (use return codes).
686717
- **Headers only**: all modules live in `src/*.hpp`. Only the two `main()`
@@ -704,7 +735,7 @@ and where contributions are most welcome.
704735
705736
---
706737
707-
## 12. Testing
738+
## 13. Testing
708739
709740
There is no formal test suite yet. Validation is currently done by:
710741
@@ -726,16 +757,16 @@ There is no formal test suite yet. Validation is currently done by:
726757
727758
---
728759
729-
## 13. Contributing
760+
## 14. Contributing
730761
731762
1. **Fork** the repository and create a branch from `main`.
732-
2. **Read §10** to find where help is most needed.
763+
2. **Read §11** to find where help is most needed.
733764
3. **Keep PRs focused** — one feature or fix per PR.
734-
4. **Match the style** described in §11.
765+
4. **Match the style** described in §12.
735766
5. **Document** any new CLI flag in both `print_usage()` (in
736767
`ltx-generate.cpp`) and `README.md`.
737768
6. **Update this file** (`DEV.md`) if you add a new module, change the GGUF
738-
schema, or significantly alter the data flow.
769+
schema, or significantly alter the data flow (e.g. AV pipeline in §5).
739770
7. **No model weights** should ever be committed to the repo.
740771
741772
For questions, open a GitHub Discussion or issue in the

0 commit comments

Comments
 (0)