You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CLAUDE.md
+21-6Lines changed: 21 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,8 @@
1
1
# ltx.cpp
2
2
3
-
C++ inference engine for LTX-Video (LTX 2.3) — text-to-video and image-to-video generation using GGML backends (Metal, CUDA, CPU).
3
+
C++ inference engine for LTX-Video (LTX 2.3) — text-to-video, image-to-video, and **audio-video (AV)** generation using GGML backends (Metal, CUDA, CPU).
4
+
5
+
**Branch `audio-video`**: same DiT sees concatenated video+audio latent; one denoise loop; output is video frames + WAV. See `docs/AV_PIPELINE.md` and README “Audio-video (AV)” section.
-`models/t5-v1_1-xxl-encoder-Q8_0.gguf` — T5 text encoder
37
41
@@ -47,9 +51,14 @@ build/ltx-generate \
47
51
--steps 20 --out output/frame
48
52
```
49
53
54
+
**Audio-video (AV):** add `--av` and optionally `--out-wav path.wav` to get video frames + WAV from the same run. Mux with ffmpeg: `ffmpeg -framerate 24 -i out_%04d.ppm -i out.wav -c:v libx264 -c:a aac -shortest out.mp4`.
55
+
50
56
Useful flags:
51
57
-`-v` — verbose per-step logging
52
58
-`--perf` — print CPU%/RSS/free-RAM/GPU-MB to stderr every 10 s
|`LTX_MIGRATE_MAX_TENSOR_MB`|`6144`| Max single-tensor size for GPU migration |
100
+
101
+
## Branch: audio-video
102
+
103
+
-**AV path**: with `--av`, video and audio latents are patchified, concatenated (video then audio tokens), passed through one DiT forward, then split; Euler step on both; video decoded with existing VAE, audio turned into WAV via a latent→waveform fallback.
104
+
-**Full audio VAE** (safetensors decoder) is not yet implemented; audio quality uses the fallback. See `docs/AV_PIPELINE.md` and `DEV.md` §5.
These latents are blended into the live denoising latent after each Euler step
249
-
(see §7 for the full schedule).
253
+
(see §8 for the full schedule).
250
254
```
251
255
252
256
---
253
257
254
-
## 5. Source file reference
258
+
## 5. Audio-video (AV) pipeline
259
+
260
+
**Branch: `audio-video`.** The LTX 2.3 GGUF DiT is a full **audio-video** model: it expects a single sequence of **concatenated video + audio** tokens and outputs a combined velocity that is split back into video and audio.
261
+
262
+
**Data flow when `--av`:**
263
+
264
+
1.**Latent init**: Video latent `[T_lat, H_lat, W_lat, C]` (as today) plus audio latent `[T_lat, 8, 16]` (C_audio=8, mel_bins=16), both filled with noise.
- Split: first `n_video_tok` tokens → video velocity; remainder → audio velocity.
270
+
- Unpatchify both; Euler step on video latent and on audio latent.
271
+
- (Optional) frame conditioning on video only (unchanged).
272
+
3.**Decode**: Video VAE decode → PPM frames (unchanged). Audio: denoised audio latent → waveform via a **latent-to-waveform** path (fake mel + overlap-add sinusoids) → 16-bit WAV (16 kHz). A full **audio VAE decoder** (safetensors) can be integrated later for higher-quality audio.
273
+
274
+
**Code:**`patchify_audio` / `unpatchify_audio` in `ltx_dit.hpp`; combined patch buffer, split, and dual Euler step in `ltx-generate.cpp`; `write_wav()` and `latent_to_waveform()` in `ltx-generate.cpp`. Design details: [docs/AV_PIPELINE.md](docs/AV_PIPELINE.md).
275
+
276
+
---
277
+
278
+
## 6. Source file reference
255
279
256
280
### `ltx_common.hpp`
257
281
@@ -332,7 +356,7 @@ Weights layout expected in the GGUF (prefix `vae.decoder.*`):
332
356
333
357
`decode(latents, T_lat, H_lat, W_lat)` runs a simplified per-frame 2-D decode
334
358
with nearest-neighbour temporal upsampling. Full causal 3-D conv decode is a
Fallback names with prefix `dit.*` are also tried.
370
394
395
+
**Audio (AV pipeline)**: `patchify_audio(lat, T, C, F)` and `unpatchify_audio(tok, T, C, F)` in the same header convert audio latent `[T, 8, 16]` ↔ `[T, 128]` tokens for concatenation with video tokens before the single DiT forward.
396
+
371
397
**Forward pass** (per call to `LtxDiT::forward()`):
|`audio_vae_path`|`--audio-vae`|`""` (optional; for full decoder when implemented) |
442
+
|`out_wav`|`--out-wav`|`""` (default: `<out prefix>.wav` when `--av`) |
414
443
|`threads`|`--threads`| 4 |
415
444
|`verbose`|`-v`| false |
416
445
417
-
**Output**: frames are written as `{out_prefix}_{NNNN}.ppm`. The output
418
-
directory is created automatically (including intermediate directories).
446
+
**Output**: frames are written as `{out_prefix}_{NNNN}.ppm`. When `--av`, a WAV file is also written (default `{out_prefix}.wav`). The output directory is created automatically (including intermediate directories).
419
447
420
448
---
421
449
@@ -430,7 +458,7 @@ Rules:
430
458
- Everything else → quantised to `target_type`
431
459
432
460
All GGUF KV metadata is copied verbatim. String arrays (e.g. the tokenizer
433
-
vocabulary) are not currently copied — this is a known limitation (see §10).
461
+
vocabulary) are not currently copied — this is a known limitation (see §11).
video and audio token sequences are concatenated before the DiT forward and split after. All are pure memory rearrangements with no arithmetic.
601
631
602
632
### Latent dimension formulas
603
633
@@ -643,7 +673,7 @@ Scores are written by `convert.py --tokenizer` (via
643
673
644
674
---
645
675
646
-
## 9. Backends (GPU: Metal, CUDA, Vulkan, ROCm)
676
+
## 10. Backends (GPU: Metal, CUDA, Vulkan, ROCm)
647
677
648
678
We follow the same pattern as [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp): **the build command determines the backend**. One backend per build; no platform-specific divergence in code.
649
679
@@ -658,7 +688,7 @@ The main performance bottleneck is the DiT `forward()` call, which rebuilds a `g
658
688
659
689
---
660
690
661
-
## 10. Known limitations and open tasks
691
+
## 11. Known limitations and open tasks
662
692
663
693
These are the main areas where the implementation is deliberately simplified
664
694
and where contributions are most welcome.
@@ -677,10 +707,11 @@ and where contributions are most welcome.
677
707
| 10 | **Threading** | `--threads` is parsed but not passed to `ggml_graph_compute_with_ctx` | Wire the thread count through to `ggml_graph_compute_with_ctx(ctx, gf, n_threads)` |
678
708
| 11 | **Output formats** | Only binary PPM (P6) output | Add JPEG/PNG output via stb_image_write or a similar library |
679
709
| 12 | **Windows `_mkdir`** | Only one level of directory is created on Windows | Implement recursive mkdir for Windows |
710
+
| 13 | **Audio VAE decoder** | With `--av`, audio is synthesized from the denoised latent via a fallback (fake mel + overlap-add); no full audio VAE decode yet | Load `ltx-2.3-22b-dev_audio_vae.safetensors` and implement 2D conv decoder (see docs/AV_PIPELINE.md) |
680
711
681
712
---
682
713
683
-
## 11. Coding conventions
714
+
## 12. Coding conventions
684
715
685
716
- **Language**: C++17 throughout; no exceptions (use return codes).
686
717
- **Headers only**: all modules live in `src/*.hpp`. Only the two `main()`
@@ -704,7 +735,7 @@ and where contributions are most welcome.
704
735
705
736
---
706
737
707
-
## 12. Testing
738
+
## 13. Testing
708
739
709
740
There is no formal test suite yet. Validation is currently done by:
710
741
@@ -726,16 +757,16 @@ There is no formal test suite yet. Validation is currently done by:
726
757
727
758
---
728
759
729
-
## 13. Contributing
760
+
## 14. Contributing
730
761
731
762
1. **Fork** the repository and create a branch from `main`.
732
-
2. **Read §10** to find where help is most needed.
763
+
2. **Read §11** to find where help is most needed.
733
764
3. **Keep PRs focused** — one feature or fix per PR.
734
-
4. **Match the style** described in §11.
765
+
4. **Match the style** described in §12.
735
766
5. **Document** any new CLI flag in both `print_usage()` (in
736
767
`ltx-generate.cpp`) and `README.md`.
737
768
6. **Update this file** (`DEV.md`) if you add a new module, change the GGUF
738
-
schema, or significantly alter the data flow.
769
+
schema, or significantly alter the data flow (e.g. AV pipeline in §5).
739
770
7. **No model weights** should ever be committed to the repo.
740
771
741
772
For questions, open a GitHub Discussion or issue in the
0 commit comments