Skip to content

feat: load LTX-2.3 connector weights from GGUF on Apple Silicon#431

Open
Samir Hassen (samirhassen) wants to merge 2 commits intoLightricks:masterfrom
samirhassen:ltx23-gguf-support
Open

feat: load LTX-2.3 connector weights from GGUF on Apple Silicon#431
Samir Hassen (samirhassen) wants to merge 2 commits intoLightricks:masterfrom
samirhassen:ltx23-gguf-support

Conversation

@samirhassen

Summary
Enables LTXVGemmaCLIPModelLoader to load LTX-2.3 (22B AV) connector weights directly from a GGUF checkpoint on Apple Silicon, with no separate safetensors extraction step required. Tested on M4 Max (36 GB) with the Q4_K_S GGUF (~16 GB).

Motivation
The existing loader only supports safetensors checkpoints. On Apple Silicon, the primary distribution format for large models is GGUF (via llama.cpp-style quantisation). The connector and projection weights are embedded in the same GGUF file as the diffusion model, so users shouldn't need to unpack or convert anything manually.

Changes
text_embeddings_connectors.py

  • Added _load_gguf_connector_sd(): reads connector tensors directly from GGUF via gguf.GGUFReader, reverses shape order (GGUF stores dimensions innermost-first), and handles F32 / F16 / BF16 natively. Falls back to ComfyUI-GGUF's dequant.py for quantised types.
  • Hardcoded transformer_config for LTX-2.3 (22B AV), which is not stored in the GGUF metadata: 32 heads × 128 head_dim for video connector (inner_dim=4096), 32 heads × 64 head_dim for audio connector (inner_dim=2048).
  • Auto-discovers proj_linear.safetensors from ComfyUI's text_encoders folder paths and merges it into the state dict, so the text_embedding_projection weights are always picked up without manual configuration.

embeddings_connector.py

  • load_embeddings_connector: changed strict=True → strict=False in load_state_dict to tolerate minor key mismatches between GGUF-extracted tensors and the module definition.
  • Embeddings1DConnector.forward: auto-selects RoPE frequency spacing based on inner_dim. The existing "exp" spacing uses POS_EMBEDDING_EXP_VALUES, which is sized for inner_dim=3840 (19B model). LTX-2.3's connector has inner_dim=4096, so "exp_2" (standard scaled formula) is used instead, preventing a shape mismatch at inference time.

gemma_encoder.py

  • LTXVGemmaCLIPModelLoader: made ltxv_path optional ([""] + ...). When empty, auto-discovers the GGUF from ComfyUI's unet/ folder so the UI doesn't require a duplicate path entry.
  • GGUF Gemma loading: falls back to AutoModelForCausalLM.from_pretrained(..., gguf_file=...) when no model*.safetensors is found, enabling the text encoder itself to be loaded from GGUF.

Why is_av must remain enabled
preprocess_text_embeds in the LTXAV transformer checks whether the embedding dimension is cross_attention_dim + audio_cross_attention_dim (4096 + 2048 = 6144) to decide whether it has already been processed. If is_av=False, only the video connector runs and the output is 4096-dim — the transformer then double-processes it and produces garbage. The is_av flag is correctly detected from the presence of audio_adaln_single.linear.weight in the state dict.

Testing
Ran full text-to-video inference in ComfyUI on Mac Studio M4 Max (36 GB) with ltx-video-2b-v0.9.5-distilled.gguf (Q4_K_S).
Video output via VHS_VideoCombine node confirmed correct (motion, coherence, no artefacts from mis-sized embeddings).

Known limitations
transformer_config for LTX-2.3 is hardcoded rather than read from GGUF metadata (the metadata does not contain it). If Lightricks releases a new architecture variant, this will need updating.
The dequantisation fallback for non-float types depends on ComfyUI-GGUF being present alongside this plugin.

user1000 and others added 2 commits March 10, 2026 23:06
- Load video/audio embeddings connector weights directly from GGUF
- Auto-find proj_linear.safetensors for text_embedding_projection
- Fix RoPE spacing (exp_2) for connectors with inner_dim != 3840
- Set is_av correctly for AV model to output 6144-dim conditioning
- Add audio_connector_attention_head_dim config for proper 2048-dim audio connector
- Replace all print()/DEBUG statements with proper logger calls
- Move stdlib imports (glob, importlib.util, logging, os) to module level
- Move folder_paths import to module level in text_embeddings_connectors
- Add logger = logging.getLogger(__name__) to text_embeddings_connectors
- Fix comment typo in gemma_encoder.py (bytesed → bytes)
- embeddings_connector.py had no debug statements (no changes needed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant