Skip to content

llama: prefix MTP assistant tensors with 'mtp.' on load allowing use of -ot 'mtp..*=CUDA0' flag#7

Open
sujitvasanth wants to merge 1 commit intoAtomicBot-ai:feature/turboquant-kv-cachefrom
sujitvasanth:fix/mtp-assistant-tensor-prefix
Open

llama: prefix MTP assistant tensors with 'mtp.' on load allowing use of -ot 'mtp..*=CUDA0' flag#7
sujitvasanth wants to merge 1 commit intoAtomicBot-ai:feature/turboquant-kv-cachefrom
sujitvasanth:fix/mtp-assistant-tensor-prefix

Conversation

@sujitvasanth
Copy link
Copy Markdown

@sujitvasanth sujitvasanth commented May 11, 2026

Overview

When the Gemma 4 assistant GGUF is loaded via llama_model_load_mtp_from_file, its block tensors (blk.0-3.*), token_embd, output_norm and rope_freqs share identical names with the target model's tensors. This makes it impossible to uniquely target MTP assistant tensors via -ot rules for GPU placement.

Fix: after loading the assistant from file, rename all tensors not already prefixed with 'mtp.' to 'mtp.<original_name>'. This is done purely in-memory on the tensors_by_name vector and the ggml_tensor name field — the GGUF file and published arch names are unchanged.

After this change, all MTP assistant tensors are addressable as mtp.blk.N.*, mtp.token_embd.weight, mtp.output_norm.weight etc, and can be pinned with:

-ot 'mtp..*=CUDA0'
this keads to speedups on multi GPU systems

This is important on dual GPU systems as splitting MTP head slows down the inference.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, co-wrote with Claude, manually edited the code and I have reviewed the code, compiled and tested and all works on ubuntu 20.04, giving further speedup.

When the Gemma 4 assistant GGUF is loaded via llama_model_load_mtp_from_file,
its block tensors (blk.0-3.*), token_embd, output_norm and rope_freqs share
identical names with the target model's tensors. This makes it impossible to
uniquely target MTP assistant tensors via -ot rules for GPU placement.

Fix: after loading the assistant into aux, rename all tensors not already
prefixed with 'mtp.' to 'mtp.<original_name>'. This is done purely in-memory
on the tensors_by_name vector and the ggml_tensor name field — the GGUF file
and published arch names are unchanged.

After this change, all MTP assistant tensors are addressable as mtp.blk.N.*,
mtp.token_embd.weight, mtp.output_norm.weight etc, and can be pinned with:

  -ot 'mtp\..*=CUDA0'
@sujitvasanth sujitvasanth changed the title llama: prefix MTP assistant tensors with 'mtp.' on load llama: prefix MTP assistant tensors with 'mtp.' on load allowing useof -ot 'mtp..*=CUDA0' flag May 11, 2026
@sujitvasanth sujitvasanth changed the title llama: prefix MTP assistant tensors with 'mtp.' on load allowing useof -ot 'mtp..*=CUDA0' flag llama: prefix MTP assistant tensors with 'mtp.' on load allowing use of -ot 'mtp..*=CUDA0' flag May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant