llama: prefix MTP assistant tensors with 'mtp.' on load allowing use of -ot 'mtp..*=CUDA0' flag by sujitvasanth · Pull Request #7 · AtomicBot-ai/atomic-llama-cpp-turboquant

sujitvasanth · 2026-05-11T00:57:47Z

Overview

When the Gemma 4 assistant GGUF is loaded via llama_model_load_mtp_from_file, its block tensors (blk.0-3.*), token_embd, output_norm and rope_freqs share identical names with the target model's tensors. This makes it impossible to uniquely target MTP assistant tensors via -ot rules for GPU placement.

Fix: after loading the assistant from file, rename all tensors not already prefixed with 'mtp.' to 'mtp.<original_name>'. This is done purely in-memory on the tensors_by_name vector and the ggml_tensor name field — the GGUF file and published arch names are unchanged.

After this change, all MTP assistant tensors are addressable as mtp.blk.N.*, mtp.token_embd.weight, mtp.output_norm.weight etc, and can be pinned with:

-ot 'mtp..*=CUDA0'
this keads to speedups on multi GPU systems

This is important on dual GPU systems as splitting MTP head slows down the inference.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, co-wrote with Claude, manually edited the code and I have reviewed the code, compiled and tested and all works on ubuntu 20.04, giving further speedup.

When the Gemma 4 assistant GGUF is loaded via llama_model_load_mtp_from_file, its block tensors (blk.0-3.*), token_embd, output_norm and rope_freqs share identical names with the target model's tensors. This makes it impossible to uniquely target MTP assistant tensors via -ot rules for GPU placement. Fix: after loading the assistant into aux, rename all tensors not already prefixed with 'mtp.' to 'mtp.<original_name>'. This is done purely in-memory on the tensors_by_name vector and the ggml_tensor name field — the GGUF file and published arch names are unchanged. After this change, all MTP assistant tensors are addressable as mtp.blk.N.*, mtp.token_embd.weight, mtp.output_norm.weight etc, and can be pinned with: -ot 'mtp\..*=CUDA0'

sujitvasanth changed the title ~~llama: prefix MTP assistant tensors with 'mtp.' on load~~ llama: prefix MTP assistant tensors with 'mtp.' on load allowing useof -ot 'mtp..*=CUDA0' flag May 11, 2026

sujitvasanth changed the title ~~llama: prefix MTP assistant tensors with 'mtp.' on load allowing useof -ot 'mtp..*=CUDA0' flag~~ llama: prefix MTP assistant tensors with 'mtp.' on load allowing use of -ot 'mtp..*=CUDA0' flag May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama: prefix MTP assistant tensors with 'mtp.' on load allowing use of -ot 'mtp..*=CUDA0' flag#7

llama: prefix MTP assistant tensors with 'mtp.' on load allowing use of -ot 'mtp..*=CUDA0' flag#7
sujitvasanth wants to merge 1 commit intoAtomicBot-ai:feature/turboquant-kv-cachefrom
sujitvasanth:fix/mtp-assistant-tensor-prefix

sujitvasanth commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sujitvasanth commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sujitvasanth commented May 11, 2026 •

edited

Loading