fix(qwen35): support Qwen3.5:9B loading from Ollama GGUF#135
Open
Jordan-HS wants to merge 1 commit into
Open
Conversation
Five fixes to load the Ollama-distributed Qwen3.5:9B Q4_K_M GGUF (sha256:dec52a44...). The model uses a hybrid SSM+attention architecture with per-layer n_head_kv variation, and bundles vision/MTP tensors that are not used for text inference. Changes: 1. llama-model-loader.cpp — get_key_or_arr: respect required=false When the array length mismatches n, previously always threw. Now returns false when required=false, matching the behavior of every other required-flag check in the function. 2. llama-model-loader.cpp — done_getting_tensors: warn on unclaimed tensors The Ollama GGUF bundles vision encoder (v.blk.*) and MTP decoder (mtp.*) tensors alongside the text model (blk.*). Loading text-only leaves these unclaimed. Changed hard error to warning when n_created < n_tensors. 3. llama-model.cpp — QWEN35/QWEN35MOE rope_sections: accept 3 or 4 elements The original code required exactly 4 rope sections. The Ollama GGUF has 3 (the 4th vision section is absent). Using the fixed get_key_or_arr(required=false) to try 4 first, fall back to 3. 4. llama-model.cpp — QWEN35/QWEN35MOE load_tensors: per-layer KV dimensions n_head_kv is a per-layer array (0 for recurrent layers, 4 for attention layers). The outer-scope n_embd_k_gqa used layer 0's value (0), causing wrong tensor shapes. Fixed to use hparams.n_embd_k_gqa(i) and hparams.n_embd_v_gqa(i) per-layer. 5. llama-model.cpp — QWEN35/QWEN35MOE ssm_dt tensor name: remove bias suffix The GGUF key is blk.N.ssm_dt, not blk.N.ssm_dt.bias. 6. models/qwen35.cpp — build_layer_attn: per-layer n_head_kv for reshape The base class n_head_kv member reflects hparams.n_head_kv() which returns layer 0's value (0, recurrent). Fixed to call model.hparams.n_head_kv(il) for the per-layer correct head count before reshaping Kcur and Vcur. Tested: Qwen3.5:9B Q4_K_M, 64K context, q8_0-K/turbo4-V KV cache, RTX 3080 (sm_86, 10 GB). 91 tok/s decode, 2440 tok/s prompt. VRAM ~9749 MiB at 64K context, ~6227 MiB at idle-after-load. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR: fix(qwen35): support Qwen3.5:9B loading from Ollama GGUF
Target:
TheTom/llama-cpp-turboquant←fix/qwen35-9b-gguf-loadingWhat This PR Addresses
The Ollama-distributed Qwen3.5:9B Q4_K_M GGUF fails to load in the current
TurboQuant fork with multiple errors. This PR fixes all of them so the model
runs end-to-end with KV cache compression at 64K context.
Tested on RTX 3080 (sm_86, 10 GB VRAM) with
-ctk q8_0 -ctv turbo4 -c 65536:The Problem
Qwen3.5:9B uses a hybrid SSM+attention architecture (full attention every 4th
layer) and ships an Ollama GGUF that bundles text, vision encoder (
v.blk.*),and MTP decoder (
mtp.*) tensors in one file. Five separate issues preventedit from loading:
get_key_or_arrignoresrequired=falseon length mismatch — throwsunconditionally instead of returning false, making fallback patterns
impossible.
done_getting_tensorshard-errors on unclaimed tensors — the bundledvision/MTP tensors are not loaded for text-only inference, so
n_createdwill always be less than
n_tensors. Currently a fatal error.rope_sectionsrequires exactly 4 elements — the Ollama GGUF has 3(the 4th vision section is absent). Hard-coded
n=4throws at load time.load_tensorsuses outer-scopen_embd_k_gqa— this is computed fromn_head_kvat outer scope, which resolves to layer 0's value (0, arecurrent layer). Attention layers have
n_head_kv=4, so the tensorshape is computed as
(256, 0)instead of(256, 4).ssm_dttensor name has wrong suffix — code requestsblk.N.ssm_dt.biasbut the GGUF key isblk.N.ssm_dt.build_layer_attnuses base-classn_head_kv— same per-layerissue as fix: turbo4 SET_ROWS, tail-block truncation, constant coupling, stack overflow (Issue #29) #4 but in
qwen35.cpp: the reshape usesn_head_kvfrom thebase class which is 0 for the model, causing an assertion failure.
The Solution
src/llama-model-loader.cppFix 1 —
get_key_or_arr: respectrequired=falsefor length mismatchFix 2 —
done_getting_tensors: warn instead of error for unclaimed tensorssrc/llama-model.cppFix 3 —
rope_sections: accept 3 or 4 elements (QWEN35 and QWEN35MOE)Fix 4 —
load_tensors: per-layer KV dimensions (QWEN35 and QWEN35MOE)Fix 5 —
ssm_dttensor name (QWEN35 and QWEN35MOE)src/models/qwen35.cppFix 6 —
build_layer_attn: per-layern_head_kvfor Kcur/Vcur reshapeValidation
Model: Qwen3.5:9B Q4_K_M (
sha256:dec52a44...)GPU: RTX 3080 (sm_86), 10 GB VRAM
Command:
Server responds
{"status":"ok"}and handles OpenAI-compatible chat completions correctly.Notes
get_key_or_arr+done_getting_tensors) are general loaderfixes that benefit any multimodal GGUF loaded for text-only inference, not
just Qwen3.5.
SSM/attention architecture where head counts vary by layer.
-ctv q8_0is NOT safe on sm_86 (RTX 30xx) — crashesfattn.cu:348duringdecode. Use
turbo4orturbo3for the V cache on this GPU family.Co-authored with Claude Sonnet 4.6