fix(qwen35): support Qwen3.5:9B loading from Ollama GGUF by Jordan-HS · Pull Request #135 · TheTom/llama-cpp-turboquant

Jordan-HS · 2026-05-08T11:35:36Z

PR: fix(qwen35): support Qwen3.5:9B loading from Ollama GGUF

Target: TheTom/llama-cpp-turboquant ← fix/qwen35-9b-gguf-loading

What This PR Addresses

The Ollama-distributed Qwen3.5:9B Q4_K_M GGUF fails to load in the current
TurboQuant fork with multiple errors. This PR fixes all of them so the model
runs end-to-end with KV cache compression at 64K context.

Tested on RTX 3080 (sm_86, 10 GB VRAM) with -ctk q8_0 -ctv turbo4 -c 65536:

91 tok/s decode, 2440 tok/s prompt
VRAM: ~9749 MiB peak, ~6227 MiB after initial load

The Problem

Qwen3.5:9B uses a hybrid SSM+attention architecture (full attention every 4th
layer) and ships an Ollama GGUF that bundles text, vision encoder (v.blk.*),
and MTP decoder (mtp.*) tensors in one file. Five separate issues prevented
it from loading:

get_key_or_arr ignores required=false on length mismatch — throws
unconditionally instead of returning false, making fallback patterns
impossible.
done_getting_tensors hard-errors on unclaimed tensors — the bundled
vision/MTP tensors are not loaded for text-only inference, so n_created
will always be less than n_tensors. Currently a fatal error.
rope_sections requires exactly 4 elements — the Ollama GGUF has 3
(the 4th vision section is absent). Hard-coded n=4 throws at load time.
load_tensors uses outer-scope n_embd_k_gqa — this is computed from
n_head_kv at outer scope, which resolves to layer 0's value (0, a
recurrent layer). Attention layers have n_head_kv=4, so the tensor
shape is computed as (256, 0) instead of (256, 4).
ssm_dt tensor name has wrong suffix — code requests
blk.N.ssm_dt.bias but the GGUF key is blk.N.ssm_dt.
build_layer_attn uses base-class n_head_kv — same per-layer
issue as fix: turbo4 SET_ROWS, tail-block truncation, constant coupling, stack overflow (Issue #29) #4 but in qwen35.cpp: the reshape uses n_head_kv from the
base class which is 0 for the model, causing an assertion failure.

The Solution

`src/llama-model-loader.cpp`

Fix 1 — get_key_or_arr: respect required=false for length mismatch

// Before:
if (n != arr_info.length) {
    throw std::runtime_error(format("key %s has wrong array length; ..."));
}

// After:
if (n != arr_info.length) {
    if (required) {
        throw std::runtime_error(format("key %s has wrong array length; ..."));
    }
    return false;
}

Fix 2 — done_getting_tensors: warn instead of error for unclaimed tensors

// Before:
if (n_created != n_tensors) {
    throw std::runtime_error(...);
}

// After:
if (n_created > n_tensors) {
    throw std::runtime_error(...);
}
if (n_created < n_tensors) {
    LLAMA_LOG_WARN("... multimodal vision/MTP tensors skipped for text-only inference\n", ...);
}

`src/llama-model.cpp`

Fix 3 — rope_sections: accept 3 or 4 elements (QWEN35 and QWEN35MOE)

// Before:
ml.get_key_or_arr(LLM_KV_ROPE_DIMENSION_SECTIONS, hparams.rope_sections, 4, true);

// After:
if (!ml.get_key_or_arr(LLM_KV_ROPE_DIMENSION_SECTIONS, hparams.rope_sections, 4, false)) {
    ml.get_key_or_arr(LLM_KV_ROPE_DIMENSION_SECTIONS, hparams.rope_sections, 3, true);
}

Fix 4 — load_tensors: per-layer KV dimensions (QWEN35 and QWEN35MOE)

// Before:
create_tensor_qkv(layer, i, n_embd, n_embd_head_k * n_head * 2, n_embd_k_gqa, n_embd_v_gqa, 0);

// After:
create_tensor_qkv(layer, i, n_embd, n_embd_head_k * n_head * 2, hparams.n_embd_k_gqa(i), hparams.n_embd_v_gqa(i), 0);

Fix 5 — ssm_dt tensor name (QWEN35 and QWEN35MOE)

// Before:
layer.ssm_dt = create_tensor(tn(LLM_TENSOR_SSM_DT, "bias", i), { hparams.ssm_dt_rank }, 0);

// After:
layer.ssm_dt = create_tensor(tn(LLM_TENSOR_SSM_DT, i), { hparams.ssm_dt_rank }, 0);

`src/models/qwen35.cpp`

Fix 6 — build_layer_attn: per-layer n_head_kv for Kcur/Vcur reshape

// Before:
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens);
// ...
Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens);

// After:
const int64_t n_head_kv_il = model.hparams.n_head_kv(il);
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv_il, n_tokens);
// ...
Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv_il, n_tokens);

Validation

Model: Qwen3.5:9B Q4_K_M (sha256:dec52a44...)
GPU: RTX 3080 (sm_86), 10 GB VRAM
Command:

llama-server \
  -m <qwen3.5-9b-q4_k_m.gguf> \
  -ngl 99 -fa 1 \
  -ctk q8_0 -ctv turbo4 \
  -c 65536 \
  --port 8080

Metric	Value
Decode speed	91.4 ±0.6 tok/s
Prompt speed	2,440 ±509 tok/s
VRAM at 64K context	~9,749 MiB
VRAM after load (idle)	~6,227 MiB
Model weights	4,717 MiB

Server responds {"status":"ok"} and handles OpenAI-compatible chat completions correctly.

Notes

Fixes 1 and 2 (get_key_or_arr + done_getting_tensors) are general loader
fixes that benefit any multimodal GGUF loaded for text-only inference, not
just Qwen3.5.
The per-layer n_head_kv issue (fixes 4 and 6) will affect any future hybrid
SSM/attention architecture where head counts vary by layer.
-ctv q8_0 is NOT safe on sm_86 (RTX 30xx) — crashes fattn.cu:348 during
decode. Use turbo4 or turbo3 for the V cache on this GPU family.

Co-authored with Claude Sonnet 4.6

Five fixes to load the Ollama-distributed Qwen3.5:9B Q4_K_M GGUF (sha256:dec52a44...). The model uses a hybrid SSM+attention architecture with per-layer n_head_kv variation, and bundles vision/MTP tensors that are not used for text inference. Changes: 1. llama-model-loader.cpp — get_key_or_arr: respect required=false When the array length mismatches n, previously always threw. Now returns false when required=false, matching the behavior of every other required-flag check in the function. 2. llama-model-loader.cpp — done_getting_tensors: warn on unclaimed tensors The Ollama GGUF bundles vision encoder (v.blk.*) and MTP decoder (mtp.*) tensors alongside the text model (blk.*). Loading text-only leaves these unclaimed. Changed hard error to warning when n_created < n_tensors. 3. llama-model.cpp — QWEN35/QWEN35MOE rope_sections: accept 3 or 4 elements The original code required exactly 4 rope sections. The Ollama GGUF has 3 (the 4th vision section is absent). Using the fixed get_key_or_arr(required=false) to try 4 first, fall back to 3. 4. llama-model.cpp — QWEN35/QWEN35MOE load_tensors: per-layer KV dimensions n_head_kv is a per-layer array (0 for recurrent layers, 4 for attention layers). The outer-scope n_embd_k_gqa used layer 0's value (0), causing wrong tensor shapes. Fixed to use hparams.n_embd_k_gqa(i) and hparams.n_embd_v_gqa(i) per-layer. 5. llama-model.cpp — QWEN35/QWEN35MOE ssm_dt tensor name: remove bias suffix The GGUF key is blk.N.ssm_dt, not blk.N.ssm_dt.bias. 6. models/qwen35.cpp — build_layer_attn: per-layer n_head_kv for reshape The base class n_head_kv member reflects hparams.n_head_kv() which returns layer 0's value (0, recurrent). Fixed to call model.hparams.n_head_kv(il) for the per-layer correct head count before reshaping Kcur and Vcur. Tested: Qwen3.5:9B Q4_K_M, 64K context, q8_0-K/turbo4-V KV cache, RTX 3080 (sm_86, 10 GB). 91 tok/s decode, 2440 tok/s prompt. VRAM ~9749 MiB at 64K context, ~6227 MiB at idle-after-load. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot added the model label May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(qwen35): support Qwen3.5:9B loading from Ollama GGUF#135

fix(qwen35): support Qwen3.5:9B loading from Ollama GGUF#135
Jordan-HS wants to merge 1 commit into
TheTom:feature/turboquant-kv-cachefrom
Jordan-HS:fix/qwen35-9b-gguf-loading

Jordan-HS commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Jordan-HS commented May 8, 2026

PR: fix(qwen35): support Qwen3.5:9B loading from Ollama GGUF

What This PR Addresses

The Problem

The Solution

src/llama-model-loader.cpp

src/llama-model.cpp

src/models/qwen35.cpp

Validation

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`src/llama-model-loader.cpp`

`src/llama-model.cpp`

`src/models/qwen35.cpp`