Skip to content

fix(qwen35): support Qwen3.5:9B loading from Ollama GGUF#135

Open
Jordan-HS wants to merge 1 commit into
TheTom:feature/turboquant-kv-cachefrom
Jordan-HS:fix/qwen35-9b-gguf-loading
Open

fix(qwen35): support Qwen3.5:9B loading from Ollama GGUF#135
Jordan-HS wants to merge 1 commit into
TheTom:feature/turboquant-kv-cachefrom
Jordan-HS:fix/qwen35-9b-gguf-loading

Conversation

@Jordan-HS
Copy link
Copy Markdown

PR: fix(qwen35): support Qwen3.5:9B loading from Ollama GGUF

Target: TheTom/llama-cpp-turboquantfix/qwen35-9b-gguf-loading


What This PR Addresses

The Ollama-distributed Qwen3.5:9B Q4_K_M GGUF fails to load in the current
TurboQuant fork with multiple errors. This PR fixes all of them so the model
runs end-to-end with KV cache compression at 64K context.

Tested on RTX 3080 (sm_86, 10 GB VRAM) with -ctk q8_0 -ctv turbo4 -c 65536:

  • 91 tok/s decode, 2440 tok/s prompt
  • VRAM: ~9749 MiB peak, ~6227 MiB after initial load

The Problem

Qwen3.5:9B uses a hybrid SSM+attention architecture (full attention every 4th
layer) and ships an Ollama GGUF that bundles text, vision encoder (v.blk.*),
and MTP decoder (mtp.*) tensors in one file. Five separate issues prevented
it from loading:

  1. get_key_or_arr ignores required=false on length mismatch — throws
    unconditionally instead of returning false, making fallback patterns
    impossible.

  2. done_getting_tensors hard-errors on unclaimed tensors — the bundled
    vision/MTP tensors are not loaded for text-only inference, so n_created
    will always be less than n_tensors. Currently a fatal error.

  3. rope_sections requires exactly 4 elements — the Ollama GGUF has 3
    (the 4th vision section is absent). Hard-coded n=4 throws at load time.

  4. load_tensors uses outer-scope n_embd_k_gqa — this is computed from
    n_head_kv at outer scope, which resolves to layer 0's value (0, a
    recurrent layer). Attention layers have n_head_kv=4, so the tensor
    shape is computed as (256, 0) instead of (256, 4).

  5. ssm_dt tensor name has wrong suffix — code requests
    blk.N.ssm_dt.bias but the GGUF key is blk.N.ssm_dt.

  6. build_layer_attn uses base-class n_head_kv — same per-layer
    issue as fix: turbo4 SET_ROWS, tail-block truncation, constant coupling, stack overflow (Issue #29) #4 but in qwen35.cpp: the reshape uses n_head_kv from the
    base class which is 0 for the model, causing an assertion failure.


The Solution

src/llama-model-loader.cpp

Fix 1 — get_key_or_arr: respect required=false for length mismatch

// Before:
if (n != arr_info.length) {
    throw std::runtime_error(format("key %s has wrong array length; ..."));
}

// After:
if (n != arr_info.length) {
    if (required) {
        throw std::runtime_error(format("key %s has wrong array length; ..."));
    }
    return false;
}

Fix 2 — done_getting_tensors: warn instead of error for unclaimed tensors

// Before:
if (n_created != n_tensors) {
    throw std::runtime_error(...);
}

// After:
if (n_created > n_tensors) {
    throw std::runtime_error(...);
}
if (n_created < n_tensors) {
    LLAMA_LOG_WARN("... multimodal vision/MTP tensors skipped for text-only inference\n", ...);
}

src/llama-model.cpp

Fix 3 — rope_sections: accept 3 or 4 elements (QWEN35 and QWEN35MOE)

// Before:
ml.get_key_or_arr(LLM_KV_ROPE_DIMENSION_SECTIONS, hparams.rope_sections, 4, true);

// After:
if (!ml.get_key_or_arr(LLM_KV_ROPE_DIMENSION_SECTIONS, hparams.rope_sections, 4, false)) {
    ml.get_key_or_arr(LLM_KV_ROPE_DIMENSION_SECTIONS, hparams.rope_sections, 3, true);
}

Fix 4 — load_tensors: per-layer KV dimensions (QWEN35 and QWEN35MOE)

// Before:
create_tensor_qkv(layer, i, n_embd, n_embd_head_k * n_head * 2, n_embd_k_gqa, n_embd_v_gqa, 0);

// After:
create_tensor_qkv(layer, i, n_embd, n_embd_head_k * n_head * 2, hparams.n_embd_k_gqa(i), hparams.n_embd_v_gqa(i), 0);

Fix 5 — ssm_dt tensor name (QWEN35 and QWEN35MOE)

// Before:
layer.ssm_dt = create_tensor(tn(LLM_TENSOR_SSM_DT, "bias", i), { hparams.ssm_dt_rank }, 0);

// After:
layer.ssm_dt = create_tensor(tn(LLM_TENSOR_SSM_DT, i), { hparams.ssm_dt_rank }, 0);

src/models/qwen35.cpp

Fix 6 — build_layer_attn: per-layer n_head_kv for Kcur/Vcur reshape

// Before:
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens);
// ...
Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens);

// After:
const int64_t n_head_kv_il = model.hparams.n_head_kv(il);
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv_il, n_tokens);
// ...
Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv_il, n_tokens);

Validation

Model: Qwen3.5:9B Q4_K_M (sha256:dec52a44...)
GPU: RTX 3080 (sm_86), 10 GB VRAM
Command:

llama-server \
  -m <qwen3.5-9b-q4_k_m.gguf> \
  -ngl 99 -fa 1 \
  -ctk q8_0 -ctv turbo4 \
  -c 65536 \
  --port 8080
Metric Value
Decode speed 91.4 ±0.6 tok/s
Prompt speed 2,440 ±509 tok/s
VRAM at 64K context ~9,749 MiB
VRAM after load (idle) ~6,227 MiB
Model weights 4,717 MiB

Server responds {"status":"ok"} and handles OpenAI-compatible chat completions correctly.


Notes

  • Fixes 1 and 2 (get_key_or_arr + done_getting_tensors) are general loader
    fixes that benefit any multimodal GGUF loaded for text-only inference, not
    just Qwen3.5.
  • The per-layer n_head_kv issue (fixes 4 and 6) will affect any future hybrid
    SSM/attention architecture where head counts vary by layer.
  • -ctv q8_0 is NOT safe on sm_86 (RTX 30xx) — crashes fattn.cu:348 during
    decode. Use turbo4 or turbo3 for the V cache on this GPU family.

Co-authored with Claude Sonnet 4.6

Five fixes to load the Ollama-distributed Qwen3.5:9B Q4_K_M GGUF
(sha256:dec52a44...). The model uses a hybrid SSM+attention architecture
with per-layer n_head_kv variation, and bundles vision/MTP tensors that
are not used for text inference.

Changes:

1. llama-model-loader.cpp — get_key_or_arr: respect required=false
   When the array length mismatches n, previously always threw.
   Now returns false when required=false, matching the behavior of
   every other required-flag check in the function.

2. llama-model-loader.cpp — done_getting_tensors: warn on unclaimed tensors
   The Ollama GGUF bundles vision encoder (v.blk.*) and MTP decoder
   (mtp.*) tensors alongside the text model (blk.*). Loading text-only
   leaves these unclaimed. Changed hard error to warning when
   n_created < n_tensors.

3. llama-model.cpp — QWEN35/QWEN35MOE rope_sections: accept 3 or 4 elements
   The original code required exactly 4 rope sections. The Ollama GGUF
   has 3 (the 4th vision section is absent). Using the fixed
   get_key_or_arr(required=false) to try 4 first, fall back to 3.

4. llama-model.cpp — QWEN35/QWEN35MOE load_tensors: per-layer KV dimensions
   n_head_kv is a per-layer array (0 for recurrent layers, 4 for
   attention layers). The outer-scope n_embd_k_gqa used layer 0's value
   (0), causing wrong tensor shapes. Fixed to use hparams.n_embd_k_gqa(i)
   and hparams.n_embd_v_gqa(i) per-layer.

5. llama-model.cpp — QWEN35/QWEN35MOE ssm_dt tensor name: remove bias suffix
   The GGUF key is blk.N.ssm_dt, not blk.N.ssm_dt.bias.

6. models/qwen35.cpp — build_layer_attn: per-layer n_head_kv for reshape
   The base class n_head_kv member reflects hparams.n_head_kv() which
   returns layer 0's value (0, recurrent). Fixed to call
   model.hparams.n_head_kv(il) for the per-layer correct head count
   before reshaping Kcur and Vcur.

Tested: Qwen3.5:9B Q4_K_M, 64K context, q8_0-K/turbo4-V KV cache,
RTX 3080 (sm_86, 10 GB). 91 tok/s decode, 2440 tok/s prompt. VRAM ~9749
MiB at 64K context, ~6227 MiB at idle-after-load.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the model label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant