Change k_cache and k_raw to use ggml_view_3d to fix np > 1 launch abort by kstjohn1 · Pull Request #10 · antirez/llama.cpp-deepseek-v4-flash

kstjohn1 · 2026-05-13T20:52:30Z

When using np > 1 the app crashes on launch:

/Users/admin/Downloads/llama.cpp-deepseek-v4-flash/ggml/src/ggml.c:3643: GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2) failed
WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info.
See: https://github.com/ggml-org/llama.cpp/pull/17869
0   libggml-base.0.10.0.dylib           0x000000010509d3d0 ggml_print_backtrace + 276
1   libggml-base.0.10.0.dylib           0x00000001051080bc ggml_abort + 156
2   libggml-base.0.10.0.dylib           0x0000000105108b50 ggml_reshape_4d.cold.1 + 0
3   libggml-base.0.10.0.dylib           0x00000001050a45c4 ggml_reshape_3d + 312
4   libllama.0.0.8927.dylib             0x000000010581d440 _ZN19llm_build_deepseek4C2ERK11llama_modelRK16llm_graph_params + 1748
5   libllama.0.0.8927.dylib             0x00000001057c82d0 _ZNSt3__111make_uniqueB9nqe210106I19llm_build_deepseek4JRK11llama_modelRK16llm_graph_paramsELi0EEENS_10unique_ptrIT_NS_14default_deleteIS9_EEEEDpOT0_ + 52
6   libllama.0.0.8927.dylib             0x00000001057c69ac _ZNK11llama_model11build_graphERK16llm_graph_params + 1816
7   libllama.0.0.8927.dylib             0x000000010570448c _ZN13llama_context13graph_reserveEjjjPK22llama_memory_context_ibPmi + 776
8   libllama.0.0.8927.dylib             0x0000000105702dd4 _ZN13llama_context13sched_reserveEv + 616
9   libllama.0.0.8927.dylib             0x0000000105701ccc _ZN13llama_contextC2ERK11llama_model20llama_context_params + 4196
10  libllama.0.0.8927.dylib             0x000000010570ba48 llama_init_from_model + 600
11  libllama-common.0.0.8927.dylib      0x0000000105320548

_ZL29common_get_device_memory_dataPKcPK18llama_model_paramsPK20llama_context_paramsRNSt3__16vectorIP19ggml_backend_deviceNS7_9allocatorISA_EEEERjSF_SF_14ggml_log_level
+ 232
12 libllama-common.0.0.8927.dylib 0x000000010531ad88 _ZL22common_params_fit_implPKcP18llama_model_paramsP20llama_context_paramsPfP32llama_model_tensor_buft_overridePmj14ggml_log_level + 180
18 dyld 0x000000018dcc7da4 start + 6992
Abort trap: 6

Overview

This change updates deepseek4.cpp to use ggml_view_3d rather than ggml_reshape_3d in 2 places.

Additional information

What's the difference?

ggml_reshape_3d requires the source tensor to have exactly ne0 * ne1 * ne2 elements. It asserts this.
ggml_view_3d doesn't require element count matching. It creates a sub-view into the source tensor starting at a given offset, with the specified dimensions and strides. It simply extracts a contiguous slice.

Why this works:

The get_k cache view returns a 4D tensor [n_embd_head_k, 1, n_kv, ns] where ns = n_stream. The view at offset 0 selects the first n_embd_head_k * 1 * n_kv elements (stream 0). Since DeepSeek V4's reservation forces n_seqs=1, only stream 0's data is relevant during reservation — exactly what the view extracts.

Implications:

np 1 (single parallel): Cache has ns=1, the 4D tensor [n_embd_head_k, 1, n_kv, 1] has the same layout as a 3D tensor. ggml_view_3d at offset 0 sees all data — identical behavior to the current ggml_reshape_3d.
np 2 (double parallel): Cache has ns=2. ggml_view_3d at offset 0 extracts stream 0's n_embd_head_k * 1 * n_kv elements. The assertion passes because the view doesn't require a count match — it just slices.
During inference: The cache prepare() gives ns=1 per sequence (single-stream slots). Same as np 1 case — works identically.
Memory allocation: The view is a no-copy operation, like reshape. The compute buffer allocation is unaffected — reservation already uses n_seqs=1, so the allocated buffers match the single-stream view.

Potential concern: The k_cache->nb[1] and k_cache->nb[2] strides must match the 4D view's strides. For DeepSeek V4 with n_head_kv = 1, k_cache->nb[1] = k_cache->nb[0] (stride for head dim), and k_cache->nb[2] is the stride
across n_kv. These are exactly what get_k returns. The view correctly steps through the first stream's data.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, Deepseekv4-flash was used to identify the needed change. The change was made on my local and tested. The model works as expected and does not abort when np > 1.

When using np > 1 the app crashes on launch: /Users/admin/Downloads/llama.cpp-deepseek-v4-flash/ggml/src/ggml.c:3643: GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2) failed WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info. See: ggml-org/llama.cpp#17869 0 libggml-base.0.10.0.dylib 0x000000010509d3d0 ggml_print_backtrace + 276 1 libggml-base.0.10.0.dylib 0x00000001051080bc ggml_abort + 156 2 libggml-base.0.10.0.dylib 0x0000000105108b50 ggml_reshape_4d.cold.1 + 0 3 libggml-base.0.10.0.dylib 0x00000001050a45c4 ggml_reshape_3d + 312 4 libllama.0.0.8927.dylib 0x000000010581d440 _ZN19llm_build_deepseek4C2ERK11llama_modelRK16llm_graph_params + 1748 5 libllama.0.0.8927.dylib 0x00000001057c82d0 _ZNSt3__111make_uniqueB9nqe210106I19llm_build_deepseek4JRK11llama_modelRK16llm_graph_paramsELi0EEENS_10unique_ptrIT_NS_14default_deleteIS9_EEEEDpOT0_ + 52 6 libllama.0.0.8927.dylib 0x00000001057c69ac _ZNK11llama_model11build_graphERK16llm_graph_params + 1816 7 libllama.0.0.8927.dylib 0x000000010570448c _ZN13llama_context13graph_reserveEjjjPK22llama_memory_context_ibPmi + 776 8 libllama.0.0.8927.dylib 0x0000000105702dd4 _ZN13llama_context13sched_reserveEv + 616 9 libllama.0.0.8927.dylib 0x0000000105701ccc _ZN13llama_contextC2ERK11llama_model20llama_context_params + 4196 10 libllama.0.0.8927.dylib 0x000000010570ba48 llama_init_from_model + 600 11 libllama-common.0.0.8927.dylib 0x0000000105320548 _ZL29common_get_device_memory_dataPKcPK18llama_model_paramsPK20llama_context_paramsRNSt3__16vectorIP19ggml_backend_deviceNS7_9allocatorISA_EEEERjSF_SF_14ggml_log_level + 232 12 libllama-common.0.0.8927.dylib 0x000000010531ad88 _ZL22common_params_fit_implPKcP18llama_model_paramsP20llama_context_paramsPfP32llama_model_tensor_buft_overridePmj14ggml_log_level + 180 18 dyld 0x000000018dcc7da4 start + 6992 Abort trap: 6 This change updates deepseek4.cpp to use ggml_view_3d rather than ggml_reshape_3d in 2 places.

github-actions Bot added the model label May 13, 2026

kstjohn1 mentioned this pull request May 13, 2026

Misc. bug: llama.cpp-deepdeek-v4-flash aborts on launch when np > 1 #9

Open

kstjohn1 changed the title ~~Change k_cache and k_raw to use ggml_view_3d~~ Change k_cache and k_raw to use ggml_view_3d to fix np > 1 launch abort May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change k_cache and k_raw to use ggml_view_3d to fix np > 1 launch abort#10

Change k_cache and k_raw to use ggml_view_3d to fix np > 1 launch abort#10
kstjohn1 wants to merge 1 commit into
antirez:mainfrom
kstjohn1:main

kstjohn1 commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kstjohn1 commented May 13, 2026

Overview

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant