Skip to content

Crash with llama-cli --context-shift #115

@ravi9

Description

@ravi9

Name and Version

./build/ReleaseOV/bin/llama-cli --version
OpenVINO: using device GPU
version: 8508 (9f102a140)
built with GNU 14.2.0 for Linux x86_64

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

GGML_OPENVINO_STATEFUL_EXECUTION=1 GGML_OPENVINO_DEVICE=CPU ./build/ReleaseOV/bin/llama-cli -c 32 -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf --context-shift

GGML_OPENVINO_STATEFUL_EXECUTION=1 GGML_OPENVINO_DEVICE=GPU ./build/ReleaseOV/bin/llama-cli -c 32 -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf --context-shift

GGML_OPENVINO_DEVICE=NPU ./build/ReleaseOV/bin/llama-cli -c 32 -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf --context-shift

Problem description & steps to reproduce

When using llama-cli with the OpenVINO backend and enabling --context-shift, the application crashes with a fatal assertion error in ggml-backend.cpp:809. This occurs across CPU, GPU, and NPU devices.

Key observations:

  • The failing tensor is cache_k_l0 (view) — a viewed/sliced tensor from the KV cache
  • The failing operation is ROPE
  • This only triggers when --context-shift is enabled, which modifies KV cache management
  • The OpenVINO backend buffer (OPENVINO0) appears to not declare support for ROPE on viewed tensors during context shift operations

Relevant log output

Logs
GGML_OPENVINO_STATEFUL_EXECUTION=1 GGML_OPENVINO_DEVICE=GPU ./build/ReleaseOV/bin/llama-cli -c 64 -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf --context-shift
...
...
llama.cpp/ggml/src/ggml-backend.cpp:809: pre-allocated tensor (cache_k_l0 (view) (view)) in a buffer (OPENVINO0) that cannot run the operation (ROPE)
[New LWP 75786]
[New LWP 75760]
[New LWP 75759]
[New LWP 75757]
[New LWP 75756]
[New LWP 75755]
[New LWP 75754]
[New LWP 75753]
[New LWP 75752]
[New LWP 75751]
[New LWP 75745]
[New LWP 75744]
[New LWP 75743]
[New LWP 75742]
[New LWP 75741]
[New LWP 75740]
[New LWP 75739]
[New LWP 75708]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007a0ed6098d71 in __futex_abstimed_wait_common64 (private=721128320, cancel=true, abstime=0x7ffc2afb8bc0, op=137, expected=0, futex_word=0x6376402ba488) at ./nptl/futex-internal.c:57
warning: 57     ./nptl/futex-internal.c: No such file or directory
#0  0x00007a0ed6098d71 in __futex_abstimed_wait_common64 (private=721128320, cancel=true, abstime=0x7ffc2afb8bc0, op=137, expected=0, futex_word=0x6376402ba488) at ./nptl/futex-internal.c:57
57      in ./nptl/futex-internal.c
#1  __futex_abstimed_wait_common (cancel=true, private=721128320, abstime=0x7ffc2afb8bc0, clockid=32764, expected=0, futex_word=0x6376402ba488) at ./nptl/futex-internal.c:87
87      in ./nptl/futex-internal.c
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x6376402ba488, expected=expected@entry=0, clockid=clockid@entry=1, abstime=abstime@entry=0x7ffc2afb8bc0, private=private@entry=0) at ./nptl/futex-internal.c:139
139     in ./nptl/futex-internal.c
#3  0x00007a0ed609c116 in __pthread_cond_wait_common (abstime=<optimized out>, clockid=<optimized out>, mutex=0x6376402ba438, cond=0x6376402ba460) at ./nptl/pthread_cond_wait.c:503
warning: 503    ./nptl/pthread_cond_wait.c: No such file or directory
#4  ___pthread_cond_clockwait64 (abstime=<optimized out>, clockid=<optimized out>, mutex=0x6376402ba438, cond=0x6376402ba460) at ./nptl/pthread_cond_wait.c:691
691     in ./nptl/pthread_cond_wait.c
#5  ___pthread_cond_clockwait64 (cond=0x6376402ba460, mutex=0x6376402ba438, clockid=<optimized out>, abstime=<optimized out>) at ./nptl/pthread_cond_wait.c:679
679     in ./nptl/pthread_cond_wait.c
#6  0x000063762573de9e in server_response::recv_with_timeout(std::unordered_set<int, std::hash<int>, std::equal_to<int>, std::allocator<int> > const&, int) ()
#7  0x0000637625743da2 in server_response_reader::next(std::function<bool ()> const&) ()
#8  0x00006376256ecc85 in cli_context::generate_completion[abi:cxx11](result_timings&) ()
#9  0x00006376256d1266 in main ()
[Inferior 1 (process 75692) detached]
Aborted (core dumped)

Metadata

Metadata

Assignees

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions