Fix CUDA context switching, bind thread on CudaStorage drop#1428
Fix CUDA context switching, bind thread on CudaStorage drop#1428EricLBuehler merged 5 commits intomasterfrom
Conversation
|
""" WalkthroughThe updates revise dependency versions for several candle-related crates in the workspace, adjust device creation logic for CUDA devices by simplifying the constructor used, and expand the conditions under which prefix caching is disabled in the engine module to include cases where the prefix cache size is zero. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Engine
participant PipelineMetadata
User->>Engine: new(no_prefix_cache, no_kv_cache, prefix_cache_n, pipeline_metadata)
Engine->>PipelineMetadata: check no_prefix_cache flag
Engine->>Engine: Set no_prefix_cache to true if:\n- no_prefix_cache is true\n- OR no_kv_cache is true\n- OR pipeline_metadata.no_prefix_cache is true\n- OR prefix_cache_n == 0
Engine-->>User: Engine instance created
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (8)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Code Metrics Report=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== C Header 3 62 53 0 9 CSS 1 473 408 14 51 Dockerfile 1 42 23 10 9 HTML 1 73 61 4 8 JavaScript 7 1248 936 174 138 JSON 14 123 122 0 1 Makefile 1 6 5 0 1 Python 87 4097 3457 161 479 Shell 1 63 26 18 19 Plain Text 3 3723 0 2413 1310 TOML 21 695 634 10 51 YAML 2 21 19 2 0 ------------------------------------------------------------------------------- Jupyter Notebooks 3 0 0 0 0 |- Markdown 2 77 32 31 14 |- Python 2 205 178 1 26 (Total) 282 210 32 40 ------------------------------------------------------------------------------- Markdown 59 5086 0 3880 1206 |- BASH 10 111 105 2 4 |- JSON 2 42 42 0 0 |- Python 7 121 109 0 12 |- Rust 22 757 634 1 122 |- TOML 2 75 63 0 12 (Total) 6192 953 3883 1356 ------------------------------------------------------------------------------- Rust 376 132361 117795 2893 11673 |- Markdown 175 3002 29 2662 311 (Total) 135363 117824 5555 11984 =============================================================================== Total 580 148073 123539 9579 14955 =============================================================================== |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
mistralrs-core/src/models/llama.rs(3 hunks)mistralrs-core/src/utils/mod.rs(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
mistralrs-core/src/models/llama.rs (1)
mistralrs-core/src/utils/mod.rs (2)
set_cuda_context(18-23)set_cuda_context(26-26)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Docs
- GitHub Check: Check (windows-latest, stable)
- GitHub Check: Test Suite (windows-latest, stable)
- GitHub Check: Test Suite (ubuntu-latest, stable)
- GitHub Check: Test Suite (macOS-latest, stable)
🔇 Additional comments (6)
mistralrs-core/src/utils/mod.rs (2)
12-16: LGTM: Clear documentation explaining the necessity.The documentation clearly explains why explicit CUDA context switching is required for multi-device scenarios.
25-26: LGTM: Appropriate no-op implementation for non-CUDA builds.The conditional compilation pattern correctly provides a no-op version when CUDA is not available.
mistralrs-core/src/models/llama.rs (4)
29-29: LGTM: Appropriate import of the new utility function.The import correctly brings in the
set_cuda_contextfunction for use in the forward pass.
524-527: Excellent documentation and correct context switching logic.The comments clearly explain why CUDA context switching is necessary, and the implementation correctly sets the context after tensor mapping to ensure operations execute on the correct device.
540-542: LGTM: Proper context restoration for final computations.The context is correctly switched back to the model's primary device before the final normalization and output head computations.
527-527:Details
❌ Incorrect review comment
Verify error handling alignment with utility function changes.
If the
set_cuda_contextfunction is modified to return aResult(as suggested in the utils review), these call sites will need to be updated to handle potential errors appropriately.Also applies to: 542-542
🏁 Script executed:
#!/bin/bash # Check if there are other usages of set_cuda_context that might need updating rg -A 3 -B 1 "set_cuda_context" --type rustLength of output: 1842
No updates needed for
set_cuda_contextcallsThe current definition of
set_cuda_contextinmistralrs-core/src/utils/mod.rsreturns()and internally unwraps any errors, so call sites like those inmodels/llama.rsdo not need to handle aResult. No changes to error handling are required here.Likely an incorrect or invalid review comment.
| #[cfg(feature = "cuda")] | ||
| pub fn set_cuda_context(dev: &candle_core::Device) { | ||
| if let candle_core::Device::Cuda(d) = dev { | ||
| unsafe { cudarc::driver::result::ctx::set_current(d.cu_primary_ctx()) } | ||
| .unwrap(); | ||
| } | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Consider more robust error handling instead of panicking.
The current implementation uses unwrap() which will panic if setting the CUDA context fails. In a production environment, this could be problematic.
Consider returning a Result to allow callers to handle errors gracefully:
-pub fn set_cuda_context(dev: &candle_core::Device) {
+pub fn set_cuda_context(dev: &candle_core::Device) -> Result<()> {
if let candle_core::Device::Cuda(d) = dev {
- unsafe { cudarc::driver::result::ctx::set_current(d.cu_primary_ctx()) }
- .unwrap();
+ unsafe { cudarc::driver::result::ctx::set_current(d.cu_primary_ctx()) }
+ .map_err(|e| candle_core::Error::Cuda(Box::new(e)))?;
}
+ Ok(())
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #[cfg(feature = "cuda")] | |
| pub fn set_cuda_context(dev: &candle_core::Device) { | |
| if let candle_core::Device::Cuda(d) = dev { | |
| unsafe { cudarc::driver::result::ctx::set_current(d.cu_primary_ctx()) } | |
| .unwrap(); | |
| } | |
| } | |
| #[cfg(feature = "cuda")] | |
| pub fn set_cuda_context(dev: &candle_core::Device) -> Result<()> { | |
| if let candle_core::Device::Cuda(d) = dev { | |
| unsafe { cudarc::driver::result::ctx::set_current(d.cu_primary_ctx()) } | |
| .map_err(|e| candle_core::Error::Cuda(Box::new(e)))?; | |
| } | |
| Ok(()) | |
| } |
🤖 Prompt for AI Agents
In mistralrs-core/src/utils/mod.rs around lines 17 to 23, the function
set_cuda_context uses unwrap() which causes a panic if setting the CUDA context
fails. To fix this, change the function to return a Result type instead of
panicking. Propagate the error from set_current by returning it to the caller,
allowing them to handle the failure gracefully rather than crashing the program.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
mistralrs-core/src/pipeline/inputs_processor.rs(2 hunks)mistralrs-core/src/utils/mod.rs(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- mistralrs-core/src/utils/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Clippy
- GitHub Check: Docs
- GitHub Check: Check (macOS-latest, stable)
- GitHub Check: Test Suite (ubuntu-latest, stable)
- GitHub Check: Test Suite (windows-latest, stable)
- GitHub Check: Check (ubuntu-latest, stable)
- GitHub Check: Check (windows-latest, stable)
- GitHub Check: Test Suite (macOS-latest, stable)
🔇 Additional comments (3)
mistralrs-core/src/pipeline/inputs_processor.rs (3)
258-259: LGTM: Explicit type annotations improve code clarity.The explicit
HashMap<DeviceLocation, Tensor>type annotations enhance code readability and make the intent clearer.
429-430: LGTM: Consistent type annotations.The explicit type annotations match the pattern established in
make_prompt_chunkand improve code consistency.
434-435: Consistent withmake_prompt_chunkchanges.The hash map population is disabled here as well, maintaining consistency with the
make_prompt_chunkfunction. The same verification concerns about downstream compatibility apply as mentioned in the previous comment.
|
@sempervictus this issue fixed the error behind #1406, #1401, #1399, #1394 for me. Can you please test and confirm it fixed it for you too? |
|
I just ran a build in the Docker container and still get: 2025-06-04T18:57:39.060479Z INFO mistralrs_core::paged_attention: Allocating 8192 MB for PagedAttention KV cache per GPU
2025-06-04T18:57:39.060503Z INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 1024 GPU blocks: available context length is 32768 tokens
2025-06-04T18:57:39.916032Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|endoftext|>", eos_toks = "<|im_end|>", "<|endoftext|>", unk_tok = `None`
2025-06-04T18:57:39.937989Z INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-06-04T18:57:39.938233Z INFO mistralrs_core: Beginning dummy run.
2025-06-04T18:57:39.941248Z INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x556f00db5342 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h3cba09f3134c688d
1: 0x556eff47c883 - core::fmt::write::h23019460b0b70a11
2: 0x556f00db452f - std::io::Write::write_fmt::h73da8773e52bf4ad
3: 0x556f00db51a3 - std::sys::backtrace::BacktraceLock::print::h58c794ef15c6671f
4: 0x556f00db4ad5 - std::panicking::default_hook::h94aabe0891249549
5: 0x556f00db41e7 - std::panicking::rust_panic_with_hook::hb81599440b437817
6: 0x556f00df6618 - std::panicking::begin_panic_handler::{{closure}}::h7a731a74ab3fd8e5
7: 0x556f00df6579 - std::sys::backtrace::__rust_end_short_backtrace::h1f727fbc9961adc0
8: 0x556f00df7bcc - __rustc[a3537046f032bc96]::rust_begin_unwind
9: 0x556eff47a96f - core::panicking::panic_fmt::he78c0e2ddfc3e30a
10: 0x556eff4820c5 - core::result::unwrap_failed::ha9d262dd5091e6ed
11: 0x556eff3647b3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h043e9fec980bebde
12: 0x556eff32ea48 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h7f2784f092b5db70.5510
13: 0x556eff32f577 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::h8febea626b5ef021.5528
14: 0x556eff32ee0b - alloc::sync::Arc<T,A>::drop_slow::hb705f75f20fb60ab
15: 0x556eff32ed10 - alloc::sync::Arc<T,A>::drop_slow::haf5599c18ae07fda
16: 0x556effac6b30 - mistralrs_core::models::qwen2::Model::forward_embed::h019baf397a0c438f
17: 0x556effac7aba - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::h1b7e055f3fc8c72d
18: 0x556f00320173 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h2ab2214ebea94332
19: 0x556f003237aa - mistralrs_core::pipeline::Pipeline::step::{{closure}}::he577c64df601a3b1
20: 0x556f002794db - mistralrs_core::engine::Engine::run::{{closure}}::h8b7c1ce232f5bb75.37372
21: 0x556effeb1679 - std::sys::backtrace::__rust_begin_short_backtrace::hff153854ba1955a2
22: 0x556effeb60b3 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h42f26e7c5b4ec229
23: 0x556f00df7f77 - std::sys::pal::unix::thread::Thread::new::thread_start::h4c462331eebbf5ed
24: 0x7f91ef23fac3 - <unknown>
25: 0x7f91ef2d0a04 - clone
26: 0x0 - <unknown>
thread '<unnamed>' panicked at library/core/src/panicking.rs:233:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.is that pulling in the Candle fix and relevant changes here or do i need to change something in the dockerfile? EDIT: sorry, issue's closed so - ping @EricLBuehler for vis |
|
@sempervictus did you run |
|
@EricLBuehler this is being built and run in Docker so dockerfile would be doing that. I have one on 12.4.1 as you do and one with diff --git a/Dockerfile.cuda-all b/Dockerfile.cuda-all
index 026a0a9e6..5fce212fd 100644
--- a/Dockerfile.cuda-all
+++ b/Dockerfile.cuda-all
@@ -1,4 +1,4 @@
-FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 AS builder
+FROM nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04 AS builder
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
curl \
@@ -15,17 +15,17 @@ WORKDIR /mistralrs
COPY . .
-ARG CUDA_COMPUTE_CAP=80
+ARG CUDA_COMPUTE_CAP=70
ENV CUDA_COMPUTE_CAP=${CUDA_COMPUTE_CAP}
ARG FEATURES="cuda cudnn"
-ENV RAYON_NUM_THREADS=4
-RUN RUSTFLAGS="-Z threads=4" cargo build --release --workspace --exclude mistralrs-pyo3 --features "${FEATURES}"
+ENV RAYON_NUM_THREADS=32
+RUN RUSTFLAGS="-Z threads=32" cargo build --release --workspace --exclude mistralrs-pyo3 --features "${FEATURES}"
-FROM nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04 AS base
+FROM nvidia/cuda:12.8.0-cudnn-runtime-ubuntu22.04 AS base
ENV HUGGINGFACE_HUB_CACHE=/data \
PORT=80 \
- RAYON_NUM_THREADS=8 \
+ RAYON_NUM_THREADS=32 \
LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
# Run the script to create symlinks in /usr/local/cuda/lib64 |
|
Doing a |
|
@EricLBuehler - unfortunately, no dice even with a 2025-06-04T19:33:01.240433Z INFO mistralrs_core::paged_attention: Allocating 8192 MB for PagedAttention KV cache per GPU
2025-06-04T19:33:01.240452Z INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 1024 GPU blocks: available context length is 32768 tokens
2025-06-04T19:33:02.083380Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|endoftext|>", eos_toks = "<|im_end|>", "<|endoftext|>", unk_tok = `None`
2025-06-04T19:33:02.105714Z INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-06-04T19:33:02.105947Z INFO mistralrs_core: Beginning dummy run.
2025-06-04T19:33:02.111040Z INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x55f24e372922 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
1: 0x55f24ca39563 - core::fmt::write::h95e30a17c3d7d930
2: 0x55f24e371b0f - std::io::Write::write_fmt::h2447d4278ce5a227
3: 0x55f24e372783 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
4: 0x55f24e3720b5 - std::panicking::default_hook::h0a7d57cc63374946
5: 0x55f24e3717c7 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
6: 0x55f24e3b3be8 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
7: 0x55f24e3b3b49 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
8: 0x55f24e3b519c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
9: 0x55f24ca3764f - core::panicking::panic_fmt::ha159237b3cadc48c
10: 0x55f24ca3eda5 - core::result::unwrap_failed::h879f86fa8962b20a
11: 0x55f24c9974e3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
12: 0x55f24dd6bde8 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h69dabdb8397fdeca
13: 0x55f24dd74f92 - <mistralrs_quant::cublaslt::api::CublasLTBatchMatmul as candle_core::custom_op::CustomOp2>::cuda_fwd::h485b6d7e9e3b157b
14: 0x55f24c90b72f - candle_core::storage::Storage::apply_op2::h6a343fb09e53884b
15: 0x55f24c82ff91 - candle_core::custom_op::<impl candle_core::tensor::Tensor>::apply_op2_arc::h1089692e7e049299
16: 0x55f24ddce901 - mistralrs_quant::cublaslt::CublasLtWrapper::batch_matmul::h8a46e8cceca17c7d
17: 0x55f24ddfdea0 - <mistralrs_quant::unquantized::UnquantLinear as mistralrs_quant::QuantMethod>::forward::h8602c1712d107498
18: 0x55f24dd8b61c - <mistralrs_quant::distributed::layers::ColumnParallelLayer as mistralrs_quant::QuantMethod>::forward::h69b916efba3c9b52
19: 0x55f24d082d2c - mistralrs_core::models::qwen2::Model::forward_embed::h76325c7661b6b0c7
20: 0x55f24d086a0a - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::hbcf387a473650d2c
21: 0x55f24d8de903 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
22: 0x55f24d8e1f3a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
23: 0x55f24d83852b - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.37410
24: 0x55f24d46e979 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
25: 0x55f24d474f13 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
26: 0x55f24e3b5547 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
27: 0x7fcebcc3fac3 - <unknown>
28: 0x7fcebccd0a04 - clone
29: 0x0 - <unknown>
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x55f24e372922 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
1: 0x55f24ca39563 - core::fmt::write::h95e30a17c3d7d930
2: 0x55f24e371b0f - std::io::Write::write_fmt::h2447d4278ce5a227
3: 0x55f24e372783 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
4: 0x55f24e3720b5 - std::panicking::default_hook::h0a7d57cc63374946
5: 0x55f24e3717c7 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
6: 0x55f24e3b3be8 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
7: 0x55f24e3b3b49 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
8: 0x55f24e3b519c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
9: 0x55f24ca3764f - core::panicking::panic_fmt::ha159237b3cadc48c
10: 0x55f24ca3eda5 - core::result::unwrap_failed::h879f86fa8962b20a
11: 0x55f24c9974e3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
12: 0x55f24c8eb2e8 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
13: 0x55f24c8eb227 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
14: 0x55f24c8eb16b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
15: 0x55f24c8eb370 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
16: 0x55f24d085a80 - mistralrs_core::models::qwen2::Model::forward_embed::h76325c7661b6b0c7
17: 0x55f24d086a0a - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::hbcf387a473650d2c
18: 0x55f24d8de903 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
19: 0x55f24d8e1f3a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
20: 0x55f24d83852b - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.37410
21: 0x55f24d46e979 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
22: 0x55f24d474f13 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
23: 0x55f24e3b5547 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
24: 0x7fcebcc3fac3 - <unknown>
25: 0x7fcebccd0a04 - clone
26: 0x0 - <unknown>
thread '<unnamed>' panicked at library/core/src/panicking.rs:233:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
`` |
|
interesting: when i quantize the model at load-time, it doesn't immediately crash: 2025-06-04T19:34:30.463380Z INFO mistralrs_server_core::mistralrs_for_server_builder: avx: false, neon: false, simd128: false, f16c: false
2025-06-04T19:34:30.463417Z INFO mistralrs_server_core::mistralrs_for_server_builder: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-06-04T19:34:30.463447Z INFO mistralrs_server_core::mistralrs_for_server_builder: Model kind is: normal (no adapters)
2025-06-04T19:34:30.463489Z INFO hf_hub: Using token file found "/root/.cache/huggingface/token"
2025-06-04T19:34:30.463573Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-04T19:34:30.463633Z INFO mistralrs_core::pipeline::normal: Loading `config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-04T19:34:30.556549Z INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00014.safetensors", "model-00002-of-00014.safetensors", "model-00003-of-00014.safetensors", "model-00004-of-00014.safetensors", "model-00005-of-00014.safetensors", "model-00006-of-00014.safetensors", "model-00007-of-00014.safetensors", "model-00008-of-00014.safetensors", "model-00009-of-00014.safetensors", "model-00010-of-00014.safetensors", "model-00011-of-00014.safetensors", "model-00012-of-00014.safetensors", "model-00013-of-00014.safetensors", "model-00014-of-00014.safetensors"]
2025-06-04T19:34:30.587340Z INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-04T19:34:30.652534Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-04T19:34:30.679579Z INFO mistralrs_quant::utils::log: Automatic loader type determined to be `qwen2`
2025-06-04T19:34:30.679591Z INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-06-04T19:34:30.843294Z INFO mistralrs_quant::utils::log: Model has 64 repeating layers.
2025-06-04T19:34:30.843711Z INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-06-04T19:34:30.843747Z INFO mistralrs_quant::utils::log: Layers 0-19: cuda[0] (32 GB)
2025-06-04T19:34:30.843762Z INFO mistralrs_quant::utils::log: Layers 20-41: cuda[1] (32 GB)
2025-06-04T19:34:30.843775Z INFO mistralrs_quant::utils::log: Layers 42-63: cuda[2] (32 GB)
2025-06-04T19:34:30.888142Z INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 7
2025-06-04T19:34:30.888153Z INFO mistralrs_core::utils::normal: Skipping BF16 because CC < 8.0
2025-06-04T19:34:30.952883Z INFO mistralrs_core::utils::normal: DType selected is F16.
2025-06-04T19:34:30.952934Z INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 152064, hidden_size: 5120, intermediate_size: 27648, num_hidden_layers: 64, num_attention_heads: 40, num_key_value_heads: 8, max_position_embeddings: 32768, sliding_window: Some(131072), rope_theta: 1000000.0, rms_norm_eps: 1e-6, hidden_act: Silu, quantization_config: None, tie_word_embeddings: false }
2025-06-04T19:34:30.953013Z INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-06-04T19:37:00.868151Z INFO mistralrs_core::pipeline::normal: Applying ISQ to all ranks.
2025-06-04T19:37:00.868198Z INFO mistralrs_core::pipeline::isq: Applying in-situ quantization into Some(Q4K) to 449 tensors.
2025-06-04T19:37:00.870213Z INFO mistralrs_core::pipeline::isq: Applying ISQ on 32 threads.
2025-06-04T19:38:22.217038Z INFO mistralrs_core::pipeline::isq: Applied in-situ quantization into Some(Q4K) to 449 tensors out of 449 total tensors. Took 81.35s
2025-06-04T19:38:22.217371Z INFO mistralrs_core::paged_attention: Allocating 8192 MB for PagedAttention KV cache per GPU
2025-06-04T19:38:22.217378Z INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 1024 GPU blocks: available context length is 32768 tokens
2025-06-04T19:38:23.075077Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|endoftext|>", eos_toks = "<|im_end|>", "<|endoftext|>", unk_tok = `None`
2025-06-04T19:38:23.098356Z INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-06-04T19:38:23.098601Z INFO mistralrs_core: Beginning dummy run.
2025-06-04T19:38:23.100785Z INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
2025-06-04T19:38:40.069655Z INFO mistralrs_core: Dummy run completed in 16.971041424s.
2025-06-04T19:38:40.070156Z INFO mistralrs_server: Serving on http://0.0.0.0:7651.
2025-06-04T19:38:43.101233Z INFO mistralrs_core::engine::logger: Throughput (T/s) 0.60, Prefix cache hitrate 0.00%, 0 running, 0 waiting |
|
@EricLBuehler - same effect, reproducible: quantized, the model works for the first request. Once a 2nd request is issued following up on the conversation, it does: 2025-06-04T19:38:40.069655Z INFO mistralrs_core: Dummy run completed in 16.971041424s.
2025-06-04T19:38:40.070156Z INFO mistralrs_server: Serving on http://0.0.0.0:7651.
2025-06-04T19:38:43.101233Z INFO mistralrs_core::engine::logger: Throughput (T/s) 0.60, Prefix cache hitrate 0.00%, 0 running, 0 waiting
2025-06-04T19:42:18.104787Z INFO mistralrs_core::engine::logger: Throughput (T/s) 38.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:42:23.104861Z INFO mistralrs_core::engine::logger: Throughput (T/s) 19.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:42:28.104968Z INFO mistralrs_core::engine::logger: Throughput (T/s) 19.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:42:33.105030Z INFO mistralrs_core::engine::logger: Throughput (T/s) 19.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:42:38.105138Z INFO mistralrs_core::engine::logger: Throughput (T/s) 19.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:42:43.105199Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:42:48.105304Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:42:53.105363Z INFO mistralrs_core::engine::logger: Throughput (T/s) 19.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:42:58.105467Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:03.105614Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:08.105715Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:13.105779Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:18.105880Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:23.106025Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:28.106124Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:33.106189Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:38.106288Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:43.106431Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:48.106530Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:53.106595Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:58.106693Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:44:03.106835Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:44:08.106934Z INFO mistralrs_core::engine::logger: Throughput (T/s) 12.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:44:18.107095Z INFO mistralrs_core::engine::logger: Throughput (T/s) 499.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:44:28.107328Z INFO mistralrs_core::engine::logger: Throughput (T/s) 492.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:44:33.107397Z INFO mistralrs_core::engine::logger: Throughput (T/s) 19.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:44:38.107462Z INFO mistralrs_core::engine::logger: Throughput (T/s) 19.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:44:43.107529Z INFO mistralrs_core::engine::logger: Throughput (T/s) 19.60, Prefix cache hitrate 0.00%, 0 running, 0 waiting
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x55f949127922 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
1: 0x55f9477ee563 - core::fmt::write::h95e30a17c3d7d930
2: 0x55f949126b0f - std::io::Write::write_fmt::h2447d4278ce5a227
3: 0x55f949127783 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
4: 0x55f9491270b5 - std::panicking::default_hook::h0a7d57cc63374946
5: 0x55f9491267c7 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
6: 0x55f949168be8 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
7: 0x55f949168b49 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
8: 0x55f94916a19c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
9: 0x55f9477ec64f - core::panicking::panic_fmt::ha159237b3cadc48c
10: 0x55f9477f3da5 - core::result::unwrap_failed::h879f86fa8962b20a
11: 0x55f94774c4e3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
12: 0x55f9476a02e8 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
13: 0x55f9476a0227 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
14: 0x55f9476a016b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
15: 0x55f9476a0370 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
16: 0x55f947da87c9 - <mistralrs_core::device_map::LayerDeviceMapper as mistralrs_core::device_map::DeviceMapper>::map::ha40c495b77d50a86
17: 0x55f947e37998 - mistralrs_core::models::qwen2::Model::forward_embed::h76325c7661b6b0c7
18: 0x55f947e3ba0a - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::hbcf387a473650d2c
19: 0x55f948693903 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
20: 0x55f948696f3a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
21: 0x55f9485ed52b - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.37410
22: 0x55f948223979 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
23: 0x55f948229f13 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
24: 0x55f94916a547 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
25: 0x7f6070758ac3 - <unknown>
26: 0x7f60707e9a04 - clone
27: 0x0 - <unknown>
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x55f949127922 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
1: 0x55f9477ee563 - core::fmt::write::h95e30a17c3d7d930
2: 0x55f949126b0f - std::io::Write::write_fmt::h2447d4278ce5a227
3: 0x55f949127783 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
4: 0x55f9491270b5 - std::panicking::default_hook::h0a7d57cc63374946
5: 0x55f9491267c7 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
6: 0x55f949168be8 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
7: 0x55f949168b49 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
8: 0x55f94916a19c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
9: 0x55f9477ec64f - core::panicking::panic_fmt::ha159237b3cadc48c
10: 0x55f9477f3da5 - core::result::unwrap_failed::h879f86fa8962b20a
11: 0x55f94774c4e3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
12: 0x55f9476a02e8 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
13: 0x55f9476a0219 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
14: 0x55f9476a016b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
15: 0x55f9476a0370 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
16: 0x55f94833edc2 - core::ptr::drop_in_place<mistralrs_core::pipeline::inputs_processor::text_models_inputs_processor::FlashParams>::h54cce5dd92919df6
17: 0x55f9486943a8 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
18: 0x55f948696f3a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
19: 0x55f9485ed52b - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.37410
20: 0x55f948223979 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
21: 0x55f948229f13 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
22: 0x55f94916a547 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
23: 0x7f6070758ac3 - <unknown>
24: 0x7f60707e9a04 - clone
25: 0x0 - <unknown>
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x55f949127922 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
1: 0x55f9477ee563 - core::fmt::write::h95e30a17c3d7d930
2: 0x55f949126b0f - std::io::Write::write_fmt::h2447d4278ce5a227
3: 0x55f949127783 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
4: 0x55f9491270b5 - std::panicking::default_hook::h0a7d57cc63374946
5: 0x55f9491267c7 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
6: 0x55f949168be8 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
7: 0x55f949168b49 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
8: 0x55f94916a19c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
9: 0x55f9477ec64f - core::panicking::panic_fmt::ha159237b3cadc48c
10: 0x55f9477f3da5 - core::result::unwrap_failed::h879f86fa8962b20a
11: 0x55f94774c4e3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
12: 0x55f9476a02e8 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
13: 0x55f9476a0219 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
14: 0x55f9476a016b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
15: 0x55f9476a0370 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
16: 0x55f947d8fc52 - <hashbrown::raw::RawTable<T,A> as core::ops::drop::Drop>::drop::ha2d468b205f8c06b
17: 0x55f94833eed6 - core::ptr::drop_in_place<mistralrs_core::pipeline::inputs_processor::text_models_inputs_processor::FlashParams>::h54cce5dd92919df6
18: 0x55f9486943a8 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
19: 0x55f948696f3a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
20: 0x55f9485ed52b - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.37410
21: 0x55f948223979 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
22: 0x55f948229f13 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
23: 0x55f94916a547 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
24: 0x7f6070758ac3 - <unknown>
25: 0x7f60707e9a04 - clone
26: 0x0 - <unknown>
thread '<unnamed>' panicked at library/core/src/panicking.rs:233:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.... i do find it mildly odd to see FlashParams showing up on a V100's stack trace. Boot process shows FA as disabled due to CC7 |
|
@sempervictus Hmm, interesting. What model is this? |
|
@EricLBuehler - |
|
@EricLBuehler - i can confirm reproducibility on qwen-distilled r1 and llama-distilled r1 as well as llama3.1 |
|
@EricLBuehler - here's the dmesg output of a single prompt long run on the SWE agent model - lots of OOB access it seems: [4010345.802913] traps: mistralrs-serve[3322477] general protection fault ip:7f7d914ec898 sp:7f7aa3bef420 error:0 in libc.so.6[7f7d914ec000+195000]
[4011764.856195] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 0): Out Of Range Address
[4011764.856222] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x504730=0xc07000e 0x504734=0x0 0x504728=0x4c1eb72 0x50472c=0x174
[4011764.856284] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 1): Out Of Range Address
[4011764.856304] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5047b0=0xc00000e 0x5047b4=0x0 0x5047a8=0x4c1eb72 0x5047ac=0x174
[4011764.856372] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 1, SM 0): Out Of Range Address
[4011764.856392] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x504f30=0xc02000e 0x504f34=0x0 0x504f28=0x4c1eb72 0x504f2c=0x174
[4011764.856452] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 1, SM 1): Out Of Range Address
[4011764.856472] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x504fb0=0xc03000e 0x504fb4=0x0 0x504fa8=0x4c1eb72 0x504fac=0x174
[4011764.856539] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 2, SM 0): Out Of Range Address
[4011764.856558] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x505730=0xc02000e 0x505734=0x0 0x505728=0x4c1eb72 0x50572c=0x174
[4011764.856618] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 2, SM 1): Out Of Range Address
[4011764.856637] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5057b0=0xc02000e 0x5057b4=0x0 0x5057a8=0x4c1eb72 0x5057ac=0x174
[4011764.856704] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 3, SM 0): Out Of Range Address
[4011764.856723] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x505f30=0xc03000e 0x505f34=0x0 0x505f28=0x4c1eb72 0x505f2c=0x174
[4011764.856781] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 3, SM 1): Out Of Range Address
[4011764.856801] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Global Exception on (GPC 0, TPC 3, SM 1): Multiple Warp Errors
[4011764.856820] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x505fb0=0xc00000e 0x505fb4=0x4 0x505fa8=0x4c1eb72 0x505fac=0x174
[4011764.856883] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 4, SM 0): Out Of Range Address
[4011764.856902] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x506730=0xc04000e 0x506734=0x0 0x506728=0x4c1eb72 0x50672c=0x174
[4011764.856955] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 4, SM 1): Out Of Range Address
[4011764.856975] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5067b0=0xc04000e 0x5067b4=0x20 0x5067a8=0x4c1eb72 0x5067ac=0x174
[4011764.857034] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 5, SM 0): Out Of Range Address
[4011764.857054] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x506f30=0xc01000e 0x506f34=0x0 0x506f28=0x4c1eb72 0x506f2c=0x174
[4011764.857106] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 5, SM 1): Out Of Range Address
[4011764.857125] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x506fb0=0xc01000e 0x506fb4=0x0 0x506fa8=0x4c1eb72 0x506fac=0x174
[4011764.857185] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 6, SM 0): Out Of Range Address
[4011764.857205] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x507730=0xc02000e 0x507734=0x0 0x507728=0x4c1eb72 0x50772c=0x174
[4011764.857257] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 6, SM 1): Out Of Range Address
[4011764.857277] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5077b0=0xc00000e 0x5077b4=0x0 0x5077a8=0x4c1eb72 0x5077ac=0x174
[4011764.857338] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0, SM 0): Out Of Range Address
[4011764.857358] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50c730=0xc03000e 0x50c734=0x0 0x50c728=0x4c1eb72 0x50c72c=0x174
[4011764.857410] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0, SM 1): Out Of Range Address
[4011764.857430] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Global Exception on (GPC 1, TPC 0, SM 1): Multiple Warp Errors
[4011764.857449] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50c7b0=0xc07000e 0x50c7b4=0x4 0x50c7a8=0x4c1eb72 0x50c7ac=0x174
[4011764.857508] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 1, SM 0): Out Of Range Address
[4011764.857528] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Global Exception on (GPC 1, TPC 1, SM 0): Multiple Warp Errors
[4011764.857547] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50cf30=0xc06000e 0x50cf34=0x4 0x50cf28=0x4c1eb72 0x50cf2c=0x174
[4011764.857598] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 1, SM 1): Out Of Range Address
[4011764.857618] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50cfb0=0xc05000e 0x50cfb4=0x20 0x50cfa8=0x4c1eb72 0x50cfac=0x174
[4011764.857677] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 2, SM 0): Out Of Range Address
[4011764.857696] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50d730=0xc04000e 0x50d734=0x0 0x50d728=0x4c1eb72 0x50d72c=0x174
[4011764.857747] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 2, SM 1): Out Of Range Address
[4011764.857766] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50d7b0=0xc07000e 0x50d7b4=0x0 0x50d7a8=0x4c1eb72 0x50d7ac=0x174
[4011764.857825] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 3, SM 0): Out Of Range Address
[4011764.857844] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50df30=0xc04000e 0x50df34=0x20 0x50df28=0x4c1eb72 0x50df2c=0x174
[4011764.857896] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 3, SM 1): Out Of Range Address
[4011764.857915] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50dfb0=0xc05000e 0x50dfb4=0x0 0x50dfa8=0x4c1eb72 0x50dfac=0x174
[4011764.857973] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 4, SM 0): Out Of Range Address
[4011764.857993] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50e730=0xc04000e 0x50e734=0x0 0x50e728=0x4c1eb72 0x50e72c=0x174
[4011764.858052] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 4, SM 1): Out Of Range Address
[4011764.858073] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50e7b0=0xc04000e 0x50e7b4=0x20 0x50e7a8=0x4c1eb72 0x50e7ac=0x174
[4011764.858134] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 5, SM 0): Out Of Range Address
[4011764.858154] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50ef30=0xc04000e 0x50ef34=0x20 0x50ef28=0x4c1eb72 0x50ef2c=0x174
[4011764.858204] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 5, SM 1): Out Of Range Address
[4011764.858224] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50efb0=0xc07000e 0x50efb4=0x20 0x50efa8=0x4c1eb72 0x50efac=0x174
[4011764.858278] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 6, SM 0): Out Of Range Address
[4011764.858297] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50f730=0xc07000e 0x50f734=0x20 0x50f728=0x4c1eb72 0x50f72c=0x174
[4011764.858344] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 6, SM 1): Out Of Range Address
[4011764.858365] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50f7b0=0xc07000e 0x50f7b4=0x20 0x50f7a8=0x4c1eb72 0x50f7ac=0x174
[4011764.858420] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 0, SM 0): Out Of Range Address
[4011764.858439] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x514730=0xc03000e 0x514734=0x20 0x514728=0x4c1eb72 0x51472c=0x174
[4011764.858485] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 0, SM 1): Out Of Range Address
[4011764.858505] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5147b0=0xc05000e 0x5147b4=0x20 0x5147a8=0x4c1eb72 0x5147ac=0x174
[4011764.858559] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 1, SM 0): Out Of Range Address
[4011764.858579] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x514f30=0xc02000e 0x514f34=0x20 0x514f28=0x4c1eb72 0x514f2c=0x174
[4011764.858626] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 1, SM 1): Out Of Range Address
[4011764.858646] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x514fb0=0xc02000e 0x514fb4=0x20 0x514fa8=0x4c1eb72 0x514fac=0x174
[4011764.858700] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 2, SM 0): Out Of Range Address
[4011764.858720] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x515730=0xc01000e 0x515734=0x20 0x515728=0x4c1eb72 0x51572c=0x174
[4011764.858767] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 2, SM 1): Out Of Range Address
[4011764.858787] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5157b0=0xc01000e 0x5157b4=0x20 0x5157a8=0x4c1eb72 0x5157ac=0x174
[4011764.858840] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 3, SM 0): Out Of Range Address
[4011764.858860] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Global Exception on (GPC 2, TPC 3, SM 0): Multiple Warp Errors
[4011764.858880] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x515f30=0xc00000e 0x515f34=0x24 0x515f28=0x4c1eb72 0x515f2c=0x174
[4011764.858926] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 3, SM 1): Out Of Range Address
[4011764.858947] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x515fb0=0xc00000e 0x515fb4=0x20 0x515fa8=0x4c1eb72 0x515fac=0x174
[4011764.859001] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 4, SM 0): Out Of Range Address
[4011764.859021] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x516730=0xc00000e 0x516734=0x20 0x516728=0x4c1eb72 0x51672c=0x174
[4011764.859068] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 4, SM 1): Out Of Range Address
[4011764.859087] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5167b0=0xc02000e 0x5167b4=0x20 0x5167a8=0x4c1eb72 0x5167ac=0x174
[4011764.859142] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 5, SM 0): Out Of Range Address
[4011764.859162] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x516f30=0xc00000e 0x516f34=0x20 0x516f28=0x4c1eb72 0x516f2c=0x174
[4011764.859209] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 5, SM 1): Out Of Range Address
[4011764.859228] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x516fb0=0xc03000e 0x516fb4=0x20 0x516fa8=0x4c1eb72 0x516fac=0x174
[4011764.859282] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 6, SM 0): Out Of Range Address
[4011764.859301] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x517730=0xc07000e 0x517734=0x20 0x517728=0x4c1eb72 0x51772c=0x174
[4011764.859348] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 6, SM 1): Out Of Range Address
[4011764.859369] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5177b0=0xc01000e 0x5177b4=0x20 0x5177a8=0x4c1eb72 0x5177ac=0x174
[4011764.859424] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 0, SM 0): Out Of Range Address
[4011764.859443] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51c730=0xc02000e 0x51c734=0x20 0x51c728=0x4c1eb72 0x51c72c=0x174
[4011764.859490] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 0, SM 1): Out Of Range Address
[4011764.859510] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51c7b0=0xc05000e 0x51c7b4=0x20 0x51c7a8=0x4c1eb72 0x51c7ac=0x174
[4011764.859564] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 1, SM 0): Out Of Range Address
[4011764.859584] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51cf30=0xc05000e 0x51cf34=0x20 0x51cf28=0x4c1eb72 0x51cf2c=0x174
[4011764.859631] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 1, SM 1): Out Of Range Address
[4011764.859650] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51cfb0=0xc05000e 0x51cfb4=0x20 0x51cfa8=0x4c1eb72 0x51cfac=0x174
[4011764.859705] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 2, SM 0): Out Of Range Address
[4011764.859725] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51d730=0xc06000e 0x51d734=0x20 0x51d728=0x4c1eb72 0x51d72c=0x174
[4011764.859772] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 2, SM 1): Out Of Range Address
[4011764.859791] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51d7b0=0xc0e000e 0x51d7b4=0x20 0x51d7a8=0x4c1eb72 0x51d7ac=0x174
[4011764.859845] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 3, SM 0): Out Of Range Address
[4011764.859866] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Global Exception on (GPC 3, TPC 3, SM 0): Multiple Warp Errors
[4011764.859885] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51df30=0xc06000e 0x51df34=0x24 0x51df28=0x4c1eb72 0x51df2c=0x174
[4011764.859932] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 3, SM 1): Out Of Range Address
[4011764.859952] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51dfb0=0xc08000e 0x51dfb4=0x20 0x51dfa8=0x4c1eb72 0x51dfac=0x174
[4011764.860005] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 4, SM 0): Out Of Range Address
[4011764.860025] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51e730=0xc07000e 0x51e734=0x20 0x51e728=0x4c1eb72 0x51e72c=0x174
[4011764.860073] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 4, SM 1): Out Of Range Address
[4011764.860092] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51e7b0=0xc0f000e 0x51e7b4=0x20 0x51e7a8=0x4c1eb72 0x51e7ac=0x174
[4011764.860146] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 5, SM 0): Out Of Range Address
[4011764.860166] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51ef30=0xc05000e 0x51ef34=0x20 0x51ef28=0x4c1eb72 0x51ef2c=0x174
[4011764.860213] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 5, SM 1): Out Of Range Address
[4011764.860233] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51efb0=0xc04000e 0x51efb4=0x20 0x51efa8=0x4c1eb72 0x51efac=0x174
[4011764.860288] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 0, SM 0): Out Of Range Address
[4011764.860307] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x524730=0xc03000e 0x524734=0x20 0x524728=0x4c1eb72 0x52472c=0x174
[4011764.860354] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 0, SM 1): Out Of Range Address
[4011764.860374] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5247b0=0xc07000e 0x5247b4=0x20 0x5247a8=0x4c1eb72 0x5247ac=0x174
[4011764.860429] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 1, SM 0): Out Of Range Address
[4011764.860448] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x524f30=0xc06000e 0x524f34=0x20 0x524f28=0x4c1eb72 0x524f2c=0x174
[4011764.860495] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 1, SM 1): Out Of Range Address
[4011764.860515] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x524fb0=0xc04000e 0x524fb4=0x20 0x524fa8=0x4c1eb72 0x524fac=0x174
[4011764.860567] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 2, SM 0): Out Of Range Address
[4011764.860588] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x525730=0xc0b000e 0x525734=0x20 0x525728=0x4c1eb72 0x52572c=0x174
[4011764.860635] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 2, SM 1): Out Of Range Address
[4011764.860655] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5257b0=0xc04000e 0x5257b4=0x20 0x5257a8=0x4c1eb72 0x5257ac=0x174
[4011764.860708] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 3, SM 0): Out Of Range Address
[4011764.860728] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x525f30=0xc05000e 0x525f34=0x20 0x525f28=0x4c1eb72 0x525f2c=0x174
[4011764.860775] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 3, SM 1): Out Of Range Address
[4011764.860794] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x525fb0=0xc06000e 0x525fb4=0x20 0x525fa8=0x4c1eb72 0x525fac=0x174
[4011764.860847] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 4, SM 0): Out Of Range Address
[4011764.860866] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x526730=0xc06000e 0x526734=0x20 0x526728=0x4c1eb72 0x52672c=0x174
[4011764.860913] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 4, SM 1): Out Of Range Address
[4011764.860933] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5267b0=0xc07000e 0x5267b4=0x20 0x5267a8=0x4c1eb72 0x5267ac=0x174
[4011764.860986] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 5, SM 0): Out Of Range Address
[4011764.861005] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x526f30=0xc0b000e 0x526f34=0x20 0x526f28=0x4c1eb72 0x526f2c=0x174
[4011764.861052] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 5, SM 1): Out Of Range Address
[4011764.861071] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x526fb0=0xc07000e 0x526fb4=0x20 0x526fa8=0x4c1eb72 0x526fac=0x174
[4011764.861124] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 6, SM 0): Out Of Range Address
[4011764.861144] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x527730=0xc04000e 0x527734=0x20 0x527728=0x4c1eb72 0x52772c=0x174
[4011764.861191] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 6, SM 1): Out Of Range Address
[4011764.861210] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5277b0=0xc04000e 0x5277b4=0x20 0x5277a8=0x4c1eb72 0x5277ac=0x174
[4011764.861265] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 0, SM 0): Out Of Range Address
[4011764.861284] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52c730=0xc07000e 0x52c734=0x20 0x52c728=0x4c1eb72 0x52c72c=0x174
[4011764.861331] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 0, SM 1): Out Of Range Address
[4011764.861351] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Global Exception on (GPC 5, TPC 0, SM 1): Multiple Warp Errors
[4011764.861370] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52c7b0=0xc06000e 0x52c7b4=0x24 0x52c7a8=0x4c1eb72 0x52c7ac=0x174
[4011764.861424] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 1, SM 0): Out Of Range Address
[4011764.861443] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52cf30=0xc05000e 0x52cf34=0x20 0x52cf28=0x4c1eb72 0x52cf2c=0x174
[4011764.861490] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 1, SM 1): Out Of Range Address
[4011764.861509] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52cfb0=0xc06000e 0x52cfb4=0x20 0x52cfa8=0x4c1eb72 0x52cfac=0x174
[4011764.861562] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 2, SM 0): Out Of Range Address
[4011764.861581] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52d730=0xc07000e 0x52d734=0x20 0x52d728=0x4c1eb72 0x52d72c=0x174
[4011764.861628] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 2, SM 1): Out Of Range Address
[4011764.861648] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52d7b0=0xc04000e 0x52d7b4=0x20 0x52d7a8=0x4c1eb72 0x52d7ac=0x174
[4011764.861701] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 3, SM 0): Out Of Range Address
[4011764.861721] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52df30=0xc06000e 0x52df34=0x20 0x52df28=0x4c1eb72 0x52df2c=0x174
[4011764.861768] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 3, SM 1): Out Of Range Address
[4011764.861787] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52dfb0=0xc04000e 0x52dfb4=0x20 0x52dfa8=0x4c1eb72 0x52dfac=0x174
[4011764.861840] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 4, SM 0): Out Of Range Address
[4011764.861859] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52e730=0xc07000e 0x52e734=0x20 0x52e728=0x4c1eb72 0x52e72c=0x174
[4011764.861906] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 4, SM 1): Out Of Range Address
[4011764.861925] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52e7b0=0xc07000e 0x52e7b4=0x20 0x52e7a8=0x4c1eb72 0x52e7ac=0x174
[4011764.861979] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 5, SM 0): Out Of Range Address
[4011764.861998] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Global Exception on (GPC 5, TPC 5, SM 0): Multiple Warp Errors
[4011764.862021] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52ef30=0xc06000e 0x52ef34=0x24 0x52ef28=0x4c1eb72 0x52ef2c=0x174
[4011764.862069] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 5, SM 1): Out Of Range Address
[4011764.862089] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52efb0=0xc04000e 0x52efb4=0x20 0x52efa8=0x4c1eb72 0x52efac=0x174
[4011764.862712] NVRM: Xid (PCI:0000:1a:00): 13, pid=3322703, name=mistralrs-serve, Graphics Exception: ChID 0010, Class 0000c3c0, Offset 00000510, Data 00419e84 |
|
@sempervictus interesting! Can you share the output log of this run? Also, is it possible to find which kernels are causing these? |
|
@EricLBuehler - here's what the container spat out into the log stream: 2025-06-04T22:04:43.895454Z INFO mistralrs_server_core::mistralrs_for_server_builder: avx: false, neon: false, simd128: false, f16c: false
2025-06-04T22:04:43.895503Z INFO mistralrs_server_core::mistralrs_for_server_builder: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-06-04T22:04:43.895533Z INFO mistralrs_server_core::mistralrs_for_server_builder: Model kind is: normal (no adapters)
2025-06-04T22:04:43.895576Z INFO hf_hub: Using token file found "/root/.cache/huggingface/token"
2025-06-04T22:04:43.895663Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-04T22:04:43.895721Z INFO mistralrs_core::pipeline::normal: Loading `config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-04T22:04:43.966507Z INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00014.safetensors", "model-00002-of-00014.safetensors", "model-00003-of-00014.safetensors", "model-00004-of-00014.safetensors", "model-00005-of-00014.safetensors", "model-00006-of-00014.safetensors", "model-00007-of-00014.safetensors", "model-00008-of-00014.safetensors", "model-00009-of-00014.safetensors", "model-00010-of-00014.safetensors", "model-00011-of-00014.safetensors", "model-00012-of-00014.safetensors", "model-00013-of-00014.safetensors", "model-00014-of-00014.safetensors"]
2025-06-04T22:04:44.011380Z INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-04T22:04:44.074830Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-04T22:04:44.104793Z INFO mistralrs_quant::utils::log: Automatic loader type determined to be `qwen2`
2025-06-04T22:04:44.104804Z INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-06-04T22:04:44.261495Z INFO mistralrs_quant::utils::log: Model has 64 repeating layers.
2025-06-04T22:04:44.261900Z INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-06-04T22:04:44.261935Z INFO mistralrs_quant::utils::log: Layers 0-19: cuda[0] (32 GB)
2025-06-04T22:04:44.261951Z INFO mistralrs_quant::utils::log: Layers 20-41: cuda[1] (32 GB)
2025-06-04T22:04:44.261963Z INFO mistralrs_quant::utils::log: Layers 42-63: cuda[2] (32 GB)
2025-06-04T22:04:44.308015Z INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 7
2025-06-04T22:04:44.308028Z INFO mistralrs_core::utils::normal: Skipping BF16 because CC < 8.0
2025-06-04T22:04:44.373043Z INFO mistralrs_core::utils::normal: DType selected is F16.
2025-06-04T22:04:44.373097Z INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 152064, hidden_size: 5120, intermediate_size: 27648, num_hidden_layers: 64, num_attention_heads: 40, num_key_value_heads: 8, max_position_embeddings: 32768, sliding_window: Some(131072), rope_theta: 1000000.0, rms_norm_eps: 1e-6, hidden_act: Silu, quantization_config: None, tie_word_embeddings: false }
2025-06-04T22:04:44.373164Z INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-06-04T22:07:10.498028Z INFO mistralrs_core::pipeline::normal: Applying ISQ to all ranks.
2025-06-04T22:07:10.498073Z INFO mistralrs_core::pipeline::isq: Applying in-situ quantization into Some(Q4K) to 449 tensors.
2025-06-04T22:07:10.500096Z INFO mistralrs_core::pipeline::isq: Applying ISQ on 32 threads.
2025-06-04T22:08:31.049024Z INFO mistralrs_core::pipeline::isq: Applied in-situ quantization into Some(Q4K) to 449 tensors out of 449 total tensors. Took 80.55s
2025-06-04T22:08:31.049365Z INFO mistralrs_core::paged_attention: Allocating 8192 MB for PagedAttention KV cache per GPU
2025-06-04T22:08:31.049377Z INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 1024 GPU blocks: available context length is 32768 tokens
2025-06-04T22:08:31.896558Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|endoftext|>", eos_toks = "<|im_end|>", "<|endoftext|>", unk_tok = `None`
2025-06-04T22:08:31.919202Z INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-06-04T22:08:31.919447Z INFO mistralrs_core: Beginning dummy run.
2025-06-04T22:08:31.921006Z INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
2025-06-04T22:08:32.238438Z INFO mistralrs_core: Dummy run completed in 0.318980626s.
2025-06-04T22:08:32.238898Z INFO mistralrs_server: Serving on http://0.0.0.0:7651.
2025-06-04T22:08:36.921162Z INFO mistralrs_core::engine::logger: Throughput (T/s) 0.60, Prefix cache hitrate 0.00%, 0 running, 0 waiting
2025-06-04T22:09:01.921532Z INFO mistralrs_core::engine::logger: Throughput (T/s) 592.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:06.921637Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:11.921729Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:16.921816Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:21.921897Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:26.921973Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:31.922043Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:36.922108Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:41.922203Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:46.922301Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:51.922395Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:56.922483Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:01.922566Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:06.922644Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:11.922716Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:16.922784Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:21.922847Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:26.922948Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:31.923044Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:36.923134Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:41.923221Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:46.923301Z INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:51.923376Z INFO mistralrs_core::engine::logger: Throughput (T/s) 5.60, Prefix cache hitrate 0.00%, 0 running, 0 waiting
2025-06-04T22:25:46.937102Z INFO mistralrs_core::engine::logger: Throughput (T/s) 1067.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:25:51.937185Z INFO mistralrs_core::engine::logger: Throughput (T/s) 13.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:26:36.937882Z INFO mistralrs_core::engine::logger: Throughput (T/s) 1956.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:26:41.937950Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:26:46.938054Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:26:51.938152Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:26:56.938246Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:01.938334Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:06.938417Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:11.938495Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:16.938566Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:21.938634Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:26.938695Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:31.938796Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:36.938892Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:41.938984Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:46.939069Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:51.939149Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:56.939225Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:28:01.939294Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:28:06.939358Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:28:11.939463Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:28:16.939562Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:28:21.939656Z INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x5634d2404922 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
1: 0x5634d0acb563 - core::fmt::write::h95e30a17c3d7d930
2: 0x5634d2403b0f - std::io::Write::write_fmt::h2447d4278ce5a227
3: 0x5634d2404783 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
4: 0x5634d24040b5 - std::panicking::default_hook::h0a7d57cc63374946
5: 0x5634d24037c7 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
6: 0x5634d2445be8 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
7: 0x5634d2445b49 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
8: 0x5634d244719c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
9: 0x5634d0ac964f - core::panicking::panic_fmt::ha159237b3cadc48c
10: 0x5634d0ad0da5 - core::result::unwrap_failed::h879f86fa8962b20a
11: 0x5634d0a294e3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
12: 0x5634d097d2e8 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
13: 0x5634d097d227 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
14: 0x5634d097d16b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
15: 0x5634d097d370 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
16: 0x5634d10857c9 - <mistralrs_core::device_map::LayerDeviceMapper as mistralrs_core::device_map::DeviceMapper>::map::ha40c495b77d50a86
17: 0x5634d1114998 - mistralrs_core::models::qwen2::Model::forward_embed::h76325c7661b6b0c7
18: 0x5634d1118a0a - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::hbcf387a473650d2c
19: 0x5634d1970903 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
20: 0x5634d1973f3a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
21: 0x5634d18ca52b - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.37410
22: 0x5634d1500979 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
23: 0x5634d1506f13 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
24: 0x5634d2447547 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
25: 0x7f2edd158ac3 - <unknown>
26: 0x7f2edd1e9a04 - clone
27: 0x0 - <unknown>
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x5634d2404922 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
1: 0x5634d0acb563 - core::fmt::write::h95e30a17c3d7d930
2: 0x5634d2403b0f - std::io::Write::write_fmt::h2447d4278ce5a227
3: 0x5634d2404783 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
4: 0x5634d24040b5 - std::panicking::default_hook::h0a7d57cc63374946
5: 0x5634d24037c7 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
6: 0x5634d2445be8 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
7: 0x5634d2445b49 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
8: 0x5634d244719c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
9: 0x5634d0ac964f - core::panicking::panic_fmt::ha159237b3cadc48c
10: 0x5634d0ad0da5 - core::result::unwrap_failed::h879f86fa8962b20a
11: 0x5634d0a294e3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
12: 0x5634d097d2e8 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
13: 0x5634d097d219 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
14: 0x5634d097d16b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
15: 0x5634d097d370 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
16: 0x5634d161bdc2 - core::ptr::drop_in_place<mistralrs_core::pipeline::inputs_processor::text_models_inputs_processor::FlashParams>::h54cce5dd92919df6
17: 0x5634d19713a8 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
18: 0x5634d1973f3a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
19: 0x5634d18ca52b - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.37410
20: 0x5634d1500979 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
21: 0x5634d1506f13 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
22: 0x5634d2447547 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
23: 0x7f2edd158ac3 - <unknown>
24: 0x7f2edd1e9a04 - clone
25: 0x0 - <unknown>
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x5634d2404922 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
1: 0x5634d0acb563 - core::fmt::write::h95e30a17c3d7d930
2: 0x5634d2403b0f - std::io::Write::write_fmt::h2447d4278ce5a227
3: 0x5634d2404783 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
4: 0x5634d24040b5 - std::panicking::default_hook::h0a7d57cc63374946
5: 0x5634d24037c7 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
6: 0x5634d2445be8 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
7: 0x5634d2445b49 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
8: 0x5634d244719c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
9: 0x5634d0ac964f - core::panicking::panic_fmt::ha159237b3cadc48c
10: 0x5634d0ad0da5 - core::result::unwrap_failed::h879f86fa8962b20a
11: 0x5634d0a294e3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
12: 0x5634d097d2e8 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
13: 0x5634d097d219 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
14: 0x5634d097d16b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
15: 0x5634d097d370 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
16: 0x5634d106cc52 - <hashbrown::raw::RawTable<T,A> as core::ops::drop::Drop>::drop::ha2d468b205f8c06b
17: 0x5634d161bed6 - core::ptr::drop_in_place<mistralrs_core::pipeline::inputs_processor::text_models_inputs_processor::FlashParams>::h54cce5dd92919df6
18: 0x5634d19713a8 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
19: 0x5634d1973f3a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
20: 0x5634d18ca52b - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.37410
21: 0x5634d1500979 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
22: 0x5634d1506f13 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
23: 0x5634d2447547 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
24: 0x7f2edd158ac3 - <unknown>
25: 0x7f2edd1e9a04 - clone
26: 0x0 - <unknown>
thread '<unnamed>' panicked at library/core/src/panicking.rs:233:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.how do i pull up which kernels its loading? separately, any chance this is what we're tripping over on v7 devices? |
|
@sempervictus thanks for the log.
Was wondering if that was showing up in the log. I've reproduced this:
So it seems that PagedAttention is somehow causing this - can you please try to run without paged attention? |
|
Also pretty sure paged attention is causing this, with the two things i havent tracked down being:
Running without it currently on the swe-agent tester ... its hanging pretty often (gpu's settle to 25% and |
I fixed that in #1429; this was some metadata
Just referencing 2 equivalent forms to my understanding.
Have you tried to activate |
|
@EricLBuehler - i've run with the NCCL-disable env var and without it. Currently using manual partitioning although i got the sense you fixed allocations previously, the KV caches seems to be biasing GPU0 (and some models cant be split). |
|
@EricLBuehler - Just rebuilt and tested the unquantized-blow-up case (with the FA fix): still blows up :-( mistralrs-server --token-source env:HF_TOKEN -n "0:20;1:22;2:22" --port 7651 plain -m SWE-bench/SWE-agent-LM-32B --max-seq-len 32768
==========
== CUDA ==
==========
CUDA Version 12.8.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
2025-06-05T01:26:48.882747Z INFO mistralrs_server_core::mistralrs_for_server_builder: avx: false, neon: false, simd128: false, f16c: false
2025-06-05T01:26:48.882784Z INFO mistralrs_server_core::mistralrs_for_server_builder: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-06-05T01:26:48.882815Z INFO mistralrs_server_core::mistralrs_for_server_builder: Model kind is: normal (no adapters)
2025-06-05T01:26:48.882853Z INFO hf_hub: Using token file found "/root/.cache/huggingface/token"
2025-06-05T01:26:48.882946Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-05T01:26:48.883032Z INFO mistralrs_core::pipeline::normal: Loading `config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-05T01:26:48.967134Z INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00014.safetensors", "model-00002-of-00014.safetensors", "model-00003-of-00014.safetensors", "model-00004-of-00014.safetensors", "model-00005-of-00014.safetensors", "model-00006-of-00014.safetensors", "model-00007-of-00014.safetensors", "model-00008-of-00014.safetensors", "model-00009-of-00014.safetensors", "model-00010-of-00014.safetensors", "model-00011-of-00014.safetensors", "model-00012-of-00014.safetensors", "model-00013-of-00014.safetensors", "model-00014-of-00014.safetensors"]
2025-06-05T01:26:48.992605Z INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-05T01:26:49.045567Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-05T01:26:49.075365Z INFO mistralrs_quant::utils::log: Automatic loader type determined to be `qwen2`
2025-06-05T01:26:49.075383Z INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-06-05T01:26:49.239330Z INFO mistralrs_quant::utils::log: Model has 64 repeating layers.
2025-06-05T01:26:49.239753Z INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-06-05T01:26:49.239793Z INFO mistralrs_quant::utils::log: Layers 0-19: cuda[0] (32 GB)
2025-06-05T01:26:49.239808Z INFO mistralrs_quant::utils::log: Layers 20-41: cuda[1] (32 GB)
2025-06-05T01:26:49.239821Z INFO mistralrs_quant::utils::log: Layers 42-63: cuda[2] (32 GB)
2025-06-05T01:26:49.288215Z INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 7
2025-06-05T01:26:49.288229Z INFO mistralrs_core::utils::normal: Skipping BF16 because CC < 8.0
2025-06-05T01:26:49.353483Z INFO mistralrs_core::utils::normal: DType selected is F16.
2025-06-05T01:26:49.353534Z INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 152064, hidden_size: 5120, intermediate_size: 27648, num_hidden_layers: 64, num_attention_heads: 40, num_key_value_heads: 8, max_position_embeddings: 32768, sliding_window: Some(131072), rope_theta: 1000000.0, rms_norm_eps: 1e-6, hidden_act: Silu, quantization_config: None, tie_word_embeddings: false }
...
2025-06-05T01:27:06.267927Z INFO mistralrs_core::paged_attention: Allocating 8192 MB for PagedAttention KV cache per GPU
2025-06-05T01:27:06.267947Z INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 1024 GPU blocks: available context length is 32768 tokens
2025-06-05T01:27:07.137262Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|endoftext|>", eos_toks = "<|im_end|>", "<|endoftext|>", unk_tok = `None`
2025-06-05T01:27:07.160496Z INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-06-05T01:27:07.160723Z INFO mistralrs_core: Beginning dummy run.
2025-06-05T01:27:07.165754Z INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x55ed44aa6cc2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
1: 0x55ed43174273 - core::fmt::write::h95e30a17c3d7d930
2: 0x55ed44aa5eaf - std::io::Write::write_fmt::h2447d4278ce5a227
3: 0x55ed44aa6b23 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
4: 0x55ed44aa6455 - std::panicking::default_hook::h0a7d57cc63374946
5: 0x55ed44aa5b67 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
6: 0x55ed44ae7f88 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
7: 0x55ed44ae7ee9 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
8: 0x55ed44ae953c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
9: 0x55ed4317235f - core::panicking::panic_fmt::ha159237b3cadc48c
10: 0x55ed43179ab5 - core::result::unwrap_failed::h879f86fa8962b20a
11: 0x55ed430d21a3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
12: 0x55ed444a0738 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h69dabdb8397fdeca
13: 0x55ed444a98e2 - <mistralrs_quant::cublaslt::api::CublasLTBatchMatmul as candle_core::custom_op::CustomOp2>::cuda_fwd::h485b6d7e9e3b157b
14: 0x55ed430464df - candle_core::storage::Storage::apply_op2::h6a343fb09e53884b
15: 0x55ed42f67e01 - candle_core::custom_op::<impl candle_core::tensor::Tensor>::apply_op2_arc::h1089692e7e049299
16: 0x55ed445030b1 - mistralrs_quant::cublaslt::CublasLtWrapper::batch_matmul::h8a46e8cceca17c7d
17: 0x55ed44532560 - <mistralrs_quant::unquantized::UnquantLinear as mistralrs_quant::QuantMethod>::forward::h8602c1712d107498
18: 0x55ed444bfdcc - <mistralrs_quant::distributed::layers::ColumnParallelLayer as mistralrs_quant::QuantMethod>::forward::h69b916efba3c9b52
19: 0x55ed4371fadc - mistralrs_core::models::qwen2::Model::forward_embed::h76325c7661b6b0c7
20: 0x55ed437237ba - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::hbcf387a473650d2c
21: 0x55ed44263ee3 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
22: 0x55ed4426751a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
23: 0x55ed441b9bea - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.43721
24: 0x55ed438dc419 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
25: 0x55ed438e2903 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
26: 0x55ed44ae98e7 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
27: 0x7f7d4f158ac3 - <unknown>
28: 0x7f7d4f1e9a04 - clone
29: 0x0 - <unknown>
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x55ed44aa6cc2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
1: 0x55ed43174273 - core::fmt::write::h95e30a17c3d7d930
2: 0x55ed44aa5eaf - std::io::Write::write_fmt::h2447d4278ce5a227
3: 0x55ed44aa6b23 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
4: 0x55ed44aa6455 - std::panicking::default_hook::h0a7d57cc63374946
5: 0x55ed44aa5b67 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
6: 0x55ed44ae7f88 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
7: 0x55ed44ae7ee9 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
8: 0x55ed44ae953c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
9: 0x55ed4317235f - core::panicking::panic_fmt::ha159237b3cadc48c
10: 0x55ed43179ab5 - core::result::unwrap_failed::h879f86fa8962b20a
11: 0x55ed430d21a3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
12: 0x55ed43026098 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
13: 0x55ed43025fd7 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
14: 0x55ed43025f1b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
15: 0x55ed43026120 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
16: 0x55ed43722830 - mistralrs_core::models::qwen2::Model::forward_embed::h76325c7661b6b0c7
17: 0x55ed437237ba - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::hbcf387a473650d2c
18: 0x55ed44263ee3 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
19: 0x55ed4426751a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
20: 0x55ed441b9bea - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.43721
21: 0x55ed438dc419 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
22: 0x55ed438e2903 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
23: 0x55ed44ae98e7 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
24: 0x7f7d4f158ac3 - <unknown>
25: 0x7f7d4f1e9a04 - clone
26: 0x0 - <unknown>
thread '<unnamed>' panicked at library/core/src/panicking.rs:233:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.lines 11-17 of the middle one look interesting. Can also confirm that the same unquantized always-crash-reproducer does crash w/ 2025-06-05T01:31:26.815015Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|endoftext|>", eos_toks = "<|im_end|>", "<|endoftext|>", unk_tok = `None`
2025-06-05T01:31:26.837498Z INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-06-05T01:31:26.837710Z INFO mistralrs_core: Beginning dummy run.
2025-06-05T01:31:26.842774Z INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x56261d161cc2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
1: 0x56261b82f273 - core::fmt::write::h95e30a17c3d7d930
2: 0x56261d160eaf - std::io::Write::write_fmt::h2447d4278ce5a227
3: 0x56261d161b23 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
4: 0x56261d161455 - std::panicking::default_hook::h0a7d57cc63374946
5: 0x56261d160b67 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
6: 0x56261d1a2f88 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
7: 0x56261d1a2ee9 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
8: 0x56261d1a453c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
9: 0x56261b82d35f - core::panicking::panic_fmt::ha159237b3cadc48c
10: 0x56261b834ab5 - core::result::unwrap_failed::h879f86fa8962b20a
11: 0x56261b78d1a3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
12: 0x56261cb5b738 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h69dabdb8397fdeca
13: 0x56261cb648e2 - <mistralrs_quant::cublaslt::api::CublasLTBatchMatmul as candle_core::custom_op::CustomOp2>::cuda_fwd::h485b6d7e9e3b157b
14: 0x56261b7014df - candle_core::storage::Storage::apply_op2::h6a343fb09e53884b
15: 0x56261b622e01 - candle_core::custom_op::<impl candle_core::tensor::Tensor>::apply_op2_arc::h1089692e7e049299
16: 0x56261cbbe0b1 - mistralrs_quant::cublaslt::CublasLtWrapper::batch_matmul::h8a46e8cceca17c7d
17: 0x56261cbed560 - <mistralrs_quant::unquantized::UnquantLinear as mistralrs_quant::QuantMethod>::forward::h8602c1712d107498
18: 0x56261cb7adcc - <mistralrs_quant::distributed::layers::ColumnParallelLayer as mistralrs_quant::QuantMethod>::forward::h69b916efba3c9b52
19: 0x56261bddaadc - mistralrs_core::models::qwen2::Model::forward_embed::h76325c7661b6b0c7
20: 0x56261bdde7ba - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::hbcf387a473650d2c
21: 0x56261c91eee3 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
22: 0x56261c921a38 - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
23: 0x56261c871a06 - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.43721
24: 0x56261bf97419 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
25: 0x56261bf9d903 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
26: 0x56261d1a48e7 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
27: 0x7f11ce558ac3 - <unknown>
28: 0x7f11ce5e9a04 - clone
29: 0x0 - <unknown>
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x56261d161cc2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
1: 0x56261b82f273 - core::fmt::write::h95e30a17c3d7d930
2: 0x56261d160eaf - std::io::Write::write_fmt::h2447d4278ce5a227
3: 0x56261d161b23 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
4: 0x56261d161455 - std::panicking::default_hook::h0a7d57cc63374946
5: 0x56261d160b67 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
6: 0x56261d1a2f88 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
7: 0x56261d1a2ee9 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
8: 0x56261d1a453c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
9: 0x56261b82d35f - core::panicking::panic_fmt::ha159237b3cadc48c
10: 0x56261b834ab5 - core::result::unwrap_failed::h879f86fa8962b20a
11: 0x56261b78d1a3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
12: 0x56261b6e1098 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
13: 0x56261b6e0fd7 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
14: 0x56261b6e0f1b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
15: 0x56261b6e1120 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
16: 0x56261bddd830 - mistralrs_core::models::qwen2::Model::forward_embed::h76325c7661b6b0c7
17: 0x56261bdde7ba - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::hbcf387a473650d2c
18: 0x56261c91eee3 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
19: 0x56261c921a38 - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
20: 0x56261c871a06 - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.43721
21: 0x56261bf97419 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
22: 0x56261bf9d903 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
23: 0x56261d1a48e7 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
24: 0x7f11ce558ac3 - <unknown>
25: 0x7f11ce5e9a04 - clone
26: 0x0 - <unknown>
thread '<unnamed>' panicked at library/core/src/panicking.rs:233:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting. |
Hmm, yeah. On my end I'm trying some things: nccl + no ISQ + paged attention works. Trying to find out which kernel is the problem though
Looks like an issue in the cublaslt code? Checking that... |
|
Re which kernels - eventually all of them far as i can tell, every model i've gotten to run crashes after some time with paged attention and the unquantized SWE one seems to be the best reproducer |
|
@EricLBuehler - how's NCCL being built? Are you linking hpc-x and building against the current CUDA version of the container or using a prebuilt? Our shop are some of the magical elves nobody ever sees who build/run the HPC clusters for a bunch of the various clouds and enterprise orgs out there (framing-overhead off line rate sort of stuff where we can, and not even on IB these days) so NCCL, UCX, etc are recurring parts of our collective nightmares. Especially w/ the proprietary/open nonsense (b200's don't run the proprietary drivers and their mezzanine "looks like" 4 permanently IB-mode CX7s, NVL72s are even stranger), NCCL compilation to-target becomes even more relevant re ABI against CUDA, drivers, and OpenMPI (not to mention the toolchain changes currently rippling through canonical's LTS'). Might be worth considering runtime instrumentation beyond the dynamic-dispatch stack traces such as codepoint interception and export to opentelem or some form of RPC to produce internal state telemetry for external analysis. More detailed console logging output probably can't hurt either (which kernels, parameters, etc) - maybe with some sort of verbosity flag to ratchet up the noise. |
|
@sempervictus it looks like nccl + NO paged attn + NO cublaslt + ISQ works
Currently delegating to
Absolutely, might try that! |
|
Well, on the cudarc side - EricLBuehler/candle#83 :-) |
|
Also i think there's some memory capacity calculus that goes south when running w/out paged attention. I've had the SWE one running overnight writing code at a somewhat sad rate but the interesting part is that its runtime memory seems to spike past actual capacity: 2025-06-05T07:09:18.355647Z ERROR mistralrs_core::engine: completion step - Model failed with error: WithBacktrace { inner: Cuda(Cuda(DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory"))), backtrace: Backtrace [{ fn: "candle_core::error::Error::bt" }, { fn: "<candle_core::cuda_backend::device::CudaDevice as candle_core::backend::BackendDevice>::alloc_uninit" }, { fn: "candle_core::tensor::Tensor::reshape" }, { fn: "mistralrs_core::attention::repeat_kv" }, { fn: "mistralrs_core::attention::Sdpa::run_attention" }, { fn: "mistralrs_core::models::qwen2::Model::forward_embed" }, { fn: "<mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward" }, { fn: "<mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs" }, { fn: "mistralrs_core::pipeline::Pipeline::step::{{closure}}" }, { fn: "mistralrs_core::engine::Engine::run::{{closure}}.37410" }, { fn: "std::sys::backtrace::__rust_begin_short_backtrace" }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}" }, { fn: "std::sys::pal::unix::thread::Thread::new::thread_start" }, { fn: "clone" }] }
2025-06-05T07:09:23.208596Z INFO mistralrs_core::engine::logger: Throughput (T/s) 0.20, Prefix cache hitrate 55.56%, 0 running, 0 waiting |
Seems like a separate issue, will tackle that after we fix this one! As for debugging:
|
|
Agreed, likely its own thing but does raise the question of "should me have a shadow MMU to track it all?" :-) |
|
@sempervictus I think I might have found it! Disabling prefix caching seems to work: Can you try that out? |
|
@sempervictus after |
|
"test is ingoing" - painfully slowly: 2025-06-05T16:29:31.508817Z INFO mistralrs_quant::utils::log: Model has 48 repeating layers.
2025-06-05T16:29:31.509484Z INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-06-05T16:29:31.509531Z INFO mistralrs_quant::utils::log: Layers 0-11: cuda[0] (32 GB)
2025-06-05T16:29:31.509548Z INFO mistralrs_quant::utils::log: Layers 12-23: cuda[1] (32 GB)
2025-06-05T16:29:31.509563Z INFO mistralrs_quant::utils::log: Layers 24-35: cuda[2] (32 GB)
2025-06-05T16:29:31.509576Z INFO mistralrs_quant::utils::log: Layers 36-47: cuda[3] (32 GB)
2025-06-05T16:29:31.560365Z INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 7
2025-06-05T16:29:31.560378Z INFO mistralrs_core::utils::normal: Skipping BF16 because CC < 8.0
2025-06-05T16:29:31.632507Z INFO mistralrs_core::utils::normal: DType selected is F16.
2025-06-05T16:29:31.632577Z INFO mistralrs_core::pipeline::vision: Model config: Llama4Config { text_config: TextConfig { hidden_act: Silu, hidden_size: 5120, intermediate_size: 8192, vocab_size: 202048, num_hidden_layers: 48, num_attention_heads: 40, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 10485760, rope_scaling: Some(Llama3RopeConfig { factor: 16.0, low_freq_factor: Some(1.0), high_freq_factor: Some(1.0), original_max_position_embeddings: Some(8192), rope_type: Llama3 }), quantization_config: None, tie_word_embeddings: false, floor_scale: Some(8192.0), attn_scale: Some(0.1), attn_temperature_tuning: Some(4.0), use_qk_norm: true, moe_layers: None, interleave_moe_layer_step: 1, intermediate_size_mlp: 16384, num_local_experts: 16, num_experts_per_tok: 1, attention_chunk_size: 8192 }, vision_config: VisionConfig { hidden_size: 1408, hidden_act: Gelu, num_hidden_layers: 34, num_attention_heads: 16, num_channels: 3, intermediate_size: 5632, vision_output_dim: 4096, image_size: 336, patch_size: 14, norm_eps: 1e-5, pixel_shuffle_ratio: 0.5, projector_input_dim: 4096, projector_output_dim: 4096, vision_feature_layer: -1, rope_theta: 10000.0 }, image_token_index: 200092 }
2025-06-05T16:29:31.632801Z INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
Loading text repeating layers: [01:13:56] [###############################>--------] 38/48 (3h) |
|
@EricLBuehler - interestingly, llama4 now doesn't fit into 128G of memory (4x32) very well... NCCL tries to dump a chunk into host RAM, takes a very long time to load. Forcing to GPU-only w/ |
|
hmm, sorta just hangs there... doesn't talk much, or use any GPU resources after load :-( - 2025-06-05T21:25:47.783403Z INFO mistralrs_server_core::mistralrs_for_server_builder: avx: false, neon: false, simd128: false, f16c: false
2025-06-05T21:25:47.783434Z INFO mistralrs_server_core::mistralrs_for_server_builder: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-06-05T21:25:47.784199Z INFO mistralrs_server_core::mistralrs_for_server_builder: Model kind is: normal (no adapters)
2025-06-05T21:25:47.785282Z INFO hf_hub: Using token file found "/root/.cache/huggingface/token"
2025-06-05T21:25:47.786888Z INFO mistralrs_core::pipeline::vision: Loading `tokenizer.json` at `meta-llama/Llama-4-Scout-17B-16E-Instruct`
2025-06-05T21:25:47.787163Z INFO mistralrs_core::pipeline::vision: Loading `config.json` at `meta-llama/Llama-4-Scout-17B-16E-Instruct`
2025-06-05T21:25:47.859665Z INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00050.safetensors", "model-00002-of-00050.safetensors", "model-00003-of-00050.safetensors", "model-00004-of-00050.safetensors", "model-00005-of-00050.safetensors", "model-00006-of-00050.safetensors", "model-00007-of-00050.safetensors", "model-00008-of-00050.safetensors", "model-00009-of-00050.safetensors", "model-00010-of-00050.safetensors", "model-00011-of-00050.safetensors", "model-00012-of-00050.safetensors", "model-00013-of-00050.safetensors", "model-00014-of-00050.safetensors", "model-00015-of-00050.safetensors", "model-00016-of-00050.safetensors", "model-00017-of-00050.safetensors", "model-00018-of-00050.safetensors", "model-00019-of-00050.safetensors", "model-00020-of-00050.safetensors", "model-00021-of-00050.safetensors", "model-00022-of-00050.safetensors", "model-00023-of-00050.safetensors", "model-00024-of-00050.safetensors", "model-00025-of-00050.safetensors", "model-00026-of-00050.safetensors", "model-00027-of-00050.safetensors", "model-00028-of-00050.safetensors", "model-00029-of-00050.safetensors", "model-00030-of-00050.safetensors", "model-00031-of-00050.safetensors", "model-00032-of-00050.safetensors", "model-00033-of-00050.safetensors", "model-00034-of-00050.safetensors", "model-00035-of-00050.safetensors", "model-00036-of-00050.safetensors", "model-00037-of-00050.safetensors", "model-00038-of-00050.safetensors", "model-00039-of-00050.safetensors", "model-00040-of-00050.safetensors", "model-00041-of-00050.safetensors", "model-00042-of-00050.safetensors", "model-00043-of-00050.safetensors", "model-00044-of-00050.safetensors", "model-00045-of-00050.safetensors", "model-00046-of-00050.safetensors", "model-00047-of-00050.safetensors", "model-00048-of-00050.safetensors", "model-00049-of-00050.safetensors", "model-00050-of-00050.safetensors"]
2025-06-05T21:25:47.891877Z INFO mistralrs_core::pipeline::vision: Loading `generation_config.json` at `meta-llama/Llama-4-Scout-17B-16E-Instruct`
2025-06-05T21:25:47.940447Z INFO mistralrs_core::pipeline::vision: Loading `preprocessor_config.json` at `meta-llama/Llama-4-Scout-17B-16E-Instruct`
2025-06-05T21:25:47.978101Z INFO mistralrs_core::pipeline::vision: Loading `processor_config.json` at `meta-llama/Llama-4-Scout-17B-16E-Instruct`
2025-06-05T21:25:47.978125Z INFO mistralrs_core::pipeline::vision: Loading `tokenizer_config.json` at `meta-llama/Llama-4-Scout-17B-16E-Instruct`
2025-06-05T21:25:48.010267Z INFO mistralrs_quant::utils::log: Automatic loader type determined to be `llama4`
2025-06-05T21:25:48.320550Z INFO mistralrs_quant::utils::log: Model has 48 repeating layers.
2025-06-05T21:25:48.321754Z INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-06-05T21:25:48.322014Z INFO mistralrs_quant::utils::log: Layers 0-11: cuda[0] (32 GB)
2025-06-05T21:25:48.322032Z INFO mistralrs_quant::utils::log: Layers 12-23: cuda[1] (32 GB)
2025-06-05T21:25:48.322046Z INFO mistralrs_quant::utils::log: Layers 24-35: cuda[2] (32 GB)
2025-06-05T21:25:48.322060Z INFO mistralrs_quant::utils::log: Layers 36-47: cuda[3] (32 GB)
2025-06-05T21:25:48.376168Z INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 7
2025-06-05T21:25:48.376186Z INFO mistralrs_core::utils::normal: Skipping BF16 because CC < 8.0
2025-06-05T21:25:48.450293Z INFO mistralrs_core::utils::normal: DType selected is F16.
2025-06-05T21:25:48.450616Z INFO mistralrs_core::pipeline::vision: Model config: Llama4Config { text_config: TextConfig { hidden_act: Silu, hidden_size: 5120, intermediate_size: 8192, vocab_size: 202048, num_hidden_layers: 48, num_attention_heads: 40, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 10485760, rope_scaling: Some(Llama3RopeConfig { factor: 16.0, low_freq_factor: Some(1.0), high_freq_factor: Some(1.0), original_max_position_embeddings: Some(8192), rope_type: Llama3 }), quantization_config: None, tie_word_embeddings: false, floor_scale: Some(8192.0), attn_scale: Some(0.1), attn_temperature_tuning: Some(4.0), use_qk_norm: true, moe_layers: None, interleave_moe_layer_step: 1, intermediate_size_mlp: 16384, num_local_experts: 16, num_experts_per_tok: 1, attention_chunk_size: 8192 }, vision_config: VisionConfig { hidden_size: 1408, hidden_act: Gelu, num_hidden_layers: 34, num_attention_heads: 16, num_channels: 3, intermediate_size: 5632, vision_output_dim: 4096, image_size: 336, patch_size: 14, norm_eps: 1e-5, pixel_shuffle_ratio: 0.5, projector_input_dim: 4096, projector_output_dim: 4096, vision_feature_layer: -1, rope_theta: 10000.0 }, image_token_index: 200092 }
2025-06-05T21:25:48.451823Z INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-06-05T23:06:10.917599Z INFO mistralrs_core::pipeline::paths: `tokenizer_config.json` does not contain a chat template, attempting to use specified JINJA chat template.
2025-06-05T23:06:10.919238Z INFO mistralrs_core::pipeline::paths: No specified chat template. No chat template will be used. Only prompts will be accepted, not messages.
2025-06-05T23:06:10.920975Z INFO mistralrs_core::pipeline::isq: Applying in-situ quantization into Some(Q4K) to 2895 tensors.
2025-06-05T23:06:10.922891Z INFO mistralrs_core::pipeline::isq: Applying ISQ on 32 threads.
2025-06-05T23:08:42.075695Z INFO mistralrs_core::pipeline::isq: Applied in-situ quantization into Some(Q4K) to 2895 tensors out of 2895 total tensors. Took 151.15s
2025-06-05T23:08:42.076475Z INFO mistralrs_core::paged_attention: Allocating 3072 MB for PagedAttention KV cache per GPU
2025-06-05T23:08:42.076487Z INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 512 GPU blocks: available context length is 16384 tokens
2025-06-05T23:08:43.333552Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot|>", "<|end_of_text|>", "<|eom|>", unk_tok = `None`
2025-06-05T23:08:43.385250Z INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-06-05T23:08:43.386047Z INFO mistralrs_core: Beginning dummy run.
2025-06-05T23:08:52.650528Z INFO mistralrs_core: Dummy run completed in 9.264463455s.
2025-06-05T23:08:52.652994Z INFO mistralrs_server: Serving on http://0.0.0.0:7650.
2025-06-05T23:08:53.390453Z INFO mistralrs_core::engine::logger: Throughput (T/s) 0.80, Prefix cache hitrate 0.00%, 0 running, 0 waitinglooks like this: |
|
Just to chime in, there's a lot of comments/output above that I'm not going to dig into but few questions:
If those assumptions are valid, whatever in the build requires GPU access would need to be pre-built externally probably or opt-out via feature with a documented drawback when using the container if there's no way to add that support via other means. There was mention of PagedAttention kernel and quantization, so I assume the custom kernels are involved/affected during the build in some way? FWIW a compute capability of 70 with the Tesla V100 GPU is a bit low. I have seen a few projects where 75 is the lowest capability but even that involved some differences (I think PagedAttention was not supported?). EDIT: Here's one:
And from that same project, this related section:
So it's possible the issues are related to the older GPU arch (nearing a decade old now?) if you can verify no issues with the image on newer modern GPUs. Given the age of that hardware, I'd also ensure your using a relatively modern release of Docker itself (or whichever other container engine you prefer). |
* Fix handling of Metal fused attn head dims (EricLBuehler#1234) * Fix handling of metal attn head dims * Fix handling of gemma3 1b when images * Tweak default for paged attn builder * Support paged attn for vision model rust api (EricLBuehler#1235) * [Breaking] Support setting HF cache path (EricLBuehler#1237) * Add it internally * Add the apis * Support tool calling for DeepSeek models (EricLBuehler#1239) * Support tool calling for deepseek models * Format * Fix deepseek * Server image processing refactor and fixes (EricLBuehler#1244) * Fix strict gemma3 case * Accept multiple images in the content array * Fix multiple images in one array ct * Add it to the python api * Typos * Optimized CUDA RoPE kernels (EricLBuehler#1247) * Add the kernels * It works * Works * Buulds * Typo fix (add_speial_tokens to add_special_tokens) (EricLBuehler#1246) * Fix typo * Update mistralrs.pyi * Fixes for UQFF + distributed layers (EricLBuehler#1250) * Fixes for uqff + distributed layers * Typo * Automatic agentic search integration (`web_search_options`) (EricLBuehler#1243) * Add the tool * Actually search * Clippy * Sort of works * Remove some debuggers * tweak * Add some rules * Works great * Tweak 'system' prompt * Update mistralrs-core/src/search/mod.rs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Typo * Add it to all the apis * Add bert model for similarity reranking * Typos * Early detection of tools * Alias max_tokens -> max_completion_tokens too * Customizable bert model * Flip the enabler around * Add docs * Update readme * Typo --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Format kernels (EricLBuehler#1251) * Update readme * Update readme * Remove test * Add quantize guards for uqff deserialize (EricLBuehler#1252) * Refactor cuBLASlt-related code (EricLBuehler#1253) * Centralize cublaslt into mistralrs-quant * Use cublaslt in unquant layer * Use beautiful trait constants for simpler code * Move tests * Dispatch to unquant for cublaslt * Dispatch to unquant for cublaslt * Fix feature * Add convert_to_gptq script * Update deps, bump pyo3 version (EricLBuehler#1259) * Faster cuda FP8 performance (EricLBuehler#1257) * Avoid fp8 sync * Fix dtype * Rust 1.86 clippy (EricLBuehler#1260) * Rust 1.86 clippy * Clippy * Refactor engine arch (EricLBuehler#1262) * Refactor engine add_request * Don't recompile regex * Clippy * Revamped LoRA support - removing the Ordering system! (EricLBuehler#1263) * Play with varbuilder lifetimes * Merge lora weights * Clippy * Lora works * Support multiple loras * Cleanup, remove adapter activation * Complete merge * Fast Metal-specific quantization method: AFQ (EricLBuehler#1264) * Add mlx quantized kernels * Add mlx quantized kernels * Kernel launcher * Add AFQ isq quant and dequant * Some quantmethod things * Begin to implement the qmm caller * Clippy * Much faster * Cache kernels * Docs * Clippy * Add it to uqff * Support prequantized models from MLX (EricLBuehler#1265) * Refactor quantizedconfig * Support AFQ prequantized * Update docs * Update docs * Automatic ISQ to select fastest & most accurate method (EricLBuehler#1266) * Automatic isq * typo * Doc * Improved usage metrics (EricLBuehler#1267) * Fix cuda * Bump tokio from 1.44.1 to 1.44.2 (EricLBuehler#1270) Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.44.1 to 1.44.2. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](tokio-rs/tokio@tokio-1.44.1...tokio-1.44.2) --- updated-dependencies: - dependency-name: tokio dependency-version: 1.44.2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Gather MM ops in mistralrs-quant (EricLBuehler#1272) * Update the caller * Wire things up * Broadcase for afq gathermm * Broadcase for afq gathermm * Clippy * Improve performance of deepseek models * Typo fix * BincountOp not used * Implement Llama 4! (EricLBuehler#1268) * Implement Llama 4 * Implement the main changes for the text model * Make chunked mask * Wire things up * Add some EP * Initial sketch of inputs processor * Runs * Progress * all reduce moes * It works! * Some cleanup * Faster moe block * Add device map * Make chunked matrix * Fully working now! * Reactivate cublaslt * Fix shared mlp cublaslt * Refactor to packed experts * Complete merge * It is a normal model now * Fixes * Set device for moe * ISQ fixes * Much faster sort kernel * Faster loading! * Faster loading! * Fp8 cpu copy ops in candle backend * Add the vision model * Add mmproj layer * Actually merge the inputs * Sketch most of the image processor * Add the rest of the image processor * Implement the whole processor * Add the loader * Some fixes * A batch of fixes * Some fixes * tmp * Actually support isq * Ok it works a bit * Fix norm device * It works * A bit cleaner * Support residul tensors * Remove text loader * Implement the device mapping system * Fix auto device map * Add examples * Add model card * Typo * Remove superflous logging * Fixes for Llama 4 UQFF loading (EricLBuehler#1275) * Support sharding for UQFF (EricLBuehler#1276) * Serialize sharded uqff files * Loading * Fix base64 * Fix bug for group-topk (group_limited_greedy) in deepseek models (EricLBuehler#1278) * Support the DeepCoder model (EricLBuehler#1279) * Add faq for metal not found * Improved PagedAttn scheduling accuracy (EricLBuehler#1282) * Scheduler ops by reference * Ensure scheduler gets correct prompts * Fix cuda build for copy_blocks * Fixes for scheduling image seqs with pagedattn (EricLBuehler#1283) * update to llguidance 0.7.16 (EricLBuehler#1284) * update llguidance to 0.7.16 from crates.io; use ParserFactory * add lark_llg.py example * use new llguidance::Matcher APIs * rework spec-decoding with llg * more work on spec sampling * check for parser stop * fix clippy * remove unneeded rollback * update build_llg_factory to return Result * Update dependencies (EricLBuehler#1286) * Much faster image inputs processing (EricLBuehler#1289) * Add more SDPA head dims for much faster SigLIP (EricLBuehler#1290) * More sdpa head dims, faster vision models * Move nonzero to above for faster metal synch * Doc * Update valid head dims * Show throughput in interactive mode (EricLBuehler#1291) * Update interactive mode throughput stats * Accurate prompt t/s * Accurate prompt t/s for usage * Unify bitwise operations (EricLBuehler#1288) * Unify bitwise ops * Tests pass * Fix cuda build * Clippy * Multimodal prefix caching support! (EricLBuehler#1298) * Initial progress * Support vision prefix caching * Update docs * Add multimodal data abstraction * Interactive mode improvements (EricLBuehler#1299) * More ergonomic image url parsing * Add option to clear * Add the Qwen 3 and Qwen 3 MoE models! (EricLBuehler#1285) * Add qwen3 model * Add enable_thinking * Add initial qwen3 moe * Add the moe model * Format * Fix order of norm * Fix expert shapes * Fix reverse * Fix norm device for isq * Fix nonzero when no nonzero * Moe model runs * Working qwen3 moe * Add metal fp8 blockwise dequant * Clean * Typo * Enable tool calling * Streamlined ux * Add some examples * Add docs * Fix dead link * Remove interactive mode max_len * Update QWEN3.md * Hotfix for vision mode clear * Revamped and streaming web search support (EricLBuehler#1301) * Streaming web search * Refactor a bit * More refactoring * Add some logging, parallelize some things * Allow url * Suppress warning, allow multi-turn searching * Batch compute_similarities * Cap content len * Typos * Doc * Handle vision messages or different tool call prefixes (EricLBuehler#1302) * Fix cuda * Tune web search budget * Simplify prefix cacher (EricLBuehler#1305) * Use rustyline to handle non-ascii in interactive mode (EricLBuehler#1306) The io::stdin().read_line() cannot handle non-ascii input, which caused crash when use backspace to delete non-ascii characters. Introduce rustyline to the interactive mode to solve the problem. Plus it can bring more editing features in the future. Close EricLBuehler#1140 * Add more tools for automatic search (EricLBuehler#1307) * Add interactive mode history * Add a website extraction tool * Pass toks by reference * Optimize prompt chunking * Fix CPU hogging in interactive mode (EricLBuehler#1309) The log enabler should be checked after the sleep instead of a busy loop checking. Since the interactive mode always disables the token speed logger, 100% CPU was taken by this loop always. * Add Metal precompilation support (EricLBuehler#1311) * Add metal precompilation for paged attn * Add for mistralrs-quant * Better constructor * Dont always build * Fix name for paged attn rebuild * Reduce thrashing of Metal autorelease (EricLBuehler#1313) * Reduce calls to autorelease * Optimize clone_in_cache * Refactor float8 * make `AdapterPaths` and `LoraAdapterPaths` public (EricLBuehler#1314) Make `AdapterPaths` and `LoraAdapterPaths` public so `LocalModelPaths` can be constructed outside of `mistralrs-core`. * Refactor KV cache manager (EricLBuehler#1315) * Refactor kv cache * Refactor caches * Fix some overflows * Add `Audio` and `Speech` model categories (EricLBuehler#1317) * add `Audio` to `ModelCategory` * add `Speech` to `ModelCategory` * fix to go back to PartialEq having an exhaustiveness check * Remove has_conv2d from vision model API (EricLBuehler#1318) * Unified/automatic flash attention enabler (EricLBuehler#1319) * Remove from sdpa params * Fix errors * No warnings * Log * Clippy * Fix cublaslt 4d mask (EricLBuehler#1320) * Fix cublaslt 4d mask * Clippy * Keep caches on gpu * Qwen VL models fixes (EricLBuehler#1322) * Add some defaults * Fix * Fix one thing * 2.5 vl works * Use caching again * Fix v2 * Move index inside loop * Offset in ropeidx * Default support for vision prefix caching is false * Fixes for all vision models (EricLBuehler#1323) * Fix phi input processor? * Fix phi input processor * Handle no_prefix_cache from pipeline * Phi models confirmed 👍 * Fixed for phi inputs processors * Fixed for phi4 * Llama 3 confirmed 😀 * Mistral 3 confirmed 😃 * Idefics 2/3 fixes * Some fixes * Remove unsafety * Improved+faster LRU prefix cacher (EricLBuehler#1321) * Show TTFT * Use LRU prefix cacher * Faster prefix cacher * Inplace ISQ support and default to mmap (EricLBuehler#1277) * Initial impl of immediate isq * Immediate isq -> !loading_isq * Varbuiler utils always using mmap! * Log * Add for packed experts * Afq without copy * Clarify * Clippy * Apple immediate isq * Better logic for loading_isq * Support showing ttft * Rename * Shared quantize guard * Parallel progress bar * Parallel loading for progress bars * Actual ISQ support * Conditional parallelism for NiceProgressBar * Use conditional iterator * Warn once * Predicate for applying immediate isq * Allow parallel * Remove debug print * Remove debug print * Remove debug print * Fix typos (EricLBuehler#1329) * Fix Idefics 3 arch chat templating (EricLBuehler#1330) * Update inputs merger * Fix * Better warning * Better warning * Better warning * Nonzero ahead of time * No f32 * Clippy * Optimize get_logprobs * Fix packed experts * Update masking * Use Sdpa in idefics3 * QuantMethod in idefics3 vision * Remove a .contiguous * Remove two space from PR comment (EricLBuehler#1331) * Add automatic vision loader type (EricLBuehler#1332) * Add automatic vision loader * Remove references to --arch * Update examples * Add the Dia 1.6b TTS model! (EricLBuehler#1304) * Add loading * Add rope, mlp, most of attn * Add encoder + encoder layer, decoder layer forwards * Add decoder forwards * Add prepare_audio_prompt * prepare_generation mostly done * Add a proper dia kvcache * Add most of decoder_step * Add the sampler * Add the generation loop * Wire things up * Add speech pipeline * Fixes * Loads * Some fixes * f32 * Some progress * Ok it runs upto dac decoding * Add dac part loading * Loads and runs at least * Remove encodec * Debugging * Debugging * Huh * Complete merge * Interactive * Confirmed dac works at least * Looks like encoder works * Much progress * Hmm * Sampling * Almost there * Sampler * Sampler * Bf16 support * Response * Use it in interactive mode * Fix oneshot * Add openai api * Add openai api * Refactor loading * Use naive sdpa for inplace * Factor out * Clippy * Clippy * Config * Refactor config * Metal clippy * Fix t/s * ISQ support * Some fixes, nits * Fix cuda * Clippy * Inhibit cublaslt for cuda * Add server example * Add python example * Add rust api * Add docs * Update config.toml * Fix .pyi * Update readme * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * update `llguidance` to `0.7.20` (EricLBuehler#1334) Update `llguidance` from `0.7.16` to `0.7.20` so that it has guidance-ai/llguidance#172 which is a fix for building on GCC 15. * Add model category <> messages check (EricLBuehler#1335) * Verify model category matches the messages * Add vision chat * Fixes * Add element-wise normalization check (EricLBuehler#1340) * Fix streaming example print statement (EricLBuehler#1339) * Fix normalization formula in comment (EricLBuehler#1338) * Fix image_to_pixels to handle non-RGB images (EricLBuehler#1337) * Fix typo in expect messages (EricLBuehler#1342) * Don't use mmap on cuda (EricLBuehler#1336) * No mmap on cuda * Simplify streaming tool call logic * Remove debug * Support AWQ format models (EricLBuehler#1350) * Support AWQ format models * Clippy fix * Fix uqff dummy layer ISQ application (EricLBuehler#1351) * Disable immediate isq if write_uqff (EricLBuehler#1352) * Fixes for UQFF loading on CUDA, ISQ pack factor (EricLBuehler#1354) * Fix logic for uqff on cuda * Updated pack_factor * Refactor Option references for model paths (EricLBuehler#1347) * refactor: use Option refs in model path helpers * Format * Add a script for server benchmarking (EricLBuehler#1355) * Serde alias * Fix * Update for tie_word_embeddings * Print running/waiting * 30 users * Update num_users * Update dummy paged attn * Optimized Metal qmv_fast path (EricLBuehler#1356) * Compile with lto * Tweak profiles * New, fast sampler for Metal! (EricLBuehler#1327) * Show TTFT * Use LRU prefix cacher * Faster prefix cacher * A bit of gpu sampling * Minp but cpu for now * Metal fast cumsum impl * Sampling with fast topp kernel * Hmm not perfect * Add metal sort kernels * Tmp * Add single block sort * Add most of multi block sort, just need copy op * Add copy kernels * Expose kernels * Add a test * Ok it works * Structure things * Add caching * Rename * Cpu is default * CUDA case * Topk * Refactor Option references for model paths (EricLBuehler#1347) * refactor: use Option refs in model path helpers * Format * Add a script for server benchmarking (EricLBuehler#1355) * Serde alias * Fix * Update for tie_word_embeddings * Print running/waiting * 30 users * Update num_users * Update dummy paged attn * Optimized Metal qmv_fast path (EricLBuehler#1356) * Compile with lto * Tweak profiles * Fix topk * Penalties * Add logits processor, clippy fixes * Fix chat port * Remove warning * Fix chat port * Fix metal parallel sampling (EricLBuehler#1357) * Cpu if parallel for now * Tweak bench script * Add immediate isq predicates for qwen3 (EricLBuehler#1358) * Add immediate isq predicates for qwen3 * Fix parsing of "parse_isq_value" depedent of device * Typo * Fix gemma3 logging * Regressions fixes (EricLBuehler#1359) * Fix regression for mmap * Revert EricLBuehler#1321 * Refactored matching_cache impl * Clippy * Revamped and smaller readme (EricLBuehler#1360) * Expandable detail sections * Refactor using derivative model * Tweak quick examples * Update llama * Update llama * Supported accelerators is a table * Update installation guides * Tweak apis * Remove --port in quick examples * Add demo gif * Add gif in readme * Update demo gif * Update demo gif * Update demo gif * Add gif in readme * Add gif in readme * Add a web chat app! (EricLBuehler#1362) * Initial * Markdown * Copy code * Add model loading sidebar * Support vision models * Tweak isq * Links go to another page * Clear when switch model * Fix html tags * Add image support! * More then one images * Fix * Improved textarea * Tab for switching between vision and text * No paged attn for now * Prettier format * Multiple models at once * Better switching, clearing ability * Mobile support * Inline markdown parser * Update examples * Typos * Support specifying isq * Fix mobile * Fixes * Fix button on mobile * Image height is capped * Thumbnail * Fix rotating kv cache edge case * Add drag and drop for images * Small things * Sidebar is frozen now * Better listner * Add readme * Tweak readme * Add chat history support to web chat app (EricLBuehler#1363) * Add chat history * Support renaming * Start immediately with new chat * Add timestamp * Prettier chat list * Style * Delete chat * Fix copy button * Fix markdown rendering * Store things in cache * Store things in cache * Refactor web chat, fix multichat image restore (EricLBuehler#1364) * Fix multichat image restoration. * Clippy * Refactor * Refactor frontent * Fix repeated immediate isq init (EricLBuehler#1365) * Add images_ref * Add debug impl * Fix the bug * Tweak style of buttons * Add a spinner * Move spinner * Tweak emoji * Add gif * Tweak initial gif * Include vision tower tensors in Mistral3 UQFF (EricLBuehler#1366) * Fix mistral 3 uqff resitdual tensors for vision * Rolling shard creation for uqff files (EricLBuehler#1367) * Fix occasional unstability during isq of afq (EricLBuehler#1368) * Fix unstability during isq of afq * Clippy * Fix web chat installation * Support web chat file uploading (EricLBuehler#1370) * Web chat fixes * Fix thumbnail in message, reuse blank chat * Add file uploading support * Fix scroll * Allowed extensions * Preserve files as literals * Support multiple clients * Add a stop button * New cache dir * New cache dir * Fix * Refactor * Update readme * Tweak drag-and-drop css * Add speech generation support to the web chat! (EricLBuehler#1373) * Initial speech gen support for web chat * Tweak ui * Update docs * Prefix caching for PagedAttention! (EricLBuehler#1369) * Exposing some things for logical token blocks * Prefix cache manager has the scheduler * Refactor * Get logical and physical blocks into the prefix cacher * Hash and cache * Pass physical block prefill * Allocation of prefilled block tables * Temp * Dont always use 2 * Hmm * Hmm * It mostly works * Increment refcount * Support images! * Add to dummy paged attn * Fix some clippy * Clippy * More checks * Include EricLBuehler#1371, closes EricLBuehler#1371 * Typos * Update docs * Metal PagedAttention accuracy improvements (EricLBuehler#1374) * Fix subtle bug * Fix half sum bug * Format metal paged attention * Handle images in paged attn scheduler (EricLBuehler#1375) * Include schemas needed for chatcompletions endpoint (EricLBuehler#1353) * EricLBuehler#1326: WIP include schemas needed for chat completions endpoint Conflicts: Cargo.lock mistralrs-server/src/main.rs * EricLBuehler#1326: WIP define utoipa as a workspace dep since core and server both need it * EricLBuehler#1326: first draft of handling schemas that use Either * EricLBuehler#1326: first draft of handling schema for Grammar * EricLBuehler#1326: Add in other endpoints to API docs. * EricLBuehler#1326: Adjust code comments * EricLBuehler#1326: Implement coderabbitai suggestions - EricLBuehler#1353 (review) - EricLBuehler#1353 (comment) * Fix constraints with metal sampler * Revert EricLBuehler#1375 * Fix case where prefix cacher returns no toks (EricLBuehler#1377) * Fix AFQ UQFF serialization * Faster UQFF serialization (EricLBuehler#1379) * Faster UQFF serialization * Fix uqff gemma3 * Improve gemma3 auto loader names * UQFF creation for AFQ on CPU support (EricLBuehler#1380) * Add afq cpu quantize/dequantize * Clippy * Improved device for afq quantize * Improved dtype handling for cpu afq (de)quantize * Improved generate_uqff_card * Add fused CPU attention kernel! (EricLBuehler#1382) * Working * Fix warnings * Allow mask * Support bf16, f16 * Handle striding * Parallelized * Add initial vector flash attn * Avoid repeated allocations * Tiled kv * Apply some clippy * Some small fixes * Chunked vec_dot * Clipy * Use T::zero * Refactor attention backends (EricLBuehler#1384) * Refactor attention code * Refactor attention code * Move into backends * Set macOS thread affinity for CPU attn (EricLBuehler#1385) * Use lazylock * Format * Fix metal warn build * Faster Qwen 3 MoE support on Metal (EricLBuehler#1387) * Fix load * Use afq gather qmm * Well it runs * It works * Polish * Fast and slow options * Remove quantized.rs * Polish some more * Refactor * Add isq * Update load in parallel * Support fp8 * Refactor for FusedExperts * Clippy * Handle pack factor when loading prequantized models * Use f32 only in moe * Avoid using f32 so much * Avoid using f32 so much * Fix PagedAttention block leaks (EricLBuehler#1388) * Warn and ignore if ignored * Fix a block allocation leak * Update bench.py * Fix double free in block engine * Do not apply ISQ if loading a prequantized model * Fix cuda build again (EricLBuehler#1389) * Fix cuda build * Fix * Format * Fixes for cuda docker * Update dockerfiles * Bump version to 0.6.0 (EricLBuehler#1390) * Bump version to 0.6.0 * Remove lower_level api * Make a static dir * Update deps * Fix routing for static handler in web chat * Fewer .contiguous calls for qwen3 moe (EricLBuehler#1391) * Allow speech models to accept batched inputs (EricLBuehler#1393) * Allow speech models to accept batched inputs * Clippy * Ring distributed backend for heterogeneous TP (EricLBuehler#1238) * Begin work on ring distributed backend for Metal * Add the actual ring functionality * It loads and kind of runs * It works * Optimize buffer allocation * Avoid copy * It works * Add allgather * Fix load * Ping-pong * Small things * Add config json * Allow different ip address * Read config once * Read config when appropriate * Replicate requests * Small fix * Fix small compat with openai * Clippy * Update docs * Add deepseek tool calling chat template * Add auto loader for vision/text detection! (EricLBuehler#1402) * Add auto loader for vision/text detection * Build fixes * Add model loader * Update docs * Format * Create Mistral.rs Server Core Lib: `mistralrs-server-core` (EricLBuehler#1346) * First draft of exposing mistral server routes as lib * make arg struct fields pub * Take base path so utoipa swagger route can properly redirect * Expose swagger routes and make it configurable * Add base path option for swagger docs * More work on modularizing mistralrs server * Sync fork (+1 squashed commit) Squashed commits: [169ae9e] Sync fork * Adjust fn params to use refs / individual params instead of args * Start breaking down controller actions into smaller pieces * Continue refactoring * Make mods pub so they can be used outside crate * Allow chat completion streamer to take a callback so that you can get the complete response when finished WIP (+3 squashed commits) Squashed commits: [0061d87] WIP [c484d56] WIP [16f8a60] WIP * Sync fork * Adjust callback type * Remove throughput_log arg that was removed in 26afcc3 * Implement defaults for Args (and use for Clap) * Small code formatting tweaks * Rename callback to match SSE event and code clean up * Sync fork * WIP: first very rough draft of server core builder. Doesn't meet parity with old functional approach yet (slower / unstable?). * Clean up (+4 squashed commits) Squashed commits: [e1cff387] Sync fork [d8301025] WIP debugging [1ea9f8c8] Sync fork [4fe28cf5] WIP: debug function * WIP server core builders * Code clean up * Add on_chunk callback * Code clean up * First draft of creating version of mistral-server that uses server-core Code clean up (+1 squashed commit) Squashed commits: [adea1693] * Sync fork * Add helper methods to builder to make optional args more ergonomic (since .build validates params) * Start adding docs * Start cleaning up crates deps * Example commit of mistral-server with implementing server-core * Start addressing CodeRabbit feedback * Fix comment typo * Tweak doc blocks * - Update type alias naming for clarity (MistralRs instead of Mistral) - CodeRabbit, don't use eprintln for lib (use trace) - Allow buffer size to be passed in and default to Constant - Allow router body limit to be passed in and default to Constant - Update doc examples * Typo * Address CoderRabbitAI feedback * Support linear rope for llama3 (EricLBuehler#1408) * Hotfix for loading * Fix vllama4 uqff loading (EricLBuehler#1409) * Fix vllama4 uqff loading * Fix regex * Fix regex * Maybe a fix * Gracefully handle receiver disconnects (EricLBuehler#1410) * Handle receiver disconnects * Format * Fix Qwen3 MoE device mapping irregularities (EricLBuehler#1411) * Fix bias * Fix lm_head packing case * Account for gate * Fix head dim * Fix interactive mode URL parsing (EricLBuehler#1412) * fix url regex in vision interactive mode * Fix regex * Clippy * Refactor auto device map (EricLBuehler#1413) * Refactor auto device map * Refactor a bit more * Clippy * Enable runtime sampling tweaks in interactive mode (EricLBuehler#1414) * Document runtime sampling commands * Fix readme * Tweak * Bounds checking * Tweak temp bounds * Send streaming tokens every time * Gumbel sampling for fast sampler (EricLBuehler#1416) * Improved handling for initialize_logging * Improved CPU flash attention accuracy & performance (EricLBuehler#1417) * Downcast correctly * Operate internally in f32 * Avoid some casts and striding * Prefetch * Provide chat_templates to container users (EricLBuehler#1419) Models often come without chat templates requiring mapping them from the source repository into a container for access by the mistralrs-server. Copy the templates from the build tree into the root of the image to permit use via `--chat-template /chat_templates/something.json` TODO: With the increase in quantized models and support for other formats, the initial benchmark run during model load can be used to qualify/select existing chat templates embedded into the binary for models which do not come with any (to include output of the functional failures in each test allowing users to modify the ones already provided correctly to suit the model being loaded). Co-authored-by: RageLtMan <rageltman [at] sempervictus> * Faster cpu flash attn (EricLBuehler#1418) * Faster cpu flash attn * Prefetch * Clippy * Add some tests * Add softcap tests * Fix test_parse_image_url test * Update tests * Update tests * Web search improvements (bm25, web chat) (EricLBuehler#1420) * Fix web search blocking case * Web search support in web chat * Tweak ui * Support fallback to bm25 * Clippy * Reinject descriptions * Propely handle consecutive searches (EricLBuehler#1421) * Update extraction tool reinjection * Looped * Update docs (EricLBuehler#1422) - lib.rs: clean up example var names and match logging change from EricLBuehler@201d6be - server_builder: fix typo - READMEs: link to crate docs * Better tool call detection logic (EricLBuehler#1424) * Add web search hook callbacks (EricLBuehler#1426) * feat: add customizable search hook * Move to builder * Update docs * Fix CUDA context switching, bind thread on CudaStorage drop (EricLBuehler#1428) * Add CUDA context helper and use in Llama forward * No flashparams? * working * Tweak * Update to use dep * conditionally build flash attention inputs (EricLBuehler#1429) * Add AGENTS.md (EricLBuehler#1430) * Support Qwen3 GGUF model (EricLBuehler#1432) * Support QWen3 GGUF model * Clippy fix * cargo fmt * Improved paged attn prefix caching (EricLBuehler#1434) * Improved paged attn prefix caching * Disable * Clippy * Temporary fix for qwen3 gguf tokenizer (EricLBuehler#1433) * Temporary fix for qwen3 gguf tokenizer * Typo fix * Add tool callback support (EricLBuehler#1427) * Add tool callback support * Fixes * Support named tool callbacks * Update examples * Update docs * Clippy * Centralize crate dependencies (EricLBuehler#1438) * chore: centralize dependencies * Format * Fix bug in tokenizer created with gguf metadata (EricLBuehler#1440) * Fix bug in tokenizer created with gguf metadata * Clippy fix * Update deps (EricLBuehler#1441) * Small things * Update deps * Update deps * Update breaking changes * Doc fixes (EricLBuehler#1442) * Mention uqff_maker * Downgrade rustyline 16.0.0 -> 15.0.0 (EricLBuehler#1444) * Add max_completion_tokens alias for server (EricLBuehler#1451) * Audio input support (Phi 4 multimodal) (EricLBuehler#1448) * Deps * Add conformer * Nemo loading * Position embeds * Load t5 attn bias * Attn and feed forward * Add conv module and glu pointwise * Implement relative attn bias * Add the forward methods * Add encoder embedding * Fix oproj * Some loading * Conformer loads! * Fully loading speech stack * Merger * Dont need that * First pass at audio processing * Read samples * Optional * Small loading fix * Runs but not correct yet * Improved audio processing? * Works with this * Fix t5 attn bias * It works! * Comment * Use some other crates * Clippy * Allow bf16 on metal * Add prefix_audio * Remove unused * Typo * User specified * Add audio url parsing * AudioProjectionMode -> InputMode * Audio prefix caching * Fix bug in audio prefix caching * Support both at the same time! * Tweak logging * Support stereo * Add mistralrs-audio * Support batching * Add server and rust api example * Add python api * Fix add_multimodal_message * Fix unfold for conformer * Streaming example * Add web chat support * Add modalities registry * Fix offline cache issue for gguf models (EricLBuehler#1452) * Add MCP server endpoints (EricLBuehler#1453) * feat(server): add MCP server support * Add mcp docs * Add handle_list_tools_request * Better launch, tool handling * Tmp state * Ok works * Handle modalities * Update docs * Add ping * Tweak temperature bounds, args * MCP documentation pass (EricLBuehler#1455) * Fix table * Update mcp docs * Improve readme header * Improve readme header * Integrate an MCP client (EricLBuehler#1456) * Add builtin mcp client * Use async loader * Add headers * Handle sse * More flexible search request * Add tool callbacks with tools, for mcp * Add bearer token support * Add websocket support * Update docs * Add python api * Clippy * Add http api, docs * Tests pass * Make these configs actually work * Add docs * Make mistralrs-mcp * Refactor examples * Update examples * Add defaults * Add defaults * Add defaults * Update docs * Improved docs * Add -y to npx usages * Even better examples * Update generate_wheels * Update generate_wheels * Update generate_wheels * Fix Dockerfile.cuda-all * Improve automatic tool call (EricLBuehler#1460) * Improved auto tool call * Add logging * chore: `Dockerfile.cuda-all` configurable threads (EricLBuehler#1458) * chore: `Dockerfile.cuda-all` - Merge `RUN` for `apt-get install` (EricLBuehler#1459) * Add fallback definition for isnan (EricLBuehler#1463) * chore: `Dockerfile` - Drop runtime rayon thread ENV (EricLBuehler#1465) * chore: Dockerfile - Remove rayon threads env * chore: Dockerfile - Improve formatting for `apt-get` * Remove duplicate calls for api_dir_list (EricLBuehler#1474) * Remove duplicate calls for api_dir_list * Support local cache for api_dir_list * Fix home folder for metal * Capitalized * Fix transient pyo3 dep (EricLBuehler#1478) Co-authored-by: Eric Buehler <eric@huggingface.co> * Fix objc dep with non macos (EricLBuehler#1480) * Fix phi 3/4 + nccl issue (EricLBuehler#1481) * Fix log * Fix n kv heads * Fix phi3.5 moe (EricLBuehler#1482) * Fix phi3.5 moe accum device * Fix again * Fix again * Support GLM4 model! (EricLBuehler#1437) * Support GLM4 model * Mention GLM4 model in ReadMe * glm4 type hint * Typo fix * Fix unsupported chat_template function * Clippy fix * Refactor distributed backend (EricLBuehler#1484) * Refactor distributed backend, check power of 2 * Fix compilation * Cap metal paged attn kv allocation (EricLBuehler#1485) * Better paged attn metal cap (EricLBuehler#1486) * Better paged attn metal cap * Small fix * Comment * Small fix * Refactor * Server core: consolidate and unify route handlers and API surface (EricLBuehler#1423) * Start working on consolidating completion and chat_completion underlying implementations * Move response channel to util mod for now (since it's used with streaming and non streaming) * More work on consolidating completions and chat completions * More WIP consolidation of server core handlers * More WIP consolidation of server core handlers * More WIP consolidation of server core handlers * Update docs and restrict completion core visibility * CodeRabbit feedback: remove logprobs warn from route handler since parse request also checks this * Use consistent var name for completions mod * Make route handler modules public API consistent (same fn names, etc.) and provide proxy fn that wrap core fns so core mod doesn't have to be pub Make lib.rs example compile checked and update example * Code formatting * Typo * Sync fork * Sync fork * Docs example fix * Support qwen3 gguf (EricLBuehler#1488) * Add qwen3 gguf * Template fixup * Make bos/eos token IDs optional (EricLBuehler#1493) * Remove python deps from CUDA dockerfiles (EricLBuehler#1487) * Handle noncontiguous v in naive_sdpa (EricLBuehler#1499) Co-authored-by: Eric Buehler <eric@huggingface.co> * Server Core: refactor Paged Attention configuration (EricLBuehler#1500) * Use StorageModePrivate for Metal PA kv cache (EricLBuehler#1506) * Fix OpenAI stream: emit field in tool-call deltas for schema compliance (EricLBuehler#1507) * FP8 KV-cache quantization for PagedAttention (EricLBuehler#1400) * Add most of paged attn kv quant * It builds a bit * All the functionality at least * Small fix * Add a scale * Fix bf16 usage * Make k_v_scale optional * Collector * Tweak collection * Refactor * Add to apis * Add cuda impl * Fix compilation * Fixes * Handle ENABLE_FP8 * Format * Tweak * Fix scaled_convert usage * Fix cache_t size * Fixed scale collection * Actual fix * Fix fp8 for CC<8 * Fix the usual String != &str bit (EricLBuehler#1483) Co-authored-by: RageLtMan <rageltman [at] sempervictus> * chore: `Dockerfile` - Drop runtime rayon thread ENV (EricLBuehler#1465) * chore: Dockerfile - Remove rayon threads env * chore: Dockerfile - Improve formatting for `apt-get` * Remove duplicate calls for api_dir_list (EricLBuehler#1474) * Remove duplicate calls for api_dir_list * Support local cache for api_dir_list * Fix home folder for metal * Capitalized * Fix transient pyo3 dep (EricLBuehler#1478) Co-authored-by: Eric Buehler <eric@huggingface.co> * Fix objc dep with non macos (EricLBuehler#1480) * Fix phi 3/4 + nccl issue (EricLBuehler#1481) * Fix log * Fix n kv heads * Fix phi3.5 moe (EricLBuehler#1482) * Fix phi3.5 moe accum device * Fix again * Fix again * Support GLM4 model! (EricLBuehler#1437) * Support GLM4 model * Mention GLM4 model in ReadMe * glm4 type hint * Typo fix * Fix unsupported chat_template function * Clippy fix * Refactor distributed backend (EricLBuehler#1484) * Refactor distributed backend, check power of 2 * Fix compilation * Cap metal paged attn kv allocation (EricLBuehler#1485) * Better paged attn metal cap (EricLBuehler#1486) * Better paged attn metal cap * Small fix * Comment * Small fix * Refactor * Server core: consolidate and unify route handlers and API surface (EricLBuehler#1423) * Start working on consolidating completion and chat_completion underlying implementations * Move response channel to util mod for now (since it's used with streaming and non streaming) * More work on consolidating completions and chat completions * More WIP consolidation of server core handlers * More WIP consolidation of server core handlers * More WIP consolidation of server core handlers * Update docs and restrict completion core visibility * CodeRabbit feedback: remove logprobs warn from route handler since parse request also checks this * Use consistent var name for completions mod * Make route handler modules public API consistent (same fn names, etc.) and provide proxy fn that wrap core fns so core mod doesn't have to be pub Make lib.rs example compile checked and update example * Code formatting * Typo * Sync fork * Sync fork * Docs example fix * Support qwen3 gguf (EricLBuehler#1488) * Add qwen3 gguf * Template fixup * Make bos/eos token IDs optional (EricLBuehler#1493) * Remove python deps from CUDA dockerfiles (EricLBuehler#1487) * Handle USE_FP8 for cuda * Fix cuda warn * Add readme * Saturating sub in sequence state --------- Co-authored-by: Eric Buehler <eric@huggingface.co> Co-authored-by: RageLtMan <sempervictus@users.noreply.github.com> Co-authored-by: Brennan Kinney <5098581+polarathene@users.noreply.github.com> Co-authored-by: Guoqing Bao <topon@outlook.com> Co-authored-by: Matthew Haynes <70829360+matthewhaynesonline@users.noreply.github.com> * Validate model name in OpenAI API (EricLBuehler#1509) * Validate model name in openai api * Add docs, allow 'ignore' * Updated examples for EricLBuehler#1509 * Fix mcp import in doc string (EricLBuehler#1510) * Add multi-model support! (EricLBuehler#1512) * Refactor MistralRs * Working multi-model! * Add mutli-model docs initially * Update mistralrs-pyo3, mistralrs-bench, mistralrs * Update apis for consistency * API tweaks * Logging tweaks * Add examples, tweak cli * Clearer pipeline id * Fix config key semantics * Format and clippy * Tweak logging, fix example * Clippy refactor * Update examples * Remove unused multi model docs * Replace 'ignore' with 'default' * Update docs * Add stars label to readme (EricLBuehler#1513) * Add CLAUDE.md * Handle base_model.model case in lora (EricLBuehler#1514) * Add thread_local! for engine-specific const/static (EricLBuehler#1517) * Fix MCP doc test (EricLBuehler#1511) * Allow disabling metal precompilation (EricLBuehler#1518) * Allow disabling metal precompilation * Simple preprocessor * Simple docs --------- Co-authored-by: Eric Buehler <eric@huggingface.co> * Rust 1.88 clippy (EricLBuehler#1522) * Rust 1.88 clippy * Format * Fix cuda warnings (EricLBuehler#1526) * Avoid panic decoding tokens on error (EricLBuehler#1527) * Split Marlin and Paged Attention kernels for faster build (EricLBuehler#1525) * Split Marlin and Paged Attention kernels for faster build * Typo fix * chore: update llguidance (EricLBuehler#1535) * chore: update llguidance * chore: remove unused import * Add the SmolLM3 model! (EricLBuehler#1501) * Add model * Update loader * Fix llama config usage * Docs * Fix config no_rope_layers * Fix tie_word_embeddings default * Add chat template * Embed the chat templates * Fix embedding template * enable_thinking default true * Update examples * XML tools for smollm3 * Add smollm3 docs * Fix openai examples * Clippy --------- Co-authored-by: Eric Buehler <eric@huggingface.co> * Add full Gemma 3n support! (EricLBuehler#1519) * Add initial * Loading for text model * Add ple embeddings * Add altup, laurel block * Update rmsnorm * Add mlp * Update attn norm application * Currently no kv shared * Wire it up * It runs * Fix bf16 * Fix scaled embd * Fixes for mean * tmp * Attn confirmed * Fix target_magnitude * Add shared kv * Ok it works * Remove npy * Fix streaming * Remove warnings * Remove paged attn * Refactor rope * Add immediate isq * Add vision & mproj * Update image processor * Vision merge runs, not correct * Remove * Add mobilenet v5 * Add multimodal vision embedding * Fix load * runs * Fix gamma * Works but just not vision tower * It works!! * Tweak * Fix warnings * Move vision tower * Fix warn * Update cache manager things * Refactor * Add audio model, it loads * Add audio processing * It runs at least * tmp * A bit better * Audio works!!!! * Fused attn in vision * Clippy * Update audio runner * Optimized audio model * Remove unused things * Fix inputs processor bug * Remove comments * Clippy * Small optimizations * Format * Correctly register modalities * Add docs * Update readme * Runs there * Fixed padding from Blaizzy/mlx-vlm#410 * Add better checks * Fix sdpa n_kv_groups * Vision encoder works! * Rotate image * Clippy * Fix cuda loading * Updated device mapper * Fix overflow * Fix dtype errors * Refactor image/audio embeddings * Fix metal * Fix dtype mismatch * Audio processing fixes * Audio processing fixes * Works * Audio is good * Fix boi/eoi too * Embed the chat templates * Better embedding accuracy in non f32 * More f32 * Support bf16 on metal * Add more ISQ * Fixed device map * Clippy * Gemma3n no paged attn * Fix saturating sub * Faster rmsnorm * Use sdpa for vision model * Fix ple bug * Fix name * Fix multiaudio * Add matformer config loading * Add docs * Add support for matformer in auto device mapper * Update docs * Typos * Tweak * Tweak * Fix multidevice * Fix gemma3n text model auto device map * Fix dims3 * Fix auto devic emap vision * Non-metal keeps PLE on cpu * Complete merge * Vision dtype f16 -> f32 * Fix metal nm device * Fix uqff * Typos * Reference uqff * Fix tests * Fix sequence length check (EricLBuehler#1546) * update candle version (EricLBuehler#1545) Co-authored-by: AlpineVibrations <pro@pro.com> * add ios target to metal deps (EricLBuehler#1548) --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com> Co-authored-by: Eric Buehler <ericlbuehler@gmail.com> Co-authored-by: edwko <187129830+edwko@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Guoqing Bao <topon@outlook.com> Co-authored-by: Michał Moskal <michal@moskal.me> Co-authored-by: Chen Mulong <chenmulong@gmail.com> Co-authored-by: Steph Wolski <5911086+Slowki@users.noreply.github.com> Co-authored-by: omahs <73983677+omahs@users.noreply.github.com> Co-authored-by: Viktor Szépe <viktor@szepe.net> Co-authored-by: Matthew Haynes <70829360+matthewhaynesonline@users.noreply.github.com> Co-authored-by: RageLtMan <sempervictus@users.noreply.github.com> Co-authored-by: Brennan Kinney <5098581+polarathene@users.noreply.github.com> Co-authored-by: Eric Buehler <eric@huggingface.co> Co-authored-by: Sbargaoui <bargaoui.sam@gmail.com> Co-authored-by: Gaétan Lepage <33058747+GaetanLepage@users.noreply.github.com> Co-authored-by: Ammar Elsabe <ayasser763@gmail.com> Co-authored-by: luke <10145679+AlpineVibrations@users.noreply.github.com> Co-authored-by: AlpineVibrations <pro@pro.com> Co-authored-by: Michael Tissen <rubiktubik@googlemail.com>
Related: EricLBuehler/candle#82
Fixes #1406, #1401, #1399, #1394
Summary
set_cuda_contexthelper to utilsLlama::forward_embedswhen switching devicesTesting
cargo fmt(fails: rustfmt component not installed)cargo test --workspace --no-run(failed: build interrupted due to environment limits)https://chatgpt.com/codex/tasks/task_e_684063442160832289cdfb7840b2aac5
Summary by CodeRabbit
Summary by CodeRabbit
Chores
Bug Fixes