Skip to content

Fix CUDA context switching, bind thread on CudaStorage drop#1428

Merged
EricLBuehler merged 5 commits intomasterfrom
codex/add-cuda-context-switching-in-llama-model
Jun 4, 2025
Merged

Fix CUDA context switching, bind thread on CudaStorage drop#1428
EricLBuehler merged 5 commits intomasterfrom
codex/add-cuda-context-switching-in-llama-model

Conversation

@EricLBuehler
Copy link
Copy Markdown
Owner

@EricLBuehler EricLBuehler commented Jun 4, 2025

Related: EricLBuehler/candle#82

Fixes #1406, #1401, #1399, #1394

Summary

  • add set_cuda_context helper to utils
  • call helper in Llama::forward_embeds when switching devices
  • document why context switching is needed

Testing

  • cargo fmt (fails: rustfmt component not installed)
  • cargo test --workspace --no-run (failed: build interrupted due to environment limits)

https://chatgpt.com/codex/tasks/task_e_684063442160832289cdfb7840b2aac5

Summary by CodeRabbit

Summary by CodeRabbit

  • Chores

    • Updated internal dependencies to newer revisions for improved stability and compatibility.
  • Bug Fixes

    • Improved device mapping logic for CUDA devices, enhancing reliability in device selection.
    • Adjusted prefix cache logic to better handle cases when the prefix cache size is set to zero, ensuring correct caching behavior.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2025

"""

Walkthrough

The updates revise dependency versions for several candle-related crates in the workspace, adjust device creation logic for CUDA devices by simplifying the constructor used, and expand the conditions under which prefix caching is disabled in the engine module to include cases where the prefix cache size is zero.

Changes

File(s) Change Summary
Cargo.toml Updated git revision hashes for candle-core, candle-nn, candle-flash-attn-v3, and candle-flash-attn dependencies.
mistralrs-core/src/device_map.rs Changed CUDA device creation to use a simpler constructor without specifying a stream.
mistralrs-core/src/engine/mod.rs Modified prefix cache disabling logic to also trigger when prefix cache size is zero.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Engine
    participant PipelineMetadata

    User->>Engine: new(no_prefix_cache, no_kv_cache, prefix_cache_n, pipeline_metadata)
    Engine->>PipelineMetadata: check no_prefix_cache flag
    Engine->>Engine: Set no_prefix_cache to true if:\n- no_prefix_cache is true\n- OR no_kv_cache is true\n- OR pipeline_metadata.no_prefix_cache is true\n- OR prefix_cache_n == 0
    Engine-->>User: Engine instance created
Loading

Poem

In the garden of code where dependencies grow,
Candle crates updated, their hashes now glow.
CUDA devices simplified, streams set aside,
Prefix cache logic—now broader in stride.
With each little tweak, our engine runs bright,
A rabbit’s delight in the soft morning light.
🐇✨
"""


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8180a80 and 0f6a138.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (1)
  • Cargo.toml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • Cargo.toml
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Test Suite (macOS-latest, stable)
  • GitHub Check: Docs
  • GitHub Check: Check (ubuntu-latest, stable)
  • GitHub Check: Clippy
  • GitHub Check: Check (macOS-latest, stable)
  • GitHub Check: Check (windows-latest, stable)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
  • GitHub Check: Test Suite (windows-latest, stable)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2025

Code Metrics Report
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                3           62           53            0            9
 CSS                     1          473          408           14           51
 Dockerfile              1           42           23           10            9
 HTML                    1           73           61            4            8
 JavaScript              7         1248          936          174          138
 JSON                   14          123          122            0            1
 Makefile                1            6            5            0            1
 Python                 87         4097         3457          161          479
 Shell                   1           63           26           18           19
 Plain Text              3         3723            0         2413         1310
 TOML                   21          695          634           10           51
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       3            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               59         5086            0         3880         1206
 |- BASH                10          111          105            2            4
 |- JSON                 2           42           42            0            0
 |- Python               7          121          109            0           12
 |- Rust                22          757          634            1          122
 |- TOML                 2           75           63            0           12
 (Total)                           6192          953         3883         1356
-------------------------------------------------------------------------------
 Rust                  376       132361       117795         2893        11673
 |- Markdown           175         3002           29         2662          311
 (Total)                         135363       117824         5555        11984
===============================================================================
 Total                 580       148073       123539         9579        14955
===============================================================================

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 39673eb and 09aec73.

📒 Files selected for processing (2)
  • mistralrs-core/src/models/llama.rs (3 hunks)
  • mistralrs-core/src/utils/mod.rs (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
mistralrs-core/src/models/llama.rs (1)
mistralrs-core/src/utils/mod.rs (2)
  • set_cuda_context (18-23)
  • set_cuda_context (26-26)
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: Docs
  • GitHub Check: Check (windows-latest, stable)
  • GitHub Check: Test Suite (windows-latest, stable)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
  • GitHub Check: Test Suite (macOS-latest, stable)
🔇 Additional comments (6)
mistralrs-core/src/utils/mod.rs (2)

12-16: LGTM: Clear documentation explaining the necessity.

The documentation clearly explains why explicit CUDA context switching is required for multi-device scenarios.


25-26: LGTM: Appropriate no-op implementation for non-CUDA builds.

The conditional compilation pattern correctly provides a no-op version when CUDA is not available.

mistralrs-core/src/models/llama.rs (4)

29-29: LGTM: Appropriate import of the new utility function.

The import correctly brings in the set_cuda_context function for use in the forward pass.


524-527: Excellent documentation and correct context switching logic.

The comments clearly explain why CUDA context switching is necessary, and the implementation correctly sets the context after tensor mapping to ensure operations execute on the correct device.


540-542: LGTM: Proper context restoration for final computations.

The context is correctly switched back to the model's primary device before the final normalization and output head computations.


527-527:

Details

❌ Incorrect review comment

Verify error handling alignment with utility function changes.

If the set_cuda_context function is modified to return a Result (as suggested in the utils review), these call sites will need to be updated to handle potential errors appropriately.

Also applies to: 542-542


🏁 Script executed:

#!/bin/bash
# Check if there are other usages of set_cuda_context that might need updating
rg -A 3 -B 1 "set_cuda_context" --type rust

Length of output: 1842


No updates needed for set_cuda_context calls

The current definition of set_cuda_context in mistralrs-core/src/utils/mod.rs returns () and internally unwraps any errors, so call sites like those in models/llama.rs do not need to handle a Result. No changes to error handling are required here.

Likely an incorrect or invalid review comment.

Comment thread mistralrs-core/src/utils/mod.rs Outdated
Comment on lines +17 to +23
#[cfg(feature = "cuda")]
pub fn set_cuda_context(dev: &candle_core::Device) {
if let candle_core::Device::Cuda(d) = dev {
unsafe { cudarc::driver::result::ctx::set_current(d.cu_primary_ctx()) }
.unwrap();
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider more robust error handling instead of panicking.

The current implementation uses unwrap() which will panic if setting the CUDA context fails. In a production environment, this could be problematic.

Consider returning a Result to allow callers to handle errors gracefully:

-pub fn set_cuda_context(dev: &candle_core::Device) {
+pub fn set_cuda_context(dev: &candle_core::Device) -> Result<()> {
     if let candle_core::Device::Cuda(d) = dev {
-        unsafe { cudarc::driver::result::ctx::set_current(d.cu_primary_ctx()) }
-            .unwrap();
+        unsafe { cudarc::driver::result::ctx::set_current(d.cu_primary_ctx()) }
+            .map_err(|e| candle_core::Error::Cuda(Box::new(e)))?;
     }
+    Ok(())
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[cfg(feature = "cuda")]
pub fn set_cuda_context(dev: &candle_core::Device) {
if let candle_core::Device::Cuda(d) = dev {
unsafe { cudarc::driver::result::ctx::set_current(d.cu_primary_ctx()) }
.unwrap();
}
}
#[cfg(feature = "cuda")]
pub fn set_cuda_context(dev: &candle_core::Device) -> Result<()> {
if let candle_core::Device::Cuda(d) = dev {
unsafe { cudarc::driver::result::ctx::set_current(d.cu_primary_ctx()) }
.map_err(|e| candle_core::Error::Cuda(Box::new(e)))?;
}
Ok(())
}
🤖 Prompt for AI Agents
In mistralrs-core/src/utils/mod.rs around lines 17 to 23, the function
set_cuda_context uses unwrap() which causes a panic if setting the CUDA context
fails. To fix this, change the function to return a Result type instead of
panicking. Propagate the error from set_current by returning it to the caller,
allowing them to handle the failure gracefully rather than crashing the program.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09aec73 and 09f8f4d.

📒 Files selected for processing (2)
  • mistralrs-core/src/pipeline/inputs_processor.rs (2 hunks)
  • mistralrs-core/src/utils/mod.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • mistralrs-core/src/utils/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Clippy
  • GitHub Check: Docs
  • GitHub Check: Check (macOS-latest, stable)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
  • GitHub Check: Test Suite (windows-latest, stable)
  • GitHub Check: Check (ubuntu-latest, stable)
  • GitHub Check: Check (windows-latest, stable)
  • GitHub Check: Test Suite (macOS-latest, stable)
🔇 Additional comments (3)
mistralrs-core/src/pipeline/inputs_processor.rs (3)

258-259: LGTM: Explicit type annotations improve code clarity.

The explicit HashMap<DeviceLocation, Tensor> type annotations enhance code readability and make the intent clearer.


429-430: LGTM: Consistent type annotations.

The explicit type annotations match the pattern established in make_prompt_chunk and improve code consistency.


434-435: Consistent with make_prompt_chunk changes.

The hash map population is disabled here as well, maintaining consistency with the make_prompt_chunk function. The same verification concerns about downstream compatibility apply as mentioned in the previous comment.

Comment thread mistralrs-core/src/pipeline/inputs_processor.rs
@EricLBuehler EricLBuehler changed the title Fix CUDA context switching in Llama Fix CUDA context switching Jun 4, 2025
@EricLBuehler EricLBuehler changed the title Fix CUDA context switching Fix CUDA context switching, bind thread on CudaStorage drop Jun 4, 2025
@EricLBuehler EricLBuehler merged commit 9989719 into master Jun 4, 2025
13 checks passed
@EricLBuehler EricLBuehler deleted the codex/add-cuda-context-switching-in-llama-model branch June 4, 2025 18:27
@EricLBuehler
Copy link
Copy Markdown
Owner Author

@sempervictus this issue fixed the error behind #1406, #1401, #1399, #1394 for me. Can you please test and confirm it fixed it for you too?

@sempervictus
Copy link
Copy Markdown
Contributor

sempervictus commented Jun 4, 2025

I just ran a build in the Docker container and still get:

2025-06-04T18:57:39.060479Z  INFO mistralrs_core::paged_attention: Allocating 8192 MB for PagedAttention KV cache per GPU
2025-06-04T18:57:39.060503Z  INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 1024 GPU blocks: available context length is 32768 tokens
2025-06-04T18:57:39.916032Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|endoftext|>", eos_toks = "<|im_end|>", "<|endoftext|>", unk_tok = `None`
2025-06-04T18:57:39.937989Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-06-04T18:57:39.938233Z  INFO mistralrs_core: Beginning dummy run.
2025-06-04T18:57:39.941248Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
   0:     0x556f00db5342 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h3cba09f3134c688d
   1:     0x556eff47c883 - core::fmt::write::h23019460b0b70a11
   2:     0x556f00db452f - std::io::Write::write_fmt::h73da8773e52bf4ad
   3:     0x556f00db51a3 - std::sys::backtrace::BacktraceLock::print::h58c794ef15c6671f
   4:     0x556f00db4ad5 - std::panicking::default_hook::h94aabe0891249549
   5:     0x556f00db41e7 - std::panicking::rust_panic_with_hook::hb81599440b437817
   6:     0x556f00df6618 - std::panicking::begin_panic_handler::{{closure}}::h7a731a74ab3fd8e5
   7:     0x556f00df6579 - std::sys::backtrace::__rust_end_short_backtrace::h1f727fbc9961adc0
   8:     0x556f00df7bcc - __rustc[a3537046f032bc96]::rust_begin_unwind
   9:     0x556eff47a96f - core::panicking::panic_fmt::he78c0e2ddfc3e30a
  10:     0x556eff4820c5 - core::result::unwrap_failed::ha9d262dd5091e6ed
  11:     0x556eff3647b3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h043e9fec980bebde
  12:     0x556eff32ea48 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h7f2784f092b5db70.5510
  13:     0x556eff32f577 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::h8febea626b5ef021.5528
  14:     0x556eff32ee0b - alloc::sync::Arc<T,A>::drop_slow::hb705f75f20fb60ab
  15:     0x556eff32ed10 - alloc::sync::Arc<T,A>::drop_slow::haf5599c18ae07fda
  16:     0x556effac6b30 - mistralrs_core::models::qwen2::Model::forward_embed::h019baf397a0c438f
  17:     0x556effac7aba - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::h1b7e055f3fc8c72d
  18:     0x556f00320173 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h2ab2214ebea94332
  19:     0x556f003237aa - mistralrs_core::pipeline::Pipeline::step::{{closure}}::he577c64df601a3b1
  20:     0x556f002794db - mistralrs_core::engine::Engine::run::{{closure}}::h8b7c1ce232f5bb75.37372
  21:     0x556effeb1679 - std::sys::backtrace::__rust_begin_short_backtrace::hff153854ba1955a2
  22:     0x556effeb60b3 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h42f26e7c5b4ec229
  23:     0x556f00df7f77 - std::sys::pal::unix::thread::Thread::new::thread_start::h4c462331eebbf5ed
  24:     0x7f91ef23fac3 - <unknown>
  25:     0x7f91ef2d0a04 - clone
  26:                0x0 - <unknown>

thread '<unnamed>' panicked at library/core/src/panicking.rs:233:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.

is that pulling in the Candle fix and relevant changes here or do i need to change something in the dockerfile?

EDIT: sorry, issue's closed so - ping @EricLBuehler for vis

@EricLBuehler
Copy link
Copy Markdown
Owner Author

EricLBuehler commented Jun 4, 2025

@sempervictus did you run cargo update as well as git pull (before rebuilding)?

@sempervictus
Copy link
Copy Markdown
Contributor

@EricLBuehler this is being built and run in Docker so dockerfile would be doing that. I have one on 12.4.1 as you do and one with

diff --git a/Dockerfile.cuda-all b/Dockerfile.cuda-all
index 026a0a9e6..5fce212fd 100644
--- a/Dockerfile.cuda-all
+++ b/Dockerfile.cuda-all
@@ -1,4 +1,4 @@
-FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 AS builder
+FROM nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04 AS builder
 
 RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
     curl \
@@ -15,17 +15,17 @@ WORKDIR /mistralrs
 
 COPY . .
 
-ARG CUDA_COMPUTE_CAP=80
+ARG CUDA_COMPUTE_CAP=70
 ENV CUDA_COMPUTE_CAP=${CUDA_COMPUTE_CAP}
 ARG FEATURES="cuda cudnn"
-ENV RAYON_NUM_THREADS=4
-RUN RUSTFLAGS="-Z threads=4" cargo build --release --workspace --exclude mistralrs-pyo3 --features "${FEATURES}"
+ENV RAYON_NUM_THREADS=32
+RUN RUSTFLAGS="-Z threads=32" cargo build --release --workspace --exclude mistralrs-pyo3 --features "${FEATURES}"
 
-FROM nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04 AS base
+FROM nvidia/cuda:12.8.0-cudnn-runtime-ubuntu22.04 AS base
 
 ENV HUGGINGFACE_HUB_CACHE=/data \
     PORT=80 \
-    RAYON_NUM_THREADS=8 \ 
+    RAYON_NUM_THREADS=32 \
     LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
 
 # Run the script to create symlinks in /usr/local/cuda/lib64

@sempervictus
Copy link
Copy Markdown
Contributor

Doing a --no-cache

@sempervictus
Copy link
Copy Markdown
Contributor

@EricLBuehler - unfortunately, no dice even with a --no-cache build (from scratch, rustup on out):

2025-06-04T19:33:01.240433Z  INFO mistralrs_core::paged_attention: Allocating 8192 MB for PagedAttention KV cache per GPU
2025-06-04T19:33:01.240452Z  INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 1024 GPU blocks: available context length is 32768 tokens
2025-06-04T19:33:02.083380Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|endoftext|>", eos_toks = "<|im_end|>", "<|endoftext|>", unk_tok = `None`
2025-06-04T19:33:02.105714Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-06-04T19:33:02.105947Z  INFO mistralrs_core: Beginning dummy run.
2025-06-04T19:33:02.111040Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
   0:     0x55f24e372922 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
   1:     0x55f24ca39563 - core::fmt::write::h95e30a17c3d7d930
   2:     0x55f24e371b0f - std::io::Write::write_fmt::h2447d4278ce5a227
   3:     0x55f24e372783 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
   4:     0x55f24e3720b5 - std::panicking::default_hook::h0a7d57cc63374946
   5:     0x55f24e3717c7 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
   6:     0x55f24e3b3be8 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
   7:     0x55f24e3b3b49 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
   8:     0x55f24e3b519c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
   9:     0x55f24ca3764f - core::panicking::panic_fmt::ha159237b3cadc48c
  10:     0x55f24ca3eda5 - core::result::unwrap_failed::h879f86fa8962b20a
  11:     0x55f24c9974e3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
  12:     0x55f24dd6bde8 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h69dabdb8397fdeca
  13:     0x55f24dd74f92 - <mistralrs_quant::cublaslt::api::CublasLTBatchMatmul as candle_core::custom_op::CustomOp2>::cuda_fwd::h485b6d7e9e3b157b
  14:     0x55f24c90b72f - candle_core::storage::Storage::apply_op2::h6a343fb09e53884b
  15:     0x55f24c82ff91 - candle_core::custom_op::<impl candle_core::tensor::Tensor>::apply_op2_arc::h1089692e7e049299
  16:     0x55f24ddce901 - mistralrs_quant::cublaslt::CublasLtWrapper::batch_matmul::h8a46e8cceca17c7d
  17:     0x55f24ddfdea0 - <mistralrs_quant::unquantized::UnquantLinear as mistralrs_quant::QuantMethod>::forward::h8602c1712d107498
  18:     0x55f24dd8b61c - <mistralrs_quant::distributed::layers::ColumnParallelLayer as mistralrs_quant::QuantMethod>::forward::h69b916efba3c9b52
  19:     0x55f24d082d2c - mistralrs_core::models::qwen2::Model::forward_embed::h76325c7661b6b0c7
  20:     0x55f24d086a0a - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::hbcf387a473650d2c
  21:     0x55f24d8de903 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
  22:     0x55f24d8e1f3a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
  23:     0x55f24d83852b - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.37410
  24:     0x55f24d46e979 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
  25:     0x55f24d474f13 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
  26:     0x55f24e3b5547 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
  27:     0x7fcebcc3fac3 - <unknown>
  28:     0x7fcebccd0a04 - clone
  29:                0x0 - <unknown>

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
   0:     0x55f24e372922 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
   1:     0x55f24ca39563 - core::fmt::write::h95e30a17c3d7d930
   2:     0x55f24e371b0f - std::io::Write::write_fmt::h2447d4278ce5a227
   3:     0x55f24e372783 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
   4:     0x55f24e3720b5 - std::panicking::default_hook::h0a7d57cc63374946
   5:     0x55f24e3717c7 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
   6:     0x55f24e3b3be8 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
   7:     0x55f24e3b3b49 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
   8:     0x55f24e3b519c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
   9:     0x55f24ca3764f - core::panicking::panic_fmt::ha159237b3cadc48c
  10:     0x55f24ca3eda5 - core::result::unwrap_failed::h879f86fa8962b20a
  11:     0x55f24c9974e3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
  12:     0x55f24c8eb2e8 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
  13:     0x55f24c8eb227 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
  14:     0x55f24c8eb16b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
  15:     0x55f24c8eb370 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
  16:     0x55f24d085a80 - mistralrs_core::models::qwen2::Model::forward_embed::h76325c7661b6b0c7
  17:     0x55f24d086a0a - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::hbcf387a473650d2c
  18:     0x55f24d8de903 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
  19:     0x55f24d8e1f3a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
  20:     0x55f24d83852b - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.37410
  21:     0x55f24d46e979 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
  22:     0x55f24d474f13 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
  23:     0x55f24e3b5547 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
  24:     0x7fcebcc3fac3 - <unknown>
  25:     0x7fcebccd0a04 - clone
  26:                0x0 - <unknown>

thread '<unnamed>' panicked at library/core/src/panicking.rs:233:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
``

@sempervictus
Copy link
Copy Markdown
Contributor

interesting: when i quantize the model at load-time, it doesn't immediately crash:

2025-06-04T19:34:30.463380Z  INFO mistralrs_server_core::mistralrs_for_server_builder: avx: false, neon: false, simd128: false, f16c: false
2025-06-04T19:34:30.463417Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-06-04T19:34:30.463447Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Model kind is: normal (no adapters)
2025-06-04T19:34:30.463489Z  INFO hf_hub: Using token file found "/root/.cache/huggingface/token"    
2025-06-04T19:34:30.463573Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-04T19:34:30.463633Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-04T19:34:30.556549Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00014.safetensors", "model-00002-of-00014.safetensors", "model-00003-of-00014.safetensors", "model-00004-of-00014.safetensors", "model-00005-of-00014.safetensors", "model-00006-of-00014.safetensors", "model-00007-of-00014.safetensors", "model-00008-of-00014.safetensors", "model-00009-of-00014.safetensors", "model-00010-of-00014.safetensors", "model-00011-of-00014.safetensors", "model-00012-of-00014.safetensors", "model-00013-of-00014.safetensors", "model-00014-of-00014.safetensors"]
2025-06-04T19:34:30.587340Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-04T19:34:30.652534Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-04T19:34:30.679579Z  INFO mistralrs_quant::utils::log: Automatic loader type determined to be `qwen2`
2025-06-04T19:34:30.679591Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-06-04T19:34:30.843294Z  INFO mistralrs_quant::utils::log: Model has 64 repeating layers.
2025-06-04T19:34:30.843711Z  INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-06-04T19:34:30.843747Z  INFO mistralrs_quant::utils::log: Layers 0-19: cuda[0] (32 GB)
2025-06-04T19:34:30.843762Z  INFO mistralrs_quant::utils::log: Layers 20-41: cuda[1] (32 GB)
2025-06-04T19:34:30.843775Z  INFO mistralrs_quant::utils::log: Layers 42-63: cuda[2] (32 GB)
2025-06-04T19:34:30.888142Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 7
2025-06-04T19:34:30.888153Z  INFO mistralrs_core::utils::normal: Skipping BF16 because CC < 8.0
2025-06-04T19:34:30.952883Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2025-06-04T19:34:30.952934Z  INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 152064, hidden_size: 5120, intermediate_size: 27648, num_hidden_layers: 64, num_attention_heads: 40, num_key_value_heads: 8, max_position_embeddings: 32768, sliding_window: Some(131072), rope_theta: 1000000.0, rms_norm_eps: 1e-6, hidden_act: Silu, quantization_config: None, tie_word_embeddings: false }
2025-06-04T19:34:30.953013Z  INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-06-04T19:37:00.868151Z  INFO mistralrs_core::pipeline::normal: Applying ISQ to all ranks.
2025-06-04T19:37:00.868198Z  INFO mistralrs_core::pipeline::isq: Applying in-situ quantization into Some(Q4K) to 449 tensors.
2025-06-04T19:37:00.870213Z  INFO mistralrs_core::pipeline::isq: Applying ISQ on 32 threads.
2025-06-04T19:38:22.217038Z  INFO mistralrs_core::pipeline::isq: Applied in-situ quantization into Some(Q4K) to 449 tensors out of 449 total tensors. Took 81.35s
2025-06-04T19:38:22.217371Z  INFO mistralrs_core::paged_attention: Allocating 8192 MB for PagedAttention KV cache per GPU
2025-06-04T19:38:22.217378Z  INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 1024 GPU blocks: available context length is 32768 tokens
2025-06-04T19:38:23.075077Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|endoftext|>", eos_toks = "<|im_end|>", "<|endoftext|>", unk_tok = `None`
2025-06-04T19:38:23.098356Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-06-04T19:38:23.098601Z  INFO mistralrs_core: Beginning dummy run.
2025-06-04T19:38:23.100785Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
2025-06-04T19:38:40.069655Z  INFO mistralrs_core: Dummy run completed in 16.971041424s.
2025-06-04T19:38:40.070156Z  INFO mistralrs_server: Serving on http://0.0.0.0:7651.
2025-06-04T19:38:43.101233Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 0.60, Prefix cache hitrate 0.00%, 0 running, 0 waiting

@sempervictus
Copy link
Copy Markdown
Contributor

sempervictus commented Jun 4, 2025

@EricLBuehler - same effect, reproducible: quantized, the model works for the first request. Once a 2nd request is issued following up on the conversation, it does:

2025-06-04T19:38:40.069655Z  INFO mistralrs_core: Dummy run completed in 16.971041424s.
2025-06-04T19:38:40.070156Z  INFO mistralrs_server: Serving on http://0.0.0.0:7651.
2025-06-04T19:38:43.101233Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 0.60, Prefix cache hitrate 0.00%, 0 running, 0 waiting
2025-06-04T19:42:18.104787Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 38.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:42:23.104861Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 19.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:42:28.104968Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 19.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:42:33.105030Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 19.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:42:38.105138Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 19.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:42:43.105199Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:42:48.105304Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:42:53.105363Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 19.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:42:58.105467Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:03.105614Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:08.105715Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:13.105779Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:18.105880Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:23.106025Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:28.106124Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:33.106189Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:38.106288Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:43.106431Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:48.106530Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:53.106595Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:43:58.106693Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:44:03.106835Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:44:08.106934Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 12.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:44:18.107095Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 499.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:44:28.107328Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 492.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:44:33.107397Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 19.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:44:38.107462Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 19.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T19:44:43.107529Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 19.60, Prefix cache hitrate 0.00%, 0 running, 0 waiting

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
   0:     0x55f949127922 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
   1:     0x55f9477ee563 - core::fmt::write::h95e30a17c3d7d930
   2:     0x55f949126b0f - std::io::Write::write_fmt::h2447d4278ce5a227
   3:     0x55f949127783 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
   4:     0x55f9491270b5 - std::panicking::default_hook::h0a7d57cc63374946
   5:     0x55f9491267c7 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
   6:     0x55f949168be8 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
   7:     0x55f949168b49 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
   8:     0x55f94916a19c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
   9:     0x55f9477ec64f - core::panicking::panic_fmt::ha159237b3cadc48c
  10:     0x55f9477f3da5 - core::result::unwrap_failed::h879f86fa8962b20a
  11:     0x55f94774c4e3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
  12:     0x55f9476a02e8 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
  13:     0x55f9476a0227 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
  14:     0x55f9476a016b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
  15:     0x55f9476a0370 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
  16:     0x55f947da87c9 - <mistralrs_core::device_map::LayerDeviceMapper as mistralrs_core::device_map::DeviceMapper>::map::ha40c495b77d50a86
  17:     0x55f947e37998 - mistralrs_core::models::qwen2::Model::forward_embed::h76325c7661b6b0c7
  18:     0x55f947e3ba0a - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::hbcf387a473650d2c
  19:     0x55f948693903 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
  20:     0x55f948696f3a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
  21:     0x55f9485ed52b - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.37410
  22:     0x55f948223979 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
  23:     0x55f948229f13 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
  24:     0x55f94916a547 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
  25:     0x7f6070758ac3 - <unknown>
  26:     0x7f60707e9a04 - clone
  27:                0x0 - <unknown>

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
   0:     0x55f949127922 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
   1:     0x55f9477ee563 - core::fmt::write::h95e30a17c3d7d930
   2:     0x55f949126b0f - std::io::Write::write_fmt::h2447d4278ce5a227
   3:     0x55f949127783 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
   4:     0x55f9491270b5 - std::panicking::default_hook::h0a7d57cc63374946
   5:     0x55f9491267c7 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
   6:     0x55f949168be8 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
   7:     0x55f949168b49 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
   8:     0x55f94916a19c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
   9:     0x55f9477ec64f - core::panicking::panic_fmt::ha159237b3cadc48c
  10:     0x55f9477f3da5 - core::result::unwrap_failed::h879f86fa8962b20a
  11:     0x55f94774c4e3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
  12:     0x55f9476a02e8 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
  13:     0x55f9476a0219 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
  14:     0x55f9476a016b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
  15:     0x55f9476a0370 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
  16:     0x55f94833edc2 - core::ptr::drop_in_place<mistralrs_core::pipeline::inputs_processor::text_models_inputs_processor::FlashParams>::h54cce5dd92919df6
  17:     0x55f9486943a8 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
  18:     0x55f948696f3a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
  19:     0x55f9485ed52b - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.37410
  20:     0x55f948223979 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
  21:     0x55f948229f13 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
  22:     0x55f94916a547 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
  23:     0x7f6070758ac3 - <unknown>
  24:     0x7f60707e9a04 - clone
  25:                0x0 - <unknown>

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
   0:     0x55f949127922 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
   1:     0x55f9477ee563 - core::fmt::write::h95e30a17c3d7d930
   2:     0x55f949126b0f - std::io::Write::write_fmt::h2447d4278ce5a227
   3:     0x55f949127783 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
   4:     0x55f9491270b5 - std::panicking::default_hook::h0a7d57cc63374946
   5:     0x55f9491267c7 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
   6:     0x55f949168be8 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
   7:     0x55f949168b49 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
   8:     0x55f94916a19c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
   9:     0x55f9477ec64f - core::panicking::panic_fmt::ha159237b3cadc48c
  10:     0x55f9477f3da5 - core::result::unwrap_failed::h879f86fa8962b20a
  11:     0x55f94774c4e3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
  12:     0x55f9476a02e8 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
  13:     0x55f9476a0219 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
  14:     0x55f9476a016b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
  15:     0x55f9476a0370 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
  16:     0x55f947d8fc52 - <hashbrown::raw::RawTable<T,A> as core::ops::drop::Drop>::drop::ha2d468b205f8c06b
  17:     0x55f94833eed6 - core::ptr::drop_in_place<mistralrs_core::pipeline::inputs_processor::text_models_inputs_processor::FlashParams>::h54cce5dd92919df6
  18:     0x55f9486943a8 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
  19:     0x55f948696f3a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
  20:     0x55f9485ed52b - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.37410
  21:     0x55f948223979 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
  22:     0x55f948229f13 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
  23:     0x55f94916a547 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
  24:     0x7f6070758ac3 - <unknown>
  25:     0x7f60707e9a04 - clone
  26:                0x0 - <unknown>

thread '<unnamed>' panicked at library/core/src/panicking.rs:233:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.

... i do find it mildly odd to see FlashParams showing up on a V100's stack trace. Boot process shows FA as disabled due to CC7

@EricLBuehler
Copy link
Copy Markdown
Owner Author

EricLBuehler commented Jun 4, 2025

@sempervictus Hmm, interesting. What model is this?

@sempervictus
Copy link
Copy Markdown
Contributor

@EricLBuehler - SWE-bench/SWE-agent-LM-32B crashes outright un-quantized, on the 2nd iteration quantized to q4k

@sempervictus
Copy link
Copy Markdown
Contributor

@EricLBuehler - i can confirm reproducibility on qwen-distilled r1 and llama-distilled r1 as well as llama3.1

@sempervictus
Copy link
Copy Markdown
Contributor

@EricLBuehler - here's the dmesg output of a single prompt long run on the SWE agent model - lots of OOB access it seems:

[4010345.802913] traps: mistralrs-serve[3322477] general protection fault ip:7f7d914ec898 sp:7f7aa3bef420 error:0 in libc.so.6[7f7d914ec000+195000]
[4011764.856195] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 0): Out Of Range Address
[4011764.856222] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x504730=0xc07000e 0x504734=0x0 0x504728=0x4c1eb72 0x50472c=0x174
[4011764.856284] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 1): Out Of Range Address
[4011764.856304] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5047b0=0xc00000e 0x5047b4=0x0 0x5047a8=0x4c1eb72 0x5047ac=0x174
[4011764.856372] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 1, SM 0): Out Of Range Address
[4011764.856392] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x504f30=0xc02000e 0x504f34=0x0 0x504f28=0x4c1eb72 0x504f2c=0x174
[4011764.856452] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 1, SM 1): Out Of Range Address
[4011764.856472] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x504fb0=0xc03000e 0x504fb4=0x0 0x504fa8=0x4c1eb72 0x504fac=0x174
[4011764.856539] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 2, SM 0): Out Of Range Address
[4011764.856558] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x505730=0xc02000e 0x505734=0x0 0x505728=0x4c1eb72 0x50572c=0x174
[4011764.856618] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 2, SM 1): Out Of Range Address
[4011764.856637] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5057b0=0xc02000e 0x5057b4=0x0 0x5057a8=0x4c1eb72 0x5057ac=0x174
[4011764.856704] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 3, SM 0): Out Of Range Address
[4011764.856723] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x505f30=0xc03000e 0x505f34=0x0 0x505f28=0x4c1eb72 0x505f2c=0x174
[4011764.856781] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 3, SM 1): Out Of Range Address
[4011764.856801] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Global Exception on (GPC 0, TPC 3, SM 1): Multiple Warp Errors
[4011764.856820] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x505fb0=0xc00000e 0x505fb4=0x4 0x505fa8=0x4c1eb72 0x505fac=0x174
[4011764.856883] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 4, SM 0): Out Of Range Address
[4011764.856902] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x506730=0xc04000e 0x506734=0x0 0x506728=0x4c1eb72 0x50672c=0x174
[4011764.856955] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 4, SM 1): Out Of Range Address
[4011764.856975] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5067b0=0xc04000e 0x5067b4=0x20 0x5067a8=0x4c1eb72 0x5067ac=0x174
[4011764.857034] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 5, SM 0): Out Of Range Address
[4011764.857054] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x506f30=0xc01000e 0x506f34=0x0 0x506f28=0x4c1eb72 0x506f2c=0x174
[4011764.857106] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 5, SM 1): Out Of Range Address
[4011764.857125] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x506fb0=0xc01000e 0x506fb4=0x0 0x506fa8=0x4c1eb72 0x506fac=0x174
[4011764.857185] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 6, SM 0): Out Of Range Address
[4011764.857205] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x507730=0xc02000e 0x507734=0x0 0x507728=0x4c1eb72 0x50772c=0x174
[4011764.857257] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 6, SM 1): Out Of Range Address
[4011764.857277] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5077b0=0xc00000e 0x5077b4=0x0 0x5077a8=0x4c1eb72 0x5077ac=0x174
[4011764.857338] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0, SM 0): Out Of Range Address
[4011764.857358] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50c730=0xc03000e 0x50c734=0x0 0x50c728=0x4c1eb72 0x50c72c=0x174
[4011764.857410] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0, SM 1): Out Of Range Address
[4011764.857430] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Global Exception on (GPC 1, TPC 0, SM 1): Multiple Warp Errors
[4011764.857449] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50c7b0=0xc07000e 0x50c7b4=0x4 0x50c7a8=0x4c1eb72 0x50c7ac=0x174
[4011764.857508] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 1, SM 0): Out Of Range Address
[4011764.857528] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Global Exception on (GPC 1, TPC 1, SM 0): Multiple Warp Errors
[4011764.857547] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50cf30=0xc06000e 0x50cf34=0x4 0x50cf28=0x4c1eb72 0x50cf2c=0x174
[4011764.857598] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 1, SM 1): Out Of Range Address
[4011764.857618] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50cfb0=0xc05000e 0x50cfb4=0x20 0x50cfa8=0x4c1eb72 0x50cfac=0x174
[4011764.857677] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 2, SM 0): Out Of Range Address
[4011764.857696] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50d730=0xc04000e 0x50d734=0x0 0x50d728=0x4c1eb72 0x50d72c=0x174
[4011764.857747] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 2, SM 1): Out Of Range Address
[4011764.857766] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50d7b0=0xc07000e 0x50d7b4=0x0 0x50d7a8=0x4c1eb72 0x50d7ac=0x174
[4011764.857825] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 3, SM 0): Out Of Range Address
[4011764.857844] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50df30=0xc04000e 0x50df34=0x20 0x50df28=0x4c1eb72 0x50df2c=0x174
[4011764.857896] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 3, SM 1): Out Of Range Address
[4011764.857915] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50dfb0=0xc05000e 0x50dfb4=0x0 0x50dfa8=0x4c1eb72 0x50dfac=0x174
[4011764.857973] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 4, SM 0): Out Of Range Address
[4011764.857993] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50e730=0xc04000e 0x50e734=0x0 0x50e728=0x4c1eb72 0x50e72c=0x174
[4011764.858052] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 4, SM 1): Out Of Range Address
[4011764.858073] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50e7b0=0xc04000e 0x50e7b4=0x20 0x50e7a8=0x4c1eb72 0x50e7ac=0x174
[4011764.858134] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 5, SM 0): Out Of Range Address
[4011764.858154] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50ef30=0xc04000e 0x50ef34=0x20 0x50ef28=0x4c1eb72 0x50ef2c=0x174
[4011764.858204] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 5, SM 1): Out Of Range Address
[4011764.858224] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50efb0=0xc07000e 0x50efb4=0x20 0x50efa8=0x4c1eb72 0x50efac=0x174
[4011764.858278] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 6, SM 0): Out Of Range Address
[4011764.858297] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50f730=0xc07000e 0x50f734=0x20 0x50f728=0x4c1eb72 0x50f72c=0x174
[4011764.858344] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 6, SM 1): Out Of Range Address
[4011764.858365] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x50f7b0=0xc07000e 0x50f7b4=0x20 0x50f7a8=0x4c1eb72 0x50f7ac=0x174
[4011764.858420] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 0, SM 0): Out Of Range Address
[4011764.858439] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x514730=0xc03000e 0x514734=0x20 0x514728=0x4c1eb72 0x51472c=0x174
[4011764.858485] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 0, SM 1): Out Of Range Address
[4011764.858505] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5147b0=0xc05000e 0x5147b4=0x20 0x5147a8=0x4c1eb72 0x5147ac=0x174
[4011764.858559] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 1, SM 0): Out Of Range Address
[4011764.858579] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x514f30=0xc02000e 0x514f34=0x20 0x514f28=0x4c1eb72 0x514f2c=0x174
[4011764.858626] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 1, SM 1): Out Of Range Address
[4011764.858646] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x514fb0=0xc02000e 0x514fb4=0x20 0x514fa8=0x4c1eb72 0x514fac=0x174
[4011764.858700] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 2, SM 0): Out Of Range Address
[4011764.858720] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x515730=0xc01000e 0x515734=0x20 0x515728=0x4c1eb72 0x51572c=0x174
[4011764.858767] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 2, SM 1): Out Of Range Address
[4011764.858787] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5157b0=0xc01000e 0x5157b4=0x20 0x5157a8=0x4c1eb72 0x5157ac=0x174
[4011764.858840] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 3, SM 0): Out Of Range Address
[4011764.858860] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Global Exception on (GPC 2, TPC 3, SM 0): Multiple Warp Errors
[4011764.858880] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x515f30=0xc00000e 0x515f34=0x24 0x515f28=0x4c1eb72 0x515f2c=0x174
[4011764.858926] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 3, SM 1): Out Of Range Address
[4011764.858947] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x515fb0=0xc00000e 0x515fb4=0x20 0x515fa8=0x4c1eb72 0x515fac=0x174
[4011764.859001] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 4, SM 0): Out Of Range Address
[4011764.859021] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x516730=0xc00000e 0x516734=0x20 0x516728=0x4c1eb72 0x51672c=0x174
[4011764.859068] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 4, SM 1): Out Of Range Address
[4011764.859087] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5167b0=0xc02000e 0x5167b4=0x20 0x5167a8=0x4c1eb72 0x5167ac=0x174
[4011764.859142] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 5, SM 0): Out Of Range Address
[4011764.859162] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x516f30=0xc00000e 0x516f34=0x20 0x516f28=0x4c1eb72 0x516f2c=0x174
[4011764.859209] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 5, SM 1): Out Of Range Address
[4011764.859228] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x516fb0=0xc03000e 0x516fb4=0x20 0x516fa8=0x4c1eb72 0x516fac=0x174
[4011764.859282] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 6, SM 0): Out Of Range Address
[4011764.859301] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x517730=0xc07000e 0x517734=0x20 0x517728=0x4c1eb72 0x51772c=0x174
[4011764.859348] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 6, SM 1): Out Of Range Address
[4011764.859369] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5177b0=0xc01000e 0x5177b4=0x20 0x5177a8=0x4c1eb72 0x5177ac=0x174
[4011764.859424] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 0, SM 0): Out Of Range Address
[4011764.859443] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51c730=0xc02000e 0x51c734=0x20 0x51c728=0x4c1eb72 0x51c72c=0x174
[4011764.859490] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 0, SM 1): Out Of Range Address
[4011764.859510] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51c7b0=0xc05000e 0x51c7b4=0x20 0x51c7a8=0x4c1eb72 0x51c7ac=0x174
[4011764.859564] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 1, SM 0): Out Of Range Address
[4011764.859584] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51cf30=0xc05000e 0x51cf34=0x20 0x51cf28=0x4c1eb72 0x51cf2c=0x174
[4011764.859631] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 1, SM 1): Out Of Range Address
[4011764.859650] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51cfb0=0xc05000e 0x51cfb4=0x20 0x51cfa8=0x4c1eb72 0x51cfac=0x174
[4011764.859705] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 2, SM 0): Out Of Range Address
[4011764.859725] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51d730=0xc06000e 0x51d734=0x20 0x51d728=0x4c1eb72 0x51d72c=0x174
[4011764.859772] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 2, SM 1): Out Of Range Address
[4011764.859791] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51d7b0=0xc0e000e 0x51d7b4=0x20 0x51d7a8=0x4c1eb72 0x51d7ac=0x174
[4011764.859845] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 3, SM 0): Out Of Range Address
[4011764.859866] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Global Exception on (GPC 3, TPC 3, SM 0): Multiple Warp Errors
[4011764.859885] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51df30=0xc06000e 0x51df34=0x24 0x51df28=0x4c1eb72 0x51df2c=0x174
[4011764.859932] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 3, SM 1): Out Of Range Address
[4011764.859952] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51dfb0=0xc08000e 0x51dfb4=0x20 0x51dfa8=0x4c1eb72 0x51dfac=0x174
[4011764.860005] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 4, SM 0): Out Of Range Address
[4011764.860025] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51e730=0xc07000e 0x51e734=0x20 0x51e728=0x4c1eb72 0x51e72c=0x174
[4011764.860073] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 4, SM 1): Out Of Range Address
[4011764.860092] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51e7b0=0xc0f000e 0x51e7b4=0x20 0x51e7a8=0x4c1eb72 0x51e7ac=0x174
[4011764.860146] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 5, SM 0): Out Of Range Address
[4011764.860166] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51ef30=0xc05000e 0x51ef34=0x20 0x51ef28=0x4c1eb72 0x51ef2c=0x174
[4011764.860213] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 5, SM 1): Out Of Range Address
[4011764.860233] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x51efb0=0xc04000e 0x51efb4=0x20 0x51efa8=0x4c1eb72 0x51efac=0x174
[4011764.860288] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 0, SM 0): Out Of Range Address
[4011764.860307] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x524730=0xc03000e 0x524734=0x20 0x524728=0x4c1eb72 0x52472c=0x174
[4011764.860354] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 0, SM 1): Out Of Range Address
[4011764.860374] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5247b0=0xc07000e 0x5247b4=0x20 0x5247a8=0x4c1eb72 0x5247ac=0x174
[4011764.860429] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 1, SM 0): Out Of Range Address
[4011764.860448] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x524f30=0xc06000e 0x524f34=0x20 0x524f28=0x4c1eb72 0x524f2c=0x174
[4011764.860495] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 1, SM 1): Out Of Range Address
[4011764.860515] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x524fb0=0xc04000e 0x524fb4=0x20 0x524fa8=0x4c1eb72 0x524fac=0x174
[4011764.860567] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 2, SM 0): Out Of Range Address
[4011764.860588] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x525730=0xc0b000e 0x525734=0x20 0x525728=0x4c1eb72 0x52572c=0x174
[4011764.860635] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 2, SM 1): Out Of Range Address
[4011764.860655] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5257b0=0xc04000e 0x5257b4=0x20 0x5257a8=0x4c1eb72 0x5257ac=0x174
[4011764.860708] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 3, SM 0): Out Of Range Address
[4011764.860728] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x525f30=0xc05000e 0x525f34=0x20 0x525f28=0x4c1eb72 0x525f2c=0x174
[4011764.860775] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 3, SM 1): Out Of Range Address
[4011764.860794] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x525fb0=0xc06000e 0x525fb4=0x20 0x525fa8=0x4c1eb72 0x525fac=0x174
[4011764.860847] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 4, SM 0): Out Of Range Address
[4011764.860866] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x526730=0xc06000e 0x526734=0x20 0x526728=0x4c1eb72 0x52672c=0x174
[4011764.860913] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 4, SM 1): Out Of Range Address
[4011764.860933] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5267b0=0xc07000e 0x5267b4=0x20 0x5267a8=0x4c1eb72 0x5267ac=0x174
[4011764.860986] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 5, SM 0): Out Of Range Address
[4011764.861005] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x526f30=0xc0b000e 0x526f34=0x20 0x526f28=0x4c1eb72 0x526f2c=0x174
[4011764.861052] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 5, SM 1): Out Of Range Address
[4011764.861071] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x526fb0=0xc07000e 0x526fb4=0x20 0x526fa8=0x4c1eb72 0x526fac=0x174
[4011764.861124] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 6, SM 0): Out Of Range Address
[4011764.861144] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x527730=0xc04000e 0x527734=0x20 0x527728=0x4c1eb72 0x52772c=0x174
[4011764.861191] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 6, SM 1): Out Of Range Address
[4011764.861210] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x5277b0=0xc04000e 0x5277b4=0x20 0x5277a8=0x4c1eb72 0x5277ac=0x174
[4011764.861265] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 0, SM 0): Out Of Range Address
[4011764.861284] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52c730=0xc07000e 0x52c734=0x20 0x52c728=0x4c1eb72 0x52c72c=0x174
[4011764.861331] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 0, SM 1): Out Of Range Address
[4011764.861351] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Global Exception on (GPC 5, TPC 0, SM 1): Multiple Warp Errors
[4011764.861370] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52c7b0=0xc06000e 0x52c7b4=0x24 0x52c7a8=0x4c1eb72 0x52c7ac=0x174
[4011764.861424] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 1, SM 0): Out Of Range Address
[4011764.861443] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52cf30=0xc05000e 0x52cf34=0x20 0x52cf28=0x4c1eb72 0x52cf2c=0x174
[4011764.861490] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 1, SM 1): Out Of Range Address
[4011764.861509] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52cfb0=0xc06000e 0x52cfb4=0x20 0x52cfa8=0x4c1eb72 0x52cfac=0x174
[4011764.861562] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 2, SM 0): Out Of Range Address
[4011764.861581] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52d730=0xc07000e 0x52d734=0x20 0x52d728=0x4c1eb72 0x52d72c=0x174
[4011764.861628] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 2, SM 1): Out Of Range Address
[4011764.861648] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52d7b0=0xc04000e 0x52d7b4=0x20 0x52d7a8=0x4c1eb72 0x52d7ac=0x174
[4011764.861701] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 3, SM 0): Out Of Range Address
[4011764.861721] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52df30=0xc06000e 0x52df34=0x20 0x52df28=0x4c1eb72 0x52df2c=0x174
[4011764.861768] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 3, SM 1): Out Of Range Address
[4011764.861787] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52dfb0=0xc04000e 0x52dfb4=0x20 0x52dfa8=0x4c1eb72 0x52dfac=0x174
[4011764.861840] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 4, SM 0): Out Of Range Address
[4011764.861859] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52e730=0xc07000e 0x52e734=0x20 0x52e728=0x4c1eb72 0x52e72c=0x174
[4011764.861906] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 4, SM 1): Out Of Range Address
[4011764.861925] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52e7b0=0xc07000e 0x52e7b4=0x20 0x52e7a8=0x4c1eb72 0x52e7ac=0x174
[4011764.861979] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 5, SM 0): Out Of Range Address
[4011764.861998] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Global Exception on (GPC 5, TPC 5, SM 0): Multiple Warp Errors
[4011764.862021] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52ef30=0xc06000e 0x52ef34=0x24 0x52ef28=0x4c1eb72 0x52ef2c=0x174
[4011764.862069] NVRM: Xid (PCI:0000:1a:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 5, SM 1): Out Of Range Address
[4011764.862089] NVRM: Xid (PCI:0000:1a:00): 13, Graphics Exception: ESR 0x52efb0=0xc04000e 0x52efb4=0x20 0x52efa8=0x4c1eb72 0x52efac=0x174
[4011764.862712] NVRM: Xid (PCI:0000:1a:00): 13, pid=3322703, name=mistralrs-serve, Graphics Exception: ChID 0010, Class 0000c3c0, Offset 00000510, Data 00419e84

@EricLBuehler
Copy link
Copy Markdown
Owner Author

@sempervictus interesting! Can you share the output log of this run?

Also, is it possible to find which kernels are causing these?

@sempervictus
Copy link
Copy Markdown
Contributor

sempervictus commented Jun 4, 2025

@EricLBuehler - here's what the container spat out into the log stream:

2025-06-04T22:04:43.895454Z  INFO mistralrs_server_core::mistralrs_for_server_builder: avx: false, neon: false, simd128: false, f16c: false
2025-06-04T22:04:43.895503Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-06-04T22:04:43.895533Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Model kind is: normal (no adapters)
2025-06-04T22:04:43.895576Z  INFO hf_hub: Using token file found "/root/.cache/huggingface/token"    
2025-06-04T22:04:43.895663Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-04T22:04:43.895721Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-04T22:04:43.966507Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00014.safetensors", "model-00002-of-00014.safetensors", "model-00003-of-00014.safetensors", "model-00004-of-00014.safetensors", "model-00005-of-00014.safetensors", "model-00006-of-00014.safetensors", "model-00007-of-00014.safetensors", "model-00008-of-00014.safetensors", "model-00009-of-00014.safetensors", "model-00010-of-00014.safetensors", "model-00011-of-00014.safetensors", "model-00012-of-00014.safetensors", "model-00013-of-00014.safetensors", "model-00014-of-00014.safetensors"]
2025-06-04T22:04:44.011380Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-04T22:04:44.074830Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-04T22:04:44.104793Z  INFO mistralrs_quant::utils::log: Automatic loader type determined to be `qwen2`
2025-06-04T22:04:44.104804Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-06-04T22:04:44.261495Z  INFO mistralrs_quant::utils::log: Model has 64 repeating layers.
2025-06-04T22:04:44.261900Z  INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-06-04T22:04:44.261935Z  INFO mistralrs_quant::utils::log: Layers 0-19: cuda[0] (32 GB)
2025-06-04T22:04:44.261951Z  INFO mistralrs_quant::utils::log: Layers 20-41: cuda[1] (32 GB)
2025-06-04T22:04:44.261963Z  INFO mistralrs_quant::utils::log: Layers 42-63: cuda[2] (32 GB)
2025-06-04T22:04:44.308015Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 7
2025-06-04T22:04:44.308028Z  INFO mistralrs_core::utils::normal: Skipping BF16 because CC < 8.0
2025-06-04T22:04:44.373043Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2025-06-04T22:04:44.373097Z  INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 152064, hidden_size: 5120, intermediate_size: 27648, num_hidden_layers: 64, num_attention_heads: 40, num_key_value_heads: 8, max_position_embeddings: 32768, sliding_window: Some(131072), rope_theta: 1000000.0, rms_norm_eps: 1e-6, hidden_act: Silu, quantization_config: None, tie_word_embeddings: false }
2025-06-04T22:04:44.373164Z  INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-06-04T22:07:10.498028Z  INFO mistralrs_core::pipeline::normal: Applying ISQ to all ranks.
2025-06-04T22:07:10.498073Z  INFO mistralrs_core::pipeline::isq: Applying in-situ quantization into Some(Q4K) to 449 tensors.
2025-06-04T22:07:10.500096Z  INFO mistralrs_core::pipeline::isq: Applying ISQ on 32 threads.
2025-06-04T22:08:31.049024Z  INFO mistralrs_core::pipeline::isq: Applied in-situ quantization into Some(Q4K) to 449 tensors out of 449 total tensors. Took 80.55s
2025-06-04T22:08:31.049365Z  INFO mistralrs_core::paged_attention: Allocating 8192 MB for PagedAttention KV cache per GPU
2025-06-04T22:08:31.049377Z  INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 1024 GPU blocks: available context length is 32768 tokens
2025-06-04T22:08:31.896558Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|endoftext|>", eos_toks = "<|im_end|>", "<|endoftext|>", unk_tok = `None`
2025-06-04T22:08:31.919202Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-06-04T22:08:31.919447Z  INFO mistralrs_core: Beginning dummy run.
2025-06-04T22:08:31.921006Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
2025-06-04T22:08:32.238438Z  INFO mistralrs_core: Dummy run completed in 0.318980626s.
2025-06-04T22:08:32.238898Z  INFO mistralrs_server: Serving on http://0.0.0.0:7651.
2025-06-04T22:08:36.921162Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 0.60, Prefix cache hitrate 0.00%, 0 running, 0 waiting
2025-06-04T22:09:01.921532Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 592.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:06.921637Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:11.921729Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:16.921816Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:21.921897Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:26.921973Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.80, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:31.922043Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:36.922108Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:41.922203Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:46.922301Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:51.922395Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:09:56.922483Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:01.922566Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:06.922644Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:11.922716Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:16.922784Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:21.922847Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:26.922948Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:31.923044Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:36.923134Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:41.923221Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:46.923301Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 18.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:10:51.923376Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 5.60, Prefix cache hitrate 0.00%, 0 running, 0 waiting
2025-06-04T22:25:46.937102Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 1067.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:25:51.937185Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 13.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:26:36.937882Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 1956.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:26:41.937950Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:26:46.938054Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.60, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:26:51.938152Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:26:56.938246Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:01.938334Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.40, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:06.938417Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:11.938495Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:16.938566Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:21.938634Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:26.938695Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:31.938796Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:36.938892Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:41.938984Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:46.939069Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.20, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:51.939149Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:27:56.939225Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:28:01.939294Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:28:06.939358Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:28:11.939463Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:28:16.939562Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting
2025-06-04T22:28:21.939656Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 17.00, Prefix cache hitrate 0.00%, 1 running, 0 waiting

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
   0:     0x5634d2404922 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
   1:     0x5634d0acb563 - core::fmt::write::h95e30a17c3d7d930
   2:     0x5634d2403b0f - std::io::Write::write_fmt::h2447d4278ce5a227
   3:     0x5634d2404783 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
   4:     0x5634d24040b5 - std::panicking::default_hook::h0a7d57cc63374946
   5:     0x5634d24037c7 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
   6:     0x5634d2445be8 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
   7:     0x5634d2445b49 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
   8:     0x5634d244719c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
   9:     0x5634d0ac964f - core::panicking::panic_fmt::ha159237b3cadc48c
  10:     0x5634d0ad0da5 - core::result::unwrap_failed::h879f86fa8962b20a
  11:     0x5634d0a294e3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
  12:     0x5634d097d2e8 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
  13:     0x5634d097d227 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
  14:     0x5634d097d16b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
  15:     0x5634d097d370 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
  16:     0x5634d10857c9 - <mistralrs_core::device_map::LayerDeviceMapper as mistralrs_core::device_map::DeviceMapper>::map::ha40c495b77d50a86
  17:     0x5634d1114998 - mistralrs_core::models::qwen2::Model::forward_embed::h76325c7661b6b0c7
  18:     0x5634d1118a0a - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::hbcf387a473650d2c
  19:     0x5634d1970903 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
  20:     0x5634d1973f3a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
  21:     0x5634d18ca52b - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.37410
  22:     0x5634d1500979 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
  23:     0x5634d1506f13 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
  24:     0x5634d2447547 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
  25:     0x7f2edd158ac3 - <unknown>
  26:     0x7f2edd1e9a04 - clone
  27:                0x0 - <unknown>

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
   0:     0x5634d2404922 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
   1:     0x5634d0acb563 - core::fmt::write::h95e30a17c3d7d930
   2:     0x5634d2403b0f - std::io::Write::write_fmt::h2447d4278ce5a227
   3:     0x5634d2404783 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
   4:     0x5634d24040b5 - std::panicking::default_hook::h0a7d57cc63374946
   5:     0x5634d24037c7 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
   6:     0x5634d2445be8 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
   7:     0x5634d2445b49 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
   8:     0x5634d244719c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
   9:     0x5634d0ac964f - core::panicking::panic_fmt::ha159237b3cadc48c
  10:     0x5634d0ad0da5 - core::result::unwrap_failed::h879f86fa8962b20a
  11:     0x5634d0a294e3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
  12:     0x5634d097d2e8 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
  13:     0x5634d097d219 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
  14:     0x5634d097d16b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
  15:     0x5634d097d370 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
  16:     0x5634d161bdc2 - core::ptr::drop_in_place<mistralrs_core::pipeline::inputs_processor::text_models_inputs_processor::FlashParams>::h54cce5dd92919df6
  17:     0x5634d19713a8 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
  18:     0x5634d1973f3a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
  19:     0x5634d18ca52b - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.37410
  20:     0x5634d1500979 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
  21:     0x5634d1506f13 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
  22:     0x5634d2447547 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
  23:     0x7f2edd158ac3 - <unknown>
  24:     0x7f2edd1e9a04 - clone
  25:                0x0 - <unknown>

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
   0:     0x5634d2404922 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
   1:     0x5634d0acb563 - core::fmt::write::h95e30a17c3d7d930
   2:     0x5634d2403b0f - std::io::Write::write_fmt::h2447d4278ce5a227
   3:     0x5634d2404783 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
   4:     0x5634d24040b5 - std::panicking::default_hook::h0a7d57cc63374946
   5:     0x5634d24037c7 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
   6:     0x5634d2445be8 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
   7:     0x5634d2445b49 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
   8:     0x5634d244719c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
   9:     0x5634d0ac964f - core::panicking::panic_fmt::ha159237b3cadc48c
  10:     0x5634d0ad0da5 - core::result::unwrap_failed::h879f86fa8962b20a
  11:     0x5634d0a294e3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
  12:     0x5634d097d2e8 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
  13:     0x5634d097d219 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
  14:     0x5634d097d16b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
  15:     0x5634d097d370 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
  16:     0x5634d106cc52 - <hashbrown::raw::RawTable<T,A> as core::ops::drop::Drop>::drop::ha2d468b205f8c06b
  17:     0x5634d161bed6 - core::ptr::drop_in_place<mistralrs_core::pipeline::inputs_processor::text_models_inputs_processor::FlashParams>::h54cce5dd92919df6
  18:     0x5634d19713a8 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
  19:     0x5634d1973f3a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
  20:     0x5634d18ca52b - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.37410
  21:     0x5634d1500979 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
  22:     0x5634d1506f13 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
  23:     0x5634d2447547 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
  24:     0x7f2edd158ac3 - <unknown>
  25:     0x7f2edd1e9a04 - clone
  26:                0x0 - <unknown>

thread '<unnamed>' panicked at library/core/src/panicking.rs:233:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.

how do i pull up which kernels its loading?


separately, any chance this is what we're tripping over on v7 devices?

@EricLBuehler
Copy link
Copy Markdown
Owner Author

@sempervictus thanks for the log.

how do i pull up which kernels its loading?

Was wondering if that was showing up in the log.

I've reproduced this:

Crashed? FlashAttention? PagedAttention?
yes yes yes
yes no yes
no no no
no yes no

So it seems that PagedAttention is somehow causing this - can you please try to run without paged attention?

@sempervictus
Copy link
Copy Markdown
Contributor

Also pretty sure paged attention is causing this, with the two things i havent tracked down being:

  1. why is flashattention even showing up on CC7 stack traces?
  2. that calling convention note here

Running without it currently on the swe-agent tester ... its hanging pretty often (gpu's settle to 25% and T/s counter stops) but completing requests when reissued a few times. So "yes" that's where the crash is, but disabling it makes ollama run models faster in comparison 😉

@EricLBuehler
Copy link
Copy Markdown
Owner Author

@sempervictus

  1. why is flashattention even showing up on CC7 stack traces?

I fixed that in #1429; this was some metadata

  1. that calling convention note here

Just referencing 2 equivalent forms to my understanding.

Running without it currently on the swe-agent tester ... its hanging pretty often (gpu's settle to 25% and T/s counter stops) but completing requests when reissued a few times. So "yes" that's where the crash is, but disabling it makes ollama run models faster in comparison 😉

Have you tried to activate nccl?

@sempervictus
Copy link
Copy Markdown
Contributor

@EricLBuehler - i've run with the NCCL-disable env var and without it. Currently using manual partitioning although i got the sense you fixed allocations previously, the KV caches seems to be biasing GPU0 (and some models cant be split).

@sempervictus
Copy link
Copy Markdown
Contributor

sempervictus commented Jun 5, 2025

@EricLBuehler - Just rebuilt and tested the unquantized-blow-up case (with the FA fix): still blows up :-(

mistralrs-server --token-source env:HF_TOKEN -n "0:20;1:22;2:22" --port 7651 plain -m SWE-bench/SWE-agent-LM-32B --max-seq-len 32768 

==========
== CUDA ==
==========

CUDA Version 12.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

2025-06-05T01:26:48.882747Z  INFO mistralrs_server_core::mistralrs_for_server_builder: avx: false, neon: false, simd128: false, f16c: false
2025-06-05T01:26:48.882784Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-06-05T01:26:48.882815Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Model kind is: normal (no adapters)
2025-06-05T01:26:48.882853Z  INFO hf_hub: Using token file found "/root/.cache/huggingface/token"    
2025-06-05T01:26:48.882946Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-05T01:26:48.883032Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-05T01:26:48.967134Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00014.safetensors", "model-00002-of-00014.safetensors", "model-00003-of-00014.safetensors", "model-00004-of-00014.safetensors", "model-00005-of-00014.safetensors", "model-00006-of-00014.safetensors", "model-00007-of-00014.safetensors", "model-00008-of-00014.safetensors", "model-00009-of-00014.safetensors", "model-00010-of-00014.safetensors", "model-00011-of-00014.safetensors", "model-00012-of-00014.safetensors", "model-00013-of-00014.safetensors", "model-00014-of-00014.safetensors"]
2025-06-05T01:26:48.992605Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-05T01:26:49.045567Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `SWE-bench/SWE-agent-LM-32B`
2025-06-05T01:26:49.075365Z  INFO mistralrs_quant::utils::log: Automatic loader type determined to be `qwen2`
2025-06-05T01:26:49.075383Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-06-05T01:26:49.239330Z  INFO mistralrs_quant::utils::log: Model has 64 repeating layers.
2025-06-05T01:26:49.239753Z  INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-06-05T01:26:49.239793Z  INFO mistralrs_quant::utils::log: Layers 0-19: cuda[0] (32 GB)
2025-06-05T01:26:49.239808Z  INFO mistralrs_quant::utils::log: Layers 20-41: cuda[1] (32 GB)
2025-06-05T01:26:49.239821Z  INFO mistralrs_quant::utils::log: Layers 42-63: cuda[2] (32 GB)
2025-06-05T01:26:49.288215Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 7
2025-06-05T01:26:49.288229Z  INFO mistralrs_core::utils::normal: Skipping BF16 because CC < 8.0
2025-06-05T01:26:49.353483Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2025-06-05T01:26:49.353534Z  INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 152064, hidden_size: 5120, intermediate_size: 27648, num_hidden_layers: 64, num_attention_heads: 40, num_key_value_heads: 8, max_position_embeddings: 32768, sliding_window: Some(131072), rope_theta: 1000000.0, rms_norm_eps: 1e-6, hidden_act: Silu, quantization_config: None, tie_word_embeddings: false }
...
2025-06-05T01:27:06.267927Z  INFO mistralrs_core::paged_attention: Allocating 8192 MB for PagedAttention KV cache per GPU
2025-06-05T01:27:06.267947Z  INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 1024 GPU blocks: available context length is 32768 tokens
2025-06-05T01:27:07.137262Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|endoftext|>", eos_toks = "<|im_end|>", "<|endoftext|>", unk_tok = `None`
2025-06-05T01:27:07.160496Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-06-05T01:27:07.160723Z  INFO mistralrs_core: Beginning dummy run.
2025-06-05T01:27:07.165754Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
   0:     0x55ed44aa6cc2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
   1:     0x55ed43174273 - core::fmt::write::h95e30a17c3d7d930
   2:     0x55ed44aa5eaf - std::io::Write::write_fmt::h2447d4278ce5a227
   3:     0x55ed44aa6b23 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
   4:     0x55ed44aa6455 - std::panicking::default_hook::h0a7d57cc63374946
   5:     0x55ed44aa5b67 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
   6:     0x55ed44ae7f88 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
   7:     0x55ed44ae7ee9 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
   8:     0x55ed44ae953c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
   9:     0x55ed4317235f - core::panicking::panic_fmt::ha159237b3cadc48c
  10:     0x55ed43179ab5 - core::result::unwrap_failed::h879f86fa8962b20a
  11:     0x55ed430d21a3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
  12:     0x55ed444a0738 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h69dabdb8397fdeca
  13:     0x55ed444a98e2 - <mistralrs_quant::cublaslt::api::CublasLTBatchMatmul as candle_core::custom_op::CustomOp2>::cuda_fwd::h485b6d7e9e3b157b
  14:     0x55ed430464df - candle_core::storage::Storage::apply_op2::h6a343fb09e53884b
  15:     0x55ed42f67e01 - candle_core::custom_op::<impl candle_core::tensor::Tensor>::apply_op2_arc::h1089692e7e049299
  16:     0x55ed445030b1 - mistralrs_quant::cublaslt::CublasLtWrapper::batch_matmul::h8a46e8cceca17c7d
  17:     0x55ed44532560 - <mistralrs_quant::unquantized::UnquantLinear as mistralrs_quant::QuantMethod>::forward::h8602c1712d107498
  18:     0x55ed444bfdcc - <mistralrs_quant::distributed::layers::ColumnParallelLayer as mistralrs_quant::QuantMethod>::forward::h69b916efba3c9b52
  19:     0x55ed4371fadc - mistralrs_core::models::qwen2::Model::forward_embed::h76325c7661b6b0c7
  20:     0x55ed437237ba - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::hbcf387a473650d2c
  21:     0x55ed44263ee3 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
  22:     0x55ed4426751a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
  23:     0x55ed441b9bea - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.43721
  24:     0x55ed438dc419 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
  25:     0x55ed438e2903 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
  26:     0x55ed44ae98e7 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
  27:     0x7f7d4f158ac3 - <unknown>
  28:     0x7f7d4f1e9a04 - clone
  29:                0x0 - <unknown>

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
   0:     0x55ed44aa6cc2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
   1:     0x55ed43174273 - core::fmt::write::h95e30a17c3d7d930
   2:     0x55ed44aa5eaf - std::io::Write::write_fmt::h2447d4278ce5a227
   3:     0x55ed44aa6b23 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
   4:     0x55ed44aa6455 - std::panicking::default_hook::h0a7d57cc63374946
   5:     0x55ed44aa5b67 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
   6:     0x55ed44ae7f88 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
   7:     0x55ed44ae7ee9 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
   8:     0x55ed44ae953c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
   9:     0x55ed4317235f - core::panicking::panic_fmt::ha159237b3cadc48c
  10:     0x55ed43179ab5 - core::result::unwrap_failed::h879f86fa8962b20a
  11:     0x55ed430d21a3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
  12:     0x55ed43026098 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
  13:     0x55ed43025fd7 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
  14:     0x55ed43025f1b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
  15:     0x55ed43026120 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
  16:     0x55ed43722830 - mistralrs_core::models::qwen2::Model::forward_embed::h76325c7661b6b0c7
  17:     0x55ed437237ba - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::hbcf387a473650d2c
  18:     0x55ed44263ee3 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
  19:     0x55ed4426751a - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
  20:     0x55ed441b9bea - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.43721
  21:     0x55ed438dc419 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
  22:     0x55ed438e2903 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
  23:     0x55ed44ae98e7 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
  24:     0x7f7d4f158ac3 - <unknown>
  25:     0x7f7d4f1e9a04 - clone
  26:                0x0 - <unknown>

thread '<unnamed>' panicked at library/core/src/panicking.rs:233:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.

lines 11-17 of the middle one look interesting.

Can also confirm that the same unquantized always-crash-reproducer does crash w/ --no-paged-attn so that may just be a catalyst for the problem which is evident at FP16 raw:

2025-06-05T01:31:26.815015Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|endoftext|>", eos_toks = "<|im_end|>", "<|endoftext|>", unk_tok = `None`
2025-06-05T01:31:26.837498Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-06-05T01:31:26.837710Z  INFO mistralrs_core: Beginning dummy run.
2025-06-05T01:31:26.842774Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
   0:     0x56261d161cc2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
   1:     0x56261b82f273 - core::fmt::write::h95e30a17c3d7d930
   2:     0x56261d160eaf - std::io::Write::write_fmt::h2447d4278ce5a227
   3:     0x56261d161b23 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
   4:     0x56261d161455 - std::panicking::default_hook::h0a7d57cc63374946
   5:     0x56261d160b67 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
   6:     0x56261d1a2f88 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
   7:     0x56261d1a2ee9 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
   8:     0x56261d1a453c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
   9:     0x56261b82d35f - core::panicking::panic_fmt::ha159237b3cadc48c
  10:     0x56261b834ab5 - core::result::unwrap_failed::h879f86fa8962b20a
  11:     0x56261b78d1a3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
  12:     0x56261cb5b738 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h69dabdb8397fdeca
  13:     0x56261cb648e2 - <mistralrs_quant::cublaslt::api::CublasLTBatchMatmul as candle_core::custom_op::CustomOp2>::cuda_fwd::h485b6d7e9e3b157b
  14:     0x56261b7014df - candle_core::storage::Storage::apply_op2::h6a343fb09e53884b
  15:     0x56261b622e01 - candle_core::custom_op::<impl candle_core::tensor::Tensor>::apply_op2_arc::h1089692e7e049299
  16:     0x56261cbbe0b1 - mistralrs_quant::cublaslt::CublasLtWrapper::batch_matmul::h8a46e8cceca17c7d
  17:     0x56261cbed560 - <mistralrs_quant::unquantized::UnquantLinear as mistralrs_quant::QuantMethod>::forward::h8602c1712d107498
  18:     0x56261cb7adcc - <mistralrs_quant::distributed::layers::ColumnParallelLayer as mistralrs_quant::QuantMethod>::forward::h69b916efba3c9b52
  19:     0x56261bddaadc - mistralrs_core::models::qwen2::Model::forward_embed::h76325c7661b6b0c7
  20:     0x56261bdde7ba - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::hbcf387a473650d2c
  21:     0x56261c91eee3 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
  22:     0x56261c921a38 - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
  23:     0x56261c871a06 - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.43721
  24:     0x56261bf97419 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
  25:     0x56261bf9d903 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
  26:     0x56261d1a48e7 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
  27:     0x7f11ce558ac3 - <unknown>
  28:     0x7f11ce5e9a04 - clone
  29:                0x0 - <unknown>

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.13.9/src/driver/safe/core.rs:257:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
   0:     0x56261d161cc2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h02fa31b9d8cef683
   1:     0x56261b82f273 - core::fmt::write::h95e30a17c3d7d930
   2:     0x56261d160eaf - std::io::Write::write_fmt::h2447d4278ce5a227
   3:     0x56261d161b23 - std::sys::backtrace::BacktraceLock::print::headc5841a9aa64f7
   4:     0x56261d161455 - std::panicking::default_hook::h0a7d57cc63374946
   5:     0x56261d160b67 - std::panicking::rust_panic_with_hook::he2fcc0f110c4d509
   6:     0x56261d1a2f88 - std::panicking::begin_panic_handler::{{closure}}::hc2c1290e9d2fc530
   7:     0x56261d1a2ee9 - std::sys::backtrace::__rust_end_short_backtrace::h594e6478825ce120
   8:     0x56261d1a453c - __rustc[ec3606f4b1ae7141]::rust_begin_unwind
   9:     0x56261b82d35f - core::panicking::panic_fmt::ha159237b3cadc48c
  10:     0x56261b834ab5 - core::result::unwrap_failed::h879f86fa8962b20a
  11:     0x56261b78d1a3 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
  12:     0x56261b6e1098 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<u8>>::h910818eba45f4ad8.5497
  13:     0x56261b6e0fd7 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::hba27da765c7a5ab2.5495
  14:     0x56261b6e0f1b - alloc::sync::Arc<T,A>::drop_slow::hc54eb0850d0765bc
  15:     0x56261b6e1120 - alloc::sync::Arc<T,A>::drop_slow::hf3f48872e4b5c869
  16:     0x56261bddd830 - mistralrs_core::models::qwen2::Model::forward_embed::h76325c7661b6b0c7
  17:     0x56261bdde7ba - <mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward::hbcf387a473650d2c
  18:     0x56261c91eee3 - <mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs::h8edb5b32da8c99b3
  19:     0x56261c921a38 - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h9b4dc98405e070f1
  20:     0x56261c871a06 - mistralrs_core::engine::Engine::run::{{closure}}::h1998f4d35d1f2f93.43721
  21:     0x56261bf97419 - std::sys::backtrace::__rust_begin_short_backtrace::hc037ddf44e014f4a
  22:     0x56261bf9d903 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hddb404aa3c36d067
  23:     0x56261d1a48e7 - std::sys::pal::unix::thread::Thread::new::thread_start::h9d9210a77f52da93
  24:     0x7f11ce558ac3 - <unknown>
  25:     0x7f11ce5e9a04 - clone
  26:                0x0 - <unknown>

thread '<unnamed>' panicked at library/core/src/panicking.rs:233:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.

@EricLBuehler
Copy link
Copy Markdown
Owner Author

@EricLBuehler - Just rebuilt and tested the unquantized-blow-up case (with the FA fix): still blows up :-(

Hmm, yeah. On my end I'm trying some things: nccl + no ISQ + paged attention works. Trying to find out which kernel is the problem though

lines 11-17 of the middle one look interesting.

11: 0x55ed430d21a3 - <cudarc::driver::safe::core::CudaSlice as core::ops::drop::Drop>::drop::h09dc078c6d45fb4a
12: 0x55ed444a0738 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice>::h69dabdb8397fdeca
13: 0x55ed444a98e2 - <mistralrs_quant::cublaslt::api::CublasLTBatchMatmul as candle_core::custom_op::CustomOp2>::cuda_fwd::h485b6d7e9e3b157b
14: 0x55ed430464df - candle_core::storage::Storage::apply_op2::h6a343fb09e53884b
15: 0x55ed42f67e01 - candle_core::custom_op::::apply_op2_arc::h1089692e7e049299
16: 0x55ed445030b1 - mistralrs_quant::cublaslt::CublasLtWrapper::batch_matmul::h8a46e8cceca17c7d
17: 0x55ed44532560 - <mistralrs_quant::unquantized::UnquantLinear as mistralrs_quant::QuantMethod>::forward::h8602c1712d107498

Looks like an issue in the cublaslt code? Checking that...

@sempervictus
Copy link
Copy Markdown
Contributor

Re which kernels - eventually all of them far as i can tell, every model i've gotten to run crashes after some time with paged attention and the unquantized SWE one seems to be the best reproducer

@sempervictus
Copy link
Copy Markdown
Contributor

@EricLBuehler - how's NCCL being built? Are you linking hpc-x and building against the current CUDA version of the container or using a prebuilt? Our shop are some of the magical elves nobody ever sees who build/run the HPC clusters for a bunch of the various clouds and enterprise orgs out there (framing-overhead off line rate sort of stuff where we can, and not even on IB these days) so NCCL, UCX, etc are recurring parts of our collective nightmares. Especially w/ the proprietary/open nonsense (b200's don't run the proprietary drivers and their mezzanine "looks like" 4 permanently IB-mode CX7s, NVL72s are even stranger), NCCL compilation to-target becomes even more relevant re ABI against CUDA, drivers, and OpenMPI (not to mention the toolchain changes currently rippling through canonical's LTS').

Might be worth considering runtime instrumentation beyond the dynamic-dispatch stack traces such as codepoint interception and export to opentelem or some form of RPC to produce internal state telemetry for external analysis. More detailed console logging output probably can't hurt either (which kernels, parameters, etc) - maybe with some sort of verbosity flag to ratchet up the noise.

@EricLBuehler
Copy link
Copy Markdown
Owner Author

@sempervictus it looks like nccl + NO paged attn + NO cublaslt + ISQ works

how's NCCL being built? Are you linking hpc-x and building against the current CUDA version of the container or using a prebuilt? Our shop are some of the magical elves nobody ever sees who build/run the HPC clusters for a bunch of the various clouds and enterprise orgs out there (framing-overhead off line rate sort of stuff where we can, and not even on IB these days) so NCCL, UCX, etc are recurring parts of our collective nightmares. Especially w/ the proprietary/open nonsense (b200's don't run the proprietary drivers and their mezzanine "looks like" 4 permanently IB-mode CX7s, NVL72s are even stranger), NCCL compilation to-target becomes even more relevant re ABI against CUDA, drivers, and OpenMPI (not to mention the toolchain changes currently rippling through canonical's LTS').

Currently delegating to cudarc, but it's dynamic linking to my understanding.

Might be worth considering runtime instrumentation beyond the dynamic-dispatch stack traces such as codepoint interception and export to opentelem or some form of RPC to produce internal state telemetry for external analysis. More detailed console logging output probably can't hurt either (which kernels, parameters, etc) - maybe with some sort of verbosity flag to ratchet up the noise.

Absolutely, might try that!

@sempervictus
Copy link
Copy Markdown
Contributor

Well, on the cudarc side - EricLBuehler/candle#83 :-)

@sempervictus
Copy link
Copy Markdown
Contributor

Also i think there's some memory capacity calculus that goes south when running w/out paged attention. I've had the SWE one running overnight writing code at a somewhat sad rate but the interesting part is that its runtime memory seems to spike past actual capacity:

2025-06-05T07:09:18.355647Z ERROR mistralrs_core::engine: completion step - Model failed with error: WithBacktrace { inner: Cuda(Cuda(DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory"))), backtrace: Backtrace [{ fn: "candle_core::error::Error::bt" }, { fn: "<candle_core::cuda_backend::device::CudaDevice as candle_core::backend::BackendDevice>::alloc_uninit" }, { fn: "candle_core::tensor::Tensor::reshape" }, { fn: "mistralrs_core::attention::repeat_kv" }, { fn: "mistralrs_core::attention::Sdpa::run_attention" }, { fn: "mistralrs_core::models::qwen2::Model::forward_embed" }, { fn: "<mistralrs_core::models::qwen2::Model as mistralrs_core::pipeline::loaders::normal_loaders::NormalModel>::forward" }, { fn: "<mistralrs_core::pipeline::normal::NormalPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs" }, { fn: "mistralrs_core::pipeline::Pipeline::step::{{closure}}" }, { fn: "mistralrs_core::engine::Engine::run::{{closure}}.37410" }, { fn: "std::sys::backtrace::__rust_begin_short_backtrace" }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}" }, { fn: "std::sys::pal::unix::thread::Thread::new::thread_start" }, { fn: "clone" }] }
2025-06-05T07:09:23.208596Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 0.20, Prefix cache hitrate 55.56%, 0 running, 0 waiting

@EricLBuehler
Copy link
Copy Markdown
Owner Author

Also i think there's some memory capacity calculus that goes south when running w/out paged attention. I've had the SWE one running overnight writing code at a somewhat sad rate but the interesting part is that its runtime memory seems to spike past actual capacity:

Seems like a separate issue, will tackle that after we fix this one!

As for debugging:

  • nccl + NO paged attn + NO cublaslt + ISQ works
  • nccl + NO paged attn + cublaslt + ISQ works
  • cublaslt is not the issue

@sempervictus
Copy link
Copy Markdown
Contributor

Agreed, likely its own thing but does raise the question of "should me have a shadow MMU to track it all?" :-)

@EricLBuehler
Copy link
Copy Markdown
Owner Author

@sempervictus I think I might have found it! Disabling prefix caching seems to work:

cargo run --features cuda -- -i --isq 5 --prefix-cache-n 0 run -m meta-llama/Llama-3.3-70B-Instruct

Can you try that out?

@EricLBuehler
Copy link
Copy Markdown
Owner Author

@sempervictus after [77436b8](https://github.com/EricLBuehler/mistral.rs/commit/77436b8d371cd3fe9e31ab4e78a349c1f0239bb2), master is now working nicely!

@sempervictus
Copy link
Copy Markdown
Contributor

"test is ingoing" - painfully slowly:

2025-06-05T16:29:31.508817Z  INFO mistralrs_quant::utils::log: Model has 48 repeating layers.
2025-06-05T16:29:31.509484Z  INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-06-05T16:29:31.509531Z  INFO mistralrs_quant::utils::log: Layers 0-11: cuda[0] (32 GB)
2025-06-05T16:29:31.509548Z  INFO mistralrs_quant::utils::log: Layers 12-23: cuda[1] (32 GB)
2025-06-05T16:29:31.509563Z  INFO mistralrs_quant::utils::log: Layers 24-35: cuda[2] (32 GB)
2025-06-05T16:29:31.509576Z  INFO mistralrs_quant::utils::log: Layers 36-47: cuda[3] (32 GB)
2025-06-05T16:29:31.560365Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 7
2025-06-05T16:29:31.560378Z  INFO mistralrs_core::utils::normal: Skipping BF16 because CC < 8.0
2025-06-05T16:29:31.632507Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2025-06-05T16:29:31.632577Z  INFO mistralrs_core::pipeline::vision: Model config: Llama4Config { text_config: TextConfig { hidden_act: Silu, hidden_size: 5120, intermediate_size: 8192, vocab_size: 202048, num_hidden_layers: 48, num_attention_heads: 40, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 10485760, rope_scaling: Some(Llama3RopeConfig { factor: 16.0, low_freq_factor: Some(1.0), high_freq_factor: Some(1.0), original_max_position_embeddings: Some(8192), rope_type: Llama3 }), quantization_config: None, tie_word_embeddings: false, floor_scale: Some(8192.0), attn_scale: Some(0.1), attn_temperature_tuning: Some(4.0), use_qk_norm: true, moe_layers: None, interleave_moe_layer_step: 1, intermediate_size_mlp: 16384, num_local_experts: 16, num_experts_per_tok: 1, attention_chunk_size: 8192 }, vision_config: VisionConfig { hidden_size: 1408, hidden_act: Gelu, num_hidden_layers: 34, num_attention_heads: 16, num_channels: 3, intermediate_size: 5632, vision_output_dim: 4096, image_size: 336, patch_size: 14, norm_eps: 1e-5, pixel_shuffle_ratio: 0.5, projector_input_dim: 4096, projector_output_dim: 4096, vision_feature_layer: -1, rope_theta: 10000.0 }, image_token_index: 200092 }
2025-06-05T16:29:31.632801Z  INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
Loading text repeating layers: [01:13:56] [###############################>--------] 38/48 (3h)                                                                                                                                                               

@sempervictus
Copy link
Copy Markdown
Contributor

@EricLBuehler - interestingly, llama4 now doesn't fit into 128G of memory (4x32) very well... NCCL tries to dump a chunk into host RAM, takes a very long time to load. Forcing to GPU-only w/ mistral-rs:cuda128-compute70 mistralrs-server --isq q4k --prefix-cache-n 0 --enable-thinking -n "0:12;1:12;2:12;3:12" --token-source env:HF_TOKEN --port 7650 vision-plain -m meta-llama/Llama-4-Scout-17B-16E-Instruct --max-seq-len 16384 but still "not a quick process" :-)

@sempervictus
Copy link
Copy Markdown
Contributor

hmm, sorta just hangs there... doesn't talk much, or use any GPU resources after load :-( - mistralrs-server --isq q4k --prefix-cache-n 0 --enable-thinking -n "0:12;1:12;2:12;3:12" --token-source env:HF_TOKEN --port 7650 vision-plain -m meta-llama/Llama-4-Scout-17B-16E-Instruct --max-seq-len 16384

2025-06-05T21:25:47.783403Z  INFO mistralrs_server_core::mistralrs_for_server_builder: avx: false, neon: false, simd128: false, f16c: false
2025-06-05T21:25:47.783434Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-06-05T21:25:47.784199Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Model kind is: normal (no adapters)
2025-06-05T21:25:47.785282Z  INFO hf_hub: Using token file found "/root/.cache/huggingface/token"    
2025-06-05T21:25:47.786888Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer.json` at `meta-llama/Llama-4-Scout-17B-16E-Instruct`
2025-06-05T21:25:47.787163Z  INFO mistralrs_core::pipeline::vision: Loading `config.json` at `meta-llama/Llama-4-Scout-17B-16E-Instruct`
2025-06-05T21:25:47.859665Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00050.safetensors", "model-00002-of-00050.safetensors", "model-00003-of-00050.safetensors", "model-00004-of-00050.safetensors", "model-00005-of-00050.safetensors", "model-00006-of-00050.safetensors", "model-00007-of-00050.safetensors", "model-00008-of-00050.safetensors", "model-00009-of-00050.safetensors", "model-00010-of-00050.safetensors", "model-00011-of-00050.safetensors", "model-00012-of-00050.safetensors", "model-00013-of-00050.safetensors", "model-00014-of-00050.safetensors", "model-00015-of-00050.safetensors", "model-00016-of-00050.safetensors", "model-00017-of-00050.safetensors", "model-00018-of-00050.safetensors", "model-00019-of-00050.safetensors", "model-00020-of-00050.safetensors", "model-00021-of-00050.safetensors", "model-00022-of-00050.safetensors", "model-00023-of-00050.safetensors", "model-00024-of-00050.safetensors", "model-00025-of-00050.safetensors", "model-00026-of-00050.safetensors", "model-00027-of-00050.safetensors", "model-00028-of-00050.safetensors", "model-00029-of-00050.safetensors", "model-00030-of-00050.safetensors", "model-00031-of-00050.safetensors", "model-00032-of-00050.safetensors", "model-00033-of-00050.safetensors", "model-00034-of-00050.safetensors", "model-00035-of-00050.safetensors", "model-00036-of-00050.safetensors", "model-00037-of-00050.safetensors", "model-00038-of-00050.safetensors", "model-00039-of-00050.safetensors", "model-00040-of-00050.safetensors", "model-00041-of-00050.safetensors", "model-00042-of-00050.safetensors", "model-00043-of-00050.safetensors", "model-00044-of-00050.safetensors", "model-00045-of-00050.safetensors", "model-00046-of-00050.safetensors", "model-00047-of-00050.safetensors", "model-00048-of-00050.safetensors", "model-00049-of-00050.safetensors", "model-00050-of-00050.safetensors"]
2025-06-05T21:25:47.891877Z  INFO mistralrs_core::pipeline::vision: Loading `generation_config.json` at `meta-llama/Llama-4-Scout-17B-16E-Instruct`
2025-06-05T21:25:47.940447Z  INFO mistralrs_core::pipeline::vision: Loading `preprocessor_config.json` at `meta-llama/Llama-4-Scout-17B-16E-Instruct`
2025-06-05T21:25:47.978101Z  INFO mistralrs_core::pipeline::vision: Loading `processor_config.json` at `meta-llama/Llama-4-Scout-17B-16E-Instruct`
2025-06-05T21:25:47.978125Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer_config.json` at `meta-llama/Llama-4-Scout-17B-16E-Instruct`
2025-06-05T21:25:48.010267Z  INFO mistralrs_quant::utils::log: Automatic loader type determined to be `llama4`
2025-06-05T21:25:48.320550Z  INFO mistralrs_quant::utils::log: Model has 48 repeating layers.
2025-06-05T21:25:48.321754Z  INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-06-05T21:25:48.322014Z  INFO mistralrs_quant::utils::log: Layers 0-11: cuda[0] (32 GB)
2025-06-05T21:25:48.322032Z  INFO mistralrs_quant::utils::log: Layers 12-23: cuda[1] (32 GB)
2025-06-05T21:25:48.322046Z  INFO mistralrs_quant::utils::log: Layers 24-35: cuda[2] (32 GB)
2025-06-05T21:25:48.322060Z  INFO mistralrs_quant::utils::log: Layers 36-47: cuda[3] (32 GB)
2025-06-05T21:25:48.376168Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 7
2025-06-05T21:25:48.376186Z  INFO mistralrs_core::utils::normal: Skipping BF16 because CC < 8.0
2025-06-05T21:25:48.450293Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2025-06-05T21:25:48.450616Z  INFO mistralrs_core::pipeline::vision: Model config: Llama4Config { text_config: TextConfig { hidden_act: Silu, hidden_size: 5120, intermediate_size: 8192, vocab_size: 202048, num_hidden_layers: 48, num_attention_heads: 40, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 10485760, rope_scaling: Some(Llama3RopeConfig { factor: 16.0, low_freq_factor: Some(1.0), high_freq_factor: Some(1.0), original_max_position_embeddings: Some(8192), rope_type: Llama3 }), quantization_config: None, tie_word_embeddings: false, floor_scale: Some(8192.0), attn_scale: Some(0.1), attn_temperature_tuning: Some(4.0), use_qk_norm: true, moe_layers: None, interleave_moe_layer_step: 1, intermediate_size_mlp: 16384, num_local_experts: 16, num_experts_per_tok: 1, attention_chunk_size: 8192 }, vision_config: VisionConfig { hidden_size: 1408, hidden_act: Gelu, num_hidden_layers: 34, num_attention_heads: 16, num_channels: 3, intermediate_size: 5632, vision_output_dim: 4096, image_size: 336, patch_size: 14, norm_eps: 1e-5, pixel_shuffle_ratio: 0.5, projector_input_dim: 4096, projector_output_dim: 4096, vision_feature_layer: -1, rope_theta: 10000.0 }, image_token_index: 200092 }
2025-06-05T21:25:48.451823Z  INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-06-05T23:06:10.917599Z  INFO mistralrs_core::pipeline::paths: `tokenizer_config.json` does not contain a chat template, attempting to use specified JINJA chat template.
2025-06-05T23:06:10.919238Z  INFO mistralrs_core::pipeline::paths: No specified chat template. No chat template will be used. Only prompts will be accepted, not messages.
2025-06-05T23:06:10.920975Z  INFO mistralrs_core::pipeline::isq: Applying in-situ quantization into Some(Q4K) to 2895 tensors.
2025-06-05T23:06:10.922891Z  INFO mistralrs_core::pipeline::isq: Applying ISQ on 32 threads.
2025-06-05T23:08:42.075695Z  INFO mistralrs_core::pipeline::isq: Applied in-situ quantization into Some(Q4K) to 2895 tensors out of 2895 total tensors. Took 151.15s
2025-06-05T23:08:42.076475Z  INFO mistralrs_core::paged_attention: Allocating 3072 MB for PagedAttention KV cache per GPU
2025-06-05T23:08:42.076487Z  INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 512 GPU blocks: available context length is 16384 tokens
2025-06-05T23:08:43.333552Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot|>", "<|end_of_text|>", "<|eom|>", unk_tok = `None`
2025-06-05T23:08:43.385250Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-06-05T23:08:43.386047Z  INFO mistralrs_core: Beginning dummy run.
2025-06-05T23:08:52.650528Z  INFO mistralrs_core: Dummy run completed in 9.264463455s.
2025-06-05T23:08:52.652994Z  INFO mistralrs_server: Serving on http://0.0.0.0:7650.
2025-06-05T23:08:53.390453Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 0.80, Prefix cache hitrate 0.00%, 0 running, 0 waiting

looks like this:

[0] Tesla V100-SXM2-32GB | 40'C,  ?? %,   0 %,   61 / 300 W | 21739 / 32768 MB | root:mistralrs-server/3165477(21462M)
[1] Tesla V100-SXM2-32GB | 38'C,  ?? %,   0 %,   58 / 300 W | 18195 / 32768 MB | root:mistralrs-server/3165477(17918M)
[2] Tesla V100-SXM2-32GB | 37'C,  ?? %,   0 %,   60 / 300 W | 18195 / 32768 MB | root:mistralrs-server/3165477(17918M)
[3] Tesla V100-SXM2-32GB | 40'C,  ?? %,   0 %,   61 / 300 W | 18195 / 32768 MB | root:mistralrs-server/3165477(17918M)

@polarathene
Copy link
Copy Markdown
Contributor

polarathene commented Jun 11, 2025

Just to chime in, there's a lot of comments/output above that I'm not going to dig into but few questions:

  • Is the issue not reproducible outside of a container build?
  • If that is the case, can you perform the build within a container at runtime with the GPU device added?
  • If that works, then the issue could be the lack of GPU access during the container build itself. This is supported but requires you to adjust the Dockerfile, however it would not be compatible with CI to require an nvidia GPU during build when CI does not provide one, and may add more friction for users to build the image themselves.

If those assumptions are valid, whatever in the build requires GPU access would need to be pre-built externally probably or opt-out via feature with a documented drawback when using the container if there's no way to add that support via other means.

There was mention of PagedAttention kernel and quantization, so I assume the custom kernels are involved/affected during the build in some way?


FWIW a compute capability of 70 with the Tesla V100 GPU is a bit low. I have seen a few projects where 75 is the lowest capability but even that involved some differences (I think PagedAttention was not supported?).

EDIT: Here's one:

GPUs with CUDA compute capabilities < 7.5 are not supported (V100, Titan V, GTX 1000 series, ...).

Make sure you have CUDA and the nvidia drivers installed. NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher. You also need to add the nvidia binaries to your path

And from that same project, this related section:

Volta NOT SUPPORTED

Warning: Flash Attention is turned off by default for the Turing image as it suffers from precision issues.
You can turn Flash Attention v1 ON by using the USE_FLASH_ATTENTION=True environment variable.

So it's possible the issues are related to the older GPU arch (nearing a decade old now?) if you can verify no issues with the image on newer modern GPUs. Given the age of that hardware, I'd also ensure your using a relatively modern release of Docker itself (or whichever other container engine you prefer).

Jeadie added a commit to spiceai/mistral.rs that referenced this pull request Jul 14, 2025
* Fix handling of Metal fused attn head dims (EricLBuehler#1234)

* Fix handling of metal attn head dims

* Fix handling of gemma3 1b when images

* Tweak default for paged attn builder

* Support paged attn for vision model rust api (EricLBuehler#1235)

* [Breaking] Support setting HF cache path (EricLBuehler#1237)

* Add it internally

* Add the apis

* Support tool calling for DeepSeek models (EricLBuehler#1239)

* Support tool calling for deepseek models

* Format

* Fix deepseek

* Server image processing refactor and fixes (EricLBuehler#1244)

* Fix strict gemma3 case

* Accept multiple images in the content array

* Fix multiple images in one array ct

* Add it to the python api

* Typos

* Optimized CUDA RoPE kernels (EricLBuehler#1247)

* Add the kernels

* It works

* Works

* Buulds

* Typo fix (add_speial_tokens to add_special_tokens) (EricLBuehler#1246)

* Fix typo

* Update mistralrs.pyi

* Fixes for UQFF + distributed layers (EricLBuehler#1250)

* Fixes for uqff + distributed layers

* Typo

* Automatic agentic search integration (`web_search_options`) (EricLBuehler#1243)

* Add the tool

* Actually search

* Clippy

* Sort of works

* Remove some debuggers

* tweak

* Add some rules

* Works great

* Tweak 'system' prompt

* Update mistralrs-core/src/search/mod.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Typo

* Add it to all the apis

* Add bert model for similarity reranking

* Typos

* Early detection of tools

* Alias max_tokens -> max_completion_tokens too

* Customizable bert model

* Flip the enabler around

* Add docs

* Update readme

* Typo

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Format kernels (EricLBuehler#1251)

* Update readme

* Update readme

* Remove test

* Add quantize guards for uqff deserialize (EricLBuehler#1252)

* Refactor cuBLASlt-related code (EricLBuehler#1253)

* Centralize cublaslt into mistralrs-quant

* Use cublaslt in unquant layer

* Use beautiful trait constants for simpler code

* Move tests

* Dispatch to unquant for cublaslt

* Dispatch to unquant for cublaslt

* Fix feature

* Add convert_to_gptq script

* Update deps, bump pyo3 version (EricLBuehler#1259)

* Faster cuda FP8 performance (EricLBuehler#1257)

* Avoid fp8 sync

* Fix dtype

* Rust 1.86 clippy (EricLBuehler#1260)

* Rust 1.86 clippy

* Clippy

* Refactor engine arch (EricLBuehler#1262)

* Refactor engine add_request

* Don't recompile regex

* Clippy

* Revamped LoRA support - removing the Ordering system! (EricLBuehler#1263)

* Play with varbuilder lifetimes

* Merge lora weights

* Clippy

* Lora works

* Support multiple loras

* Cleanup, remove adapter activation

* Complete merge

* Fast Metal-specific quantization method: AFQ (EricLBuehler#1264)

* Add mlx quantized kernels

* Add mlx quantized kernels

* Kernel launcher

* Add AFQ isq quant and dequant

* Some quantmethod things

* Begin to implement the qmm caller

* Clippy

* Much faster

* Cache kernels

* Docs

* Clippy

* Add it to uqff

* Support prequantized models from MLX (EricLBuehler#1265)

* Refactor quantizedconfig

* Support AFQ prequantized

* Update docs

* Update docs

* Automatic ISQ to select fastest & most accurate method (EricLBuehler#1266)

* Automatic isq

* typo

* Doc

* Improved usage metrics (EricLBuehler#1267)

* Fix cuda

* Bump tokio from 1.44.1 to 1.44.2 (EricLBuehler#1270)

Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.44.1 to 1.44.2.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](tokio-rs/tokio@tokio-1.44.1...tokio-1.44.2)

---
updated-dependencies:
- dependency-name: tokio
  dependency-version: 1.44.2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Gather MM ops in mistralrs-quant (EricLBuehler#1272)

* Update the caller

* Wire things up

* Broadcase for afq gathermm

* Broadcase for afq gathermm

* Clippy

* Improve performance of deepseek models

* Typo fix

* BincountOp not used

* Implement Llama 4! (EricLBuehler#1268)

* Implement Llama 4

* Implement the main changes for the text model

* Make chunked mask

* Wire things up

* Add some EP

* Initial sketch of inputs processor

* Runs

* Progress

* all reduce moes

* It works!

* Some cleanup

* Faster moe block

* Add device map

* Make chunked matrix

* Fully working now!

* Reactivate cublaslt

* Fix shared mlp cublaslt

* Refactor to packed experts

* Complete merge

* It is a normal model now

* Fixes

* Set device for moe

* ISQ fixes

* Much faster sort kernel

* Faster loading!

* Faster loading!

* Fp8 cpu copy ops in candle backend

* Add the vision model

* Add mmproj layer

* Actually merge the inputs

* Sketch most of the image processor

* Add the rest of the image processor

* Implement the whole processor

* Add the loader

* Some fixes

* A batch of fixes

* Some fixes

* tmp

* Actually support isq

* Ok it works a bit

* Fix norm device

* It works

* A bit cleaner

* Support residul tensors

* Remove text loader

* Implement the device mapping system

* Fix auto device map

* Add examples

* Add model card

* Typo

* Remove superflous logging

* Fixes for Llama 4 UQFF loading (EricLBuehler#1275)

* Support sharding for UQFF (EricLBuehler#1276)

* Serialize sharded uqff files

* Loading

* Fix base64

* Fix bug for group-topk (group_limited_greedy) in deepseek models (EricLBuehler#1278)

* Support the DeepCoder model (EricLBuehler#1279)

* Add faq for metal not found

* Improved PagedAttn scheduling accuracy (EricLBuehler#1282)

* Scheduler ops by reference

* Ensure scheduler gets correct prompts

* Fix cuda build for copy_blocks

* Fixes for scheduling image seqs with pagedattn (EricLBuehler#1283)

* update to llguidance 0.7.16 (EricLBuehler#1284)

* update llguidance to 0.7.16 from crates.io; use ParserFactory

* add lark_llg.py example

* use new llguidance::Matcher APIs

* rework spec-decoding with llg

* more work on spec sampling

* check for parser stop

* fix clippy

* remove unneeded rollback

* update build_llg_factory to return Result

* Update dependencies (EricLBuehler#1286)

* Much faster image inputs processing (EricLBuehler#1289)

* Add more SDPA head dims for much faster SigLIP (EricLBuehler#1290)

* More sdpa head dims, faster vision models

* Move nonzero to above for faster metal synch

* Doc

* Update valid head dims

* Show throughput in interactive mode (EricLBuehler#1291)

* Update interactive mode throughput stats

* Accurate prompt t/s

* Accurate prompt t/s for usage

* Unify bitwise operations (EricLBuehler#1288)

* Unify bitwise ops

* Tests pass

* Fix cuda build

* Clippy

* Multimodal prefix caching support! (EricLBuehler#1298)

* Initial progress

* Support vision prefix caching

* Update docs

* Add multimodal data abstraction

* Interactive mode improvements (EricLBuehler#1299)

* More ergonomic image url parsing

* Add option to clear

* Add the Qwen 3 and Qwen 3 MoE models! (EricLBuehler#1285)

* Add qwen3 model

* Add enable_thinking

* Add initial qwen3 moe

* Add the moe model

* Format

* Fix order of norm

* Fix expert shapes

* Fix reverse

* Fix norm device for isq

* Fix nonzero when no nonzero

* Moe model runs

* Working qwen3 moe

* Add metal fp8 blockwise dequant

* Clean

* Typo

* Enable tool calling

* Streamlined ux

* Add some examples

* Add docs

* Fix dead link

* Remove interactive mode max_len

* Update QWEN3.md

* Hotfix for vision mode clear

* Revamped and streaming web search support (EricLBuehler#1301)

* Streaming web search

* Refactor a bit

* More refactoring

* Add some logging, parallelize some things

* Allow url

* Suppress warning, allow multi-turn searching

* Batch compute_similarities

* Cap content len

* Typos

* Doc

* Handle vision messages or different tool call prefixes (EricLBuehler#1302)

* Fix cuda

* Tune web search budget

* Simplify prefix cacher (EricLBuehler#1305)

* Use rustyline to handle non-ascii in interactive mode (EricLBuehler#1306)

The io::stdin().read_line() cannot handle non-ascii input, which caused
crash when use backspace to delete non-ascii characters.

Introduce rustyline to the interactive mode to solve the problem. Plus
it can bring more editing features in the future.

Close EricLBuehler#1140

* Add more tools for automatic search (EricLBuehler#1307)

* Add interactive mode history

* Add a website extraction tool

* Pass toks by reference

* Optimize prompt chunking

* Fix CPU hogging in interactive mode (EricLBuehler#1309)

The log enabler should be checked after the sleep instead of a busy
loop checking.

Since the interactive mode always disables the token speed logger, 100%
CPU was taken by this loop always.

* Add Metal precompilation support  (EricLBuehler#1311)

* Add metal precompilation for paged attn

* Add for mistralrs-quant

* Better constructor

* Dont always build

* Fix name for paged attn rebuild

* Reduce thrashing of Metal autorelease (EricLBuehler#1313)

* Reduce calls to autorelease

* Optimize clone_in_cache

* Refactor float8

* make `AdapterPaths` and `LoraAdapterPaths` public (EricLBuehler#1314)

Make `AdapterPaths` and `LoraAdapterPaths` public so `LocalModelPaths`
can be constructed outside of `mistralrs-core`.

* Refactor KV cache manager (EricLBuehler#1315)

* Refactor kv cache

* Refactor caches

* Fix some overflows

* Add `Audio` and `Speech` model categories (EricLBuehler#1317)

* add `Audio` to `ModelCategory`

* add `Speech` to `ModelCategory`

* fix to go back to PartialEq having an exhaustiveness check

* Remove has_conv2d from vision model API (EricLBuehler#1318)

* Unified/automatic flash attention enabler (EricLBuehler#1319)

* Remove from sdpa params

* Fix errors

* No warnings

* Log

* Clippy

* Fix cublaslt 4d mask (EricLBuehler#1320)

* Fix cublaslt 4d mask

* Clippy

* Keep caches on gpu

* Qwen VL models fixes (EricLBuehler#1322)

* Add some defaults

* Fix

* Fix one thing

* 2.5 vl works

* Use caching again

* Fix v2

* Move index inside loop

* Offset in ropeidx

* Default support for vision prefix caching is false

* Fixes for all vision models (EricLBuehler#1323)

* Fix phi input processor?

* Fix phi input processor

* Handle no_prefix_cache from pipeline

* Phi models confirmed 👍

* Fixed for phi inputs processors

* Fixed for phi4

* Llama 3 confirmed 😀

* Mistral 3 confirmed 😃

* Idefics 2/3 fixes

* Some fixes

* Remove unsafety

* Improved+faster LRU prefix cacher (EricLBuehler#1321)

* Show TTFT

* Use LRU prefix cacher

* Faster prefix cacher

* Inplace ISQ support and default to mmap (EricLBuehler#1277)

* Initial impl of immediate isq

* Immediate isq -> !loading_isq

* Varbuiler utils always using mmap!

* Log

* Add for packed experts

* Afq without copy

* Clarify

* Clippy

* Apple immediate isq

* Better logic for loading_isq

* Support showing ttft

* Rename

* Shared quantize guard

* Parallel progress bar

* Parallel loading for progress bars

* Actual ISQ support

* Conditional parallelism for NiceProgressBar

* Use conditional iterator

* Warn once

* Predicate for applying immediate isq

* Allow parallel

* Remove debug print

* Remove debug print

* Remove debug print

* Fix typos (EricLBuehler#1329)

* Fix Idefics 3 arch chat templating (EricLBuehler#1330)

* Update inputs merger

* Fix

* Better warning

* Better warning

* Better warning

* Nonzero ahead of time

* No f32

* Clippy

* Optimize get_logprobs

* Fix packed experts

* Update masking

* Use Sdpa in idefics3

* QuantMethod in idefics3 vision

* Remove a .contiguous

* Remove two space from PR comment (EricLBuehler#1331)

* Add automatic vision loader type (EricLBuehler#1332)

* Add automatic vision loader

* Remove references to --arch

* Update examples

* Add the Dia 1.6b TTS model! (EricLBuehler#1304)

* Add loading

* Add rope, mlp, most of attn

* Add encoder + encoder layer, decoder layer forwards

* Add decoder forwards

* Add prepare_audio_prompt

* prepare_generation mostly done

* Add a proper dia kvcache

* Add most of decoder_step

* Add the sampler

* Add the generation loop

* Wire things up

* Add speech pipeline

* Fixes

* Loads

* Some fixes

* f32

* Some progress

* Ok it runs upto dac decoding

* Add dac part loading

* Loads and runs at least

* Remove encodec

* Debugging

* Debugging

* Huh

* Complete merge

* Interactive

* Confirmed dac works at least

* Looks like encoder works

* Much progress

* Hmm

* Sampling

* Almost there

* Sampler

* Sampler

* Bf16 support

* Response

* Use it in interactive mode

* Fix oneshot

* Add openai api

* Add openai api

* Refactor loading

* Use naive sdpa for inplace

* Factor out

* Clippy

* Clippy

* Config

* Refactor config

* Metal clippy

* Fix t/s

* ISQ support

* Some fixes, nits

* Fix cuda

* Clippy

* Inhibit cublaslt for cuda

* Add server example

* Add python example

* Add rust api

* Add docs

* Update config.toml

* Fix .pyi

* Update readme

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* update `llguidance` to `0.7.20` (EricLBuehler#1334)

Update `llguidance` from `0.7.16` to `0.7.20` so that it has guidance-ai/llguidance#172 which is a fix for building on GCC 15.

* Add model category <> messages check (EricLBuehler#1335)

* Verify model category matches the messages

* Add vision chat

* Fixes

* Add element-wise normalization check (EricLBuehler#1340)

* Fix streaming example print statement (EricLBuehler#1339)

* Fix normalization formula in comment (EricLBuehler#1338)

* Fix image_to_pixels to handle non-RGB images (EricLBuehler#1337)

* Fix typo in expect messages (EricLBuehler#1342)

* Don't use mmap on cuda (EricLBuehler#1336)

* No mmap on cuda

* Simplify streaming tool call logic

* Remove debug

* Support AWQ format models (EricLBuehler#1350)

* Support AWQ format models

* Clippy fix

* Fix uqff dummy layer ISQ application (EricLBuehler#1351)

* Disable immediate isq if write_uqff (EricLBuehler#1352)

* Fixes for UQFF loading on CUDA, ISQ pack factor (EricLBuehler#1354)

* Fix logic for uqff on cuda

* Updated pack_factor

* Refactor Option references for model paths (EricLBuehler#1347)

* refactor: use Option refs in model path helpers

* Format

* Add a script for server benchmarking (EricLBuehler#1355)

* Serde alias

* Fix

* Update for tie_word_embeddings

* Print running/waiting

* 30 users

* Update num_users

* Update dummy paged attn

* Optimized Metal qmv_fast path (EricLBuehler#1356)

* Compile with lto

* Tweak profiles

* New, fast sampler for Metal! (EricLBuehler#1327)

* Show TTFT

* Use LRU prefix cacher

* Faster prefix cacher

* A bit of gpu sampling

* Minp but cpu for now

* Metal fast cumsum impl

* Sampling with fast topp kernel

* Hmm not perfect

* Add metal sort kernels

* Tmp

* Add single block sort

* Add most of multi block sort, just need copy op

* Add copy kernels

* Expose kernels

* Add a test

* Ok it works

* Structure things

* Add caching

* Rename

* Cpu is default

* CUDA case

* Topk

* Refactor Option references for model paths (EricLBuehler#1347)

* refactor: use Option refs in model path helpers

* Format

* Add a script for server benchmarking (EricLBuehler#1355)

* Serde alias

* Fix

* Update for tie_word_embeddings

* Print running/waiting

* 30 users

* Update num_users

* Update dummy paged attn

* Optimized Metal qmv_fast path (EricLBuehler#1356)

* Compile with lto

* Tweak profiles

* Fix topk

* Penalties

* Add logits processor, clippy fixes

* Fix chat port

* Remove warning

* Fix chat port

* Fix metal parallel sampling (EricLBuehler#1357)

* Cpu if parallel for now

* Tweak bench script

* Add immediate isq predicates for qwen3 (EricLBuehler#1358)

* Add immediate isq predicates for qwen3

* Fix parsing of "parse_isq_value" depedent of device

* Typo

* Fix gemma3 logging

* Regressions fixes (EricLBuehler#1359)

* Fix regression for mmap

* Revert EricLBuehler#1321

* Refactored matching_cache impl

* Clippy

* Revamped and smaller readme (EricLBuehler#1360)

* Expandable detail sections

* Refactor using derivative model

* Tweak quick examples

* Update llama

* Update llama

* Supported accelerators is a table

* Update installation guides

* Tweak apis

* Remove --port in quick examples

* Add demo gif

* Add gif in readme

* Update demo gif

* Update demo gif

* Update demo gif

* Add gif in readme

* Add gif in readme

* Add a web chat app! (EricLBuehler#1362)

* Initial

* Markdown

* Copy code

* Add model loading sidebar

* Support vision models

* Tweak isq

* Links go to another page

* Clear when switch model

* Fix html tags

* Add image support!

* More then one images

* Fix

* Improved textarea

* Tab for switching between vision and text

* No paged attn for now

* Prettier format

* Multiple models at once

* Better switching, clearing ability

* Mobile support

* Inline markdown parser

* Update examples

* Typos

* Support specifying isq

* Fix mobile

* Fixes

* Fix button on mobile

* Image height is capped

* Thumbnail

* Fix rotating kv cache edge case

* Add drag and drop for images

* Small things

* Sidebar is frozen now

* Better listner

* Add readme

* Tweak readme

* Add chat history support to web chat app (EricLBuehler#1363)

* Add chat history

* Support renaming

* Start immediately with new chat

* Add timestamp

* Prettier chat list

* Style

* Delete chat

* Fix copy button

* Fix markdown rendering

* Store things in cache

* Store things in cache

* Refactor web chat, fix multichat image restore (EricLBuehler#1364)

* Fix multichat image restoration.

* Clippy

* Refactor

* Refactor frontent

* Fix repeated immediate isq init (EricLBuehler#1365)

* Add images_ref

* Add debug impl

* Fix the bug

* Tweak style of buttons

* Add a spinner

* Move spinner

* Tweak emoji

* Add gif

* Tweak initial gif

* Include vision tower tensors in Mistral3 UQFF (EricLBuehler#1366)

* Fix mistral 3 uqff resitdual tensors for vision

* Rolling shard creation for uqff files (EricLBuehler#1367)

* Fix occasional unstability during isq of afq (EricLBuehler#1368)

* Fix unstability during isq of afq

* Clippy

* Fix web chat installation

* Support web chat file uploading (EricLBuehler#1370)

* Web chat fixes

* Fix thumbnail in message, reuse blank chat

* Add file uploading support

* Fix scroll

* Allowed extensions

* Preserve files as literals

* Support multiple clients

* Add a stop button

* New cache dir

* New cache dir

* Fix

* Refactor

* Update readme

* Tweak drag-and-drop css

* Add speech generation support to the web chat! (EricLBuehler#1373)

* Initial speech gen support for web chat

* Tweak ui

* Update docs

* Prefix caching for PagedAttention! (EricLBuehler#1369)

* Exposing some things for logical token blocks

* Prefix cache manager has the scheduler

* Refactor

* Get logical and physical blocks into the prefix cacher

* Hash and cache

* Pass physical block prefill

* Allocation of prefilled block tables

* Temp

* Dont always use 2

* Hmm

* Hmm

* It mostly works

* Increment refcount

* Support images!

* Add to dummy paged attn

* Fix some clippy

* Clippy

* More checks

* Include EricLBuehler#1371, closes EricLBuehler#1371

* Typos

* Update docs

* Metal PagedAttention accuracy improvements (EricLBuehler#1374)

* Fix subtle bug

* Fix half sum bug

* Format metal paged attention

* Handle images in paged attn scheduler (EricLBuehler#1375)

* Include schemas needed for chatcompletions endpoint (EricLBuehler#1353)

* EricLBuehler#1326: WIP include schemas needed for chat completions endpoint

 Conflicts:
	Cargo.lock
	mistralrs-server/src/main.rs

* EricLBuehler#1326: WIP define utoipa as a workspace dep since core and server both need it

* EricLBuehler#1326: first draft of handling schemas that use Either

* EricLBuehler#1326: first draft of handling schema for Grammar

* EricLBuehler#1326: Add in other endpoints to API docs.

* EricLBuehler#1326: Adjust code comments

* EricLBuehler#1326: Implement coderabbitai suggestions

- EricLBuehler#1353 (review)
- EricLBuehler#1353 (comment)

* Fix constraints with metal sampler

* Revert EricLBuehler#1375

* Fix case where prefix cacher returns no toks (EricLBuehler#1377)

* Fix AFQ UQFF serialization

* Faster UQFF serialization (EricLBuehler#1379)

* Faster UQFF serialization

* Fix uqff gemma3

* Improve gemma3 auto loader names

* UQFF creation for AFQ on CPU support (EricLBuehler#1380)

* Add afq cpu quantize/dequantize

* Clippy

* Improved device for afq quantize

* Improved dtype handling for cpu afq (de)quantize

* Improved generate_uqff_card

* Add fused CPU attention kernel! (EricLBuehler#1382)

* Working

* Fix warnings

* Allow mask

* Support bf16, f16

* Handle striding

* Parallelized

* Add initial vector flash attn

* Avoid repeated allocations

* Tiled kv

* Apply some clippy

* Some small fixes

* Chunked vec_dot

* Clipy

* Use T::zero

* Refactor attention backends (EricLBuehler#1384)

* Refactor attention code

* Refactor attention code

* Move into backends

* Set macOS thread affinity for CPU attn (EricLBuehler#1385)

* Use lazylock

* Format

* Fix metal warn build

* Faster Qwen 3 MoE support on Metal (EricLBuehler#1387)

* Fix load

* Use afq gather qmm

* Well it runs

* It works

* Polish

* Fast and slow options

* Remove quantized.rs

* Polish some more

* Refactor

* Add isq

* Update load in parallel

* Support fp8

* Refactor for FusedExperts

* Clippy

* Handle pack factor when loading prequantized models

* Use f32 only in moe

* Avoid using f32 so much

* Avoid using f32 so much

* Fix PagedAttention block leaks (EricLBuehler#1388)

* Warn and ignore if ignored

* Fix a block allocation leak

* Update bench.py

* Fix double free in block engine

* Do not apply ISQ if loading a prequantized model

* Fix cuda build again (EricLBuehler#1389)

* Fix cuda build

* Fix

* Format

* Fixes for cuda docker

* Update dockerfiles

* Bump version to 0.6.0 (EricLBuehler#1390)

* Bump version to 0.6.0

* Remove lower_level api

* Make a static dir

* Update deps

* Fix routing for static handler in web chat

* Fewer .contiguous calls for qwen3 moe (EricLBuehler#1391)

* Allow speech models to accept batched inputs (EricLBuehler#1393)

* Allow speech models to accept batched inputs

* Clippy

* Ring distributed backend for heterogeneous TP (EricLBuehler#1238)

* Begin work on ring distributed backend for Metal

* Add the actual ring functionality

* It loads and kind of runs

* It works

* Optimize buffer allocation

* Avoid copy

* It works

* Add allgather

* Fix load

* Ping-pong

* Small things

* Add config json

* Allow different ip address

* Read config once

* Read config when appropriate

* Replicate requests

* Small fix

* Fix small compat with openai

* Clippy

* Update docs

* Add deepseek tool calling chat template

* Add auto loader for vision/text detection! (EricLBuehler#1402)

* Add auto loader for vision/text detection

* Build fixes

* Add model loader

* Update docs

* Format

* Create Mistral.rs Server Core Lib: `mistralrs-server-core` (EricLBuehler#1346)

* First draft of exposing mistral server routes as lib

* make arg struct fields pub

* Take base path so utoipa swagger route can properly redirect

* Expose swagger routes and make it configurable

* Add base path option for swagger docs

* More work on modularizing mistralrs server

* Sync fork (+1 squashed commit)
Squashed commits:
[169ae9e] Sync fork

* Adjust fn params to use refs / individual params instead of args

* Start breaking down controller actions into smaller pieces

* Continue refactoring

* Make mods pub so they can be used outside crate

* Allow chat completion streamer to take a callback so that you can get the complete response when finished

WIP (+3 squashed commits)
Squashed commits:
[0061d87] WIP
[c484d56] WIP
[16f8a60] WIP

* Sync fork

* Adjust callback type

* Remove throughput_log arg that was removed in 26afcc3

* Implement defaults for Args (and use for Clap)

* Small code formatting tweaks

* Rename callback to match SSE event and code clean up

* Sync fork

* WIP: first very rough draft of server core builder. Doesn't meet parity with old functional approach yet (slower / unstable?).

* Clean up (+4 squashed commits)
Squashed commits:
[e1cff387] Sync fork
[d8301025] WIP debugging
[1ea9f8c8] Sync fork
[4fe28cf5] WIP: debug function

* WIP server core builders

* Code clean up

* Add on_chunk callback

* Code clean up

* First draft of creating version of mistral-server that uses server-core

Code clean up (+1 squashed commit)
Squashed commits:
[adea1693]

* Sync fork

* Add helper methods to builder to make optional args more ergonomic (since .build validates params)

* Start adding docs

* Start cleaning up crates deps

* Example commit of mistral-server with implementing server-core

* Start addressing CodeRabbit feedback

* Fix comment typo

* Tweak doc blocks

* - Update type alias naming for clarity (MistralRs instead of Mistral)
- CodeRabbit, don't use eprintln for lib (use trace)
- Allow buffer size to be passed in and default to Constant
- Allow router body limit to be passed in and default to Constant
- Update doc examples

* Typo

* Address CoderRabbitAI feedback

* Support linear rope for llama3 (EricLBuehler#1408)

* Hotfix for loading

* Fix vllama4 uqff loading (EricLBuehler#1409)

* Fix vllama4 uqff loading

* Fix regex

* Fix regex

* Maybe a fix

* Gracefully handle receiver disconnects (EricLBuehler#1410)

* Handle receiver disconnects

* Format

* Fix Qwen3 MoE device mapping irregularities (EricLBuehler#1411)

* Fix bias

* Fix lm_head packing case

* Account for gate

* Fix head dim

* Fix interactive mode URL parsing (EricLBuehler#1412)

* fix url regex in vision interactive mode

* Fix regex

* Clippy

* Refactor auto device map (EricLBuehler#1413)

* Refactor auto device map

* Refactor a bit more

* Clippy

* Enable runtime sampling tweaks in interactive mode (EricLBuehler#1414)

* Document runtime sampling commands

* Fix readme

* Tweak

* Bounds checking

* Tweak temp bounds

* Send streaming tokens every time

* Gumbel sampling for fast sampler (EricLBuehler#1416)

* Improved handling for initialize_logging

* Improved CPU flash attention accuracy & performance (EricLBuehler#1417)

* Downcast correctly

* Operate internally in f32

* Avoid some casts and striding

* Prefetch

* Provide chat_templates to container users (EricLBuehler#1419)

Models often come without chat templates requiring mapping them
from the source repository into a container for access by the
mistralrs-server.

Copy the templates from the build tree into the root of the image
to permit use via `--chat-template /chat_templates/something.json`

TODO:
  With the increase in quantized models and support for other
formats, the initial benchmark run during model load can be used
to qualify/select existing chat templates embedded into the binary
for models which do not come with any (to include output of the
functional failures in each test allowing users to modify the
ones already provided correctly to suit the model being loaded).

Co-authored-by: RageLtMan <rageltman [at] sempervictus>

* Faster cpu flash attn (EricLBuehler#1418)

* Faster cpu flash attn

* Prefetch

* Clippy

* Add some tests

* Add softcap tests

* Fix test_parse_image_url test

* Update tests

* Update tests

* Web search improvements (bm25, web chat) (EricLBuehler#1420)

* Fix web search blocking case

* Web search support in web chat

* Tweak ui

* Support fallback to bm25

* Clippy

* Reinject descriptions

* Propely handle consecutive searches (EricLBuehler#1421)

* Update extraction tool reinjection

* Looped

* Update docs (EricLBuehler#1422)

- lib.rs: clean up example var names and match logging change from EricLBuehler@201d6be
- server_builder: fix typo
- READMEs: link to crate docs

* Better tool call detection logic (EricLBuehler#1424)

* Add web search hook callbacks (EricLBuehler#1426)

* feat: add customizable search hook

* Move to builder

* Update docs

* Fix CUDA context switching, bind thread on CudaStorage drop (EricLBuehler#1428)

* Add CUDA context helper and use in Llama forward

* No flashparams?

* working

* Tweak

* Update to use dep

* conditionally build flash attention inputs (EricLBuehler#1429)

* Add AGENTS.md (EricLBuehler#1430)

* Support Qwen3 GGUF model (EricLBuehler#1432)

* Support QWen3 GGUF model

* Clippy fix

* cargo fmt

* Improved paged attn prefix caching (EricLBuehler#1434)

* Improved paged attn prefix caching

* Disable

* Clippy

* Temporary fix for qwen3 gguf tokenizer (EricLBuehler#1433)

* Temporary fix for qwen3 gguf tokenizer

* Typo fix

* Add tool callback support (EricLBuehler#1427)

* Add tool callback support

* Fixes

* Support named tool callbacks

* Update examples

* Update docs

* Clippy

* Centralize crate dependencies (EricLBuehler#1438)

* chore: centralize dependencies

* Format

* Fix bug in tokenizer created with gguf metadata (EricLBuehler#1440)

* Fix bug in tokenizer created with gguf metadata

* Clippy fix

* Update deps (EricLBuehler#1441)

* Small things

* Update deps

* Update deps

* Update breaking changes

* Doc fixes (EricLBuehler#1442)

* Mention uqff_maker

* Downgrade rustyline 16.0.0 -> 15.0.0 (EricLBuehler#1444)

* Add max_completion_tokens alias for server (EricLBuehler#1451)

* Audio input support (Phi 4 multimodal) (EricLBuehler#1448)

* Deps

* Add conformer

* Nemo loading

* Position embeds

* Load t5 attn bias

* Attn and feed forward

* Add conv module and glu pointwise

* Implement relative attn bias

* Add the forward methods

* Add encoder embedding

* Fix oproj

* Some loading

* Conformer loads!

* Fully loading speech stack

* Merger

* Dont need that

* First pass at audio processing

* Read samples

* Optional

* Small loading fix

* Runs but not correct yet

* Improved audio processing?

* Works with this

* Fix t5 attn bias

* It works!

* Comment

* Use some other crates

* Clippy

* Allow bf16 on metal

* Add prefix_audio

* Remove unused

* Typo

* User specified

* Add audio url parsing

* AudioProjectionMode -> InputMode

* Audio prefix caching

* Fix bug in audio prefix caching

* Support both at the same time!

* Tweak logging

* Support stereo

* Add mistralrs-audio

* Support batching

* Add server and rust api example

* Add python api

* Fix add_multimodal_message

* Fix unfold for conformer

* Streaming example

* Add web chat support

* Add modalities registry

* Fix offline cache issue for gguf models (EricLBuehler#1452)

* Add MCP server endpoints (EricLBuehler#1453)

* feat(server): add MCP server support

* Add mcp docs

* Add handle_list_tools_request

* Better launch, tool handling

* Tmp state

* Ok works

* Handle modalities

* Update docs

* Add ping

* Tweak temperature bounds, args

* MCP documentation pass (EricLBuehler#1455)

* Fix table

* Update mcp docs

* Improve readme header

* Improve readme header

* Integrate an MCP client (EricLBuehler#1456)

* Add builtin mcp client

* Use async loader

* Add headers

* Handle sse

* More flexible search request

* Add tool callbacks with tools, for mcp

* Add bearer token support

* Add websocket support

* Update docs

* Add python api

* Clippy

* Add http api, docs

* Tests pass

* Make these configs actually work

* Add docs

* Make mistralrs-mcp

* Refactor examples

* Update examples

* Add defaults

* Add defaults

* Add defaults

* Update docs

* Improved docs

* Add -y to npx usages

* Even better examples

* Update generate_wheels

* Update generate_wheels

* Update generate_wheels

* Fix Dockerfile.cuda-all

* Improve automatic tool call (EricLBuehler#1460)

* Improved auto tool call

* Add logging

* chore: `Dockerfile.cuda-all` configurable threads (EricLBuehler#1458)

* chore: `Dockerfile.cuda-all` - Merge `RUN` for `apt-get install` (EricLBuehler#1459)

* Add fallback definition for isnan (EricLBuehler#1463)

* chore: `Dockerfile` - Drop runtime rayon thread ENV (EricLBuehler#1465)

* chore: Dockerfile - Remove rayon threads env

* chore: Dockerfile - Improve formatting for `apt-get`

* Remove duplicate calls for api_dir_list (EricLBuehler#1474)

* Remove duplicate calls for api_dir_list

* Support local cache for api_dir_list

* Fix home folder for metal

* Capitalized

* Fix transient pyo3 dep (EricLBuehler#1478)

Co-authored-by: Eric Buehler <eric@huggingface.co>

* Fix objc dep with non macos (EricLBuehler#1480)

* Fix phi 3/4 + nccl issue (EricLBuehler#1481)

* Fix log

* Fix n kv heads

* Fix phi3.5 moe (EricLBuehler#1482)

* Fix phi3.5 moe accum device

* Fix again

* Fix again

* Support GLM4 model! (EricLBuehler#1437)

* Support GLM4 model

* Mention GLM4 model in ReadMe

* glm4 type hint

* Typo fix

* Fix unsupported chat_template function

* Clippy fix

* Refactor distributed backend (EricLBuehler#1484)

* Refactor distributed backend, check power of 2

* Fix compilation

* Cap metal paged attn kv allocation (EricLBuehler#1485)

* Better paged attn metal cap (EricLBuehler#1486)

* Better paged attn metal cap

* Small fix

* Comment

* Small fix

* Refactor

* Server core: consolidate and unify route handlers and API surface (EricLBuehler#1423)

* Start working on consolidating completion and chat_completion underlying implementations

* Move response channel to util mod for now (since it's used with streaming and non streaming)

* More work on consolidating completions and chat completions

* More WIP consolidation of server core handlers

* More WIP consolidation of server core handlers

* More WIP consolidation of server core handlers

* Update docs and restrict completion core visibility

* CodeRabbit feedback: remove logprobs warn from route handler since parse request also checks this

* Use consistent var name for completions mod

* Make route handler modules public API consistent (same fn names, etc.) and provide proxy fn that wrap core fns so core mod doesn't have to be pub
Make lib.rs example compile checked and update example

* Code formatting

* Typo

* Sync fork

* Sync fork

* Docs example fix

* Support qwen3 gguf (EricLBuehler#1488)

* Add qwen3 gguf

* Template fixup

* Make bos/eos token IDs optional (EricLBuehler#1493)

* Remove python deps from CUDA dockerfiles (EricLBuehler#1487)

* Handle noncontiguous v in naive_sdpa (EricLBuehler#1499)

Co-authored-by: Eric Buehler <eric@huggingface.co>

* Server Core: refactor Paged Attention configuration (EricLBuehler#1500)

* Use StorageModePrivate for Metal PA kv cache (EricLBuehler#1506)

* Fix OpenAI stream: emit field in tool-call deltas for schema compliance (EricLBuehler#1507)

* FP8 KV-cache quantization for PagedAttention (EricLBuehler#1400)

* Add most of paged attn kv quant

* It builds a bit

* All the functionality at least

* Small fix

* Add a scale

* Fix bf16 usage

* Make k_v_scale optional

* Collector

* Tweak collection

* Refactor

* Add to apis

* Add cuda impl

* Fix compilation

* Fixes

* Handle ENABLE_FP8

* Format

* Tweak

* Fix scaled_convert usage

* Fix cache_t size

* Fixed scale collection

* Actual fix

* Fix fp8 for CC<8

* Fix the usual String != &str bit (EricLBuehler#1483)

Co-authored-by: RageLtMan <rageltman [at] sempervictus>

* chore: `Dockerfile` - Drop runtime rayon thread ENV (EricLBuehler#1465)

* chore: Dockerfile - Remove rayon threads env

* chore: Dockerfile - Improve formatting for `apt-get`

* Remove duplicate calls for api_dir_list (EricLBuehler#1474)

* Remove duplicate calls for api_dir_list

* Support local cache for api_dir_list

* Fix home folder for metal

* Capitalized

* Fix transient pyo3 dep (EricLBuehler#1478)

Co-authored-by: Eric Buehler <eric@huggingface.co>

* Fix objc dep with non macos (EricLBuehler#1480)

* Fix phi 3/4 + nccl issue (EricLBuehler#1481)

* Fix log

* Fix n kv heads

* Fix phi3.5 moe (EricLBuehler#1482)

* Fix phi3.5 moe accum device

* Fix again

* Fix again

* Support GLM4 model! (EricLBuehler#1437)

* Support GLM4 model

* Mention GLM4 model in ReadMe

* glm4 type hint

* Typo fix

* Fix unsupported chat_template function

* Clippy fix

* Refactor distributed backend (EricLBuehler#1484)

* Refactor distributed backend, check power of 2

* Fix compilation

* Cap metal paged attn kv allocation (EricLBuehler#1485)

* Better paged attn metal cap (EricLBuehler#1486)

* Better paged attn metal cap

* Small fix

* Comment

* Small fix

* Refactor

* Server core: consolidate and unify route handlers and API surface (EricLBuehler#1423)

* Start working on consolidating completion and chat_completion underlying implementations

* Move response channel to util mod for now (since it's used with streaming and non streaming)

* More work on consolidating completions and chat completions

* More WIP consolidation of server core handlers

* More WIP consolidation of server core handlers

* More WIP consolidation of server core handlers

* Update docs and restrict completion core visibility

* CodeRabbit feedback: remove logprobs warn from route handler since parse request also checks this

* Use consistent var name for completions mod

* Make route handler modules public API consistent (same fn names, etc.) and provide proxy fn that wrap core fns so core mod doesn't have to be pub
Make lib.rs example compile checked and update example

* Code formatting

* Typo

* Sync fork

* Sync fork

* Docs example fix

* Support qwen3 gguf (EricLBuehler#1488)

* Add qwen3 gguf

* Template fixup

* Make bos/eos token IDs optional (EricLBuehler#1493)

* Remove python deps from CUDA dockerfiles (EricLBuehler#1487)

* Handle USE_FP8 for cuda

* Fix cuda warn

* Add readme

* Saturating sub in sequence state

---------

Co-authored-by: Eric Buehler <eric@huggingface.co>
Co-authored-by: RageLtMan <sempervictus@users.noreply.github.com>
Co-authored-by: Brennan Kinney <5098581+polarathene@users.noreply.github.com>
Co-authored-by: Guoqing Bao <topon@outlook.com>
Co-authored-by: Matthew Haynes <70829360+matthewhaynesonline@users.noreply.github.com>

* Validate model name in OpenAI API (EricLBuehler#1509)

* Validate model name in openai api

* Add docs, allow 'ignore'

* Updated examples for EricLBuehler#1509

* Fix mcp import in doc string (EricLBuehler#1510)

* Add multi-model support! (EricLBuehler#1512)

* Refactor MistralRs

* Working multi-model!

* Add mutli-model docs initially

* Update mistralrs-pyo3, mistralrs-bench, mistralrs

* Update apis for consistency

* API tweaks

* Logging tweaks

* Add examples, tweak cli

* Clearer pipeline id

* Fix config key semantics

* Format and clippy

* Tweak logging, fix example

* Clippy refactor

* Update examples

* Remove unused multi model docs

* Replace 'ignore' with 'default'

* Update docs

* Add stars label to readme (EricLBuehler#1513)

* Add CLAUDE.md

* Handle base_model.model case in lora (EricLBuehler#1514)

* Add thread_local! for engine-specific const/static (EricLBuehler#1517)

* Fix MCP doc test (EricLBuehler#1511)

* Allow disabling metal precompilation (EricLBuehler#1518)

* Allow disabling metal precompilation

* Simple preprocessor

* Simple docs

---------

Co-authored-by: Eric Buehler <eric@huggingface.co>

* Rust 1.88 clippy (EricLBuehler#1522)

* Rust 1.88 clippy

* Format

* Fix cuda warnings (EricLBuehler#1526)

* Avoid panic decoding tokens on error (EricLBuehler#1527)

* Split Marlin and Paged Attention kernels for faster build (EricLBuehler#1525)

* Split Marlin and Paged Attention kernels for faster build

* Typo fix

* chore: update llguidance (EricLBuehler#1535)

* chore: update llguidance

* chore: remove unused import

* Add the SmolLM3 model! (EricLBuehler#1501)

* Add model

* Update loader

* Fix llama config usage

* Docs

* Fix config no_rope_layers

* Fix tie_word_embeddings default

* Add chat template

* Embed the chat templates

* Fix embedding template

* enable_thinking default true

* Update examples

* XML tools for smollm3

* Add smollm3 docs

* Fix openai examples

* Clippy

---------

Co-authored-by: Eric Buehler <eric@huggingface.co>

* Add full Gemma 3n support! (EricLBuehler#1519)

* Add initial

* Loading for text model

* Add ple embeddings

* Add altup, laurel block

* Update rmsnorm

* Add mlp

* Update attn norm application

* Currently no kv shared

* Wire it up

* It runs

* Fix bf16

* Fix scaled embd

* Fixes for mean

* tmp

* Attn confirmed

* Fix target_magnitude

* Add shared kv

* Ok it works

* Remove npy

* Fix streaming

* Remove warnings

* Remove paged attn

* Refactor rope

* Add immediate isq

* Add vision & mproj

* Update image processor

* Vision merge runs, not correct

* Remove

* Add mobilenet v5

* Add multimodal vision embedding

* Fix load

* runs

* Fix gamma

* Works but just not vision tower

* It works!!

* Tweak

* Fix warnings

* Move vision tower

* Fix warn

* Update cache manager things

* Refactor

* Add audio model, it loads

* Add audio processing

* It runs at least

* tmp

* A bit better

* Audio works!!!!

* Fused attn in vision

* Clippy

* Update audio runner

* Optimized audio model

* Remove unused things

* Fix inputs processor bug

* Remove comments

* Clippy

* Small optimizations

* Format

* Correctly register modalities

* Add docs

* Update readme

* Runs there

* Fixed padding from Blaizzy/mlx-vlm#410

* Add better checks

* Fix sdpa n_kv_groups

* Vision encoder works!

* Rotate image

* Clippy

* Fix cuda loading

* Updated device mapper

* Fix overflow

* Fix dtype errors

* Refactor image/audio embeddings

* Fix metal

* Fix dtype mismatch

* Audio processing fixes

* Audio processing fixes

* Works

* Audio is good

* Fix boi/eoi too

* Embed the chat templates

* Better embedding accuracy in non f32

* More f32

* Support bf16 on metal

* Add more ISQ

* Fixed device map

* Clippy

* Gemma3n no paged attn

* Fix saturating sub

* Faster rmsnorm

* Use sdpa for vision model

* Fix ple bug

* Fix name

* Fix multiaudio

* Add matformer config loading

* Add docs

* Add support for matformer in auto device mapper

* Update docs

* Typos

* Tweak

* Tweak

* Fix multidevice

* Fix gemma3n text model auto device map

* Fix dims3

* Fix auto devic emap vision

* Non-metal keeps PLE on cpu

* Complete merge

* Vision dtype f16 -> f32

* Fix metal nm device

* Fix uqff

* Typos

* Reference uqff

* Fix tests

* Fix sequence length check (EricLBuehler#1546)

* update candle version (EricLBuehler#1545)

Co-authored-by: AlpineVibrations <pro@pro.com>

* add ios target to metal deps (EricLBuehler#1548)

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>
Co-authored-by: Eric Buehler <ericlbuehler@gmail.com>
Co-authored-by: edwko <187129830+edwko@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Guoqing Bao <topon@outlook.com>
Co-authored-by: Michał Moskal <michal@moskal.me>
Co-authored-by: Chen Mulong <chenmulong@gmail.com>
Co-authored-by: Steph Wolski <5911086+Slowki@users.noreply.github.com>
Co-authored-by: omahs <73983677+omahs@users.noreply.github.com>
Co-authored-by: Viktor Szépe <viktor@szepe.net>
Co-authored-by: Matthew Haynes <70829360+matthewhaynesonline@users.noreply.github.com>
Co-authored-by: RageLtMan <sempervictus@users.noreply.github.com>
Co-authored-by: Brennan Kinney <5098581+polarathene@users.noreply.github.com>
Co-authored-by: Eric Buehler <eric@huggingface.co>
Co-authored-by: Sbargaoui <bargaoui.sam@gmail.com>
Co-authored-by: Gaétan Lepage <33058747+GaetanLepage@users.noreply.github.com>
Co-authored-by: Ammar Elsabe <ayasser763@gmail.com>
Co-authored-by: luke <10145679+AlpineVibrations@users.noreply.github.com>
Co-authored-by: AlpineVibrations <pro@pro.com>
Co-authored-by: Michael Tissen <rubiktubik@googlemail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NV Cosmos Failing Device Local Storage

3 participants