[pull] main from huggingface:main#41
Open
pull[bot] wants to merge 377 commits intoEricLBuehler:mainfrom
Open
Conversation
* Add the const-set op. * Cuda implementation. * Bugfix. * Metal cleanup. * Add the metal kernels. * Add some testing. * Finish the metal implementation. * Bump the version.
* fixed quantized-gemma example * lint
* gemma3: changed RotaryEmbedding base freq based on layer and sliding window * Changed attention mask per layer, either normal or sliding * made attention mask creation slightly more efficient by only creating them once per model iteration * changed is_sliding to an Option * clippy * changed to stop on both <eos> and <end_of_turn> instead of either or
* removed scale factor from computation and made quantized gemma3 work similarly to non-quantized gemma3 * created default consts, replaced is_sliding with Option holding a window_size
* Add the scatter op. * Backprop support. * Cuda support.
* Add the scatter_set op. * Metal op. * Cuda version. * Merge the checks. * Add the actual ops.
* Support for (un)-batched rope. * Use 3d rope in the rope/ropei/rope_thd functions. * Get the CPU versions to work. * Fix the cuda version. * Adapt the metal side. * Fix the metal tests.
* Optimize Tensor::new when called on nested Vec<..>. * Improve performance. * Similar flattening for the 4d case. * More tweaks. * Add some dummy test.
* tracing page * warned about asynchronous execution * cleanup * added Nsignt Systems recommendation
* Add a scattered kv cache. * Update some comments.
* add Qwen3.rs * fixed compile error * attempting to gett pr 2903 working with qwen weights * different qwen variants working * added moe model * clippy * added additional eos token * translated Korean comments to English as well as I can * removed specialized Qwen3RmsNorm and replaced with generic Candle RmsNorm * replaced custom repeat_kv implementation with candle's repeat_kv implementation * replace linear with linear_b in attention initalization * replaced custom custom kv_cache implementation with candle kv_cache * style * replaced explicit broadcast add with normal add in decoder layer * removed keeping the Rotary embedding layer in the model struct * used tie_word_embeddings bool from config instead of relying on existence of weights for lm head in CasualLM * removed duplicate code from qwen3_moe * removed sliding window from qwen3 attention * removed MoE code * removed unused option * Fixed Typo Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com> * fixed tie word embeddings to use the correct embedding weights instead of the opposite --------- Co-authored-by: Max <naturale@hufs.ac.kr> Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com>
* Indexing with max-value results in zero/no-op. * Add some testing. * Also adapt the metal kernels. * Another test. * Fix.
* fixed quantized_phi3 implementation * quantized_qwen3 implementation * Update quantized_phi3.rs * Update quantized_phi3.rs * add quantized_qwen3 example * Clippy fixes. * Cleanup. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>
|
Important Review skippedBot user detected. To trigger a single review, invoke the You can disable this status message by setting the 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Join our Discord community for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
* added resize to candle-onnx, not currently working * changed unreachable to bail, and bailed when both scales and sizes are set * cleanup and added other unused options for this op * cleanup * fixed image loading to make output work * cleanup and removed unused variables * removed path path creation code, and changed unwrap to ?
* optimize KV cache to reduce GPU memory usage * revert to using candle_nn::kv_cache::KvCache with initial capacity of 512
* OLMo 2 model * Update olmo-2 to example * Clippy fix. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>
Co-authored-by: danielclough <danielclough@users.noreply.github.com>
…3387) Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
Remove a small CPU-GPU coherency overhead from intermediate buffers Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
The tokenizers crate depends on onig_sys (native regex C library) which cannot compile for wasm32 targets. This gates both the Cargo.toml dependency and the module declaration behind cfg(not(target_arch = "wasm32")) Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* feat: add #[non_exhaustive] to DType enum Closes #3333 Adding new variants to a public enum is a breaking change for downstream crates that use exhaustive match statements. Mark DType as non_exhaustive so future variant additions do not require a semver-breaking release. The only external-crate match affected within the workspace is in candle-pyo3, which now has a wildcard arm returning an unsupported dtype error. * fmt * black fmt --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
…oes not support f64 (#3426)
* feat(quantized_llama): rectangular causal mask for prefix KV caching Previously `mask()` always created a square (seq_len × seq_len) mask. When a prefix KV cache is pre-populated (index_pos > 0), attention scores have shape (seq_len × (index_pos + seq_len)), so broadcasting the square mask failed with: cannot broadcast [seq_len, seq_len] to [batch, heads, seq_len, kv_len] Fix: pass `index_pos` into `mask()` and build a (seq_len, kv_len) mask where kv_len = index_pos + seq_len. - First `index_pos` columns = 0 → every query attends to all prefix keys - Last `seq_len` columns = standard causal triangle When index_pos == 0 the mask is still square — fully backwards compatible. The mask cache key changes from usize to (usize, usize) to accommodate different (seq_len, kv_len) pairs in the same session. This enables batched user-turn prefill after KV-cache prefix restoration, making prefix KV caching actually fast (one batched forward instead of feeding tokens one at a time to avoid the mask crash). * fix(models): rectangular causal mask for prefix KV caching across all affected models Extend the quantized_llama rectangular mask fix to all models that share the same square-mask + HashMap<usize> cache pattern: - llama.rs - llama2_c.rs - quantized_llama2_c.rs - quantized_phi.rs - quantized_phi3.rs - quantized_qwen2.rs - quantized_lfm2.rs - granite.rs - granitemoehybrid.rs - voxtral/voxtral_llama.rs Shared utility: move `build_causal_mask(seq_len, index_pos, device)` into `crate::utils` so all models call a single implementation. Also add 5 unit tests for `build_causal_mask` in quantized_llama.rs covering: - square shape (index_pos=0) - rectangular shape (index_pos>0) - correct values for square and rectangular cases - single-query with prefix (all-zero row) - broadcast compatibility with (batch, heads, seq_len, kv_len) attention shape Co-Authored-By: Arthur Zucker <arthur.zucker@gmail.com> * style: rustfmt + remove unused repeat_n import in granitemoehybrid * fix(tests): remove unused super::* import in quantized_llama tests
* Implement the new Google model * Fix model
🔒 Pin GitHub Actions to commit SHAs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )