[pull] main from huggingface:main by pull[bot] · Pull Request #41 · EricLBuehler/candle

pull · 2024-11-19T09:14:01Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* Add the const-set op. * Cuda implementation. * Bugfix. * Metal cleanup. * Add the metal kernels. * Add some testing. * Finish the metal implementation. * Bump the version.

* fixed quantized-gemma example * lint

* gemma3: changed RotaryEmbedding base freq based on layer and sliding window * Changed attention mask per layer, either normal or sliding * made attention mask creation slightly more efficient by only creating them once per model iteration * changed is_sliding to an Option * clippy * changed to stop on both <eos> and <end_of_turn> instead of either or

* removed scale factor from computation and made quantized gemma3 work similarly to non-quantized gemma3 * created default consts, replaced is_sliding with Option holding a window_size

* Add the scatter op. * Backprop support. * Cuda support.

* Add the scatter_set op. * Metal op. * Cuda version. * Merge the checks. * Add the actual ops.

* Support for (un)-batched rope. * Use 3d rope in the rope/ropei/rope_thd functions. * Get the CPU versions to work. * Fix the cuda version. * Adapt the metal side. * Fix the metal tests.

* Optimize Tensor::new when called on nested Vec<..>. * Improve performance. * Similar flattening for the 4d case. * More tweaks. * Add some dummy test.

* tracing page * warned about asynchronous execution * cleanup * added Nsignt Systems recommendation

* Add a scattered kv cache. * Update some comments.

* add Qwen3.rs * fixed compile error * attempting to gett pr 2903 working with qwen weights * different qwen variants working * added moe model * clippy * added additional eos token * translated Korean comments to English as well as I can * removed specialized Qwen3RmsNorm and replaced with generic Candle RmsNorm * replaced custom repeat_kv implementation with candle's repeat_kv implementation * replace linear with linear_b in attention initalization * replaced custom custom kv_cache implementation with candle kv_cache * style * replaced explicit broadcast add with normal add in decoder layer * removed keeping the Rotary embedding layer in the model struct * used tie_word_embeddings bool from config instead of relying on existence of weights for lm head in CasualLM * removed duplicate code from qwen3_moe * removed sliding window from qwen3 attention * removed MoE code * removed unused option * Fixed Typo Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com> * fixed tie word embeddings to use the correct embedding weights instead of the opposite --------- Co-authored-by: Max <naturale@hufs.ac.kr> Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com>

* Indexing with max-value results in zero/no-op. * Add some testing. * Also adapt the metal kernels. * Another test. * Fix.

* fixed quantized_phi3 implementation * quantized_qwen3 implementation * Update quantized_phi3.rs * Update quantized_phi3.rs * add quantized_qwen3 example * Clippy fixes. * Cleanup. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>

coderabbitai · 2025-05-08T13:06:18Z

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Join our Discord community for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

* added resize to candle-onnx, not currently working * changed unreachable to bail, and bailed when both scales and sizes are set * cleanup and added other unused options for this op * cleanup * fixed image loading to make output work * cleanup and removed unused variables * removed path path creation code, and changed unwrap to ?

* optimize KV cache to reduce GPU memory usage * revert to using candle_nn::kv_cache::KvCache with initial capacity of 512

…pies (#2953)

* OLMo 2 model * Update olmo-2 to example * Clippy fix. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>

Co-authored-by: danielclough <danielclough@users.noreply.github.com>

…3417)

…#3418)

…3387) Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

Remove a small CPU-GPU coherency overhead from intermediate buffers Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

The tokenizers crate depends on onig_sys (native regex C library) which cannot compile for wasm32 targets. This gates both the Cargo.toml dependency and the module declaration behind cfg(not(target_arch = "wasm32")) Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

* feat: add #[non_exhaustive] to DType enum Closes #3333 Adding new variants to a public enum is a breaking change for downstream crates that use exhaustive match statements. Mark DType as non_exhaustive so future variant additions do not require a semver-breaking release. The only external-crate match affected within the workspace is in candle-pyo3, which now has a wildcard arm returning an unsupported dtype error. * fmt * black fmt --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

…oes not support f64 (#3426)

…ive-C expects (#3435)

* feat(quantized_llama): rectangular causal mask for prefix KV caching Previously `mask()` always created a square (seq_len × seq_len) mask. When a prefix KV cache is pre-populated (index_pos > 0), attention scores have shape (seq_len × (index_pos + seq_len)), so broadcasting the square mask failed with: cannot broadcast [seq_len, seq_len] to [batch, heads, seq_len, kv_len] Fix: pass `index_pos` into `mask()` and build a (seq_len, kv_len) mask where kv_len = index_pos + seq_len. - First `index_pos` columns = 0 → every query attends to all prefix keys - Last `seq_len` columns = standard causal triangle When index_pos == 0 the mask is still square — fully backwards compatible. The mask cache key changes from usize to (usize, usize) to accommodate different (seq_len, kv_len) pairs in the same session. This enables batched user-turn prefill after KV-cache prefix restoration, making prefix KV caching actually fast (one batched forward instead of feeding tokens one at a time to avoid the mask crash). * fix(models): rectangular causal mask for prefix KV caching across all affected models Extend the quantized_llama rectangular mask fix to all models that share the same square-mask + HashMap<usize> cache pattern: - llama.rs - llama2_c.rs - quantized_llama2_c.rs - quantized_phi.rs - quantized_phi3.rs - quantized_qwen2.rs - quantized_lfm2.rs - granite.rs - granitemoehybrid.rs - voxtral/voxtral_llama.rs Shared utility: move `build_causal_mask(seq_len, index_pos, device)` into `crate::utils` so all models call a single implementation. Also add 5 unit tests for `build_causal_mask` in quantized_llama.rs covering: - square shape (index_pos=0) - rectangular shape (index_pos>0) - correct values for square and rectangular cases - single-query with prefix (all-zero row) - broadcast compatibility with (batch, heads, seq_len, kv_len) attention shape Co-Authored-By: Arthur Zucker <arthur.zucker@gmail.com> * style: rustfmt + remove unused repeat_n import in granitemoehybrid * fix(tests): remove unused super::* import in quantized_llama tests

* Implement the new Google model * Fix model

🔒 Pin GitHub Actions to commit SHAs

pull bot added ⤵️ pull merge-conflict Resolve conflicts manually labels Nov 19, 2024

EricLBuehler force-pushed the main branch from bac2055 to 96279d5 Compare January 8, 2025 17:25

LaurentMazare and others added 22 commits April 19, 2025 10:07

Add the const-set op. (#2910)

a4c56a9

* Add the const-set op. * Cuda implementation. * Bugfix. * Metal cleanup. * Add the metal kernels. * Add some testing. * Finish the metal implementation. * Bump the version.

fixed quantized-gemma example (#2914)

99bd69f

* fixed quantized-gemma example * lint

Cudarc update. (#2915)

82def7a

Fixed Quantized Gemma3 Model and example (#2918)

3aeb957

* removed scale factor from computation and made quantized gemma3 work similarly to non-quantized gemma3 * created default consts, replaced is_sliding with Option holding a window_size

Add the scatter op. (#2921)

3827685

* Add the scatter op. * Backprop support. * Cuda support.

Add the scatter in place ops. (#2923)

a2e9254

* Add the scatter_set op. * Metal op. * Cuda version. * Merge the checks. * Add the actual ops.

Bump the crate version to 0.9.0. (#2924)

fbaf0b0

Remove redundant mlx gemm dtype check (#2925)

6e0646c

Support for "unbatched" rope. (#2926)

e3db300

* Support for (un)-batched rope. * Use 3d rope in the rope/ropei/rope_thd functions. * Get the CPU versions to work. * Fix the cuda version. * Adapt the metal side. * Fix the metal tests.

Optimize Tensor::new when called on nested Vec<..>. (#2927)

e98754f

* Optimize Tensor::new when called on nested Vec<..>. * Improve performance. * Similar flattening for the 4d case. * More tweaks. * Add some dummy test.

Fix the gumbel softmax by casting to f32. (#2928)

d4bac37

Switch Tensor::full to return a contiguous tensor. (#2929)

de23d34

Added tracing page to the candle book. (#2922)

5029ac5

* tracing page * warned about asynchronous execution * cleanup * added Nsignt Systems recommendation

Add support for Helium-v1. (#2932)

38fc866

Bump the candle version to 0.9.1. (#2935)

8a19bb7

Add a scattered kv cache. (#2936)

cd96fa8

* Add a scattered kv cache. * Update some comments.

fixed quantized_phi3 implementation

66be13b

Indexing with max-value results in zero/no-op. (#2940)

e27b470

* Indexing with max-value results in zero/no-op. * Add some testing. * Also adapt the metal kernels. * Another test. * Fix.

Bump cudarc to 0.16.3. (#2942)

637473c

Qwen3 quantized implementation (#2939)

3d05f5c

* fixed quantized_phi3 implementation * quantized_qwen3 implementation * Update quantized_phi3.rs * Update quantized_phi3.rs * add quantized_qwen3 example * Clippy fixes. * Cleanup. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>

greenrazer and others added 4 commits May 10, 2025 07:05

Fixed Quantized Qwen3 Model (#2951)

485ddf2

* optimize KV cache to reduce GPU memory usage * revert to using candle_nn::kv_cache::KvCache with initial capacity of 512

Make tensor contiguous before the repeat_kv calls to avoid strided co…

6bd6172

…pies (#2953)

Olmo 2 model (#2954)

450a49e

* OLMo 2 model * Update olmo-2 to example * Clippy fix. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>

olafurjohannsson and others added 30 commits February 18, 2026 20:40

fix: add chat template for Qwen3 models in qwen example (#3167) (#3377)

6ce61d3

feat: nomic-embed-text-v1.5 model and examples (#3374)

026b4f2

Co-authored-by: danielclough <danielclough@users.noreply.github.com>

fix(metal-kernel): index select with u32 indices and i64 source (#3371)

8cc682c

feat: allow tokenizer to load from GGUF metadata (#3245)

38e7202

fix: conv2d_tiled produces wrong results when C == H == W (#3405)

bf9e950

Add support for head dim 512 to FA v2/v3 and metal SDPA, CUDA 13.2 (#…

7769e3b

…3417)

fix: correct CPU scatter RequiresContiguous op for non-contiguous ids (…

df22d80

…#3418)

fix(metal): seed buffer size 4 → 8 bytes (u64) (#3408)

15969b5

feat: add eps() and remove_mean() getters to LayerNorm and RmsNorm (#…

f7e39a4

…3387) Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

[Metal] Use StorageModePrivate for intermediate compute buffers (#3416)

e0e33e9

Remove a small CPU-GPU coherency overhead from intermediate buffers Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

Guard against NULL architecture pointer on simulators (#3392)

c6a4649

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

fix(metal): buffer pool rounds 2 down by 1 (#3394)

a7d8427

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

Fix some NaNs with GGML quantized (#3428)

25805ff

Run conv2d_c_eq_h_eq_w test in f32 as it is still precise and metal d…

a9060dc

…oes not support f64 (#3426)

fix(flash-attn-v3): add -fPIC on non-MSVC targets (#3380)

96066e8

Bump candle version to 0.10.0 (#3432)

153cadd

Use AnyObject instead of c_void, as it has the @ encoding that Object…

cb94e4f

…ive-C expects (#3435)

Bump candle version to 0.10.1 (#3436)

904bf22

Fix sliding window full sdpa corner case (#3438)

46928bc

Bump candle version to 0.10.2 (#3441)

7c7a8c5

🔒 pin maturin.yml actions to commit SHAs

1f3cc15

🔒 pin python.yml actions to commit SHAs

c8ddf6d

🔒 pin ci_cuda.yaml actions to commit SHAs

321b894

🔒 pin trufflehog.yml actions to commit SHAs

6f9afa3

Implement the new Google model (#3443)

097655a

* Implement the new Google model * Fix model

Merge pull request #3442 from huggingface/security/pin-actions-to-sha

c42e1fe

🔒 Pin GitHub Actions to commit SHAs

Handle case with gemma4 without safetensors index (#3457)

34625ab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from huggingface:main#41

[pull] main from huggingface:main#41
pull[bot] wants to merge 377 commits intoEricLBuehler:mainfrom
huggingface:main

pull bot commented Nov 19, 2024 •

edited

Loading

Uh oh!

coderabbitai bot commented May 8, 2025 •

edited

Loading

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

pull bot commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

pull bot commented Nov 19, 2024 •

edited

Loading

coderabbitai bot commented May 8, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)