Skip to content

[pull] main from huggingface:main#41

Open
pull[bot] wants to merge 377 commits intoEricLBuehler:mainfrom
huggingface:main
Open

[pull] main from huggingface:main#41
pull[bot] wants to merge 377 commits intoEricLBuehler:mainfrom
huggingface:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented Nov 19, 2024

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

LaurentMazare and others added 22 commits April 19, 2025 10:07
* Add the const-set op.

* Cuda implementation.

* Bugfix.

* Metal cleanup.

* Add the metal kernels.

* Add some testing.

* Finish the metal implementation.

* Bump the version.
* fixed quantized-gemma example

* lint
* gemma3: changed RotaryEmbedding base freq based on layer and sliding window

* Changed attention mask per layer, either normal or sliding

* made attention mask creation slightly more efficient by only creating them once per model iteration

* changed is_sliding to an Option

* clippy

* changed to stop on both <eos> and <end_of_turn> instead of either or
* removed scale factor from computation and made quantized gemma3 work similarly to non-quantized gemma3

* created default consts, replaced is_sliding with Option holding a window_size
* Add the scatter op.

* Backprop support.

* Cuda support.
* Add the scatter_set op.

* Metal op.

* Cuda version.

* Merge the checks.

* Add the actual ops.
* Support for (un)-batched rope.

* Use 3d rope in the rope/ropei/rope_thd functions.

* Get the CPU versions to work.

* Fix the cuda version.

* Adapt the metal side.

* Fix the metal tests.
* Optimize Tensor::new when called on nested Vec<..>.

* Improve performance.

* Similar flattening for the 4d case.

* More tweaks.

* Add some dummy test.
* tracing page

* warned about asynchronous execution

* cleanup

* added Nsignt Systems recommendation
* Add a scattered kv cache.

* Update some comments.
* add Qwen3.rs

* fixed compile error

* attempting to gett pr 2903 working with qwen weights

* different qwen variants working

* added moe model

* clippy

* added additional eos token

* translated Korean comments to English as well as I can

* removed specialized Qwen3RmsNorm and replaced with generic Candle RmsNorm

* replaced custom repeat_kv implementation with candle's repeat_kv implementation

* replace linear with linear_b in attention initalization

* replaced custom custom kv_cache implementation with candle kv_cache

* style

* replaced explicit broadcast add with normal add in decoder layer

* removed keeping the Rotary embedding layer in the model struct

* used tie_word_embeddings bool from config instead of relying on existence of weights for lm head in CasualLM

* removed duplicate code from qwen3_moe

* removed sliding window from qwen3 attention

* removed MoE code

* removed unused option

* Fixed Typo

Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com>

* fixed tie word embeddings to use the correct embedding weights instead of the opposite

---------

Co-authored-by: Max <naturale@hufs.ac.kr>
Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com>
* Indexing with max-value results in zero/no-op.

* Add some testing.

* Also adapt the metal kernels.

* Another test.

* Fix.
* fixed quantized_phi3 implementation

* quantized_qwen3 implementation

* Update quantized_phi3.rs

* Update quantized_phi3.rs

* add quantized_qwen3 example

* Clippy fixes.

* Cleanup.

---------

Co-authored-by: Laurent <laurent.mazare@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented May 8, 2025

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Join our Discord community for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

greenrazer and others added 4 commits May 10, 2025 07:05
* added resize to candle-onnx, not currently working

* changed unreachable to bail, and bailed when both scales and sizes are set

* cleanup and added other unused options for this op

* cleanup

* fixed image loading to make output work

* cleanup and removed unused variables

* removed path path creation code, and changed unwrap to ?
* optimize KV cache to reduce GPU memory usage

* revert to using candle_nn::kv_cache::KvCache with initial capacity of 512
* OLMo 2 model

* Update olmo-2 to example

* Clippy fix.

---------

Co-authored-by: laurent <laurent.mazare@gmail.com>
olafurjohannsson and others added 30 commits February 18, 2026 20:40
Co-authored-by: danielclough <danielclough@users.noreply.github.com>
…3387)

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
Remove a small CPU-GPU coherency overhead from intermediate buffers

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
The tokenizers crate depends on onig_sys (native regex C library) which cannot compile for wasm32 targets.

This gates both the Cargo.toml dependency and the module declaration behind
cfg(not(target_arch = "wasm32"))

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* feat: add #[non_exhaustive] to DType enum

Closes #3333

Adding new variants to a public enum is a breaking change for downstream
crates that use exhaustive match statements. Mark DType as non_exhaustive
so future variant additions do not require a semver-breaking release.

The only external-crate match affected within the workspace is in
candle-pyo3, which now has a wildcard arm returning an unsupported dtype
error.

* fmt

* black fmt

---------

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* feat(quantized_llama): rectangular causal mask for prefix KV caching

Previously `mask()` always created a square (seq_len × seq_len) mask.
When a prefix KV cache is pre-populated (index_pos > 0), attention scores
have shape (seq_len × (index_pos + seq_len)), so broadcasting the square
mask failed with:

  cannot broadcast [seq_len, seq_len] to [batch, heads, seq_len, kv_len]

Fix: pass `index_pos` into `mask()` and build a (seq_len, kv_len) mask
where kv_len = index_pos + seq_len.

- First `index_pos` columns = 0  → every query attends to all prefix keys
- Last `seq_len` columns = standard causal triangle

When index_pos == 0 the mask is still square — fully backwards compatible.

The mask cache key changes from usize to (usize, usize) to accommodate
different (seq_len, kv_len) pairs in the same session.

This enables batched user-turn prefill after KV-cache prefix restoration,
making prefix KV caching actually fast (one batched forward instead of
feeding tokens one at a time to avoid the mask crash).

* fix(models): rectangular causal mask for prefix KV caching across all affected models

Extend the quantized_llama rectangular mask fix to all models that share
the same square-mask + HashMap<usize> cache pattern:

- llama.rs
- llama2_c.rs
- quantized_llama2_c.rs
- quantized_phi.rs
- quantized_phi3.rs
- quantized_qwen2.rs
- quantized_lfm2.rs
- granite.rs
- granitemoehybrid.rs
- voxtral/voxtral_llama.rs

Shared utility: move `build_causal_mask(seq_len, index_pos, device)` into
`crate::utils` so all models call a single implementation.

Also add 5 unit tests for `build_causal_mask` in quantized_llama.rs covering:
- square shape (index_pos=0)
- rectangular shape (index_pos>0)
- correct values for square and rectangular cases
- single-query with prefix (all-zero row)
- broadcast compatibility with (batch, heads, seq_len, kv_len) attention shape

Co-Authored-By: Arthur Zucker <arthur.zucker@gmail.com>

* style: rustfmt + remove unused repeat_n import in granitemoehybrid

* fix(tests): remove unused super::* import in quantized_llama tests
* Implement the new Google model

* Fix model
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⤵️ pull merge-conflict Resolve conflicts manually

Projects

None yet

Development

Successfully merging this pull request may close these issues.