Skip to content

deep-gemm: add#382

Open
drbh wants to merge 20 commits intomainfrom
add-deep-gemm
Open

deep-gemm: add#382
drbh wants to merge 20 commits intomainfrom
add-deep-gemm

Conversation

@drbh
Copy link
Collaborator

@drbh drbh commented Feb 20, 2026

This PR adds the deep-gemm kernels and relies on an experimental feature added in this PR huggingface/kernels#298

The deep-gemm kernels heavily rely on JIT compilation and need access to nvcc, cutlass headers and internal deep-gemm headers at runtime. This pr includes the internal headers and minor changes to lazily load nvrtc at runtime, and the related PR in the kernels builder updates the build process to inject cutlass headers into the build artifacts so the kernel has all of the required dependencies at runtime.

example usage

nvidia-smi -L
# GPU 0: NVIDIA H100 80GB HBM3 

# navigate to example and run
cd kernels-community/deep-gemm
uv run scripts/readme_example.py
[cuBLASLt BF16] shape: 256x1024x512, cosine_sim: 1.000000, max_diff: 0.0000
[FP8 1D2D] shape: 256x1024x512, cosine_sim: 0.999325, max_diff: 3.9062

note

  • if you are on a machine with cuda cap of >=9 you'll need cuda 12.9 and up for the JIT to build successfully dues to inlined asm that is not available on earlier version.
  • if you are on a machine with more than one cuda driver you may have to specify the cuda home like CUDA_HOME=/usr/local/cuda-12.9 uv run scripts/readme_example.py

@MekkCyber MekkCyber changed the title Add deep gemm deep-gemm: add Feb 24, 2026
Copy link
Member

@danieldk danieldk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool work! Added a comment about the submodules.

run: |
KERNEL="${{ steps.validate.outputs.kernel }}"
( cd "$KERNEL" && nix run -L .#build-and-upload )
( cd "$KERNEL" && nix run -L .?submodules=1#build-and-upload )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should put this in the flake itself: https://discourse.nixos.org/t/nix-2-27-0-released/62003

If users build a kernel manually they'll miss the ?submodules=1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oo thanks for the tip, this is much cleaner. I fully agree its better to declare this in the flake than depend on users to include ?submodules=1 when building. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants