| layout | default |
|---|---|
| title | LLM-Speed |
| description | High-performance CUDA kernel library for LLM inference — FlashAttention with O(N) memory, Tensor Core GEMM acceleration, and seamless PyTorch integration |
| lang | en |
High-performance CUDA kernel library for LLM inference, featuring FlashAttention with O(N) memory complexity and Tensor Core GEMM acceleration.
Optimized CUDA kernels for modern LLM inference with memory-efficient algorithms and hardware acceleration
O(N) memory complexity with online softmax algorithm. Supports causal masking for autoregressive models.
Hardware-accelerated matrix multiplication using WMMA API. FP16 input with FP32 accumulation.
Compute/memory overlap with pipelined execution. Async copy for Ampere+ architectures.
Carefully designed shared memory layouts with padding to eliminate bank conflicts.
FlashAttention dramatically reduces memory usage compared to standard attention implementations
| Sequence Length | Standard Attention | FlashAttention | Reduction |
|---|---|---|---|
| 1024 | 4 MB | 0.25 MB | 94% |
| 2048 | 16 MB | 0.5 MB | 97% |
| 4096 | 64 MB | 1 MB | 98% |
| 8192 | 256 MB | 2 MB | 99% |
import torch from cuda_llm_ops import flash_attentionbatch, heads = 2, 8 seq_len, head_dim = 2048, 64
q = torch.randn(batch, heads, seq_len, head_dim, device='cuda', dtype=torch.float16) k = torch.randn_like(q) v = torch.randn_like(q)
output = flash_attention(q, k, v, is_causal=True)
import torch from cuda_llm_ops import tensor_core_gemma = torch.randn(1024, 512, device='cuda', dtype=torch.float16) b = torch.randn(512, 1024, device='cuda', dtype=torch.float16)
c = tensor_core_gemm(a, b) print(c.dtype) # torch.float32
| Architecture | SM Version | Tensor Core | Status |
|---|---|---|---|
| Volta (V100) | SM 7.0 | FP16 | ✅ Supported |
| Turing (T4, RTX 20) | SM 7.5 | FP16, INT8 | ✅ Supported |
| Ampere (A100, RTX 30) | SM 8.0, 8.6 | FP16, BF16, INT8, TF32 | ✅ Recommended |
| Ada Lovelace (RTX 40) | SM 8.9 | FP16, BF16, INT8, FP8 | ✅ Supported |
| Hopper (H100) | SM 9.0 | FP16, BF16, INT8, FP8 | ✅ Supported |
Get up and running in 5 minutes with installation and basic usage examples.
Complete API documentation with parameters, examples, and error handling.
Technical deep dive into CUDA kernels, optimization strategies, and implementation details.
Optimization tips, benchmarking tools, and best practices for maximum performance.
Join the community and start optimizing your CUDA kernels today