Skip to content

Latest commit

 

History

History
302 lines (289 loc) · 10.6 KB

File metadata and controls

302 lines (289 loc) · 10.6 KB
layout default
title LLM-Speed
description High-performance CUDA kernel library for LLM inference — FlashAttention with O(N) memory, Tensor Core GEMM acceleration, and seamless PyTorch integration
lang en
v{{ site.current_version }} — High-performance CUDA kernels for LLM inference

LLM-Speed

High-performance CUDA kernel library for LLM inference, featuring FlashAttention with O(N) memory complexity and Tensor Core GEMM acceleration.

CI License CUDA C++ Python
98%
Memory Reduction
2.1×
Speedup (8K seq)
90%+
cuBLAS Perf
5
GPU Architectures

Key Features

Optimized CUDA kernels for modern LLM inference with memory-efficient algorithms and hardware acceleration

FlashAttention

O(N) memory complexity with online softmax algorithm. Supports causal masking for autoregressive models.

🔢

Tensor Core GEMM

Hardware-accelerated matrix multiplication using WMMA API. FP16 input with FP32 accumulation.

🐍

PyTorch Integration

Seamless integration with PyTorch via pybind11. Native CUDA tensor support.

🔄

Double Buffering

Compute/memory overlap with pipelined execution. Async copy for Ampere+ architectures.

🏦

Bank Conflict Free

Carefully designed shared memory layouts with padding to eliminate bank conflicts.

📊

Property Testing

Comprehensive tests with Hypothesis for correctness verification across edge cases.

Memory Efficiency

FlashAttention dramatically reduces memory usage compared to standard attention implementations

Sequence Length Standard Attention FlashAttention Reduction
1024 4 MB 0.25 MB 94%
2048 16 MB 0.5 MB 97%
4096 64 MB 1 MB 98%
8192 256 MB 2 MB 99%

Quick Example

Get started with just a few lines of code

flash_attention.py
import torch
from cuda_llm_ops import flash_attention

Create inputs

batch, heads = 2, 8 seq_len, head_dim = 2048, 64

q = torch.randn(batch, heads, seq_len, head_dim, device='cuda', dtype=torch.float16) k = torch.randn_like(q) v = torch.randn_like(q)

O(N) memory attention!

output = flash_attention(q, k, v, is_causal=True)

tensor_core_gemm.py
import torch
from cuda_llm_ops import tensor_core_gemm

Matrix multiplication

a = torch.randn(1024, 512, device='cuda', dtype=torch.float16) b = torch.randn(512, 1024, device='cuda', dtype=torch.float16)

Hardware accelerated GEMM

FP16 input → FP32 output

c = tensor_core_gemm(a, b) print(c.dtype) # torch.float32

GPU Architecture Support

Optimized for all modern NVIDIA GPU architectures

Architecture SM Version Tensor Core Status
Volta (V100) SM 7.0 FP16 ✅ Supported
Turing (T4, RTX 20) SM 7.5 FP16, INT8 ✅ Supported
Ampere (A100, RTX 30) SM 8.0, 8.6 FP16, BF16, INT8, TF32 ✅ Recommended
Ada Lovelace (RTX 40) SM 8.9 FP16, BF16, INT8, FP8 ✅ Supported
Hopper (H100) SM 9.0 FP16, BF16, INT8, FP8 ✅ Supported

Documentation

Comprehensive guides in English and Chinese

🚀

Get up and running in 5 minutes with installation and basic usage examples.

📚

Complete API documentation with parameters, examples, and error handling.

🏗️

Technical deep dive into CUDA kernels, optimization strategies, and implementation details.

Optimization tips, benchmarking tools, and best practices for maximum performance.

Ready to accelerate your LLM inference?

Join the community and start optimizing your CUDA kernels today