layout	default
title	LLM-Speed
description	High-performance CUDA kernel library for LLM inference — FlashAttention with O(N) memory, Tensor Core GEMM acceleration, and seamless PyTorch integration
lang	en

v{{ site.current_version }} — High-performance CUDA kernels for LLM inference

LLM-Speed

High-performance CUDA kernel library for LLM inference, featuring FlashAttention with O(N) memory complexity and Tensor Core GEMM acceleration.

🚀 Get Started 🇨🇳 中文文档 💻 View on GitHub

98%

Memory Reduction

2.1×

Speedup (8K seq)

90%+

cuBLAS Perf

GPU Architectures

Key Features

Optimized CUDA kernels for modern LLM inference with memory-efficient algorithms and hardware acceleration

⚡

FlashAttention

O(N) memory complexity with online softmax algorithm. Supports causal masking for autoregressive models.

🔢

Tensor Core GEMM

Hardware-accelerated matrix multiplication using WMMA API. FP16 input with FP32 accumulation.

🐍

PyTorch Integration

Seamless integration with PyTorch via pybind11. Native CUDA tensor support.

🔄

Double Buffering

Compute/memory overlap with pipelined execution. Async copy for Ampere+ architectures.

🏦

Bank Conflict Free

Carefully designed shared memory layouts with padding to eliminate bank conflicts.

📊

Property Testing

Comprehensive tests with Hypothesis for correctness verification across edge cases.

Memory Efficiency

FlashAttention dramatically reduces memory usage compared to standard attention implementations

Sequence Length	Standard Attention	FlashAttention	Reduction
1024	4 MB	0.25 MB	94%
2048	16 MB	0.5 MB	97%
4096	64 MB	1 MB	98%
8192	256 MB	2 MB	99%

Quick Example

Get started with just a few lines of code

flash_attention.py

import torch
from cuda_llm_ops import flash_attention
Create inputs

batch, heads = 2, 8
seq_len, head_dim = 2048, 64
q = torch.randn(batch, heads, seq_len, head_dim,
device='cuda', dtype=torch.float16)
k = torch.randn_like(q)
v = torch.randn_like(q)
O(N) memory attention!

output = flash_attention(q, k, v, is_causal=True)

tensor_core_gemm.py

import torch
from cuda_llm_ops import tensor_core_gemm
Matrix multiplication

a = torch.randn(1024, 512, device='cuda',
dtype=torch.float16)
b = torch.randn(512, 1024, device='cuda',
dtype=torch.float16)
Hardware accelerated GEMM

FP16 input → FP32 output

c = tensor_core_gemm(a, b)
print(c.dtype)  # torch.float32

GPU Architecture Support

Optimized for all modern NVIDIA GPU architectures

Architecture	SM Version	Tensor Core	Status
Volta (V100)	SM 7.0	FP16	✅ Supported
Turing (T4, RTX 20)	SM 7.5	FP16, INT8	✅ Supported
Ampere (A100, RTX 30)	SM 8.0, 8.6	FP16, BF16, INT8, TF32	✅ Recommended
Ada Lovelace (RTX 40)	SM 8.9	FP16, BF16, INT8, FP8	✅ Supported
Hopper (H100)	SM 9.0	FP16, BF16, INT8, FP8	✅ Supported

Documentation

Comprehensive guides in English and Chinese

🚀

Quick Start

Get up and running in 5 minutes with installation and basic usage examples.

📚

API Reference

Complete API documentation with parameters, examples, and error handling.

🏗️

Architecture

Technical deep dive into CUDA kernels, optimization strategies, and implementation details.

⚡

Performance Guide

Optimization tips, benchmarking tools, and best practices for maximum performance.

Ready to accelerate your LLM inference?

Join the community and start optimizing your CUDA kernels today

⭐ Star on GitHub 💬 Join Discussions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM-Speed

Key Features

FlashAttention

Tensor Core GEMM

PyTorch Integration

Double Buffering

Bank Conflict Free

Property Testing

Memory Efficiency

Quick Example

Create inputs

O(N) memory attention!

Matrix multiplication

Hardware accelerated GEMM

FP16 input → FP32 output

GPU Architecture Support

Documentation

Quick Start

API Reference

Architecture

Performance Guide

Ready to accelerate your LLM inference?

FilesExpand file tree

index.md

Latest commit

History

index.md

File metadata and controls

LLM-Speed

Key Features

FlashAttention

Tensor Core GEMM

PyTorch Integration

Double Buffering

Bank Conflict Free

Property Testing

Memory Efficiency

Quick Example

Create inputs

O(N) memory attention!

Matrix multiplication

Hardware accelerated GEMM

FP16 input → FP32 output

GPU Architecture Support

Documentation

Quick Start

API Reference

Architecture

Performance Guide

Ready to accelerate your LLM inference?