English | 简体中文
A CUDA kernel lab for AI workloads, organized as focused modules for elementwise ops, reductions, GEMM, convolution, attention, quantization, and experimental newer-CUDA paths.
src/common/: shared CUDA utilities such as tensor wrappers, timers, launch helpers, and reduction primitivessrc/01_elementwise/tosrc/07_cuda13_features/: numbered kernel modules covering elementwise ops, reductions, GEMM, convolution, attention, quantization, and experimental newer-CUDA featurestests/: GoogleTest + RapidCheck coverage across kernel modulesexamples/: shipped CUDA and Python examplespython/: nanobind bindings plus benchmark scriptsdocs/: optimization notes and Python binding docs
- Core CUDA kernels under
src/01_elementwisetosrc/06_quantization - CMake-based native builds and CTest-based validation
- Thin Python bindings for a subset of elementwise, reduction, and GEMM kernels
The following modules currently exist as educational or compatibility-oriented paths rather than production-grade implementations:
src/04_convolution/conv_winograd.cu: currently falls back to the validated implicit-GEMM convolution pathsrc/07_cuda13_features/tma.cu: currently uses a regular kernel copy fallbacksrc/07_cuda13_features/cluster.cu: currently uses a portable block-reduction fallbacksrc/07_cuda13_features/fp8_gemm.cu: currently demonstrates scaled float behavior rather than a true Hopper FP8 kernel
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
ctest --test-dir build --output-on-failureThe current Python extension is named hpc_ai_opt and exposes low-level submodules such as elementwise, reduction, and gemm.
cmake -S . -B build -DBUILD_PYTHON_BINDINGS=ON
cmake --build build
export PYTHONPATH="$(pwd)/build/python:${PYTHONPATH}"
python -c "import hpc_ai_opt; print(hpc_ai_opt.__doc__)"
python examples/python/basic_usage.pyThe bindings are intentionally thin:
- CUDA tensors are passed in directly
- output tensors are allocated by the caller
- several kernels require explicit shape arguments
- wrappers validate basic tensor sizes/arguments, then launch CUDA work asynchronously
cmake -S . -B build -DBUILD_EXAMPLES=ON
cmake --build build --target relu_example gemm_benchmarkimport torch
import hpc_ai_opt
x = torch.randn(1024, 1024, device="cuda", dtype=torch.float32)
y = torch.empty_like(x)
hpc_ai_opt.elementwise.relu(x, y)- CUDA Toolkit 12.4+
- CMake 3.24+
- A C++20 compiler
- An NVIDIA GPU with CUDA support
- PyTorch with CUDA support for the Python example path
Notes:
- The Docker development environment is currently based on CUDA 12.4.1.
- Experimental modules under
src/07_cuda13_features/are not evidence of full Hopper/Blackwell feature coverage. flash_attentioncurrently supportsfloatwithhead_dim == 64in the shipped implementation.
The default GitHub Actions workflow is intentionally lightweight and currently validates:
- formatting
- repository/documentation consistency checks
- documentation builds
It does not currently provide full native CUDA build-and-test coverage on GitHub-hosted runners. For native verification, run the local CMake + CTest flow shown above on a machine with a working CUDA toolchain and GPU.
docs/README.mddocs/python/index.rstdocs/01_gemm_optimization.mddocs/04_flash_attention.md