High-performance sparse matrix kernels implemented in Python/Numba CUDA, targeting the Tesla P100 GPU. This project demonstrates GPU optimization techniques for memory-bound, irregular-access workloads — the core challenges in NVIDIA's cuSPARSE library.
| Metric | Achieved | Target | Status |
|---|---|---|---|
| Peak Bandwidth | 96.6 GB/s | 100 GB/s | 97% achieved |
| vs cuSPARSE | 53.4% | 65% | 82% of target |
| vs CPU (NumPy) | 5,432× speedup | — | Excellent |
This represents near-optimal performance for a Python/Numba CUDA implementation. The remaining gap to cuSPARSE requires native CUDA C/C++ and hardware features (texture cache) inaccessible from Python.
- Understand sparse matrix challenges — irregular memory access, load imbalancing, low arithmetic intensity
- Implement multiple SpMV kernel strategies — scalar, vector, and adaptive approaches
- Benchmark against cuSPARSE — honest comparison with NVIDIA's production library
- Optimize systematically — test multiple strategies, document what works and why
Hardware: Tesla P100-PCIE-16GB (732 GB/s peak bandwidth)
Matrix density: 0.1% (typical for scientific computing)
| Matrix Size | Non-zeros | Our Best | cuSPARSE | % of cuSPARSE | CPU Speedup |
|---|---|---|---|---|---|
| 10K x 10K | 100K | 1.5 GB/s | 9.6 GB/s | 16% | 31× |
| 25K x 25K | 625K | 8.9 GB/s | 52.4 GB/s | 17% | 671× |
| 50K x 50K | 2.5M | 32.2 GB/s | 159.2 GB/s | 20% | 2,222× |
| 75K x 75K | 5.6M | 69.8 GB/s | 173.4 GB/s | 40% | 4,223× |
| 100K x 100K | 10M | 96.6 GB/s | 180.8 GB/s | 53% | 5,432× |
We implemented and tested 7 optimization strategies. None outperformed the baseline at large matrix sizes.
| Optimization | 100K Performance | vs Baseline | Status |
|---|---|---|---|
| Baseline (Vector) | 96.6 GB/s | — | Best |
| Shared Memory Caching | 0.7 GB/s | -99% | Failed |
| Row Binning | 52.0 GB/s | -46% | Worse |
| Loop Unrolling | 93.3 GB/s | -3% | Worse |
| Block Size 128 | 87.1 GB/s | -10% | Worse |
| Block Size 512 | 95.1 GB/s | -2% | Same |
| Combined | 93.1 GB/s | -4% | Worse |
| P100 Tuned | 95.5 GB/s | -1% | Same |
The test matrices have perfectly uniform row lengths (every row has exactly N×density non-zeros):
Row lengths: min=100, max=100, mean=100.0, std=0.0
This means:
- Zero warp divergence — all 32 threads in a warp finish simultaneously
- Optimal load balancing — no threads are idle
- Nothing to optimize — the baseline is already ideal
The shared memory approach catastrophically failed because it iterates through 100 segments for each of 100 non-zeros = 10,000 operations vs 100 for the baseline.
| Factor | Estimated Impact | Accessible from Numba? |
|---|---|---|
| Texture cache for x[] | ~15–20% | No |
| Native CUDA vs JIT | ~10–15% | No |
| Hand-tuned assembly | ~7–10% | No |
| CSR5/HYB formats | ~10–15% | Possible but complex |
Conclusion: 53% of cuSPARSE is near-optimal for Numba CUDA. Further gains require native CUDA C/C++.
Unlike dense GEMM with predictable memory patterns, SpMV faces fundamental challenges:
Dense GEMM: Sparse SpMV:
┌─────────────────┐ ┌─────────────────┐
│ ■ ■ ■ ■ ■ ■ ■ ■ │ │ ■ · · ■ · · · ■ │ ← Scattered non-zeros
│ ■ ■ ■ ■ ■ ■ ■ ■ │ │ · ■ · · · ■ · · │
│ ■ ■ ■ ■ ■ ■ ■ ■ │ │ · · ■ ■ · · ■ · │ Random x[col] access
│ ■ ■ ■ ■ ■ ■ ■ ■ │ │ ■ · · · ■ · · ■ │ → Poor coalescing
└─────────────────┘ └─────────────────┘
Coalesced access Irregular access
Arithmetic Intensity: SpMV performs only 2 FLOPs per non-zero but loads ~12 bytes, yielding AI = 0.17 FLOP/byte — far below P100's ridge point of 6.4 FLOP/byte. This makes SpMV fundamentally memory-bound.
sparse_kernels/
├── main.py # Entry point - runs benchmarks + generates figures
├── formats/
│ ├── csr.py # CSR storage + 3 baseline SpMV kernels
│ └── csr_optimized.py # Phase 2: 7 optimization strategies
├── benchmarks/
│ ├── spmv_benchmark.py # Phase 1 benchmark suite
│ └── phase2_benchmark.py # Phase 2 optimization comparison
├── visualization/
│ ├── plots.py # Phase 1 visualizations
│ └── phase2_plots.py # Phase 2 optimization charts
└── docs/
├── ARCHITECTURE.md # Technical design details
└── figures/ # Generated visualizations
pip install numpy numba scipy matplotlib
# Optional: for cuSPARSE comparison
pip install cupy-cuda12x # Match your CUDA version# Phase 1: Baseline benchmark
python3 benchmarks/spmv_benchmark.py
# Phase 2: Optimization comparison
python3 benchmarks/phase2_benchmark.py
# Generate all visualizations
python3 visualization/phase2_plots.pyfrom formats.csr import CSRMatrix, spmv_gpu_vector
import numpy as np
# Create sparse matrix (100K x 100K, 0.1% density)
A = CSRMatrix.random(100000, 100000, density=0.001, seed=42)
x = np.random.randn(100000).astype(np.float32)
# GPU SpMV — 5400x faster than NumPy
y_gpu = spmv_gpu_vector(A, x)
y_result = y_gpu.copy_to_host()
print(f"Matrix: {A}")
# CSRMatrix(shape=(100000, 100000), nnz=10000000, density=0.10%)@cuda.jit
def spmv_csr_scalar_kernel(data, indices, indptr, x, y, num_rows):
row = cuda.grid(1)
if row < num_rows:
dot = 0.0
for j in range(indptr[row], indptr[row + 1]):
dot += data[j] * x[indices[j]]
y[row] = dot- [+] Simple, minimal overhead
- [-] Thread divergence when row lengths vary
- Result: 38.5 GB/s at 100K
Uses 32 threads collaboratively per row with warp shuffle reduction.
- [+] Parallel reduction within row
- [+] Excellent for uniform row lengths
- Result: 96.6 GB/s at 100K (Best)
Dynamically selects scalar vs vector based on row length.
- [+] Handles mixed workloads
- Result: 96.6 GB/s at 100K
| Strategy | Approach | Result |
|---|---|---|
| Shared Memory | Cache x[] segments in SMEM | -99% (overhead too high) |
| Row Binning | Separate kernels for short/medium/long rows | -46% (indirection overhead) |
| Loop Unrolling | Process 4 elements per iteration | -3% (register pressure) |
| Block Size Tuning | Test 128, 256, 512 threads | No improvement |
SpMV operates deep in the memory-bound region:
- Arithmetic Intensity: 0.17 FLOP/byte
- P100 Ridge Point: 6.4 FLOP/byte
- Maximum achievable: ~125 GFLOPS (limited by memory bandwidth)
At 100K x 100K, our Vector kernel achieves 96.6 GB/s (13.2% of peak 732 GB/s) compared to cuSPARSE's 180.8 GB/s (24.7% of peak).
-
Baseline can be optimal — When row lengths are uniform, the simple vector kernel is already ideal.
-
Shared memory isn't always faster — For random access patterns, L2 cache often works better than explicit caching.
-
Measure, don't assume — All 7 optimizations seemed promising in theory but failed in practice.
-
Numba has limits — 53% of cuSPARSE is excellent for Python. The remaining gap requires native CUDA.
-
Honest benchmarking matters — Comparing against production libraries shows true performance.
- Phase 1: Baseline implementation (3 kernel variants)
- Phase 1: Benchmarking infrastructure
- Phase 1: cuSPARSE comparison
- Phase 2: Shared memory caching
- Phase 2: Row binning optimization
- Phase 2: Loop unrolling
- Phase 2: Block size tuning
- Phase 2: Combined strategies
- Documentation and visualization
- Test with variable row lengths (non-uniform matrices)
- Test with real-world matrices (SuiteSparse collection)
- ELL format for structured matrices
- Native CUDA C implementation for comparison
- Bell & Garland, "Efficient Sparse Matrix-Vector Multiplication on CUDA" (2008)
- Merrill & Garland, "Merge-Based Parallel Sparse Matrix-Vector Multiplication" (2016)
- NVIDIA cuSPARSE Documentation
- Numba CUDA Documentation
Tesla P100-PCIE-16GB
| Spec | Value |
|---|---|
| Architecture | Pascal (SM 6.0) |
| CUDA Cores | 3,584 |
| Memory | 16 GB HBM2 |
| Bandwidth | 732 GB/s |
| FP32 Peak | 9.3 TFLOPS |
MIT License — see LICENSE for details.
Andrey Maltsev
GPU Computing & HPC, RHEL Engineer
Part of my GPU computing portfolio demonstrating sparse linear algebra optimization. See also: CUDA Matrix Multiplication, N-Body Simulation






