GPU-Accelerated Sparse Matrix Operations

High-performance sparse matrix kernels implemented in Python/Numba CUDA, targeting the Tesla P100 GPU. This project demonstrates GPU optimization techniques for memory-bound, irregular-access workloads — the core challenges in NVIDIA's cuSPARSE library.

Key Results

Metric	Achieved	Target	Status
Peak Bandwidth	96.6 GB/s	100 GB/s	97% achieved
vs cuSPARSE	53.4%	65%	82% of target
vs CPU (NumPy)	5,432× speedup	—	Excellent

This represents near-optimal performance for a Python/Numba CUDA implementation. The remaining gap to cuSPARSE requires native CUDA C/C++ and hardware features (texture cache) inaccessible from Python.

Project Goals

Understand sparse matrix challenges — irregular memory access, load imbalancing, low arithmetic intensity
Implement multiple SpMV kernel strategies — scalar, vector, and adaptive approaches
Benchmark against cuSPARSE — honest comparison with NVIDIA's production library
Optimize systematically — test multiple strategies, document what works and why

Benchmark Results

Hardware: Tesla P100-PCIE-16GB (732 GB/s peak bandwidth)
Matrix density: 0.1% (typical for scientific computing)

Performance Scaling

Matrix Size	Non-zeros	Our Best	cuSPARSE	% of cuSPARSE	CPU Speedup
10K x 10K	100K	1.5 GB/s	9.6 GB/s	16%	31×
25K x 25K	625K	8.9 GB/s	52.4 GB/s	17%	671×
50K x 50K	2.5M	32.2 GB/s	159.2 GB/s	20%	2,222×
75K x 75K	5.6M	69.8 GB/s	173.4 GB/s	40%	4,223×
100K x 100K	10M	96.6 GB/s	180.8 GB/s	53%	5,432×

Phase 2: Optimization Results

We implemented and tested 7 optimization strategies. None outperformed the baseline at large matrix sizes.

Optimization	100K Performance	vs Baseline	Status
Baseline (Vector)	96.6 GB/s	—	Best
Shared Memory Caching	0.7 GB/s	-99%	Failed
Row Binning	52.0 GB/s	-46%	Worse
Loop Unrolling	93.3 GB/s	-3%	Worse
Block Size 128	87.1 GB/s	-10%	Worse
Block Size 512	95.1 GB/s	-2%	Same
Combined	93.1 GB/s	-4%	Worse
P100 Tuned	95.5 GB/s	-1%	Same

Why Optimizations Failed

The test matrices have perfectly uniform row lengths (every row has exactly N×density non-zeros):

Row lengths: min=100, max=100, mean=100.0, std=0.0

This means:

Zero warp divergence — all 32 threads in a warp finish simultaneously
Optimal load balancing — no threads are idle
Nothing to optimize — the baseline is already ideal

The shared memory approach catastrophically failed because it iterates through 100 segments for each of 100 non-zeros = 10,000 operations vs 100 for the baseline.

Understanding the cuSPARSE Gap

Factor	Estimated Impact	Accessible from Numba?
Texture cache for x[]	~15–20%	No
Native CUDA vs JIT	~10–15%	No
Hand-tuned assembly	~7–10%	No
CSR5/HYB formats	~10–15%	Possible but complex

Conclusion: 53% of cuSPARSE is near-optimal for Numba CUDA. Further gains require native CUDA C/C++.

Why Sparse Matrices Are Hard on GPUs

Unlike dense GEMM with predictable memory patterns, SpMV faces fundamental challenges:

Dense GEMM:                     Sparse SpMV:
┌─────────────────┐             ┌─────────────────┐
│ ■ ■ ■ ■ ■ ■ ■ ■ │             │ ■ · · ■ · · · ■ │  ← Scattered non-zeros
│ ■ ■ ■ ■ ■ ■ ■ ■ │             │ · ■ · · · ■ · · │
│ ■ ■ ■ ■ ■ ■ ■ ■ │             │ · · ■ ■ · · ■ · │  Random x[col] access
│ ■ ■ ■ ■ ■ ■ ■ ■ │             │ ■ · · · ■ · · ■ │  → Poor coalescing
└─────────────────┘             └─────────────────┘
Coalesced access                Irregular access

Arithmetic Intensity: SpMV performs only 2 FLOPs per non-zero but loads ~12 bytes, yielding AI = 0.17 FLOP/byte — far below P100's ridge point of 6.4 FLOP/byte. This makes SpMV fundamentally memory-bound.

Architecture

sparse_kernels/
├── main.py                         # Entry point - runs benchmarks + generates figures
├── formats/
│   ├── csr.py                      # CSR storage + 3 baseline SpMV kernels
│   └── csr_optimized.py            # Phase 2: 7 optimization strategies
├── benchmarks/
│   ├── spmv_benchmark.py           # Phase 1 benchmark suite
│   └── phase2_benchmark.py         # Phase 2 optimization comparison
├── visualization/
│   ├── plots.py                    # Phase 1 visualizations
│   └── phase2_plots.py             # Phase 2 optimization charts
└── docs/
    ├── ARCHITECTURE.md             # Technical design details
    └── figures/                    # Generated visualizations

Quick Start

Prerequisites

pip install numpy numba scipy matplotlib

# Optional: for cuSPARSE comparison
pip install cupy-cuda12x  # Match your CUDA version

Run Benchmarks

# Phase 1: Baseline benchmark
python3 benchmarks/spmv_benchmark.py

# Phase 2: Optimization comparison
python3 benchmarks/phase2_benchmark.py

# Generate all visualizations
python3 visualization/phase2_plots.py

Example Usage

from formats.csr import CSRMatrix, spmv_gpu_vector
import numpy as np

# Create sparse matrix (100K x 100K, 0.1% density)
A = CSRMatrix.random(100000, 100000, density=0.001, seed=42)
x = np.random.randn(100000).astype(np.float32)

# GPU SpMV — 5400x faster than NumPy
y_gpu = spmv_gpu_vector(A, x)
y_result = y_gpu.copy_to_host()

print(f"Matrix: {A}")
# CSRMatrix(shape=(100000, 100000), nnz=10000000, density=0.10%)

Kernel Implementations

Baseline Kernels (Phase 1)

1. Scalar Kernel — One Thread per Row

@cuda.jit
def spmv_csr_scalar_kernel(data, indices, indptr, x, y, num_rows):
    row = cuda.grid(1)
    if row < num_rows:
        dot = 0.0
        for j in range(indptr[row], indptr[row + 1]):
            dot += data[j] * x[indices[j]]
        y[row] = dot

[+] Simple, minimal overhead
[-] Thread divergence when row lengths vary
Result: 38.5 GB/s at 100K

2. Vector Kernel — One Warp per Row (Best)

Uses 32 threads collaboratively per row with warp shuffle reduction.

[+] Parallel reduction within row
[+] Excellent for uniform row lengths
Result: 96.6 GB/s at 100K (Best)

3. Adaptive Kernel

Dynamically selects scalar vs vector based on row length.

[+] Handles mixed workloads
Result: 96.6 GB/s at 100K

Optimization Attempts (Phase 2)

Strategy	Approach	Result
Shared Memory	Cache x[] segments in SMEM	-99% (overhead too high)
Row Binning	Separate kernels for short/medium/long rows	-46% (indirection overhead)
Loop Unrolling	Process 4 elements per iteration	-3% (register pressure)
Block Size Tuning	Test 128, 256, 512 threads	No improvement

Roofline Analysis

SpMV operates deep in the memory-bound region:

Arithmetic Intensity: 0.17 FLOP/byte
P100 Ridge Point: 6.4 FLOP/byte
Maximum achievable: ~125 GFLOPS (limited by memory bandwidth)

At 100K x 100K, our Vector kernel achieves 96.6 GB/s (13.2% of peak 732 GB/s) compared to cuSPARSE's 180.8 GB/s (24.7% of peak).

Lessons Learned

Baseline can be optimal — When row lengths are uniform, the simple vector kernel is already ideal.
Shared memory isn't always faster — For random access patterns, L2 cache often works better than explicit caching.
Measure, don't assume — All 7 optimizations seemed promising in theory but failed in practice.
Numba has limits — 53% of cuSPARSE is excellent for Python. The remaining gap requires native CUDA.
Honest benchmarking matters — Comparing against production libraries shows true performance.

Roadmap

Completed

Phase 1: Baseline implementation (3 kernel variants)
Phase 1: Benchmarking infrastructure
Phase 1: cuSPARSE comparison
Phase 2: Shared memory caching
Phase 2: Row binning optimization
Phase 2: Loop unrolling
Phase 2: Block size tuning
Phase 2: Combined strategies
Documentation and visualization

Future Work

Test with variable row lengths (non-uniform matrices)
Test with real-world matrices (SuiteSparse collection)
ELL format for structured matrices
Native CUDA C implementation for comparison

References

Bell & Garland, "Efficient Sparse Matrix-Vector Multiplication on CUDA" (2008)
Merrill & Garland, "Merge-Based Parallel Sparse Matrix-Vector Multiplication" (2016)
NVIDIA cuSPARSE Documentation
Numba CUDA Documentation

Hardware

Tesla P100-PCIE-16GB

Spec	Value
Architecture	Pascal (SM 6.0)
CUDA Cores	3,584
Memory	16 GB HBM2
Bandwidth	732 GB/s
FP32 Peak	9.3 TFLOPS

License

MIT License — see LICENSE for details.

Author

Andrey Maltsev
GPU Computing & HPC, RHEL Engineer

Part of my GPU computing portfolio demonstrating sparse linear algebra optimization. See also: CUDA Matrix Multiplication, N-Body Simulation

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.ipynb_checkpoints		.ipynb_checkpoints
benchmarks		benchmarks
docs		docs
formats		formats
visualization		visualization
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Folders and files

Latest commit

History

Repository files navigation

GPU-Accelerated Sparse Matrix Operations

Key Results

Project Goals

Benchmark Results

Performance Scaling

Phase 2: Optimization Results

Why Optimizations Failed

Understanding the cuSPARSE Gap

Why Sparse Matrices Are Hard on GPUs

Architecture

Quick Start

Prerequisites

Run Benchmarks

Example Usage

Kernel Implementations

Baseline Kernels (Phase 1)

1. Scalar Kernel — One Thread per Row

2. Vector Kernel — One Warp per Row (Best)

3. Adaptive Kernel

Optimization Attempts (Phase 2)

Roofline Analysis

Lessons Learned

Roadmap

Completed

Future Work

References

Hardware

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages