Skip to content

maltsev-andrey/sparse_kernels

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU-Accelerated Sparse Matrix Operations

CUDA Python Numba License

High-performance sparse matrix kernels implemented in Python/Numba CUDA, targeting the Tesla P100 GPU. This project demonstrates GPU optimization techniques for memory-bound, irregular-access workloads — the core challenges in NVIDIA's cuSPARSE library.

Performance Summary

Key Results

Metric Achieved Target Status
Peak Bandwidth 96.6 GB/s 100 GB/s 97% achieved
vs cuSPARSE 53.4% 65% 82% of target
vs CPU (NumPy) 5,432× speedup Excellent

This represents near-optimal performance for a Python/Numba CUDA implementation. The remaining gap to cuSPARSE requires native CUDA C/C++ and hardware features (texture cache) inaccessible from Python.

Project Goals

  1. Understand sparse matrix challenges — irregular memory access, load imbalancing, low arithmetic intensity
  2. Implement multiple SpMV kernel strategies — scalar, vector, and adaptive approaches
  3. Benchmark against cuSPARSE — honest comparison with NVIDIA's production library
  4. Optimize systematically — test multiple strategies, document what works and why

Benchmark Results

Hardware: Tesla P100-PCIE-16GB (732 GB/s peak bandwidth)
Matrix density: 0.1% (typical for scientific computing)

Performance Scaling

Matrix Size Non-zeros Our Best cuSPARSE % of cuSPARSE CPU Speedup
10K x 10K 100K 1.5 GB/s 9.6 GB/s 16% 31×
25K x 25K 625K 8.9 GB/s 52.4 GB/s 17% 671×
50K x 50K 2.5M 32.2 GB/s 159.2 GB/s 20% 2,222×
75K x 75K 5.6M 69.8 GB/s 173.4 GB/s 40% 4,223×
100K x 100K 10M 96.6 GB/s 180.8 GB/s 53% 5,432×

Performance Scaling

Phase 2: Optimization Results

We implemented and tested 7 optimization strategies. None outperformed the baseline at large matrix sizes.

Optimization Comparison

Optimization 100K Performance vs Baseline Status
Baseline (Vector) 96.6 GB/s Best
Shared Memory Caching 0.7 GB/s -99% Failed
Row Binning 52.0 GB/s -46% Worse
Loop Unrolling 93.3 GB/s -3% Worse
Block Size 128 87.1 GB/s -10% Worse
Block Size 512 95.1 GB/s -2% Same
Combined 93.1 GB/s -4% Worse
P100 Tuned 95.5 GB/s -1% Same

Why Optimizations Failed

Why Optimizations Failed

The test matrices have perfectly uniform row lengths (every row has exactly N×density non-zeros):

Row lengths: min=100, max=100, mean=100.0, std=0.0

This means:

  • Zero warp divergence — all 32 threads in a warp finish simultaneously
  • Optimal load balancing — no threads are idle
  • Nothing to optimize — the baseline is already ideal

The shared memory approach catastrophically failed because it iterates through 100 segments for each of 100 non-zeros = 10,000 operations vs 100 for the baseline.

Understanding the cuSPARSE Gap

cuSPARSE Gap Analysis

Factor Estimated Impact Accessible from Numba?
Texture cache for x[] ~15–20% No
Native CUDA vs JIT ~10–15% No
Hand-tuned assembly ~7–10% No
CSR5/HYB formats ~10–15% Possible but complex

Conclusion: 53% of cuSPARSE is near-optimal for Numba CUDA. Further gains require native CUDA C/C++.

Why Sparse Matrices Are Hard on GPUs

Sparsity Pattern

Unlike dense GEMM with predictable memory patterns, SpMV faces fundamental challenges:

Dense GEMM:                     Sparse SpMV:
┌─────────────────┐             ┌─────────────────┐
│ ■ ■ ■ ■ ■ ■ ■ ■ │             │ ■ · · ■ · · · ■ │  ← Scattered non-zeros
│ ■ ■ ■ ■ ■ ■ ■ ■ │             │ · ■ · · · ■ · · │
│ ■ ■ ■ ■ ■ ■ ■ ■ │             │ · · ■ ■ · · ■ · │  Random x[col] access
│ ■ ■ ■ ■ ■ ■ ■ ■ │             │ ■ · · · ■ · · ■ │  → Poor coalescing
└─────────────────┘             └─────────────────┘
Coalesced access                Irregular access

Arithmetic Intensity: SpMV performs only 2 FLOPs per non-zero but loads ~12 bytes, yielding AI = 0.17 FLOP/byte — far below P100's ridge point of 6.4 FLOP/byte. This makes SpMV fundamentally memory-bound.

Architecture

sparse_kernels/
├── main.py                         # Entry point - runs benchmarks + generates figures
├── formats/
│   ├── csr.py                      # CSR storage + 3 baseline SpMV kernels
│   └── csr_optimized.py            # Phase 2: 7 optimization strategies
├── benchmarks/
│   ├── spmv_benchmark.py           # Phase 1 benchmark suite
│   └── phase2_benchmark.py         # Phase 2 optimization comparison
├── visualization/
│   ├── plots.py                    # Phase 1 visualizations
│   └── phase2_plots.py             # Phase 2 optimization charts
└── docs/
    ├── ARCHITECTURE.md             # Technical design details
    └── figures/                    # Generated visualizations

Quick Start

Prerequisites

pip install numpy numba scipy matplotlib

# Optional: for cuSPARSE comparison
pip install cupy-cuda12x  # Match your CUDA version

Run Benchmarks

# Phase 1: Baseline benchmark
python3 benchmarks/spmv_benchmark.py

# Phase 2: Optimization comparison
python3 benchmarks/phase2_benchmark.py

# Generate all visualizations
python3 visualization/phase2_plots.py

Example Usage

from formats.csr import CSRMatrix, spmv_gpu_vector
import numpy as np

# Create sparse matrix (100K x 100K, 0.1% density)
A = CSRMatrix.random(100000, 100000, density=0.001, seed=42)
x = np.random.randn(100000).astype(np.float32)

# GPU SpMV — 5400x faster than NumPy
y_gpu = spmv_gpu_vector(A, x)
y_result = y_gpu.copy_to_host()

print(f"Matrix: {A}")
# CSRMatrix(shape=(100000, 100000), nnz=10000000, density=0.10%)

Kernel Implementations

Baseline Kernels (Phase 1)

1. Scalar Kernel — One Thread per Row

@cuda.jit
def spmv_csr_scalar_kernel(data, indices, indptr, x, y, num_rows):
    row = cuda.grid(1)
    if row < num_rows:
        dot = 0.0
        for j in range(indptr[row], indptr[row + 1]):
            dot += data[j] * x[indices[j]]
        y[row] = dot
  • [+] Simple, minimal overhead
  • [-] Thread divergence when row lengths vary
  • Result: 38.5 GB/s at 100K

2. Vector Kernel — One Warp per Row (Best)

Uses 32 threads collaboratively per row with warp shuffle reduction.

  • [+] Parallel reduction within row
  • [+] Excellent for uniform row lengths
  • Result: 96.6 GB/s at 100K (Best)

3. Adaptive Kernel

Dynamically selects scalar vs vector based on row length.

  • [+] Handles mixed workloads
  • Result: 96.6 GB/s at 100K

Optimization Attempts (Phase 2)

Strategy Approach Result
Shared Memory Cache x[] segments in SMEM -99% (overhead too high)
Row Binning Separate kernels for short/medium/long rows -46% (indirection overhead)
Loop Unrolling Process 4 elements per iteration -3% (register pressure)
Block Size Tuning Test 128, 256, 512 threads No improvement

Roofline Analysis

Roofline Model

SpMV operates deep in the memory-bound region:

  • Arithmetic Intensity: 0.17 FLOP/byte
  • P100 Ridge Point: 6.4 FLOP/byte
  • Maximum achievable: ~125 GFLOPS (limited by memory bandwidth)

At 100K x 100K, our Vector kernel achieves 96.6 GB/s (13.2% of peak 732 GB/s) compared to cuSPARSE's 180.8 GB/s (24.7% of peak).

Lessons Learned

  1. Baseline can be optimal — When row lengths are uniform, the simple vector kernel is already ideal.

  2. Shared memory isn't always faster — For random access patterns, L2 cache often works better than explicit caching.

  3. Measure, don't assume — All 7 optimizations seemed promising in theory but failed in practice.

  4. Numba has limits — 53% of cuSPARSE is excellent for Python. The remaining gap requires native CUDA.

  5. Honest benchmarking matters — Comparing against production libraries shows true performance.

Roadmap

Completed

  • Phase 1: Baseline implementation (3 kernel variants)
  • Phase 1: Benchmarking infrastructure
  • Phase 1: cuSPARSE comparison
  • Phase 2: Shared memory caching
  • Phase 2: Row binning optimization
  • Phase 2: Loop unrolling
  • Phase 2: Block size tuning
  • Phase 2: Combined strategies
  • Documentation and visualization

Future Work

  • Test with variable row lengths (non-uniform matrices)
  • Test with real-world matrices (SuiteSparse collection)
  • ELL format for structured matrices
  • Native CUDA C implementation for comparison

References

Hardware

Tesla P100-PCIE-16GB

Spec Value
Architecture Pascal (SM 6.0)
CUDA Cores 3,584
Memory 16 GB HBM2
Bandwidth 732 GB/s
FP32 Peak 9.3 TFLOPS

License

MIT License — see LICENSE for details.

Author

Andrey Maltsev
GPU Computing & HPC, RHEL Engineer


Part of my GPU computing portfolio demonstrating sparse linear algebra optimization. See also: CUDA Matrix Multiplication, N-Body Simulation

About

High-performance GPU SpMV kernels in Python/Numba CUDA. Achieves 96.6 GB/s (53% of cuSPARSE) with 5,432x speedup over CPU. Includes optimization analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages