This repository contains a series of tutorial notebooks demonstrating the capabilities and features of nvmath-python, NVIDIA's library that bridges Python's scientific computing ecosystem with NVIDIA's CUDA-X math libraries.
nvmath-python is designed to provide high-performance mathematical operations that complement existing GPU libraries like CuPy and PyTorch. Unlike traditional array libraries, nvmath-python focuses on advanced mathematical operations with features like kernel fusion, flexible APIs, and device-level integration.
These notebooks can be run on NVIDIA Brev or Google Colab.
There are Docker Images and Docker Compose files for creating Brev Launchables or running locally.
Brev Launchables of this tutorial should use:
- 4xL4, 2xL4, 2xL40S, or 1x L40S instances.
- Crusoe or any other provider with Flexible Ports.
To use these notebooks, you will need:
- A computer equipped with an NVIDIA GPU
- Python environment with required libraries installed
Please refer to the nvmath-python documentation for installation instructions.
| Notebook | Tutorial | Solutions |
|---|---|---|
| 01. Kernel Fusion | ||
| 02. Memory and Execution Spaces | ||
| 03. Stateful API and Autotuning | ||
| 04. FFT Callbacks | ||
| 05. Device API | ||
| 06. Direct Sparse Solver |
01. Kernel Fusion (01_kernel_fusion.ipynb)
Key Topics:
- Introduction to nvmath-python and its role in the Python scientific computing ecosystem
- Why nvmath-python is NOT an array library but co-exists with NumPy, CuPy, and PyTorch
- Benchmarking GPU codes with
cupyx.profiler.benchmark - Kernel fusion for composite operations (GEMM:
D = α(A·B) + βC) - Performance comparison: kernel fusion vs. separate operations
- Using NVIDIA Nsight plugin for JupyterLab for performance profiling
Key Insights:
- nvmath-python provides performance benefits over CuPy for GEMM operations through kernel fusion
- Fused kernels eliminate multiple kernel invocation overhead and optimize memory accesses
- Profiling reveals that CuPy requires multiple kernel launches while nvmath-python uses a single fused kernel
02. Memory and Execution Spaces (02_mem_exec_spaces.ipynb)
Key Topics:
- Understanding memory and execution spaces in nvmath-python
- Flexibility of working with both CPU and GPU memory
- Difference between specialized APIs and generic APIs
- Performance implications of data transfers between memory spaces
- Using nvmath-python's logging mechanism to understand internal operations
Key Insights:
- Memory space (where data is stored) and execution space (where computation happens) may differ
- Specialized APIs like
matmulonly support GPU execution, triggering expensive automatic data transfers for CPU inputs - Generic APIs like FFT adapt to input location, executing on CPU for CPU inputs and GPU for GPU inputs
- Logging provides visibility into specification, planning, and execution phases
03. Stateful API and Autotuning (03_stateful_api.ipynb)
Key Topics:
- Stateless (function-form) vs. stateful (class-form) APIs
- Understanding the four phases: specification, planning, execution, and resource management
- Reusing
Matmulobjects for batched operations - Performance benefits of amortizing specification and planning costs
- Autotuning for optimal kernel selection
Key Insights:
- Stateless API is convenient for single operations but repeats specification/planning for each call
- Stateful API allows specification and planning once, then multiple executions
- Autotuning finds optimal kernels when built-in heuristics are suboptimal
- Critical for scenarios with repeated operations on similar-shaped data
04. FFT Callbacks (04_callbacks.ipynb)
Key Topics:
- Custom Python functions as FFT prolog/epilog callbacks
- JIT compilation to intermediate representation (LTO-IR)
- Application example: Gaussian image filtering using FFT
- Comparison of CuPy vs. nvmath-python implementations
- Cost breakdown: compilation, planning, and execution phases
- Amortizing compilation/planning costs across batch processing
Key Insights:
- Callbacks enable custom element-wise operations fused with FFT kernels
- JIT compilation overhead is one-time cost that can be amortized
- For single images, CuPy may be faster due to nvmath-python's compilation overhead
- For large enough batches amortization makes stateful API a preferred choice
05. Device API (05_device_api.ipynb)
Key Topics:
- Using nvmath-python's device APIs within custom numba-cuda kernels
- Monte Carlo simulation of stock prices using Geometric Brownian Motion (GBM)
- Integration with nvmath-python's random number generation at device level
- Comparison of memory-bound array operations vs. compute-intensive kernels
- Optimizing throughput by consuming multiple random variates per iteration
Key Insights:
- Custom kernels eliminate intermediate array allocations and memory transfers
- nvmath-python device RNG provides performance benefits over CuPy for GBM simulation
- Philox4_32_10 generator returns 4 random variates at once, enabling vectorized consumption
- Device-level APIs enable fine-grained control for compute-intensive workloads
- Critical for applications where each thread handles complex calculations
06. Direct Sparse Solver (06_sparse_solver.ipynb)
Key Topics:
- Direct sparse solver for large linear systems with sparse matrices
- Solving linear equations of the form A·X = B using nvmath-python
- Working with CSR (Compressed Sparse Row) format matrices
- GPU and hybrid execution modes for the solver
Key Insights:
- nvmath-python provides high-performance sparse solver backed by NVIDIA cuDSS library
- Direct methods are suitable for sparse linear systems with specific matrix structures
- CSR format efficiently stores and manipulates sparse matrices on GPU
All notebooks use a consistent benchmarking approach with cupyx.profiler.benchmark:
- Proper GPU synchronization using CUDA events
- Warm-up runs to eliminate cold-start effects
- Multiple repetitions for statistical stability
- Reports minimum time from repeated runs
For detailed profiling, notebooks demonstrate using NVIDIA Nsight Systems and the JupyterLab plugin for kernel-level analysis.
