nvmath-python Tutorial

This repository contains a series of tutorial notebooks demonstrating the capabilities and features of nvmath-python, NVIDIA's library that bridges Python's scientific computing ecosystem with NVIDIA's CUDA-X math libraries.

Overview

nvmath-python is designed to provide high-performance mathematical operations that complement existing GPU libraries like CuPy and PyTorch. Unlike traditional array libraries, nvmath-python focuses on advanced mathematical operations with features like kernel fusion, flexible APIs, and device-level integration.

These notebooks can be run on NVIDIA Brev or Google Colab.

There are Docker Images and Docker Compose files for creating Brev Launchables or running locally.

Brev Launchables of this tutorial should use:

4xL4, 2xL4, 2xL40S, or 1x L40S instances.
Crusoe or any other provider with Flexible Ports.

Prerequisites

To use these notebooks, you will need:

A computer equipped with an NVIDIA GPU
Python environment with required libraries installed

Please refer to the nvmath-python documentation for installation instructions.

Notebooks

Notebook	Tutorial	Solutions
01. Kernel Fusion
02. Memory and Execution Spaces
03. Stateful API and Autotuning
04. FFT Callbacks
05. Device API
06. Direct Sparse Solver

01. Kernel Fusion (`01_kernel_fusion.ipynb`)

Key Topics:

Introduction to nvmath-python and its role in the Python scientific computing ecosystem
Why nvmath-python is NOT an array library but co-exists with NumPy, CuPy, and PyTorch
Benchmarking GPU codes with cupyx.profiler.benchmark
Kernel fusion for composite operations (GEMM: D = α(A·B) + βC)
Performance comparison: kernel fusion vs. separate operations
Using NVIDIA Nsight plugin for JupyterLab for performance profiling

Key Insights:

nvmath-python provides performance benefits over CuPy for GEMM operations through kernel fusion
Fused kernels eliminate multiple kernel invocation overhead and optimize memory accesses
Profiling reveals that CuPy requires multiple kernel launches while nvmath-python uses a single fused kernel

02. Memory and Execution Spaces (`02_mem_exec_spaces.ipynb`)

Key Topics:

Understanding memory and execution spaces in nvmath-python
Flexibility of working with both CPU and GPU memory
Difference between specialized APIs and generic APIs
Performance implications of data transfers between memory spaces
Using nvmath-python's logging mechanism to understand internal operations

Key Insights:

Memory space (where data is stored) and execution space (where computation happens) may differ
Specialized APIs like matmul only support GPU execution, triggering expensive automatic data transfers for CPU inputs
Generic APIs like FFT adapt to input location, executing on CPU for CPU inputs and GPU for GPU inputs
Logging provides visibility into specification, planning, and execution phases

03. Stateful API and Autotuning (`03_stateful_api.ipynb`)

Key Topics:

Stateless (function-form) vs. stateful (class-form) APIs
Understanding the four phases: specification, planning, execution, and resource management
Reusing Matmul objects for batched operations
Performance benefits of amortizing specification and planning costs
Autotuning for optimal kernel selection

Key Insights:

Stateless API is convenient for single operations but repeats specification/planning for each call
Stateful API allows specification and planning once, then multiple executions
Autotuning finds optimal kernels when built-in heuristics are suboptimal
Critical for scenarios with repeated operations on similar-shaped data

04. FFT Callbacks (`04_callbacks.ipynb`)

Key Topics:

Custom Python functions as FFT prolog/epilog callbacks
JIT compilation to intermediate representation (LTO-IR)
Application example: Gaussian image filtering using FFT
Comparison of CuPy vs. nvmath-python implementations
Cost breakdown: compilation, planning, and execution phases
Amortizing compilation/planning costs across batch processing

Key Insights:

Callbacks enable custom element-wise operations fused with FFT kernels
JIT compilation overhead is one-time cost that can be amortized
For single images, CuPy may be faster due to nvmath-python's compilation overhead
For large enough batches amortization makes stateful API a preferred choice

05. Device API (`05_device_api.ipynb`)

Key Topics:

Using nvmath-python's device APIs within custom numba-cuda kernels
Monte Carlo simulation of stock prices using Geometric Brownian Motion (GBM)
Integration with nvmath-python's random number generation at device level
Comparison of memory-bound array operations vs. compute-intensive kernels
Optimizing throughput by consuming multiple random variates per iteration

Key Insights:

Custom kernels eliminate intermediate array allocations and memory transfers
nvmath-python device RNG provides performance benefits over CuPy for GBM simulation
Philox4_32_10 generator returns 4 random variates at once, enabling vectorized consumption
Device-level APIs enable fine-grained control for compute-intensive workloads
Critical for applications where each thread handles complex calculations

06. Direct Sparse Solver (`06_sparse_solver.ipynb`)

Key Topics:

Direct sparse solver for large linear systems with sparse matrices
Solving linear equations of the form A·X = B using nvmath-python
Working with CSR (Compressed Sparse Row) format matrices
GPU and hybrid execution modes for the solver

Key Insights:

nvmath-python provides high-performance sparse solver backed by NVIDIA cuDSS library
Direct methods are suitable for sparse linear systems with specific matrix structures
CSR format efficiently stores and manipulates sparse matrices on GPU

General Benchmarking Notes

All notebooks use a consistent benchmarking approach with cupyx.profiler.benchmark:

Proper GPU synchronization using CUDA events
Warm-up runs to eliminate cold-start effects
Multiple repetitions for statistical stability
Reports minimum time from repeated runs

For detailed profiling, notebooks demonstrate using NVIDIA Nsight Systems and the JupyterLab plugin for kernel-level analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvmath-python Tutorial

Overview

Prerequisites

Notebooks

01. Kernel Fusion (`01_kernel_fusion.ipynb`)

02. Memory and Execution Spaces (`02_mem_exec_spaces.ipynb`)

03. Stateful API and Autotuning (`03_stateful_api.ipynb`)

04. FFT Callbacks (`04_callbacks.ipynb`)

05. Device API (`05_device_api.ipynb`)

06. Direct Sparse Solver (`06_sparse_solver.ipynb`)

General Benchmarking Notes

Additional Resources

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

nvmath-python Tutorial

Overview

Prerequisites

Notebooks

01. Kernel Fusion (01_kernel_fusion.ipynb)

02. Memory and Execution Spaces (02_mem_exec_spaces.ipynb)

03. Stateful API and Autotuning (03_stateful_api.ipynb)

04. FFT Callbacks (04_callbacks.ipynb)

05. Device API (05_device_api.ipynb)

06. Direct Sparse Solver (06_sparse_solver.ipynb)

General Benchmarking Notes

Additional Resources

01. Kernel Fusion (`01_kernel_fusion.ipynb`)

02. Memory and Execution Spaces (`02_mem_exec_spaces.ipynb`)

03. Stateful API and Autotuning (`03_stateful_api.ipynb`)

04. FFT Callbacks (`04_callbacks.ipynb`)

05. Device API (`05_device_api.ipynb`)

06. Direct Sparse Solver (`06_sparse_solver.ipynb`)