diff --git a/docs/sdk/context-transfer-engine/gpu-inf-mem.md b/docs/sdk/context-transfer-engine/gpu-inf-mem.md new file mode 100644 index 00000000..55b1e314 --- /dev/null +++ b/docs/sdk/context-transfer-engine/gpu-inf-mem.md @@ -0,0 +1,357 @@ +# GPU Infinite Memory (CUDA UVM) + +## Overview + +The GPU Infinite Memory module (`wrp_cte::uvm`) provides software-managed +demand paging for GPU virtual memory. It allows applications to reserve a +virtual address (VA) range far larger than physical GPU memory (up to 512 GB +by default) and back pages with physical memory on demand. + +Key capabilities: + +- **Demand paging** — physical memory is allocated only when a page is + explicitly touched, keeping the GPU memory footprint minimal. +- **Eviction and restoration** — pages can be evicted to pinned host RAM + (or optionally to a CTE blob store) and later restored with their data + intact. +- **Prefetching** — touching a page automatically prefetches a + configurable number of subsequent pages asynchronously. +- **Async overlap** — separate transfer and compute CUDA streams allow + data movement to overlap with kernel execution. +- **Userspace only** — no kernel driver modifications or root privileges + required. Uses the CUDA driver API (`cuMemAddressReserve`, + `cuMemCreate`, `cuMemMap`). + +Typical use cases include LLM weight overcommitment, sparse tensor +computations, and workloads that exceed the physical GPU memory capacity. + +## Installation + +### Prerequisites + +- Linux x86_64 with an NVIDIA GPU (compute capability 7.0+) +- CUDA Toolkit 12.0 or later +- NVIDIA driver 525+ installed on the host + +### Install with Conda (recommended) + +The conda installer automatically pulls in `cuda-toolkit` from `conda-forge` +and `nvidia` channels, so you do **not** need a system-wide CUDA installation. + +```bash +# Clone the repository +git clone --recursive https://github.com/iowarp/clio-core.git +cd clio-core + +# Install with CUDA variant (creates the "iowarp" conda environment) +./install.sh cuda +``` + +This runs the following under the hood: + +1. Creates (or reuses) a `iowarp` conda environment with Miniconda +2. Installs `cuda-toolkit` and `cuda-cudart` from conda-forge/nvidia +3. Runs `rattler-build` with the `cuda.yaml` variant, which selects the + `cuda-release` CMake preset +4. Installs headers and libraries into `$CONDA_PREFIX` + +After installation, verify: + +```bash +conda activate iowarp +ls $CONDA_PREFIX/lib/libwrp_cte_uvm.so # UVM shared library +ls $CONDA_PREFIX/include/wrp_cte/uvm/gpu_vmm.h # Public header +``` + +#### Manual conda build + +If you need more control, you can invoke `rattler-build` directly: + +```bash +conda activate iowarp +conda install -y rattler-build cuda-toolkit -c conda-forge -c nvidia + +rattler-build build \ + --recipe installers/conda/ \ + --variant-config installers/conda/variants/cuda.yaml \ + -c conda-forge -c nvidia + +# Install the built package +conda install -c build/conda-output -c conda-forge -c nvidia iowarp-core +``` + +### Install from Source + +```bash +# Configure with the cuda-release preset +cmake --preset=cuda-release + +# Build +cmake --build build --parallel + +# Install +sudo cmake --install build +``` + +This installs `libwrp_cte_uvm.so` to `$PREFIX/lib/` and headers to +`$PREFIX/include/wrp_cte/uvm/`. + +## Linking + +### CMake + +```cmake +find_package(wrp_cte_uvm REQUIRED) +target_link_libraries(your_app PRIVATE wrp_cte::uvm) +``` + +The CMake package automatically links the CUDA driver and runtime libraries. + +## Quick Start + +```cpp +#include + +using namespace wrp_cte::uvm; + +int main() { + // 1. Configure + GpuVmmConfig config; + config.va_size_bytes = 512ULL * 1024 * 1024 * 1024; // 512 GB VA space + config.page_size = 2ULL * 1024 * 1024; // 2 MB pages + config.fill_value = 0; // Zero-fill new pages + config.prefetch_window = 4; // Prefetch 4 pages ahead + + // 2. Initialize + GpuVirtualMemoryManager vmm; + vmm.init(config); + + // 3. Touch pages on demand + vmm.touchPage(0); // Back page 0 with physical memory + CUdeviceptr ptr = vmm.getPagePtr(0); // Get device pointer + // ... launch kernels using ptr on vmm.getComputeStream() ... + + // 4. Touch a range (covers all pages in the byte range) + vmm.touchRange(0, 64ULL * 1024 * 1024); // Back first 64 MB + + // 5. Evict pages to free physical memory + vmm.evictPage(0); // Data saved to host RAM + + // 6. Re-touch restores the original data (not re-filled) + vmm.touchPage(0); // Data restored from host + + // 7. Cleanup + vmm.destroy(); + return 0; +} +``` + +## Configuration + +All configuration is passed through `GpuVmmConfig`: + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `va_size_bytes` | `size_t` | 512 GB | Total virtual address space to reserve. No physical memory is allocated upfront. | +| `page_size` | `size_t` | 2 MB | Granularity for demand paging. Aligned up to the hardware allocation granularity. | +| `fill_value` | `int` | 5 | Value used to fill freshly allocated pages (as `int` words). | +| `device` | `int` | 0 | CUDA device ordinal. | +| `prefetch_window` | `size_t` | 4 | Number of pages to prefetch ahead asynchronously when `touchPage()` is called. Set to 0 to disable. | +| `use_cte` | `bool` | false | When true, evicted pages are stored in a CTE blob store instead of pinned host RAM. Requires IOWarp CTE to be running. | +| `cte_tag_name` | `string` | `"gpu_vmm_pages"` | Tag name used in the CTE blob store for page data. | + +## API Reference + +### Lifecycle + +```cpp +CUresult init(const GpuVmmConfig &config = GpuVmmConfig()); +void destroy(); +``` + +`init()` reserves the VA range, creates CUDA streams, and prepares the page +table. `destroy()` unmaps all pages, frees backing stores, and releases the VA +reservation. Also called by the destructor. + +### Demand Paging + +```cpp +CUresult touchPage(size_t page_index); +CUresult touchPageAsync(size_t page_index); +CUresult touchRange(size_t offset, size_t size); +``` + +- **`touchPage`** — synchronously backs a page. If the page was + previously evicted, its data is restored from host RAM (or CTE). Otherwise + the page is filled with `fill_value`. Automatically calls `prefetchAhead()`. +- **`touchPageAsync`** — queues the page backing on the transfer stream. + The caller must call `syncTransfer()` before accessing the page. Does not + trigger prefetch. +- **`touchRange`** — touches all pages covering the byte range + `[offset, offset + size)`. + +### Eviction + +```cpp +CUresult evictPage(size_t page_index); +CUresult evictPageAsync(size_t page_index); +``` + +- **`evictPage`** — synchronously copies page data to host RAM (or CTE), + then unmaps and releases the physical allocation. The VA slot is preserved. +- **`evictPageAsync`** — the D2H copy runs on the transfer stream, + allowing the compute stream to continue. Synchronizes before unmapping. + +### Prefetching + +```cpp +void prefetchAhead(size_t page_index); +``` + +Asynchronously touches pages `[page_index + 1, page_index + prefetch_window]` +on the transfer stream. Called automatically by `touchPage()`. + +### Query + +```cpp +CUdeviceptr getBasePtr() const; +size_t getPageSize() const; +size_t getTotalPages() const; +size_t getMappedPageCount() const; +size_t getEvictedPageCount() const; +bool isMapped(size_t page_index) const; +bool isEvictedToHost(size_t page_index) const; +CUdeviceptr getPagePtr(size_t page_index) const; +``` + +### Stream Management + +```cpp +cudaStream_t getTransferStream() const; +cudaStream_t getComputeStream() const; +void syncTransfer(); +void syncCompute(); +``` + +The VMM maintains two CUDA streams: + +- **Transfer stream** — used for all H2D / D2H copies and async page + operations. +- **Compute stream** — intended for user kernel launches. This separation + allows overlapping data movement with computation. + +## Architecture + +### Memory Layout + +``` +Virtual Address Space (e.g. 512 GB) +┌─────────┬─────────┬─────────┬─────────┬─── ─ ─ ─ ─ ─┬─────────┐ +│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ ... │ Page N │ +│ 2 MB │ 2 MB │ 2 MB │ 2 MB │ │ 2 MB │ +├─────────┼─────────┼─────────┼─────────┤ ├─────────┤ +│ MAPPED │ (empty) │ MAPPED │ (empty) │ │ (empty) │ +└─────────┴─────────┴─────────┴─────────┴─── ─ ─ ─ ─ ─┴─────────┘ + │ │ + ▼ ▼ + Physical Physical + GPU Memory GPU Memory +``` + +The entire VA range is reserved with `cuMemAddressReserve()` at `init()` time. +No physical memory is consumed. Each page is independently backed by physical +memory via `cuMemCreate()` + `cuMemMap()` when touched, and released via +`cuMemUnmap()` + `cuMemRelease()` when evicted. + +### Page Lifecycle + +``` + touchPage() evictPage() + UNMAPPED ─────────────► MAPPED ──────────────► EVICTED (host RAM) + ▲ │ + │ touchPage() │ + └─────────────────────────┘ + (data restored, not re-filled) +``` + +### CTE Integration + +When `use_cte = true`, evicted pages are stored as blobs in the CTE blob store +instead of pinned host RAM. This allows page data to persist across process +restarts and to be backed by any CTE storage target (NVMe, distributed +storage, etc.). + +```cpp +GpuVmmConfig config; +config.use_cte = true; +config.cte_tag_name = "my_model_weights"; +vmm.init(config); +``` + +Each page is stored as a blob named `page_` under the configured tag. + +## Example: LLM Weight Overcommitment + +This example demonstrates loading model weights that exceed GPU memory by +paging layers in and out on demand. + +```cpp +#include + +using namespace wrp_cte::uvm; + +void run_inference(size_t num_layers, size_t layer_size) { + GpuVmmConfig config; + config.va_size_bytes = num_layers * layer_size; + config.page_size = layer_size; // One page per layer + config.fill_value = 0; + config.prefetch_window = 2; // Prefetch next 2 layers + + GpuVirtualMemoryManager vmm; + vmm.init(config); + + for (size_t layer = 0; layer < num_layers; ++layer) { + // Touch this layer (prefetch triggers for layer+1, layer+2) + vmm.touchPage(layer); + CUdeviceptr weights = vmm.getPagePtr(layer); + + // Launch inference kernel on compute stream + // run_layer_kernel<<<..., vmm.getComputeStream()>>>(weights, ...); + vmm.syncCompute(); + + // Evict layers we're done with to free GPU memory + if (layer >= 2) { + vmm.evictPageAsync(layer - 2); + } + } + + vmm.destroy(); +} +``` + +## Running the Tests + +The UVM module includes a comprehensive test suite: + +```bash +# Build with CUDA enabled +cmake --preset=cuda-release +cmake --build build --target test_gpu_vmm + +# Run the tests +./build/bin/test_gpu_vmm +``` + +The tests cover: + +| Test | Description | +|------|-------------| +| Basic Demand Paging | Touch individual pages, verify fill values | +| Page Eviction | Map, evict, re-touch pages | +| Touch Range | Back all pages in a byte range | +| Large VA Reservation | Reserve 512 GB, touch sparse pages | +| Kernel Access Full VA | GPU kernel reads 256 scattered pages across 512 GB | +| Evict and Restore | Write custom data, evict, restore, verify preservation | +| Prefetch Window | Verify automatic prefetching of subsequent pages | +| Async Overlap | Concurrent eviction and compute on separate streams | +| Multi-Page Round-Trip | Evict and restore 5 pages with distinct values |