From ac3243bc14dad3991df95de7c1500a1ffe1dc46f Mon Sep 17 00:00:00 2001 From: Jaya Venkatesh Date: Wed, 11 Feb 2026 08:14:44 -0800 Subject: [PATCH 01/10] added hpc docs page Signed-off-by: Jaya Venkatesh --- source/hpc.md | 318 ++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 230 insertions(+), 88 deletions(-) diff --git a/source/hpc.md b/source/hpc.md index bbfdae15..4430a4d3 100644 --- a/source/hpc.md +++ b/source/hpc.md @@ -1,132 +1,274 @@ --- -review_priority: "index" +review_priority: "p1" --- # HPC -RAPIDS works extremely well in traditional HPC (High Performance Computing) environments where GPUs are often co-located with accelerated networking hardware such as InfiniBand. Deploying on HPC often means using queue management systems such as SLURM, LSF, PBS, etc. +RAPIDS can be deployed on HPC clusters managed by [SLURM](https://slurm.schedmd.com/). -## SLURM +## SLURM Basics -```{warning} -This is a legacy page and may contain outdated information. We are working hard to update our documentation with the latest and greatest information, thank you for bearing with us. -``` +SLURM is a job scheduler that manages access to compute nodes on HPC clusters. +Instead of logging into a GPU machine directly, you ask SLURM for resources +(CPUs, GPUs, memory, time) and it allocates a node for you when one becomes +available. -If you are unfamiliar with SLURM or need a refresher, we recommend the [quickstart guide](https://slurm.schedmd.com/quickstart.html). -Depending on how your nodes are configured, additional settings may be required such as defining the number of GPUs `(--gpus)` desired or the number of gpus per node `(--gpus-per-node)`. -In the following example, we assume each allocation runs on a DGX1 with access to all eight GPUs. +Nodes are organized into **partitions**, groups of machines with similar +hardware. For example, your cluster might have a `gpu` partition with A100 nodes +and a `cpu` partition with CPU-only nodes. -### Start Scheduler +### Partitions -First, start the scheduler with the following SLURM script. This and the following scripts can deployed with `salloc` for interactive usage or `sbatch` for batched run. +Check which partitions are available and what GPUs they have. The `-o` flag +customizes the output format: `%P` shows the partition name, `%G` the +generic resources (such as GPUs), `%D` the number of nodes, and `%T` the +node state. ```bash -#!/usr/bin/env bash +sinfo -o "%P %G %D %T" +PARTITION GRES NODES STATE +gpu gpu:a100:4 10 idle +gpu-dev gpu:v100:2 4 idle +``` -#SBATCH -J dask-scheduler -#SBATCH -n 1 -#SBATCH -t 00:10:00 +Your cluster admin can tell you which partition to use. Throughout this guide +we use `-p gpu`. Replace this with your partition name. -module load cuda/11.0.3 -CONDA_ROOT=/nfs-mount/user/miniconda3 -source $CONDA_ROOT/etc/profile.d/conda.sh -conda activate rapids +### Interactive Jobs -LOCAL_DIRECTORY=/nfs-mount/dask-local-directory -mkdir $LOCAL_DIRECTORY -CUDA_VISIBLE_DEVICES=0 dask-scheduler \ - --protocol tcp \ - --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" & +An interactive job gives you a shell on a compute node where you can run +commands directly. This is useful for development, debugging, and testing +before submitting longer batch jobs. -dask-cuda-worker \ - --rmm-pool-size 14GB \ - --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" +Use `srun` to request a GPU node. The `--gres=gpu:1` flag requests one GPU, +`--time` sets the maximum walltime, and `--pty bash` gives you a terminal. + +```bash +srun -p gpu --gres=gpu:1 --time=01:00:00 --pty bash ``` -Notice that we configure the scheduler to write a `scheduler-file` to a NFS accessible location. This file contains metadata about the scheduler and will -include the IP address and port for the scheduler. The file will serve as input to the workers informing them what address and port to connect. +This will queue until a node is available, then drop you into a shell on +the allocated node. -The scheduler doesn't need the whole node to itself so we can also start a worker on this node to fill out the unused resources. +### Batch Jobs -### Start Dask CUDA Workers +For longer-running work, write a script and submit it with `sbatch`. SLURM +runs the script when resources become available and you don't need to stay +connected. -Next start the other [dask-cuda workers](https://docs.rapids.ai/api/dask-cuda/~~~rapids_api_docs_version~~~/). Dask-CUDA extends the traditional Dask `Worker` class with specific options and enhancements for GPU environments. Unlike the scheduler and client, the workers script should be scalable and allow the users to tune how many workers are created. -For example, we can scale the number of nodes to 3: `sbatch/salloc -N3 dask-cuda-worker.script` . In this case, because we have 8 GPUs per node and we have 3 nodes, -our job will have 24 workers. +```bash +sbatch my_job.sh +Submitted batch job 12345 +``` + +Check the status of your jobs with `squeue`. The `-u` flag filters by your +username. ```bash -#!/usr/bin/env bash +squeue -u $USER +``` -#SBATCH -J dask-cuda-workers -#SBATCH -t 00:10:00 +### Keeping Sessions Alive -module load cuda/11.0.3 -CONDA_ROOT=/nfs-mount/miniconda3 -source $CONDA_ROOT/etc/profile.d/conda.sh -conda activate rapids +If your SSH connection drops while in an interactive job, the job is +terminated and you lose your work. To avoid this, start a +[tmux](https://github.com/tmux/tmux) or +[screen](https://www.gnu.org/software/screen/) session on the login node +**before** requesting your interactive job. -LOCAL_DIRECTORY=/nfs-mount/dask-local-directory -mkdir $LOCAL_DIRECTORY -dask-cuda-worker \ - --rmm-pool-size 14GB \ - --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" +```bash +tmux new -s rapids +srun -p gpu --gres=gpu:1 --time=01:00:00 --pty bash +# ... work ... +# Detach with Ctrl+b, d and your job keeps running +# Reattach later: +tmux attach -t rapids ``` -### cuDF Example Workflow +## Install RAPIDS + +### Environment Modules + +[Environment modules](https://modules.readthedocs.io/) are the standard way +to manage software on HPC clusters. We'll create a +[conda](https://docs.conda.io/) environment containing both CUDA and RAPIDS, +then wrap it in an [Lmod](https://lmod.readthedocs.io/) module file so it can +be loaded with a single command. -Lastly, we can now run a job on the established Dask Cluster. +We use conda here because it handles the CUDA toolkit and RAPIDS dependencies +together, avoiding version conflicts with system libraries. + +```{note} +Conda installs the CUDA **toolkit** (runtime libraries), but +the NVIDIA **kernel driver** must already be installed on the cluster's compute +nodes. This is typically managed by your cluster admin. You can verify the +driver is available by running `nvidia-smi` on a compute node. +``` + +#### Create the environment ```bash -#!/usr/bin/env bash +conda create -n rapids-{{ rapids_version }} {{ rapids_conda_channels }} \ + {{ rapids_conda_packages }} +``` -#SBATCH -J dask-client -#SBATCH -n 1 -#SBATCH -t 00:10:00 +#### Create the module file -module load cuda/11.0.3 -CONDA_ROOT=/nfs-mount/miniconda3 -source $CONDA_ROOT/etc/profile.d/conda.sh -conda activate rapids +Place a module file in your cluster's module path so that users can load +the environment. Replace `` with the absolute path to +your conda installation. -LOCAL_DIRECTORY=/nfs-mount/dask-local-directory +```bash +mkdir -p /opt/modulefiles/rapids +cat << 'EOF' > /opt/modulefiles/rapids/{{ rapids_version }}.lua +help([[RAPIDS {{ rapids_version }} - GPU-accelerated data science libraries.]]) -cat <>/tmp/dask-cudf-example.py -import cudf -import dask.dataframe as dd -from dask.distributed import Client +whatis("Name: RAPIDS") +whatis("Version: {{ rapids_version }}") +whatis("Description: GPU-accelerated data science libraries") -client = Client(scheduler_file="$LOCAL_DIRECTORY/dask-scheduler.json") -cdf = cudf.datasets.timeseries() +family("rapids") -ddf = dd.from_pandas(cdf, npartitions=10) -res = ddf.groupby(['id', 'name']).agg(['mean', 'sum', 'count']).compute() -print(res) +local conda_root = "" +local env = "rapids-{{ rapids_version }}" +local env_prefix = pathJoin(conda_root, "envs", env) + +prepend_path("PATH", pathJoin(env_prefix, "bin")) +prepend_path("LD_LIBRARY_PATH", pathJoin(env_prefix, "lib")) + +setenv("CONDA_PREFIX", env_prefix) +setenv("CONDA_DEFAULT_ENV", env) EOF +``` + +#### Verify + +```bash +module load rapids/{{ rapids_version }} +python -c "import cudf; print(cudf.__version__)" +``` + +### Containers + +Many HPC clusters support running containers through runtimes such as +[Apptainer](https://apptainer.org/) (formerly Singularity), +[Pyxis](https://github.com/NVIDIA/pyxis) + [Enroot](https://github.com/NVIDIA/enroot), +[Podman](https://podman.io/), or +[Charliecloud](https://hpc.github.io/charliecloud/). This is an alternative +to environment modules, as the RAPIDS container image ships with CUDA and all +RAPIDS libraries pre-installed and does not need any additional configuration. + +Check with your cluster admin which container runtime is available. The +examples below cover Apptainer and Pyxis + Enroot, two of the most common +setups on HPC clusters. + +#### Apptainer + +[Apptainer](https://apptainer.org/) is a container runtime designed for HPC. +The `--nv` flag exposes the host GPU drivers to the container. + +```bash +apptainer pull rapids.sif docker://{{ rapids_container }} +``` + +#### Pyxis + Enroot + +[Enroot](https://github.com/NVIDIA/enroot) is NVIDIA's lightweight container +runtime for HPC. [Pyxis](https://github.com/NVIDIA/pyxis) is a SLURM plugin +that integrates Enroot into SLURM, adding `--container-*` flags to `srun` and +`sbatch` so you can launch containerized jobs directly through the scheduler. +Pyxis + Enroot is pre-installed on many GPU clusters including NVIDIA DGX +systems. + +Import the RAPIDS container image as a squashfs file. We recommend +pre-importing large images to avoid re-downloading on every job. -python /tmp/dask-cudf-example.py +Note that Enroot uses `#` instead of `/` to separate the registry hostname +from the image path. + +```bash +enroot import --output rapids.sqsh 'docker://{{ rapids_container.replace("/", "#", 1) }}' +``` + +## Run a Single GPU Job + +[cudf.pandas](https://docs.rapids.ai/api/cudf/stable/cudf_pandas/) lets you +accelerate existing pandas code on a GPU with no code changes. You run your +script with `python -m cudf.pandas` instead of `python` and pandas operations +are automatically dispatched to the GPU. + +The following example uses pandas to generate and aggregate random data. + +```python +# my_script.py +import pandas as pd + +df = pd.DataFrame({"x": range(1_000_000), "y": range(1_000_000)}) +df["group"] = df["x"] % 100 +result = df.groupby("group").agg(["mean", "sum", "count"]) +print(result) ``` -### Confirm Output +### Interactive -Putting the above together will result in the following output: +#### With modules ```bash - x y - mean sum count mean sum count -id name -1077 Laura 0.028305 1.868120 66 -0.098905 -6.527731 66 -1026 Frank 0.001536 1.414839 921 -0.017223 -15.862306 921 -1082 Patricia 0.072045 3.602228 50 0.081853 4.092667 50 -1007 Wendy 0.009837 11.676199 1187 0.022978 27.275216 1187 -976 Wendy -0.003663 -3.267674 892 0.008262 7.369577 892 -... ... ... ... ... ... ... -912 Michael 0.012409 0.459119 37 0.002528 0.093520 37 -1103 Ingrid -0.132714 -1.327142 10 0.108364 1.083638 10 -998 Tim 0.000587 0.747745 1273 0.001777 2.262094 1273 -941 Yvonne 0.050258 11.358393 226 0.080584 18.212019 226 -900 Michael -0.134216 -1.073729 8 0.008701 0.069610 8 - -[6449 rows x 6 columns] -``` - -

+srun -p gpu --gres=gpu:1 --pty bash +module load rapids/{{ rapids_version }} +python -m cudf.pandas my_script.py +``` + +#### With containers + +`````{tab-set} + +````{tab-item} Apptainer + +The `--nv` flag exposes the host GPU drivers to the container. + +```bash +srun -p gpu --gres=gpu:1 apptainer exec --nv rapids.sif \ + python -m cudf.pandas my_script.py +``` + +```` + +````{tab-item} Pyxis + Enroot + +The `--container-image` flag is provided by Pyxis. Use `--container-mounts` +to make your data and scripts available inside the container. + +```bash +srun -p gpu --gres=gpu:1 \ + --container-image=./rapids.sqsh \ + --container-mounts=$(pwd):/work --container-workdir=/work \ + python -m cudf.pandas /work/my_script.py +``` + +```` + +````` + +### Batch + +Write a SLURM batch script to run the same workload without an interactive +session. This is the typical workflow for production jobs. + +```bash +#!/usr/bin/env bash +#SBATCH --job-name=rapids-cudf +#SBATCH --gres=gpu:1 +#SBATCH --time=01:00:00 + +module load rapids/{{ rapids_version }} +python -m cudf.pandas my_script.py +``` + +```bash +sbatch rapids_job.sh +``` + +```{relatedexamples} + +``` From aaf45f7074d5c168a9d4223a15b38ddb3079fe72 Mon Sep 17 00:00:00 2001 From: Jaya Venkatesh Date: Wed, 11 Feb 2026 08:23:55 -0800 Subject: [PATCH 02/10] added links to SLURM docs Signed-off-by: Jaya Venkatesh --- source/hpc.md | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/source/hpc.md b/source/hpc.md index 4430a4d3..b44be9a1 100644 --- a/source/hpc.md +++ b/source/hpc.md @@ -6,7 +6,7 @@ review_priority: "p1" RAPIDS can be deployed on HPC clusters managed by [SLURM](https://slurm.schedmd.com/). -## SLURM Basics +## SLURM SLURM is a job scheduler that manages access to compute nodes on HPC clusters. Instead of logging into a GPU machine directly, you ask SLURM for resources @@ -17,6 +17,8 @@ Nodes are organized into **partitions**, groups of machines with similar hardware. For example, your cluster might have a `gpu` partition with A100 nodes and a `cpu` partition with CPU-only nodes. +For a more comprehensive overview, see the [SLURM quickstart guide](https://slurm.schedmd.com/quickstart.html). + ### Partitions Check which partitions are available and what GPUs they have. The `-o` flag @@ -79,9 +81,13 @@ terminated and you lose your work. To avoid this, start a ```bash tmux new -s rapids srun -p gpu --gres=gpu:1 --time=01:00:00 --pty bash -# ... work ... -# Detach with Ctrl+b, d and your job keeps running -# Reattach later: +``` + +To detach from the tmux session without ending your job, press `Ctrl+b` +then `d`. Your interactive job continues running in the background. When +you reconnect via SSH, reattach to the session with: + +```bash tmux attach -t rapids ``` From 730d0d24283d3d358e5ee696dcbef3f40f554f53 Mon Sep 17 00:00:00 2001 From: Jaya Venkatesh Date: Wed, 11 Feb 2026 08:25:22 -0800 Subject: [PATCH 03/10] make review priority index Signed-off-by: Jaya Venkatesh --- source/hpc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/hpc.md b/source/hpc.md index b44be9a1..06b45753 100644 --- a/source/hpc.md +++ b/source/hpc.md @@ -1,5 +1,5 @@ --- -review_priority: "p1" +review_priority: "index" --- # HPC From 4f22dcf43510285e1c22f8a25e847adb42fb290c Mon Sep 17 00:00:00 2001 From: Jaya Venkatesh Date: Tue, 5 May 2026 19:27:40 -0700 Subject: [PATCH 04/10] added notes to HPC page Signed-off-by: Jaya Venkatesh --- source/hpc.md | 44 ++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 42 insertions(+), 2 deletions(-) diff --git a/source/hpc.md b/source/hpc.md index 06b45753..f71d59ad 100644 --- a/source/hpc.md +++ b/source/hpc.md @@ -19,6 +19,12 @@ and a `cpu` partition with CPU-only nodes. For a more comprehensive overview, see the [SLURM quickstart guide](https://slurm.schedmd.com/quickstart.html). +```{note} +Some clusters provide SLURM commands through environment modules. If commands +such as `sinfo`, `srun`, or `sbatch` are not found, load your cluster's SLURM +module first, for example `module load slurm`. +``` + ### Partitions Check which partitions are available and what GPUs they have. The `-o` flag @@ -58,6 +64,11 @@ For longer-running work, write a script and submit it with `sbatch`. SLURM runs the script when resources become available and you don't need to stay connected. +Run batch jobs from a filesystem that is shared between the submit host and +compute nodes. This ensures your scripts, input data, and SLURM output files +are visible wherever the job runs. Your cluster admin can tell you which paths +are shared. + ```bash sbatch my_job.sh Submitted batch job 12345 @@ -113,6 +124,17 @@ driver is available by running `nvidia-smi` on a compute node. #### Create the environment +Create the environment in a location that is available on compute nodes. On +many clusters this means installing conda and environments on a shared +filesystem rather than on the login node's local disk. + +```{note} +Recent versions of conda may require accepting Anaconda's Terms of Service for +default channels before non-interactive environment creation. If `conda create` +fails with a Terms of Service error, follow the command that conda prints to +accept the Terms of Service +``` + ```bash conda create -n rapids-{{ rapids_version }} {{ rapids_conda_channels }} \ {{ rapids_conda_packages }} @@ -124,6 +146,11 @@ Place a module file in your cluster's module path so that users can load the environment. Replace `` with the absolute path to your conda installation. +The example below is a Lua modulefile and requires +[Lmod](https://lmod.readthedocs.io/). Verify that `module --version` reports +Lmod before using it. If your cluster uses Tcl Environment Modules, ask your +cluster admin for the equivalent Tcl modulefile. + ```bash mkdir -p /opt/modulefiles/rapids cat << 'EOF' > /opt/modulefiles/rapids/{{ rapids_version }}.lua @@ -151,9 +178,13 @@ EOF ```bash module load rapids/{{ rapids_version }} -python -c "import cudf; print(cudf.__version__)" +srun -p gpu --gres=gpu:1 python -c "import cudf; print(cudf.__version__)" ``` +Run this verification on a GPU compute node. A login or head node may not have +a GPU or a compatible NVIDIA driver even when the compute nodes are configured +correctly. + ### Containers Many HPC clusters support running containers through runtimes such as @@ -186,6 +217,13 @@ that integrates Enroot into SLURM, adding `--container-*` flags to `srun` and Pyxis + Enroot is pre-installed on many GPU clusters including NVIDIA DGX systems. +GPU containers also require NVIDIA container runtime tooling on compute nodes, +including `nvidia-container-cli` from +[`libnvidia-container`](https://github.com/NVIDIA/libnvidia-container). If +Pyxis fails while starting the container and references `nvidia-container-cli`, +ask your cluster admin to install the NVIDIA container runtime packages on the +compute nodes. + Import the RAPIDS container image as a squashfs file. We recommend pre-importing large images to avoid re-downloading on every job. @@ -259,7 +297,9 @@ srun -p gpu --gres=gpu:1 \ ### Batch Write a SLURM batch script to run the same workload without an interactive -session. This is the typical workflow for production jobs. +session. This is the typical workflow for production jobs. Save the script in a +shared filesystem so compute nodes can access it and so the SLURM output file is +written somewhere visible after the job completes. ```bash #!/usr/bin/env bash From edfecf42499e464118f104d9c765db9cf0c06e68 Mon Sep 17 00:00:00 2001 From: Jaya Venkatesh Date: Tue, 5 May 2026 19:28:18 -0700 Subject: [PATCH 05/10] precommit run Signed-off-by: Jaya Venkatesh --- source/hpc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/hpc.md b/source/hpc.md index f71d59ad..d0e92e40 100644 --- a/source/hpc.md +++ b/source/hpc.md @@ -131,7 +131,7 @@ filesystem rather than on the login node's local disk. ```{note} Recent versions of conda may require accepting Anaconda's Terms of Service for default channels before non-interactive environment creation. If `conda create` -fails with a Terms of Service error, follow the command that conda prints to +fails with a Terms of Service error, follow the command that conda prints to accept the Terms of Service ``` From 64dd2cd54637691b58c3af07f0bec450c7205acb Mon Sep 17 00:00:00 2001 From: Jaya Venkatesh Date: Tue, 5 May 2026 19:29:17 -0700 Subject: [PATCH 06/10] shift container tooling Signed-off-by: Jaya Venkatesh --- source/hpc.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/source/hpc.md b/source/hpc.md index d0e92e40..07349516 100644 --- a/source/hpc.md +++ b/source/hpc.md @@ -199,6 +199,13 @@ Check with your cluster admin which container runtime is available. The examples below cover Apptainer and Pyxis + Enroot, two of the most common setups on HPC clusters. +GPU containers also require NVIDIA container runtime tooling on compute nodes, +including `nvidia-container-cli` from +[`libnvidia-container`](https://github.com/NVIDIA/libnvidia-container). If +Pyxis fails while starting the container and references `nvidia-container-cli`, +ask your cluster admin to install the NVIDIA container runtime packages on the +compute nodes. + #### Apptainer [Apptainer](https://apptainer.org/) is a container runtime designed for HPC. @@ -217,13 +224,6 @@ that integrates Enroot into SLURM, adding `--container-*` flags to `srun` and Pyxis + Enroot is pre-installed on many GPU clusters including NVIDIA DGX systems. -GPU containers also require NVIDIA container runtime tooling on compute nodes, -including `nvidia-container-cli` from -[`libnvidia-container`](https://github.com/NVIDIA/libnvidia-container). If -Pyxis fails while starting the container and references `nvidia-container-cli`, -ask your cluster admin to install the NVIDIA container runtime packages on the -compute nodes. - Import the RAPIDS container image as a squashfs file. We recommend pre-importing large images to avoid re-downloading on every job. From a80cc8fbb040efd68dbb5a05de4bef19f4b36812 Mon Sep 17 00:00:00 2001 From: Jaya Venkatesh Date: Wed, 6 May 2026 10:48:01 -0700 Subject: [PATCH 07/10] Update HPC conda install guidance --- source/hpc.md | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/source/hpc.md b/source/hpc.md index 07349516..d8bab4d1 100644 --- a/source/hpc.md +++ b/source/hpc.md @@ -128,15 +128,9 @@ Create the environment in a location that is available on compute nodes. On many clusters this means installing conda and environments on a shared filesystem rather than on the login node's local disk. -```{note} -Recent versions of conda may require accepting Anaconda's Terms of Service for -default channels before non-interactive environment creation. If `conda create` -fails with a Terms of Service error, follow the command that conda prints to -accept the Terms of Service -``` - ```bash -conda create -n rapids-{{ rapids_version }} {{ rapids_conda_channels }} \ +conda create -n rapids-{{ rapids_version }} --override-channels \ + {{ rapids_conda_channels }} \ {{ rapids_conda_packages }} ``` From 6ab6f96a06a3f1c9c6485e21e208fb6787137d67 Mon Sep 17 00:00:00 2001 From: Jaya Venkatesh Date: Fri, 8 May 2026 12:55:41 -0700 Subject: [PATCH 08/10] add notes to a few sections Signed-off-by: Jaya Venkatesh --- source/hpc.md | 32 ++++++++++++++++++++++++-------- 1 file changed, 24 insertions(+), 8 deletions(-) diff --git a/source/hpc.md b/source/hpc.md index d8bab4d1..1f14832a 100644 --- a/source/hpc.md +++ b/source/hpc.md @@ -4,7 +4,10 @@ review_priority: "index" # HPC -RAPIDS can be deployed on HPC clusters managed by [SLURM](https://slurm.schedmd.com/). +RAPIDS works extremely well in traditional HPC (High Performance Computing) +environments where GPUs are often co-located with accelerated networking +hardware. RAPIDS can be deployed on HPC clusters managed by +[SLURM](https://slurm.schedmd.com/). ## SLURM @@ -112,8 +115,8 @@ to manage software on HPC clusters. We'll create a then wrap it in an [Lmod](https://lmod.readthedocs.io/) module file so it can be loaded with a single command. -We use conda here because it handles the CUDA toolkit and RAPIDS dependencies -together, avoiding version conflicts with system libraries. +Conda installs the full RAPIDS suite alongside the CUDA toolkit in a single +command, which is convenient on shared HPC filesystems. ```{note} Conda installs the CUDA **toolkit** (runtime libraries), but @@ -122,11 +125,24 @@ nodes. This is typically managed by your cluster admin. You can verify the driver is available by running `nvidia-smi` on a compute node. ``` +#### Install miniforge + +If conda isn't already available on your cluster, install +[miniforge](https://github.com/conda-forge/miniforge), the conda distribution +RAPIDS recommends. Install it to a shared filesystem so compute nodes can +read the environments you create. + +```bash +curl -LO "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" +bash Miniforge3-$(uname)-$(uname -m).sh -b -p /path/to/miniforge3 +source /path/to/miniforge3/etc/profile.d/conda.sh +``` + #### Create the environment Create the environment in a location that is available on compute nodes. On -many clusters this means installing conda and environments on a shared -filesystem rather than on the login node's local disk. +many clusters this means installing environments on a shared filesystem rather +than on the login node's local disk. ```bash conda create -n rapids-{{ rapids_version }} --override-channels \ @@ -137,8 +153,8 @@ conda create -n rapids-{{ rapids_version }} --override-channels \ #### Create the module file Place a module file in your cluster's module path so that users can load -the environment. Replace `` with the absolute path to -your conda installation. +the environment. Replace `` with the absolute path to +your miniforge installation. The example below is a Lua modulefile and requires [Lmod](https://lmod.readthedocs.io/). Verify that `module --version` reports @@ -156,7 +172,7 @@ whatis("Description: GPU-accelerated data science libraries") family("rapids") -local conda_root = "" +local conda_root = "" local env = "rapids-{{ rapids_version }}" local env_prefix = pathJoin(conda_root, "envs", env) From e958d280203773f2dc95953695a21b2aa159b2aa Mon Sep 17 00:00:00 2001 From: Jaya Venkatesh Date: Fri, 8 May 2026 13:03:05 -0700 Subject: [PATCH 09/10] Address review: restore intro, miniforge, soften conda --- source/hpc.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/source/hpc.md b/source/hpc.md index 1f14832a..1667d4b2 100644 --- a/source/hpc.md +++ b/source/hpc.md @@ -128,9 +128,8 @@ driver is available by running `nvidia-smi` on a compute node. #### Install miniforge If conda isn't already available on your cluster, install -[miniforge](https://github.com/conda-forge/miniforge), the conda distribution -RAPIDS recommends. Install it to a shared filesystem so compute nodes can -read the environments you create. +[miniforge](https://github.com/conda-forge/miniforge). Install it to a shared +filesystem so compute nodes can read the environments you create. ```bash curl -LO "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" From d6851dcc2666c25286c1166c3094cdb889902111 Mon Sep 17 00:00:00 2001 From: Jaya Venkatesh Date: Fri, 8 May 2026 13:37:43 -0700 Subject: [PATCH 10/10] changed SLURM to caps Signed-off-by: Jaya Venkatesh --- source/hpc.md | 50 ++++++++++++++++++++++++++++++-------------------- 1 file changed, 30 insertions(+), 20 deletions(-) diff --git a/source/hpc.md b/source/hpc.md index 1667d4b2..b918194d 100644 --- a/source/hpc.md +++ b/source/hpc.md @@ -7,12 +7,12 @@ review_priority: "index" RAPIDS works extremely well in traditional HPC (High Performance Computing) environments where GPUs are often co-located with accelerated networking hardware. RAPIDS can be deployed on HPC clusters managed by -[SLURM](https://slurm.schedmd.com/). +[Slurm](https://slurm.schedmd.com/). -## SLURM +## Slurm -SLURM is a job scheduler that manages access to compute nodes on HPC clusters. -Instead of logging into a GPU machine directly, you ask SLURM for resources +Slurm is a job scheduler that manages access to compute nodes on HPC clusters. +Instead of logging into a GPU machine directly, you ask Slurm for resources (CPUs, GPUs, memory, time) and it allocates a node for you when one becomes available. @@ -20,11 +20,11 @@ Nodes are organized into **partitions**, groups of machines with similar hardware. For example, your cluster might have a `gpu` partition with A100 nodes and a `cpu` partition with CPU-only nodes. -For a more comprehensive overview, see the [SLURM quickstart guide](https://slurm.schedmd.com/quickstart.html). +For a more comprehensive overview, see the [Slurm quickstart guide](https://slurm.schedmd.com/quickstart.html). ```{note} -Some clusters provide SLURM commands through environment modules. If commands -such as `sinfo`, `srun`, or `sbatch` are not found, load your cluster's SLURM +Some clusters provide Slurm commands through environment modules. If commands +such as `sinfo`, `srun`, or `sbatch` are not found, load your cluster's Slurm module first, for example `module load slurm`. ``` @@ -63,12 +63,12 @@ the allocated node. ### Batch Jobs -For longer-running work, write a script and submit it with `sbatch`. SLURM +For longer-running work, write a script and submit it with `sbatch`. Slurm runs the script when resources become available and you don't need to stay connected. Run batch jobs from a filesystem that is shared between the submit host and -compute nodes. This ensures your scripts, input data, and SLURM output files +compute nodes. This ensures your scripts, input data, and Slurm output files are visible wherever the job runs. Your cluster admin can tell you which paths are shared. @@ -125,10 +125,10 @@ nodes. This is typically managed by your cluster admin. You can verify the driver is available by running `nvidia-smi` on a compute node. ``` -#### Install miniforge +#### Install Miniforge If conda isn't already available on your cluster, install -[miniforge](https://github.com/conda-forge/miniforge). Install it to a shared +[Miniforge](https://github.com/conda-forge/miniforge). Install it to a shared filesystem so compute nodes can read the environments you create. ```bash @@ -151,9 +151,10 @@ conda create -n rapids-{{ rapids_version }} --override-channels \ #### Create the module file -Place a module file in your cluster's module path so that users can load -the environment. Replace `` with the absolute path to -your miniforge installation. +Replace `` with the absolute path to your Miniforge +installation. The example below installs the modulefile to `~/modulefiles`, +which works without admin access. Cluster admins can install it to a +shared module path (e.g. `/opt/modulefiles`) instead so all users can load it. The example below is a Lua modulefile and requires [Lmod](https://lmod.readthedocs.io/). Verify that `module --version` reports @@ -161,8 +162,8 @@ Lmod before using it. If your cluster uses Tcl Environment Modules, ask your cluster admin for the equivalent Tcl modulefile. ```bash -mkdir -p /opt/modulefiles/rapids -cat << 'EOF' > /opt/modulefiles/rapids/{{ rapids_version }}.lua +mkdir -p ~/modulefiles/rapids +cat << 'EOF' > ~/modulefiles/rapids/{{ rapids_version }}.lua help([[RAPIDS {{ rapids_version }} - GPU-accelerated data science libraries.]]) whatis("Name: RAPIDS") @@ -183,6 +184,15 @@ setenv("CONDA_DEFAULT_ENV", env) EOF ``` +Add the modulefile directory to your module search path: + +```bash +module use ~/modulefiles +``` + +To make this persistent across sessions, add `module use ~/modulefiles` to +your `~/.bashrc`. + #### Verify ```bash @@ -227,8 +237,8 @@ apptainer pull rapids.sif docker://{{ rapids_container }} #### Pyxis + Enroot [Enroot](https://github.com/NVIDIA/enroot) is NVIDIA's lightweight container -runtime for HPC. [Pyxis](https://github.com/NVIDIA/pyxis) is a SLURM plugin -that integrates Enroot into SLURM, adding `--container-*` flags to `srun` and +runtime for HPC. [Pyxis](https://github.com/NVIDIA/pyxis) is a Slurm plugin +that integrates Enroot into Slurm, adding `--container-*` flags to `srun` and `sbatch` so you can launch containerized jobs directly through the scheduler. Pyxis + Enroot is pre-installed on many GPU clusters including NVIDIA DGX systems. @@ -305,9 +315,9 @@ srun -p gpu --gres=gpu:1 \ ### Batch -Write a SLURM batch script to run the same workload without an interactive +Write a Slurm batch script to run the same workload without an interactive session. This is the typical workflow for production jobs. Save the script in a -shared filesystem so compute nodes can access it and so the SLURM output file is +shared filesystem so compute nodes can access it and so the Slurm output file is written somewhere visible after the job completes. ```bash