Skip to content

Commit c9f87aa

Browse files
Add docs for 26.3.0 release
Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> Co-authored-by: Rajath Agasthya <rajathagasthya@gmail.com>
1 parent a2edf56 commit c9f87aa

14 files changed

Lines changed: 567 additions & 440 deletions

gpu-operator/cdi.rst

Lines changed: 88 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,15 @@
1616
1717
.. headings # #, * *, =, -, ^, "
1818
19-
############################################################
20-
Container Device Interface (CDI) Support in the GPU Operator
21-
############################################################
19+
#################################################################################
20+
Container Device Interface (CDI) and Node Resource Interface (NRI) Plugin Support
21+
#################################################################################
2222

23-
************************************
24-
About the Container Device Interface
25-
************************************
23+
This page gives an overview of CDI and NRI Plugin support in the GPU Operator.
24+
25+
**************************************
26+
About Container Device Interface (CDI)
27+
**************************************
2628

2729
The `Container Device Interface (CDI) <https://github.com/cncf-tags/container-device-interface/blob/main/SPEC.md>`_
2830
is an open specification for container runtimes that abstracts what access to a device, such as an NVIDIA GPU, means,
@@ -31,23 +33,22 @@ ensure that a device is available in a container. CDI simplifies adding support
3133
the specification is applicable to all container runtimes that support CDI.
3234

3335
Starting with GPU Operator v25.10.0, CDI is used by default for enabling GPU support in containers running on Kubernetes.
34-
Specifically, CDI support in container runtimes, e.g. containerd and cri-o, is used to inject GPU(s) into workload
36+
Specifically, CDI support in container runtimes, like containerd and cri-o, is used to inject GPU(s) into workload
3537
containers. This differs from prior GPU Operator releases where CDI was used via a CDI-enabled ``nvidia`` runtime class.
3638

3739
Use of CDI is transparent to cluster administrators and application developers.
3840
The benefits of CDI are largely to reduce development and support for runtime-specific
3941
plugins.
4042

41-
********************************
42-
Enabling CDI During Installation
43-
********************************
43+
************
44+
Enabling CDI
45+
************
4446

4547
CDI is enabled by default during installation in GPU Operator v25.10.0 and later.
4648
Follow the instructions for installing the Operator with Helm on the :doc:`getting-started` page.
4749

4850
CDI is also enabled by default during a Helm upgrade to GPU Operator v25.10.0 and later.
4951

50-
*******************************
5152
Enabling CDI After Installation
5253
*******************************
5354

@@ -125,3 +126,79 @@ disable CDI and use the legacy NVIDIA Container Toolkit stack instead with the f
125126
nvidia.com/gpu.deploy.operator-validator=true \
126127
nvidia.com/gpu.present=true \
127128
--overwrite
129+
130+
131+
.. _nri-plugin:
132+
133+
**********************************************
134+
About the Node Resource Interface (NRI) Plugin
135+
**********************************************
136+
137+
Node Resource Interface (NRI) is a standardized interface for plugging in extensions, called NRI Plugins, to OCI-compatible container runtimes like CRI-O and containerd.
138+
NRI Plugins serve as hooks which intercept pod and container lifecycle events and perform functions including inject devices to a container, topology aware placement strategies, and more.
139+
For more details on NRI, refer to the `NRI overview <https://github.com/containerd/nri/tree/main?tab=readme-ov-file#background>`_ in the containerd repository.
140+
141+
When enabled in the GPU Operator, the NRI Plugin is managed by the NVIDIA Container Toolkit andprovides an alternative to the ``nvidia`` runtime class to provision GPU workload pods.
142+
It allows the GPU Operator to extend the container runtime behaviour without modifying the container runtime itself.
143+
This feature also simplifies deployments on platforms like k3s, k0s, or RKE, because the GPU Operator no longer needs you to set values like ``CONTAINERD_CONFIG``, ``CONTAINERD_SOCKET``, or ``RUNTIME_CONFIG_SOURCE`` for the Container Toolkit.
144+
145+
***********************
146+
Enabling the NRI Plugin
147+
***********************
148+
149+
The NRI Plugin requires the following:
150+
151+
- CDI to be enabled in the GPU Operator.
152+
153+
- CRI-O v1.34.0 or later or containerd v1.7.30, v2.1.x, or v2.2.x.
154+
If you are not using the latest containerd version, check that both CDI and NRI are enabled in the containerd configuration file before deploying GPU Operator.
155+
156+
To enable the NRI Plugin during installation, follow the instructions for installing the Operator with Helm on the :doc:`getting-started` page and include the ``--set cdi.nriPluginEnabled=true`` argument in you Helm command.
157+
158+
Enabling the NRI Plugin After Installation
159+
******************************************
160+
161+
#. Enable NRI Plugin by modifying the cluster policy:
162+
163+
.. code-block:: console
164+
165+
$ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \
166+
-p='[{"op": "replace", "path": "/spec/cdi/nriPluginEnabled", "value":true}]'
167+
168+
*Example Output*
169+
170+
.. code-block:: output
171+
172+
clusterpolicy.nvidia.com/cluster-policy patched
173+
174+
#. (Optional) Confirm that the container toolkit and device plugin pods restart:
175+
176+
.. code-block:: console
177+
178+
$ kubectl get pods -n gpu-operator
179+
180+
*Example Output*
181+
182+
.. literalinclude:: ./manifests/output/nri-get-pods-restart.txt
183+
:language: output
184+
:emphasize-lines: 6,9
185+
186+
187+
************************
188+
Disabling the NRI Plugin
189+
************************
190+
191+
Disable the NRI Plugin and use the ``nvidia`` runtime class instead with the following procedure:
192+
193+
Disable the NRI Plugin by modifying the cluster policy:
194+
195+
.. code-block:: console
196+
197+
$ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \
198+
-p='[{"op": "replace", "path": "/spec/cdi/nriPluginEnabled", "value":false}]'
199+
200+
*Example Output*
201+
202+
.. code-block:: output
203+
204+
clusterpolicy.nvidia.com/cluster-policy patched

gpu-operator/conf.py

Lines changed: 0 additions & 226 deletions
This file was deleted.

0 commit comments

Comments
 (0)