Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions gpu-operator/life-cycle-policy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,9 +93,10 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
* - v26.3.0

* - NVIDIA GPU Driver |ki|_
- | `590.48.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-590-48-01/index.html>`_
- | `595.58.03 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-595-58-03/index.html>`_
| `590.48.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-590-48-01/index.html>`_
| `580.126.20 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-126-20/index.html>`_ (**D**, **R**)
| `570.211.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-211-01/index.html>`_
| `570.211.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-211-01/index.html>`_
| `535.288.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-288-01/index.html>`_

* - NVIDIA Driver Manager for Kubernetes
Expand Down
10 changes: 5 additions & 5 deletions gpu-operator/platform-support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -351,7 +351,7 @@ The GPU Operator has been validated in the following scenarios:

* - Red Hat Core OS
-
- | 4.17---4.21
- | 4.18---4.21
-
-
-
Expand Down Expand Up @@ -426,7 +426,7 @@ The GPU Operator has been validated in the following scenarios:
.. _rhel-9:

:sup:`3`
Non-precompiled driver containers for Red Hat Enterprise Linux 9.2, 9.4, and 9.6 versions are available for x86 based platforms only.
Non-precompiled driver containers for Red Hat Enterprise Linux 9.2, 9.4, 9.6, and 9.7 versions are available for x86 based platforms only.
They are not available for ARM based systems.

.. note::
Expand Down Expand Up @@ -525,7 +525,7 @@ Operating System Kubernetes KubeVirt OpenShift Virtual
Ubuntu 24.04 LTS 1.32---1.35 0.36+
Ubuntu 22.04 LTS 1.32---1.35 0.36+ 0.59.1+
Ubuntu 20.04 LTS 1.32---1.35 0.36+ 0.59.1+
Red Hat Core OS 4.17---4.21 4.17---4.21
Red Hat Core OS 4.18---4.21 4.18---4.21
================ =========== ============= ========= ============= ===========

You can run GPU passthrough and NVIDIA vGPU in the same cluster as long as you use
Expand Down Expand Up @@ -573,7 +573,7 @@ Supported operating systems and NVIDIA GPU Drivers with GPUDirect RDMA.
- Ubuntu 24.04 LTS with Network Operator 25.7.0.
- Ubuntu 20.04 and 22.04 LTS with Network Operator 25.7.0.
- Red Hat Enterprise Linux 9.2, 9.4, and 9.6 with Network Operator 25.7.0.
- Red Hat OpenShift 4.17 and higher with Network Operator 25.7.0.
- Red Hat OpenShift 4.18 and higher with Network Operator 25.7.0.
- Ubuntu 24.04 LTS with Network Operator 25.10.0

For information about configuring GPUDirect RDMA, refer to :doc:`gpu-operator-rdma`.
Expand All @@ -586,7 +586,7 @@ Supported operating systems and NVIDIA GPU Drivers with GPUDirect Storage.

- Ubuntu 24.04 LTS Network Operator 25.7.0.
- Ubuntu 20.04 and 22.04 LTS with Network Operator 25.7.0.
- Red Hat OpenShift Container Platform 4.17 and higher.
- Red Hat OpenShift Container Platform 4.18 and higher.

.. note::

Expand Down
19 changes: 19 additions & 0 deletions gpu-operator/release-notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,25 @@ Fixed Issues

* Fixed an issue where the GPU Operator was not adding a namespace to ServiceAccount objects. (`PR #2039 <https://github.com/NVIDIA/gpu-operator/pull/2039>`_)

Known Issues
------------

* When GPUDirect RDMA is enabled, the ``nvidia-peermem`` container may fail to restart after the driver pod restarts without a node reboot and without any driver configuration changes.
In this scenario, the driver uses a fast-path optimization that skips recompilation, but the ``nvidia-peermem`` sidecar does not detect that its module is already loaded and fails to start.
This occurs because the kernel state is not cleared when the driver pod restarts.


To work around this issue, set the ``FORCE_REINSTALL=true`` environment variable in the ClusterPolicy.

.. code-block:: console

$ kubectl patch clusterpolicy cluster-policy --type=json \
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the best way to fix this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks good to me!

-p='[{"op": "add", "path": "/spec/driver/manager/env/-", "value": {"name": "FORCE_REINSTALL", "value": "true"}}]'

Setting ``FORCE_REINSTALL=true`` forces full driver recompilation, node drain, and GPU workload disruption on every restart.
Alternatively, rebooting the node clears the kernel state and allows the ``nvidia-peermem`` module to load successfully, though this may disrupt running workloads.



Removals and Deprecations
-------------------------
Expand Down
Loading