diff --git a/gpu-operator/life-cycle-policy.rst b/gpu-operator/life-cycle-policy.rst index 75303436a..dcc09cb7d 100644 --- a/gpu-operator/life-cycle-policy.rst +++ b/gpu-operator/life-cycle-policy.rst @@ -93,9 +93,10 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information. * - v26.3.0 * - NVIDIA GPU Driver |ki|_ - - | `590.48.01 `_ + - | `595.58.03 `_ + | `590.48.01 `_ | `580.126.20 `_ (**D**, **R**) - | `570.211.01 `_ + | `570.211.01 `_ | `535.288.01 `_ * - NVIDIA Driver Manager for Kubernetes diff --git a/gpu-operator/platform-support.rst b/gpu-operator/platform-support.rst index c77ce054f..b570f2b88 100644 --- a/gpu-operator/platform-support.rst +++ b/gpu-operator/platform-support.rst @@ -351,7 +351,7 @@ The GPU Operator has been validated in the following scenarios: * - Red Hat Core OS - - - | 4.17---4.21 + - | 4.18---4.21 - - - @@ -426,7 +426,7 @@ The GPU Operator has been validated in the following scenarios: .. _rhel-9: :sup:`3` - Non-precompiled driver containers for Red Hat Enterprise Linux 9.2, 9.4, and 9.6 versions are available for x86 based platforms only. + Non-precompiled driver containers for Red Hat Enterprise Linux 9.2, 9.4, 9.6, and 9.7 versions are available for x86 based platforms only. They are not available for ARM based systems. .. note:: @@ -525,7 +525,7 @@ Operating System Kubernetes KubeVirt OpenShift Virtual Ubuntu 24.04 LTS 1.32---1.35 0.36+ Ubuntu 22.04 LTS 1.32---1.35 0.36+ 0.59.1+ Ubuntu 20.04 LTS 1.32---1.35 0.36+ 0.59.1+ -Red Hat Core OS 4.17---4.21 4.17---4.21 +Red Hat Core OS 4.18---4.21 4.18---4.21 ================ =========== ============= ========= ============= =========== You can run GPU passthrough and NVIDIA vGPU in the same cluster as long as you use @@ -573,7 +573,7 @@ Supported operating systems and NVIDIA GPU Drivers with GPUDirect RDMA. - Ubuntu 24.04 LTS with Network Operator 25.7.0. - Ubuntu 20.04 and 22.04 LTS with Network Operator 25.7.0. - Red Hat Enterprise Linux 9.2, 9.4, and 9.6 with Network Operator 25.7.0. -- Red Hat OpenShift 4.17 and higher with Network Operator 25.7.0. +- Red Hat OpenShift 4.18 and higher with Network Operator 25.7.0. - Ubuntu 24.04 LTS with Network Operator 25.10.0 For information about configuring GPUDirect RDMA, refer to :doc:`gpu-operator-rdma`. @@ -586,7 +586,7 @@ Supported operating systems and NVIDIA GPU Drivers with GPUDirect Storage. - Ubuntu 24.04 LTS Network Operator 25.7.0. - Ubuntu 20.04 and 22.04 LTS with Network Operator 25.7.0. -- Red Hat OpenShift Container Platform 4.17 and higher. +- Red Hat OpenShift Container Platform 4.18 and higher. .. note:: diff --git a/gpu-operator/release-notes.rst b/gpu-operator/release-notes.rst index f1247e2b2..f077c0065 100644 --- a/gpu-operator/release-notes.rst +++ b/gpu-operator/release-notes.rst @@ -152,6 +152,25 @@ Fixed Issues * Fixed an issue where the GPU Operator was not adding a namespace to ServiceAccount objects. (`PR #2039 `_) +Known Issues +------------ + +* When GPUDirect RDMA is enabled, the ``nvidia-peermem`` container may fail to restart after the driver pod restarts without a node reboot and without any driver configuration changes. + In this scenario, the driver uses a fast-path optimization that skips recompilation, but the ``nvidia-peermem`` sidecar does not detect that its module is already loaded and fails to start. + This occurs because the kernel state is not cleared when the driver pod restarts. + + + To work around this issue, set the ``FORCE_REINSTALL=true`` environment variable in the ClusterPolicy. + + .. code-block:: console + + $ kubectl patch clusterpolicy cluster-policy --type=json \ + -p='[{"op": "add", "path": "/spec/driver/manager/env/-", "value": {"name": "FORCE_REINSTALL", "value": "true"}}]' + + Setting ``FORCE_REINSTALL=true`` forces full driver recompilation, node drain, and GPU workload disruption on every restart. + Alternatively, rebooting the node clears the kernel state and allows the ``nvidia-peermem`` module to load successfully, though this may disrupt running workloads. + + Removals and Deprecations -------------------------