[linux-nvidia-6.18-next] Cherry-pick all detected missing patches from grace by ltrager · Pull Request #445 · NVIDIA/NV-Kernels

ltrager · 2026-05-28T00:44:46Z

Overview

Cherry-pick all Grace upstreamed patches defined in grace-baremetal.txt and grace-iov.txt as detected by the nvidia-kernel-patches-verification tool.

Testing

On Grace hardware, validated that this branch brings up MPAM (40 PARTIDs / 2 PMGs) with a functional resctrl interface (L3/MB allocation, llc_occupancy monitoring, enforcement and error handling) and clean CPU-hotplug, with no cross-cutting regressions — cache/CPU topology and ACPI tables were byte-identical to a control kernel on the same box. A same-hardware comparison against the 7.0 NVIDIA kernel confirmed identical core MPAM bring-up (40/2) but showed this backport is a subset of the 7.0 resctrl work — missing L3_MAX, per-resource schema_format, and RESCTRL_IOMMU (and reporting num_rmids=80 vs 2). This branch focuses on getting all Grace patches in described in the nvidia-kernel-patches-verification tool, will work with @fyu1 on the remaining patches in a future PR.
Ran @fyu1's https://github.com/fyu1/NV-Kernels.fenghuay.baseos/blob/mpam_tests/resctrl_mpam_smoke.sh
Exercised the resctrl interface end-to-end to confirm it enforces, not just exposes files: created and removed a control group; wrote restricted L3 (CBM) and MB (%) schemata and verified they applied via read-back, bit_usage, and last_cmd_status; assigned CPUs and a task; ran a cache workload and confirmed llc_occupancy monitoring tracked it; verified invalid schemata are rejected; and confirmed CLOSID allocation is bounded and graceful.
Cycled every non-boot CPU offline→online to exercise the per-CPU MPAM re-enable path on each online, the arm_mpam CPU-hotplug domain callbacks, and resctrl domain add/remove. Confirmed no kernel splats (WARNING/BUG/call-trace) during the cycle, that resctrl remained healthy and enforcing afterward, and that all CPUs were restored online.
KVM - In progress

nvidia-bfigg · 2026-05-28T00:49:05Z

Are there tests that have been developed to confirm this functionality works? What tests did you run?

nirmoy · 2026-05-28T00:49:47Z

Boro watcher review skipped

The GitHub watcher skips automatic boro reviews for PRs with more than 50 commits. This PR currently has 100 commits.

To run the review anyway, ask BaseOS_Kernel_Bot in #baseos-kernel:

review https://github.com/NVIDIA/NV-Kernels/pull/445

Head: f06bc45adc27

This comment is maintained by nv-pr-bot. It is updated when the GitHub watcher sees a newer PR head.

ltrager · 2026-05-28T01:41:47Z

Are there tests that have been developed to confirm this functionality works? What tests did you run?

I built and booted the kernel on a Vera system. I need to run more extensive tests tomorrow. Just wanted to put this up so @arighi and @sforshee can add branches on top of this if need be.

nirmoy · 2026-05-28T09:05:47Z

@ltrager You can use https://gitlab-master.nvidia.com/dgx/os/dgx-packages/-/tree/baseos7/nvidia-arm-mpam-tools/DEBS/usr/bin?ref_type=heads for the mpam test. If you find issue please help fix it. I generally direct codex to use this to validate mpam codes and codex also fixes if it finds any issue

arighi · 2026-05-28T09:21:44Z

There are build issues with resctrl:

In file included from drivers/resctrl/mpam_devices.c:30:
drivers/resctrl/mpam_internal.h:398:41: error: field ‘resctrl_mon_dom’ has incomplete type
  398 |         struct rdt_l3_mon_domain        resctrl_mon_dom;
      |                                         ^~~~~~~~~~~~~~~

ltrager · 2026-06-02T00:05:43Z

@arighi - That was due to missing patches in nvidia-kernel-patches-verification. I have added them in NVIDIA/nvidia-kernel-patches-verification#5 and updated this branch.

arighi

This is still marked as draft, but I've done a bunch of tests on Vera and Grace (vllm, ollama, stress-ng, sched_ext). Everything looks good to me, so I think we should just merge this.

Acked-by: Andrea Righi <arighi@nvidia.com>

sforshee · 2026-06-03T21:41:26Z

It looks like a lot of the commits were cherry-picked from another one of our branches -- is this correct? In most cases I don't think this is problematic, though I think should add your S-O-B to the provenance. I'm less sure about the backports though. If they were backported for another kernel version it's not a given that the backports are correct for 6.18, though if they are clean cherry picks that's some evidence they might be.

There are a number of SAUCE patches applied, then reverted to cherry-pick the commits from upstream. It would be preferable to clean up the history by dropping both the SAUCE patches and the reverts. I think we have other instances of this to clean up though, so this could be a follow-up.

I wrote a script to do some basic checks. One was running git patch-id for any cherry-picks and comparing to the patch-id for the upstream commits, and some came back not matching:

571432d944ff46ae4e861adafcbe0dec5ed3cf00
7f224089f0f8d3af73e7c24152e3fe959dd47a8d
8ebefe1484ad3ca26da8fe6eecfaecf26705244c
479d006cd45d6923451c6c3da4f0aa9ee4faaa99
41dcbeba5e0ef39e5d083d783e43e8f6d2f146d0
a4229563ff26bc9ac9103c9296c3a3e3696dab0c
8b20305d6c2c68c9b16b288f595b9ad97e14ee94
f819f1a31d6eb2b69a7a4c5cc1273433ce1301a7
eed5ea82cb36fe4d92ed4aa1ed38750074bec247
fb1d3bfbdf2b6273614af7726db272bb52395ead
5ba3ade59f4ba38b6dc0da3ff0bf78055e1a436d
f4969bdc007e4227c5b778f3f67a66c4dbb150f7

These may still have been clean cherry picks, but I wanted to confirm. I picked one at random and did see minor context changes, but maybe git was able to resolve those automatically. If any required backporting, they should be annotated (backported from ... with a note about the changes you made.

I also checked for fixes to the backports and identified quite a few which aren't applied. Probably not something that needs to be addressed for merging, but something we might want to follow up on in case there's anything important in this list.

INFO: Found fixes for upstream commit 8c90dc68a5de4349ef9ba51449fb0a29cd690547:
ccad6001be5c38426ccf45790c411467ad3c03c6
4387970bbd84fd14e0c49c3089c5061ccd86b98a
1ef2a89584b7b788b2603590d886db076b2f24cc
b9f5c38e4af1a094384650d2fc79fb992d6d5e64
INFO: Found fixes for upstream commit 3bd04fe7d807bbdcfe75b29ca82fae4e2d7dc524:
6ccbb613b42a1f1ba7bfd547a148f644a902a25c
f1caff3335ea6eab88cdc84ec8f2e3c45ca05486
INFO: Found fixes for upstream commit 09b89d2a72f37b078198cbb09d5b9e13ba9d68b9:
a1cb6577f575ba5ec2583caf4f791a86754dbf69
f91e913355f49c878fc77f995fd71b7800352bd2
INFO: Found fixes for upstream commit 823e7c3712c584641b4ef890a8af34884c677197:
c2803bd580db226008aabf2fb2f0c9a7d3b5d0de
INFO: Found fixes for upstream commit 41e8a14950e1732af51cfec8fa09f8ded02a5ca9:
4ad79c874e53ebb7fe3b8ae7ac6c858a2121f415
INFO: Found fixes for upstream commit 9e5afb7c32830bcd123976a7729ef4e2dff0cd77:
c2803bd580db226008aabf2fb2f0c9a7d3b5d0de
INFO: Found fixes for upstream commit 6789fb99282c0a8e8e84701b7edf456f4a9e71e2:
f758340da529ccb12531c3f83d5992e912f6c8d5
INFO: Found fixes for upstream commit 264c285999fce128fc52743bce582468b26e9f65:
67c0a487efa542cca9477ea84915db2e091f98d0
INFO: Found fixes for upstream commit 2a3c79c61539779a09928893518c8286d7774b54:
4d5bbbafc170eb21474a37d844211fce6b0f3c51
INFO: Found fixes for upstream commit 28fa2cce7a8388f09e457f1e24241ca6d5e985d8:
d06b8e7c97c3290e61006e30b32beb9e715fab82
INFO: Found fixes for upstream commit 2ec41967189cd65a8f79c760dd1b50c4f56e8ac6:
e6dbcb7c0e7b508d443a9aa6f77f63a2f83b1ae4
INFO: Found fixes for upstream commit f58ef9d1d1355b15443719df95081f193067ab88:
80d9411c00e805488b631c91034e9b6c14a6dbdc
INFO: Found fixes for upstream commit 3aa31a8bb11e47c0ff2b306988d1756b810c1c3c:
590d745680309f8d956c3f0a97270fe65013b272
INFO: Found fixes for upstream commit 5d74781ebc86c5fa9e9d6934024c505412de9b52:
702809dabdecca807bdd50cfdcc1c980feb2ba62
d97708701434ce72968e771976aaf9d3438fcafd
e98137f0a874ab36d0946de4707aa48cb7137d1c
1a8a5227f22996d3e503c60569b1813a404da033
61ceaf236115f20f4fdd7cf60f883ada1063349a
INFO: Found fixes for upstream commit 44ebaa1744fd79cd86d10f5453c90b8c4e22b7f4:
69dc538a4f5a57dcc5ea4893c769d567f539a1b1
INFO: Found fixes for upstream commit d2041f1f11dd99076010841a86c4d02d04650814:
faa37ff3bf18d5242fe3d54f5462b1c3254c2567

ltrager · 2026-06-04T04:32:23Z

@sforshee - This branch only applies missing commits from grace-baremetal.txt and grace-iov.txt. All commits listed in those have already been upstreamed using the upstream commit hash. I applied all missing hashes with git cherry-pick -x -s

I just double checked this:

files=(
  ../nvidia-kernel-patches-verification/grace-baremetal.txt
  ../nvidia-kernel-patches-verification/grace-iov.txt
)

git log --pretty=%s linux-nvidia-6.18-next..HEAD | while IFS= read -r title; do
  grep -Fq -- "$title" "${files[@]}" || printf '%s\n' "$title"
done

sforshee · 2026-06-04T14:06:02Z

Ah, the reason I was seeing extra commits is that your branch is still based on 6.18.33 and linux-nvidia-6.18-next is now based on 6.18.34. There was a rebase pushed to the branch before the workflow got disabled.

fyu1 · 2026-06-04T14:29:35Z

Do you need to support MPAM on Vera?

ltrager · 2026-06-04T16:16:21Z

@fyu1 - Our overall goal is to get the linux-nvidia-6.18 branch in good shape as our reference kernel. I'm currently working on integrating the patches defined in nvidia-kernel-patches-verification. This branch handles Grace, next will handle Vera, then I will circle back to the remaining MPAM patches.

sforshee

Non-functional, but 2c543aacdc9c x86/resctrl: Add SDCIAE feature in the command line options is completely missing changes to Documentation/admin-guide/kernel-parameters.txt from the upstream commit.

This is the list of fixes to consider for the PR after the rebase. I should note -- my script only checked for fixes to the commits in the PR and did not recurse into fixes-for-fixes, etc., so this list may be incomplete. It also didn't check whether the commits appeared later in the branch.

INFO: Found fixes for upstream commit 8c90dc68a5de4349ef9ba51449fb0a29cd690547:
ccad6001be5c38426ccf45790c411467ad3c03c6
4387970bbd84fd14e0c49c3089c5061ccd86b98a
1ef2a89584b7b788b2603590d886db076b2f24cc
b9f5c38e4af1a094384650d2fc79fb992d6d5e64
INFO: Found fixes for upstream commit 3bd04fe7d807bbdcfe75b29ca82fae4e2d7dc524:
6ccbb613b42a1f1ba7bfd547a148f644a902a25c
f1caff3335ea6eab88cdc84ec8f2e3c45ca05486
INFO: Found fixes for upstream commit 09b89d2a72f37b078198cbb09d5b9e13ba9d68b9:
a1cb6577f575ba5ec2583caf4f791a86754dbf69
INFO: Found fixes for upstream commit 823e7c3712c584641b4ef890a8af34884c677197:
c2803bd580db226008aabf2fb2f0c9a7d3b5d0de
INFO: Found fixes for upstream commit 41e8a14950e1732af51cfec8fa09f8ded02a5ca9:
4ad79c874e53ebb7fe3b8ae7ac6c858a2121f415
INFO: Found fixes for upstream commit 9e5afb7c32830bcd123976a7729ef4e2dff0cd77:
c2803bd580db226008aabf2fb2f0c9a7d3b5d0de
INFO: Found fixes for upstream commit 6789fb99282c0a8e8e84701b7edf456f4a9e71e2:
f758340da529ccb12531c3f83d5992e912f6c8d5
INFO: Found fixes for upstream commit 264c285999fce128fc52743bce582468b26e9f65:
67c0a487efa542cca9477ea84915db2e091f98d0
INFO: Found fixes for upstream commit 2a3c79c61539779a09928893518c8286d7774b54:
4d5bbbafc170eb21474a37d844211fce6b0f3c51
INFO: Found fixes for upstream commit 28fa2cce7a8388f09e457f1e24241ca6d5e985d8:
d06b8e7c97c3290e61006e30b32beb9e715fab82
INFO: Found fixes for upstream commit 2ec41967189cd65a8f79c760dd1b50c4f56e8ac6:
e6dbcb7c0e7b508d443a9aa6f77f63a2f83b1ae4
INFO: Found fixes for upstream commit f58ef9d1d1355b15443719df95081f193067ab88:
80d9411c00e805488b631c91034e9b6c14a6dbdc
INFO: Found fixes for upstream commit 3aa31a8bb11e47c0ff2b306988d1756b810c1c3c:
590d745680309f8d956c3f0a97270fe65013b272
INFO: Found fixes for upstream commit 5d74781ebc86c5fa9e9d6934024c505412de9b52:
702809dabdecca807bdd50cfdcc1c980feb2ba62
d97708701434ce72968e771976aaf9d3438fcafd
e98137f0a874ab36d0946de4707aa48cb7137d1c
1a8a5227f22996d3e503c60569b1813a404da033
61ceaf236115f20f4fdd7cf60f883ada1063349a
INFO: Found fixes for upstream commit 44ebaa1744fd79cd86d10f5453c90b8c4e22b7f4:
69dc538a4f5a57dcc5ea4893c769d567f539a1b1
INFO: Found fixes for upstream commit d2041f1f11dd99076010841a86c4d02d04650814:
faa37ff3bf18d5242fe3d54f5462b1c3254c2567

sforshee · 2026-06-04T18:44:04Z

 #include <linux/llist.h>
 #include <linux/mutex.h>
-#include <linux/srcu.h>
+#include <linux/resctrl.h>


The upstream commit did not remove the srcu.h include. It's clearly a duplicate and upstream commit b5a69c4 removed it. But this is still marked as a cherry pick and should be marked as a backport if the upstream change was adjusted.

sforshee · 2026-06-04T19:05:38Z

 #include <linux/vfio_pci_core.h>
 #include <linux/delay.h>
 #include <linux/jiffies.h>
+#include <linux/pci-p2pdma.h>


The upstream commit does not add this include. Not sure if it's necessary for this commit, but it actually looks like it's added by cherry pick which precedes it upstream but comes after it in this tree (5d74781ebc86c5fa9e9d6934024c505412de9b52 is the backport). Ideally we would apply these in the upstream order to make them clean cherry picks, but if not the patches should be identified as backports instead of cherry picks when adjustments are made.

fyu1 · 2026-06-04T21:14:38Z

Acked-by: Fenghua Yu fenghuay@nvidia.com

…tion Enforcement Smart Data Cache Injection (SDCI) is a mechanism that enables direct insertion of data from I/O devices into the L3 cache. By directly caching data from I/O devices rather than first storing the I/O data in DRAM, SDCI reduces demands on DRAM bandwidth and reduces latency to the processor consuming the I/O data. The SDCIAE (SDCI Allocation Enforcement) PQE feature allows system software to control the portion of the L3 cache used for SDCI. When enabled, SDCIAE forces all SDCI lines to be placed into the L3 cache partitions identified by the highest-supported L3_MASK_n register, where n is the maximum supported CLOSID. Add CPUID feature bit that can be used to configure SDCIAE. The SDCIAE feature details are documented in: AMD64 Architecture Programmer's Manual Volume 2: System Programming Publication # 24593 Revision 3.41 section 19.4.7 L3 Smart Data Cache Injection Allocation Enforcement (SDCIAE). available at https://bugzilla.kernel.org/show_bug.cgi?id=206537 Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://patch.msgid.link/83ca10d981c48e86df2c3ad9658bb3ba3544c763.1762995456.git.babu.moger@amd.com (cherry picked from commit 3767def) Signed-off-by: Lee Trager <ltrager@nvidia.com>

There is no need to share the main device pointer (struct vfio_device *) with all the feature functions as they only need the core device pointer. Therefore, extract the core device pointer once in the caller (vfio_pci_core_ioctl_feature) and share it instead. Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-8-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org> (cherry picked from commit 47d13c9) Signed-off-by: Lee Trager <ltrager@nvidia.com>

Make sure that all VFIO PCI devices have peer-to-peer capabilities enables, so we would be able to export their MMIO memory through DMABUF, VFIO has always supported P2P mappings with itself. VFIO type 1 insecurely reads PFNs directly out of a VMA's PTEs and programs them into the IOMMU allowing any two VFIO devices to perform P2P to each other. All existing VMMs use this capability to export P2P into a VM where the VM could setup any kind of DMA it likes. Projects like DPDK/SPDK are also known to make use of this, though less frequently. As a first step to more properly integrating VFIO with the P2P subsystem unconditionally enable P2P support for VFIO PCI devices. The struct p2pdma_provider will act has a handle to the P2P subsystem to do things like DMA mapping. While real PCI devices have to support P2P (they can't even tell if an IOVA is P2P or not) there may be fake PCI devices that may trigger some kind of catastrophic system failure. To date VFIO has never tripped up on such a case, but if one is discovered the plan is to add a PCI quirk and have pcim_p2pdma_init() fail. This will fully block the broken device throughout any users of the P2P subsystem in the kernel. Thus P2P through DMABUF will follow the historical VFIO model and be unconditionally enabled by vfio-pci. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-9-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org> (cherry picked from commit 35c3503) Signed-off-by: Lee Trager <ltrager@nvidia.com>

Add support for exporting PCI device MMIO regions through dma-buf, enabling safe sharing of non-struct page memory with controlled lifetime management. This allows RDMA and other subsystems to import dma-buf FDs and build them into memory regions for PCI P2P operations. The implementation provides a revocable attachment mechanism using dma-buf move operations. MMIO regions are normally pinned as BARs don't change physical addresses, but access is revoked when the VFIO device is closed or a PCI reset is issued. This ensures kernel self-defense against potentially hostile userspace. Currently VFIO can take MMIO regions from the device's BAR and map them into a PFNMAP VMA with special PTEs. This mapping type ensures the memory cannot be used with things like pin_user_pages(), hmm, and so on. In practice only the user process CPU and KVM can safely make use of these VMA. When VFIO shuts down these VMAs are cleaned by unmap_mapping_range() to prevent any UAF of the MMIO beyond driver unbind. However, VFIO type 1 has an insecure behavior where it uses follow_pfnmap_*() to fish a MMIO PFN out of a VMA and program it back into the IOMMU. This has a long history of enabling P2P DMA inside VMs, but has serious lifetime problems by allowing a UAF of the MMIO after the VFIO driver has been unbound. Introduce DMABUF as a new safe way to export a FD based handle for the MMIO regions. This can be consumed by existing DMABUF importers like RDMA or DRM without opening an UAF. A following series will add an importer to iommufd to obsolete the type 1 code and allow safe UAF-free MMIO P2P in VM cases. DMABUF has a built in synchronous invalidation mechanism called move_notify. VFIO keeps track of all drivers importing its MMIO and can invoke a synchronous invalidation callback to tell the importing drivers to DMA unmap and forget about the MMIO pfns. This process is being called revoke. This synchronous invalidation fully prevents any lifecycle problems. VFIO will do this before unbinding its driver ensuring there is no UAF of the MMIO beyond the driver lifecycle. Further, VFIO has additional behavior to block access to the MMIO during things like Function Level Reset. This is because some poor platforms may experience a MCE type crash when touching MMIO of a PCI device that is undergoing a reset. Today this is done by using unmap_mapping_range() on the VMAs. Extend that into the DMABUF world and temporarily revoke the MMIO from the DMABUF importers during FLR as well. This will more robustly prevent an errant P2P from possibly upsetting the platform. A DMABUF FD is a preferred handle for MMIO compared to using something like a pgmap because: - VFIO is supported, including its P2P feature, on archs that don't support pgmap - PCI devices have all sorts of BAR sizes, including ones smaller than a section so a pgmap cannot always be created - It is undesirable to waste a lot of memory for struct pages, especially for a case like a GPU with ~100GB of BAR size - We want a synchronous revoke semantic to support FLR with light hardware requirements Use the P2P subsystem to help generate the DMA mapping. This is a significant upgrade over the abuse of dma_map_resource() that has historically been used by DMABUF exporters. Experience with an OOT version of this patch shows that real systems do need this. This approach deals with all the P2P scenarios: - Non-zero PCI bus_offset - ACS flags routing traffic to the IOMMU - ACS flags that bypass the IOMMU - though vfio noiommu is required to hit this. There will be further work to formalize the revoke semantic in DMABUF. For now this acts like a move_notify dynamic exporter where importer fault handling will get a failure when they attempt to map. This means that only fully restartable fault capable importers can import the VFIO DMABUFs. A future revoke semantic should open this up to more HW as the HW only needs to invalidate, not handle restartable faults. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-10-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org> (cherry picked from commit 5d74781) Signed-off-by: Lee Trager <ltrager@nvidia.com>

Call vfio_pci_core_fill_phys_vec() with the proper physical ranges for the synthetic BAR 2 and BAR 4 regions. Otherwise use the normal flow based on the PCI bar. This demonstrates a DMABUF that follows the region info report to only allow mapping parts of the region that are mmapable. Since the BAR is power of two sized and the "CXL" region is just page aligned the there can be a padding region at the end that is not mmaped or passed into the DMABUF. The "CXL" ranges that are remapped into BAR 2 and BAR 4 areas are not PCI MMIO, they actually run over the CXL-like coherent interconnect and for the purposes of DMA behave identically to DRAM. We don't try to model this distinction between true PCI BAR memory that takes a real PCI path and the "CXL" memory that takes a different path in the p2p framework for now. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-11-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org> (cherry picked from commit 5415d88) Signed-off-by: Lee Trager <ltrager@nvidia.com>

This function is used to establish the "private interconnect" between the VFIO DMABUF exporter and the iommufd DMABUF importer. This is intended to be a temporary API until the core DMABUF interface is improved to natively support a private interconnect and revocable negotiation. This function should only be called by iommufd when trying to map a DMABUF. For now iommufd will only support VFIO DMABUFs. The following improvements are needed in the DMABUF API to generically support more exporters with iommufd/kvm type importers that cannot use the DMA API: 1) Revoke semantics. VFIO needs to be able to prevent access to the MMIO during FLR, and so it will use dma_buf_move_notify() to prevent access. iommmufd does not support fault handling so it cannot implement the full move_notify. Instead if revoke is negotiated the exporter promises not to use move_notify() unless the importer can experiance failures. iommufd will unmap the dmabuf from the iommu page tables while it is revoked. 2) Private interconnect negotiation. iommufd will only be able to map a "private interconnect" that provides a phys_addr_t and a struct p2pdma_provider * to describe the memory. It cannot use a DMA mapped scatterlist since it is directly calling iommu_map(). 3) NULL device during dma_buf_dynamic_attach(). Since iommufd doesn't use the DMA API it doesn't have a DMAable struct device to pass here. Link: https://patch.msgid.link/r/1-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Acked-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> (cherry picked from commit 96ce2ae) Signed-off-by: Lee Trager <ltrager@nvidia.com>

Add IOPT_ADDRESS_DMABUF to the iopt_pages and the basic infrastructure to create an iopt_pages from a struct dma_buf *. DMABUF pages are not supported for accesses, and for now can only be used with the VFIO DMABUF exporter. The overall flow will be similar to memfd where the user can pass in a DMABUF file descriptor to IOMMU_IOAS_MAP_FILE and create an area and pages. Like other areas it can be copied and otherwise manipulated, though there is little point in doing so. There is no pinned page accounting done for DMABUF maps. The DMABUF attachment exists so long as the dmabuf is mapped into an IOAS, even if the IOAS is not mapped to any domains. Link: https://patch.msgid.link/r/2-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> (cherry picked from commit 71db84a) Signed-off-by: Lee Trager <ltrager@nvidia.com>

Once a DMABUF is revoked the domain will be unmapped under the pages mutex. Double unmapping will trigger a WARN, and mapping while revoked will fail. Check for revoked DMABUFs along all the map and unmap paths to resolve this. Ensure that map/unmap is always done under the pages mutex so it is synchronized with the revoke notifier. If a revoke happens between allocating the iopt_pages and the population to a domain then the population will succeed, and leave things unmapped as though revoke had happened immediately after. Currently there is no way to repopulate the domains. Userspace is expected to know if it is going to do something that would trigger revoke (eg if it is about to do a FLR) then it should go and remove the DMABUF mappings before and put the back after. The revoke is only to protect the kernel from mis-behaving userspace. Link: https://patch.msgid.link/r/3-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> (cherry picked from commit 71e2409) Signed-off-by: Lee Trager <ltrager@nvidia.com>

When connected to VFIO, the only DMABUF exporter that is accepted, the move_notify callback will be made when VFIO wants to remove access to the MMIO. This is being called revoke. Wire up revoke to go through all the iommu_domain's that have mapped the DMABUF and unmap them. The locking here is unpleasant, since the existing locking scheme was designed to come from the iopt through the area to the pages we cannot use pages as starting point for the locking. There is no way to obtain the domains_rwsem before obtaining the pages mutex to reliably use the existing domains_itree. Solve this problem by adding a new tracking structure just for DMABUF revoke. Record a linked list of areas and domains inside the pages mutex. Clean the entries on the list during revoke. The map/unmaps are now all done under a pages mutex while updating the tracking linked list so nothing can get out of sync. Only one lock is required for revoke processing. Link: https://patch.msgid.link/r/4-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> (cherry picked from commit fc7063a) Signed-off-by: Lee Trager <ltrager@nvidia.com>

Addresses intended for MMIO should be propagated through to the iommu with the IOMMU_MMIO flag set. Keep track in the batch if all the pfns are cachable or mmio and flush the batch out of it ever needs to be changed. Switch to IOMMU_MMIO if the batch is MMIO when mapping the iommu. Link: https://patch.msgid.link/r/5-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> (cherry picked from commit 3114c67) Signed-off-by: Lee Trager <ltrager@nvidia.com>

Make another sub implementation of pfn_reader for DMABUF. This version will fill the batch using the struct phys_vec recorded during the attachment. Link: https://patch.msgid.link/r/6-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> (cherry picked from commit 74014a4) Signed-off-by: Lee Trager <ltrager@nvidia.com>

Since dmabuf only has APIs that work on an int fd and not a struct file *, pass the fd deeper into the call chain so we can use the dmabuf APIs as is. Link: https://patch.msgid.link/r/7-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> (cherry picked from commit 217725f) Signed-off-by: Lee Trager <ltrager@nvidia.com>

Finally call iopt_alloc_dmabuf_pages() if the user passed in a DMABUF through IOMMU_IOAS_MAP_FILE. This makes the feature visible to userspace. Link: https://patch.msgid.link/r/8-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> (cherry picked from commit 44ebaa1) Signed-off-by: Lee Trager <ltrager@nvidia.com>

Basic tests of establishing a dmabuf and revoking it. The selftest kernel side provides a basic small dmabuf for this testing. Link: https://patch.msgid.link/r/9-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> (cherry picked from commit d2041f1) Signed-off-by: Lee Trager <ltrager@nvidia.com>

fill_sg_entry() splits large DMA buffers into multiple scatter-gather entries, each holding up to UINT_MAX bytes. When calculating the DMA address for entries beyond the second one, the expression (i * UINT_MAX) causes integer overflow due to 32-bit arithmetic. This manifests when the input arg length >= 8 GiB results in looping for i >= 2. Fix by casting i to dma_addr_t before multiplication. Fixes: 3aa31a8 ("dma-buf: provide phys_vec to scatter-gather mapping routine") Signed-off-by: Alex Mastro <amastro@fb.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/r/20251125-dma-buf-overflow-v1-1-b70ea1e6c4ba@fb.com Signed-off-by: Alex Williamson <alex@shazbot.org> (cherry picked from commit 590d745) Signed-off-by: Lee Trager <ltrager@nvidia.com>

NVIDIA's Grace based systems have large device memory. The device memory is mapped as VM_PFNMAP in the VMM VMA. The nvgrace-gpu module could make use of the huge PFNMAP support added in mm [1]. To make use of the huge pfnmap support, fault/huge_fault ops based mapping mechanism needs to be implemented. Currently nvgrace-gpu module relies on remap_pfn_range to do the mapping during VM bootup. Replace it to instead rely on fault and use vfio_pci_vmf_insert_pfn to setup the mapping. Moreover to enable huge pfnmap, nvgrace-gpu module is updated by adding huge_fault ops implementation. The implementation establishes mapping according to the order request. Note that if the PFN or the VMA address is unaligned to the order, the mapping fallbacks to the PTE level. Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1] Cc: Shameer Kolothum <skolothumtho@nvidia.com> Cc: Alex Williamson <alex@shazbot.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Vikram Sethi <vsethi@nvidia.com> Reviewed-by: Zhi Wang <zhiw@nvidia.com> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-3-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org> (cherry picked from commit 9db6548) Signed-off-by: Lee Trager <ltrager@nvidia.com>

Remove code duplication in vfio_pci_core_mmap by calling vfio_pci_core_setup_barmap to perform the bar mapping. No functional change is intended. Cc: Donald Dutile <ddutile@redhat.com> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Reviewed-by: Zhi Wang <zhiw@nvidia.com> Suggested-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-4-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org> (cherry picked from commit 7f5764e) Signed-off-by: Lee Trager <ltrager@nvidia.com>

Split the function that check for the GPU device being ready on the probe. Move the code to wait for the GPU to be ready through BAR0 register reads to a separate function. This would help reuse the code. This also fixes a bug where the return status in case of timeout gets overridden by return from pci_enable_device. With the fix, a timeout generate an error as initially intended. Fixes: d85f69d ("vfio/nvgrace-gpu: Check the HBM training and C2C link status") Reviewed-by: Zhi Wang <zhiw@nvidia.com> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-5-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org> (cherry picked from commit 7d05507) Signed-off-by: Lee Trager <ltrager@nvidia.com>

Introduce a new flag reset_done to notify that the GPU has just been reset and the mapping to the GPU memory is zapped. Implement the reset_done handler to set this new variable. It will be used later in the patches to wait for the GPU memory to be ready before doing any mapping or access. Cc: Jason Gunthorpe <jgg@ziepe.ca> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Suggested-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-6-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org> (cherry picked from commit dfe7654) Signed-off-by: Lee Trager <ltrager@nvidia.com>

Speculative prefetches from CPU to GPU memory until the GPU is ready after reset can cause harmless corrected RAS events to be logged on Grace systems. It is thus preferred that the mapping not be re-established until the GPU is ready post reset. The GPU readiness can be checked through BAR0 registers similar to the checking at the time of device probe. It can take several seconds for the GPU to be ready. So it is desirable that the time overlaps as much of the VM startup as possible to reduce impact on the VM bootup time. The GPU readiness state is thus checked on the first fault/huge_fault request or read/write access which amortizes the GPU readiness time. The first fault and read/write checks the GPU state when the reset_done flag - which denotes whether the GPU has just been reset. The memory_lock is taken across map/access to avoid races with GPU reset. Also check if the memory is enabled, before waiting for GPU to be ready. Otherwise the readiness check would block for 30s. Lastly added PM handling wrapping on read/write access. Cc: Shameer Kolothum <skolothumtho@nvidia.com> Cc: Alex Williamson <alex@shazbot.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Vikram Sethi <vsethi@nvidia.com> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Suggested-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-7-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org> (cherry picked from commit a23b106) Signed-off-by: Lee Trager <ltrager@nvidia.com>

The memory failure handling implementation for the PFNMAP memory with no struct pages is faulty. The VA of the mapping is determined based on the the PFN. It should instead be based on the file mapping offset. At the occurrence of poison, the memory_failure_pfn is triggered on the poisoned PFN. Introduce a callback function that allows mm to translate the PFN to the corresponding file page offset. The kernel module using the registration API must implement the callback function and provide the translation. The translated value is then used to determine the VA information and sending the SIGBUS to the usermode process mapped to the poisoned PFN. The callback is also useful for the driver to be notified of the poisoned PFN, which may then track it. Link: https://lkml.kernel.org/r/20251211070603.338701-2-ankita@nvidia.com Fixes: 2ec4196 ("mm: handle poisoning of pfn without struct pages") Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Matthew R. Ochs <mochs@nvidia.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Neo Jia <cjia@nvidia.com> Cc: Vikram Sethi <vsethi@nvidia.com> Cc: Yishai Hadas <yishaih@nvidia.com> Cc: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit e6dbcb7) Signed-off-by: Lee Trager <ltrager@nvidia.com>

The nvgrace-gpu module [1] maps the device memory to the user VA (Qemu) without adding the memory to the kernel. The device memory pages are PFNMAP and not backed by struct page. The module can thus utilize the MM's PFNMAP memory_failure mechanism that handles ECC/poison on regions with no struct pages. The kernel MM code exposes register/unregister APIs allowing modules to register the device memory for memory_failure handling. Make nvgrace-gpu register the GPU memory with the MM on open. The module registers its memory region, the address_space with the kernel MM for ECC handling and implements a callback function to convert the PFN to the file page offset. The callback functions checks if the PFN belongs to the device memory region and is also contained in the VMA range, an error is returned otherwise. Link: https://lore.kernel.org/all/20240220115055.23546-1-ankita@nvidia.com/ [1] Suggested-by: Alex Williamson <alex@shazbot.org> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Jiaqi Yan <jiaqiyan@google.com> Link: https://lore.kernel.org/r/20260115202849.2921-3-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org> (cherry picked from commit e5f19b6) Signed-off-by: Lee Trager <ltrager@nvidia.com>

Add stubs to address CONFIG_MEMORY_FAILURE disabled. Suggested-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20260115202849.2921-2-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org> (cherry picked from commit 205e6d1) Signed-off-by: Lee Trager <ltrager@nvidia.com>

PCIe ATS may be disabled by platform firmware, root complex limitations, or kernel policy even when a device advertises the ATS capability in its PCI configuration space. Add a new IOMMU_CAP_PCI_ATS_SUPPORTED capability to allow IOMMU drivers to report the effective ATS decision for a device. When this capability is true for a device, ATS may be enabled for that device, but it does not imply that ATS is currently enabled. A subsequent patch will extend iommufd to expose the effective ATS status to userspace. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> (cherry picked from commit a82efb8) Signed-off-by: Lee Trager <ltrager@nvidia.com>

If the IOMMU driver reports that ATS is not supported for a device, set the IOMMU_HW_CAP_PCI_ATS_NOT_SUPPORTED flag in the returned hardware capabilities. This uses a negative flag for UAPI compatibility. Existing userspace assumes ATS is supported if no flag is present. This also ensures that new userspace works correctly on both old and new kernels, where a zero value implies ATS support. When this flag is set, ATS cannot be used for the device. When it is clear, ATS may be enabled when an appropriate HWPT is attached. Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> (cherry picked from commit a11661a) Signed-off-by: Lee Trager <ltrager@nvidia.com>

ltrager · 2026-06-05T08:15:32Z

@sforshee - Thanks for double checking these commits. The mismatches came in as an artifact due to the git surgery I did when inserting the missing commits. Additionally some patches were listed out of order. I reapplied them which should resolve the issue.

93538246e1 arm_mpam: resctrl: Allow resctrl to allocate monitors is now listed as a backport as I had to add an allocation flag which is default in newer kernels

sforshee · 2026-06-05T14:10:55Z

@ltrager thanks! Looks good now.

Acked-by: Seth Forshee <sforshee@nvidia.com>

sforshee · 2026-06-05T14:38:54Z

Applied to linux-nvidia-6.18-next, closing PR.

nirmoy added help wanted Extra attention is needed pending_review_comment labels May 28, 2026

ltrager marked this pull request as draft May 28, 2026 01:24

nirmoy removed help wanted Extra attention is needed pending_review_comment labels May 28, 2026

ltrager force-pushed the 6.18-missing-grace branch from 49e3244 to d7cecc9 Compare June 1, 2026 23:42

github-actions Bot force-pushed the linux-nvidia-6.18-next branch from a2af04d to 295622c Compare June 2, 2026 01:06

arighi approved these changes Jun 3, 2026

View reviewed changes

ltrager marked this pull request as ready for review June 3, 2026 18:59

nirmoy added help wanted Extra attention is needed has_1_ack pending_review_comment labels Jun 3, 2026

ltrager force-pushed the 6.18-missing-grace branch from 447974a to b36ba48 Compare June 4, 2026 04:15

sforshee requested changes Jun 4, 2026

View reviewed changes

nirmoy added has_2_acks and removed help wanted Extra attention is needed has_1_ack labels Jun 4, 2026

vivekkreddy and others added 24 commits June 5, 2026 07:51

ltrager force-pushed the 6.18-missing-grace branch from b36ba48 to f06bc45 Compare June 5, 2026 08:03

sforshee approved these changes Jun 5, 2026

View reviewed changes

sforshee closed this Jun 5, 2026

Conversation

ltrager commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Testing

Uh oh!

nvidia-bfigg commented May 28, 2026

Uh oh!

nirmoy commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Boro watcher review skipped

Uh oh!

ltrager commented May 28, 2026

Uh oh!

nirmoy commented May 28, 2026

Uh oh!

arighi commented May 28, 2026

Uh oh!

ltrager commented Jun 2, 2026

Uh oh!

arighi left a comment

Choose a reason for hiding this comment

Uh oh!

sforshee commented Jun 3, 2026

Uh oh!

ltrager commented Jun 4, 2026

Uh oh!

sforshee commented Jun 4, 2026

Uh oh!

fyu1 commented Jun 4, 2026

Uh oh!

ltrager commented Jun 4, 2026

Uh oh!

sforshee left a comment

Choose a reason for hiding this comment

Uh oh!

sforshee Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

sforshee Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

fyu1 commented Jun 4, 2026

Uh oh!

ltrager commented Jun 5, 2026

Uh oh!

sforshee commented Jun 5, 2026

Uh oh!

sforshee commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

ltrager commented May 28, 2026 •

edited

Loading

nirmoy commented May 28, 2026 •

edited

Loading