Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
148 commits
Select commit Hold shift + click to select a range
409ebf4
x86/cpufeatures: Add support for L3 Smart Data Cache Injection Alloca…
babumoger Nov 13, 2025
4356905
x86/resctrl: Add SDCIAE feature in the command line options
babumoger Nov 13, 2025
58ceae9
x86,fs/resctrl: Detect io_alloc feature
babumoger Nov 13, 2025
6107c03
x86,fs/resctrl: Implement "io_alloc" enable/disable handlers
babumoger Nov 13, 2025
20bc406
fs/resctrl: Introduce interface to display "io_alloc" support
babumoger Nov 13, 2025
ba0ff94
fs/resctrl: Add user interface to enable/disable io_alloc feature
babumoger Nov 13, 2025
9a3109e
fs/resctrl: Introduce interface to display io_alloc CBMs
babumoger Nov 13, 2025
a03a5f6
fs/resctrl: Modify struct rdt_parse_data to pass mode and CLOSID
babumoger Nov 13, 2025
5577d55
fs/resctrl: Introduce interface to modify io_alloc capacity bitmasks
babumoger Nov 13, 2025
4a051f5
fs/resctrl: Update bit_usage to reflect io_alloc
babumoger Nov 13, 2025
fb99d6c
ACPI / PPTT: Add a helper to fill a cpumask from a processor container
Nov 19, 2025
bce526d
ACPI / PPTT: Stop acpi_count_levels() expecting callers to clear levels
Nov 19, 2025
ae02b5e
ACPI / PPTT: Add acpi_pptt_cache_v1_full to use pptt cache as one str…
benhor01 Nov 19, 2025
1fe2ab7
ACPI / PPTT: Find cache level by cache-id
Nov 19, 2025
f826709
ACPI / PPTT: Add a helper to fill a cpumask from a cache_id
Nov 19, 2025
9b390d3
arm64: kconfig: Add Kconfig entry for MPAM
Nov 19, 2025
3c4fd26
platform: Define platform_device_put cleanup handler
benhor01 Nov 19, 2025
3880b07
ACPI: Define acpi_put_table cleanup handler and acpi_get_table_pointe…
benhor01 Nov 19, 2025
11cba5e
ACPI / MPAM: Parse the MPAM table
Nov 19, 2025
873a859
arm_mpam: Add probe/remove for mpam msc driver and kbuild boiler plate
Nov 19, 2025
b802063
arm_mpam: Add the class and component structures for firmware describ…
Nov 19, 2025
0615354
arm_mpam: Add MPAM MSC register layout definitions
Nov 19, 2025
cf5a661
arm_mpam: Add cpuhp callbacks to probe MSC hardware
Nov 19, 2025
5685e45
arm_mpam: Probe hardware to find the supported partid/pmg values
Nov 19, 2025
fd6ec4e
arm_mpam: Add helpers for managing the locking around the mon_sel reg…
Nov 19, 2025
4c92218
arm_mpam: Probe the hardware features resctrl supports
Nov 19, 2025
f1450ad
arm_mpam: Merge supported features during mpam_enable() into mpam_class
Nov 19, 2025
650fe68
arm_mpam: Reset MSC controls from cpuhp callbacks
Nov 19, 2025
c637f1c
arm_mpam: Add a helper to touch an MSC from any CPU
Nov 19, 2025
22407c6
arm_mpam: Extend reset logic to allow devices to be reset any time
Nov 19, 2025
ad65a0d
arm_mpam: Register and enable IRQs
Nov 19, 2025
754be83
arm_mpam: Use a static key to indicate when mpam is enabled
Nov 19, 2025
0c7a78e
arm_mpam: Allow configuration to be applied and restored during cpu o…
Nov 19, 2025
87fa980
arm_mpam: Probe and reset the rest of the features
Nov 19, 2025
ca70bb5
arm_mpam: Add helpers to allocate monitors
Nov 19, 2025
c64ad09
arm_mpam: Add mpam_msmon_read() to read monitor value
Nov 19, 2025
098f142
arm_mpam: Track bandwidth counter state for power management
Nov 19, 2025
8073de4
arm_mpam: Consider overflow in bandwidth counter state
benhor01 Nov 19, 2025
52f76fb
arm_mpam: Probe for long/lwd mbwu counters
rohit-arm Nov 19, 2025
31ae450
arm_mpam: Use long MBWU counters if supported
rohit-arm Nov 19, 2025
79afb33
arm_mpam: Add helper to reset saved mbwu state
Nov 19, 2025
454d41e
cpumask: Add initialiser to use cleanup helpers
YuryNorov Nov 20, 2025
274eb52
arm_mpam: Add kunit test for bitmap reset
Nov 19, 2025
12b4b9b
arm_mpam: Add kunit tests for props_mismatch()
Nov 19, 2025
2872b11
MAINTAINERS: new entry for MPAM Driver
benhor01 Nov 19, 2025
f96ab3c
x86,fs/resctrl: Improve domain type checking
aegl Dec 17, 2025
ca054ed
x86/resctrl: Move L3 initialization into new helper function
aegl Dec 17, 2025
76e272e
x86/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types
aegl Dec 17, 2025
b58c422
x86/resctrl: Clean up domain_remove_cpu_ctrl()
aegl Dec 17, 2025
e7bbcc5
x86,fs/resctrl: Refactor domain create/remove using struct rdt_domain…
aegl Dec 17, 2025
51ad6fe
fs/resctrl: Split L3 dependent parts out of __mon_event_count()
aegl Dec 17, 2025
4a813d2
x86,fs/resctrl: Use struct rdt_domain_hdr when reading counters
aegl Dec 17, 2025
48db9f0
x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain
aegl Dec 17, 2025
1a52a5d
x86,fs/resctrl: Rename some L3 specific functions
aegl Dec 17, 2025
ac4acd0
fs/resctrl: Make event details accessible to functions when reading e…
aegl Dec 17, 2025
063bbdb
x86,fs/resctrl: Handle events that can be read from any CPU
aegl Dec 17, 2025
21caffc
x86,fs/resctrl: Support binary fixed point event counters
aegl Dec 17, 2025
d12da8a
x86,fs/resctrl: Add an architectural hook called for first mount
aegl Jan 8, 2026
d02dce6
x86,fs/resctrl: Add and initialize a resource for package scope monit…
aegl Dec 17, 2025
2e7b051
fs/resctrl: Emphasize that L3 monitoring resource is required for sum…
aegl Dec 17, 2025
acf2ec1
x86/resctrl: Discover hardware telemetry events
aegl Jan 8, 2026
bd841bf
x86,fs/resctrl: Fill in details of events for performance and energy …
aegl Dec 17, 2025
71638e0
x86,fs/resctrl: Add architectural event pointer
aegl Dec 17, 2025
6bcc739
x86/resctrl: Find and enable usable telemetry events
aegl Dec 17, 2025
2d5ff51
x86/resctrl: Read telemetry events
aegl Dec 17, 2025
578afea
fs/resctrl: Refactor mkdir_mondata_subdir()
aegl Dec 17, 2025
ef141e9
fs/resctrl: Refactor rmdir_mondata_subdir_allrdtgrp()
aegl Dec 17, 2025
6f15e6d
x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF…
aegl Dec 17, 2025
8158488
x86/resctrl: Add energy/perf choices to rdt boot option
aegl Dec 17, 2025
bd26761
x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG
aegl Dec 17, 2025
6056151
fs/resctrl: Move allocation/free of closid_num_dirty_rmid[]
aegl Dec 17, 2025
661d226
x86,fs/resctrl: Compute number of RMIDs as minimum across resources
aegl Dec 17, 2025
0d7b9ba
fs/resctrl: Move RMID initialization to first mount
aegl Dec 17, 2025
fc2ed46
x86/resctrl: Enable RDT_RESOURCE_PERF_PKG
aegl Jan 8, 2026
abf3736
x86,fs/resctrl: Update documentation for telemetry events
aegl Dec 17, 2025
0310b32
arm_mpam: Ensure in_reset_state is false after applying configuration
henryZe Mar 13, 2026
74cea12
arm_mpam: Reset when feature configuration bit unset
benhor01 Mar 13, 2026
36c6099
arm64/sysreg: Add MPAMSM_EL1 register
benhor01 Mar 13, 2026
39e8012
KVM: arm64: Preserve host MPAM configuration when changing traps
benhor01 Mar 13, 2026
4d49ce2
KVM: arm64: Make MPAMSM_EL1 accesses UNDEF
benhor01 Mar 13, 2026
36efc33
arm64: mpam: Context switch the MPAM registers
Mar 13, 2026
8468a81
arm64: mpam: Re-initialise MPAM regs when CPU comes online
Mar 13, 2026
41abf44
arm64: mpam: Drop the CONFIG_EXPERT restriction
benhor01 Mar 13, 2026
72d6a37
arm64: mpam: Advertise the CPUs MPAM limits to the driver
Mar 13, 2026
1c19e77
arm64: mpam: Add cpu_pm notifier to restore MPAM sysregs
Mar 13, 2026
b7ec00f
arm64: mpam: Initialise and context switch the MPAMSM_EL1 register
benhor01 Mar 13, 2026
546abc4
arm64: mpam: Add helpers to change a task or cpu's MPAM PARTID/PMG va…
Mar 13, 2026
7243584
KVM: arm64: Force guest EL1 to use user-space's partid configuration
Mar 13, 2026
bad9d33
arm_mpam: resctrl: Add boilerplate cpuhp and domain allocation
Mar 13, 2026
3f21141
arm_mpam: resctrl: Pick the caches we will use as resctrl resources
Mar 13, 2026
796d6de
arm_mpam: resctrl: Implement resctrl_arch_reset_all_ctrls()
Mar 13, 2026
8550150
arm_mpam: resctrl: Add resctrl_arch_get_config()
Mar 13, 2026
3827eff
arm_mpam: resctrl: Implement helpers to update configuration
Mar 13, 2026
6f843e0
arm_mpam: resctrl: Add plumbing against arm64 task and cpu hooks
Mar 13, 2026
4f1790e
arm_mpam: resctrl: Add CDP emulation
Mar 13, 2026
b4b5349
arm_mpam: resctrl: Hide CDP emulation behind CONFIG_EXPERT
benhor01 Mar 13, 2026
cdffd57
arm_mpam: resctrl: Convert to/from MPAMs fixed-point formats
Mar 13, 2026
7a01a0f
arm_mpam: resctrl: Add rmid index helpers
benhor01 Mar 13, 2026
0d00efc
arm_mpam: resctrl: Wait for cacheinfo to be ready
benhor01 Mar 13, 2026
9e60206
arm_mpam: resctrl: Add support for 'MB' resource
Mar 13, 2026
9cb24da
arm_mpam: resctrl: Add kunit test for control format conversions
Mar 13, 2026
3e294c2
arm_mpam: resctrl: Add monitor initialisation and domain boilerplate
benhor01 Mar 13, 2026
06e5148
arm_mpam: resctrl: Add support for csu counters
Mar 13, 2026
9353824
arm_mpam: resctrl: Allow resctrl to allocate monitors
Mar 13, 2026
93d4006
arm_mpam: resctrl: Add resctrl_arch_rmid_read()
Mar 13, 2026
e732be4
arm_mpam: resctrl: Update the rmid reallocation limit
Mar 13, 2026
88ac217
arm_mpam: resctrl: Add empty definitions for assorted resctrl functions
Mar 13, 2026
e690a79
arm64: mpam: Select ARCH_HAS_CPU_RESCTRL
Mar 13, 2026
d5c4215
arm_mpam: resctrl: Call resctrl_init() on platforms that can support …
Mar 13, 2026
75f654d
arm_mpam: Add quirk framework
shankerd04 Mar 13, 2026
9a5ebbb
arm_mpam: Add workaround for T241-MPAM-1
shankerd04 Mar 13, 2026
fe49877
arm_mpam: Add workaround for T241-MPAM-4
shankerd04 Mar 13, 2026
74b2bbd
arm_mpam: Add workaround for T241-MPAM-6
shankerd04 Mar 13, 2026
7bc1db7
arm_mpam: Quirk CMN-650's CSU NRDY behaviour
Mar 13, 2026
c461cdd
arm64: mpam: Add initial MPAM documentation
benhor01 Mar 13, 2026
0877da9
mm: change ghes code to allow poison of non-struct pfn
ankita-nv Nov 2, 2025
d4949de
mm: handle poisoning of pfn without struct pages
ankita-nv Nov 2, 2025
07bd53a
PCI/P2PDMA: Separate the mmap() support from the core logic
rleon Nov 20, 2025
3ab8dca
PCI/P2PDMA: Simplify bus address mapping API
rleon Nov 20, 2025
1542f1c
PCI/P2PDMA: Refactor to separate core P2P functionality from memory a…
rleon Nov 20, 2025
234f299
PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function
rleon Nov 20, 2025
81997f7
PCI/P2PDMA: Document DMABUF model
jgunthorpe Nov 20, 2025
29d8cf2
dma-buf: provide phys_vec to scatter-gather mapping routine
rleon Nov 20, 2025
f801d92
vfio: Export vfio device get and put registration helpers
vivekkreddy Nov 20, 2025
8cffa88
vfio/pci: Share the core device pointer while invoking feature functions
vivekkreddy Nov 20, 2025
cfba50c
vfio/pci: Enable peer-to-peer DMA transactions by default
rleon Nov 20, 2025
a8e6de7
vfio/pci: Add dma-buf export support for MMIO regions
rleon Nov 20, 2025
9565d4b
vfio/nvgrace: Support get_dmabuf_phys
jgunthorpe Nov 20, 2025
be79888
vfio/pci: Add vfio_pci_dma_buf_iommufd_map()
jgunthorpe Nov 21, 2025
a7930ea
iommufd: Add DMABUF to iopt_pages
jgunthorpe Nov 21, 2025
8a57344
iommufd: Do not map/unmap revoked DMABUFs
jgunthorpe Nov 21, 2025
a60775a
iommufd: Allow a DMABUF to be revoked
jgunthorpe Nov 21, 2025
2e7d427
iommufd: Allow MMIO pages in a batch
jgunthorpe Nov 21, 2025
1b72de9
iommufd: Have pfn_reader process DMABUF iopt_pages
jgunthorpe Nov 21, 2025
f1ef6c1
iommufd: Have iopt_map_file_pages convert the fd to a file
jgunthorpe Nov 21, 2025
0fd6807
iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
jgunthorpe Nov 21, 2025
0499433
iommufd/selftest: Add some tests for the dmabuf flow
jgunthorpe Nov 21, 2025
e740fb2
dma-buf: fix integer overflow in fill_sg_entry() for buffers >= 8GiB
Nov 26, 2025
237fbb7
vfio/nvgrace-gpu: Add support for huge pfnmap
ankita-nv Nov 27, 2025
090a204
vfio: use vfio_pci_core_setup_barmap to map bar in mmap
ankita-nv Nov 27, 2025
21925e8
vfio/nvgrace-gpu: split the code to wait for GPU ready
ankita-nv Nov 27, 2025
a3a8765
vfio/nvgrace-gpu: Inform devmem unmapped after reset
ankita-nv Nov 27, 2025
f89145c
vfio/nvgrace-gpu: wait for the GPU mem to be ready
ankita-nv Nov 27, 2025
f5a870d
mm: fixup pfnmap memory failure handling to use pgoff
ankita-nv Dec 11, 2025
576ad00
vfio/nvgrace-gpu: register device memory for poison handling
ankita-nv Jan 15, 2026
879b837
mm: add stubs for PFNMAP memory failure registration functions
ankita-nv Jan 15, 2026
697207a
iommu: Add device ATS supported capability
Mar 17, 2026
f06bc45
iommufd: Report ATS not supported status via IOMMU_GET_HW_INFO
Mar 17, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion Documentation/admin-guide/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6220,9 +6220,14 @@
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
mba, smba, bmec, abmc.
mba, smba, bmec, abmc, sdciae, energy[:guid],
perf[:guid].
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba
To turn off all energy telemetry monitoring and ensure that
perf telemetry monitoring associated with guid 0x12345
is enabled use:
rdt=!energy,perf:0x12345

reboot= [KNL]
Format (x86 or x86_64):
Expand Down
1 change: 1 addition & 0 deletions Documentation/arch/arm64/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ ARM64 Architecture
memory
memory-tagging-extension
mops
mpam
perf
pointer-authentication
ptdump
Expand Down
72 changes: 72 additions & 0 deletions Documentation/arch/arm64/mpam.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
.. SPDX-License-Identifier: GPL-2.0

====
MPAM
====

What is MPAM
============
MPAM (Memory Partitioning and Monitoring) is a feature in the CPUs and memory
system components such as the caches or memory controllers that allow memory
traffic to be labelled, partitioned and monitored.

Traffic is labelled by the CPU, based on the control or monitor group the
current task is assigned to using resctrl. Partitioning policy can be set
using the schemata file in resctrl, and monitor values read via resctrl.
See Documentation/filesystems/resctrl.rst for more details.

This allows tasks that share memory system resources, such as caches, to be
isolated from each other according to the partitioning policy (so called noisy
neighbours).

Supported Platforms
===================
Use of this feature requires CPU support, support in the memory system
components, and a description from firmware of where the MPAM device controls
are in the MMIO address space. (e.g. the 'MPAM' ACPI table).

The MMIO device that provides MPAM controls/monitors for a memory system
component is called a memory system component. (MSC).

Because the user interface to MPAM is via resctrl, only MPAM features that are
compatible with resctrl can be exposed to user-space.

MSC are considered as a group based on the topology. MSC that correspond with
the L3 cache are considered together, it is not possible to mix MSC between L2
and L3 to 'cover' a resctrl schema.

The supported features are:

* Cache portion bitmap controls (CPOR) on the L2 or L3 caches. To expose
CPOR at L2 or L3, every CPU must have a corresponding CPU cache at this
level that also supports the feature. Mismatched big/little platforms are
not supported as resctrl's controls would then also depend on task
placement.

* Memory bandwidth maximum controls (MBW_MAX) on or after the L3 cache.
resctrl uses the L3 cache-id to identify where the memory bandwidth
control is applied. For this reason the platform must have an L3 cache
with cache-id's supplied by firmware. (It doesn't need to support MPAM.)

To be exported as the 'MB' schema, the topology of the group of MSC chosen
must match the topology of the L3 cache so that the cache-id's can be
repainted. For example: Platforms with Memory bandwidth maximum controls
on CPU-less NUMA nodes cannot expose the 'MB' schema to resctrl as these
nodes do not have a corresponding L3 cache. If the memory bandwidth
control is on the memory rather than the L3 then there must be a single
global L3 as otherwise it is unknown which L3 the traffic came from. There
must be no caches between the L3 and the memory so that the two ends of
the path have equivalent traffic.

When the MPAM driver finds multiple groups of MSC it can use for the 'MB'
schema, it prefers the group closest to the L3 cache.

* Cache Storage Usage (CSU) counters can expose the 'llc_occupancy' provided
there is at least one CSU monitor on each MSC that makes up the L3 group.
Exposing CSU counters from other caches or devices is not supported.

Reporting Bugs
==============
If you are not seeing the counters or controls you expect please share the
debug messages produced when enabling dynamic debug and booting with:
dyndbg="file mpam_resctrl.c +pl"
9 changes: 9 additions & 0 deletions Documentation/arch/arm64/silicon-errata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,9 @@ stable kernels.
| ARM | GIC-700 | #2941627 | ARM64_ERRATUM_2941627 |
+----------------+-----------------+-----------------+-----------------------------+
+----------------+-----------------+-----------------+-----------------------------+
| ARM | CMN-650 | #3642720 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
+----------------+-----------------+-----------------+-----------------------------+
| Broadcom | Brahma-B53 | N/A | ARM64_ERRATUM_845719 |
+----------------+-----------------+-----------------+-----------------------------+
| Broadcom | Brahma-B53 | N/A | ARM64_ERRATUM_843419 |
Expand Down Expand Up @@ -248,6 +251,12 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | T241 GICv3/4.x | T241-FABRIC-4 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | T241 MPAM | T241-MPAM-1 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | T241 MPAM | T241-MPAM-4 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | T241 MPAM | T241-MPAM-6 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
+----------------+-----------------+-----------------+-----------------------------+
| Freescale/NXP | LS2080A/LS1043A | A-008585 | FSL_ERRATUM_A008585 |
+----------------+-----------------+-----------------+-----------------------------+
Expand Down
97 changes: 74 additions & 23 deletions Documentation/driver-api/pci/p2pdma.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,48 @@ between two devices on the bus. This type of transaction is henceforth
called Peer-to-Peer (or P2P). However, there are a number of issues that
make P2P transactions tricky to do in a perfectly safe way.

One of the biggest issues is that PCI doesn't require forwarding
transactions between hierarchy domains, and in PCIe, each Root Port
defines a separate hierarchy domain. To make things worse, there is no
simple way to determine if a given Root Complex supports this or not.
(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
only supports doing P2P when the endpoints involved are all behind the
same PCI bridge, as such devices are all in the same PCI hierarchy
domain, and the spec guarantees that all transactions within the
hierarchy will be routable, but it does not require routing
between hierarchies.

The second issue is that to make use of existing interfaces in Linux,
memory that is used for P2P transactions needs to be backed by struct
pages. However, PCI BARs are not typically cache coherent so there are
a few corner case gotchas with these pages so developers need to
be careful about what they do with them.
For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up
until they reach a host bridge or root port. If the path includes PCIe switches
then based on the ACS settings the transaction can route entirely within
the PCIe hierarchy and never reach the root port. The kernel will evaluate
the PCIe topology and always permit P2P in these well-defined cases.

However, if the P2P transaction reaches the host bridge then it might have to
hairpin back out the same root port, be routed inside the CPU SOC to another
PCIe root port, or routed internally to the SOC.

The PCIe specification doesn't define the forwarding of transactions between
hierarchy domains and kernel defaults to blocking such routing. There is an
allow list to allow detecting known-good HW, in which case P2P between any
two PCIe devices will be permitted.

Since P2P inherently is doing transactions between two devices it requires two
drivers to be co-operating inside the kernel. The providing driver has to convey
its MMIO to the consuming driver. To meet the driver model lifecycle rules the
MMIO must have all DMA mapping removed, all CPU accesses prevented, all page
table mappings undone before the providing driver completes remove().

This requires the providing and consuming driver to actively work together to
guarantee that the consuming driver has stopped using the MMIO during a removal
cycle. This is done by either a synchronous invalidation shutdown or waiting
for all usage refcounts to reach zero.

At the lowest level the P2P subsystem offers a naked struct p2p_provider that
delegates lifecycle management to the providing driver. It is expected that
drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF
to provide an invalidation shutdown. These MMIO addresess have no struct page, and
if used with mmap() must create special PTEs. As such there are very few
kernel uAPIs that can accept pointers to them; in particular they cannot be used
with read()/write(), including O_DIRECT.

Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE
pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of
pgmap ensures that when the pgmap is destroyed all other drivers have stopped
using the MMIO. This option works with O_DIRECT flows, in some cases, if the
underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through
FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap
it also relies on architecture support along with alignment and minimum size
limitations.


Driver Writer's Guide
Expand Down Expand Up @@ -114,14 +140,39 @@ allocating scatter-gather lists with P2P memory.
Struct Page Caveats
-------------------

Driver writers should be very careful about not passing these special
struct pages to code that isn't prepared for it. At this time, the kernel
interfaces do not have any checks for ensuring this. This obviously
precludes passing these pages to userspace.
While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs,
pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set.

P2P memory is also technically IO memory but should never have any side
effects behind it. Thus, the order of loads and stores should not be important
and ioreadX(), iowriteX() and friends should not be necessary.
The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The
KVA is still MMIO and must still be accessed through the normal
readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just
like any other MMIO mapping. While this will actually work on some
architectures, others will experience corruption or just crash in the kernel.
Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU
access happens.


Usage With DMABUF
=================

DMABUF provides an alternative to the above struct page-based
client/provider/orchestrator system and should be used when struct page
doesn't exist. In this mode the exporting driver will wrap
some of its MMIO in a DMABUF and give the DMABUF FD to userspace.

Userspace can then pass the FD to an importing driver which will ask the
exporting driver to map it to the importer.

In this case the initiator and target pci_devices are known and the P2P subsystem
is used to determine the mapping type. The phys_addr_t-based DMA API is used to
establish the dma_addr_t.

Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants
to remove() it must deliver an invalidation shutdown to all DMABUF importing
drivers through move_notify() and synchronously DMA unmap all the MMIO.

No importing driver can continue to have a DMA map to the MMIO after the
exporting driver has destroyed its p2p_provider.


P2P DMA Support Library
Expand Down
Loading