perf(mta): GPU-resident metatomic force provider by HaoZeke · Pull Request #9 · metatensor/gromacs

HaoZeke · 2026-02-22T02:22:42Z

Summary

Pure-ML GPU path (MetatomicGpuForceProvider) bypassing IForceProvider for zero-copy coordinate wrapping, GPU NL construction from nbnxm GpuPairlist, and GPU-resident model evaluation
PME GPU integration pattern: DeviceBuffer<RVec> + GpuEventSynchronizer registered with GpuForceReduction
CUDA kernel converts nbnxm GpuPairlist to metatensor NL format on device; no D2H copies in single-rank hot path
DD supported: GPU model eval + CPU sparse MPI force exchange (same pattern as CPU MetatomicForceProvider)
New nbnxm accessors gpuGetPairlist() and gpuGetAtomIndex() for GPU pairlist access
Activation via metatomic-gpu MDP section or GMX_METATOMIC_GPU env var; CUDA build required (stub for CPU-only)

Single-rank GPU-resident only; no multi-GPU

The zero-copy force output via GpuForceReduction is gated on !hasDomainDecomposition(). With DD, forces go to CPU for sparse MPI exchange through distributeNonHomeForces(). No multi-GPU support exists: LAMMPS Kokkos handles this via MPI_Comm_split_type round-robin GPU assignment, but the metatomic GPU path just uses whatever device GROMACS assigns through DeviceStreamManager. Multi-rank multi-GPU is untested.

Not upstream-ready

The GPU force provider bypasses IForceProvider and wires directly into do_force() in sim_util.cpp (coordinate sync, atom mapping, NL build, model evaluation, output application) and runner.cpp (provider construction, model path resolution from KVT, topology iteration). It also adds a metatomicGpu member to t_forcerec, a useGpuMetatomic flag to SimulationWork, and nbnxm accessors that expose internal GPU data structures.

Upstreaming would require a proper GPU force provider interface (analogous to PME GPU integration) rather than the current direct wiring. Either IForceProvider needs GPU buffer support, or a separate IGpuForceProvider interface with DeviceBuffer input/output.

Phase 1 audit (latest commit)

Device validation: GMX_RELEASE_ASSERT that DeviceContext matches cudaGetDevice()
Coordinate lifetime: debug-mode checksum of d_x before model eval, verified unchanged after backward() (fires with GMX_METATOMIC_DEBUG=1)
Interaction range: GPU module registers interaction_range + max_nl_cutoff (prerequisite for Newton-mode DD)
Stub fix: CPU builds failed to link because the stub was missing the xReadyOnDevice parameter
Parity test: TestGPUCPUParity asserts GPU/CPU step-0 energy within 1 kJ/mol

Test plan

CPU build (metatomic-cpu env) compiles with stub
Existing CPU metatomic tests pass
CI benchmark workflow fixed (stub signature mismatch was the CPU link failure)
CUDA build (metatomic-cuda env) on remote workstation
Single-rank GPU energy matches CPU reference (TestGPUCPUParity)
DD GPU energy matches serial GPU reference
nsys profile confirms no GPU-CPU copies in single-rank hot path

HaoZeke · 2026-02-22T20:44:14Z

Benchmark: 4-Way Metatomic Scaling (CPU + GPU vs LAMMPS + LAMMPS/kk)

Hardware: cosmopc5 -- RTX 4070 Ti SUPER, Threadripper PRO 3955WX
Model: PET-MAD v1.0.2 | Steps: 500 | Threading: -ntmpi 1 -ntomp 4

Results (ms/step, lower is better)

Atoms	GMX CPU	GMX GPU	LAMMPS	LAMMPS/kk
648	68.3	69.4	86.0	80.0
1536	149.1	144.7	196.0	180.0
3000	272.1	260.3	354.0	328.0

Speedup

Atoms	GMX CPU vs LAMMPS	GMX GPU vs LAMMPS/kk	GPU vs CPU
648	21% faster	13% faster	2% slower
1536	24% faster	20% faster	3% faster
3000	23% faster	21% faster	4% faster

Energy parity (step 0)

Engine	648 atoms (kJ/mol)	3000 atoms (kJ/mol)
GMX CPU	-309,456	-1,407,420
GMX GPU	-309,109	--
LAMMPS	-309,022	-1,407,437
LAMMPS/kk	-309,022	-1,407,437

All within 0.14%. The difference is from coordinate precision: GROMACS uses float32 (-DGMX_DOUBLE=OFF), LAMMPS metal units use float64. The GMX GPU path has an additional source of difference from the CUDA NL kernel using float32 positions (nbnxm d_xq stores xyz as float).

Analysis

GROMACS vs LAMMPS (20-24% faster): Both call the same metatensor C++ API and PyTorch model. The gap comes from atom representation -- GROMACS uses shift-vector NL (648 atoms) vs LAMMPS explicit ghost atoms (~6700 atoms for 648 local). PET-MAD's O(N^2) transformer attention makes this 10x atom count difference significant.

GPU-resident vs CPU IForceProvider (1-4% faster): The CPU pairlist mode already reuses the GROMACS NL via MDModulesPairlistConstructedSignal. The GPU path only saves the D2H coordinate copy and H2D force copy around model evaluation. At these sizes that is ~5-10 ms vs ~260 ms model evaluation. The benefit grows with system size (the GPU advantage increases from -2% to +4% going from 648 to 3000 atoms).

Reproduction

cd mta_bench/water_scaling && python prepare.py

# All 4 engines:
python3 run_benchmarks.py --engines gmx_cpu,gmx_gpu,lmp,lmp_kk --sizes 216,512,1000

# Or individually:
# GMX CPU
$GMX_BIN grompp -f gmx_cpu.mdp -c conf.gro -p topol.top -o topol.tpr -maxwarn 10
$GMX_BIN mdrun -s topol.tpr -ntomp 4 -ntmpi 1 -g md.log -pin off

# GMX GPU-resident
$GMX_BIN grompp -f gmx_gpu.mdp -c conf.gro -p topol.top -o topol.tpr -maxwarn 10
$GMX_BIN mdrun -s topol.tpr -ntomp 4 -ntmpi 1 -g md.log -pin off -nb gpu -pme cpu

HaoZeke · 2026-02-26T07:05:47Z

Some changes since I ran these but.. single GPU numbers with PET seem good

Add a pure-ML GPU path that bypasses IForceProvider for zero-copy coordinate wrapping, GPU neighbor list construction from nbnxm GpuPairlist, and GPU-resident model evaluation with force output. Architecture follows the PME GPU integration pattern: - DeviceBuffer<RVec> force output registered with GpuForceReduction - GpuEventSynchronizer for force completion signaling - Shared CUDA stream with LibTorch (no inter-stream sync) New files: - metatomic_gpu_forceprovider.{h,cpp}: Pimpl-based GPU force provider with DD support (sparse MPI force exchange for halo atoms) - metatomic_gpu_mdmodule.{h,cpp}: MDModule for MDP option parsing and model cutoff registration - metatomic_gpu_pairlist_kernel.{cu,cuh}: CUDA kernel converting nbnxm GpuPairlist to metatensor NL format on device - metatomic_gpu_forceprovider_stub.cpp: CPU-only build stub Domain decomposition support: - updateAtomMapping() refreshes home/halo mapping on NS steps - Single-rank: forces stay on GPU via GpuForceReduction - DD: forces copied to host for sparse MPI exchange (same pattern as CPU MetatomicForceProvider::distributeNonHomeForces) nbnxm API additions: - gpuGetPairlist(): expose GpuPairlist for given locality - gpuGetAtomIndex(): expose nbnxm-to-natural atom index mapping Activation: metatomic-gpu MDP section or GMX_METATOMIC_GPU env var. CUDA build required; non-CUDA builds use stub implementation.

Register MetatomicGpuModuleInfo in initMdpTransform and buildMdpOutput loops so grompp recognizes metatomic-gpu-* MDP keys. Fix buildMdpOutput to use flat addMdpOutputValue helpers instead of nested addObject.

The GPU module's initMdpTransform was empty, causing grompp to reject metatomic-gpu-* MDP keys as unknown. Add transform rules matching the CPU module pattern to map flat MDP keys to the option tree.

The CUDA kernel used c_dBoxX/Y/Z without including ishift.h where they are defined. This didn't matter for CPU-only builds but fails when GMX_GPU=CUDA.

The neighbor list tensors retain computational graph state from the backward pass. On the next MD step, register_autograd_neighbors rejects them because they are already part of a graph. Detach before reuse.

Three fixes for the GPU-resident metatomic force provider: 1. Coordinate synchronization: pass localXReadyOnDevice event into calculateForces() and enqueueWaitEvent() on the metatomic stream. Without this, torch reads the coordinate buffer before the async H2D copy completes, resulting in all-zero positions. Also add useGpuMetatomic to the H2D copy gate and consumption count. 2. NL symmetrization: the nbnxm GPU pairlist is a half-list (14K pairs for 216 water). ML models require a full list (24K pairs). After the CUDA kernel, reverse all pairs (swap i/j, negate cell shifts and distances), concatenate, then deduplicate via key-encoding + sort. 3. Unit conversion: add set_quantity("energy") on ModelOutput so the metatomic framework applies eV -> kJ/mol conversion. Verified: GPU energy = CPU energy = -309,108.53 kJ/mol for 216 water with PET-MAD model (24,266 NL pairs in both paths).

The GPU metatomic MDModule now subscribes to preprocessing notifications (IndexGroupsAndNames, gmx_mtop_t*, WarningHandler*, MDLogger) to exclude classical NB interactions for ML atoms via addEmbeddedNBExclusions. Without this, Coulomb (SR) was -194K kJ/mol instead of ~6.8K kJ/mol. Also registers MDModulesEnergyOutputToMetatomicPotRequestChecker so the "Metatomic Potential" term appears in .edr output. Adds metatomic-gpu-input-group MDP option (default "System") matching the CPU module pattern.

…gistration - Fix stub calculateForces missing xReadyOnDevice param (CPU build link error) - Assert GROMACS DeviceContext matches CUDA current device in constructor - Debug-mode checksum verifying d_x stability during model eval - Change GPU pairlist range to interaction_range + max_nl_cutoff for DD Newton - CI: skip gpu-marked tests on CPU-only builds

The nlMode field was removed from MetatomicParameters on ddScaling (7e53b93) but the GPU-specific code still referenced it. Drop the three remaining uses so the build compiles against the current struct.

… range-loop warning The device assertion needs the full DeviceInformation definition, not just the forward declaration from device_management.h. The range-loop over string literals was binding a const ref to a temporary; use explicit std::string construction instead.

HaoZeke changed the base branch from realDomDec to ddScaling February 22, 2026 02:23

HaoZeke changed the title ~~GPU-resident metatomic force provider~~ perf(mta): GPU-resident metatomic force provider Feb 22, 2026

HaoZeke marked this pull request as draft February 22, 2026 03:23

This was referenced Feb 22, 2026

feat(mtagro): enable muti-rank usage #4

Merged

perf(mta): fix DD scalability bottlenecks + ASV benchmarks #8

Merged

HaoZeke force-pushed the gpuMetatomic branch from 27118a2 to a09fe0c Compare February 23, 2026 10:07

HaoZeke force-pushed the ddScaling branch 2 times, most recently from 2206ec1 to 9252b04 Compare February 24, 2026 04:17

HaoZeke force-pushed the gpuMetatomic branch from a09fe0c to 0821849 Compare February 24, 2026 04:20

HaoZeke force-pushed the ddScaling branch from 9252b04 to 720ed74 Compare February 24, 2026 04:20

HaoZeke force-pushed the gpuMetatomic branch from 0821849 to 49b934b Compare February 24, 2026 04:28

HaoZeke force-pushed the ddScaling branch from 720ed74 to 63af5fc Compare February 24, 2026 04:28

HaoZeke force-pushed the gpuMetatomic branch 3 times, most recently from 2ca90fc to ee1ce67 Compare February 26, 2026 02:40

HaoZeke added 13 commits March 15, 2026 03:18

fix(mta-gpu): remove incorrect isDirectProvider return

e34a6f4

fix(mta-gpu): wire GPU MDModule into grompp MDP parsing

0323162

Register MetatomicGpuModuleInfo in initMdpTransform and buildMdpOutput loops so grompp recognizes metatomic-gpu-* MDP keys. Fix buildMdpOutput to use flat addMdpOutputValue helpers instead of nested addObject.

fix(mta-gpu): add MDP transform rules for GPU module keys

fbe56ce

The GPU module's initMdpTransform was empty, causing grompp to reject metatomic-gpu-* MDP keys as unknown. Add transform rules matching the CPU module pattern to map flat MDP keys to the option tree.

fix(mta-gpu): include ishift.h for c_dBox constants in CUDA kernel

22a2e62

The CUDA kernel used c_dBoxX/Y/Z without including ishift.h where they are defined. This didn't matter for CPU-only builds but fails when GMX_GPU=CUDA.

fix(mta-gpu): detach NL tensors before register_autograd_neighbors

ae9dfc0

The neighbor list tensors retain computational graph state from the backward pass. On the next MD step, register_autograd_neighbors rejects them because they are already part of a graph. Detach before reuse.

fix(mta-gpu): remove stale nlMode references from GPU path

df44926

The nlMode field was removed from MetatomicParameters on ddScaling (7e53b93) but the GPU-specific code still referenced it. Drop the three remaining uses so the build compiles against the current struct.

chore(ci): try to run on a GPU

9f7215c

chore(ci): try finding benchmarks better

df4f386

HaoZeke force-pushed the gpuMetatomic branch from 88b0a0d to df4f386 Compare March 15, 2026 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(mta): GPU-resident metatomic force provider#9

perf(mta): GPU-resident metatomic force provider#9
HaoZeke wants to merge 13 commits intoddScalingfrom
gpuMetatomic

HaoZeke commented Feb 22, 2026 •

edited

Loading

Uh oh!

HaoZeke commented Feb 22, 2026 •

edited

Loading

Uh oh!

HaoZeke commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HaoZeke commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Single-rank GPU-resident only; no multi-GPU

Not upstream-ready

Phase 1 audit (latest commit)

Test plan

Uh oh!

HaoZeke commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark: 4-Way Metatomic Scaling (CPU + GPU vs LAMMPS + LAMMPS/kk)

Results (ms/step, lower is better)

Speedup

Energy parity (step 0)

Analysis

Uh oh!

HaoZeke commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HaoZeke commented Feb 22, 2026 •

edited

Loading

HaoZeke commented Feb 22, 2026 •

edited

Loading