perf(mta): GPU-resident metatomic force provider#9
perf(mta): GPU-resident metatomic force provider#9
Conversation
Benchmark: 4-Way Metatomic Scaling (CPU + GPU vs LAMMPS + LAMMPS/kk)Hardware: cosmopc5 -- RTX 4070 Ti SUPER, Threadripper PRO 3955WX Results (ms/step, lower is better)
Speedup
Energy parity (step 0)
All within 0.14%. The difference is from coordinate precision: GROMACS uses float32 ( AnalysisGROMACS vs LAMMPS (20-24% faster): Both call the same metatensor C++ API and PyTorch model. The gap comes from atom representation -- GROMACS uses shift-vector NL (648 atoms) vs LAMMPS explicit ghost atoms (~6700 atoms for 648 local). PET-MAD's O(N^2) transformer attention makes this 10x atom count difference significant. GPU-resident vs CPU IForceProvider (1-4% faster): The CPU pairlist mode already reuses the GROMACS NL via Reproductioncd mta_bench/water_scaling && python prepare.py
# All 4 engines:
python3 run_benchmarks.py --engines gmx_cpu,gmx_gpu,lmp,lmp_kk --sizes 216,512,1000
# Or individually:
# GMX CPU
$GMX_BIN grompp -f gmx_cpu.mdp -c conf.gro -p topol.top -o topol.tpr -maxwarn 10
$GMX_BIN mdrun -s topol.tpr -ntomp 4 -ntmpi 1 -g md.log -pin off
# GMX GPU-resident
$GMX_BIN grompp -f gmx_gpu.mdp -c conf.gro -p topol.top -o topol.tpr -maxwarn 10
$GMX_BIN mdrun -s topol.tpr -ntomp 4 -ntmpi 1 -g md.log -pin off -nb gpu -pme cpu |
27118a2 to
a09fe0c
Compare
2206ec1 to
9252b04
Compare
a09fe0c to
0821849
Compare
0821849 to
49b934b
Compare
2ca90fc to
ee1ce67
Compare
Add a pure-ML GPU path that bypasses IForceProvider for zero-copy
coordinate wrapping, GPU neighbor list construction from nbnxm
GpuPairlist, and GPU-resident model evaluation with force output.
Architecture follows the PME GPU integration pattern:
- DeviceBuffer<RVec> force output registered with GpuForceReduction
- GpuEventSynchronizer for force completion signaling
- Shared CUDA stream with LibTorch (no inter-stream sync)
New files:
- metatomic_gpu_forceprovider.{h,cpp}: Pimpl-based GPU force provider
with DD support (sparse MPI force exchange for halo atoms)
- metatomic_gpu_mdmodule.{h,cpp}: MDModule for MDP option parsing
and model cutoff registration
- metatomic_gpu_pairlist_kernel.{cu,cuh}: CUDA kernel converting
nbnxm GpuPairlist to metatensor NL format on device
- metatomic_gpu_forceprovider_stub.cpp: CPU-only build stub
Domain decomposition support:
- updateAtomMapping() refreshes home/halo mapping on NS steps
- Single-rank: forces stay on GPU via GpuForceReduction
- DD: forces copied to host for sparse MPI exchange (same pattern
as CPU MetatomicForceProvider::distributeNonHomeForces)
nbnxm API additions:
- gpuGetPairlist(): expose GpuPairlist for given locality
- gpuGetAtomIndex(): expose nbnxm-to-natural atom index mapping
Activation: metatomic-gpu MDP section or GMX_METATOMIC_GPU env var.
CUDA build required; non-CUDA builds use stub implementation.
Register MetatomicGpuModuleInfo in initMdpTransform and buildMdpOutput loops so grompp recognizes metatomic-gpu-* MDP keys. Fix buildMdpOutput to use flat addMdpOutputValue helpers instead of nested addObject.
The GPU module's initMdpTransform was empty, causing grompp to reject metatomic-gpu-* MDP keys as unknown. Add transform rules matching the CPU module pattern to map flat MDP keys to the option tree.
The CUDA kernel used c_dBoxX/Y/Z without including ishift.h where they are defined. This didn't matter for CPU-only builds but fails when GMX_GPU=CUDA.
The neighbor list tensors retain computational graph state from the backward pass. On the next MD step, register_autograd_neighbors rejects them because they are already part of a graph. Detach before reuse.
Three fixes for the GPU-resident metatomic force provider:
1. Coordinate synchronization: pass localXReadyOnDevice event into
calculateForces() and enqueueWaitEvent() on the metatomic stream.
Without this, torch reads the coordinate buffer before the async
H2D copy completes, resulting in all-zero positions. Also add
useGpuMetatomic to the H2D copy gate and consumption count.
2. NL symmetrization: the nbnxm GPU pairlist is a half-list (14K pairs
for 216 water). ML models require a full list (24K pairs). After the
CUDA kernel, reverse all pairs (swap i/j, negate cell shifts and
distances), concatenate, then deduplicate via key-encoding + sort.
3. Unit conversion: add set_quantity("energy") on ModelOutput so the
metatomic framework applies eV -> kJ/mol conversion.
Verified: GPU energy = CPU energy = -309,108.53 kJ/mol for 216 water
with PET-MAD model (24,266 NL pairs in both paths).
The GPU metatomic MDModule now subscribes to preprocessing notifications (IndexGroupsAndNames, gmx_mtop_t*, WarningHandler*, MDLogger) to exclude classical NB interactions for ML atoms via addEmbeddedNBExclusions. Without this, Coulomb (SR) was -194K kJ/mol instead of ~6.8K kJ/mol. Also registers MDModulesEnergyOutputToMetatomicPotRequestChecker so the "Metatomic Potential" term appears in .edr output. Adds metatomic-gpu-input-group MDP option (default "System") matching the CPU module pattern.
…gistration - Fix stub calculateForces missing xReadyOnDevice param (CPU build link error) - Assert GROMACS DeviceContext matches CUDA current device in constructor - Debug-mode checksum verifying d_x stability during model eval - Change GPU pairlist range to interaction_range + max_nl_cutoff for DD Newton - CI: skip gpu-marked tests on CPU-only builds
The nlMode field was removed from MetatomicParameters on ddScaling (7e53b93) but the GPU-specific code still referenced it. Drop the three remaining uses so the build compiles against the current struct.
… range-loop warning The device assertion needs the full DeviceInformation definition, not just the forward declaration from device_management.h. The range-loop over string literals was binding a const ref to a temporary; use explicit std::string construction instead.

Summary
MetatomicGpuForceProvider) bypassingIForceProviderfor zero-copy coordinate wrapping, GPU NL construction from nbnxmGpuPairlist, and GPU-resident model evaluationDeviceBuffer<RVec>+GpuEventSynchronizerregistered withGpuForceReductionGpuPairlistto metatensor NL format on device; no D2H copies in single-rank hot pathMetatomicForceProvider)gpuGetPairlist()andgpuGetAtomIndex()for GPU pairlist accessmetatomic-gpuMDP section orGMX_METATOMIC_GPUenv var; CUDA build required (stub for CPU-only)Single-rank GPU-resident only; no multi-GPU
The zero-copy force output via
GpuForceReductionis gated on!hasDomainDecomposition(). With DD, forces go to CPU for sparse MPI exchange throughdistributeNonHomeForces(). No multi-GPU support exists: LAMMPS Kokkos handles this viaMPI_Comm_split_typeround-robin GPU assignment, but the metatomic GPU path just uses whatever device GROMACS assigns throughDeviceStreamManager. Multi-rank multi-GPU is untested.Not upstream-ready
The GPU force provider bypasses
IForceProviderand wires directly intodo_force()insim_util.cpp(coordinate sync, atom mapping, NL build, model evaluation, output application) andrunner.cpp(provider construction, model path resolution from KVT, topology iteration). It also adds ametatomicGpumember tot_forcerec, auseGpuMetatomicflag toSimulationWork, and nbnxm accessors that expose internal GPU data structures.Upstreaming would require a proper GPU force provider interface (analogous to PME GPU integration) rather than the current direct wiring. Either
IForceProviderneeds GPU buffer support, or a separateIGpuForceProviderinterface withDeviceBufferinput/output.Phase 1 audit (latest commit)
GMX_RELEASE_ASSERTthatDeviceContextmatchescudaGetDevice()d_xbefore model eval, verified unchanged afterbackward()(fires withGMX_METATOMIC_DEBUG=1)interaction_range + max_nl_cutoff(prerequisite for Newton-mode DD)xReadyOnDeviceparameterTestGPUCPUParityasserts GPU/CPU step-0 energy within 1 kJ/molTest plan
metatomic-cpuenv) compiles with stubmetatomic-cudaenv) on remote workstationTestGPUCPUParity)nsys profileconfirms no GPU-CPU copies in single-rank hot path