Skip to content

perf(mta): GPU-resident metatomic force provider#9

Draft
HaoZeke wants to merge 13 commits intoddScalingfrom
gpuMetatomic
Draft

perf(mta): GPU-resident metatomic force provider#9
HaoZeke wants to merge 13 commits intoddScalingfrom
gpuMetatomic

Conversation

@HaoZeke
Copy link
Copy Markdown
Member

@HaoZeke HaoZeke commented Feb 22, 2026

Summary

  • Pure-ML GPU path (MetatomicGpuForceProvider) bypassing IForceProvider for zero-copy coordinate wrapping, GPU NL construction from nbnxm GpuPairlist, and GPU-resident model evaluation
  • PME GPU integration pattern: DeviceBuffer<RVec> + GpuEventSynchronizer registered with GpuForceReduction
  • CUDA kernel converts nbnxm GpuPairlist to metatensor NL format on device; no D2H copies in single-rank hot path
  • DD supported: GPU model eval + CPU sparse MPI force exchange (same pattern as CPU MetatomicForceProvider)
  • New nbnxm accessors gpuGetPairlist() and gpuGetAtomIndex() for GPU pairlist access
  • Activation via metatomic-gpu MDP section or GMX_METATOMIC_GPU env var; CUDA build required (stub for CPU-only)

Single-rank GPU-resident only; no multi-GPU

The zero-copy force output via GpuForceReduction is gated on !hasDomainDecomposition(). With DD, forces go to CPU for sparse MPI exchange through distributeNonHomeForces(). No multi-GPU support exists: LAMMPS Kokkos handles this via MPI_Comm_split_type round-robin GPU assignment, but the metatomic GPU path just uses whatever device GROMACS assigns through DeviceStreamManager. Multi-rank multi-GPU is untested.

Not upstream-ready

The GPU force provider bypasses IForceProvider and wires directly into do_force() in sim_util.cpp (coordinate sync, atom mapping, NL build, model evaluation, output application) and runner.cpp (provider construction, model path resolution from KVT, topology iteration). It also adds a metatomicGpu member to t_forcerec, a useGpuMetatomic flag to SimulationWork, and nbnxm accessors that expose internal GPU data structures.

Upstreaming would require a proper GPU force provider interface (analogous to PME GPU integration) rather than the current direct wiring. Either IForceProvider needs GPU buffer support, or a separate IGpuForceProvider interface with DeviceBuffer input/output.

Phase 1 audit (latest commit)

  • Device validation: GMX_RELEASE_ASSERT that DeviceContext matches cudaGetDevice()
  • Coordinate lifetime: debug-mode checksum of d_x before model eval, verified unchanged after backward() (fires with GMX_METATOMIC_DEBUG=1)
  • Interaction range: GPU module registers interaction_range + max_nl_cutoff (prerequisite for Newton-mode DD)
  • Stub fix: CPU builds failed to link because the stub was missing the xReadyOnDevice parameter
  • Parity test: TestGPUCPUParity asserts GPU/CPU step-0 energy within 1 kJ/mol

Test plan

  • CPU build (metatomic-cpu env) compiles with stub
  • Existing CPU metatomic tests pass
  • CI benchmark workflow fixed (stub signature mismatch was the CPU link failure)
  • CUDA build (metatomic-cuda env) on remote workstation
  • Single-rank GPU energy matches CPU reference (TestGPUCPUParity)
  • DD GPU energy matches serial GPU reference
  • nsys profile confirms no GPU-CPU copies in single-rank hot path

@HaoZeke HaoZeke changed the base branch from realDomDec to ddScaling February 22, 2026 02:23
@HaoZeke HaoZeke changed the title GPU-resident metatomic force provider perf(mta): GPU-resident metatomic force provider Feb 22, 2026
@HaoZeke HaoZeke marked this pull request as draft February 22, 2026 03:23
@HaoZeke
Copy link
Copy Markdown
Member Author

HaoZeke commented Feb 22, 2026

Benchmark: 4-Way Metatomic Scaling (CPU + GPU vs LAMMPS + LAMMPS/kk)

Hardware: cosmopc5 -- RTX 4070 Ti SUPER, Threadripper PRO 3955WX
Model: PET-MAD v1.0.2 | Steps: 500 | Threading: -ntmpi 1 -ntomp 4

Results (ms/step, lower is better)

Atoms GMX CPU GMX GPU LAMMPS LAMMPS/kk
648 68.3 69.4 86.0 80.0
1536 149.1 144.7 196.0 180.0
3000 272.1 260.3 354.0 328.0

Speedup

Atoms GMX CPU vs LAMMPS GMX GPU vs LAMMPS/kk GPU vs CPU
648 21% faster 13% faster 2% slower
1536 24% faster 20% faster 3% faster
3000 23% faster 21% faster 4% faster

Energy parity (step 0)

Engine 648 atoms (kJ/mol) 3000 atoms (kJ/mol)
GMX CPU -309,456 -1,407,420
GMX GPU -309,109 --
LAMMPS -309,022 -1,407,437
LAMMPS/kk -309,022 -1,407,437

All within 0.14%. The difference is from coordinate precision: GROMACS uses float32 (-DGMX_DOUBLE=OFF), LAMMPS metal units use float64. The GMX GPU path has an additional source of difference from the CUDA NL kernel using float32 positions (nbnxm d_xq stores xyz as float).

Analysis

GROMACS vs LAMMPS (20-24% faster): Both call the same metatensor C++ API and PyTorch model. The gap comes from atom representation -- GROMACS uses shift-vector NL (648 atoms) vs LAMMPS explicit ghost atoms (~6700 atoms for 648 local). PET-MAD's O(N^2) transformer attention makes this 10x atom count difference significant.

GPU-resident vs CPU IForceProvider (1-4% faster): The CPU pairlist mode already reuses the GROMACS NL via MDModulesPairlistConstructedSignal. The GPU path only saves the D2H coordinate copy and H2D force copy around model evaluation. At these sizes that is ~5-10 ms vs ~260 ms model evaluation. The benefit grows with system size (the GPU advantage increases from -2% to +4% going from 648 to 3000 atoms).

Reproduction
cd mta_bench/water_scaling && python prepare.py

# All 4 engines:
python3 run_benchmarks.py --engines gmx_cpu,gmx_gpu,lmp,lmp_kk --sizes 216,512,1000

# Or individually:
# GMX CPU
$GMX_BIN grompp -f gmx_cpu.mdp -c conf.gro -p topol.top -o topol.tpr -maxwarn 10
$GMX_BIN mdrun -s topol.tpr -ntomp 4 -ntmpi 1 -g md.log -pin off

# GMX GPU-resident
$GMX_BIN grompp -f gmx_gpu.mdp -c conf.gro -p topol.top -o topol.tpr -maxwarn 10
$GMX_BIN mdrun -s topol.tpr -ntomp 4 -ntmpi 1 -g md.log -pin off -nb gpu -pme cpu

@HaoZeke
Copy link
Copy Markdown
Member Author

HaoZeke commented Feb 26, 2026

Some changes since I ran these but.. single GPU numbers with PET seem good
pr9_gpuMetatomic

HaoZeke added 13 commits March 15, 2026 03:18
Add a pure-ML GPU path that bypasses IForceProvider for zero-copy
coordinate wrapping, GPU neighbor list construction from nbnxm
GpuPairlist, and GPU-resident model evaluation with force output.

Architecture follows the PME GPU integration pattern:
- DeviceBuffer<RVec> force output registered with GpuForceReduction
- GpuEventSynchronizer for force completion signaling
- Shared CUDA stream with LibTorch (no inter-stream sync)

New files:
- metatomic_gpu_forceprovider.{h,cpp}: Pimpl-based GPU force provider
  with DD support (sparse MPI force exchange for halo atoms)
- metatomic_gpu_mdmodule.{h,cpp}: MDModule for MDP option parsing
  and model cutoff registration
- metatomic_gpu_pairlist_kernel.{cu,cuh}: CUDA kernel converting
  nbnxm GpuPairlist to metatensor NL format on device
- metatomic_gpu_forceprovider_stub.cpp: CPU-only build stub

Domain decomposition support:
- updateAtomMapping() refreshes home/halo mapping on NS steps
- Single-rank: forces stay on GPU via GpuForceReduction
- DD: forces copied to host for sparse MPI exchange (same pattern
  as CPU MetatomicForceProvider::distributeNonHomeForces)

nbnxm API additions:
- gpuGetPairlist(): expose GpuPairlist for given locality
- gpuGetAtomIndex(): expose nbnxm-to-natural atom index mapping

Activation: metatomic-gpu MDP section or GMX_METATOMIC_GPU env var.
CUDA build required; non-CUDA builds use stub implementation.
Register MetatomicGpuModuleInfo in initMdpTransform and buildMdpOutput
loops so grompp recognizes metatomic-gpu-* MDP keys. Fix buildMdpOutput
to use flat addMdpOutputValue helpers instead of nested addObject.
The GPU module's initMdpTransform was empty, causing grompp to reject
metatomic-gpu-* MDP keys as unknown. Add transform rules matching the
CPU module pattern to map flat MDP keys to the option tree.
The CUDA kernel used c_dBoxX/Y/Z without including ishift.h where they
are defined. This didn't matter for CPU-only builds but fails when
GMX_GPU=CUDA.
The neighbor list tensors retain computational graph state from the
backward pass. On the next MD step, register_autograd_neighbors rejects
them because they are already part of a graph. Detach before reuse.
Three fixes for the GPU-resident metatomic force provider:

1. Coordinate synchronization: pass localXReadyOnDevice event into
   calculateForces() and enqueueWaitEvent() on the metatomic stream.
   Without this, torch reads the coordinate buffer before the async
   H2D copy completes, resulting in all-zero positions. Also add
   useGpuMetatomic to the H2D copy gate and consumption count.

2. NL symmetrization: the nbnxm GPU pairlist is a half-list (14K pairs
   for 216 water). ML models require a full list (24K pairs). After the
   CUDA kernel, reverse all pairs (swap i/j, negate cell shifts and
   distances), concatenate, then deduplicate via key-encoding + sort.

3. Unit conversion: add set_quantity("energy") on ModelOutput so the
   metatomic framework applies eV -> kJ/mol conversion.

Verified: GPU energy = CPU energy = -309,108.53 kJ/mol for 216 water
with PET-MAD model (24,266 NL pairs in both paths).
The GPU metatomic MDModule now subscribes to preprocessing notifications
(IndexGroupsAndNames, gmx_mtop_t*, WarningHandler*, MDLogger) to exclude
classical NB interactions for ML atoms via addEmbeddedNBExclusions.
Without this, Coulomb (SR) was -194K kJ/mol instead of ~6.8K kJ/mol.

Also registers MDModulesEnergyOutputToMetatomicPotRequestChecker so the
"Metatomic Potential" term appears in .edr output.

Adds metatomic-gpu-input-group MDP option (default "System") matching
the CPU module pattern.
…gistration

- Fix stub calculateForces missing xReadyOnDevice param (CPU build link error)
- Assert GROMACS DeviceContext matches CUDA current device in constructor
- Debug-mode checksum verifying d_x stability during model eval
- Change GPU pairlist range to interaction_range + max_nl_cutoff for DD Newton
- CI: skip gpu-marked tests on CPU-only builds
The nlMode field was removed from MetatomicParameters on ddScaling
(7e53b93) but the GPU-specific code still referenced it. Drop the
three remaining uses so the build compiles against the current struct.
… range-loop warning

The device assertion needs the full DeviceInformation definition, not
just the forward declaration from device_management.h. The range-loop
over string literals was binding a const ref to a temporary; use
explicit std::string construction instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant