Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions CSCS-Alps/molecular-dynamics-with-i-Pi.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# Running molecular dynamics simulations on Alps with `metatomic` and `i-pi`

Last update 17-02-2026.

CSCS recommends running machine learning-related applications in containers, see [this page](https://docs.cscs.ch/software/ml/#optimizing-data-loading-for-machine-learning) for reference. In this tutorial, we use the [Pytorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) provided by Nvidia to fully exploit the power of the GH200s installed on daint.alps.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ML applications this is indeed recommended. For mixed applications, other considerations apply. For instance, the Nvidia NGC images ship with OpenMPI 4, which is incompatible with our network stack. Therefore, if MPI needed, things get more complicated.


## Selecting and Building a Base Container
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better not to build and run containers on the login node. You should do this in an allocation. See our documentation on building images with podman.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was building it on an interactive computational node, I'll mention it here


To start, first pick a version of the container that you like by clicking the `Get Container` button of [this page](https://docs.cscs.ch/software/ml/#optimizing-data-loading-for-machine-learning) and copy the image path. Then, create a new folder, say `MD-container` for the following steps.

```bash
mkdir MD-container
cd MD-container
```

Inside `MD-container`, create a file `Containerfile`. Later on, we will create a custom container base on the NGC container that you chose and the custom commands written in the `Containerfile`. The `Containerfile` for now looks like:

```Dockerfile
FROM <image_path:you_chose>
```

Build the container:

```bash
podman build -t <image_name:you_like> .
```

At this stage, the container is identical to the upstream NGC image. You can enter this container with:

```bash
podman run --rm -it --gpus all <image_name:you_like> bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below about using podman directly for running.

```

## Making Persistent Modifications

After entering the container, you can start configure your simulation environment as usual. However, be sure that every command are recorded, because your modification to this loaded container is temporary and will lose after you exit. To do permenant modifications, you should write down the commands in the `Containerfile`. The way to do so is to add them with the `RUN` keyword:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not good advice IMO. You should only modify the Containerfile.


```Dockerfile
FROM <image_path:you_chose>

RUN command_0 \
&& command_1 \
&& ...
```

It is also possible to setup the environmental variables in the `Containerfile` too, with:

```Dockerfile
ENV VARIABLE_NAME=value
```

Rebuild the container:

```bash
podman build -t <image_name:you_like> .
```

You can save the modified container:

```bash
podman save -o /path/to/container.tar <image_name:you_like>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use the Container Engine instead of Podman for running, and therefore import images in the Container Engine instead of using podman save.

```

Next time, you can use the container by first loading it with:

```bash
podman load -i /path/to/container.tar
```

and then enter it with:

```bash
podman run --rm -it --gpus all <image_name:you_like> bash
```

## Example Containerfile for `metatomic + i-Pi + PLUMED`

Below is an example `Containerfile` that

- installs micromamba
- creates a Python environment
- compiles `metatomic[torch]`
- builds PLUMED

```Dockerfile
# The image path of the nvidia container that you want to use
FROM nvcr.io/nvidia/pytorch:26.01-py3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvidia NGC containers come with OpenMPI 4, which is incompatible with our network stack. They work very well when using NCCL (with the hooks enabled by the Container Engine, see below), but do not work with MPI applications.

Our extended Alps images come with OpenMPI 5 and are compatible with our network stack, so they should be used as base images. However, so far they have been mainly tested for ML applications.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me test it and see if it makes some differences

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it might not, if you are not using MPI at all. But mine was a more general remark (especially given the general title of the PR). This might work for i-PI, but will not work (efficiently) with LAMMPS+metatomic for example.


# Install micromamba and create conda environment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you install micromamba for? would this be better using python from the base container?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm just to show how to use it in container, since others may want to install something from conda channel, like lammps-metatomic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally people should not use conda on CSCS for this, both plumed-metatomic and lammps-metatomic are available on spack and should be built as uenv

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay, I try to replace this with a conda-free version?

Copy link
Member

@Luthaf Luthaf Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice (unless @RMeli disagree with me and says this is fine!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using conda is risky, because you have little control over dependencies. In particular, it is difficult to tap into the networks stack, which is essential for using the high-speed network. The same applies to other package manager.

As @Luthaf mentioned, uenv is likely a better option for applications needing MPI (at least for the time being), since it will build from source using the correct dependencies for the network. I don't see I-PI in Spack, so that will need to be added if you want to go down that route (which I understand is annoying).

RUN curl -Ls https://micro.mamba.pm/api/micromamba/linux-aarch64/latest | tar -xvj bin/micromamba \
&& eval "$(./bin/micromamba shell hook -s posix)" \
&& ./bin/micromamba shell init -s bash -r ~/micromamba \
&& source ~/.bashrc \
&& micromamba activate \
&& micromamba create -n sim python==3.12.* -y \
# Ensure system torch is accessible in the conda env
&& echo "/usr/local/lib/python3.12/dist-packages" > /root/micromamba/envs/sim/lib/python3.12/site-packages/system_torch.pth \
# Build your environment
&& micromamba activate sim \
&& pip install ipython metatensor metatomic ipi ase \
&& pip install metatensor[torch] --no-build-isolation --no-binary=metatensor-torch \
&& pip install metatomic[torch] --no-build-isolation --no-binary=metatomic-torch \
&& pip install vesin[torch] --no-build-isolation --no-binary=vesin-torch \
&& wget https://github.com/plumed/plumed2/archive/refs/tags/v2.10.0.tar.gz -O /tmp/plumed.tar.gz \
&& tar -xvf /tmp/plumed.tar.gz -C /tmp \
&& cd /tmp/plumed2-2.10.0 \
&& CPPFLAGS=$(python src/metatomic/flags-from-python.py --cppflags) \
&& LDFLAGS=$(python src/metatomic/flags-from-python.py --ldflags) \
&& ./configure PYTHON_BIN=python --enable-libtorch --enable-libmetatomic --enable-modules=+metatomic LDFLAGS="$LDFLAGS" CPPFLAGS="$CPPFLAGS" \
&& make -j \
&& make install

ENV PLUMED_KERNEL=/usr/local/lib/libplumedKernel.so
ENV PYTHONPATH="/usr/local/lib/plumed/python:$PYTHONPATH"

WORKDIR /workspace
```

## Example SLURM Submission Script

The following is an example job submission script, in which we launch four tasks, each on a unique GPU.

```bash
#!/bin/bash -l
#SBATCH -o job.%j.out
#SBATCH --job-name=metad-run
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=40
#SBATCH --time 1-00:00:00
#SBATCH --gres=gpu:4


set -e

# Load the container image
IMAGE_TAR=/path/for/saving_the_container
IMAGE_NAME=<image_name:you_like>
podman load -i $IMAGE_TAR

run_job () {
local NAME=$1
local GPU=$2
WORKDIR=/your/work_dir/simulation_${NAME}/
IPIFILE=/your/work_dir/parameters/${NAME}_input.xml

cd $WORKDIR

srun --cpus-per-task=$SLURM_CPUS_PER_TASK --ntasks=1 --gpus=1 \
--gpu-bind=single:1 \
--export=ALL,CUDA_VISIBLE_DEVICES=$GPU,NVIDIA_VISIBLE_DEVICES=$GPU \
podman run --rm \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running directly with podman on Alps is not officially supported. You should use the Container Engine instead. Additionally, this can only work on a single node and can't scale to multiple nodes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll take a look

--gpus all \
--env CUDA_VISIBLE_DEVICES=$GPU \
--env NVIDIA_VISIBLE_DEVICES=$GPU \
# map your directories to the directories in the container
-v /users/<your_username>:/users/<your_username> \
-v /capstor/scratch/cscs/<your_username>:/capstor/scratch/cscs/<your_username> \
-w "${WORKDIR}" \
"${IMAGE_NAME}" \
bash -lc "
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export TORCH_NUM_THREADS=1
Comment on lines +162 to +164
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you asking for 40 CPUs per task and then set these? MKL is not available for aarch64, so you should probably look into similar variables for OpenBLAS/NVPL if you really want to set this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that setting these variables can speed up i-pi calculation a lot, but haven't check why and which really works...

/workspace/bin/micromamba run -n sim i-pi ${IPIFILE}
" &
}

run_job H3PO4 0
run_job NaH2PO4 1
run_job Na2HPO4 2
run_job Na3PO4 3

wait

exit 0
```