-
Notifications
You must be signed in to change notification settings - Fork 0
Add dcumentation to run MD simulation on CSCS #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,177 @@ | ||
| # Running molecular dynamics simulations on Alps with `metatomic` and `i-pi` | ||
|
|
||
| Last update 17-02-2026. | ||
|
|
||
| CSCS recommends running machine learning-related applications in containers, see [this page](https://docs.cscs.ch/software/ml/#optimizing-data-loading-for-machine-learning) for reference. In this tutorial, we use the [Pytorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) provided by Nvidia to fully exploit the power of the GH200s installed on daint.alps. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For ML applications this is indeed recommended. For mixed applications, other considerations apply. For instance, the Nvidia NGC images ship with OpenMPI 4, which is incompatible with our network stack. Therefore, if MPI needed, things get more complicated. |
||
|
|
||
| ## Selecting and Building a Base Container | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be better not to build and run containers on the login node. You should do this in an allocation. See our documentation on building images with
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I was building it on an interactive computational node, I'll mention it here |
||
|
|
||
| To start, first pick a version of the container that you like by clicking the `Get Container` button of [this page](https://docs.cscs.ch/software/ml/#optimizing-data-loading-for-machine-learning) and copy the image path. Then, create a new folder, say `MD-container` for the following steps. | ||
|
|
||
| ```bash | ||
| mkdir MD-container | ||
| cd MD-container | ||
| ``` | ||
|
|
||
| Inside `MD-container`, create a file `Containerfile`. Later on, we will create a custom container base on the NGC container that you chose and the custom commands written in the `Containerfile`. The `Containerfile` for now looks like: | ||
|
|
||
| ```Dockerfile | ||
| FROM <image_path:you_chose> | ||
| ``` | ||
|
|
||
| Build the container: | ||
|
|
||
| ```bash | ||
| podman build -t <image_name:you_like> . | ||
| ``` | ||
|
|
||
| At this stage, the container is identical to the upstream NGC image. You can enter this container with: | ||
|
|
||
| ```bash | ||
| podman run --rm -it --gpus all <image_name:you_like> bash | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See comment below about using |
||
| ``` | ||
|
|
||
| ## Making Persistent Modifications | ||
|
|
||
| After entering the container, you can start configure your simulation environment as usual. However, be sure that every command are recorded, because your modification to this loaded container is temporary and will lose after you exit. To do permenant modifications, you should write down the commands in the `Containerfile`. The way to do so is to add them with the `RUN` keyword: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not good advice IMO. You should only modify the |
||
|
|
||
| ```Dockerfile | ||
| FROM <image_path:you_chose> | ||
|
|
||
| RUN command_0 \ | ||
| && command_1 \ | ||
| && ... | ||
| ``` | ||
|
|
||
| It is also possible to setup the environmental variables in the `Containerfile` too, with: | ||
|
|
||
| ```Dockerfile | ||
| ENV VARIABLE_NAME=value | ||
| ``` | ||
|
|
||
| Rebuild the container: | ||
|
|
||
| ```bash | ||
| podman build -t <image_name:you_like> . | ||
| ``` | ||
|
|
||
| You can save the modified container: | ||
|
|
||
| ```bash | ||
| podman save -o /path/to/container.tar <image_name:you_like> | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should use the Container Engine instead of |
||
| ``` | ||
|
|
||
| Next time, you can use the container by first loading it with: | ||
|
|
||
| ```bash | ||
| podman load -i /path/to/container.tar | ||
| ``` | ||
|
|
||
| and then enter it with: | ||
|
|
||
| ```bash | ||
| podman run --rm -it --gpus all <image_name:you_like> bash | ||
| ``` | ||
|
|
||
| ## Example Containerfile for `metatomic + i-Pi + PLUMED` | ||
|
|
||
| Below is an example `Containerfile` that | ||
|
|
||
| - installs micromamba | ||
| - creates a Python environment | ||
| - compiles `metatomic[torch]` | ||
| - builds PLUMED | ||
|
|
||
| ```Dockerfile | ||
| # The image path of the nvidia container that you want to use | ||
| FROM nvcr.io/nvidia/pytorch:26.01-py3 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nvidia NGC containers come with OpenMPI 4, which is incompatible with our network stack. They work very well when using NCCL (with the hooks enabled by the Container Engine, see below), but do not work with MPI applications. Our extended Alps images come with OpenMPI 5 and are compatible with our network stack, so they should be used as base images. However, so far they have been mainly tested for ML applications.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let me test it and see if it makes some differences
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this case it might not, if you are not using MPI at all. But mine was a more general remark (especially given the general title of the PR). This might work for i-PI, but will not work (efficiently) with LAMMPS+metatomic for example. |
||
|
|
||
| # Install micromamba and create conda environment | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what do you install micromamba for? would this be better using python from the base container?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm just to show how to use it in container, since others may want to install something from conda channel, like
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ideally people should not use conda on CSCS for this, both plumed-metatomic and lammps-metatomic are available on spack and should be built as uenv
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah okay, I try to replace this with a conda-free version?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be nice (unless @RMeli disagree with me and says this is fine!)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using As @Luthaf mentioned, uenv is likely a better option for applications needing MPI (at least for the time being), since it will build from source using the correct dependencies for the network. I don't see I-PI in Spack, so that will need to be added if you want to go down that route (which I understand is annoying). |
||
| RUN curl -Ls https://micro.mamba.pm/api/micromamba/linux-aarch64/latest | tar -xvj bin/micromamba \ | ||
| && eval "$(./bin/micromamba shell hook -s posix)" \ | ||
| && ./bin/micromamba shell init -s bash -r ~/micromamba \ | ||
| && source ~/.bashrc \ | ||
| && micromamba activate \ | ||
| && micromamba create -n sim python==3.12.* -y \ | ||
| # Ensure system torch is accessible in the conda env | ||
| && echo "/usr/local/lib/python3.12/dist-packages" > /root/micromamba/envs/sim/lib/python3.12/site-packages/system_torch.pth \ | ||
| # Build your environment | ||
| && micromamba activate sim \ | ||
| && pip install ipython metatensor metatomic ipi ase \ | ||
| && pip install metatensor[torch] --no-build-isolation --no-binary=metatensor-torch \ | ||
| && pip install metatomic[torch] --no-build-isolation --no-binary=metatomic-torch \ | ||
| && pip install vesin[torch] --no-build-isolation --no-binary=vesin-torch \ | ||
| && wget https://github.com/plumed/plumed2/archive/refs/tags/v2.10.0.tar.gz -O /tmp/plumed.tar.gz \ | ||
| && tar -xvf /tmp/plumed.tar.gz -C /tmp \ | ||
| && cd /tmp/plumed2-2.10.0 \ | ||
| && CPPFLAGS=$(python src/metatomic/flags-from-python.py --cppflags) \ | ||
| && LDFLAGS=$(python src/metatomic/flags-from-python.py --ldflags) \ | ||
| && ./configure PYTHON_BIN=python --enable-libtorch --enable-libmetatomic --enable-modules=+metatomic LDFLAGS="$LDFLAGS" CPPFLAGS="$CPPFLAGS" \ | ||
| && make -j \ | ||
| && make install | ||
|
|
||
| ENV PLUMED_KERNEL=/usr/local/lib/libplumedKernel.so | ||
| ENV PYTHONPATH="/usr/local/lib/plumed/python:$PYTHONPATH" | ||
|
|
||
| WORKDIR /workspace | ||
| ``` | ||
|
|
||
| ## Example SLURM Submission Script | ||
|
|
||
| The following is an example job submission script, in which we launch four tasks, each on a unique GPU. | ||
|
|
||
| ```bash | ||
| #!/bin/bash -l | ||
| #SBATCH -o job.%j.out | ||
| #SBATCH --job-name=metad-run | ||
| #SBATCH --nodes 1 | ||
| #SBATCH --ntasks-per-node=4 | ||
| #SBATCH --cpus-per-task=40 | ||
| #SBATCH --time 1-00:00:00 | ||
| #SBATCH --gres=gpu:4 | ||
|
|
||
|
|
||
| set -e | ||
|
|
||
| # Load the container image | ||
| IMAGE_TAR=/path/for/saving_the_container | ||
| IMAGE_NAME=<image_name:you_like> | ||
| podman load -i $IMAGE_TAR | ||
|
|
||
| run_job () { | ||
| local NAME=$1 | ||
| local GPU=$2 | ||
| WORKDIR=/your/work_dir/simulation_${NAME}/ | ||
| IPIFILE=/your/work_dir/parameters/${NAME}_input.xml | ||
|
|
||
| cd $WORKDIR | ||
|
|
||
| srun --cpus-per-task=$SLURM_CPUS_PER_TASK --ntasks=1 --gpus=1 \ | ||
| --gpu-bind=single:1 \ | ||
| --export=ALL,CUDA_VISIBLE_DEVICES=$GPU,NVIDIA_VISIBLE_DEVICES=$GPU \ | ||
| podman run --rm \ | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Running directly with
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! I'll take a look |
||
| --gpus all \ | ||
| --env CUDA_VISIBLE_DEVICES=$GPU \ | ||
| --env NVIDIA_VISIBLE_DEVICES=$GPU \ | ||
| # map your directories to the directories in the container | ||
| -v /users/<your_username>:/users/<your_username> \ | ||
| -v /capstor/scratch/cscs/<your_username>:/capstor/scratch/cscs/<your_username> \ | ||
| -w "${WORKDIR}" \ | ||
| "${IMAGE_NAME}" \ | ||
| bash -lc " | ||
| export OMP_NUM_THREADS=1 | ||
| export MKL_NUM_THREADS=1 | ||
| export TORCH_NUM_THREADS=1 | ||
|
Comment on lines
+162
to
+164
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why are you asking for 40 CPUs per task and then set these? MKL is not available for aarch64, so you should probably look into similar variables for OpenBLAS/NVPL if you really want to set this.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I found that setting these variables can speed up i-pi calculation a lot, but haven't check why and which really works... |
||
| /workspace/bin/micromamba run -n sim i-pi ${IPIFILE} | ||
| " & | ||
| } | ||
|
|
||
| run_job H3PO4 0 | ||
| run_job NaH2PO4 1 | ||
| run_job Na2HPO4 2 | ||
| run_job Na3PO4 3 | ||
|
|
||
| wait | ||
|
|
||
| exit 0 | ||
| ``` | ||
Uh oh!
There was an error while loading. Please reload this page.