Skip to content

Comments

Add dcumentation to run MD simulation on CSCS#1

Open
GardevoirX wants to merge 2 commits intomainfrom
container
Open

Add dcumentation to run MD simulation on CSCS#1
GardevoirX wants to merge 2 commits intomainfrom
container

Conversation

@GardevoirX
Copy link

No description provided.

@Luthaf
Copy link
Member

Luthaf commented Feb 17, 2026

@RMeli could you check this all makes sense to you?

# The image path of the nvidia container that you want to use
FROM nvcr.io/nvidia/pytorch:26.01-py3

# Install micromamba and create conda environment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you install micromamba for? would this be better using python from the base container?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm just to show how to use it in container, since others may want to install something from conda channel, like lammps-metatomic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally people should not use conda on CSCS for this, both plumed-metatomic and lammps-metatomic are available on spack and should be built as uenv

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay, I try to replace this with a conda-free version?

Copy link
Member

@Luthaf Luthaf Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice (unless @RMeli disagree with me and says this is fine!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using conda is risky, because you have little control over dependencies. In particular, it is difficult to tap into the networks stack, which is essential for using the high-speed network. The same applies to other package manager.

As @Luthaf mentioned, uenv is likely a better option for applications needing MPI (at least for the time being), since it will build from source using the correct dependencies for the network. I don't see I-PI in Spack, so that will need to be added if you want to go down that route (which I understand is annoying).

Copy link
Member

@RMeli RMeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello. I had a look, and the approach is somewhat distant to what you can find in our documentation. I see two main issues with this approach:

  1. The use of Podman instead of the Container Engine,
  2. The use of Nvidia NGC images, which use OpenMPI 4 and therefore do not work with our network stack.

I have no knowledge of i-PI and from what I can see it does not need MPI itself. You also don't seem to consider/use MPI in the example here (i.e. PLUMED without MPI, ...) so this might not be an issue for this particular case. However, if you want to use it MPI-enabled clients, or use this approach with MPI applications (you mentioned LAMMPS below), then (2) is a non-starter with the bare NGC images. You can try with our Alps Extended Images, which come with OpenMPI 5 and are compatible with our network stack. (They are focussed on AI/ML applications, and I have yet to test them myself for other applications/MD.)

The other option is to use uenv. This allows to build the whole software stack from scratch, tapping into the right network stack. However, this requires to have all the needed dependencies available in the Spack package manager, which I understand is not great (but I already contributed plumed+metatomic to Spack and I have a working lammps+metatomic Spack recipe ready).


Last update 17-02-2026.

CSCS recommends running machine learning-related applications in containers, see [this page](https://docs.cscs.ch/software/ml/#optimizing-data-loading-for-machine-learning) for reference. In this tutorial, we use the [Pytorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) provided by Nvidia to fully exploit the power of the GH200s installed on daint.alps.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ML applications this is indeed recommended. For mixed applications, other considerations apply. For instance, the Nvidia NGC images ship with OpenMPI 4, which is incompatible with our network stack. Therefore, if MPI needed, things get more complicated.

# The image path of the nvidia container that you want to use
FROM nvcr.io/nvidia/pytorch:26.01-py3

# Install micromamba and create conda environment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using conda is risky, because you have little control over dependencies. In particular, it is difficult to tap into the networks stack, which is essential for using the high-speed network. The same applies to other package manager.

As @Luthaf mentioned, uenv is likely a better option for applications needing MPI (at least for the time being), since it will build from source using the correct dependencies for the network. I don't see I-PI in Spack, so that will need to be added if you want to go down that route (which I understand is annoying).

Comment on lines +162 to +164
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export TORCH_NUM_THREADS=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you asking for 40 CPUs per task and then set these? MKL is not available for aarch64, so you should probably look into similar variables for OpenBLAS/NVPL if you really want to set this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that setting these variables can speed up i-pi calculation a lot, but haven't check why and which really works...

srun --cpus-per-task=$SLURM_CPUS_PER_TASK --ntasks=1 --gpus=1 \
--gpu-bind=single:1 \
--export=ALL,CUDA_VISIBLE_DEVICES=$GPU,NVIDIA_VISIBLE_DEVICES=$GPU \
podman run --rm \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running directly with podman on Alps is not officially supported. You should use the Container Engine instead. Additionally, this can only work on a single node and can't scale to multiple nodes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll take a look


```Dockerfile
# The image path of the nvidia container that you want to use
FROM nvcr.io/nvidia/pytorch:26.01-py3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvidia NGC containers come with OpenMPI 4, which is incompatible with our network stack. They work very well when using NCCL (with the hooks enabled by the Container Engine, see below), but do not work with MPI applications.

Our extended Alps images come with OpenMPI 5 and are compatible with our network stack, so they should be used as base images. However, so far they have been mainly tested for ML applications.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me test it and see if it makes some differences

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it might not, if you are not using MPI at all. But mine was a more general remark (especially given the general title of the PR). This might work for i-PI, but will not work (efficiently) with LAMMPS+metatomic for example.

At this stage, the container is identical to the upstream NGC image. You can enter this container with:

```bash
podman run --rm -it --gpus all <image_name:you_like> bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below about using podman directly for running.


CSCS recommends running machine learning-related applications in containers, see [this page](https://docs.cscs.ch/software/ml/#optimizing-data-loading-for-machine-learning) for reference. In this tutorial, we use the [Pytorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) provided by Nvidia to fully exploit the power of the GH200s installed on daint.alps.

## Selecting and Building a Base Container
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better not to build and run containers on the login node. You should do this in an allocation. See our documentation on building images with podman.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was building it on an interactive computational node, I'll mention it here


## Making Persistent Modifications

After entering the container, you can start configure your simulation environment as usual. However, be sure that every command are recorded, because your modification to this loaded container is temporary and will lose after you exit. To do permenant modifications, you should write down the commands in the `Containerfile`. The way to do so is to add them with the `RUN` keyword:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not good advice IMO. You should only modify the Containerfile.

You can save the modified container:

```bash
podman save -o /path/to/container.tar <image_name:you_like>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use the Container Engine instead of Podman for running, and therefore import images in the Container Engine instead of using podman save.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants