Conversation
|
@RMeli could you check this all makes sense to you? |
| # The image path of the nvidia container that you want to use | ||
| FROM nvcr.io/nvidia/pytorch:26.01-py3 | ||
|
|
||
| # Install micromamba and create conda environment |
There was a problem hiding this comment.
what do you install micromamba for? would this be better using python from the base container?
There was a problem hiding this comment.
hmm just to show how to use it in container, since others may want to install something from conda channel, like lammps-metatomic?
There was a problem hiding this comment.
Ideally people should not use conda on CSCS for this, both plumed-metatomic and lammps-metatomic are available on spack and should be built as uenv
There was a problem hiding this comment.
ah okay, I try to replace this with a conda-free version?
There was a problem hiding this comment.
It would be nice (unless @RMeli disagree with me and says this is fine!)
There was a problem hiding this comment.
Using conda is risky, because you have little control over dependencies. In particular, it is difficult to tap into the networks stack, which is essential for using the high-speed network. The same applies to other package manager.
As @Luthaf mentioned, uenv is likely a better option for applications needing MPI (at least for the time being), since it will build from source using the correct dependencies for the network. I don't see I-PI in Spack, so that will need to be added if you want to go down that route (which I understand is annoying).
RMeli
left a comment
There was a problem hiding this comment.
Hello. I had a look, and the approach is somewhat distant to what you can find in our documentation. I see two main issues with this approach:
- The use of
Podmaninstead of the Container Engine, - The use of Nvidia NGC images, which use OpenMPI 4 and therefore do not work with our network stack.
I have no knowledge of i-PI and from what I can see it does not need MPI itself. You also don't seem to consider/use MPI in the example here (i.e. PLUMED without MPI, ...) so this might not be an issue for this particular case. However, if you want to use it MPI-enabled clients, or use this approach with MPI applications (you mentioned LAMMPS below), then (2) is a non-starter with the bare NGC images. You can try with our Alps Extended Images, which come with OpenMPI 5 and are compatible with our network stack. (They are focussed on AI/ML applications, and I have yet to test them myself for other applications/MD.)
The other option is to use uenv. This allows to build the whole software stack from scratch, tapping into the right network stack. However, this requires to have all the needed dependencies available in the Spack package manager, which I understand is not great (but I already contributed plumed+metatomic to Spack and I have a working lammps+metatomic Spack recipe ready).
|
|
||
| Last update 17-02-2026. | ||
|
|
||
| CSCS recommends running machine learning-related applications in containers, see [this page](https://docs.cscs.ch/software/ml/#optimizing-data-loading-for-machine-learning) for reference. In this tutorial, we use the [Pytorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) provided by Nvidia to fully exploit the power of the GH200s installed on daint.alps. |
There was a problem hiding this comment.
For ML applications this is indeed recommended. For mixed applications, other considerations apply. For instance, the Nvidia NGC images ship with OpenMPI 4, which is incompatible with our network stack. Therefore, if MPI needed, things get more complicated.
| # The image path of the nvidia container that you want to use | ||
| FROM nvcr.io/nvidia/pytorch:26.01-py3 | ||
|
|
||
| # Install micromamba and create conda environment |
There was a problem hiding this comment.
Using conda is risky, because you have little control over dependencies. In particular, it is difficult to tap into the networks stack, which is essential for using the high-speed network. The same applies to other package manager.
As @Luthaf mentioned, uenv is likely a better option for applications needing MPI (at least for the time being), since it will build from source using the correct dependencies for the network. I don't see I-PI in Spack, so that will need to be added if you want to go down that route (which I understand is annoying).
| export OMP_NUM_THREADS=1 | ||
| export MKL_NUM_THREADS=1 | ||
| export TORCH_NUM_THREADS=1 |
There was a problem hiding this comment.
Why are you asking for 40 CPUs per task and then set these? MKL is not available for aarch64, so you should probably look into similar variables for OpenBLAS/NVPL if you really want to set this.
There was a problem hiding this comment.
I found that setting these variables can speed up i-pi calculation a lot, but haven't check why and which really works...
| srun --cpus-per-task=$SLURM_CPUS_PER_TASK --ntasks=1 --gpus=1 \ | ||
| --gpu-bind=single:1 \ | ||
| --export=ALL,CUDA_VISIBLE_DEVICES=$GPU,NVIDIA_VISIBLE_DEVICES=$GPU \ | ||
| podman run --rm \ |
There was a problem hiding this comment.
Running directly with podman on Alps is not officially supported. You should use the Container Engine instead. Additionally, this can only work on a single node and can't scale to multiple nodes.
|
|
||
| ```Dockerfile | ||
| # The image path of the nvidia container that you want to use | ||
| FROM nvcr.io/nvidia/pytorch:26.01-py3 |
There was a problem hiding this comment.
Nvidia NGC containers come with OpenMPI 4, which is incompatible with our network stack. They work very well when using NCCL (with the hooks enabled by the Container Engine, see below), but do not work with MPI applications.
Our extended Alps images come with OpenMPI 5 and are compatible with our network stack, so they should be used as base images. However, so far they have been mainly tested for ML applications.
There was a problem hiding this comment.
let me test it and see if it makes some differences
There was a problem hiding this comment.
In this case it might not, if you are not using MPI at all. But mine was a more general remark (especially given the general title of the PR). This might work for i-PI, but will not work (efficiently) with LAMMPS+metatomic for example.
| At this stage, the container is identical to the upstream NGC image. You can enter this container with: | ||
|
|
||
| ```bash | ||
| podman run --rm -it --gpus all <image_name:you_like> bash |
There was a problem hiding this comment.
See comment below about using podman directly for running.
|
|
||
| CSCS recommends running machine learning-related applications in containers, see [this page](https://docs.cscs.ch/software/ml/#optimizing-data-loading-for-machine-learning) for reference. In this tutorial, we use the [Pytorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) provided by Nvidia to fully exploit the power of the GH200s installed on daint.alps. | ||
|
|
||
| ## Selecting and Building a Base Container |
There was a problem hiding this comment.
It would be better not to build and run containers on the login node. You should do this in an allocation. See our documentation on building images with podman.
There was a problem hiding this comment.
Yeah I was building it on an interactive computational node, I'll mention it here
|
|
||
| ## Making Persistent Modifications | ||
|
|
||
| After entering the container, you can start configure your simulation environment as usual. However, be sure that every command are recorded, because your modification to this loaded container is temporary and will lose after you exit. To do permenant modifications, you should write down the commands in the `Containerfile`. The way to do so is to add them with the `RUN` keyword: |
There was a problem hiding this comment.
This is not good advice IMO. You should only modify the Containerfile.
| You can save the modified container: | ||
|
|
||
| ```bash | ||
| podman save -o /path/to/container.tar <image_name:you_like> |
There was a problem hiding this comment.
You should use the Container Engine instead of Podman for running, and therefore import images in the Container Engine instead of using podman save.
No description provided.