Skip to content

Commit a3168fb

Browse files
committed
[Minor] Details about distributed slurm and the entrypoint
1 parent b584248 commit a3168fb

File tree

1 file changed

+6
-1
lines changed

1 file changed

+6
-1
lines changed

installation/docker-amd64-cuda/entrypoints/pre-entrypoint.sh

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,18 @@
66
# In the end all variables exported should be present and the command given by the user should run with PID 1.
77

88
# In distributed jobs the number of times the entrypoint is run should match the number of containers created.
9-
# On Slurm, if the entrypoint is called multiple times in the same container we can skip it with the following variables:
9+
# On Slurm, for example, with Pyxis a single container is created per node,
10+
# and if the entrypoint is called manually after srun, it will run multiple times in the same container (ntasks-per-node)
11+
# so we can skip it with the following variables:
12+
13+
# If nodes share the same container:
1014
if [ -n "${SLURM_ONE_ENTRYPOINT_SCRIPT_PER_JOB}" ] && [ "${SLURM_PROCID}" -gt 0 ]; then
1115
echo "[TEMPLATE INFO] Running the entrypoing only once for the job."
1216
echo "[TEMPLATE INFO] Skipping entrypoints on SLURM_PROCID ${SLURM_PROCID}."
1317
echo "[TEMPLATE INFO] Executing the command" "$@"
1418
exec "$@"
1519
fi
20+
# If tasks on the same node share the same container:
1621
if [ -n "${SLURM_ONE_ENTRYPOINT_SCRIPT_PER_NODE}" ] && [ "${SLURM_LOCALID}" -gt 0 ]; then
1722
echo "[TEMPLATE INFO] Running the entrypoint once per node."
1823
echo "[TEMPLATE INFO] Skipping entrypoints on SLURM_PROCID ${SLURM_PROCID}."

0 commit comments

Comments
 (0)