-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Hello,
I'm running the optimisation on my cluster, managed via SLURM, via a batch script. When I use a population size (let's say 50) grater than the number of tasks per node, the first 32 populations are assigned correctly to workers 0 -> 31, while the others (32 -> 49) are assigned only to worker_0, without being spread throughout the other workers.
The optimisation starts, but once it finishes with worker_0, it halts. 32 input/outputpipe are being created but only a worker_0.wlog.
This is the batch script I use to start the job on the cluster:
#!/bin/bash
#SBATCH --job-name=twocomp_optim
#SBATCH --chdir=/its/home/as2913/repositories/TwoComp/L2L_TwoComp_Ariel
#SBATCH -o logs/%x_%j.out
#SBATCH -e logs/%x_%j.err
#SBATCH --verbose
#SBATCH --nodes=1
#SBATCH --threads-per-core=1
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=1
##SBATCH --exclusive
#SBATCH --partition=general
##SBATCH --distribution=block:cyclic:fcyclic
#SBATCH --hint=nomultithread
#SBATCH --mem=50G
N_TASKS=${SLURM_NTASKS}
CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
date +"%Y-%m-%d %T.%N"
set -eo pipefail
set -x
eval "$($HOME/miniconda/bin/conda shell.bash hook)"
conda activate nest
source $HOME/repositories/nest-simulator/nest_install/bin/nest_vars.sh
cd $HOME/repositories/TwoComp/L2L_TwoComp_Ariel || exit 1
# python -c "import l2l"
export PYTHONPATH=$PYTHONPATH:$(pwd)
# python3 bin/build_model.py
# export LD_LIBRARY_PATH=neuron_model/neuron_build:$LD_LIBRARY_PATH
# srun --mpi=pmix python -u bin/TwoCompFA.py
python -u bin/TwoCompFA.py
These are the parameters I use in the optimisation script:
runner_params = {"srun": "srun --exact", "max_workers": 32}
traj, _ = experiment.prepare_experiment(runner_params=runner_params, name=filename, log_stdout=True, debug=True, stop_run=True, overwrite=True, multiprocessing=True)
And this is the .out file which stops indefinitely at that last line:
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Worker created 31
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : All 32 workers created
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : 50 workers launched
MainProcess utils.environment artemis-a40-12.local 3101720 INFO : Iteration: 1/100
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Running generation: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 0 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 1 to worker 1
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 2 to worker 2
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 3 to worker 3
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 4 to worker 4
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 5 to worker 5
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 6 to worker 6
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 7 to worker 7
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 8 to worker 8
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 9 to worker 9
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 10 to worker 10
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 11 to worker 11
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 12 to worker 12
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 13 to worker 13
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 14 to worker 14
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 15 to worker 15
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 16 to worker 16
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 17 to worker 17
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 18 to worker 18
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 19 to worker 19
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 20 to worker 20
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 21 to worker 21
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 22 to worker 22
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 23 to worker 23
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 24 to worker 24
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 25 to worker 25
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 26 to worker 26
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 27 to worker 27
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 28 to worker 28
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 29 to worker 29
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 30 to worker 30
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 31 to worker 31
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : All workers started running individuals for gen 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Reading output from gen 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 0: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 32 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 32: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 33 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 33: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 34 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 34: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 35 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 35: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 36 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 36: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 37 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 37: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 38 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 38: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 39 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 39: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 40 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 40: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 41 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 41: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 42 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 42: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 43 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 43: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 44 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 44: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 45 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 45: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 46 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 46: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 47 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 47: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 48 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 48: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : --- sent idx 49 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO : Individual finished without error 49: 0
Do you have any idea what is going wrong with this?
Please let me know if you need any other information.
Best,
Ariel