Skip to content

Only one worker is assigned when using multiprocessing #132

@arielshmilli

Description

@arielshmilli

Hello,

I'm running the optimisation on my cluster, managed via SLURM, via a batch script. When I use a population size (let's say 50) grater than the number of tasks per node, the first 32 populations are assigned correctly to workers 0 -> 31, while the others (32 -> 49) are assigned only to worker_0, without being spread throughout the other workers.
The optimisation starts, but once it finishes with worker_0, it halts. 32 input/outputpipe are being created but only a worker_0.wlog.

This is the batch script I use to start the job on the cluster:

#!/bin/bash
#SBATCH --job-name=twocomp_optim
#SBATCH --chdir=/its/home/as2913/repositories/TwoComp/L2L_TwoComp_Ariel
#SBATCH -o logs/%x_%j.out
#SBATCH -e logs/%x_%j.err
#SBATCH --verbose
#SBATCH --nodes=1
#SBATCH --threads-per-core=1
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=1
##SBATCH --exclusive
#SBATCH --partition=general
##SBATCH --distribution=block:cyclic:fcyclic
#SBATCH --hint=nomultithread
#SBATCH --mem=50G

N_TASKS=${SLURM_NTASKS}
CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}

date +"%Y-%m-%d %T.%N"

set -eo pipefail
set -x

eval "$($HOME/miniconda/bin/conda shell.bash hook)"
conda activate nest
source $HOME/repositories/nest-simulator/nest_install/bin/nest_vars.sh

cd $HOME/repositories/TwoComp/L2L_TwoComp_Ariel || exit 1

# python -c "import l2l"
export PYTHONPATH=$PYTHONPATH:$(pwd)

# python3 bin/build_model.py
# export LD_LIBRARY_PATH=neuron_model/neuron_build:$LD_LIBRARY_PATH
# srun --mpi=pmix python -u bin/TwoCompFA.py
python -u bin/TwoCompFA.py

These are the parameters I use in the optimisation script:

runner_params = {"srun": "srun --exact", "max_workers": 32}
traj, _ = experiment.prepare_experiment(runner_params=runner_params, name=filename, log_stdout=True, debug=True, stop_run=True, overwrite=True, multiprocessing=True)

And this is the .out file which stops indefinitely at that last line:

MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Worker created 31
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : All 32 workers created
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : 50 workers launched

MainProcess utils.environment artemis-a40-12.local 3101720 INFO    : Iteration: 1/100
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Running generation: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 0 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 1 to worker 1
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 2 to worker 2
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 3 to worker 3
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 4 to worker 4
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 5 to worker 5
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 6 to worker 6
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 7 to worker 7
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 8 to worker 8
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 9 to worker 9
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 10 to worker 10
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 11 to worker 11
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 12 to worker 12
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 13 to worker 13
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 14 to worker 14
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 15 to worker 15
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 16 to worker 16
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 17 to worker 17
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 18 to worker 18
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 19 to worker 19
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 20 to worker 20
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 21 to worker 21
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 22 to worker 22
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 23 to worker 23
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 24 to worker 24
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 25 to worker 25
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 26 to worker 26
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 27 to worker 27
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 28 to worker 28
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 29 to worker 29
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 30 to worker 30
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 31 to worker 31
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : All workers started running individuals for gen 0

MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Reading output from gen 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 0: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 32 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 32: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 33 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 33: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 34 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 34: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 35 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 35: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 36 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 36: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 37 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 37: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 38 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 38: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 39 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 39: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 40 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 40: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 41 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 41: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 42 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 42: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 43 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 43: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 44 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 44: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 45 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 45: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 46 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 46: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 47 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 47: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 48 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 48: 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : --- sent idx 49 to worker 0
MainProcess utils.runner artemis-a40-12.local 3101720 INFO    : Individual finished without error 49: 0

Do you have any idea what is going wrong with this?

Please let me know if you need any other information.

Best,
Ariel

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions