Skip to content

printing of worker information not correct #138

@affans

Description

@affans

using Distributed on a Slurm cluster, I am unable to connect to the workers. The reason is that SlurmClusterManager requires the output from start_worker to be in the form julia_worker:PORT#IP.IP.IP.IP. However, when using srun to launch workers across allocated resources, this is what I get:

┌ Debug: srun command: `srun -D /home/affans /home/affans/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/julia --worker`
└ @ SlurmClusterManager REPL[7]:25


julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:9427#172.16.1.26
9431#172.1.1.26
9425#172.1.1.26
9428#172.1.1.26
9430#172.1.1.26
9429#172.1.1.26
9423#172.1.1.26
9432#172.1.1.26
9426#172.1.1.26
9424#172.1.1.26

So somehow the print statements (to stdout) are in a race? I asked for 10 workers here, and it seemed to print all 10 julia_workers all on the same line. Here is another example of the print :

[ Info: Worker 1 output: julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_wor$[ Info: Worker 2 output: julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_wor$[ Info: Worker 3 output: julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_wor$[ Info: Worker 4 output: 9369#172.16.1.41
[ Info: Worker 5 output: julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_wor$[ Info: Worker 6 output: 9361#172.16.1.41
[ Info: Worker 7 output: julia_worker:9360#172.16.1.41
[ Info: Worker 8 output: 9365#172.16.1.41

Any reason what could be causing this?

Version info:

julia> versioninfo()
Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, broadwell)
Threads: 32 default, 0 interactive, 16 GC (on 32 virtual cores)
Environment:
  LD_LIBRARY_PATH = /cm/shared/apps/slurm/16.05.8/lib64/slurm:/cm/shared/apps/slurm/16.05.8/lib64:/cm/shared/apps/openmpi/gcc/64/1.10.1/lib64
  JULIA_NUM_THREADS = 32
  LD_RUN_PATH = /cm/shared/apps/openmpi/gcc/64/1.10.1/lib64

julia> Distributed.VERSION
v"1.10.3"

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions