Skip to content

segmentation Fault at PMPI_Iallreduce in distributed Galois (missing MPI_Init?) #425

@Barenya255

Description

@Barenya255

Hello,
I've been trying to run distributed Galois for quite some time. I've tried running all the provided apps and have been encountering a segmentation fault.

Command used :
./sssp-pull --startNode=0 $graphPath

Error observed:

[0] Master distribution time : 0.239983 seconds to read 168 bytes in 20 seeks (0.00070005 MBPS)
[0] Starting graph reading.
[0] Reading graph complete.
[0] Edge inspection time: 0.246308 seconds to read 148615096 bytes (603.371 MBPS)
Loading edge-data while creating edges
[0] Edge loading time: 0.529808 seconds to read 271105352 bytes (511.705 MBPS)
[0] Graph construction complete.
[0] InitializeGraph::go called
[0] SSSP::go run 0 called

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6dc8604 in PMPI_Iallreduce () from /lfs/sware/openmpi411/lib/libmpi.so.40
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7.x86_64 numactl-libs-2.0.9-7.el7.x86_64

Here are some of my observations from plugging the binary onto gdb.

The segfault has been occuring in the library libdist/libgalois_dist_async.a on a PMPI_Iallreduce call.

After observing the segfault, I opened gdb and noticed a constant address that is outside the process's memory bounds being accessed.

This address was being pushed into r9 in the preamble to calling MPI_IallReduce and was moved into rbp. This address does not seem to be accessible ever.

As per SystemV ABI, r9 would be the sixth argument being passed into a function, which for our case is MPI_COMM_WORLD.

This could happen if MPI_COMM_WORLD was not initialised, which would indicate the code flow lacking an MPI_Init().

Also, gdb could only set a future breakpoint in MPI_Init, and the segfault int MPI_Iallreduce before MPI_Init breakpoint. I don't notice any boost_mpi libraries.

This is the preamble to PMPI_Iallreduce call:

   0x478441 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+145>:	mov    $0x44000000,%r9d
   0x478447 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+151>:	movb   $0x0,-0xe8(%rbp)
   0x47844e <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+158>:	mov    $0x58000001,%r8d
   0x478454 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+164>:	mov    $0x4c000808,%ecx
   0x478459 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+169>:	mov    $0x1,%edx
   0x47845e <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+174>:	mov    %rsi,-0x3a8(%rbp)
   0x478465 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+181>:	mov    %rdi,-0x3a0(%rbp)
   0x47846c <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+188>:	mov    %rax,-0x388(%rbp)
   0x478473 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+195>:	vmovdqa %xmm3,-0x100(%rbp)
   0x47847b <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+203>:	push   %rax
   0x47847c <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+204>:	callq  0x413230 <MPI_Iallreduce@plt>

This is where the segfault is happening.

  
   0x00007ffff6dc85c8 <+24>:	mov    %r9,%rbp
   0x00007ffff6dc85cb <+27>:	push   %rbx
   0x00007ffff6dc85cc <+28>:	sub    $0x28,%rsp
   0x00007ffff6dc85d0 <+32>:	mov    0x29e041(%rip),%rax        # 0x7ffff7066618
   0x00007ffff6dc85d7 <+39>:	mov    0x60(%rsp),%rbx
   0x00007ffff6dc85dc <+44>:	cmpb   $0x0,(%rax)
   0x00007ffff6dc85df <+47>:	je     0x7ffff6dc8648 <PMPI_Iallreduce+152>
   0x00007ffff6dc85e1 <+49>:	mov    0x29e8d0(%rip),%rax        # 0x7ffff7066eb8
   0x00007ffff6dc85e8 <+56>:	mov    (%rax),%eax
   0x00007ffff6dc85ea <+58>:	sub    $0x2,%eax
   0x00007ffff6dc85ed <+61>:	cmp    $0x2,%eax
   0x00007ffff6dc85f0 <+64>:	ja     0x7ffff6dc8750 <PMPI_Iallreduce+416>
   0x00007ffff6dc85f6 <+70>:	test   %rbp,%rbp
   0x00007ffff6dc85f9 <+73>:	je     0x7ffff6dc8612 <PMPI_Iallreduce+98>
   0x00007ffff6dc85fb <+75>:	cmp    0x29e1c6(%rip),%rbp        # 0x7ffff70667c8
   0x00007ffff6dc8602 <+82>:	je     0x7ffff6dc8612 <PMPI_Iallreduce+98>
=> 0x00007ffff6dc8604 <+84>:	mov    0xe8(%rbp),%eax

Note the address 0x44000000. This address seems not accessible.

(gdb) p/x *0x44000000
Cannot access memory at address 0x44000000

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions