El::Trsm() consumes too much memory when running on many CPUs.

`El::Trsm()` is used e.g. for computing `L^{-1} B` [in SDPB solver](https://github.com/davidsd/sdpb/blob/c39fd50686b1cc02851864cd5799c24daf20ea68/src/sdp_solve/SDP_Solver/run/step/initialize_schur_complement_solver/compute_Q.cxx#L48), and for `L_X^{-1} F_p` [in SDPA solver](https://github.com/davidsd/sdpb/blob/297b2cb0b5f3bc3a18765ca92a66c4150317be6e/src/sdpa_solve/SDP_Solver/run/step/compute_S.cxx#L108).
If a block is assigned to many CPUs, it may lead to huge memory consumption, sometimes more than ~10x compared to single-CPU case. This is probably the cause of some unexpected OOM crashes, since our memory estimates do not account for this factor.


For wide matrices, memory overhead is roughly proportional to MPI grid height.
All CPUs are arranged as a 2D grid, and grid height is [defined](https://gitlab.com/bootstrapcollaboration/elemental/-/blob/51adc2b5980fa3b7e1a8d63c2b178e6627629014/src/core/Grid.cpp#L67-L73) as `min x | x >= floor(sqrt(num_cpus)) && num_cpus % x == 0`.
This means that memory consumption (and also performance) is especially bad when `num_cpus` is a big prime number (e.g. 13 or 17), which leads to 1D vertical grids (e.g. 13x1 or 17x1).

Memory model for Trsm and Trmm (which suffers from similar issues) can be found here:
https://github.com/davidsd/sdpb/blob/464306bd53d9ca4aec97437f84dcde6898aed1cf/src/sdpb_util/memory_estimates.hxx#L54

Memory consumption and speedup plots for `L^{-1} X`, where `L ~ 125 x 125`, `X ~ 125 x 213500`, precision = 448 bit:

<img width="600" src="https://github.com/user-attachments/assets/ebfc1bb2-f6fa-4003-bf7a-c9143d1452d0" />


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

El::Trsm() consumes too much memory when running on many CPUs. #275

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

El::Trsm() consumes too much memory when running on many CPUs. #275

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions