[frontier] Add concentrated list of useful cray-mpich environment variables#1002
[frontier] Add concentrated list of useful cray-mpich environment variables#1002abbotts wants to merge 2 commits intoolcf:masterfrom
Conversation
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| Setting this environment variable to ``1`` will spawn a thread dedicated to making progress on outstanding MPI communication and automatically increase the MPI thread level to MPI_THREAD_MULTIPLE. | ||
| Applications that use one-sided MPI (eg, ``MPI_Put``, ``MPI_Get``) or non-blocking collectives (eg, ``MPI_Ialltoall``) will likely benefit from enabling this feature. |
There was a problem hiding this comment.
Interesting. My experiments with MPI_Get and MPI_Ialltoall seemed to work pretty well without the async thread. Maybe because I wasn't trying to overlap with heavy CPU-based computation?
There was a problem hiding this comment.
So, @timattox and I had a discussion on this and both my recommendation (one-sided) and his (non-blocking collectives) are based on guidance we got from Krishna, but neither of us has had a chance to really test.
I'm not sure how much CPU-computation has to do with it. I think this comes down to when progress happens, and Slingshot may change some of that. Without the offloaded rendezvous the progress would only happen in a libfabric call, and that's only going to happen from an MPI call unless you have the progress thread.
The guidance in the MPICH man page is actually more broad than what we have here. It basically says "this is good for anything except blocking pt2pt".
My inclination is leave this in for now but make a point to specifically test over the next 6 months and update with what we think the right guidance is for different codes.
290e0f2 to
6133621
Compare
|
Rebased to latest master. Time to bring this out of draft and work on getting it merged. @hagertnl , @GeorgiadouAntigoni - I've had this sitting on the back burner and I'd like to make some progress towards getting this merged and moving to the other MPI and GPU aware MPI documentation we've been discussing. I'm open to any content or formatting changes here. |
|
I like the current version. I get the sense that the docs need a larger re-shuffle to trim down content and better-organize "tips & tricks"-like sections, but that would best be handled in a separate PR. Do we want to include any of the outdated workarounds that shouldn't be needed anymore, or stage those for a later update? |
|
I'd like to handle the outdated workarounds in a separate PR. The COE needs to scrub the known issues/workarounds/fixed issues list to make sure we capture everything. Unfortunately the issues with the new signal handler in cray-mpich/9.0.1 mean we can't get rid of nearly as many workarounds as I hoped. |
The full list of cray-mpich environment variables can be quite intimidating for most users. This PR is an effort to pull out the ones most users should be aware of and write them in plain text.
I'll open this as a PR because we need to iterate a bit on placement, formatting, and descriptions. There's also a few that didn't make this first cut that we might want to add. In particular, these are on the shortlist but I decided to leave out but perhaps should be added back in. I feel like if we want to add these we need a more dedicated MPI debugging page.