-
Notifications
You must be signed in to change notification settings - Fork 2
Add HaveIdleWork to avoid excessive CPU cost for spinning. #236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
Conversation
This is related to #235. Main changes are: - Add CommonBase::HaveIdleWork - Modify CommonBase::MPISpinner to check whether there is idle work, and if not back off down to spinning at 10hz. - Override HaveIdlework in various CommonBase derivatives to indicate if they have idle work. (cherry picked from commit 7ce6aef)
mvousden
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally happy to see this. It will have very little impact on runtime in practice, because the Mothership still spins as fast as it can. Should reduce the rate at which electrons leave smaller computers.
I approve, but would like @heliosfa to take a look before anyone merges it.
| if(HaveIdleWork()){ | ||
| OnIdle(); // Guess | ||
| }else{ | ||
| probe_idle_backoff_us=std::min<int>(MAX_PROBE_IDLE_BACKOFF_US, std::max<int>(10, (probe_idle_backoff_us*5)/4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to see some documentation about this, explaning how probe_idle_backoff_us decreases and increases incrementally as a function of workload.
Small note on styling - we generally try to adhere to the same style as neighbouring source when introducing modifications. In this case, lines of source in the Orchestrator source shouldn't span more than 79 characters, and we generally avoid underscores in variable names over camel case.
We do have a style guide, at https://github.com/POETSII/Orchestrator/wiki/Code-Style, but I don't want to dissuade you from contributing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't really think about the schedule, it's just exponential backoff with a slow-ish
multiplier and a reset to zero on activity. I'm afraid there is no particular thought that went into it. So feel free to change it to better match work-load and so on.
In terms of style, I generally try to match where possible, but it is a balance
between fixing things for my ultimate goals (getting DPD working) versus trying
to package up into small patches that can be easily applied. So probably that
means more work to integrate the PRs, but I don't really expect them
to be directly applicable, as I don't really understand the orchestrator.
So feel free to improve however you want, and I'll just merge 1.0.0-alpha
back into my copy.
| return false; | ||
| } | ||
| return true; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return !pCmCall->IsEmpty()? ;p
|
Some care needs to be taken in implementing this and we may even need to look at making it switchable. @DrongoTheDog will need to confirm, but I believe the design intent here is that the MPI spinners spin as fast as possible to drain the MPI network as quickly as possible given that there could be large bursts of data that overwhelm the statically sized MPI buffers. On a beecake of a system, this is not a huge issue, but I can definitely see issues on laptops/desktops with fewer cores or battery. |
|
On 14/06/2021 14:23, Graeme Bragg wrote:
*CAUTION:* This e-mail originated outside the University of Southampton.
Some care needs to be taken in implementing this and we may even need to
look at making it switchable.
@DrongoTheDog
<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FDrongoTheDog&data=04%7C01%7Cadb%40soton.ac.uk%7C6982ce02389940e671ac08d92f37907d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637592737958118750%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OgMHeTgXJhkFPzJoJ5hEM87BolNZW9InKZe9QYJ9vYk%3D&reserved=0>
will need to confirm, but I believe the design intent here is that the
MPI spinners spin as fast as possible to drain the MPI network as
quickly as possible given that there could be large bursts of data that
overwhelm the statically sized MPI buffers.
The inter-process comms inside the Orchestrator is based on non-blocking
(immediate) reads, so there is a message buffer. One can (and one did)
chuck away the default buffer and attach ones own; this is done in the
CommonBase constructor and it's currently set to 1GByte. (Notice the use
of malloc/free; the standard says not to use new/delete or it'll all
fall over. So I tried, obviously, and it all fell over so I went back to
doing what I was told.)
When I chucked the source over the fence to MLV and GMB, there existed a
vestigial (and subtly documented) POETS> test /flood=*,* command, which
(amongst other things) induced a message storm *in the MPI fabric*, with
a view to seeing how far I could reasonably abuse MPI and get away with
it. The answer turned out to be quite a lot; you can quite comfortably
(as far as MPI is concerned) exfiltrate many Mbytes of data in SGLs
(sodding great lumps).
On a beecake of a system, this is not a huge issue, but I can definitely
see issues on laptops/desktops with fewer cores or battery.
The only parameter that matters is memory footprint.
…--
--
Andrew Brown
Professor of Electronics
University of Southampton
Hampshire SO17 1BJ
UK
T: 02380 593374
M: 07464 981171
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good change set overall. A couple of queries (Marked with a 🍠 as per MLV's convention).
I have also had a bash at updating the Orchestrator Docs [Docx|PDF] with this info (changes on pages 10 and 15). I would appreciate if you could have a quick read and confirm that that describes the behaviour.
| void Dump(unsigned = 0,FILE * = stdout); | ||
| unsigned OnExit(PMsg_p *); | ||
| virtual void OnIdle(); | ||
| virtual bool HaveIdleWork(); // Return false if there is something we want to do in OnIdle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this comment be // Return false if there is ... 🍠
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, comment is flipped. I think originally I had it as NoIdleWork, then changed the sense.
| if (flag==0) { // Nothing there.... | ||
| OnIdle(); // Guess | ||
| if(HaveIdleWork()){ | ||
| OnIdle(); // Guess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just having a think about this - shouldn't the sleep period be reset if there is idle work and OnIdle() is called? 🍠
Currently, the period is incremented any time there is no idle work, but it is not reset if there is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mental model was that the OnIdle was mostly draining work that arrived in messages, so it was something like:
- Message arrives, causing work to queue up, resetting delay to zero
- HaveIdleWork returns true until all the work is finished, with delay staying at zero
- HaveIdleWork returns false (and presumably keeps returning false), so delay slowly increases
But if idle work can arrive from elsewhere (i.e. not from a message), then I guess that you'd
want any return from HaveIdleWork to reset the counter too.
|
If the worry about a flood of messages arriving is a concern, (which I could see), then That would allow the implementation to switch from (Not saying this should be done, probably not worth the effort - just reset the |
|
Hang on, chaps. Let's get this into perspective.
The design intent is that MPISpinner brokers comms within the MPI
universe, which is expected to be rare: user input, logserver traffic,
commands from user to motherships and high-level responses back. (Recall
that the console read sits in its own thread and injects messages into
the spinner on an equal footing to everything else.) Root overloads
CommonBase::OnIdle for the batch system to work, and so does LogServer,
although I can't remember why.
Using blocking IO will almost certainly bite us in the bum at some
point, and there will be a gnailing and a washing of teeth.
To shift huge slabs of data into (and then presumably out of) the MPI
universe, MPIspinner is not the mechanism of choice: if you want to
write to a datafile, let the motherships do it directly. Instrumentation
packets need to go to a Monitor process, to be sure, but we haven't got
one yet so it can't possibly be broken. GMB wants to talk to Externals
directly via sockets (which I think is daft, but hey....)
It seems to me that the only driver for this is excessive CPU load and
GMBs feet getting hot. OK, poke a 100mS delay in the loop and let's get
on with our lives. Dynamic delay adjustment here is not worth worrying
about.
ADB
…On 15/06/2021 17:24, m8pple wrote:
*CAUTION:* This e-mail originated outside the University of Southampton.
If the worry about a flood of messages arriving is a concern, (which I
could see), then
presumably the correct approach is to move to a blocking MPI receive
once HaveIdleWork
returns false. Maybe the documentation for HaveIdleWork could be changed to:
|/*! Return true if there is something we want to do in OnIdle. If
HaveIdleWork returns false it is guaranteed it will never return true
again unless a message is received. */ virtual bool HaveIdleWork(); |
That would allow the implementation to switch from |MPI_Iprobe| to
|MPI_Probe|
which would get the absolute minimum latency of receiving messages to avoid
flooding, while also maximising the efficiency of the loop.
(Not saying this should be done, probably not worth the effort - just
reset the
delay to 0 if HaveIdleWork returns true, and then you get almost all the CPU
reduction benefit.).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPOETSII%2FOrchestrator%2Fpull%2F236%23issuecomment-861644289&data=04%7C01%7Cadb%40soton.ac.uk%7C210702316c774a1b3bbf08d9301a1f9d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637593711022595935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Jq0O86fA58AcgORWTiYE9Sp%2FjKvHRpZsn1aa0UVf2Qs%3D&reserved=0>,
or unsubscribe
<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE7EYI2CWRIKJPLKBSZB4DDTS55FXANCNFSM46U47SQA&data=04%7C01%7Cadb%40soton.ac.uk%7C210702316c774a1b3bbf08d9301a1f9d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637593711022605931%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=urKSZL%2BN0Ind59IHWipIsll9cWcIp6R4MAs28p7prZY%3D&reserved=0>.
--
--
Andrew Brown
Professor of Electronics
University of Southampton
Hampshire SO17 1BJ
UK
T: 02380 593374
M: 07464 981171
|
|
I agree, I don't really think it is worth shifting to blocking or doing much else
I don't think this is a trivial issue that only applies to people Some of those threads will probably switch to doing real things during So yes - I think that the hacky delay fix is fine, but it's not just a case of avoiding |
This is related to #235.
Main changes are:
whether there is idle work, and if not back off down
to spinning at 10hz.
derivatives to indicate if they have idle work.
Tested locally, though not tested while running hardware. Maximum backup threshold is 10hz, though
probably this could go down to 100hz or even 1Khz and still get benefit. Backoff schedule could also
be adjusted.
(cherry picked from commit 7ce6aef)