Add HaveIdleWork to avoid excessive CPU cost for spinning. #236

m8pple · 2021-06-14T09:59:35Z

This is related to #235.
Main changes are:

Add CommonBase::HaveIdleWork
Modify CommonBase::MPISpinner to check
whether there is idle work, and if not back off down
to spinning at 10hz.
Override HaveIdlework in various CommonBase
derivatives to indicate if they have idle work.

Tested locally, though not tested while running hardware. Maximum backup threshold is 10hz, though
probably this could go down to 100hz or even 1Khz and still get benefit. Backoff schedule could also
be adjusted.

(cherry picked from commit 7ce6aef)

This is related to #235. Main changes are: - Add CommonBase::HaveIdleWork - Modify CommonBase::MPISpinner to check whether there is idle work, and if not back off down to spinning at 10hz. - Override HaveIdlework in various CommonBase derivatives to indicate if they have idle work. (cherry picked from commit 7ce6aef)

mvousden

Generally happy to see this. It will have very little impact on runtime in practice, because the Mothership still spins as fast as it can. Should reduce the rate at which electrons leave smaller computers.

I approve, but would like @heliosfa to take a look before anyone merges it.

mvousden · 2021-06-14T10:46:35Z

Source/Common/CommonBase.cpp

+    if(HaveIdleWork()){
+      OnIdle();                          // Guess
+    }else{
+      probe_idle_backoff_us=std::min<int>(MAX_PROBE_IDLE_BACKOFF_US, std::max<int>(10, (probe_idle_backoff_us*5)/4));


I would prefer to see some documentation about this, explaning how probe_idle_backoff_us decreases and increases incrementally as a function of workload.

Small note on styling - we generally try to adhere to the same style as neighbouring source when introducing modifications. In this case, lines of source in the Orchestrator source shouldn't span more than 79 characters, and we generally avoid underscores in variable names over camel case.

We do have a style guide, at https://github.com/POETSII/Orchestrator/wiki/Code-Style, but I don't want to dissuade you from contributing!

I didn't really think about the schedule, it's just exponential backoff with a slow-ish
multiplier and a reset to zero on activity. I'm afraid there is no particular thought that went into it. So feel free to change it to better match work-load and so on.

In terms of style, I generally try to match where possible, but it is a balance
between fixing things for my ultimate goals (getting DPD working) versus trying
to package up into small patches that can be easily applied. So probably that
means more work to integrate the PRs, but I don't really expect them
to be directly applicable, as I don't really understand the orchestrator.
So feel free to improve however you want, and I'll just merge 1.0.0-alpha
back into my copy.

mvousden · 2021-06-14T10:54:21Z

Source/Root/Root.cpp

+    return false;
+  }
+  return true;
+}


return !pCmCall->IsEmpty()? ;p

heliosfa · 2021-06-14T13:22:54Z

Some care needs to be taken in implementing this and we may even need to look at making it switchable.

@DrongoTheDog will need to confirm, but I believe the design intent here is that the MPI spinners spin as fast as possible to drain the MPI network as quickly as possible given that there could be large bursts of data that overwhelm the statically sized MPI buffers.

On a beecake of a system, this is not a huge issue, but I can definitely see issues on laptops/desktops with fewer cores or battery.

DrongoTheDog · 2021-06-14T14:09:55Z

On 14/06/2021 14:23, Graeme Bragg wrote: *CAUTION:* This e-mail originated outside the University of Southampton. Some care needs to be taken in implementing this and we may even need to look at making it switchable. @DrongoTheDog <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FDrongoTheDog&data=04%7C01%7Cadb%40soton.ac.uk%7C6982ce02389940e671ac08d92f37907d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637592737958118750%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OgMHeTgXJhkFPzJoJ5hEM87BolNZW9InKZe9QYJ9vYk%3D&reserved=0> will need to confirm, but I believe the design intent here is that the MPI spinners spin as fast as possible to drain the MPI network as quickly as possible given that there could be large bursts of data that overwhelm the statically sized MPI buffers.

The inter-process comms inside the Orchestrator is based on non-blocking (immediate) reads, so there is a message buffer. One can (and one did) chuck away the default buffer and attach ones own; this is done in the CommonBase constructor and it's currently set to 1GByte. (Notice the use of malloc/free; the standard says not to use new/delete or it'll all fall over. So I tried, obviously, and it all fell over so I went back to doing what I was told.) When I chucked the source over the fence to MLV and GMB, there existed a vestigial (and subtly documented) POETS> test /flood=*,* command, which (amongst other things) induced a message storm *in the MPI fabric*, with a view to seeing how far I could reasonably abuse MPI and get away with it. The answer turned out to be quite a lot; you can quite comfortably (as far as MPI is concerned) exfiltrate many Mbytes of data in SGLs (sodding great lumps).

On a beecake of a system, this is not a huge issue, but I can definitely see issues on laptops/desktops with fewer cores or battery.

The only parameter that matters is memory footprint.

…

-- -- Andrew Brown Professor of Electronics University of Southampton Hampshire SO17 1BJ UK T: 02380 593374 M: 07464 981171

heliosfa

Looks like a good change set overall. A couple of queries (Marked with a 🍠 as per MLV's convention).

I have also had a bash at updating the Orchestrator Docs [Docx|PDF] with this info (changes on pages 10 and 15). I would appreciate if you could have a quick read and confirm that that describes the behaviour.

heliosfa · 2021-06-15T12:46:50Z

Source/Common/CommonBase.h

 void                   Dump(unsigned = 0,FILE * = stdout);
 unsigned               OnExit(PMsg_p *);
 virtual void           OnIdle();
+virtual bool           HaveIdleWork(); // Return false if there is something we want to do in OnIdle


Should this comment be // Return false if there is ... 🍠

Yeah, comment is flipped. I think originally I had it as NoIdleWork, then changed the sense.

heliosfa · 2021-06-15T13:07:08Z

Source/Common/CommonBase.cpp

  if (flag==0) {                       // Nothing there....
-    OnIdle();                          // Guess
+    if(HaveIdleWork()){
+      OnIdle();                          // Guess


Just having a think about this - shouldn't the sleep period be reset if there is idle work and OnIdle() is called? 🍠

Currently, the period is incremented any time there is no idle work, but it is not reset if there is.

My mental model was that the OnIdle was mostly draining work that arrived in messages, so it was something like:

Message arrives, causing work to queue up, resetting delay to zero

HaveIdleWork returns true until all the work is finished, with delay staying at zero

HaveIdleWork returns false (and presumably keeps returning false), so delay slowly increases

But if idle work can arrive from elsewhere (i.e. not from a message), then I guess that you'd
want any return from HaveIdleWork to reset the counter too.

m8pple · 2021-06-15T16:24:43Z

If the worry about a flood of messages arriving is a concern, (which I could see), then
presumably the correct approach is to move to a blocking MPI receive once HaveIdleWork
returns false. Maybe the documentation for HaveIdleWork could be changed to:

/*! Return true if there is something we want to do in OnIdle. If HaveIdleWork
  returns false it is guaranteed it will never return true again unless a message
  is received. */
virtual bool           HaveIdleWork();

That would allow the implementation to switch from MPI_Iprobe to MPI_Probe
which would get the absolute minimum latency of receiving messages to avoid
flooding, while also maximising the efficiency of the loop.

(Not saying this should be done, probably not worth the effort - just reset the
delay to 0 if HaveIdleWork returns true, and then you get almost all the CPU
reduction benefit.).

DrongoTheDog · 2021-06-15T16:44:17Z

Hang on, chaps. Let's get this into perspective. The design intent is that MPISpinner brokers comms within the MPI universe, which is expected to be rare: user input, logserver traffic, commands from user to motherships and high-level responses back. (Recall that the console read sits in its own thread and injects messages into the spinner on an equal footing to everything else.) Root overloads CommonBase::OnIdle for the batch system to work, and so does LogServer, although I can't remember why. Using blocking IO will almost certainly bite us in the bum at some point, and there will be a gnailing and a washing of teeth. To shift huge slabs of data into (and then presumably out of) the MPI universe, MPIspinner is not the mechanism of choice: if you want to write to a datafile, let the motherships do it directly. Instrumentation packets need to go to a Monitor process, to be sure, but we haven't got one yet so it can't possibly be broken. GMB wants to talk to Externals directly via sockets (which I think is daft, but hey....) It seems to me that the only driver for this is excessive CPU load and GMBs feet getting hot. OK, poke a 100mS delay in the loop and let's get on with our lives. Dynamic delay adjustment here is not worth worrying about. ADB

…

On 15/06/2021 17:24, m8pple wrote: *CAUTION:* This e-mail originated outside the University of Southampton. If the worry about a flood of messages arriving is a concern, (which I could see), then presumably the correct approach is to move to a blocking MPI receive once HaveIdleWork returns false. Maybe the documentation for HaveIdleWork could be changed to: |/*! Return true if there is something we want to do in OnIdle. If HaveIdleWork returns false it is guaranteed it will never return true again unless a message is received. */ virtual bool HaveIdleWork(); | That would allow the implementation to switch from |MPI_Iprobe| to |MPI_Probe| which would get the absolute minimum latency of receiving messages to avoid flooding, while also maximising the efficiency of the loop. (Not saying this should be done, probably not worth the effort - just reset the delay to 0 if HaveIdleWork returns true, and then you get almost all the CPU reduction benefit.). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPOETSII%2FOrchestrator%2Fpull%2F236%23issuecomment-861644289&data=04%7C01%7Cadb%40soton.ac.uk%7C210702316c774a1b3bbf08d9301a1f9d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637593711022595935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Jq0O86fA58AcgORWTiYE9Sp%2FjKvHRpZsn1aa0UVf2Qs%3D&reserved=0>, or unsubscribe <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE7EYI2CWRIKJPLKBSZB4DDTS55FXANCNFSM46U47SQA&data=04%7C01%7Cadb%40soton.ac.uk%7C210702316c774a1b3bbf08d9301a1f9d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637593711022605931%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=urKSZL%2BN0Ind59IHWipIsll9cWcIp6R4MAs28p7prZY%3D&reserved=0>.

-- -- Andrew Brown Professor of Electronics University of Southampton Hampshire SO17 1BJ UK T: 02380 593374 M: 07464 981171

m8pple · 2021-06-15T19:05:53Z

I agree, I don't really think it is worth shifting to blocking or doing much else
beyond the current approach, but:

It seems to me that the only driver for this is excessive CPU load and
GMBs feet getting hot. OK, poke a 100mS delay in the loop and let's get
on with our lives. Dynamic delay adjustment here is not worth worrying
about.

I don't think this is a trivial issue that only applies to people
running laptops, and should be taken seriously. Running the stock 1.0.0-alpha
Orchestrator on Ayres (which has hardware), it immediately and permanently consumes
600% of CPU until it exits. Ayres only has 6 CPUs and 12 hyper-threads (IIRC, it
is i7), so the orchestrator by default consumes 100% of the true CPU
capacity of the server - any other useful task will have to compete with the spinners.

Some of those threads will probably switch to doing real things during
various stages, but it's difficult to see how it's not going to make tasks
like compose slower if all your parallel g++ processes have to compete
with the spinners. It also means that you won't get a frequency boost
on the CPUs that are doing the important work because the spinners
are reducing the possible boost frequency. I guess at least they don't
pollute the cache.

So yes - I think that the hacky delay fix is fine, but it's not just a case of avoiding
laptops heating up or Graeme's feet getting warm. Spinning CPU cores wreck
performance in a multi-core system due to lots of second order hardware
and OS effects, and should be identified and removed.

mvousden · 2021-06-28T14:14:04Z

@heliosfa @m8pple what's blocking here?

m8pple mentioned this pull request Jun 14, 2021

RTCL utilises 1 CPU permanently at 100% #238

Closed

mvousden self-requested a review June 14, 2021 10:38

mvousden assigned m8pple Jun 14, 2021

mvousden reviewed Jun 14, 2021

View reviewed changes

heliosfa added the feature New feature label Jun 14, 2021

heliosfa self-requested a review June 15, 2021 12:11

heliosfa requested changes Jun 15, 2021

View reviewed changes

heliosfa changed the base branch from 1.0.0-alpha to development June 16, 2021 15:22

heliosfa self-assigned this Jul 5, 2021

mvousden added 5 commits July 19, 2021 11:26

Merge branch 'development' into dt10_idle_work_reduction

b46f61a

Merge branch 'development' into dt10_idle_work_reduction

a0c2b28

Merge branch 'development' into dt10_idle_work_reduction

091b436

Ensure HaveIdleWork accounts for scheduled exits.

4900397

Be the change you want to see in the world.

10f4ef5

heliosfa mentioned this pull request Dec 6, 2021

Implement Hardware Idle support and use it configurably for the Softswitch barrier #297

Draft

Add HaveIdleWork to avoid excessive CPU cost for spinning. #236

Are you sure you want to change the base?

Add HaveIdleWork to avoid excessive CPU cost for spinning. #236

Uh oh!

Conversation

m8pple commented Jun 14, 2021

Uh oh!

mvousden left a comment

Choose a reason for hiding this comment

Uh oh!

mvousden Jun 14, 2021

Choose a reason for hiding this comment

Uh oh!

m8pple Jun 14, 2021

Choose a reason for hiding this comment

Uh oh!

mvousden Jun 14, 2021

Choose a reason for hiding this comment

Uh oh!

heliosfa commented Jun 14, 2021

Uh oh!

DrongoTheDog commented Jun 14, 2021 via email

Uh oh!

heliosfa left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heliosfa Jun 15, 2021

Choose a reason for hiding this comment

Uh oh!

m8pple Jun 15, 2021

Choose a reason for hiding this comment

Uh oh!

heliosfa Jun 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m8pple Jun 15, 2021

Choose a reason for hiding this comment

Uh oh!

m8pple commented Jun 15, 2021

Uh oh!

DrongoTheDog commented Jun 15, 2021 via email

Uh oh!

m8pple commented Jun 15, 2021

Uh oh!

mvousden commented Jun 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

heliosfa left a comment •

edited

Loading

heliosfa Jun 15, 2021 •

edited

Loading