-
Notifications
You must be signed in to change notification settings - Fork 38
LoLa benchmark implementation #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
LoLa benchmark implementation #63
Conversation
f249a59 to
6a1b0c8
Compare
Test 1Two threads are both producers and subscribers. Sequentially exchange data in turns. PreparationsSamples slot count can be adjusted in etc/mw_com_config.json file by changing numberOfSampleSlots parameter in SkeletonA and SkeletonB services. ExecutionResults
Test 2In addition to previous test there are spawned multiple threads that receives data. All tests uses 2 additional subscribers for each producer. PreparationsSamples slot cont can be adjusted in etc/mw_com_config.json file by changing numberOfSampleSlots parameter in SkeletonA and SkeletonB services. ExecutionResults
All tests were conducted with 8192B samples |
crimson11
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1st: I really appreciate, that you are coming up with this effort to develop a basic benchmark test! We internally did have some iterations for benchmarking already - but it is often hard to decide, WHAT to exactly measure!
We are typically differentiating between MACRO and MICRO benchmarks.
I guess, what you are trying to do here is a mix-up or rather some MACRO benchmark.
Micro benchmark
So in case of MICRO benchmarks, we typically try to measure the runtime of a specific API. Sometimes it isn't even a public API, but some internal API/logic/algorithm. So executing a rather small amount of code frequently and then averaging runtime.
Macro benchmark
In a MACRO benchmark, we have a more complex application setup, which does a lot of mw::com communication for either a given amount of time or a given number of communication calls and then we look at throughput and CPU usage at the end of the test.
Latency notion
So - your benchmark calculates a "Latency" according to the output.
From the code it is apparently the measurement between benchmark_ab_start_point/benchmark_ab_finish_point,
So it is the loop of the "inner sequence":
- proxy checks for a new sample
- skeleton allocates a new sample
- skeleton sends the new sample
So from this I guess you mean the "latency" in score::mw::com between a communication partner allocating/sending a sample until it is "accessible" by a subscriber?
Imho it is very hard to argue about how to measure "correctly". Let me explain:
Technically, the sample is "visible" to the consumer, after the provider has done the following steps:
- Allocating memory for the sample
- Writing the sample data into the allocated memory
- Having successfully called
Send()
After the last step, the sample data is residing in shared-memory (thus generally "visble" to the subscriber) AND the "meta-info" in shared-memory is updated, so that the subscriber is able to find the new sample within the shared-memory.
So at least the time those 3 steps take are part of the "latency".
Btw.: step (2) you aren't measuring at all in your test. Because you're not feeding data to the sample, which means, that it is either default constructed or contains the data from an older sample, which had the same memory location before!
But imho it is crucial to really "fill" the sample with valid data. Because these are memory writes, which take time and MUST be also measured! Because if you would compare the score::mw::com implementation (with shared-mem/zero-copy) with some other binding/middleware, which isn't zero-copy you would see BIG differences in latency, when the sample-data is large! So it only makes sense to measure/compare latency in conjunction with a specific sample data type/size and where you really write this data!
Since the consumer side call to GetNewSamples() is surely part of the "latency" (without this call "succeeding" no access to the sample is possible) I'm fine with also adding its runtime to the latency measurement. If I wanted to be nitpicky, the measurement/timer would need to get stopped already in the callback/lambda you hand over to GetNewSamples() because this is the exact point in time, where the consumer gets the access to the sample ;)
Now the most interessting part of the latency thing: When does the consumer side (calling GetNewSamples()) know, that there is a new sample and the call makes sense?
Answer: It only knows it in the so-called "event-driven" reception mode, where the consumer did register a so called ReceiveHandler at the event-instance before, which will fire as soon as the provider side has called Send().
So in "real-world" scenarios, where provider/skeleton and consumer/proxy reside in different processes, you can measure "latency" in a meaningful way only in this "event-driven" mode and then there the whole runtime of this notification path (which in score::mw::com is done through a message-passing side-channel) becomes part of the latency! And here for sure context-switches are involved also!
So your decision, to have the consumer/proxy side GetNewSamples() call directly next to the Send() call by the provider/skeleton makes perfect sense in the way, that you don't need the additional EventReceiveHandler path ... which is good to estimate "latency" for users not using the "event-driven" mode with registered EventReceiveHanlders but just "poll" for new samples (which is a very common use case!).
Further points to consider
- if you look at your "inner loop" in line 129ff, there is a risk, that you measure unintended retry-loops in failure cases. I would get rid of this! E.g. an error in either of these 3 steps is unexpected in your benchmark setup and should lead to termination of the test.
- having provider and consumer in the same process, while having some benefits in latency measurement (see above), on the other hand might be "unrealistic". In a real world setup with provider/consumer in different processes, maybe the risk that after the context switch to the provider took place, the caches are invalidated and therefore the access to the sample in the consumer takes longer/higher latency because the data from the sample in shared-memory must first be loaded into the cache again ...
- also on the consumer side I would then add "code", which really accesses/reads the sample to take such things into account.
- for having really tight/good measurement without OS overhead (using CPU specific performance counters/timers) a framework like google-benchmark is really helpful.
Summary
Again - kudos to you, to get in touch with score::mw::com not only superficially ... but going deeper!
Georg Dadunashvili at our side is our benchmarking dude. I will get in touch with him after his vacation and inform him, that there is seemingly interest in the community about benchmarks/performance data! So we will see, how we can contribute here stuff, which we alread have.
score/mw/com/benchmark/benchmark.cpp
Outdated
| std::mutex cout_mutex{}; | ||
| score::cpp::latch benchmark_ab_start_point{3}, benchmark_ab_finish_point{3}, init_ab_sync_point{3}, deinit_ab_sync_point{2}; | ||
| score::cpp::latch benchmark_multi_start_point{1 + kThreadsMultiTotal}, benchmark_multi_finish_point{1 + kThreadsMultiTotal}, init_multi_sync_point{kThreadsMultiTotal}, deinit_multi_sync_point{kThreadsMultiTotal}; | ||
| const auto instance_specifier_skeleton_a_optional = InstanceSpecifier::Create("benchmark/SkeletonA"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InstanceSpecifier::Create returns a score::Result ... so shouldn't these variables be renamed - xxx_result instead of xxx_optional?
... and why making these variables "global"? There is no need imho. Just create it within main ...
Further: An InstanceSpecifier is something, which "identifies/specifies" an instance of a service type. And therefore it is completely independent from skeleton or proxy side .... so finally I would rename to:
instance_specifier_instance_a_result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed, moved
| ], | ||
| "serviceInstances": [ | ||
| { | ||
| "instanceSpecifier": "benchmark/SkeletonA", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better: "benchmark/InstanceA"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed
| ] | ||
| }, | ||
| { | ||
| "instanceSpecifier": "benchmark/SkeletonB", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better: "benchmark/InstanceB"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed
score/mw/com/benchmark/benchmark.cpp
Outdated
| const auto instance_specifier_skeleton_a_optional = InstanceSpecifier::Create("benchmark/SkeletonA"); | ||
| const auto instance_specifier_skeleton_b_optional = InstanceSpecifier::Create("benchmark/SkeletonB"); | ||
|
|
||
| score::cpp::optional<std::reference_wrapper<impl::ProxyEvent<DummyBenchmarkData>>> GetBenchmarkDataProxyEvent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this "helper" function? It doesn't add anything "special" and is just bloating the code?
Why not directly accessing proxy.dummy_benchmark_data_ at the various locations instead of calling this func?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some leftover, removed
| } | ||
| } while (handle.size() == 0); | ||
|
|
||
| auto proxy_result = BenchmarkProxy::Create(std::move(handle.front())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I do not get. Why are you creating a proxy/consumer within the "Transmitter" func? The job of subscribing/consuming data provided by the skeleton/provider should be the job of the "subscriber" thread, which calls the Subscriber function ...
And for naming:
I would prefer to have either Provider or Sender instead of Transmitter! This is more natural as the Skeleton is the "Service Provider" and the Proxy is the "Service Consumer".
Or you name it "Sender", because you are actually calling the Send API on the skeleton side ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the function was supposed to be named transeiver which would put some light on overall functionality.
There are transeivers and subscribers.
Renamed
| do { | ||
| sample_result = skeleton.dummy_benchmark_data_.Allocate(); | ||
| } while (!sample_result.has_value()); | ||
| skeleton.dummy_benchmark_data_.Send(std::move(sample_result).value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this "starter" logic? So only one provider/skeleton does initially provide/send an event before it gets read/consumed in line 131? Why not in the case of the other provider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explained in other comment
| } while (!sample_result.has_value()); | ||
| skeleton.dummy_benchmark_data_.Send(std::move(sample_result).value()); | ||
| } | ||
| for (std::size_t cycle = 0U; cycle < kIterations; cycle++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "testing-sequence" in the loop looks a bit odd?
So basically it is:
- Consumer side: Try to acquire a new sample (new since the last call to GetNewSamples)
- Provider side: Allocate memory for a new sample to be sent.
- Provider side: Send the allocated sample.
The code is imho very confusing as in some step (but not all!) you tried to add "error" handling in the sense, that you simply repeat the API call in case of error.
But why aren't you doing this for the last step (Send()) - it also can have an error ...
And why do you for the 1st step use a while-loop with empty body and in the secon step a do-while-loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While loop with empty body was incorrect use of api - wrong assumption that 0 samples received returns error.
Workflow further explained in other comment.
score/mw/com/benchmark/benchmark.cpp
Outdated
| } | ||
| #endif | ||
| init_ab_sync_point.arrive_and_wait(); | ||
| const auto benchmark_ab_start_time = std::chrono::steady_clock::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't benchmark_ab_start_time be taken AFTER benchmark_ab_start_point.arrive_and_wait() ?
Otherwise you might measure some time it takes the other 2? finally reach benchmark_ab_start_point ... but measurement shall start after all have reached the point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats right, fixed
score/mw/com/benchmark/benchmark.cpp
Outdated
|
|
||
| dummy_data_event.Subscribe(1); | ||
| init_multi_sync_point.arrive_and_wait(); | ||
| benchmark_multi_start_point.arrive_and_wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't used at all right now? E.g. no measurements/test-output deduced from it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some leftover, removed
|
@crimson11 Thanks for comprehesive explanation on this topic! Workflow inside benchmarkFor this measurement there is created test that spawns two bidirectional threads LatencyThe thing called "latency" is assumed to be time that has passed between Send and succesfull Receive. At the time of sketching some initial code and learning API, the first approach was to use "event-driven" method for data reception but at that time I got (much) worse results than with the current fully synchronous approach. From those runs I found here and there some other points for improvements overall outcomes so maybe I need to revisit "event-driven" reception mode one more time with slightly better undestandment of using LoLa. Further points to considerAddressing your points:
SummaryThanks again for your help, I hope this will lead to legitimize in some way results of this benchmark. That would be great to add to this repository some of your existing benchmarks. I'm very curious to see what (and even more interesting - how) are you measuring internally and what are results. Can you propose some simple scenario for another benchmark where LoLa should shine among others? I would be happy to consider some proposal from you to proceed with some additional measurements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some replies to your answer:
As the aim is to measure only zero-copy providers I intentionally ommitted the step with filling real data into sample. I wanted to completly remove any business logic from benchmark and measure raw performance of libraries.
My point: Even IF you compare only zero-copy implementations, this IS an issue! Because the performance greatly depends on WHERE the sample data (in shared memory) is located! And this is decided/controlled by the implementation! If one implementation puts the data samples "dense" in memory vs. another implementation has an arbitrary layout, where it puts/allocates the samples, this might have HUGE impact on data locality and cache-hits.
So I insist, that doing real data-write on the provider side and data-read on the consumer side is essential to rule such things out.
Thats fully intended to measure retry time in failure cases. My assumption is that I gave IPC provider enough resources to execute full test. If any thread is too fast/too slow to allocate, send or receive data then well, thats the part of latency
See my comments: In some API calls, errors returned will most likely never be healed ... so I think tracing/logging these errors makes sense ... otherwise someone running this test in some different environment will never detect, that he measures "bollocks" :)
General Feedback
Now, that I "understood" your Transceiver logic, I have the following to say:
You are massively including scheduling behaviour into your latency measurement.
I.e.
Each of your Transceiver threads does a high-frequency polling, whether the other Transceiver thread has sent a new sample. If one of the (producing) Transceivers gets de-scheduled, the other Transceiver thread will stuck in its GetNewSamples() loop for very long until the producer gets re-scheduled and is able to send a new sample.
So this will add massively to the latency ... in this case almost all of the latency is scheduling inflicted ...
I guess, you are aware of this?
So to rule such things out some combined "micro-benchmarking" would be eventually more "spot-on". I.e. microbench the allocate-write-send path seperately and then the GetNewSamples-access-sample path seperately, when you call it at a point in time, where you know, that a new accessible sample is there. Then add up those two "microbenched" paths and you have a raw latency figure ...
| auto id = std::this_thread::get_id(); | ||
| auto native_handle = *reinterpret_cast<std::thread::native_handle_type*>(&id); | ||
|
|
||
| int max_priority = sched_get_priority_max(SCHED_RR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this (POSIX) func needs a specific include -> <sched.h> ?
| int max_priority = sched_get_priority_max(SCHED_RR); | ||
| struct sched_param params; | ||
| params.sched_priority = max_priority; | ||
| if (pthread_setschedparam(native_handle, SCHED_RR, ¶ms)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need include <pthread.h> for this?
| cpu_set_t cpuset; | ||
| CPU_ZERO(&cpuset); | ||
| CPU_SET(cpu, &cpuset); | ||
| if (pthread_setaffinity_np(native_handle, sizeof(cpu_set_t), &cpuset)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is Linux specific (thus the "non portable / _np" extension. Then you should adjust the bazel BUILD accordingly, annotating, that binary is "Linux" only (since major/typical goal in score is imdo to achieve QNX compatibility)
| return; | ||
| } | ||
|
|
||
| ServiceHandleContainer<impl::HandleType> handle{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as it is a container (could contain even multiple handles) I would prefer to rename:
handle -> handle_container
or at least:
handle -> handles
| while (true) { | ||
| auto result = dummy_data_event.GetNewSamples(( | ||
| [](SamplePtr<DummyBenchmarkData> sample) noexcept { | ||
| std::ignore = sample; | ||
| }),1); | ||
| if (result.has_value()) | ||
| { | ||
| if (result.value() == 0) | ||
| { | ||
| continue; | ||
| } else { | ||
| break; | ||
| } | ||
| } | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. This is still "odd". In case the call to GetNewSamples() would give you an error. IF it would give you an error, it will be most likely indicating an issue, which won't self-heal! But then you have an endless-loop here.
So my proposal:
| while (true) { | |
| auto result = dummy_data_event.GetNewSamples(( | |
| [](SamplePtr<DummyBenchmarkData> sample) noexcept { | |
| std::ignore = sample; | |
| }),1); | |
| if (result.has_value()) | |
| { | |
| if (result.value() == 0) | |
| { | |
| continue; | |
| } else { | |
| break; | |
| } | |
| } | |
| }; | |
| while (true) { | |
| auto result = dummy_data_event.GetNewSamples(( | |
| [](SamplePtr<DummyBenchmarkData> sample) noexcept { | |
| std::ignore = sample; | |
| }),1); | |
| if (!result.has_value()) | |
| { | |
| std::cerr << "Transceiver: GetNewSamples()( failled: << result.error() << "!" << std::endl; | |
| break; | |
| } | |
| if (result.value() == 0) | |
| { | |
| continue; | |
| } else | |
| { | |
| break; | |
| } | |
| }; |
| } | ||
| }; | ||
| do { | ||
| sample_result = skeleton.dummy_benchmark_data_.Allocate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you have the "same" problem as above. When you have a problem doing allocation ... there is a likelihood, that this problem does not self-heal amd you end up in an endless loop.
| do { | ||
| sample_result = skeleton.dummy_benchmark_data_.Allocate(); | ||
| } while (!sample_result.has_value()); | ||
| skeleton.dummy_benchmark_data_.Send(std::move(sample_result).value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sendmay fail ... and you don't have any error handling ... if you don't want to do error-handling ... then you should be at least concise by explicitly ignoring the return value of Send()...
|
|
||
| std::thread transceiverA(Transceiver, cpu++, true, std::ref(instance_specifier_instance_a), std::ref(instance_specifier_instance_b)); | ||
| std::thread transceiverB(Transceiver, cpu++, false, std::ref(instance_specifier_instance_b), std::ref(instance_specifier_instance_a)); | ||
| #if kSubscribers > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you have the additional subscriber threads still in here. They are not explicitly taken into account by interacting with specific timing checkpoints?
So they are run "in background", thus creating CPU-load and because they are also accessing the provided service instances, they are also effectively potentially affecting the Transceivers (working on the same control-data-structures).
So is this exactly the idea? Check how an additional number of consumers might affect the "latency" of a given communication? If so -> that should be explicitly be documented.
Implementation of benchmark for LoLa that is benchmark-publish-subscribe.