LoLa benchmark implementation #63

piotrchodorowski · 2025-09-10T08:45:29Z

Implementation of benchmark for LoLa that is benchmark-publish-subscribe.

piotrchodorowski · 2025-09-10T09:09:07Z

Test 1

Two threads are both producers and subscribers. Sequentially exchange data in turns.

Preparations

Samples slot count can be adjusted in etc/mw_com_config.json file by changing numberOfSampleSlots parameter in SkeletonA and SkeletonB services.
Additional subscribers ammount can be adjusted in benchmark.h file with kSubscribers value. Should be set to 0.

Execution

cd score/mw/com/benchmark
bazel build -c opt --copt="-Wno-error=maybe-uninitialized" //score/mw/com/benchmark && sudo ../../../../bazel-bin/score/mw/com/benchmark/benchmark

Results

Platform	Samples slots	Iterations	Time	Latency	Service
Ryzen 9	1	1000000	1.0693s	534ns	LoLa
Ryzen 9	10	1000000	1.68181s	840ns	LoLa
Ryzen 9	100	1000000	6.93548s	3467ns	LoLa
Raspberry Pi 5	1	1000000	5.58168s	2790ns	LoLa
Raspberry Pi 5	10	1000000	7.65726s	3828ns	LoLa
Raspberry Pi 5	100	1000000	25.9888s	12994ns	LoLa

Test 2

In addition to previous test there are spawned multiple threads that receives data. All tests uses 2 additional subscribers for each producer.

Preparations

Samples slot cont can be adjusted in etc/mw_com_config.json file by changing numberOfSampleSlots parameter in SkeletonA and SkeletonB services.
Additional subscribers ammount can be adjusted in benchmark.h file with kSubscribers value. Should be set to 2.

Execution

cd score/mw/com/benchmark
bazel build -c opt --copt="-Wno-error=maybe-uninitialized" //score/mw/com/benchmark && sudo ../../../../bazel-bin/score/mw/com/benchmark/benchmark

Results

Platform	Samples slots	Iterations	Time	Latency	Service
Ryzen 9	1	1000000	1.21143s	605ns	LoLa
Ryzen 9	10	1000000	1.67949s	839ns	LoLa
Ryzen 9	100	1000000	6.94725s	3473ns	LoLa
Raspberry Pi 5	1	1000000	5.61278s	2806ns	LoLa
Raspberry Pi 5	10	1000000	9.13232s	4566ns	LoLa
Raspberry Pi 5	100	1000000	25.93900s	12969ns	LoLa

All tests were conducted with 8192B samples

crimson11

1st: I really appreciate, that you are coming up with this effort to develop a basic benchmark test! We internally did have some iterations for benchmarking already - but it is often hard to decide, WHAT to exactly measure!
We are typically differentiating between MACRO and MICRO benchmarks.
I guess, what you are trying to do here is a mix-up or rather some MACRO benchmark.

Micro benchmark

So in case of MICRO benchmarks, we typically try to measure the runtime of a specific API. Sometimes it isn't even a public API, but some internal API/logic/algorithm. So executing a rather small amount of code frequently and then averaging runtime.

Macro benchmark

In a MACRO benchmark, we have a more complex application setup, which does a lot of mw::com communication for either a given amount of time or a given number of communication calls and then we look at throughput and CPU usage at the end of the test.

Latency notion

So - your benchmark calculates a "Latency" according to the output.
From the code it is apparently the measurement between benchmark_ab_start_point/benchmark_ab_finish_point,
So it is the loop of the "inner sequence":

proxy checks for a new sample
skeleton allocates a new sample
skeleton sends the new sample

So from this I guess you mean the "latency" in score::mw::com between a communication partner allocating/sending a sample until it is "accessible" by a subscriber?

Imho it is very hard to argue about how to measure "correctly". Let me explain:
Technically, the sample is "visible" to the consumer, after the provider has done the following steps:

Allocating memory for the sample
Writing the sample data into the allocated memory
Having successfully called Send()

After the last step, the sample data is residing in shared-memory (thus generally "visble" to the subscriber) AND the "meta-info" in shared-memory is updated, so that the subscriber is able to find the new sample within the shared-memory.

So at least the time those 3 steps take are part of the "latency".

Btw.: step (2) you aren't measuring at all in your test. Because you're not feeding data to the sample, which means, that it is either default constructed or contains the data from an older sample, which had the same memory location before!
But imho it is crucial to really "fill" the sample with valid data. Because these are memory writes, which take time and MUST be also measured! Because if you would compare the score::mw::com implementation (with shared-mem/zero-copy) with some other binding/middleware, which isn't zero-copy you would see BIG differences in latency, when the sample-data is large! So it only makes sense to measure/compare latency in conjunction with a specific sample data type/size and where you really write this data!

Since the consumer side call to GetNewSamples() is surely part of the "latency" (without this call "succeeding" no access to the sample is possible) I'm fine with also adding its runtime to the latency measurement. If I wanted to be nitpicky, the measurement/timer would need to get stopped already in the callback/lambda you hand over to GetNewSamples() because this is the exact point in time, where the consumer gets the access to the sample ;)

Now the most interessting part of the latency thing: When does the consumer side (calling GetNewSamples()) know, that there is a new sample and the call makes sense?
Answer: It only knows it in the so-called "event-driven" reception mode, where the consumer did register a so called ReceiveHandler at the event-instance before, which will fire as soon as the provider side has called Send().
So in "real-world" scenarios, where provider/skeleton and consumer/proxy reside in different processes, you can measure "latency" in a meaningful way only in this "event-driven" mode and then there the whole runtime of this notification path (which in score::mw::com is done through a message-passing side-channel) becomes part of the latency! And here for sure context-switches are involved also!

So your decision, to have the consumer/proxy side GetNewSamples() call directly next to the Send() call by the provider/skeleton makes perfect sense in the way, that you don't need the additional EventReceiveHandler path ... which is good to estimate "latency" for users not using the "event-driven" mode with registered EventReceiveHanlders but just "poll" for new samples (which is a very common use case!).

Further points to consider

if you look at your "inner loop" in line 129ff, there is a risk, that you measure unintended retry-loops in failure cases. I would get rid of this! E.g. an error in either of these 3 steps is unexpected in your benchmark setup and should lead to termination of the test.
having provider and consumer in the same process, while having some benefits in latency measurement (see above), on the other hand might be "unrealistic". In a real world setup with provider/consumer in different processes, maybe the risk that after the context switch to the provider took place, the caches are invalidated and therefore the access to the sample in the consumer takes longer/higher latency because the data from the sample in shared-memory must first be loaded into the cache again ...
also on the consumer side I would then add "code", which really accesses/reads the sample to take such things into account.
for having really tight/good measurement without OS overhead (using CPU specific performance counters/timers) a framework like google-benchmark is really helpful.

Summary

Again - kudos to you, to get in touch with score::mw::com not only superficially ... but going deeper!

Georg Dadunashvili at our side is our benchmarking dude. I will get in touch with him after his vacation and inform him, that there is seemingly interest in the community about benchmarks/performance data! So we will see, how we can contribute here stuff, which we alread have.

crimson11 · 2025-09-10T12:15:15Z

score/mw/com/benchmark/benchmark.cpp

+std::mutex cout_mutex{};
+score::cpp::latch benchmark_ab_start_point{3}, benchmark_ab_finish_point{3}, init_ab_sync_point{3}, deinit_ab_sync_point{2};
+score::cpp::latch benchmark_multi_start_point{1 + kThreadsMultiTotal}, benchmark_multi_finish_point{1 + kThreadsMultiTotal}, init_multi_sync_point{kThreadsMultiTotal}, deinit_multi_sync_point{kThreadsMultiTotal};
+const auto instance_specifier_skeleton_a_optional = InstanceSpecifier::Create("benchmark/SkeletonA");


InstanceSpecifier::Create returns a score::Result ... so shouldn't these variables be renamed - xxx_result instead of xxx_optional?

... and why making these variables "global"? There is no need imho. Just create it within main ...

Further: An InstanceSpecifier is something, which "identifies/specifies" an instance of a service type. And therefore it is completely independent from skeleton or proxy side .... so finally I would rename to:

instance_specifier_instance_a_result

Renamed, moved

crimson11 · 2025-09-10T12:29:39Z

score/mw/com/benchmark/etc/mw_com_config.json

+  ],
+  "serviceInstances": [
+    {
+      "instanceSpecifier": "benchmark/SkeletonA",


better: "benchmark/InstanceA"

crimson11 · 2025-09-10T12:30:02Z

score/mw/com/benchmark/etc/mw_com_config.json

+      ]
+    },
+    {
+      "instanceSpecifier": "benchmark/SkeletonB",


better: "benchmark/InstanceB"

crimson11 · 2025-09-10T12:35:23Z

score/mw/com/benchmark/benchmark.cpp

+const auto instance_specifier_skeleton_a_optional = InstanceSpecifier::Create("benchmark/SkeletonA");
+const auto instance_specifier_skeleton_b_optional = InstanceSpecifier::Create("benchmark/SkeletonB");
+
+score::cpp::optional<std::reference_wrapper<impl::ProxyEvent<DummyBenchmarkData>>> GetBenchmarkDataProxyEvent(


Why this "helper" function? It doesn't add anything "special" and is just bloating the code?
Why not directly accessing proxy.dummy_benchmark_data_ at the various locations instead of calling this func?

Some leftover, removed

crimson11 · 2025-09-10T12:41:28Z

score/mw/com/benchmark/benchmark.cpp

+        }
+    } while (handle.size() == 0);
+
+    auto proxy_result = BenchmarkProxy::Create(std::move(handle.front()));


This I do not get. Why are you creating a proxy/consumer within the "Transmitter" func? The job of subscribing/consuming data provided by the skeleton/provider should be the job of the "subscriber" thread, which calls the Subscriber function ...

And for naming:
I would prefer to have either Provider or Sender instead of Transmitter! This is more natural as the Skeleton is the "Service Provider" and the Proxy is the "Service Consumer".
Or you name it "Sender", because you are actually calling the Send API on the skeleton side ....

Actually the function was supposed to be named transeiver which would put some light on overall functionality.
There are transeivers and subscribers.
Renamed

crimson11 · 2025-09-10T16:06:33Z

score/mw/com/benchmark/benchmark.cpp

+        do {
+            sample_result = skeleton.dummy_benchmark_data_.Allocate();
+        } while (!sample_result.has_value());
+        skeleton.dummy_benchmark_data_.Send(std::move(sample_result).value());


Why this "starter" logic? So only one provider/skeleton does initially provide/send an event before it gets read/consumed in line 131? Why not in the case of the other provider?

Explained in other comment

crimson11 · 2025-09-10T16:39:27Z

score/mw/com/benchmark/benchmark.cpp

+        } while (!sample_result.has_value());
+        skeleton.dummy_benchmark_data_.Send(std::move(sample_result).value());
+    }
+    for (std::size_t cycle = 0U; cycle < kIterations; cycle++)


The "testing-sequence" in the loop looks a bit odd?
So basically it is:

Consumer side: Try to acquire a new sample (new since the last call to GetNewSamples)

Provider side: Allocate memory for a new sample to be sent.

Provider side: Send the allocated sample.

The code is imho very confusing as in some step (but not all!) you tried to add "error" handling in the sense, that you simply repeat the API call in case of error.
But why aren't you doing this for the last step (Send()) - it also can have an error ...
And why do you for the 1st step use a while-loop with empty body and in the secon step a do-while-loop?

While loop with empty body was incorrect use of api - wrong assumption that 0 samples received returns error.
Workflow further explained in other comment.

crimson11 · 2025-09-10T16:45:34Z

score/mw/com/benchmark/benchmark.cpp

+    }
+#endif
+    init_ab_sync_point.arrive_and_wait();
+    const auto benchmark_ab_start_time = std::chrono::steady_clock::now();


Shouldn't benchmark_ab_start_time be taken AFTER benchmark_ab_start_point.arrive_and_wait() ?
Otherwise you might measure some time it takes the other 2? finally reach benchmark_ab_start_point ... but measurement shall start after all have reached the point?

Thats right, fixed

crimson11 · 2025-09-10T16:47:40Z

score/mw/com/benchmark/benchmark.cpp

+
+    dummy_data_event.Subscribe(1);
+    init_multi_sync_point.arrive_and_wait();
+    benchmark_multi_start_point.arrive_and_wait();


This isn't used at all right now? E.g. no measurements/test-output deduced from it?

Some leftover, removed

piotrchodorowski · 2025-09-11T10:25:45Z

@crimson11 Thanks for comprehesive explanation on this topic!
The reasoning for creating benchmark is to compare some IPC providers. All of them differs a little in functionalities and workflow so first step was to build something very simple that provides same functionality across different adapters. Actual functionality at this point is kind of arbitrary choice that can be some starting point.

Workflow inside benchmark

For this measurement there is created test that spawns two bidirectional threads Transceiver (former Transmitter, ugh, very misleading "typo") that exchanges data between itself. Basic data flow is fully sequential: thread A sends data to thread B and only after successful data reception thread B sends data back to thread A. That address one of your concerns and explains why we need some starter logic for one of the threads.
Also there is possibility to spawn some additional Subscriber threads that attaches to Skeletons created in Transceiver threads.

Latency

The thing called "latency" is assumed to be time that has passed between Send and succesfull Receive.
As the aim is to measure only zero-copy providers I intentionally ommitted the step with filling real data into sample. I wanted to completly remove any business logic from benchmark and measure raw performance of libraries.

At the time of sketching some initial code and learning API, the first approach was to use "event-driven" method for data reception but at that time I got (much) worse results than with the current fully synchronous approach. From those runs I found here and there some other points for improvements overall outcomes so maybe I need to revisit "event-driven" reception mode one more time with slightly better undestandment of using LoLa.

Further points to consider

Addressing your points:

Thats fully intended to measure retry time in failure cases. My assumption is that I gave IPC provider enough resources to execute full test. If any thread is too fast/too slow to allocate, send or receive data then well, thats the part of latency
You are completly right about that but I needed to make just one choice of the scenario to start with something
Like many others things, I don't want to read or write the data just to narrow as much ass possible the measurement to actual raw performance
That's very good point, surely will consider that

Summary

Thanks again for your help, I hope this will lead to legitimize in some way results of this benchmark.

That would be great to add to this repository some of your existing benchmarks. I'm very curious to see what (and even more interesting - how) are you measuring internally and what are results.

Can you propose some simple scenario for another benchmark where LoLa should shine among others? I would be happy to consider some proposal from you to proceed with some additional measurements.

crimson11

Some replies to your answer:

As the aim is to measure only zero-copy providers I intentionally ommitted the step with filling real data into sample. I wanted to completly remove any business logic from benchmark and measure raw performance of libraries.

My point: Even IF you compare only zero-copy implementations, this IS an issue! Because the performance greatly depends on WHERE the sample data (in shared memory) is located! And this is decided/controlled by the implementation! If one implementation puts the data samples "dense" in memory vs. another implementation has an arbitrary layout, where it puts/allocates the samples, this might have HUGE impact on data locality and cache-hits.

So I insist, that doing real data-write on the provider side and data-read on the consumer side is essential to rule such things out.

Thats fully intended to measure retry time in failure cases. My assumption is that I gave IPC provider enough resources to execute full test. If any thread is too fast/too slow to allocate, send or receive data then well, thats the part of latency

See my comments: In some API calls, errors returned will most likely never be healed ... so I think tracing/logging these errors makes sense ... otherwise someone running this test in some different environment will never detect, that he measures "bollocks" :)

General Feedback

Now, that I "understood" your Transceiver logic, I have the following to say:

You are massively including scheduling behaviour into your latency measurement.
I.e.
Each of your Transceiver threads does a high-frequency polling, whether the other Transceiver thread has sent a new sample. If one of the (producing) Transceivers gets de-scheduled, the other Transceiver thread will stuck in its GetNewSamples() loop for very long until the producer gets re-scheduled and is able to send a new sample.
So this will add massively to the latency ... in this case almost all of the latency is scheduling inflicted ...
I guess, you are aware of this?
So to rule such things out some combined "micro-benchmarking" would be eventually more "spot-on". I.e. microbench the allocate-write-send path seperately and then the GetNewSamples-access-sample path seperately, when you call it at a point in time, where you know, that a new accessible sample is there. Then add up those two "microbenched" paths and you have a raw latency figure ...

crimson11 · 2025-09-11T12:29:48Z

score/mw/com/benchmark/benchmark.cpp

+    auto id = std::this_thread::get_id();
+    auto native_handle = *reinterpret_cast<std::thread::native_handle_type*>(&id);
+
+    int max_priority = sched_get_priority_max(SCHED_RR);


I guess this (POSIX) func needs a specific include -> <sched.h> ?

crimson11 · 2025-09-11T12:33:50Z

score/mw/com/benchmark/benchmark.cpp

+    int max_priority = sched_get_priority_max(SCHED_RR);
+    struct sched_param params;
+    params.sched_priority = max_priority;
+    if (pthread_setschedparam(native_handle, SCHED_RR, &params))


Don't we need include <pthread.h> for this?

crimson11 · 2025-09-11T12:38:59Z

score/mw/com/benchmark/benchmark.cpp

+    cpu_set_t cpuset;
+    CPU_ZERO(&cpuset);
+    CPU_SET(cpu, &cpuset);
+    if (pthread_setaffinity_np(native_handle, sizeof(cpu_set_t), &cpuset))


I guess this is Linux specific (thus the "non portable / _np" extension. Then you should adjust the bazel BUILD accordingly, annotating, that binary is "Linux" only (since major/typical goal in score is imdo to achieve QNX compatibility)

crimson11 · 2025-09-11T12:46:24Z

score/mw/com/benchmark/benchmark.cpp

+        return;
+    }
+
+    ServiceHandleContainer<impl::HandleType> handle{};


as it is a container (could contain even multiple handles) I would prefer to rename:
handle -> handle_container
or at least:
handle -> handles

crimson11 · 2025-09-11T13:19:48Z

score/mw/com/benchmark/benchmark.cpp

+        while (true) {
+            auto result = dummy_data_event.GetNewSamples((
+            [](SamplePtr<DummyBenchmarkData> sample) noexcept {
+                std::ignore = sample;
+            }),1);
+            if (result.has_value())
+            {
+                if (result.value() == 0)
+                {
+                    continue;
+                } else {
+                    break;
+                }
+            }
+        };


Hm. This is still "odd". In case the call to GetNewSamples() would give you an error. IF it would give you an error, it will be most likely indicating an issue, which won't self-heal! But then you have an endless-loop here.
So my proposal:

Suggested change

while (true) {

auto result = dummy_data_event.GetNewSamples((

[](SamplePtr<DummyBenchmarkData> sample) noexcept {

std::ignore = sample;

}),1);

if (result.has_value())

{

if (result.value() == 0)

{

continue;

} else {

break;

}

}

};

while (true) {

auto result = dummy_data_event.GetNewSamples((

[](SamplePtr<DummyBenchmarkData> sample) noexcept {

std::ignore = sample;

}),1);

if (!result.has_value())

{

std::cerr << "Transceiver: GetNewSamples()( failled: << result.error() << "!" << std::endl;

break;

}

if (result.value() == 0)

{

continue;

} else

{

break;

}

};

crimson11 · 2025-09-11T13:21:54Z

score/mw/com/benchmark/benchmark.cpp

+            }
+        };
+        do {
+            sample_result = skeleton.dummy_benchmark_data_.Allocate();


Here you have the "same" problem as above. When you have a problem doing allocation ... there is a likelihood, that this problem does not self-heal amd you end up in an endless loop.

crimson11 · 2025-09-11T13:23:45Z

score/mw/com/benchmark/benchmark.cpp

+        do {
+            sample_result = skeleton.dummy_benchmark_data_.Allocate();
+        } while (!sample_result.has_value());
+        skeleton.dummy_benchmark_data_.Send(std::move(sample_result).value());


Sendmay fail ... and you don't have any error handling ... if you don't want to do error-handling ... then you should be at least concise by explicitly ignoring the return value of Send()...

crimson11 · 2025-09-11T13:39:48Z

score/mw/com/benchmark/benchmark.cpp

+
+    std::thread transceiverA(Transceiver, cpu++, true, std::ref(instance_specifier_instance_a), std::ref(instance_specifier_instance_b));
+    std::thread transceiverB(Transceiver, cpu++, false, std::ref(instance_specifier_instance_b), std::ref(instance_specifier_instance_a));
+#if kSubscribers > 0


Why do you have the additional subscriber threads still in here. They are not explicitly taken into account by interacting with specific timing checkpoints?
So they are run "in background", thus creating CPU-load and because they are also accessing the provided service instances, they are also effectively potentially affecting the Transceivers (working on the same control-data-structures).

So is this exactly the idea? Check how an additional number of consumers might affect the "latency" of a given communication? If so -> that should be explicitly be documented.

piotrchodorowski requested review from LittleHuba, castler and hoe-jo as code owners September 10, 2025 08:45

Implement LoLa benchmark

6a1b0c8

piotrchodorowski force-pushed the pich_lola_benchmark branch from f249a59 to 6a1b0c8 Compare September 10, 2025 08:49

crimson11 reviewed Sep 10, 2025

View reviewed changes

Review fix

fdef7b0

crimson11 suggested changes Sep 11, 2025

View reviewed changes

LoLa benchmark implementation #63

Are you sure you want to change the base?

LoLa benchmark implementation #63

Uh oh!

Conversation

piotrchodorowski commented Sep 10, 2025 • edited by vinodreddy-g Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

piotrchodorowski commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test 1

Preparations

Execution

Results

Test 2

Preparations

Execution

Results

Uh oh!

crimson11 left a comment

Choose a reason for hiding this comment

Micro benchmark

Macro benchmark

Latency notion

Further points to consider

Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piotrchodorowski commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Workflow inside benchmark

Latency

Further points to consider

Summary

Uh oh!

crimson11 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piotrchodorowski commented Sep 10, 2025 •

edited by vinodreddy-g

Loading

piotrchodorowski commented Sep 10, 2025 •

edited

Loading

piotrchodorowski commented Sep 11, 2025 •

edited

Loading

crimson11 left a comment •

edited

Loading