-
Notifications
You must be signed in to change notification settings - Fork 7
Specialise for u64 and particular digit counts of u128 #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
160b54a to
f2c0b90
Compare
|
Hey, thanks for looking into this and your work! I also looked at the base PR but didn't have much time to review but I did run the benchmarks on my machine (AMD Ryzen 9 5950X) and I'm getting no improvements at all. May I know the hardware on which you benched this? I'm suspecting this may be something that llvm may be optimizing for x86 but not as well for ARM. Here are my results: I'll try to figure out why this is happening when I get more time. |
|
Ok I just looked a bit more and it looks like This is all guesstimated and I've yet to actually measure anything. |
|
divq on intel is definitely worth using if we know at compile time that result fits in a u64. I think we only know that once digits<=20. In master, we can use divq for the second u128 division, and presumably the udivmodti4 impl is clever enough to do so, but we can't use it for the first u128 division. This PR is largely focussed around trying to separate out the digit counts so we can use different optimisations, and also have less sequential divisions so that the cpu can pipeline better. I made a new branch on restatedev I've been running on an M2 mac. But ive just tried it on an intel AWS node ( Its weird that I get 31ns on that machine, faster than your 37ns for standard_buf_random, despite your significantly better CPU! AMD vs Intel? For completeness, see here the comparison on my mac of So its 2x faster on ARM than on x86. Maybe we can get the x86 even better. |
|
My bad, I figured out my issue. I realized I had Thanks! |
|
I'm relieved to hear that! |
|
Merging this as-is. Thanks again! |
|
published in |
Currently based on top of #25
Particular sizes of input need different numbers of expensive u128 operations, so by specialising to those sizes we can speed them up. Most especially, u64 inputs need no u128 operations! We also can do the u64 operations in parallel (4 u32 at a time) when dealing with >= 20 digit outputs.
These benchmarks results are as compared to #25, so they are compounded on the deltas there.