Convert bcd -> ascii after shuffling. Gain 3%#112
Conversation
|
wow, seems like our testing is too limited: I'm surprised no entry in @Antares0982 data sample triggered that :O |
|
I fixed the correctness issue and added a fairly low-power test. It now takes 8.08ns in the benchmark, which is still a gain, and maybe there's an easy follwoup that recovers the remainder, but I can't see it right now. It seems like one can do magic to the mask during the length evaluation, but I don't see the correct and fast combination just yet ... |
|
The reason I use ctz32 is that starting the computation from unshuffled_bcd reduces certain data dependencies, which I expect will increase instruction-level parallelism. If the computation instead starts from bcd, then the operations along the path _mm_cmpgt_epi8, _mm_movemask_epi8, and clz must wait for _mm_shuffle_epi8 to complete. |
|
Should we concatenate the conversion_table and the shuffle_table into a single table, and then read 16*2 bytes directly starting at offset |
|
As for ctz Vs clz, the origin of this patch is that I thought could get rid of or'ing something into the mask this way, reducing the number of μops. The dependency chain doesn't seem all that important given that the buffer needs to be committed to memory, and no exponent needs to be inserted either in the fixed case. Anyway, I'm happy to go whichever way turns out to be faster. I thought about concatenating the lists, but didn't try it. Both because I wanted to be able to flip back easily between the version with the 32byte array that I outlined above and this one, and because the consecutive read shouldn't be faster if both are in L1 cache (even with march=x86-64-v4 it doesn't appear to be optimized to a single load, but the compiler may consider other things), instead you get two address calculations instead of one, but yeah, let me try to benchmark that tomorrow. I've been surprised too often to not believe in experiment |
|
I pushed to zmij-playground, You can try the script On my machine this PR is not always faster (even with ctz). I tried 3 times ctz version is always slightly faster. |
|
I ran five more times, including the concatenate variant in the tests. The ctz variant is not necessarily better than the current PR. Overall, it seems unable to surpass the current branch point (I'm using |
|
I also added a convenience Python script to quickly replace the zmij source into zmij-playground, so you can iteratively edit and test which approach yields better performance. The script is located at The script automatically inserts |
|
@Antares0982 nice, I've been messing around with it some more, and I cannot find a short incantation that uses In my experiments, the concatenated version tended to come out slower. I tried an intermingled array and a two-component struct which yields different code (the former performing minimally better). In both cases neither clang nor gcc made efforts to keep the two loads close to each other, which is consistent with what we see for the |
|
BTW talking about benchmarking: according to intel's intrinsics website |
sse4-improve-ctz-concatenate |
vitaut
left a comment
There was a problem hiding this comment.
Thanks for the PR! Overall looks good but please rebase your changes.
Thanks, I hope to get to it during the course of the week, but it might take until next week. I think going through |
38cd652 to
6b8a4da
Compare
|
I'm back on this, now that the NEON changes and the ensuing restructuring is done. I finally added clang to my setup, and surprisingly on @Antares0982's benchmark I find gcc faster -- and less improved with my updated version of the patch. With clang the patch amounts to a 3% improvement, making it almost competitive with gcc. The fastest form appears to be one which interleaves the (previously introduced) shuffle tables and the tables with the point insertions. Conveniently, this seems to also be advantageous in non-synthetic benchmarks with memory pressure, as the two constants can be always made to be loaded to the same cache line, i.e. in a single read from the non_L1 world. Before I resubmit, I still want to see if I can expand this to NEON: the table is shared with the NEON code, so it means less ifdefs if it works there as well. |
c98938e to
b38b1a0
Compare
Let me know when it's ready for review. Note that the shuffle table is now generated at compile time so you'll need to do the same for the new table. |
3244f65 to
f032301
Compare
51b16f0 to
868df15
Compare
VTune says the problem is back-end bound and not related to memory. I'm using Intel |
|
By adding each flag x86-64-v3 has and avx2 does not have, I found the problem is from BMI2. Adding -mbmi2 will cause the performance collapse. |
|
I didn't know "back-ended" excludes _mm_storeu here, but it wasn't hard to do the experiment. We don't use any of the BMI2 instructions explicitly, and also not implicitly AFAICT: It's almost by definition a compiler bug but it seems completely wild that something like this could affect AMD and Intel in the same way. Especially as the assembly is almost the same except for the placement of a few scalar instructions. The only thing that I can see that seems vaguely plausible as a reason is that the bad case has three address calculations in a row, two as part of multiplications. I know nothing about CPU ports and such, but there being a limitation seems consistent with this microarchitecture picture that I found on wikipedia. According to this, port0 will be quite busy. |
|
After enabling BMI2, the clang emitted about 30 |
|
BTW I tried Lemire's AVX512 code from https://lemire.me/blog/2022/03/28/converting-integers-to-decimal-strings-faster-with-avx-512/ and it is indeed faster -- but again clang is significantly slower than gcc. The code gains almost 0.2ns with gcc, pushing it under 8ms, but with clang it takes ~11ns (instead of 14ns with our code and I don't think the code has much practical use, it requires As another aside, I tried dozens of variants, and within SSE4.2 (i.e. Anyway, will try to update the patch tonight! |
Seems xjb has already implemented the avx512vbmi optimization: https://github.com/xjb714/xjb/blob/afe97d6552a5028ff4512e9678c8606a45c7a9f5/src/ftoa.cpp#L1031 BTW @xjb714 has listed a table of AVX512 support |
|
Right, that is Lemire's code as well. The table is nice. It seems like I'm in agreement with xjb about the performance characteristics. If only there were a way to evaluate the length / count the zeros faster. pcmpistri applied after the shuffle seems to be the biggest improvement I can create, but only on Zen4. |
3b60f87 to
bc893e0
Compare
|
I updated the patch.
|
|
Mostly looks good. Just two minor comments and please rebase your changes. |
- early return needs no else - incomprehensible comment improved
85dc397 to
8a4dc23
Compare
|
I rebased and updated per your comments. I also finally had time to try doing the same on NEON. It is, according to my experiments, not an improvement there. |
|
ps I just did some experiments after reading a bit more about unaligned reads -- the internet says that unaligned reads have become penalty-free with AVX2 microarchitectures. This didn't match my experience when preparing this patch where the simple and memory-efficient way to obtain the point-and-zeros mask by doing This was a bit of a surprise, but then I read the above sentence again and thought "please, please, don't tell me that one needs to use VEX-encoding to get an unpenalized unaligned load", and well -- with -mavx2 the performance difference seems to be gone and this memory-efficient version even appears a bit faster. The wonders of x86 never cease. Given the performance degradations with clang, I'm not sure if adding a x86-64-v3 path is worth pursuing for some tiny gain (I also had some fun ideas for AVX2 paths, but while intellectually rewarding, it is not emotionally rewarding: none of them is faster than the SSE4.1 path, so I'm not tempted to add more preprocessor variables) |
|
Merged, thank you! |

This takes care of the point insertion automatically, and we can also include the point in the length estimation automatically. This is a follow up to #110
The observation behind this patch is that the shuffle leaves NULs where the point goes, and we know where it went. If we convert to digits after the shuffle, we can now on the one hand insert the point while converting to string. On the other hand, since we want to include the point in the size evaluation, and we want to treat it exactly like a zero (any trailing sequence of zeros and decimal point should not be included in the result), we can just include it in the length evaluation with no further logic.
With @Antares0982 benchmark on my AMD Ryzen (Zen 4 core):
This is with clang. With gcc it's performance neutral on the benchmark in zmij and I didn't dare running it with Antares' benchmark. I think it's worse because it adds a jump for the final ternary but I was not able to prove that hypothesis with the aid of
select_less.For those that still remember that I wanted to reduce the number of tables:
const char conversion_table[32] = { <16 '0's>, '.', <15 '0's> }and then using an unaligned load from that array to get a sequence of zeros with the point inserted in the right place. This worked fine, but I had to move it far up in the function and add a memory clobber for gcc to load it early enough to not see a performance penalty (clang does the right thing by itself), and since we're not optimizing for size the current approach seemed simpler.