Skip to content

Convert bcd -> ascii after shuffling. Gain 3%#112

Merged
vitaut merged 10 commits intovitaut:mainfrom
TobiSchluter:sse4_even_faster
Apr 4, 2026
Merged

Convert bcd -> ascii after shuffling. Gain 3%#112
vitaut merged 10 commits intovitaut:mainfrom
TobiSchluter:sse4_even_faster

Conversation

@TobiSchluter
Copy link
Copy Markdown
Contributor

@TobiSchluter TobiSchluter commented Mar 5, 2026

This takes care of the point insertion automatically, and we can also include the point in the length estimation automatically. This is a follow up to #110

The observation behind this patch is that the shuffle leaves NULs where the point goes, and we know where it went. If we convert to digits after the shuffle, we can now on the one hand insert the point while converting to string. On the other hand, since we want to include the point in the size evaluation, and we want to treat it exactly like a zero (any trailing sequence of zeros and decimal point should not be included in the result), we can just include it in the length evaluation with no further logic.

With @Antares0982 benchmark on my AMD Ryzen (Zen 4 core):

before:
=== double benchmark (5000 rounds × 91932 values, 100 warmup) ===
  C              459660000 calls       5518.78 ms total     12.01 ± 0.42 ns/call  (sink=7896585000)
  C++            459660000 calls       3802.74 ms total      8.27 ± 0.37 ns/call  (sink=7896585000)
  Rust           459660000 calls       4191.17 ms total      9.12 ± 0.36 ns/call  (sink=7896789000)

after:
=== double benchmark (5000 rounds × 91932 values, 100 warmup) ===
  C              459660000 calls       5580.04 ms total     12.14 ± 0.43 ns/call  (sink=7896585000)
  C++            459660000 calls       3507.83 ms total      7.63 ± 0.24 ns/call  (sink=7896569700)
  Rust           459660000 calls       4220.10 ms total      9.18 ± 0.37 ns/call  (sink=7896789000)

This is with clang. With gcc it's performance neutral on the benchmark in zmij and I didn't dare running it with Antares' benchmark. I think it's worse because it adds a jump for the final ternary but I was not able to prove that hypothesis with the aid of select_less.

For those that still remember that I wanted to reduce the number of tables:

  1. we're explicitly not optimizing for size here, and
  2. more interestingly, I tried using an array of the form const char conversion_table[32] = { <16 '0's>, '.', <15 '0's> } and then using an unaligned load from that array to get a sequence of zeros with the point inserted in the right place. This worked fine, but I had to move it far up in the function and add a memory clobber for gcc to load it early enough to not see a performance penalty (clang does the right thing by itself), and since we're not optimizing for size the current approach seemed simpler.

@TobiSchluter TobiSchluter changed the title Convert bcd -> ascii after shuffling. Gain 10% [sic] Convert bcd -> ascii after shuffling. Gain 8% [sic] Mar 5, 2026
@TobiSchluter
Copy link
Copy Markdown
Contributor Author

TobiSchluter commented Mar 5, 2026

wow, seems like our testing is too limited:
auto end = zmij::detail::write(2410., buffer);
prints 241. https://godbolt.org/z/7G6q4MTWo

I'm surprised no entry in @Antares0982 data sample triggered that :O
Good thing I thought a bit more about what i wrote: in fixed mode, zeros before the decimal point should remain.

@TobiSchluter
Copy link
Copy Markdown
Contributor Author

TobiSchluter commented Mar 5, 2026

I fixed the correctness issue and added a fairly low-power test. It now takes 8.08ns in the benchmark, which is still a gain, and maybe there's an easy follwoup that recovers the remainder, but I can't see it right now. It seems like one can do magic to the mask during the length evaluation, but I don't see the correct and fast combination just yet ...

@Antares0982
Copy link
Copy Markdown
Contributor

The reason I use ctz32 is that starting the computation from unshuffled_bcd reduces certain data dependencies, which I expect will increase instruction-level parallelism. If the computation instead starts from bcd, then the operations along the path _mm_cmpgt_epi8, _mm_movemask_epi8, and clz must wait for _mm_shuffle_epi8 to complete.

@Antares0982
Copy link
Copy Markdown
Contributor

Antares0982 commented Mar 5, 2026

Should we concatenate the conversion_table and the shuffle_table into a single table, and then read 16*2 bytes directly starting at offset idx_point*2? I believe this would be more memory-efficient.

@TobiSchluter
Copy link
Copy Markdown
Contributor Author

As for ctz Vs clz, the origin of this patch is that I thought could get rid of or'ing something into the mask this way, reducing the number of μops. The dependency chain doesn't seem all that important given that the buffer needs to be committed to memory, and no exponent needs to be inserted either in the fixed case. Anyway, I'm happy to go whichever way turns out to be faster.

I thought about concatenating the lists, but didn't try it. Both because I wanted to be able to flip back easily between the version with the 32byte array that I outlined above and this one, and because the consecutive read shouldn't be faster if both are in L1 cache (even with march=x86-64-v4 it doesn't appear to be optimized to a single load, but the compiler may consider other things), instead you get two address calculations instead of one, but yeah, let me try to benchmark that tomorrow. I've been surprised too often to not believe in experiment

@Antares0982
Copy link
Copy Markdown
Contributor

I pushed to zmij-playground, sse4 branch: https://github.com/Antares0982/zmij-playground/tree/sse4, a ctz version is also added.

You can try the script ./compare.sh which is using the any_dtoa_benchmark

#!/usr/bin/env bash
./tools/compile_cpp_sse4.sh ./zmij-cpp/zmij.cc zmijcpp
./tools/compile_cpp_sse4.sh ./sse4-improve/zmij.cc sse4_even_faster
./tools/compile_cpp_sse4.sh ./sse4-improve-ctz/zmij.cc sse4_even_faster_ctz
./build/any_dtoa_benchmark ./build/libs/libzmijcpp.so:zmijcpp_detail_write_double:skip ./build/libs/libsse4_even_faster.so:zmijcpp_detail_write_double:skip ./build/libs/libsse4_even_faster_ctz.so:zmijcpp_detail_write_double:skip

On my machine this PR is not always faster (even with ctz). I tried 3 times

=== double benchmark (5000 rounds × 91932 values, 100 warmup) ===
  zmijcpp                    459660000 calls       4441.01 ms total      9.66 ± 0.29 ns/call  (sink=7896585000)
  sse4_even_faster           459660000 calls       4470.56 ms total      9.73 ± 0.45 ns/call  (sink=7896585000)
  sse4_even_faster_ctz       459660000 calls       4454.41 ms total      9.69 ± 0.08 ns/call  (sink=7896585000)
=== double benchmark (5000 rounds × 91932 values, 100 warmup) ===
  zmijcpp                    459660000 calls       4413.96 ms total      9.60 ± 0.13 ns/call  (sink=7896585000)
  sse4_even_faster           459660000 calls       4452.24 ms total      9.69 ± 0.10 ns/call  (sink=7896585000)
  sse4_even_faster_ctz       459660000 calls       4440.39 ms total      9.66 ± 0.08 ns/call  (sink=7896585000)
=== double benchmark (5000 rounds × 91932 values, 100 warmup) ===
  zmijcpp                    459660000 calls       4434.02 ms total      9.65 ± 0.36 ns/call  (sink=7896585000)
  sse4_even_faster           459660000 calls       4445.42 ms total      9.67 ± 0.24 ns/call  (sink=7896585000)
  sse4_even_faster_ctz       459660000 calls       4390.22 ms total      9.55 ± 0.08 ns/call  (sink=7896585000)

ctz version is always slightly faster.

@Antares0982
Copy link
Copy Markdown
Contributor

I ran five more times, including the concatenate variant in the tests. The ctz variant is not necessarily better than the current PR. Overall, it seems unable to surpass the current branch point (I'm using 83ef806018de89a4cfe83bb80361c986310a8725).

=== double benchmark (5000 rounds × 91932 values, 100 warmup) ===
  zmijcpp                    459660000 calls       4421.54 ms total      9.62 ± 0.05 ns/call  (sink=7896585000)
  sse4_even_faster           459660000 calls       4465.33 ms total      9.71 ± 0.17 ns/call  (sink=7896585000)
  sse4_even_faster_ctz       459660000 calls       4472.03 ms total      9.73 ± 0.23 ns/call  (sink=7896585000)
  sse4_even_faster_ctz-concatenate   459660000 calls       4463.80 ms total      9.71 ± 0.16 ns/call  (sink=7896585000)
=== double benchmark (5000 rounds × 91932 values, 100 warmup) ===
  zmijcpp                    459660000 calls       4419.27 ms total      9.61 ± 0.07 ns/call  (sink=7896585000)
  sse4_even_faster           459660000 calls       4465.22 ms total      9.71 ± 0.21 ns/call  (sink=7896585000)
  sse4_even_faster_ctz       459660000 calls       4454.68 ms total      9.69 ± 0.16 ns/call  (sink=7896585000)
  sse4_even_faster_ctz-concatenate   459660000 calls       4450.45 ms total      9.68 ± 0.08 ns/call  (sink=7896585000)
=== double benchmark (5000 rounds × 91932 values, 100 warmup) ===
  zmijcpp                    459660000 calls       4404.35 ms total      9.58 ± 0.16 ns/call  (sink=7896585000)
  sse4_even_faster           459660000 calls       4455.63 ms total      9.69 ± 0.10 ns/call  (sink=7896585000)
  sse4_even_faster_ctz       459660000 calls       4463.05 ms total      9.71 ± 0.18 ns/call  (sink=7896585000)
  sse4_even_faster_ctz-concatenate   459660000 calls       4461.70 ms total      9.71 ± 0.07 ns/call  (sink=7896585000)
=== double benchmark (5000 rounds × 91932 values, 100 warmup) ===
  zmijcpp                    459660000 calls       4426.47 ms total      9.63 ± 0.15 ns/call  (sink=7896585000)
  sse4_even_faster           459660000 calls       4455.63 ms total      9.69 ± 0.13 ns/call  (sink=7896585000)
  sse4_even_faster_ctz       459660000 calls       4449.94 ms total      9.68 ± 0.06 ns/call  (sink=7896585000)
  sse4_even_faster_ctz-concatenate   459660000 calls       4470.17 ms total      9.72 ± 0.19 ns/call  (sink=7896585000)
=== double benchmark (5000 rounds × 91932 values, 100 warmup) ===
  zmijcpp                    459660000 calls       4437.20 ms total      9.65 ± 0.32 ns/call  (sink=7896585000)
  sse4_even_faster           459660000 calls       4476.82 ms total      9.74 ± 0.33 ns/call  (sink=7896585000)
  sse4_even_faster_ctz       459660000 calls       4453.59 ms total      9.69 ± 0.16 ns/call  (sink=7896585000)
  sse4_even_faster_ctz-concatenate   459660000 calls       4430.14 ms total      9.64 ± 0.11 ns/call  (sink=7896585000)

@Antares0982
Copy link
Copy Markdown
Contributor

I also added a convenience Python script to quickly replace the zmij source into zmij-playground, so you can iteratively edit and test which approach yields better performance. The script is located at ./tools/replace_zmij.py and is used like this:

./tools/replace_zmij.py ../zmij/zmij.cc -o ./zmij-cpp/zmij.cc

The script automatically inserts ZMIJ_INLINE at the appropriate locations and comments out explicit instantiations to avoid an extra call or a tail-call jmp at the FFI boundary.

@TobiSchluter
Copy link
Copy Markdown
Contributor Author

@Antares0982 nice, I've been messing around with it some more, and I cannot find a short incantation that uses clz, so ctz + logic for "point within zeros" indeed seems more attractive. At least if one cares about correctness :D

In my experiments, the concatenated version tended to come out slower. I tried an intermingled array and a two-component struct which yields different code (the former performing minimally better). In both cases neither clang nor gcc made efforts to keep the two loads close to each other, which is consistent with what we see for the sse_consts struct.

@TobiSchluter
Copy link
Copy Markdown
Contributor Author

TobiSchluter commented Mar 6, 2026

BTW talking about benchmarking: according to intel's intrinsics website _mm_cmpgt_epi8 should have a half-cycle latency. I tried replacing
__m128i mask128 = _mm_cmpgt_epi8(bcd, _mm_setzero_si128());
with the equivalent
__m128i mask128 = _mm_add_epi8(bcd, _mm_set1_epi32(0x7f7f7f7f));
which only has 0.33 cycle latency, provided the constant is loaded early enough. I wasn't able to confirm a performance-advantage though. Maybe your high-tech benchmarking can see it? There are of course several reasons why the documented performance-advantage may not materialize (special logic for pxor xmm0, xmm0 being the main one).

@Antares0982
Copy link
Copy Markdown
Contributor

  1. _mm_add_epi8 has no help on my machine.
  2. I tried many combinations, but it is still slower than the branch point 83ef80. I compiled the two assemblies, it shows the
    sse4-improve-ctz-concatenate only saves 1 instruction. I have no idea how to improve it now

83ef80

movdqa	112(%rax), %xmm1
	por	%xmm0, %xmm1
	leal	(%rcx,%rdi), %eax
	shlq	$4, %rax
	leaq	_ZN12_GLOBAL__N_125double_sse4_shuffle_tableE(%rip), %rdx
	movdqa	%xmm1, %xmm2
	pshufb	(%rax,%rdx), %xmm2
	leaq	(%rsi,%r9), %rax
	pxor	%xmm3, %xmm3
	pcmpgtb	%xmm3, %xmm0
	pmovmskb	%xmm0, %edx
	orl	$65536, %edx                    # imm = 0x10000
	rep		bsfl	%edx, %edx
	movl	$16, %r8d
	subl	%edx, %r8d
	movdqu	%xmm2, (%rsi,%r9)
	movd	%xmm1, 16(%rsi,%r9)
	movl	%ecx, %ecx
	addq	%rax, %rcx
	leaq	(%rcx,%rdi), %rdx
	movb	$46, (%rdi,%rcx)
	leaq	(%rax,%r8), %rcx
	cmpq	%rdx, %rcx
	leaq	1(%r8,%rax), %rsi
	cmovbeq	%rdx, %rsi

sse4-improve-ctz-concatenate

addl	%r9d, %ecx
	movq	%rcx, %rax
	shlq	$5, %rax
	leaq	_ZN12_GLOBAL__N_125double_sse4_shuffle_tableE(%rip), %rdx
	movdqa	%xmm0, %xmm1
	pshufb	(%rax,%rdx), %xmm1
	leaq	(%rsi,%rdi), %r8
	por	16(%rax,%rdx), %xmm1
	movd	%xmm0, %eax
	addl	$48, %eax
	pxor	%xmm2, %xmm2
	pcmpgtb	%xmm2, %xmm0
	pmovmskb	%xmm0, %edx
	orl	$65536, %edx                    # imm = 0x10000
	rep		bsfl	%edx, %edx
	movl	$16, %r9d
	subl	%edx, %r9d
	movdqu	%xmm1, (%rsi,%rdi)
	movl	%eax, 16(%rsi,%rdi)
	movl	$17, %esi
	subl	%edx, %esi
	cmpl	%r9d, %ecx
	cmovaeq	%rcx, %rsi
	addq	%r8, %rsi

Comment thread zmij.cc Outdated
Comment thread zmij.cc Outdated
Copy link
Copy Markdown
Owner

@vitaut vitaut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Overall looks good but please rebase your changes.

@TobiSchluter
Copy link
Copy Markdown
Contributor Author

TobiSchluter commented Mar 9, 2026

Thanks for the PR! Overall looks good but please rebase your changes.

Thanks, I hope to get to it during the course of the week, but it might take until next week. I think going through constexpr functions is a good idea also because it allows switching between alternatives more easily. E.g. with AVX2 it is tempting to use the broadcast loads for the splatted constants, at least when using -Os.

@vitaut vitaut force-pushed the main branch 2 times, most recently from 38cd652 to 6b8a4da Compare March 9, 2026 02:11
@TobiSchluter TobiSchluter changed the title Convert bcd -> ascii after shuffling. Gain 8% [sic] Convert bcd -> ascii after shuffling. Gain 3% Mar 20, 2026
@TobiSchluter
Copy link
Copy Markdown
Contributor Author

I'm back on this, now that the NEON changes and the ensuing restructuring is done. I finally added clang to my setup, and surprisingly on @Antares0982's benchmark I find gcc faster -- and less improved with my updated version of the patch. With clang the patch amounts to a 3% improvement, making it almost competitive with gcc.

The fastest form appears to be one which interleaves the (previously introduced) shuffle tables and the tables with the point insertions. Conveniently, this seems to also be advantageous in non-synthetic benchmarks with memory pressure, as the two constants can be always made to be loaded to the same cache line, i.e. in a single read from the non_L1 world.

Before I resubmit, I still want to see if I can expand this to NEON: the table is shared with the NEON code, so it means less ifdefs if it works there as well.

@vitaut vitaut force-pushed the main branch 6 times, most recently from c98938e to b38b1a0 Compare March 21, 2026 19:54
@vitaut
Copy link
Copy Markdown
Owner

vitaut commented Mar 21, 2026

I still want to see if I can expand this to NEON

Let me know when it's ready for review. Note that the shuffle table is now generated at compile time so you'll need to do the same for the new table.

@vitaut vitaut force-pushed the main branch 2 times, most recently from 3244f65 to f032301 Compare March 22, 2026 02:09
@Antares0982
Copy link
Copy Markdown
Contributor

A disconcerting observation which I made that seems unrelated to this patch is that with clang 18.1.3 (the one I'm using) performance collapses if one uses -march=x86-64-v3 or -v4 instead of -v2 (runtime of @Antares0982' benchmark increases from 8ms to 14ms). I had a look at the assembly but couldn't figure out why that happens.

Do you mean the regression occurs both before and after this PR?

Yes, with and without the patch.

@Antares0982 I suspected that the output buffer in the benchmark becomes misaligned and we end up observing a traffic jam due to crossing cache lines, but explicitly aligning it to 64 bytes doesn't fix it. I'm observing this on Zen 4, and since you're using vperf I assume you're using a different microarchitecture, so it appears to be something general.

VTune says the problem is back-end bound and not related to memory. I'm using Intel

@Antares0982
Copy link
Copy Markdown
Contributor

By adding each flag x86-64-v3 has and avx2 does not have, I found the problem is from BMI2. Adding -mbmi2 will cause the performance collapse.

@TobiSchluter
Copy link
Copy Markdown
Contributor Author

TobiSchluter commented Mar 30, 2026

I didn't know "back-ended" excludes _mm_storeu here, but it wasn't hard to do the experiment.

We don't use any of the BMI2 instructions explicitly, and also not implicitly AFAICT:
image
TZCNT would be enabled by -mbmi. The instruction is also present in your -mavx2 excerpt. This is not what I'd expect, but it seems to exclude it as the culprit.

It's almost by definition a compiler bug but it seems completely wild that something like this could affect AMD and Intel in the same way. Especially as the assembly is almost the same except for the placement of a few scalar instructions. The only thing that I can see that seems vaguely plausible as a reason is that the bad case has three address calculations in a row, two as part of multiplications. I know nothing about CPU ports and such, but there being a limitation seems consistent with this microarchitecture picture that I found on wikipedia. According to this, port0 will be quite busy.

@Antares0982
Copy link
Copy Markdown
Contributor

After enabling BMI2, the clang emitted about 30 mulx instructions, which actually correspond to u128 multiplications (yesterday I only focused on the hot-spot, which in fact was not the direct cause). The compiler attempted to optimize 128-bit multiplication using mulx, but the result was a severe performance regression.

@TobiSchluter
Copy link
Copy Markdown
Contributor Author

TobiSchluter commented Mar 31, 2026

BTW I tried Lemire's AVX512 code from https://lemire.me/blog/2022/03/28/converting-integers-to-decimal-strings-faster-with-avx-512/ and it is indeed faster -- but again clang is significantly slower than gcc. The code gains almost 0.2ns with gcc, pushing it under 8ms, but with clang it takes ~11ns (instead of 14ns with our code and v4), so clang's problems with newer architecture variants go farther than just this one code.

I don't think the code has much practical use, it requires -march=x86-64-v4 -mavx512ifma -mavx512vbmi, which I'm actually surprised is available on Zen 4, but in practice it will only be available and beneficial to people who rebuild with -march=native while avoiding clang.

As another aside, I tried dozens of variants, and within SSE4.2 (i.e. x86-64-v2) nothing seems to speed up the code. The variations that I found which are minimally faster on Tiger Lake are minimally slower on Zen 4 - or vice versa (evaluating length via __mm_cmpistri, combining arithmetic with _mm_madd_epi16, using different packing strategies, inserting zeros in different places along the instruction chain, to name some of the things that I tried). Lemire's code is probably a demonstration of how little is left to be gained in the pure digit conversion. If one goes the full AVX512 route there is probably a smarter way to evaluate the string length, but I didn't explore the hyperdimensional space of AVX512 bit instructions.

Anyway, will try to update the patch tonight!

@Antares0982
Copy link
Copy Markdown
Contributor

BTW I tried Lemire's AVX512 code from https://lemire.me/blog/2022/03/28/converting-integers-to-decimal-strings-faster-with-avx-512/ and it is indeed faster -- but again clang is significantly slower than gcc. The code gains almost 0.2ns with gcc, pushing it under 8ms, but with clang it takes ~11ns (instead of 14ns with our code and v4), so clang's problems with newer architecture variants go farther than just this one code.

Seems xjb has already implemented the avx512vbmi optimization: https://github.com/xjb714/xjb/blob/afe97d6552a5028ff4512e9678c8606a45c7a9f5/src/ftoa.cpp#L1031

BTW @xjb714 has listed a table of AVX512 support

Antares0982/ftoa-benchmark#2 (comment)

@TobiSchluter
Copy link
Copy Markdown
Contributor Author

Right, that is Lemire's code as well. The table is nice. It seems like I'm in agreement with xjb about the performance characteristics.

If only there were a way to evaluate the length / count the zeros faster. pcmpistri applied after the shuffle seems to be the biggest improvement I can create, but only on Zen4.

@TobiSchluter TobiSchluter force-pushed the sse4_even_faster branch 2 times, most recently from 3b60f87 to bc893e0 Compare March 31, 2026 13:55
@TobiSchluter TobiSchluter requested a review from vitaut March 31, 2026 13:58
@TobiSchluter
Copy link
Copy Markdown
Contributor Author

I updated the patch.

  1. it now uses if where possible instead of #ifdef
  2. the TZCNT changes are removed
  3. i removed the clang-formatting changes

Comment thread zmij.cc Outdated
Comment thread zmij.cc Outdated
@vitaut
Copy link
Copy Markdown
Owner

vitaut commented Apr 2, 2026

Mostly looks good. Just two minor comments and please rebase your changes.

@TobiSchluter
Copy link
Copy Markdown
Contributor Author

I rebased and updated per your comments.

I also finally had time to try doing the same on NEON. It is, according to my experiments, not an improvement there.

@TobiSchluter
Copy link
Copy Markdown
Contributor Author

TobiSchluter commented Apr 4, 2026

ps I just did some experiments after reading a bit more about unaligned reads -- the internet says that unaligned reads have become penalty-free with AVX2 microarchitectures. This didn't match my experience when preparing this patch where the simple and memory-efficient way to obtain the point-and-zeros mask by doing _mm_loadu_si128(m128ptr("0000000000000000.00000000000000" + (16 - index))); was measurably more expensive than using the much bigger table that is in the patch.

This was a bit of a surprise, but then I read the above sentence again and thought "please, please, don't tell me that one needs to use VEX-encoding to get an unpenalized unaligned load", and well -- with -mavx2 the performance difference seems to be gone and this memory-efficient version even appears a bit faster. The wonders of x86 never cease.

Given the performance degradations with clang, I'm not sure if adding a x86-64-v3 path is worth pursuing for some tiny gain (I also had some fun ideas for AVX2 paths, but while intellectually rewarding, it is not emotionally rewarding: none of them is faster than the SSE4.1 path, so I'm not tempted to add more preprocessor variables)

@TobiSchluter TobiSchluter requested a review from vitaut April 4, 2026 14:34
@vitaut vitaut merged commit 7afa3cc into vitaut:main Apr 4, 2026
3 checks passed
@vitaut
Copy link
Copy Markdown
Owner

vitaut commented Apr 4, 2026

Merged, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants