Skip to content

Conversation

@jackkleeman
Copy link
Contributor

@jackkleeman jackkleeman commented Aug 23, 2025

Currently based on top of #25

Particular sizes of input need different numbers of expensive u128 operations, so by specialising to those sizes we can speed them up. Most especially, u64 inputs need no u128 operations! We also can do the u64 operations in parallel (4 u32 at a time) when dealing with >= 20 digit outputs.

These benchmarks results are as compared to #25, so they are compounded on the deltas there.

encode/standard_new_fixed
                        time:   [25.572 ns 25.663 ns 25.757 ns]
                        change: [-21.341% -21.066% -20.808%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
encode/standard_new_random
                        time:   [24.551 ns 25.111 ns 26.021 ns]
                        change: [-22.428% -19.368% -14.989%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe
encode/standard_new_random_u64
                        time:   [14.897 ns 15.027 ns 15.166 ns]
                        change: [-28.542% -27.831% -27.124%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
encode/standard_bytes_fixed
                        time:   [14.276 ns 14.301 ns 14.330 ns]
                        change: [-32.710% -32.506% -32.316%] (p = 0.00 < 0.05)
                        Performance has improved.
encode/standard_bytes_random
                        time:   [16.773 ns 16.904 ns 17.047 ns]
                        change: [-29.651% -28.887% -28.261%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe
encode/standard_bytes_random_u64
                        time:   [9.5272 ns 9.6357 ns 9.7416 ns]
                        change: [-38.272% -37.483% -36.710%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
encode/standard_buf_fixed
                        time:   [29.307 ns 29.383 ns 29.467 ns]
                        change: [-19.478% -19.077% -18.714%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
encode/standard_buf_random
                        time:   [16.005 ns 16.034 ns 16.066 ns]
                        change: [-30.308% -30.094% -29.877%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
encode/standard_buf_random_u64
                        time:   [8.8229 ns 8.8414 ns 8.8666 ns]
                        change: [-41.447% -40.877% -40.431%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
encode/alternative_new_fixed
                        time:   [25.537 ns 25.593 ns 25.658 ns]
                        change: [-22.100% -21.892% -21.658%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
encode/alternative_new_random
                        time:   [23.408 ns 23.712 ns 23.987 ns]
                        change: [-26.354% -25.470% -24.584%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
encode/alternative_new_random_u64
                        time:   [15.397 ns 15.462 ns 15.524 ns]
                        change: [-29.714% -28.879% -28.060%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
encode/alternative_bytes_fixed
                        time:   [14.321 ns 14.347 ns 14.379 ns]
                        change: [-33.374% -33.204% -33.036%] (p = 0.00 < 0.05)
                        Performance has improved.
encode/alternative_bytes_random
                        time:   [16.828 ns 16.914 ns 16.997 ns]
                        change: [-30.550% -29.503% -28.705%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
encode/alternative_bytes_random_u64
                        time:   [9.1515 ns 9.2020 ns 9.2555 ns]
                        change: [-41.914% -40.981% -39.673%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
encode/alternative_buf_fixed
                        time:   [29.719 ns 29.796 ns 29.867 ns]
                        change: [-18.757% -18.536% -18.321%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) low severe
  6 (6.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe
encode/alternative_buf_random
                        time:   [16.348 ns 16.365 ns 16.381 ns]
                        change: [-30.424% -30.246% -30.058%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  9 (9.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
encode/alternative_buf_random_u64
                        time:   [8.6293 ns 8.6441 ns 8.6616 ns]
                        change: [-41.293% -41.123% -40.929%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

@jackkleeman jackkleeman force-pushed the vectorise branch 3 times, most recently from 160b54a to f2c0b90 Compare August 25, 2025 09:22
@fbernier
Copy link
Owner

fbernier commented Aug 25, 2025

Hey, thanks for looking into this and your work! I also looked at the base PR but didn't have much time to review but I did run the benchmarks on my machine (AMD Ryzen 9 5950X) and I'm getting no improvements at all. May I know the hardware on which you benched this? I'm suspecting this may be something that llvm may be optimizing for x86 but not as well for ARM.

Here are my results:

~/code/base62 vectorise *17 ❯ cargo bench --all-features -- --baseline master encode                                                  ⎈ gke_helpful-cat-109717_us-central1-f_lightstep-test-db-staging 09:23:34
   Compiling base62 v2.2.1 (/home/fbernier/code/base62)
    Finished `bench` profile [optimized] target(s) in 2.91s
     Running benches/base62.rs (target/release/deps/base62-8b29c01a834e4375)
encode/standard_new_fixed
                        time:   [35.584 ns 35.618 ns 35.653 ns]
                        change: [+0.2082% +0.3559% +0.4935%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
encode/standard_new_random
                        time:   [44.316 ns 44.342 ns 44.373 ns]
                        change: [−0.0382% +0.0824% +0.2004%] (p = 0.19 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
encode/standard_bytes_fixed
                        time:   [31.143 ns 31.167 ns 31.194 ns]
                        change: [−0.6344% −0.5199% −0.4069%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 18 outliers among 100 measurements (18.00%)
  6 (6.00%) high mild
  12 (12.00%) high severe
encode/standard_bytes_random
                        time:   [32.439 ns 32.460 ns 32.484 ns]
                        change: [+0.0443% +0.1357% +0.2256%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
encode/standard_buf_fixed
                        time:   [40.475 ns 40.530 ns 40.586 ns]
                        change: [+0.0495% +0.2463% +0.4537%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
encode/standard_buf_random
                        time:   [37.294 ns 37.315 ns 37.338 ns]
                        change: [+0.4044% +0.4970% +0.5911%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
encode/alternative_new_fixed
                        time:   [36.456 ns 36.506 ns 36.554 ns]
                        change: [+0.8347% +1.0947% +1.3571%] (p = 0.00 < 0.05)
                        Change within noise threshold.
encode/alternative_new_random
                        time:   [43.665 ns 43.712 ns 43.757 ns]
                        change: [+0.0196% +0.1920% +0.3595%] (p = 0.03 < 0.05)
                        Change within noise threshold.
encode/alternative_bytes_fixed
                        time:   [31.437 ns 31.561 ns 31.708 ns]
                        change: [−0.8544% −0.3901% +0.1137%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 19 outliers among 100 measurements (19.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  14 (14.00%) high severe
encode/alternative_bytes_random
                        time:   [31.870 ns 31.885 ns 31.903 ns]
                        change: [−0.5065% −0.4255% −0.3443%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  11 (11.00%) high mild
  3 (3.00%) high severe
encode/alternative_buf_fixed
                        time:   [39.248 ns 39.333 ns 39.430 ns]
                        change: [−0.3193% −0.0209% +0.3072%] (p = 0.89 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
encode/alternative_buf_random
                        time:   [37.151 ns 37.170 ns 37.191 ns]
                        change: [+0.0341% +0.1236% +0.2108%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 16 outliers among 100 measurements (16.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  11 (11.00%) high severe

I'll try to figure out why this is happening when I get more time.

@fbernier
Copy link
Owner

Ok I just looked a bit more and it looks like udivti3 calls into udivmodti4 and optimizes into a divq (~4 cycles) for x86_64 whereas ARM branches into a software udiv128by64to64default (probably ~50+ cycles) and your multiply-shift changes should be ~6 cycles.

This is all guesstimated and I've yet to actually measure anything.

@jackkleeman
Copy link
Contributor Author

divq on intel is definitely worth using if we know at compile time that result fits in a u64. I think we only know that once digits<=20. In master, we can use divq for the second u128 division, and presumably the udivmodti4 impl is clever enough to do so, but we can't use it for the first u128 division. This PR is largely focussed around trying to separate out the digit counts so we can use different optimisations, and also have less sequential divisions so that the cpu can pipeline better.

I made a new branch on restatedev u64-bench which has no optimisations, just the u64 benches, so we can compare.

I've been running on an M2 mac. But ive just tried it on an intel AWS node (m6i.2xlarge, Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz), and I still see a decent improvement comparing vectorise to u64-bench:

encode/standard_buf_random
                        time:   [31.630 ns 31.695 ns 31.780 ns]
                        change: [-28.598% -28.502% -28.372%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe
encode/standard_buf_random_u64
                        time:   [16.177 ns 16.211 ns 16.251 ns]
                        change: [-43.676% -43.563% -43.441%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

Its weird that I get 31ns on that machine, faster than your 37ns for standard_buf_random, despite your significantly better CPU! AMD vs Intel?

For completeness, see here the comparison on my mac of vectorise to u64-bench:

encode/standard_buf_random
                        time:   [15.604 ns 15.640 ns 15.681 ns]
                        change: [-65.232% -65.040% -64.843%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe
Benchmarking encode/standard_buf_random_u64: Collecting 100 samples in estimated 5.0000 s (141M iterationsencode/standard_buf_random_u64
                        time:   [8.7813 ns 8.8055 ns 8.8304 ns]
                        change: [-41.650% -41.329% -41.032%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

So its 2x faster on ARM than on x86. Maybe we can get the x86 even better.

@fbernier
Copy link
Owner

fbernier commented Aug 25, 2025

My bad, I figured out my issue. I realized I had buildcache configured on this machine which I hadn't used in a while and that caused the code not to get recompiled when running benchmarks (likely a bug). I've now re-ran them and see improvements across the board. I will be reviewing this soon.

Thanks!

~/c/base62 vectorise *18 !1 ?3 ❯ cargo bench --all-features -- --baseline master encode
   Compiling base62 v2.2.1 (/home/fbernier/code/base62)
    Finished `bench` profile [optimized] target(s) in 2.93s
     Running benches/base62.rs (target/release/deps/base62-a6a16069e85ad429)
encode/standard_new_fixed
                        time:   [28.954 ns 29.011 ns 29.073 ns]
                        change: [−24.150% −24.005% −23.849%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
encode/standard_new_random
                        time:   [38.305 ns 38.484 ns 38.663 ns]
                        change: [−19.843% −19.564% −19.287%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking encode/standard_new_random_u64: Collecting 100 samples in estimated 5.0001 s (142M iterati
                        time:   [20.675 ns 20.689 ns 20.705 ns]
                        change: [−19.232% −19.100% −18.956%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  7 (7.00%) low severe
  6 (6.00%) low mild
  3 (3.00%) high mild
Benchmarking encode/standard_bytes_fixed: Collecting 100 samples in estimated 5.0000 s (235M iterations
                        time:   [21.566 ns 21.676 ns 21.788 ns]
                        change: [−34.881% −34.655% −34.387%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking encode/standard_bytes_random: Collecting 100 samples in estimated 5.0000 s (144M iteration
                        time:   [25.311 ns 25.334 ns 25.357 ns]
                        change: [−20.709% −20.623% −20.539%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking encode/standard_bytes_random_u64: Collecting 100 samples in estimated 5.0000 s (353M itera
                        time:   [9.1959 ns 9.2012 ns 9.2076 ns]
                        change: [−48.032% −47.984% −47.932%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) high mild
  8 (8.00%) high severe
encode/standard_buf_fixed
                        time:   [33.930 ns 33.963 ns 34.001 ns]
                        change: [−18.898% −18.760% −18.624%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
encode/standard_buf_random
                        time:   [24.271 ns 24.332 ns 24.393 ns]
                        change: [−34.461% −34.291% −34.101%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
Benchmarking encode/standard_buf_random_u64: Collecting 100 samples in estimated 5.0001 s (162M iterati
                        time:   [10.356 ns 10.366 ns 10.377 ns]
                        change: [−47.743% −47.683% −47.625%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe
Benchmarking encode/alternative_new_fixed: Collecting 100 samples in estimated 5.0001 s (168M iteration
                        time:   [29.804 ns 29.828 ns 29.853 ns]
                        change: [−20.989% −20.862% −20.738%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
Benchmarking encode/alternative_new_random: Collecting 100 samples in estimated 5.0002 s (90M iterationencode/alternative_new_random
                        time:   [36.759 ns 36.856 ns 36.966 ns]
                        change: [−20.111% −19.899% −19.695%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking encode/alternative_new_random_u64: Collecting 100 samples in estimated 5.0001 s (142M iterencode/alternative_new_random_u64
                        time:   [20.464 ns 20.494 ns 20.522 ns]
                        change: [−19.160% −18.963% −18.767%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking encode/alternative_bytes_fixed: Collecting 100 samples in estimated 5.0000 s (236M iteratiencode/alternative_bytes_fixed
                        time:   [21.410 ns 21.510 ns 21.612 ns]
                        change: [−35.112% −34.845% −34.550%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
Benchmarking encode/alternative_bytes_random: Collecting 100 samples in estimated 5.0001 s (143M iteratencode/alternative_bytes_random
                        time:   [25.355 ns 25.384 ns 25.415 ns]
                        change: [−20.518% −20.408% −20.298%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking encode/alternative_bytes_random_u64: Collecting 100 samples in estimated 5.0001 s (356M itencode/alternative_bytes_random_u64
                        time:   [9.4581 ns 9.4718 ns 9.4863 ns]
                        change: [−46.139% −46.046% −45.946%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
Benchmarking encode/alternative_buf_fixed: Collecting 100 samples in estimated 5.0001 s (147M iterationencode/alternative_buf_fixed
                        time:   [33.849 ns 33.897 ns 33.949 ns]
                        change: [−18.455% −18.335% −18.213%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking encode/alternative_buf_random_u64: Collecting 100 samples in estimated 5.0001 s (163M iterencode/alternative_buf_random_u64
                        time:   [9.8654 ns 9.8713 ns 9.8782 ns]
                        change: [−50.188% −50.129% −50.069%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

@jackkleeman
Copy link
Contributor Author

I'm relieved to hear that!

@fbernier
Copy link
Owner

Merging this as-is. Thanks again!

@fbernier fbernier merged commit f0f2a0e into fbernier:master Aug 26, 2025
8 checks passed
@jackkleeman jackkleeman deleted the vectorise branch August 26, 2025 15:17
@fbernier
Copy link
Owner

published in 2.2.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants