Skip to content

Conversation

@dqhl76
Copy link
Collaborator

@dqhl76 dqhl76 commented Dec 5, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

refactor: try reduce aggregate hash index cost on hot path

When perf tpch-1000 q18, I found find_or_insert on hot path, which cannot be seen on smaller scale dataset.
image

With pure linear probing (+1 each time), occupied slots tend to cluster. Once you hit such a cluster, the probing process need over a long run of consecutive occupied entries. If, instead, the next probe position is derived from the hash (i.e. more “random”), you break up these clusters. That may well hurt potential SIMD or prefetch optimisations, but it also shortens long probe chains on average.

This idea is insipred by a duckdb's optimization PR. Thanks!

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Dec 5, 2025
@dqhl76 dqhl76 marked this pull request as draft December 5, 2025 08:50
@dqhl76 dqhl76 added the ci-cloud Build docker image for cloud test label Dec 5, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

Docker Image for PR

  • tag: pr-19072-e3377f5-1764931157

note: this image tag is only available for internal use.

@dqhl76 dqhl76 added ci-benchmark-cloud Benchmark: run only cloud tests for tpch/hits and removed ci-cloud Build docker image for cloud test labels Dec 5, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

Docker Image for PR

  • tag: pr-19072-e3377f5-1764935655

note: this image tag is only available for internal use.

@forsaken628
Copy link
Collaborator

Sequential probing has a greater likelihood of being optimized for SIMD instructions. Or maybe the compiler isn't that smart yet?

@dqhl76
Copy link
Collaborator Author

dqhl76 commented Dec 6, 2025

Sequential probing has a greater likelihood of being optimized for SIMD instructions. Or maybe the compiler isn't that smart yet?

I have saw the improvement from Q18 in tpch 1000 (120.03s -> 111.57s). I will perf for a flame graph to ensure that latter.

Here is my guess:

With pure linear probing (+1 each time), occupied slots tend to cluster. Once you hit such a cluster, the probing process need over a long run of consecutive occupied entries. If, instead, the next probe position is derived from the hash (i.e. more “random”), you break up these clusters. That may well hurt potential SIMD or prefetch optimisations, but it also shortens long probe chains on average.

(BTW, this idea is inspired from an optimisation PR in DuckDB. A related approach from SwissTable is to keep linear probing but increase the step size when you encounter several consecutive occupied slots. I haven’t tested that variant here yet)

@forsaken628
Copy link
Collaborator

https://github.com/rust-lang/hashbrown/blob/master/src/raw/mod.rs#L2068

@dqhl76
Copy link
Collaborator Author

dqhl76 commented Dec 6, 2025

@forsaken628
Copy link
Collaborator

Later, we can replace the current Entry with the Group of SwissTable

@dqhl76 dqhl76 marked this pull request as ready for review December 7, 2025 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-benchmark-cloud Benchmark: run only cloud tests for tpch/hits pr-refactor this PR changes the code base without new features or bugfix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants