Skip to content

Use ExplicitThreadLocal for scratch int[] in PanamaVectorUtilSupport#17

Merged
eolivelli merged 1 commit into
mainfrom
fast-threadlocal-scratch-int
May 23, 2026
Merged

Use ExplicitThreadLocal for scratch int[] in PanamaVectorUtilSupport#17
eolivelli merged 1 commit into
mainfrom
fast-threadlocal-scratch-int

Conversation

@eolivelli
Copy link
Copy Markdown
Owner

Summary

  • Wall-clock profile of a HerdDB indexing workload showed PanamaVectorUtilSupport.assembleAndSum256 spending non-trivial time in java.lang.ThreadLocal.get / ThreadLocalMap.getEntry. The host process carries many ThreadLocal instances (Netty FastThreadLocalRunnable, gRPC, metrics, …), so the JDK ThreadLocalMap's open-addressing hash is under enough pressure for probing to be visible on this hot path.
  • Switch the two static scratchInt256 / scratchInt512 fields to ExplicitThreadLocal, jvector's existing CHM-backed thread-local primitive — documented as a drop-in replacement for ThreadLocal and already used by GraphIndexBuilder, FusedPQ, RandomAccessVectorValues. Its lookup cost is independent of how many other thread-locals live on the host thread.
  • No call-site changes (the six .get() consumers in assembleAndSum*, calculatePartialSums*, squareDistance* are unchanged). Static lifetime matches what the ThreadLocal had, so no close() plumbing is required.

Test plan

  • ./mvnw -pl jvector-twenty -am -DskipTests package — builds clean
  • ./mvnw -pl jvector-tests test -Dtest=TestVectorizationProviderTests run: 3, Failures: 0, Errors: 0
  • Re-run the HerdDB indexing-service wall profile and confirm ThreadLocalMap.getEntry disappears from under assembleAndSum256

🤖 Generated with Claude Code

Wall-clock profiling of a HerdDB indexing workload showed
PanamaVectorUtilSupport.assembleAndSum256 spending non-trivial time in
java.lang.ThreadLocal.get / ThreadLocalMap.getEntry. The host process
carries many ThreadLocal instances (Netty FastThreadLocalRunnable, gRPC,
metrics, ...), which puts the JDK ThreadLocalMap under enough hash
pressure that lookups become noticeable on this very hot path.

Switch the two static scratchInt256/scratchInt512 fields to
ExplicitThreadLocal, jvector's existing CHM-backed thread-local
primitive (documented as a drop-in replacement for ThreadLocal and
already used by GraphIndexBuilder, FusedPQ, RandomAccessVectorValues).
Its lookup cost is independent of how many other thread-locals live on
the host thread, sidestepping the ThreadLocalMap pathology.

No call-site changes are required; .get() semantics are identical. The
static lifetime matches what the ThreadLocal had, so no close() is
needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eolivelli eolivelli merged commit 3603ac4 into main May 23, 2026
3 of 10 checks passed
@eolivelli eolivelli deleted the fast-threadlocal-scratch-int branch May 23, 2026 04:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant