Reduce per-call allocations in GraphSearcher.searchOneLayer and FusedPQDecoder#15
Merged
Merged
Conversation
…PQDecoder Allocation profiling of OnDiskGraphIndexCompactor workloads showed two hotspots accounting for ~87% of all allocations on the compaction ForkJoin workers. GraphSearcher.searchOneLayer (61.9% of allocs): The NeighborProcessor lambda and the visited::add method reference were being allocated on every iteration of the search loop via DirectMethodHandle.allocateInstance. Hoist both into reusable instance fields and promote scoreTracker to a field so the processor lambda captures only `this`, allowing it to be created once per searcher. FusedPQDecoder.newDecoder (24.8% of allocs): The base/Cosine decoder constructors called VectorUtil.sub(query, center) which allocates a new ArrayVectorFloat for the centered query on every decoder construction (one per GraphSearcher.search call). Add a thread-local centered-query scratch buffer in FusedPQ (mirroring the existing reusableResults / reusableNeighborCodes pattern) and a VectorUtil.subInto(dest, lhs, rhs) helper that writes the difference into a caller-provided destination without allocating. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Allocation profiling of
OnDiskGraphIndexCompactorworkloads (async-profiler alloc mode, 30 s) showed two hotspots accounting for ~87% of all allocations on the compaction ForkJoin workers. This PR removes both.GraphSearcher.searchOneLayer(lines 483 + 488)NeighborProcessorlambda +visited::addmethod reference, allocated per loop iteration viaDirectMethodHandle.allocateInstanceFusedPQDecoder.newDecoder→<init>→VectorUtil.subnew ArrayVectorFloat(new float[dim])for the centered query on every decoder construction (one per search)GraphSearcher.searchOneLayerscoreTrackerto an instance field so theNeighborProcessorlambda needs to capture onlythis.NeighborProcessora permanent instance field, allocated once in the constructor.IntMarkerforvisitedafinalfield bound once at construction (this.visitedAdder = visited::add).FusedPQDecoder/FusedPQThreadLocal<VectorFloat<?>> reusableCenteredQueryinFusedPQ, mirroring the existingreusableResults/reusableNeighborCodes/pqCodeScratchpattern.FusedPQ.approximateScoreFunctionFor→FusedPQDecoder.newDecoderand the three subclass constructors.VectorUtil.sub(query, center)(allocates) with a newVectorUtil.subInto(dest, lhs, rhs)helper that writes the difference into the thread-local scratch.The centered-query buffer is read-only after construction (just feeds
VectorUtil.calculatePartialSums), so a thread-local lifetime is safe — andFusedPQDecoderis already constructed concurrently from multiple ForkJoin workers.Test plan
./mvnw -pl jvector-base,jvector-twenty,jvector-native -am compile— all modules compile../mvnw -pl jvector-tests -am test— 275 tests pass, 0 failures, 0 errors, 2 skipped. IncludesTestOnDiskGraphIndexCompactor,TestFusedGraphIndex,TestProductQuantization,TestCompressedVectors,TestReconstructionError,TestOnDiskGraphIndex,Test2DThreshold,TestConcurrentReadWriteDeletes, etc.asprof -e allocand confirm the two allocation sites drop to near zero.🤖 Generated with Claude Code