Skip to content

Conversation

@abernardi597
Copy link

@abernardi597 abernardi597 commented Dec 4, 2025

Description

I took a stab at bringing the OpenSearch JVector codec into Lucene as a codec in sandbox (see issue #14681) to see how a DiskANN-insipired index might compare to the current generation of HNSW.
I made quite a few changes along the way and wanted to cut this PR to share some of those changes/results and maybe solicit some feedback from interested parties. Most notably, I did remove the incremental graph building functionality that is used to speed up merges, though I'd like to add it back and look at the improvements in merge-time for JVector indices. I also made a PR for JVector (datastax/jvector#577) to fix a byte-order inconsistency to better leverage Lucene's bulk-read for floats.

I hooked it up to lucene-util (PR incoming) for comparison, trying to play into the strengths of each codec while also maintaining similar levels of parallelism. I ran HNSW using 32x indexing threads and force-merging into 1 segment while using 1x indexing thread for JVector backed by a 32x concurrency ForkJoinPool for its SIMD operations and ForkJoinPool.commonPool() for its other parallel operations. I also fixed oversample=1 for both and used neighborOverflow=2 and alpha=2 for JVector.

These results are from the 768-dim cohere dataset using PQ for quantization in JVector and OSQ in Lucene using a m7g.16xlarge EC2 instance.

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType metric
0.965 1.408 1.399 0.994 100000 100 50 64 250 no 4968 5.99 16700.07 10.10 1 298.17 292.969 292.969 HNSW COSINE
0.939 2.186 2.155 0.986 100000 100 50 64 250 no 3485 19.58 5107.77 0.01 1 318.80 292.969 292.969 JVECTOR COSINE
0.963 1.409 1.401 0.994 100000 100 50 64 250 8 bits 5028 8.75 11431.18 12.95 1 372.84 367.737 74.768 HNSW COSINE
0.939 9.524 9.516 0.999 100000 100 50 64 250 8 bits 3525 886.28 112.83 0.01 1 392.79 367.737 74.768 JVECTOR COSINE
0.899 0.967 0.959 0.992 100000 100 50 64 250 4 bits 5076 8.84 11314.78 9.07 1 335.80 331.116 38.147 HNSW COSINE
0.937 3.469 3.457 0.997 100000 100 50 64 250 4 bits 3437 148.70 672.51 0.01 1 356.17 331.116 38.147 JVECTOR COSINE
0.669 0.681 0.673 0.988 100000 100 50 64 250 1 bits 5895 8.04 12439.36 8.84 1 308.42 303.459 10.490 HNSW COSINE
0.730 1.056 1.044 0.989 100000 100 50 64 250 1 bits 2672 51.39 1945.90 0.01 1 328.70 303.459 10.490 JVECTOR COSINE

This PR is not really intended to be merged, in light of some of the feedback on the previous PR (#14892) that suggests Lucene should try to incorporate some of the learnings rather than add yet another KNN engine.

@mikemccand
Copy link
Member

I did remove the incremental graph building functionality that is used to speed up merges, though I'd like to add it back and look at the improvements in merge-time for JVector indices.

Lucene's HNSW merging has exactly this optimization (reusing incoming HNSW graph from largest of the segments being merged, as long as there are no (not many now?) deletions, as a starting point for the merged HNSW graph) I think? So preserving this from jVector would make the comparison more fair ...

@mikemccand
Copy link
Member

mikemccand commented Dec 11, 2025

I also made a PR for JVector (datastax/jvector#577) to fix a byte-order inconsistency to better leverage Lucene's bulk-read for floats.

Nice!

But, sigh, I see your PR there is blocked on the the usual GitHub penalize-new-contributors "feature"/tax of insisting that a maintainer simply approve the GH automation actions that would smoke test the PR and maybe give you some feedback on simple things to fix.

@mikemccand
Copy link
Member

@abernardi597 there was also a previous PR #14892 which implemented a Lucene Codec wrapping jVector, also inspired by OpenSearch's integration, but a while ago (early summer 2025). I suspect OpenSearch's jVector integration made many improvements since then.

Anyways, how does your PR here compare to that original PR? Did you start from that one, or intentionally not start from it to do everything fresh, or something in between?

@mikemccand
Copy link
Member

I also fixed oversample=1 for both and used neighborOverflow=2 and alpha=2 for JVector.

Does Lucene's HNSW have an analogue for neighborOverflow=2 and alpha=2 that you are trying to match to make the comparison as apples/apples as possible?

@mikemccand
Copy link
Member

I hooked it up to lucene-util (PR incoming) for comparison

+1 thank you -- making it easy-ish for anyone to benchmark jVector against Faiss wrapped Codec and Lucene's HNSW implementation would be awesome.

knnPerfTest.py got a number of improvements recently (autologger, factoring away non-differentiating columns, preserving index-time and force-merge-time across invocations, etc.).

Plus we now have Cohere v3 vectors, 1024 dims instead of 768 from Cohere v2. And they are unit-sphere normalized, unlike Cohere v2.

@abernardi597
Copy link
Author

abernardi597 commented Dec 12, 2025

Lucene's HNSW merging has exactly this optimization

I've been working on some modifications to further align the two implementations.
For example, I have added changes to do single-threaded graph construction on the indexing thread (instead of buffering all the docs until building the graph in parallel at flush-time).

I am working on this graph-reuse bit, though it looks like Lucene also does a smart merge where it inserts key nodes from the smaller graph such that it can re-use adjacency information from the small graph to seed the graph search when inserting the remaining nodes. JVector does not do this at the moment, but would likely benefit from such a change (possibly as an upstream contribution).

how does your PR here compare to that original PR?

I looked at the original PR as a starting point, but found that there were several key changes in the upstream OpenSearch implementation that could be brought in. Merging those commits seemed unwieldy, so I opted to start from scratch by checking out the codec into the sandbox. Then I fixed the build and style issues before making some changes to how the codec actually works to get more functional parity with Lucene's HNSW codecs. Specifically trying to get the extra KNN tests passing and moving towards the single-indexing-thread model as I mentioned above.

Does Lucene's HNSW have an analogue for neighborOverflow=2 and alpha=2

We have found that alpha=2 is actually partially responsible for the increase in index/search time. alpha > 1 is a hyper-parameter that relaxes the diversity check by a multiplicative factor, with alpha=1 being the same diversity check as HNSW. We found that alpha=2 resulted in graphs with every node saturated with edges (maxConn edges), which was really slowing down the construction and search.

There is also a hierarchyEnabled flag that adds layers to the graph in much the same fashion as the H in HNSW.
Enabling the hierarchy with alpha=1 and also allowing 2*maxConn for level=0 gives somewhat more promising results:

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_doc/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType metric
0.926 2.262 2.2 0.973 100000 100 50 64 250 no 3277 12.07 8282.26 0.01 1 319.21 292.969 292.969 JVECTOR COSINE
0.926 10.106 9.95 0.985 100000 100 50 64 250 8 bits 3238 196.74 508.27 0.01 1 393.46 367.737 74.768 JVECTOR COSINE
0.926 3.444 3.386 0.983 100000 100 50 64 250 4 bits 3189 75.32 1327.6 0.01 1 356.71 331.116 38.147 JVECTOR COSINE
0.739 1.15 1.122 0.976 100000 100 50 64 250 1 bits 2581 22.24 4496.4 0.01 1 329.15 303.459 10.49 JVECTOR COSINE

Combining this with the single-thread-indexing mentioned above lets me run a more apples-to-apples test, with 32x indexing threads and 32x merge threads with a final force-merge for both codecs:

recall latency(ms) netCPU avgCpuCount numDoc topK fanout maxConn beamWidth quantized visited index(s) index_doc/s force_merge(s) total_index(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.960 1.796 1.764 0.982 200000 100 50 64 250 no 5596 12.81 15612.80 24.58 37.39 1 596.91 585.938 585.938 HNSW
0.904 2.416 2.371 0.981 200000 100 50 64 250 no 3321 15.42 12972.69 73.05 88.47 1 686.89 585.938 585.938 JVECTOR
0.894 1.391 1.363 0.980 200000 100 50 64 250 4 bits 5661 18.55 10784.00 21.93 40.48 1 672.01 662.231 76.294 HNSW
0.903 3.923 3.862 0.984 200000 100 50 64 250 4 bits 3274 15.56 12850.99 107.83 123.39 1 760.87 662.231 76.294 JVECTOR
0.661 0.887 0.867 0.977 200000 100 50 64 250 1 bits 6552 17.25 11594.20 19.71 36.96 1 617.32 606.918 20.981 HNSW
0.724 1.252 1.229 0.982 200000 100 50 64 250 1 bits 2704 15.52 12888.26 35.89 51.41 1 705.95 606.918 20.981 JVECTOR

I'm nearly at a point where I can re-use the largest graph at merge-time, but I'm working through an elusive duplicate neighbor bug.

making it easy-ish for anyone to benchmark jVector against Faiss wrapped Codec and Lucene's HNSW implementation would be awesome

Apologies on the delay here, I am working on re-applying my changes on top of these awesome improvements!

moduleApi project(':lucene:facet')
moduleTestImplementation project(':lucene:test-framework')

moduleImplementation('io.github.jbellis:jvector:4.0.0-rc.5') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before merging this has to be cleaned up. Lucene does not want delectations of external dependencies with version numbers here. Needs to move to version.toml file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before merging this has to be cleaned up. Lucene does not want delectations of external dependencies with version numbers here. Needs to move to version.toml file.

+1 -- this PR is still very much draft I think. Thanks @uschindler.

@mikemccand
Copy link
Member

Thanks @abernardi597 -- this sounds like awesome progress.

It's curious how recall of JVector is not as good with no quantization, but then as we quantize it gets better (vs Lucene), and then exceeds Lucene at extreme 1 bit quantization.

Do you have 8 bit quantization results for that last table? We could see if it participates in that trend too...

I see the JVector indices are somewhat larger (~15%) than the Lucene indices in your last table?

How did you make that table, with the alternating HNSW / jVector rows? Is this part of your pending PR changes for luceneutil? That's a nice capability! We could use it for comparing Faiss as well. Maybe open a separate luceneutil spinoff for that?

Does this mean the HNSW graph is still bushier in JVector? Or that maybe JVector is less efficient in how it stores its graph (which would make sense -- it's optimizing for fewer disk seeks, not necessarily total storage?). Actually knnPerfTest.py in luceneutil prints stats about the graphs, at least for Lucene's HNSW KnnVectorsFormat, like this:

Leaf 0 has 3 layers
Leaf 0 has 103976 documents
Graph level=2 size=2, Fanout min=1, mean=1.00, max=1, meandelta=0.00
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1   1
Graph level=1 size=1549, Fanout min=1, mean=14.82, max=64, meandelta=5443.59
%   0  10  20  30  40  50  60  70  80  90 100
    0   4   6   8  10  13  15  18  21  28  64
Graph level=0 size=103976, Fanout min=1, mean=18.57, max=128, meandelta=4003.06
%   0  10  20  30  40  50  60  70  80  90 100
    0   5   7   9  11  14  17  21  27  38 128

This is telling us the percentiles for node-neighbor-count I think. So graph level=0 (the bottom-most one that has 1:1 vector <-> node) at P40 nodes have 11 neighbors.

If we implement that for JVector then we could draw a more direct comparison of how their HNSW graphs? Maybe open a spinoff issue in luceneutil? Does JVector even expose enough transparency/APIs to compute these stats? We could at least compare the mean neighbor count of both for a coarse comparison.

@abernardi597
Copy link
Author

abernardi597 commented Dec 15, 2025

Do you have the 8 bit quantization results for the last table?

I didn't run that benchmark on 8-bit quantization, since empirically that seems to substantially increase the indexing time and query latency without much benefit compared to 4-bit.
The way I am drawing comparisons from scalar to product quantization is by the compression level. For example, 8-bit quantization represents 4x compression, which for PQ means using sub-vectors of dimension 1 as each subvector has 1 byte worth of centroids (so each 4-byte float compresses to a one-byte centroid index), Similarly, 1-bit quantization represents 32x compression, where PQ uses sub-vectors of dimension 8 so 8 4-byte floats compress into a one-byte centroid index. Even higher compression rates are theoretically possible with PQ than with scalar quantization, but I have not touched it at all really.

The table before the last does include 8-bit results for JVector, which shows nearly 5x slower query latency and 16x slower indexing speed than raw vectors. It's also nearly 3x slower to query than 4-bit and takes more than 2x longer to index.

How did you make that table, with the alternating HNSW/jVector rows?

knnPerfTest.py seems to support this already by specifying a tuple for indexType! Once I made the two codecs comparable with the same merge policy and concurrency parameters, it spit out the table for me.

Does this mean the HNSW graph is still bushier in JVector?

I did wire this up when investigating some of the recall disparity and trying to make the graphs look similar (e.g. alpha=1, useHierarchy=true) to validate the graphs aren't totally dissimilar.

For example, compare the results for HNSW (top) and JVector (bottom) on 500K docs without quantization:

Leaf 0 has 3 layers
Leaf 0 has 500000 documents
Graph level=2 size=62, Fanout min=1, mean=7.35, max=19, meandelta=29631.33
%   0  10  20  30  40  50  60  70  80  90 100
    0   2   4   4   5   6   7   9  11  12  19
Graph level=1 size=7113, Fanout min=1, mean=19.75, max=64, meandelta=20463.54
%   0  10  20  30  40  50  60  70  80  90 100
    0   6   8  11  13  16  20  24  30  40  64
Graph level=0 size=500000, Fanout min=1, mean=23.12, max=128, meandelta=18093.16
%   0  10  20  30  40  50  60  70  80  90 100
    0   7   9  12  14  17  21  26  33  47 128
Graph level=2 size=62, connectedness=1.00
Graph level=1 size=7113, connectedness=1.00
Graph level=0 size=500000, connectedness=1.00
Leaf 0 has 3 layers
Leaf 0 has 500000 documents
Graph level=2 size=21, Fanout min=1, mean=5.43, max=12, meandelta=-22546.25
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   2   3   4   5   7   7   7   8  12
Graph level=1 size=4001, Fanout min=1, mean=18.79, max=64, meandelta=-3287.41
%   0  10  20  30  40  50  60  70  80  90 100
    0   5   8  10  13  16  19  23  29  37  64
Graph level=0 size=500000, Fanout min=1, mean=23.11, max=128, meandelta=-1101.85
%   0  10  20  30  40  50  60  70  80  90 100
    0   7   9  12  14  17  21  26  33  47 128
Graph level=2 size=21, connectedness=1.00
Graph level=1 size=4001, connectedness=1.00
Graph level=0 size=500000, connectedness=1.00

Interestingly, the base layers show the same distribution of degrees (and nearly identical mean fanout), while the upper layers start to diverge, most notably in size.

@RKSPD
Copy link

RKSPD commented Dec 16, 2025

Hi @abernardi597 thanks for working on the JVector Lucene integration project! Please feel free to discuss anything with me about running performance tests with knnPerfTest and integration with luceneutil. I'll be happy to help you in any way I can!

@mikemccand
Copy link
Member

Hi @abernardi597 thanks for working on the JVector Lucene integration project! Please feel free to discuss anything with me about running performance tests with knnPerfTest and integration with luceneutil. I'll be happy to help you in any way I can!

+1, thank you (and hello again!) @RKSPD! Benchmarking is hard ... we are discussing the challenges specifically with jVector's benchmarking on this issue.

@mikemccand
Copy link
Member

The JVector specific KNN query seems to have some interesting query-time hyper-parameters:

    private final int overQueryFactor;
    private final float threshold;
    private final float rerankFloor;
    private final boolean usePruning;

Does Lucene's KNN query have corollaries for these?

@mikemccand
Copy link
Member

how does your PR here compare to that original PR?

I looked at the original PR as a starting point, but found that there were several key changes in the upstream OpenSearch implementation that could be brought in. Merging those commits seemed unwieldy, so I opted to start from scratch by checking out the codec into the sandbox. Then I fixed the build and style issues before making some changes to how the codec actually works to get more functional parity with Lucene's HNSW codecs. Specifically trying to get the extra KNN tests passing and moving towards the single-indexing-thread model as I mentioned above.

Got it -- the upstream (OpenSearch jvector plugin Codec) changed a lot since @RKSPD's first PR.

But then I wonder if we are missing anything that @RKSPD did the first time around? Or is this PR entirely a superset of how that first PR was using jVector?

@RKSPD
Copy link

RKSPD commented Dec 18, 2025

The JVector specific KNN query seems to have some interesting query-time hyper-parameters:

    private final int overQueryFactor;
    private final float threshold;
    private final float rerankFloor;
    private final boolean usePruning;

Does Lucene's KNN query have corollaries for these?

In my experience with benchmarking, overQueryFactor, threshold, rerankFloor = 0 kept the performance metrics similar to Lucene HNSW for small index speed testing (Cohere 768, 200k docs on luceneutil). Using the knnPerfTest run parameters we can use the fanout/overSample levers to test apples/apples performance vs HNSW. Also @abernardi597 for testing multi-threaded performance, maybe check if knnPerfTest numIndexThreads = 1 can lead to better benchmarks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants