cuLSH

cuLSH is a GPU-accelerated locality-sensitive hashing library.
It provides a scikit-learn-like fit/query API for index building and querying.

Supported Algorithms

Algorithm	Class	Distance Metric	Data Type
Random Projection LSH	`RPLSH`	Cosine	Dense
MinHash LSH	`MinHashLSH`	Jaccard	Sparse
p-Stable LSH	`PStableLSH`	Euclidean	Dense

Installation

Prerequisites: CUDA Toolkit (11.8+), CMake (3.24+), C++17 compiler.

We recommend installing in a conda environment:

git clone https://github.com/rishic3/cuLSH.git
cd cuLSH/python

conda create -n python=3.12 culsh -y
conda activate culsh

pip install .

For development:

pip install -e .
pip install -r requirements_dev.txt

# Rebuild .so after C++ changes
make clean && make release

Docker

Run cuLSH in a docker container:

docker build -t culsh .
docker run --gpus all -it culsh

Note: Requires NVIDIA Container Toolkit for GPU access.

Usage

Numpy Example

import numpy as np
from culsh import PStableLSH

# Numpy inputs
X = np.random.randn(10000, 128).astype(np.float32)
Q = np.random.randn(100, 128).astype(np.float32)

# Fit (returns PStableLSHModel)
model = PStableLSH(n_hash_tables=16, n_hashes=8, seed=42).fit(X)

# Query (returns candidate neighbors)
candidates = model.query(Q)

# Access results
indices = candidates.get_indices()   # Candidate neighbor indices
offsets = candidates.get_offsets()   # Start offset for each query in indices
counts = candidates.get_counts()     # Number of candidates per query

Cupy Example

import cupy as cp
from culsh import PStableLSH

# Cupy inputs
X = cp.random.randn(10000, 128, dtype=cp.float32)
Q = cp.random.randn(100, 128, dtype=cp.float32)

model = PStableLSH(n_hash_tables=16, n_hashes=8).fit(X)
candidates = model.query(Q)

# Access results as cupy arrays
indices = candidates.get_indices(as_cupy=True)
offsets = candidates.get_offsets(as_cupy=True)

Sparse Data

import scipy.sparse  
# import cupyx.scipy.sparse
from culsh import MinHashLSH

X = scipy.sparse.random(10000, 5000, density=0.01, format='csr')
Q = scipy.sparse.random(100, 5000, density=0.01, format='csr')
# Or use cupy:
# X = cupyx.scipy.sparse.random(10000, 5000, density=0.01, format='csr')
# Q = cupyx.scipy.sparse.random(100, 5000, density=0.01, format='csr')

model = MinHashLSH(n_hash_tables=32, n_hashes=4).fit(X)
candidates = model.query(Q)

Simultaneous Fit and Query

Query the same data used for fitting:

candidates = PStableLSH(n_hash_tables=16, n_hashes=8).fit_query(X)

Batched Queries

For large query sets, use batch_size to reduce peak GPU memory:

candidates = model.query(Q, batch_size=1000)

Save and Load

# Save fitted model
model.save("model.npz")

# Load model
from culsh import PStableLSHModel
model = PStableLSHModel.load("model.npz")

Benchmarks

Benchmark results (log-scale):

Dataset: SIFT1M 1M training vectors, 10k query vectors, 128 dimensions.

Algorithm: RPLSH (GPU), FAISS IndexLSH (CPU). 32 hash tables × 16 hashes.

Hardware: NVIDIA RTX A6000 (GPU), Intel Core i9-14900K w/AVX2 (CPU).

Note: cuLSH uses bucket-based retrieval (returns all bucket matches), while FAISS IndexLSH uses hash-then-scan (computes Hamming distance to all vectors and returns top-k). The top-k value for FAISS was set to match mean candidates returned by cuLSH for a comparison at equal output sizes.

Name		Name	Last commit message	Last commit date
Latest commit History 161 Commits
.github/workflows		.github/workflows
cuda		cuda
docs		docs
python		python
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cuLSH

Supported Algorithms

Installation

Docker

Usage

Numpy Example

Cupy Example

Sparse Data

Simultaneous Fit and Query

Batched Queries

Save and Load

Benchmarks

About

Uh oh!

Languages

License

rishic3/cuLSH

Folders and files

Latest commit

History

Repository files navigation

cuLSH

Supported Algorithms

Installation

Docker

Usage

Numpy Example

Cupy Example

Sparse Data

Simultaneous Fit and Query

Batched Queries

Save and Load

Benchmarks

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages