Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
0138a99
Create Nystrom.py
huddyyeo Feb 24, 2021
d73a55b
adding code and unit test for nystrom
hl-anna Feb 24, 2021
ef95f4c
added ivf_np tests
Gantrithor-AI Feb 24, 2021
4c4cf47
added tests for ivf
Gantrithor-AI Feb 24, 2021
dad6759
added ivf numpy tests
Gantrithor-AI Feb 24, 2021
de88326
added tests for ivf_pytorch
Gantrithor-AI Feb 24, 2021
21207c5
final edit
huddyyeo Feb 24, 2021
9f3b308
add empty init files
huddyyeo Feb 24, 2021
bbf9876
make lint happy
huddyyeo Feb 24, 2021
b685143
changed default use_gpu setting to false
huddyyeo Feb 24, 2021
9da09b9
added unit tests for nystrom
hl-anna Mar 4, 2021
f0da4b1
linter
huddyyeo Mar 4, 2021
8c6124b
removing sklearn function
hl-anna Mar 4, 2021
e7e980c
applied black linting
hl-anna Mar 4, 2021
e4be739
applied black linting
hl-anna Mar 4, 2021
73a58c3
minor changes and black linting
hl-anna Mar 4, 2021
70c6718
changing maximum -> max for older torch
hl-anna Mar 4, 2021
fab4ae3
updated exp kernel
hl-anna Mar 26, 2021
50aa46d
updated exp kernel
hl-anna Mar 26, 2021
17cd242
add IVF superclass
huddyyeo Apr 1, 2021
d541aeb
typo correction
huddyyeo Apr 1, 2021
e311bab
Revert "typo correction"
huddyyeo Apr 1, 2021
fc60335
changing tests
huddyyeo Apr 1, 2021
3686d61
import utils correctly
huddyyeo Apr 1, 2021
63d1782
add lazytensor import to base ivf class
huddyyeo Apr 1, 2021
86c2380
black
huddyyeo Apr 1, 2021
8765257
add clustering functions as input
huddyyeo Apr 1, 2021
838f68a
added unused device to np utils zeros
huddyyeo Apr 1, 2021
96536b3
updated utils
hl-anna Apr 5, 2021
f627cf8
added utils
hl-anna Apr 5, 2021
14774a1
added numpy utils
hl-anna Apr 5, 2021
5e8b39e
testing rearranging np utils
huddyyeo Apr 5, 2021
6e80436
added numpy utils
hl-anna Apr 5, 2021
32a3fe7
remove 1 space
huddyyeo Apr 5, 2021
e83034e
Merge branch 'knn_and_convos' of https://github.com/huddyyeo/keops in…
hl-anna Apr 6, 2021
83d2e64
removed LazyTensor from utils
hl-anna Apr 6, 2021
5c00d65
update to add kmeans optimisation approximation
huddyyeo Apr 6, 2021
73a6b5a
Merge branch 'knn_and_convos' of https://github.com/huddyyeo/keops in…
huddyyeo Apr 6, 2021
46ba1fc
changing kmeans inputs
huddyyeo Apr 6, 2021
9ea0af6
typo
huddyyeo Apr 6, 2021
0337e7f
edit spacing to match
huddyyeo Apr 6, 2021
8fcdade
change tab to space
huddyyeo Apr 6, 2021
b6a4c68
add dummy inputs to np kmeans
huddyyeo Apr 6, 2021
ad342f4
remove normalising in kmeans
huddyyeo Apr 6, 2021
8d23e6c
update var name
huddyyeo Apr 6, 2021
3fa2782
correction
huddyyeo Apr 6, 2021
9429712
change angular to negative dot product
huddyyeo Apr 9, 2021
085f071
add import ivf to init files
huddyyeo Apr 9, 2021
785b038
trying to resolve merge conflict
huddyyeo Apr 9, 2021
cc5c84d
moving around code
huddyyeo Apr 9, 2021
17435c8
rearrange torch init
huddyyeo Apr 9, 2021
765279b
removing space
huddyyeo Apr 9, 2021
a4e6c9b
Revert "removing space"
huddyyeo Apr 9, 2021
4bea5b2
add space
huddyyeo Apr 9, 2021
41ddcba
moving code around
huddyyeo Apr 9, 2021
c1a79f0
running black
huddyyeo Apr 9, 2021
03952a7
test
huddyyeo Apr 9, 2021
9de2b85
changed import structure
huddyyeo Apr 9, 2021
ac33af7
changed import structure again
huddyyeo Apr 9, 2021
8e1b404
adding angular full metric
huddyyeo Apr 9, 2021
020d468
added angular, manhattan metrics to numpy test
Gantrithor-AI Apr 9, 2021
2a929a5
added metrics to torch unit test (ivf)
Gantrithor-AI Apr 9, 2021
d407a62
calc angular distances without torch.linalg - test
Gantrithor-AI Apr 9, 2021
fb6b5fb
delete normalise
huddyyeo Apr 9, 2021
f3f57a5
Merge branch 'knn_and_convos' of https://github.com/huddyyeo/keops in…
huddyyeo Apr 9, 2021
7a6dcc7
black
huddyyeo Apr 9, 2021
d8deff2
add docstrings + NND
huddyyeo Apr 11, 2021
73ff718
black
huddyyeo Apr 11, 2021
d4b2ca3
add imports for NND
huddyyeo Apr 11, 2021
403be68
fixed euclidean typo
Gantrithor-AI Apr 12, 2021
6b7e77a
typo + changed leaf_multiplier default
Gantrithor-AI Apr 12, 2021
330645a
Add files via upload
Gantrithor-AI Apr 12, 2021
66e500c
Add files via upload
Gantrithor-AI Apr 12, 2021
df92907
add ivf torch tut
huddyyeo Apr 13, 2021
a1a3b0f
rearranging code to avoid conflict
huddyyeo Apr 13, 2021
9ccc927
add np tutorial for ivf
huddyyeo Apr 13, 2021
6df8737
Merge branch 'master' into knn_and_convos
huddyyeo Apr 13, 2021
59a7deb
add spaces
huddyyeo Apr 13, 2021
d3cf556
adding back new code for knn benchmark
huddyyeo Apr 13, 2021
47da991
NNDescent version with clusters
Gantrithor-AI Apr 15, 2021
58a2098
Merge branch 'master' into knn_and_convos
jeanfeydy Apr 15, 2021
a495a26
requested edits 1
huddyyeo Apr 18, 2021
af1d29b
edit tests to reflect correct import structure
huddyyeo Apr 18, 2021
3c9d184
full stops on generic ivf class
huddyyeo Apr 18, 2021
dd0d702
change doc strings for parent classes
huddyyeo Apr 18, 2021
6982e8a
change numpy ivf approximation error message
huddyyeo Apr 18, 2021
8fc2640
update utils to add comments
huddyyeo Apr 18, 2021
301e807
nn descent code update
huddyyeo Apr 18, 2021
1ec3c02
updated as per jean's comments
Gantrithor-AI Apr 18, 2021
95378e9
black
huddyyeo Apr 18, 2021
8307208
updated tutorials
huddyyeo Apr 19, 2021
4a73bfd
added nystrom scripts and unit tests
hl-anna Apr 20, 2021
13d52f7
updated imports in unit tests
hl-anna Apr 20, 2021
c2b12d6
updated imports and added note to kmeans
hl-anna Apr 20, 2021
3d204ba
changed torch init
huddyyeo Apr 20, 2021
5d7f1bb
changed capitalisation
huddyyeo Apr 20, 2021
4621fd0
updated nystrom
huddyyeo Apr 20, 2021
b6962bc
testing updated import structure
huddyyeo Apr 22, 2021
d30e00f
moved imports around again
huddyyeo Apr 22, 2021
129d216
Update plot_nnd_torch.py
Gantrithor-AI Apr 23, 2021
ddbdefe
Update utils.py
Gantrithor-AI Apr 23, 2021
02f0fee
Create utils.py
Gantrithor-AI Apr 23, 2021
7c793e8
Update plot_nnd_torch.py
Gantrithor-AI Apr 23, 2021
48959da
Update plot_ivf_torch.py
Gantrithor-AI Apr 23, 2021
b1c3095
Update plot_nnd_torch.py
Gantrithor-AI Apr 23, 2021
f3ccd59
reorganised accuracy computations
huddyyeo Apr 23, 2021
5390fcc
added in updates for Nystroem
hl-anna Apr 25, 2021
1e23100
Delete nystrom_generic.py
hl-anna Apr 25, 2021
ce55836
Delete nystrom.py
hl-anna Apr 25, 2021
e3b2f07
Delete Nystrom.py
hl-anna Apr 25, 2021
0d2eba7
Rename nystroem.py to nystrom.py
hl-anna Apr 25, 2021
cd3fb2d
Rename nystroem.py to nystrom.py
hl-anna Apr 25, 2021
8c4e079
Rename nystroem_generic.py to nystrom_generic.py
hl-anna Apr 25, 2021
bd4bf43
rename
huddyyeo Apr 27, 2021
8778cbe
shifting import
huddyyeo Apr 28, 2021
ce36828
add packages
huddyyeo Apr 28, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 68 additions & 3 deletions pykeops/benchmarks/plot_benchmark_KNN.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,6 @@

use_cuda = torch.cuda.is_available()


##############################################
# We then specify the values of K that we will inspect:

Ks = [1, 10, 50, 100] # Numbers of neighbors to find
Expand Down Expand Up @@ -386,6 +384,69 @@ def f(x_test):
return fit


############################################################################
# KeOps IVF-Flat implementation
# --------------------------------------
#
# KeOps IVF-Flat is an approximation method that leverages the KeOps engine. It
# uses the IVF-Flat approximation algorithm comprising 4 steps: (1) split the
# training data into clusters using k-means, (2) find the 'a' nearest clusters
# to each cluster, (3) find the nearest cluster to each query point, and (4)
# perform the nearest neighbour search within only these nearest clusters, and
# the 'a' nearest clusters to each of these clusters. (1) and (2) are performed
# during fitting, while (3) and (4) are performed during query time. Steps (3)
# and (4) achieve time savings during query time by reducing the amount of
# pair-wise distance calculations.

from pykeops.torch.knn import IVF


def KNN_KeOps_ivf_flat(K, metric="euclidean", clusters=100, a=10, **kwargs):

# Setup the K-NN estimator:
if metric == "angular":
metric = "angular_full" # alternative metric for non-normalised data
KNN = IVF(k=K, metric=metric)

def fit(x_train):
x_train = tensor(x_train)
start = timer()
KNN.fit(x_train, clusters=clusters, a=a)
elapsed = timer() - start

def f(x_test):
x_test = tensor(x_test)
start = timer()
indices = KNN.kneighbors(x_test)
elapsed = timer() - start
indices = indices.cpu().numpy()

return indices, elapsed

return f, elapsed

return fit


##################################################################
# The time savings and accuracies achieved depend on the underlyng data
# structure, the number of clusters chosen and the 'a' parameter. The algorithm
# speed suffers for clusters >200. Reducing the proportion of clusters searched
# over (i.e. the a/clusters value) increases the algorithm speed, but lowers its
# accuracy. For structured data (e.g. MNIST), high accuracies >90% can be
# reached by just searching over 10% of clusters. However, for uniformly
# distributed random data, over 80% of the clusters will need to be searched
# over to attain >90% accuracy.

# Here, we propose 2 sets of parameters that work well on real data (e.g.
# MNIST, GloVe):

KNN_KeOps_gpu_IVFFlat_fast = partial(KNN_KeOps_ivf_flat, clusters=10, a=1)
KNN_KeOps_gpu_IVFFlat_slow = partial(KNN_KeOps_ivf_flat, clusters=200, a=40)

##############################################


################################################################################
# SciKit-Learn tree-based and bruteforce methods
# -----------------------------------------------------
Expand Down Expand Up @@ -631,7 +692,9 @@ def run_KNN_benchmark(name, loops=[1]):
routines = [(KNN_JAX_batch_loop, "JAX (small batches, GPU)", {})]
else:
routines = [
(KNN_KeOps, "KeOps (GPU)", {}),
(KNN_KeOps, "KeOps-Flat (GPU)", {}),
(KNN_KeOps_gpu_IVFFlat_fast, "KeOps-IVF-Flat (GPU, nprobe=1)", {}),
(KNN_KeOps_gpu_IVFFlat_slow, "KeOps-IVF-Flat (GPU, nprobe=40)", {}),
(KNN_faiss_gpu_Flat, "FAISS-Flat (GPU)", {}),
(KNN_faiss_gpu_IVFFlat_fast, "FAISS-IVF-Flat (GPU, nprobe=1)", {}),
(KNN_faiss_gpu_IVFFlat_slow, "FAISS-IVF-Flat (GPU, nprobe=40)", {}),
Expand Down Expand Up @@ -660,6 +723,8 @@ def run_KNN_benchmark(name, loops=[1]):
legend_location="upper right",
linestyles=[
"o-",
"+-.",
"x-.",
"s-",
"^:",
"<:",
Expand Down
228 changes: 228 additions & 0 deletions pykeops/common/ivf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
class GenericIVF:
"""Abstract class to compute IVF functions.

End-users should use 'pykeops.numpy.ivf' or 'pykeops.torch.ivf'.

"""

def __init__(self, k, metric, normalise, lazytensor):

self.__k = k
self.__normalise = normalise
self.__update_metric(metric)
self.__LazyTensor = lazytensor
self.__c = None

def __update_metric(self, metric):
"""
Update the metric used in the class.
"""
if isinstance(metric, str):
self.__distance = self.tools.distance_function(metric)
self.__metric = metric
elif callable(metric):
self.__distance = metric
self.__metric = "custom"
else:
raise ValueError(
f"The 'metric' argument has type {type(metric)}, but only strings and functions are supported."
)

@property
def metric(self):
"""Returns the metric used in the search."""
return self.__metric

@property
def clusters(self):
"""Returns the clusters obtained through K-Means."""
if self.__c is not None:
return self.__c
else:
raise NotImplementedError("Run .fit() first!")

def __get_tools(self):
pass

def __k_argmin(self, x, y, k=1):
"""
Compute the k nearest neighbors between x and y, for various k.
"""
x_i = self.__LazyTensor(
self.tools.to(self.tools.unsqueeze(x, 1), self.__device)
)
y_j = self.__LazyTensor(
self.tools.to(self.tools.unsqueeze(y, 0), self.__device)
)

D_ij = self.__distance(x_i, y_j)
if not self.tools.is_tensor(x):
if self.__backend:
D_ij.backend = self.__backend

if k == 1:
return self.tools.view(self.tools.long(D_ij.argmin(dim=1)), -1)
else:
return self.tools.long(D_ij.argKmin(K=k, dim=1))

def __sort_clusters(self, x, lab, store_x=True):
"""
Takes in a dataset and sorts according to its labels.

Args:
x ((N, D) array): Input dataset of N points in dimension D.
lab ((N) array): Labels for each point in x.
store_x (bool): Store the sort permutations for use later.
"""
lab, perm = self.tools.sort(self.tools.view(lab, -1))
if store_x:
self.__x_perm = perm
else:
self.__y_perm = perm
return x[perm], lab

def __unsort(self, indices):
"""
Given an input indices, undo and prior sorting operations.
First, select the true x indices with __x_perm[indices].
Then, use index_select to choose the indices in true x, for each true y.
"""
return self.tools.index_select(
self.__x_perm[indices], 0, self.__y_perm.argsort()
)

def _fit(
self,
x,
clusters=50,
a=5,
Niter=15,
device=None,
backend=None,
approx=False,
n=50,
):
"""
Fits the main dataset
"""

# basic checks that the hyperparameters are as expected
if type(clusters) != int:
raise ValueError("Clusters must be an integer")
if clusters >= len(x):
raise ValueError("Number of clusters must be less than length of dataset")
if type(a) != int:
raise ValueError("Number of clusters to search over must be an integer")
if a > clusters:
raise ValueError(
"Number of clusters to search over must be less than total number of clusters"
)
if len(x.shape) != 2:
raise ValueError("Input must be a 2D array")
# normalise the input if selected
if self.__normalise:
x = x / self.tools.repeat(self.tools.norm(x, 2, -1), x.shape[1]).reshape(
-1, x.shape[1]
)

# if we want to use the approximation in Kmeans, and our metric is angular, switch to full angular metric
if approx and self.__metric == "angular":
self.__update_metric("angular_full")

x = self.tools.contiguous(x)
self.__device = device
self.__backend = backend

# perform K-Means
cl, c = self.tools.kmeans(
x,
self.__distance,
clusters,
Niter=Niter,
device=self.__device,
approx=approx,
n=n,
)

self.__c = c
# perform one final cluster assignment, since K-Means ends on cluster update step
cl = self.__assign(x)

# obtain the nearest clusters to each cluster
ncl = self.__k_argmin(c, c, k=a)
self.__x_ranges, _, _ = self.tools.cluster_ranges_centroids(x, cl)

x, x_labels = self.__sort_clusters(x, cl, store_x=True)
self.__x = x
r = self.tools.repeat(self.tools.arange(clusters, device=self.__device), a)
# create a [clusters, clusters] sized boolean matrix
self.__keep = self.tools.to(
self.tools.zeros([clusters, clusters], dtype=bool), self.__device
)
# set the indices of the nearest clusters to each cluster to True
self.__keep[r, ncl.flatten()] = True

return self

def __assign(self, x, c=None):
"""
Assigns nearest clusters to a dataset.
If no clusters are given, uses the clusters found through K-Means.

Args:
x ((N, D) array): Input dataset of N points in dimension D.
c ((M, D) array): Cluster locations of M points in dimension D.
"""
if c is None:
c = self.__c
return self.__k_argmin(x, c)

def _kneighbors(self, y):
"""
Obtain the k nearest neighbors of the query dataset y.
"""
if self.__x is None:
raise ValueError("Input dataset not fitted yet! Call .fit() first!")
if self.__device and self.tools.device(y) != self.__device:
raise ValueError("Input dataset and query dataset must be on same device")
if len(y.shape) != 2:
raise ValueError("Query dataset must be a 2D tensor")
if self.__x.shape[-1] != y.shape[-1]:
raise ValueError("Query and dataset must have same dimensions")
if self.__normalise:
y = y / self.tools.repeat(self.tools.norm(y, 2, -1), y.shape[1]).reshape(
-1, y.shape[1]
)
y = self.tools.contiguous(y)
# assign y to the previously found clusters and get labels
y_labels = self.__assign(y)

# obtain y_ranges
y_ranges, _, _ = self.tools.cluster_ranges_centroids(y, y_labels)
self.__y_ranges = y_ranges

# sort y contiguous
y, y_labels = self.__sort_clusters(y, y_labels, store_x=False)

# perform actual knn computation
x_i = self.__LazyTensor(self.tools.unsqueeze(self.__x, 0))
y_j = self.__LazyTensor(self.tools.unsqueeze(y, 1))
D_ij = self.__distance(y_j, x_i)
ranges_ij = self.tools.from_matrix(y_ranges, self.__x_ranges, self.__keep)
D_ij.ranges = ranges_ij
indices = D_ij.argKmin(K=self.__k, axis=1)
return self.__unsort(indices)

def brute_force(self, x, y, k=5):
"""Performs a brute force search with KeOps

Args:
x ((N, D) array): Input dataset of N points in dimension D.
y ((M, D) array): Query dataset of M points in dimension D.
k (int): Number of nearest neighbors to obtain.

"""
x_LT = self.__LazyTensor(self.tools.unsqueeze(x, 0))
y_LT = self.__LazyTensor(self.tools.unsqueeze(y, 1))
D_ij = self.__distance(y_LT, x_LT)
return D_ij.argKmin(K=k, axis=1)
Loading