-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Thanks for sharing this code, it's been helpful!
I am seeing a serious bug, though. The results of running this code show that some documents with zero Jaccard similarity (no shingles in common) have a MinHash signature similarity greater than 0.5!
I dug into it and the reason for this is in the definition of the random hash function. Currently, the random hash performs a modulo of the maximum shingle ID. The proper implementation is to use a prime number slightly larger than the maximum shingle ID. Source: http://stackoverflow.com/a/24678616
What's happening currently is that the random hash functions are producing a lot of collisions, so the minhash signatures aren't very good.
For example, try the following Matlab code:
% Random coefficients.
a = 7112;
b = 13067;
% Maximum shingle ID
N = 24000;
% Evaluate hash over all possible inputs
x = 1:24000;
codes = mod((a*x + b), N);
% Count the number of unique results
length(unique(codes))
ans =
3000
Only 1/8th of the outputs are unique!
As a quick-and-dirty way of getting the next prime, you can look here:
https://primes.utm.edu/lists/small/100000.txt
The next prime number after 24,000 is 24,001. Using this value for the modulo makes the code work properly.