Skip to content

Random Hashes #2

@chrisjmccormick

Description

@chrisjmccormick

Thanks for sharing this code, it's been helpful!

I am seeing a serious bug, though. The results of running this code show that some documents with zero Jaccard similarity (no shingles in common) have a MinHash signature similarity greater than 0.5!

I dug into it and the reason for this is in the definition of the random hash function. Currently, the random hash performs a modulo of the maximum shingle ID. The proper implementation is to use a prime number slightly larger than the maximum shingle ID. Source: http://stackoverflow.com/a/24678616

What's happening currently is that the random hash functions are producing a lot of collisions, so the minhash signatures aren't very good.

For example, try the following Matlab code:

% Random coefficients.
a = 7112;
b = 13067;

% Maximum shingle ID
N = 24000;

% Evaluate hash over all possible inputs
x = 1:24000;
codes = mod((a*x + b), N);

% Count the number of unique results
length(unique(codes))

ans =

       3000

Only 1/8th of the outputs are unique!

As a quick-and-dirty way of getting the next prime, you can look here:
https://primes.utm.edu/lists/small/100000.txt

The next prime number after 24,000 is 24,001. Using this value for the modulo makes the code work properly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions