Computing Jaccard similarity across all entries is a hercules task. Its in the order of . However there are many proposed optimization techniques that can significantly reduce the number of pairs that one needs to consider. I spent almost a week studying the various techniques and googling useful resources related to this topic. Below is a summary of the different approaches I found and links to few useful resources about these approaches. In particular I found two resources to be very useful:
Approaches for Optimizing Jaccard Similarity Computation
1. Token Based Filtering:
- Idea: Partition the data by tokens and consider only those pairs where at least one token matches.
- Use this approach only when you want to compute jaccard similarity for all pairs, except where it is zero. Pairs that don’t share any tokens will be dropped from comparison.
- If there are duplicate tokens, then you will need to preprocess and convert the duplicate tokens into set.
- Reference: Chapter 6 of Similarity Joins in Relational Databases
2. Prefix Filtering
- Idea: Instead of using all tokens, use only top n tokens. n is dependent on the minimum jaccard similarity value .
- This approach will consider only those pairs that have jaccard similarity greater than .
- This approach has no false negative i.e. pairs that have jaccard similarity greater than the minimum jaccard similarity will always be considered. But there can be false positive. Hence it is important to calculate jaccard similarity by considering actual tokens.
- Needs an extra pre-computation step where tokens needs to be ordered in some manner. In general ordering the tokens by frequency in ascending order improves performance of this approach
- There are other variations of this approach such as positional filtering.
- Reference: Chapter 6 of Similarity Joins in Relational Databases , Chapter 3 of Mining Massive Datasets.
3. Locality Sensitive Hashing (using MinHash)
- Idea: Instead of consider tokens directly, hash the token using k different hash functions and identify minimum hash value associated with each hash function. In order to compute jaccard similarity, simply count number of times min hash value matches and divide it by k. In order to reduce the number of pairs that we consider, further hash min hash signature by considering r columns at a time.
- Similar to prefix filtering, this approach only considers those pairs that have a high probability that the jaccard similarity is greater than some predetermined minimum jaccard similarity
- This approach can lead to both false negative and false positive i.e. it is possible that pairs with jaccard similarity are dropped from being considered.
- There are several parameters that one has to consider, namely: (1) family of hash function, (2) number of hash functions to consider, (3) number of rows and bands while partitioning minhash signatures during LSH.
- Chapter 6 of Similarity Joins in Relational Databases: Use this chapter to get a general idea about minHash and LSH
- CSE 344 (Washing University), Section 10: This is a problem set (with solution). The problems in this resource significantly helped reinforce my understanding about minHash
- Columbia University Lecture 1 Dealing with Massive Dataset (Duplicate Detection): Use this resource to understand relationship between number of hash functions and error associated with computing jaccard similarity.