Optimizing Jaccard Similarity Computation for Big Data

Computing Jaccard similarity across all entries is a hercules task. Its in the order of O(N^2). However there are many proposed optimization techniques that can significantly reduce the number of pairs that one needs to consider. I spent almost a week studying the various techniques and googling useful resources related to this topic. Below is a summary of the different approaches I found and links to few useful resources about these approaches. In particular I found two resources to be very useful:

  1. Chapter 6 of Similarity Joins in Relational Databases and
  2. Chapter 3 of Mining Massive Datasets.

Approaches for Optimizing Jaccard Similarity Computation

1. Token Based Filtering: 

  • Idea: Partition the data by tokens and consider only those pairs where at least one token matches.
  • Note:
    • Use this approach only when you want to compute jaccard similarity for all pairs, except where it is zero. Pairs that don’t share any tokens will be dropped from comparison.
    • If there are duplicate tokens, then you will need to preprocess and convert the duplicate tokens into set.
  • Reference: Chapter 6 of Similarity Joins in Relational Databases

2. Prefix Filtering

  • Idea: Instead of using all tokens, use only top n tokens. n is dependent on the minimum jaccard similarity value \tau .
  • Note:
    • This approach will consider only those pairs that have jaccard similarity greater than \tau .
    • This approach has no false negative i.e. pairs that have jaccard similarity greater than the minimum jaccard similarity will always be considered. But there can be false positive. Hence it is important to calculate jaccard similarity by considering actual tokens.
    • Needs an extra pre-computation step where tokens needs to be ordered in some manner. In general ordering the tokens by frequency in ascending order improves performance of this approach
  • There are other variations of this approach such as positional filtering.
  • Reference: Chapter 6 of Similarity Joins in Relational Databases , Chapter 3 of Mining Massive Datasets.

3. Locality Sensitive Hashing (using MinHash) 

  • Idea: Instead of consider tokens directly, hash the token using k different hash functions and identify minimum hash value associated with each hash function. In order to compute jaccard similarity, simply count number of times min hash value matches and divide it by k. In order to reduce the number of pairs that we consider, further hash min hash signature by considering r columns at a time.
  • Note:
    • Similar to prefix filtering, this approach only considers those pairs that have a high probability that the jaccard similarity is greater than some predetermined minimum jaccard similarity
    • This approach can lead to both false negative and false positive i.e. it is possible that pairs with jaccard similarity are dropped from being considered.
    • There are several parameters that one has to consider, namely: (1) family of hash function, (2) number of hash functions to consider, (3) number of rows and bands while partitioning minhash signatures during LSH.
  • References:
Advertisements

About Ritesh Agrawal

I am a applied researcher who enjoys anything related to statistics, large data analysis, data mining, machine learning and data visualization.
This entry was posted in General, Hadoop, Machine Learning, Programming and tagged , , . Bookmark the permalink.

One Response to Optimizing Jaccard Similarity Computation for Big Data

  1. Pingback: [译]针对大数据的Jaccard相似度计算优化 - Darren Xinyang Li

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s