Efficient Textual Similarity Across Millions of Web Queries

Computing textual similarity (such as Jaccard similarity coefficient) between millions of search queries can be an arduous task. The main challenge is the number of pairs that one needs to consider; a relatively small dataset containing ten thousands queries leads to more than 49 million possible query pairs ({{10000}\choose{2}} = 49,995,000).

Based on Vernica, et.al. paper, I show a simple and efficient technique to filter out non-similar query pairs and thereby significantly reduce the number of query pairs that one needs to evaluate. I first explain the idea and then discuss how it can be implemented within the MapReduce paradigm. Lastly I work through a toy dataset containing about 4000 queries and demonstrate significant time improvement. Note I assume that you have basic understanding of the MapReduce paradigm. Even if you don’t, it should be pretty simple to understand the idea.

The Idea:
The general idea is to active filter query pairs that don’t share any common terms and thereby only consider those query pairs that at-least share one common term. The trick is in partitioning the data based on prefix-filtering approach. It will be easier to understand if we can consider an example. Consider three queries as shown below. Across the three queries,  we have five unique terms (or tokens): irs, forms, tax, internal and revenue.

  1. irs forms
  2. tax forms
  3. internal revenue
 Now let’s assume, related to each of the five unique terms, we have five buckets that are filled with queries containing the given term atleast once (Note: as shown in the table we  duplicate queries across multiple buckets if they have more than one terms).  
Bucket 1:
  • irs forms
Bucket 2:
  • irs forms
  • tax forms
Bucket 3:
  • tax forms
Bucket 4:
  • internal revenue
Bucket 5:
  • internal revenue

Now we only need to compute textual similarity between queries that appear in a single bucket. In other words, we compare a given query to only those other queries that appear together with a given query. For instance, consider the first query, “irs forms”. It appears in two buckets: 1 and 2. There are no other queries in bucket 1 and hence there is nothing to do over there. There is one another query “tax forms” in bucket 2 and thus we need to compute textual similarity between “irs forms” and “tax forms”.  Further notice that we never compute textual similarity between “irs forms” and “internal revenue” as the two never appear together in any given bucket.

Even the above simple example shows the benefit of prefix filtering. Using $latext N^2$ approach (hereafter I call it as Naive approach), we will need to evaluate three query pairs:

  • “irs forms”, “tax forms”
  • “irs forms”, “internal revenue”
  • “tax forms”, “internal revenue”

However, using prefix filtering we only evaluate one query pair: “irs forms” and “tax forms”. Thus prefix filtering helps avoid computing textual similarity between queries that don’t share any term. Now let’s focus on how to implement it efficiently within the MapReduce paradigm.

Its pretty simple to implement the above idea using single MapReduce. For each input query, the mapper does two things:

  1. Tokenize the query
  2. Output <token, query> pair for each identified token.

For instance if the query is “irs forms”, the mapper will output two records corresponding to this query: <irs, irs forms> and <forms, irs forms>. This correlates to our bucket idea as shown in the table above.

If we are using hadoop, the output from the mapper will be sorted based on key (token) and all records having same key will be sent to a single reducer. The reducer‘s job is to collect all the queries related to a single token and compute textual similarity between these subset of queries. Here we can use our naive approach to do the computation.

There is however one more thing can be done to conserve bandwidth between mapper and reducer. Consider the third query, “internal revenue”, in our running example. For this query, the mapper will generate two outputs as shown below

  • <internal, internal revenue>
  • <revenue, internal revenue>

However, both “internal” and “revenue” appear only once in our query dataset, i.e. there is only one query related to both the terms. Hence we know that buckets corresponding to these two terms will contain only one query. Thus, it doesn’t make sense to generate buckets that occurs only once in the dataset. We can implement this logic by using two MapReduce phases. The first MapReduce phase (mapper and reducer) is concerened with building a dictionary of terms that occur more than once. In the second MapReduce phase, the mapper uses these dictionary to generate <token, query> pairs for only those tokens (or terms) that are present in the dictionary.  The reducer remains the same as earlier. Here I am assume that the dictionary is small enough to fit in the memory. To an extent this is not a big assumption as we know from language theory generally number of tokens are much smaller.

Note: I have uploaded all the ruby code on github and can be found over here. There is a run.sh file that shows how to run the code. You don’t need hadoop in order to test the code as you can simulate the hadoop environment through sequence of shell commands.

In order to test and evaluate the benefit of prefix filtering, I downloaded a toy dataset containing all the web searches performed on AOL and that lead to click on irs.gov website. You can read more about the dataset and download it from here.

There are 4943 queries in the dataset composed of 2325 unique terms. Out of 2325 terms, only 1086 terms occur more than once.

Approach Number of Pairs
Average Running time
Naive Approach: (N^2)  12,214,153 2m41.412s
Single MapReduce Phase    2,675,684 45.199s*
Dictionary Based MapReduce
(Two MapReduce Phase)
   2,675,684 44.340s*

*I simulated hadoop environment through shell as shown below. Hence the time doesn’t include startup time for hadoop.

#One MapReduce Approach
time cat queries.txt | ruby SimplePairGeneratorMapper.rb | sort -t$'\t' -k1,1 | ruby PairGeneratorReducer.rb > prefix_output.txt

#Two MapReduce Approach
time cat queries.txt | ruby FrequencyCountMapper.rb | sort -t$'\t' -k1,1 | ruby FrequencyCountReducer.rb > dictionary
time cat queries.txt | ruby DictionaryPairGeneratorMapper.rb | sort -t$'\t' -k1,1 | ruby PairGeneratorReducer.rb > prefix_output.txt

As you can see from the above table, using prefix filtering has significant advantage over the naive approach as it significantly cut downs number of pairs that are evaluated. The number of pairs evaluated by the naive approach is almost 6 times more than that are evaluated by prefix filtering. As a result, the processing time is significantly lower for the prefix filtering based approach. Its almost 3.5 times faster as compared to the naive approach.

However I didn’t see significant time difference between the simple and dictionary based prefix filtering approaches. One reason might be that the dataset is not big enough to see any substantial difference. But on the other hand, in a larger dataset you are less likely to find terms that appears only once (assuming you have already corrected any mistakes) and hence there won’t be much difference between the two approaches.

A Word Of Caution
Prefix filtering is not panacea for textual similarity problem. There is one big problem. Some query pairs are evaluated multiple times because they appear together in multiple buckets. For instance consider these two queries: “irs forms” and “irs forms w4”. The two queries will appear together inside two buckets corresponding to the following two terms: “irs” and “forms”. As a result, the two queries will be compared twice. For web queries this is not big problem as web queries are generally two to three words long. However, it will be a challenge when we are dealing with larger textual documents (such as sentences and paragraphs). Let’s say two textual documents are exactly same and contain 1000 words. Then the two documents will be compared 1000 times. One solution might be to use an additional MapReduce phase. The first phase reducer simply generates document pairs that needs to be evaluated and sort them somehow so that ordering between the two documents is maintained. The next reducer removes any duplicate pairs. You might also want to replicate duplicate removal process in combiner so as to conserve bandwidth between mapper and reducer.