Apache Pig: Macro for Splitting Data Into Training and Testing dataset

Introduction: Apache Pig (> 0.7.0) comes with a handy operator, Split, to separate a relation into two or more relations. For instance let’s say we have a website “users” data and depending on the age of a user we want to create two different datasets: kids, adults, seniors. This can be easily achieved with a single command and in a single map/reduce job using the Split operator as show below.

split users into kids if age < 18,
                 adults if age >= 18 and age < 65,
                 seniors otherwise;

Problem: However, if you are trying to randomly split the data into training and testing dataset, you can’t directly use the split operator as it cannot handle non deterministic functions (such as RANDOM). Thus the below command won’t work and will raise an error:

split data into testing if RANDOM() <= 0.10,
                training otherwise;

Solution: Below is a simple macro that uses the split operator but removes the non deterministic function issue by first assigning random values to each tuple and then filtering on those values. In order to make sure that the returned dataset has exactly the same schema as the input dataset, I am using a small trick over here. I assign random values as the first column so that in the foreach operator I can easily skip it by using numeric column reference.

--
-- Macro: split_into_training_testing
-- @param inputData(relation): Input dataset that needs to be
--                                  split into training and testing
-- @param split_percentage(double): Indicates the size of the testing
--                                  dataset in relation to original dataset.
--                                  split_percentage should be within 0 and 1
-- Returns two relations. The first relation contains (1-split_percentage) samples.
-- The second dataset contains split_percentage samples.
--
DEFINE split_into_training_testing(inputData, split_percentage)
RETURNS training, testing
{
    data = foreach $inputData generate RANDOM() as random_assignment, *;
    split data into testing_data if random_assignment <= $split_percentage, training_data otherwise;
    $training = foreach training_data generate $1..;
    $testing = foreach testing_data generate $1..;
};

-- Sample Usage
inData = load 'some_files.txt'...;
training, testing = split_into_training_testing(inData, 0.1);
Posted in Data Mining, Hadoop | Tagged | 1 Comment

On Writing Python UDF for Pig: A perspective

When writing python UDF for Pig, one is faced with multiple options. In this post, I show three different approaches to writing python UDF for Pig. To keep things in perspective, lets take an example of student’s dataset containing following fields: name, GPA score and residential zipcode. Let’s assume our task is to rank students by there GPA score within each zipcode. Pig 0.11 comes with handy rank operator but doesn’t work inside the foreach operator (see this blog post for details). So let’s write an enumerate_bag operator as shown here but using different approaches. 

Approach 1: Using @outputSchema decorator
This is the most commonly used approach while writing python UDF. The idea is to simply bind a fixed schema to the returned value of the UDF so that Apache Pig know’s how to handle it. In this approach we use the “@outputSchema” decorator and define the output schema just above the python udf (as shown in the code below).

#myudf.py
@outputSchema("record: {(rank:int, name:chararray, gpa:double, zipcode:chararray)}")
def enumerate_bag(input):
    output = []
    for rank, item in enumerate(input):
        output.append(tuple([rank] + list(item)))
    return output
register 'myudf.py' using jython as myudf;
students = load 'students.dat' using PigStorage(',') as (name:chararray, gpa:double, zipcode:chararray);
students_by_zipcode = group students by zipcode;
result = foreach students_by_zipcode {
           sorted = order students by gpa desc;
           ranked = myudf.enumerate_bag(sorted);
           generate flatten(ranked);
        };
dump result;

The problem with the above approach however is that it makes the UDF specific to a particular dataset. In the example above, our enumerate_bag UDF is tied to the student’s dataset and can’t be used anywhere else. This is kind of restricting especially for UDF that are generic in nature and have usage beyond a particular situation. This takes to our second approach.

Approach 2: Binding schema within Pig
Using the @outputSchema decorator ties the UDF to a particular dataset. In order to overcome this issue, the second approach is to bind the schema to the returned value within the Pig script, where we have much more information about the dataset and its schema. Without the @outputSchema decorator, Pig interpret’s the UDF output to be of type bytearray. Within the Pig script, we cast this bytearray to some other form as shown below.

#myudf.py
def enumerate_bag(input):
    output = []
    for rank, item in enumerate(input):
        output.append(tuple([rank] + list(item)))
    return output
register 'myudf.py' using jython as myudf;
students = load 'students.dat' using PigStorage(',') as (name:chararray, gpa:double, zipcode:chararray);
students_by_zipcode = group students by zipcode;
result1 = foreach students_by_zipcode {
           sorted = order students by gpa desc;
           ranked = myudf.enumerate_bag(sorted);
           generate ranked as record: {(rank:int, name:chararray, gpa:double, zipcode:chararray)};
        };
-- note we need to this additional line to explode bag.
result2 = foreach result1 generate flatten(record);
dump result2;

Although the above approach makes our enumerate_bag UDF generic enough to be used outside of the given task, I don’t like two things about this approach. First, we need an extra statement to flatten out the results (notice result1 and result2). Second, this kind of approach can’t be used within a macro as we won’t know the schema before hand. This leads us to our third approach of using the @outputSchemaFunction decorator.

Approach 3: Using @outputSchmaFunction:
This is by far the most complex solution but also the one that makes our UDF truly generalized and offers the greatest flexibility. The idea is to use a python function that returns the schema for the value retuned by the UDF in runtime.

As shown in the code, we have a enumerate_bag UDF that returns rank position along with other fields in the dataset and enumerateBagSchema function that returns the output schema. The input to the enumerateBagSchema is a Schema object, which in turn contains one or more Field Schemas. We inform Pig about the enumerateBagSchema function by using the @outputSchemaFunction decorator (placed just above the UDF). In addition, we also have to use the @schemaFunction decorator (just above the enumerateBagSchema function).

#myudf.py
import org.apache.pig.data.DataType as DataType
import org.apache.pig.impl.logicalLayer.schema.SchemaUtil as SchemaUtil

@outuptSchemaFunction("enumerateBagSchema")
def enumerate_bag(input):
    output = []
    for rank, item in enumerate(input):
        output.append(tuple([rank] + list(item)))
    return output

# define the schema for enumerate_bag UDF
@schemaFunction("enumerateBagSchema")
def enumerateBagSchema(input):
    # create rank position field
    fields = ['rank_position']
    dt = [DataType.INTEGER]

    #The input schema is a bag of tuples. Get the tuple from bag
    for i in input.getField(0).schema.getFields():
        # iterate over all the fields within the tuple
        for j in i.schema.getFields():
            fields.append(j.alias)
            dt.append(j.type)

    # return new schema
    return SchemaUtil.newBagSchema(fields, dt);
register 'myudf.py' using jython as myudf;
students = load 'students.dat' using PigStorage(',') as (name:chararray, gpa:double, zipcode:chararray);
students_by_zipcode = group students by zipcode;
result = foreach students_by_zipcode {
           sorted = order students by gpa desc;
           ranked = myudf.enumerate_bag(sorted);
           generate flatten(ranked);
        };
dump result;

Conclusion
We have several different choices for writing python UDF. If you don’t plan to use the UDF within a Pig Macro, then the second approach is worth considering as it makes the UDF generic but also much faster to code. However the challenge is that if one changes the UDF then one has to make sure to change the schema definition in all the Pig scripts that’s using the UDF. For instance, if we change the position of the newly added rank column from first to last position in the above UDF, then we have to remember to do the same in all the Pig script that is using the UDF. Using the third approach makes this redundant as the information related to the UDF is in one place. Thus,

  1. Use the first approach if your UDF is very specific to a particular dataset.
  2. Use the second approach if you plan to use the UDF in just few Pig scripts and it will be only you who will end up using the UDF.
  3. Use the third approach if you want the UDF to be generic that would be used across many projects and by many people.
Posted in General | Tagged , | 1 Comment

Rank Operator in Pig 0.11

In one of my earlier posts, I talked about implementing the RANK operator for extracting top N records. Since then lot of things have changed and I thought it will be worth while to catchup. Hive 0.7 or greater now has a “row_sequence” operator that assigns rank position to all the input records in the order they arrive.

Similarly, in the upcoming release of Apache Pig (0.11), there is a new handy “RANK” operator. The basic syntax is

ranked = RANK input [BY [COL [ASC|DESC]]] [DENSE];

However, below are some things to keep in mind about the “RANK” operator:
1. Ties are assigned same rank position: Below is an example of how ties are handled. Table 1 lists a sample student’s dataset. In Table 2, the dataset is sorted by GPA score. Note that student B and C have same score and therefore both are assigned the same rank position of 2. Also note that the student D is assigned rank position 4 instead of 3.

Table 1: Student Dataset
StudentID GPA
A 4.0
B 3.5
C 3.5
D 3.0
Table 2:
ranked = rank student by GPA desc
rank_student StudentID GPA
1 A 4.0
2 B 3.5
2 C 3.5
4 D 3.0

2. Use the keyword “DENSE” to avoid gaps in rank positions: As show in Table 2, by default the “RANK” operator inserts gaps in between rank positions if ties are observed. However, this behavior can be modified by using the “DENSE” keyword. Below is a sample output if using the DENSE keyword.

ranked = rank student by GPA desc DENSE
rank_student StudentID GPA
1 A 4.0
2 B 3.5
2 C 3.5
3 D 3.0

3. It didn’t work inside the FOREACH operator: Let’s assume we also have residential address for each of the above student and we want to  want to rank students within different zipcode of their home address. A natural (as shown below) instinct would be to group the dataset by zipcode and then apply rank operator within the “foreach” operator. 
students_by_zipcode = group students by zipcode;
ranked = foreach students_by_zipcode{
ranked = rank students by GPA desc;
generate flatten(ranked);
}

However, the above code threw an error indicating that the rank operator can’t be nested within the foreach operator.  Since the official docs for Pig 0.11 is not yet available, i am not sure whether this is a bug or an intended limitation of RANK operator and will keep an eye on it. But for now, I hope this is helpful.

 

Posted in Data Mining | Tagged , | 5 Comments

Comparing Ranked List

Recently working on a recommendation engine, I stumbled across an interesting problem. I wanted to put some automated checks in place so that if the newly generated list of recommendations significantly differ from the previous one then the system raises some kind of flags (such as an email alert). The challenge was however how to quantify similarity or differences between the new and the old recommendation list. The output of the recommendation engine is a sorted list of items. Thus, the problem was to quantify similarity between two ranked lists.

As it turns out, the problem of comparing two ranked lists is already known and quite popular. A very interesting paper on the topic is “A Similarity Measure for Indefinite Rankings“.  Not only the literature review is very informative, but even the suggested approach seems to be comprehensive. Below is a brief summary of the paper. Also, I have put together an implementation of rank biased overlap (RBO) that is mentioned in the above linked paper.

There are broadly two different approaches to compare ranked lists: (1) Rank Correlation and (2) Set Based Measure.

Ranked Correlation:
Rank correlation based approaches such as Kendall Tau essentially measure the probability of two items being in the same order in the two ranked lists. For instance if item A appears before the item B in list 1, then what’s the probability that the item A also appears before the item B in list 2. Mathematically, Kendall Tau is computed as

\tau = P(C) - P(D) = \frac{C}{N}- \frac{D}{N}= \frac{C-D}{N}
where:

  • N = Total number of pairs = \frac{1}{2}n(n-1), where n is the number of items in a list.
  • C = Number of concordant pairs i.e number of pairs for which relative ordering (as explained above) is preserved in the two lists. Thus, probability of concordant pairs P(C) = \frac{C}{N}.
  • D = Number of discordant pairs i.e. number of pairs for which relative ordering is reversed. Thus, probability of discordant pairs P(D) = \frac{D}{N}.

There are however certain challenges with all the rank correlation based measures. Often referred as dis-jointness problem, one of the challenges is what happens if an item is present only in one of the list. In this case there is no way to calculate number of concordant or discordant pairs unless you make certain assumptions about the missing item. There are number of extensions of Kendall Tau that make certain assumptions to deal with the above dis-jointness problem. For instance, a common approach is to assume that the missing item is at the bottom of the list.

Another problem, known as top-weighteness, with the Kendall Tau is that the rank position of an item has no effect on the final similarity score. For instance, consider three lists:
A: [a, b, c, d, e]
B: [b, a, c, d, e]
C: [a, b, c, e, d]
As compared to list A, both list B and C switch one item pair (items a and b are switched in B and items e and d are switched in list C), but at different rank positions. However, both have get the same Kendall Tau score of 0.8. However, in some cases (such as search engine), the top ranking items are more important as compared to lower ranking items. Hence, it is expected to give lower score to a list that switches item position at a higher rank position (List B) as compared to a list that switches items at a lower ranked position (List C).

Set Based Measure:
The general idea of set intersection is well known. By itself, set intersection has no concept of rank. It just considers two bag of items and returns the list of items that are common in the two bags. However, researchers and mathematicians have been able to able to use the idea of set intersection in some innovative ways to quantify similarity between two ranked list. In general, the idea is to determine the fraction of content overlapping at different depths. For instance, we compare List A and B in the above example, the length of set intersection at various depth is shown below

Depth Items in List A @ k Items in List B @ k Set intersection Fraction
1 a b {} 0/2 = 0
2 a,b b,a {a,b} 2/2 = 1
3 a,b,c b,a,c {a,b,c} 3/3 = 1
4 a,b,c,d b,a,c,d {a,b,c,d} 4/4 = 1
5 a,b,c,d,e b,a,c,e {a,b,c,d,e} 5/5 = 1

Once we have fraction of content overlapping at various depth, one can either plot the distribution to study how similar two lists are or, as in the case of Average overlap score, return the average of the last column. Since observing a common item at higher rank contributes to all the lower ranked intersections, this kind of approach is naturally top-weighted i.e. it gives more importance to items at the higher rank. For instance, if we compare List A and B in the above example, the average overlap score at depth 5 is 4/5=0.8. However, if comparing List A and C, the average overlap score is \frac{(1+1+1+0.6+1)}{5} =0.92 .

Rank Biased Overlap (RBO) extends the above idea of average overlap in two ways. First, the above approach is not bounded. That is the average overlap value can range anywhere from zero to infinity and therefore is not useful. In order to address this issue, RBO uses geometric series. One nice property of geometric series that is used in RBO is that it is a type of convergent series. That is, it can be easily shown that the indefinite sum of geometric series is finite and is given as \sum_{d=1}^{\infty} p^{d+1} = \frac{1}{1-p} . Thus, using geometric series, RBO is able to bound the unbounded average overlap score for indefinite rank position. However, another advantage of using geometric series in RBO. Since the values in geometric series decreases with the increasing depth, it allows RBO to explicitly model user’s behavior i.e. the likelihood of going from a given rank position i to i+1. This is the crux of RBO but if you read the above paper then you will many more things such as how to extend the idea to lists that are of different lengths, etc.

Quick Links
1. RBO Implementation

Posted in Data Mining, Machine Learning | Leave a comment

TextMate + MultiMarkDown + MathJax: The Perfect Combo for Leanpub Authros

Leanpub is a great publishing platform for independent authors. However first time authors often struggle deciding on the right editor, especially because leanpub uses markdown encoding. After hours of googling, I failed to find any perfect editor. In particular I was looking for a simple editor but had two major requirements:

1. I haven’t used markdown syntax much before and wanted an editor that has live preview option.
2. I often deal with equations and therefore needed an editor that can support latex based equation and show them in live preview without I personally moving the files around.

After googling for almost a day, I failed to find the perfect editor. But I managed to build one by using  the combination of TextMate, Multimarkdown and MathJax. You might want to see this video to get excited about my solution.

Below are the steps you need to carry out in order to get the TextMate based solution working.

Step 1: Install Multimarkdown: From your terminal, run the following command. I am assuming that you are familiar with homebrew. If not, read more about it over here
brew install multimarkdown

Step 2: Install Markdown TextMate bundle
a. download markdown textmate bundle from here
b. Unzip and rename the top level folder to “markdown.tmbundle”
c. Double click on it and it should be installed in TextMate.

Step 3: Update TextMate Path Variable:
a. Start TextMate and go to TextMate > Preferences > Advanced Tab
b. Edit Path variable and add /usr/local/bin: as the first thing (Make sure that you don’t delete other paths)

Step 4. Install MathJax
a. Download MathJax from here
b. Open Mac System Preferences. Click on “Sharing” icon and switch on “Web Sharing”.
c. On the right side, you will see an option to “Create Personal Website Folder”. Click that button. This will create a folder named “Sites” in your home directory.
d. Unzip and move MathJax folder to the above created (~/Sites) folder.
e. Test whether MathJax is properly installed by going to the following address: “http://127.0.0.1/~CHANGE_TO_USERNAME/Mathjax/test/index.html&#8221;

Step 5. Test Live Preview of Markdown and Equation Support
a. Download the sample file and open it using TextMake. Make sure that the langugage setting is set to “Multimarkdown”. If not, change it to “Multimarkdown”.
b. From TextMate Menu bar, select Windows > Show Web Preview. You should see nicely formatted output along with the equation.
c. Now try modify to the modify content of the downloaded sample file and you will notice that the web preview automatically refreshes.

Voila!!!

Questions:
1. What is there Xhtml header and what’s its doing ?
The Xhtml header at the top of the sample file calls the MathJax library. Also it modifies the equation tag from $$ to {$$}. I did that mainly because leanpub requires that equations are enclosed by {$$} and {/$$}.

2. I am getting markdown command not found.
Make sure you have appended path variable in TextMate. See step 3b

3. Equations are not being rendered correctly
Make sure that Mathjax is correctly installed by going to “http://127.0.0.1/~USERNAME/Mathjax/test/index.html&#8221;. If this is working, then make sure that you have correct path in the script tag of sample file.

4. My computer is on fire!! and is starting to run very slow 
Change the refresh rate in preview. If you set it 0.0 seconds that TextMate web preview will call MathJax after you add even a single character. I generally set it to 500 seconds when I am focusing on writing and change it to 0.5 seconds if I am working on formatting issues.

 

Posted in General, Tools, Writing | Leave a comment

TerminalOMatic: Customizing terminal to be even faster

For most MacBook Pro users, Terminal is an indispensable application. But we can make even better. Below is a collection of some of the tricks that I found to be are some of the tricks that I found very useful. I generally put these commands in my ~/.bashrc file. You might however want to read this blog before you decide where to put these custom commands.

Trick 1: Intelligent History Scrolling
Scrolling through history although useful can be a pain. Often you are looking for a particular command. Here is a solution that might be worth exploring. Simply bind your UP and DOWN arrow key so that it searches history looking for command that starts with what you have typed so far. For instance, searching for commands that starts with “javac”, simply type “javac” on the command line and then press UP arrow key. Now history will show only those commands that start with “javac”.

Trick 2: Case Insensitive Autcomplete
Its frustrating to realize that you were looking for a folder named “Pictures” by typing “p” instead of “P” on the command line and the autocompletion is not able to find it. Again the solution is very simple. Simply put this one line of code in your ~/.bashrc so that auto-completion is case insensitive.
bind 'set completion-ignore-case on'

Trick 3: Application shortcut
Often I like opening json files in chrome or firefox. But it’s pain to open chrome/firefox and then go to the file menu and search for the appropriate file. Why not just open it through command line. This is possible through the following command

$>open -a /Application/Firefox.app/ ~/Documents/sample.json

But you can be even more faster. As shown below, just write a function that makes the above call.

ff(){
open -a /Applications/Firefox.app/ "$1"
}
chrome(){
open -a /Applications/Chrome.app/ "$1"
}

Now to open a file with Firefox, simple enter the following command through the command line
$>ff ~/Documents/sample.json

Trick 4: Folder shortcut
There are often two or three projects that I am work on simultaneously and often need to switch from one project to another. Even though autocomplete save tons of time navigating all the complex file structure, there is even a shorter way to save time and fingers.  Just define variables in my ~/.bashrc and use them in your command as shown below.

#Folder variables
blog='~/Documents/personal/writings/blog'

Now, let’s say I want to open the “blog” folder in textmate, I can simply reference the above variable in the command line as follows
$>mate $blog

Trick 5: Defining commonly used commands
If there are commands that you often run, consider using alias to define. For instance, if you often open a particular folder in textmate using the following command $>mate ~/Documents/personal/writings/blog, then consider creating an alias as shown belowblog.mate=’mate ~/Documents/personal/writings/blog’

Now, from the command line simply run
$>blog.mate

Trick 6: Better Theme
Terminal ships with lot of themes but none work for me. I find this theme particularly good. Try it.

Posted in General, Programming, Tips | Tagged , , | 1 Comment

Efficient Textual Similarity Across Millions of Web Queries

Computing textual similarity (such as Jaccard similarity coefficient) between millions of search queries can be an arduous task. The main challenge is the number of pairs that one needs to consider; a relatively small dataset containing ten thousands queries leads to more than 49 million possible query pairs ({{10000}\choose{2}} = 49,995,000).

Based on Vernica, et.al. paper, I show a simple and efficient technique to filter out non-similar query pairs and thereby significantly reduce the number of query pairs that one needs to evaluate. I first explain the idea and then discuss how it can be implemented within the MapReduce paradigm. Lastly I work through a toy dataset containing about 4000 queries and demonstrate significant time improvement. Note I assume that you have basic understanding of the MapReduce paradigm. Even if you don’t, it should be pretty simple to understand the idea.

The Idea:
The general idea is to active filter query pairs that don’t share any common terms and thereby only consider those query pairs that at-least share one common term. The trick is in partitioning the data based on prefix-filtering approach. It will be easier to understand if we can consider an example. Consider three queries as shown below. Across the three queries,  we have five unique terms (or tokens): irs, forms, tax, internal and revenue.

  1. irs forms
  2. tax forms
  3. internal revenue
 Now let’s assume, related to each of the five unique terms, we have five buckets that are filled with queries containing the given term atleast once (Note: as shown in the table we  duplicate queries across multiple buckets if they have more than one terms).  
Bucket 1:
irs
  • irs forms
Bucket 2:
forms
  • irs forms
  • tax forms
Bucket 3:
tax
  • tax forms
Bucket 4:
internal
  • internal revenue
Bucket 5:
revenue
  • internal revenue

Now we only need to compute textual similarity between queries that appear in a single bucket. In other words, we compare a given query to only those other queries that appear together with a given query. For instance, consider the first query, “irs forms”. It appears in two buckets: 1 and 2. There are no other queries in bucket 1 and hence there is nothing to do over there. There is one another query “tax forms” in bucket 2 and thus we need to compute textual similarity between “irs forms” and “tax forms”.  Further notice that we never compute textual similarity between “irs forms” and “internal revenue” as the two never appear together in any given bucket.

Even the above simple example shows the benefit of prefix filtering. Using $latext N^2$ approach (hereafter I call it as Naive approach), we will need to evaluate three query pairs:

  • “irs forms”, “tax forms”
  • “irs forms”, “internal revenue”
  • “tax forms”, “internal revenue”

However, using prefix filtering we only evaluate one query pair: “irs forms” and “tax forms”. Thus prefix filtering helps avoid computing textual similarity between queries that don’t share any term. Now let’s focus on how to implement it efficiently within the MapReduce paradigm.

Implementation
Its pretty simple to implement the above idea using single MapReduce. For each input query, the mapper does two things:

  1. Tokenize the query
  2. Output <token, query> pair for each identified token.

For instance if the query is “irs forms”, the mapper will output two records corresponding to this query: <irs, irs forms> and <forms, irs forms>. This correlates to our bucket idea as shown in the table above.

If we are using hadoop, the output from the mapper will be sorted based on key (token) and all records having same key will be sent to a single reducer. The reducer‘s job is to collect all the queries related to a single token and compute textual similarity between these subset of queries. Here we can use our naive approach to do the computation.

There is however one more thing can be done to conserve bandwidth between mapper and reducer. Consider the third query, “internal revenue”, in our running example. For this query, the mapper will generate two outputs as shown below

  • <internal, internal revenue>
  • <revenue, internal revenue>

However, both “internal” and “revenue” appear only once in our query dataset, i.e. there is only one query related to both the terms. Hence we know that buckets corresponding to these two terms will contain only one query. Thus, it doesn’t make sense to generate buckets that occurs only once in the dataset. We can implement this logic by using two MapReduce phases. The first MapReduce phase (mapper and reducer) is concerened with building a dictionary of terms that occur more than once. In the second MapReduce phase, the mapper uses these dictionary to generate <token, query> pairs for only those tokens (or terms) that are present in the dictionary.  The reducer remains the same as earlier. Here I am assume that the dictionary is small enough to fit in the memory. To an extent this is not a big assumption as we know from language theory generally number of tokens are much smaller.

Note: I have uploaded all the ruby code on github and can be found over here. There is a run.sh file that shows how to run the code. You don’t need hadoop in order to test the code as you can simulate the hadoop environment through sequence of shell commands.

Test:
In order to test and evaluate the benefit of prefix filtering, I downloaded a toy dataset containing all the web searches performed on AOL and that lead to click on irs.gov website. You can read more about the dataset and download it from here.

There are 4943 queries in the dataset composed of 2325 unique terms. Out of 2325 terms, only 1086 terms occur more than once.

Approach Number of Pairs
Evaluated
Average Running time
Naive Approach: (N^2)  12,214,153 2m41.412s
Single MapReduce Phase    2,675,684 45.199s*
Dictionary Based MapReduce
(Two MapReduce Phase)
   2,675,684 44.340s*

*I simulated hadoop environment through shell as shown below. Hence the time doesn’t include startup time for hadoop.

#One MapReduce Approach
time cat queries.txt | ruby SimplePairGeneratorMapper.rb | sort -t$'\t' -k1,1 | ruby PairGeneratorReducer.rb > prefix_output.txt

#Two MapReduce Approach
time cat queries.txt | ruby FrequencyCountMapper.rb | sort -t$'\t' -k1,1 | ruby FrequencyCountReducer.rb > dictionary
time cat queries.txt | ruby DictionaryPairGeneratorMapper.rb | sort -t$'\t' -k1,1 | ruby PairGeneratorReducer.rb > prefix_output.txt

Conclusion
As you can see from the above table, using prefix filtering has significant advantage over the naive approach as it significantly cut downs number of pairs that are evaluated. The number of pairs evaluated by the naive approach is almost 6 times more than that are evaluated by prefix filtering. As a result, the processing time is significantly lower for the prefix filtering based approach. Its almost 3.5 times faster as compared to the naive approach.

However I didn’t see significant time difference between the simple and dictionary based prefix filtering approaches. One reason might be that the dataset is not big enough to see any substantial difference. But on the other hand, in a larger dataset you are less likely to find terms that appears only once (assuming you have already corrected any mistakes) and hence there won’t be much difference between the two approaches.

A Word Of Caution
Prefix filtering is not panacea for textual similarity problem. There is one big problem. Some query pairs are evaluated multiple times because they appear together in multiple buckets. For instance consider these two queries: “irs forms” and “irs forms w4″. The two queries will appear together inside two buckets corresponding to the following two terms: “irs” and “forms”. As a result, the two queries will be compared twice. For web queries this is not big problem as web queries are generally two to three words long. However, it will be a challenge when we are dealing with larger textual documents (such as sentences and paragraphs). Let’s say two textual documents are exactly same and contain 1000 words. Then the two documents will be compared 1000 times. One solution might be to use an additional MapReduce phase. The first phase reducer simply generates document pairs that needs to be evaluated and sort them somehow so that ordering between the two documents is maintained. The next reducer removes any duplicate pairs. You might also want to replicate duplicate removal process in combiner so as to conserve bandwidth between mapper and reducer.

 

Posted in Data Mining, Hadoop, Programming, Text Mining | Tagged , , , | Leave a comment