Easy-Hadoop, Rapid Hadoop Programming

In today’s digital world, the biggest challenge is how to deal with the large volume of data. Apache Hadoop provides one solution to this problem. It’s an open source software for distributed and scalable computing. Hadoop is usually deployed in a cluster with 100s of nodes.

For quite sometime I have been using Hadoop and it has been quite pleasing experience. However, over time I realized that I have been writing almost similar code for different projects. I was spending more time on writing various pieces of code rather than analyzing data. This led me to think of easy-hadoop, a rapid hadoop programming library. The idea of easy-hadoop is to develop small customizable module. For instance, one of the common tasks is to extract columns from the input dataset and set the key and value pairs. As shown below, one can write a simple ruby code to achieve this:

#Assuming input data is tab delimited
tokens = line.to_s.strip.split("t")

#Let's assume we want the key to be combination of the
#3 and 4 columns and use pipe character as a delimiter
key = tokens[3..4].join('|')

#Let’s assume that we want 1,2,5 as values
values = []
[1,2,5].each{|i| values << tokens[i] }
value = values.join(“t”)

#Write key and value pair
puts “#{key}t#{value}”

While the above code works, its very specific to a particular dataset and a project as it assumes lot of things (such as input delimiter, key delimiter, etc). Instead we can convert each of the assumption we made in the above code into a parameter and make the same code much more useful. Below is the another version of the same code:

@options = {
#Tokenize input data
tokens =

#Generate key
keys = []
@options[:keys].each{|k|  keys << tokens[k] }
keys = keys.join(@options[:key_delimiter])

#Generate Values
values = []
@options[:values].each{|v| values << tokens[v] }
values = values.join(@options[:value_delimiter])

#Write output
puts “#{keys}t#{values}”

By converting most of the assumptions into parameters, we made our simple function much more useful in many other project. Now we just need to modify @options and the code remains the same. However, its still not yet elegant as one has to still modify the source code each time one of the parameters changes. This can be solved by using setting different parameters through command line arguments (as shown below) and using optparse utility in ruby to parse and set parameter hash map (@options). If you are interested, checkout the ColExtractor.rb in Easy-Hadoop project.

ruby ColExtractor.rb --input_delimiter="t" --key=3,4 --value=1,2,5 --key_delimiter="|" --value_delimiter="t"

About Ritesh Agrawal

I am a applied researcher who enjoys anything related to statistics, large data analysis, data mining, machine learning and data visualization.
This entry was posted in Data Mining, Ruby and tagged , , , . Bookmark the permalink.

2 Responses to Easy-Hadoop, Rapid Hadoop Programming

  1. Very cool stuff, Ritesh! I haven’t played with Ruby yet, but this might be a good reason to. One question: would this play well with hosting on EC2?

    • Ritesh Agrawal says:

      Hi Frank,

      It should. I know few people who have used it with ec2. Let me know if you need any help.

Comments are closed.