Using ruby gems along with hadoop streaming

I just discovered there is a simple way to use ruby gems (or ruby libraries) in your mapper or reducer script even if you don’t have administrative rights. Below is a short and quick explanation of how to do this. One of the parameter in hadoop streaming is “-cacheArchive”. It allows you to specify path of the archive on the master machine and create a symbolic link. You can read more about it over here. In order to use ruby gems, we will need to do four simple steps

Step 1. Zip gem source code:
Download the source code of a gem and zip it. Lets assume you want to use the awesome geokit gem. At the top level of the geokit gem there is one file (geokit.rb) and a folder (geokit). Use the following command on MacOSX to create a zip file

$> zip -r geokit.rb geokit

Note: use -r parameter to recursively include subfolders

Step 2: Upload the zip file on hadoop’s distributed file system

$> hadoop dfs -copyFromLocal lib/

Step 3: Tell Hadoop about the zip file
In your hadoop streaming file, use the cacheArchive option to specify location of the gem and also its symbolic link. Below is a just an example of a hadoop streaming file. Note that Hadoop will unzip the file before running mapper script and hence the files inside the zip will be available to our ruby mapper.

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/.../hadoop-0.20.1-dev-streaming.jar
-input input_file
-output <output_file>
-mapper "ruby Mapper.rb"
-file code/Mapper.rb
-cacheArchive hdfs://machine-name:port-number/user/user_name/lib/

Step 4: Tell ruby mapper/reducer about the gem

Now modify ruby library path in Mapper.rb as follows
#file: Mapper.rb
$: <<   'geokitgem/'
require 'rubygems'
require 'geokit'

That’s it.

About Ritesh Agrawal

I am a applied researcher who enjoys anything related to statistics, large data analysis, data mining, machine learning and data visualization.
This entry was posted in Hadoop, Programming, Ruby and tagged , , , . Bookmark the permalink.

4 Responses to Using ruby gems along with hadoop streaming

  1. Prasanna says:

    I want to find latitude and longitude for whole U.S zip codes. Given the size of the file, I want to run it in rails application. Can I use hadoop for it. Do you have any link more for Apache hadoop programming.

    Thanks a lot !!!

  2. sam says:

    It’s actually a nice and helpful piece of information. I’m glad that you just shared this helpful information with us. Please stay us informed like this. Thanks for sharing.

  3. Pingback: Apache Pig and Distributed Cache | Memento

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s