I just discovered there is a simple way to use ruby gems (or ruby libraries) in your mapper or reducer script even if you don’t have administrative rights. Below is a short and quick explanation of how to do this. One of the parameter in hadoop streaming is “-cacheArchive”. It allows you to specify path of the archive on the master machine and create a symbolic link. You can read more about it over here. In order to use ruby gems, we will need to do four simple steps
Step 1. Zip gem source code:
Download the source code of a gem and zip it. Lets assume you want to use the awesome geokit gem. At the top level of the geokit gem there is one file (geokit.rb) and a folder (geokit). Use the following command on MacOSX to create a zip file
$> zip -r geokit.zip geokit.rb geokit
Note: use -r parameter to recursively include subfolders
Step 2: Upload the zip file on hadoop’s distributed file system
$> hadoop dfs -copyFromLocal geokit.zip lib/
Step 3: Tell Hadoop about the zip file
In your hadoop streaming file, use the cacheArchive option to specify location of the gem and also its symbolic link. Below is a just an example of a hadoop streaming file. Note that Hadoop will unzip the file before running mapper script and hence the files inside the zip will be available to our ruby mapper.
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/.../hadoop-0.20.1-dev-streaming.jar
-mapper "ruby Mapper.rb"
Step 4: Tell ruby mapper/reducer about the gem
Now modify ruby library path in Mapper.rb as follows
$: << 'geokitgem/'