INSTALLING HADOOP ON MAC OSX LION

Although you are likely to run hadoop on a big cluster of computers, it useful to have it locally installed for debugging and testing purpose. Here are some quick notes on how to set hadoop on Mac OSX Lion. Please refer below mentioned references for details.

Quick Summary

  1. Install Hadoop
  2. Edit Configuration
    1. Edit hadoop-env.sh to handle SCDyanmicStore related errors
    2. Edit core-site.xml
    3. Edit mapread-site.xml
    4. Edit hdfs-site.xml
  3. Enable ssh to localhost
  4. Start and Test Hadoop

Detailed Instructions

Step 1: Installing Hadoop
If you haven’t heard about homebrew, then you should definitely give it a try. It really makes installing and uninstalling softwares effortless and keeps your machine clean of unused files. Below I am using homebrew to install hadoop.

brew install hadoop

Step 2: Edit Configurations

Step 2.1 Add following line to /usr/local/Cellar/hadoop/1.0.1/libexec/conf/hadoop-env.sh. This line is required to overcome the following error related “SCDynamicStore”, expecially “Unable to load realm info from SCDynamicStore”

export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"

Step 2.2:Add the following content in the /usr/local/Cellar/hadoop/1.0.1/libexec/conf/core-site.xml. One key property is hadoop.tmp.dir. Note that we are setting the hdfs in current user’s folder and naming it as hadoop-store. You don’t need to create this folder as it will be automatically created for you in the later stages.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/Users/${user.name}/hadoop-store</value>
		<description>A base for other temporary directories.</description>
	</property>
	<property>
		<name>fs.default.name</name>
		<value>hdfs://localhost:8020</value>
	</property>
</configuration>

Step 2.3: Add the following content in the /usr/local/Cellar/hadoop/1.0.1/libexec/conf/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
	  <name>mapred.job.tracker</name>
	  <value>localhost:9001</value>
	</property>

	<property>
        <name>mapred.tasktracker.map.tasks.maximum</name>
        <value>2</value>
    </property>

    <property>
        <name>mapred.tasktracker.reduce.tasks.maximum</name>
        <value>2</value>
    </property>
</configuration>

Step 2.4: Add the following content in the /usr/local/Cellar/hadoop/1.0.1/libexec/conf/hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
	  <name>dfs.replication</name>
	  <value>1</value>
	</property>
</configuration>

Step 3:Enable SSH to localhost

Make sure that you have ssh private (~/.ssh/id_rsa) and public (~/.ssh/id_rsa.pub) keys already setup. If you are missing the above two files, then run the following command (Thanks to Ryan Rosario for pointing out this). Instead of using rsa key, you can also use dsa (replace rsa with dsa in the command below). However instructions below assume that you have used rsa key.

ssh-keygen -t rsa

Step 3.1: Make sure that “Remote login” is enabled in your system preferences. For this, Go to
“System Preferences” -> “Sharing”. “Remote login” should be checked.

Step 3.2: From the terminal run the following command. Make sure that authorized_key has 0600 permission. (see Raj Bandyopadhay’s comment)

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Step 3.3: Try login to localhost. If you get any error remove (or change to something else) ~/.ssh/known_hosts and retry connecting to localhost.

ssh localhost

Step 4. Start and Test Hadoop

hadoop namenode -format
/usr/local/Cellar/hadoop/1.0.1/bin/start-all.sh
hadoop jar /usr/local/Cellar/hadoop/1.0.1/libexec/hadoop-examples-1.0.1.jar pi 10 100

To make sure that all hadoop processes started, use the following command

ps ax | grep hadoop | wc -l
# expected output is 6

There are 5 process related to hadoop. If you see less than 6 processes then check log files. Log files are located at /usr/local/Cellar/hadoop/1.0.1/libexec/logs/*.log

Additional Notes

  • Namenode info: http://localhost:50070/dfshealth.jsp
  • Jobtracker: http://localhost:50030
  • Starting hadoop cluster: ‘/usr/local/Cellar/hadoop/1.0.1/bin/start-all.sh’
  • Stop hadoop cluster: /usr/local/Cellar/hadoop/1.0.1/bin/stop-all.sh
  • Verify hadoop started properly: Use ps ax | grep hadoop | wc -l and make sure you see 6 as output. There are 5 processes associated with hadoop and one pertaining to the last command

Common Issues

  • Unable to load realm info from SCDynamicStore: Refer step 2.1
  • could only be replicated to 0 nodes, instead of 1: Refer Step 3. Mostly likely this problem is because SSH to localhost not available
  • Jobtracker not starting: I stumbled across this problem and found that there was a spelling mistake in mapread-site.xml. I misspelled mapread to mapred (missing second a). Also see above additional notes to make sure that there are 5 processes running.

References: