INSTALLING HADOOP ON MAC OSX LION

Although you are likely to run hadoop on a big cluster of computers, it useful to have it locally installed for debugging and testing purpose. Here are some quick notes on how to set hadoop on Mac OSX Lion. Please refer below mentioned references for details.

Quick Summary

  1. Install Hadoop
  2. Edit Configuration
    1. Edit hadoop-env.sh to handle SCDyanmicStore related errors
    2. Edit core-site.xml
    3. Edit mapread-site.xml
    4. Edit hdfs-site.xml
  3. Enable ssh to localhost
  4. Start and Test Hadoop

Detailed Instructions

Step 1: Installing Hadoop
If you haven’t heard about homebrew, then you should definitely give it a try. It really makes installing and uninstalling softwares effortless and keeps your machine clean of unused files. Below I am using homebrew to install hadoop.

brew install hadoop

Step 2: Edit Configurations

Step 2.1 Add following line to /usr/local/Cellar/hadoop/1.0.1/libexec/conf/hadoop-env.sh. This line is required to overcome the following error related “SCDynamicStore”, expecially “Unable to load realm info from SCDynamicStore”

export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"

Step 2.2:Add the following content in the /usr/local/Cellar/hadoop/1.0.1/libexec/conf/core-site.xml. One key property is hadoop.tmp.dir. Note that we are setting the hdfs in current user’s folder and naming it as hadoop-store. You don’t need to create this folder as it will be automatically created for you in the later stages.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/Users/${user.name}/hadoop-store</value>
		<description>A base for other temporary directories.</description>
	</property>
	<property>
		<name>fs.default.name</name>
		<value>hdfs://localhost:8020</value>
	</property>
</configuration>

Step 2.3: Add the following content in the /usr/local/Cellar/hadoop/1.0.1/libexec/conf/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
	  <name>mapred.job.tracker</name>
	  <value>localhost:9001</value>
	</property>

	<property>
        <name>mapred.tasktracker.map.tasks.maximum</name>
        <value>2</value>
    </property>

    <property>
        <name>mapred.tasktracker.reduce.tasks.maximum</name>
        <value>2</value>
    </property>
</configuration>

Step 2.4: Add the following content in the /usr/local/Cellar/hadoop/1.0.1/libexec/conf/hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
	  <name>dfs.replication</name>
	  <value>1</value>
	</property>
</configuration>

Step 3:Enable SSH to localhost

Make sure that you have ssh private (~/.ssh/id_rsa) and public (~/.ssh/id_rsa.pub) keys already setup. If you are missing the above two files, then run the following command (Thanks to Ryan Rosario for pointing out this). Instead of using rsa key, you can also use dsa (replace rsa with dsa in the command below). However instructions below assume that you have used rsa key.

ssh-keygen -t rsa

Step 3.1: Make sure that “Remote login” is enabled in your system preferences. For this, Go to
“System Preferences” -> “Sharing”. “Remote login” should be checked.

Step 3.2: From the terminal run the following command. Make sure that authorized_key has 0600 permission. (see Raj Bandyopadhay’s comment)

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Step 3.3: Try login to localhost. If you get any error remove (or change to something else) ~/.ssh/known_hosts and retry connecting to localhost.

ssh localhost

Step 4. Start and Test Hadoop

hadoop namenode -format
/usr/local/Cellar/hadoop/1.0.1/bin/start-all.sh
hadoop jar /usr/local/Cellar/hadoop/1.0.1/libexec/hadoop-examples-1.0.1.jar pi 10 100

To make sure that all hadoop processes started, use the following command

ps ax | grep hadoop | wc -l
# expected output is 6

There are 5 process related to hadoop. If you see less than 6 processes then check log files. Log files are located at /usr/local/Cellar/hadoop/1.0.1/libexec/logs/*.log

Additional Notes

  • Namenode info: http://localhost:50070/dfshealth.jsp
  • Jobtracker: http://localhost:50030
  • Starting hadoop cluster: ‘/usr/local/Cellar/hadoop/1.0.1/bin/start-all.sh’
  • Stop hadoop cluster: /usr/local/Cellar/hadoop/1.0.1/bin/stop-all.sh
  • Verify hadoop started properly: Use ps ax | grep hadoop | wc -l and make sure you see 6 as output. There are 5 processes associated with hadoop and one pertaining to the last command

Common Issues

  • Unable to load realm info from SCDynamicStore: Refer step 2.1
  • could only be replicated to 0 nodes, instead of 1: Refer Step 3. Mostly likely this problem is because SSH to localhost not available
  • Jobtracker not starting: I stumbled across this problem and found that there was a spelling mistake in mapread-site.xml. I misspelled mapread to mapred (missing second a). Also see above additional notes to make sure that there are 5 processes running.

References:

65 thoughts on “INSTALLING HADOOP ON MAC OSX LION

  1. Thank you for the writeup. Very helpful!

    > hadoop nodename -format
    Exception in thread “main” java.lang.NoClassDefFoundError: nodename
    Caused by: java.lang.ClassNotFoundException: nodename
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

    Perhaps you meant:
    > hadoop namenode -format

  2. Great tutorial, very useful. One small caveat is that the .ssh/authorized_keys file must have permission bits set to 0600. Use ‘chmod 0600 .ssh/authorized_keys’ after creating that file.

  3. Thank you for the excellent tutorial. This is my first time installing on Mac — I usually use Hadoop on Ubuntu.

    One thing to note. An SSH keypair must already exist in order to do step 3.1:
    ssh-keygen -t dsa

  4. Great instructions. I went through it successfully but I still have trouble running hadoop directly on java files and cannot set the HADOOP_CLASSPATH properly as I am going over the hadoop book. Any ideas?

    Thanks!

    1. @kavic,
      I didn’t explicitly set hadoop_classpath. Did you use brew to install hadoop. Also make sure that you JAVA_HOME properly set. In my case JAVA_HOME points to /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
      Also make sure you are able to run the hadoop test
      hadoop jar /usr/local/Cellar/hadoop/1.0.1/libexec/hadoop-examples-1.0.1.jar pi 10 100
      Let me know if you are able to solve this problem. I would like to update the blog based on your solution.

      1. Thanks for following up.

        I finally figured out what was wrong since the tests were running fine and I could even run python scripts via streaming. The problem was that HADOOP_CLASSPATH is apparently set relative to the home directory /user/hduser and once I copied the classes over a new directory over there things were fixed… I am still wondering exactly what went wrong though!

  5. I have followed this exactly but unfortunately receive an error when trying to run the example. It says something about protocol mismatch, ClientProtocol version mismatch. (client = 61, server = 63). Any ideas?

        1. New error occuring now

          Number of Maps = 10
          Samples per Map = 100
          12/07/18 21:10:38 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 0 time(s).
          12/07/18 21:10:39 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 1 time(s).
          12/07/18 21:10:40 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 2 time(s).
          12/07/18 21:10:41 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 3 time(s).
          12/07/18 21:10:42 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 4 time(s).
          12/07/18 21:10:43 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 5 time(s).
          12/07/18 21:10:44 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 6 time(s).
          12/07/18 21:10:45 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 7 time(s).
          12/07/18 21:10:46 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 8 time(s).
          12/07/18 21:10:47 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 9 time(s).
          java.lang.RuntimeException: java.net.ConnectException: Call to localhost/127.0.0.1:8020 failed on connection exception: java.net.ConnectException: Connection refused
          at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:546)
          at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:318)
          at org.apache.hadoop.examples.PiEstimator.estimate(PiEstimator.java:265)
          at org.apache.hadoop.examples.PiEstimator.run(PiEstimator.java:342)
          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
          at org.apache.hadoop.examples.PiEstimator.main(PiEstimator.java:351)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
          at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
          at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
          Caused by: java.net.ConnectException: Call to localhost/127.0.0.1:8020 failed on connection exception: java.net.ConnectException: Connection refused
          at org.apache.hadoop.ipc.Client.wrapException(Client.java:1099)
          at org.apache.hadoop.ipc.Client.call(Client.java:1075)
          at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
          at $Proxy1.getProtocolVersion(Unknown Source)
          at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
          at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
          at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
          at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:238)
          at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:203)
          at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
          at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
          at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
          at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
          at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
          at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:123)
          at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:542)
          … 17 more
          Caused by: java.net.ConnectException: Connection refused
          at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
          at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
          at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
          at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489)
          at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)
          at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560)
          at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184)
          at org.apache.hadoop.ipc.Client.getConnection(Client.java:1206)
          at org.apache.hadoop.ipc.Client.call(Client.java:1050)
          … 31 more

      1. Ritesh – I have the same problem too. When I give the command ‘hadoop dfs’ I do get the help options but when I give any ‘hadoop dfs’ commands (-ls, -mkdir, copyFromLocal, etc) it creates everything on UFS where I am. (I am using MacOS). I also get the following warning: “WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable”

  6. for rsa key generation we can make sure that its stored in the proper file name.
    ssh-keygen -t rsa -p ” ” -f ~/.ssh/id_rsa

  7. This worked flawlessly in on OS X 10.8 Mountain. Could you please update this page to indicate that.

    Thanks,

    Parag

  8. Hello..
    I am using Mac OS 10.8…. Could you tell me what is going wrong….

    When trying out the tutorial the map seems to work, but it cannot compute the reduce.
    12/08/13 08:58:12 INFO mapred.JobClient: Running job: job_201208130857_0001
    12/08/13 08:58:13 INFO mapred.JobClient: map 0% reduce 0%
    12/08/13 08:58:27 INFO mapred.JobClient: map 20% reduce 0%
    12/08/13 08:58:33 INFO mapred.JobClient: map 30% reduce 0%
    12/08/13 08:58:36 INFO mapred.JobClient: map 40% reduce 0%
    12/08/13 08:58:39 INFO mapred.JobClient: map 50% reduce 0%
    12/08/13 08:58:42 INFO mapred.JobClient: map 60% reduce 0%
    12/08/13 08:58:45 INFO mapred.JobClient: map 70% reduce 0%
    12/08/13 08:58:48 INFO mapred.JobClient: map 80% reduce 0%
    12/08/13 08:58:51 INFO mapred.JobClient: map 90% reduce 0%
    12/08/13 08:58:54 INFO mapred.JobClient: map 100% reduce 0%
    12/08/13 08:59:14 INFO mapred.JobClient: Task Id : attempt_201208130857_0001_m_000000_0, Status : FAILED
    Too many fetch-failures
    12/08/13 08:59:14 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 403 for URL: http://10.1.66.17:50060/tasklog?plaintext=true&attemptid=attempt_201208130857_0001_m_000000_0&filter=stdout
    12/08/13 08:59:14 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 403 for URL: http://10.1.66.17:50060/tasklog?plaintext=true&attemptid=attempt_201208130857_0001_m_000000_0&filter=stderr
    12/08/13 08:59:18 INFO mapred.JobClient: map 89% reduce 0%
    12/08/13 08:59:21 INFO mapred.JobClient: map 100% reduce 0%
    12/08/13 09:00:14 INFO mapred.JobClient: Task Id : attempt_201208130857_0001_m_000001_0, Status : FAILED
    Too many fetch-failures

    Here is what I get when I try to see the tasklog using the links given in the output
    http://10.1.66.17:50060/tasklog?plaintext=true&attemptid=attempt_201208130857_0001_m_000000_0&filter=stderr —>
    2012-08-13 08:58:39.189 java[74092:1203] Unable to load realm info from SCDynamicStore

    http://10.1.66.17:50060/tasklog?plaintext=true&attemptid=attempt_201208130857_0001_m_000000_0&filter=stdout —>

    Also this error of Unable to load realm info from SCDynamicStore does not show up when I do ‘hadoop namenode -format’ or ‘start-all.sh’

  9. In the event that a non-admin will be running hadoop, you’ll also need to adjust permissions on the hadoop log directory. For a typical developer workstation, something like this will usually be fine:

    chmod -R a+w libexec/logs

    (from the hadoop directory).

  10. Cool Ritesh. It was a piece of cake.
    However I have a couple of observations –
    1. ” /usr/local/Cellar/hadoop/1.0.1/libexec/conf/ ” is incorrect. The correct one should be – ” /usr/local/Cellar/hadoop/1.0.4/libexec/conf/ “

  11. I get this error message after entering the below into the terminal. I’m not doing something right.

    -MacBook-Pro:hadoop jonathanschaller$ /usr/local/Cellar/hadoop/1.0.4/libexec/conf/hadoop-env.sh
    -bash: /usr/local/Cellar/hadoop/1.0.4/libexec/conf/hadoop-env.sh: Permission denied

  12. Great Job ! I’m using it on My Macbook Air OS X Mountain Lion with the 1.1.0 version.

    Works like a charm :)

  13. Hello again Ritesh Agrawal !

    I would like to know who hadoop work with the other exemple in hadoop-exemple.jar .

    There is several example to be use as a test.

    I found the WordCount test but i would like to know how to execute it with the right syntax :

    First I need a .txt with some letter in double triple… etc..

    Is there any command to let hadoop knows or do I just need to do something like this :

    hadoop jar /usr/local/~/hadoop-example-1.1.0.jar worldcount? 10 100 /usr/~/toto.txt

    Is there any documentation where I can find some help with the example ? or a Wiki ?

    Thanks for your help,

    Kevin

    1. Let’s say you want to count the words of the file : ~/Downloads/ulysse.txt
      First copy it to the hdfs::
      fs -put ~/Downloads/ulysse.txt /user/yourname/wordcount-ex
      To run the example:
      hadoop jar hadoop-examples*.jar wordcount /user/yourname/wordcount-ex/ulysse.txt /user/yourname/wordcount-ex/output
      It will write the result in /user/yourname/wordcount-ex/output
      To see the result :
      hadoop fs -cat /user/yourname/wordcount-ex/output/part-r-00000

      Hope this helps!

  14. I followed your tutorial to install the current hadoop-1.1.2 on lion 10.8.3 with java 1.6.0_43.
    It seems to work pretty well, at least the pi example works fine.
    But when I run the word count examples (as I explain above) it works but I have two warnings bothering me :
    WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    WARN snappy.LoadSnappy: Snappy native library not loaded.

    Do you know how I can solve these ?

  15. Thanks Ritesh, instructions are really good, worked without issue. I am new to Hadoop. Would you recommend any links for further example which will help me write my own job.

  16. Dear Retish,

    I have done step 2.1, however, still got the error, do you know why? Thank you.

    starting namenode, logging to /usr/local/Cellar/hadoop/1.1.2/libexec/bin/../logs/hadoop-chaochen-namenode-Chaos-iMac.local.out
    2013-05-18 23:13:39.369 java[4390:1b03] Unable to load realm info from SCDynamicStore

    1. @Chao,
      Make sure that you copied the whole statement in 2.1: export HADOOP_OPTS=”-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk”

      Apart from that I am not sure. I haven’t tried installing hadoop 1.1.2. let me know if that doesn’t work and I can try installing hadoop 1.1.2 tonight.

      Ritesh

  17. Thanks very useful but I am trying to install hadoop 1.1.2 but I am getting the following error.

    The following two commands seem to work ok:
    ~ $ hadoop namenode -format
    ~ $ /usr/local/Cellar/hadoop/1.1.2/bin/start-all.sh

    As you can see other commands seem to work ok
    ~ $ ps ax | grep hadoop | wc -l
    5

    However the example fails miserably… ideas?
    ~ $ hadoop jar /usr/local/Cellar/hadoop/1.1.2/libexec/hadoop-examples-1.1.2.jar pi 10 100
    Number of Maps = 10
    Samples per Map = 100
    2013-05-27 15:05:30.221 java[30151:1703] Unable to load realm info from SCDynamicStore
    13/05/27 15:05:30 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/sergep/PiEstimator_TMP_3_141592654/in/part0 could only be replicated to 0 nodes, instead of 1
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1639)

  18. I am running 1.1.2 and undid the step 2.* and I got the “Unable to load realm info from SCDynamicStore” error while executing hadoop namenode -format. But the command formatted. “/usr/local/Cellar/hadoop/1.1.2/bin/start-all.sh” starts hadoop, and “hadoop jar /usr/local/Cellar/hadoop/1.1.2/libexec/hadoop-examples-1.1.2.jar pi 10 10” throws the “Unable to load realm info from SCDynamicStore” error, but completes. However I think the results are incorrect. The last 2 lines of the output are:

    Job Finished in 2.517 seconds
    Estimated value of Pi is 3.20000000000000000000

    The pi value seems way off.

    1. I just played around a bit more with the sample and it appears that the pi estimation is correct. You just have to modify the final term, samples per map, to get more decimal places. Here are more detailed results.

      hadoop jar /usr/local/Cellar/hadoop/1.1.2/libexec/hadoop-example1.1.2.jar pi 10 1000000000
      Number of Maps = 10
      Samples per Map = 1000000000
      ……..
      Job Finished in 321.893 seconds
      Estimated value of Pi is 3.14159266440000000000

  19. Great Post.. you should start series of hadoop related components installation on Mac such as Hive, Accumulo, and etc..

  20. I am unable to ssh to local host on my Mac (OS 10.8.x). I looked around on the web but couldn’t solve. Would greatly appreciate any help. Here is what I did.
    * tested with rsa keys but when it didn’t work I created a new .ssh directory and create dsa files. didn’t work
    * followed the instructions and enabled ‘remote login’
    * disabled the firewall too.
    * Following is what I get:
    ssh localhost
    Connection closed by ::1

    1. In the /var/log I see the following message every time I try to test ‘ssh localhost’
      sshd[1787]: fatal: Access denied for user by PAM account configuration [preauth]

      1. Well I did PAM file manipulation. SSHD config file. Still didnt work. This just did it. Thanks a lot. Luvyaaa..

  21. Hi, when I run start-all.sh, i see some processes launched and I can see the icons for those on my window. I am working in some window, and if I launch some map-reduce job, my work get interrupted by those processes launching and coming to foreground/focus. How to avoid that on mac?

    VJ

  22. Hi, I am trying to create a smoketest user in my hdfs. When i give the command hadoop fs -chmod 757 /mapred it shows the following error

    chmod: Call From diliprnair-VAIO/192.168.1.136 to diliprnair-VAIO:8020 failed on
    connection exception: java.net.ConnectException: Connection refused: no further
    information; For more details see: http://wiki.apache.org/hadoop/ConnectionRef
    used

    Could you heplp

  23. Ido not even know how I finished up here, however I believed this poxt was great.
    I do not recognize who you might be buut definittely
    you arre going to a famous blogger if you happen to are not
    already. Cheers!

Leave a reply to lise regnier Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.