Tuesday, November 30, 2010

For beginners, quickly install Hadoop 0.20 on Linux in cluster mode

Its very easy to set up a small Hadoop cluster on Linux for testing and development purposes. In this blog post I will demonstrate how to setup a small Hadoop 0.20 cluster in 10 easy steps. You will be able to set it up in less than an hour.

In the cluster we have 5 machines as follows
3sn.mydomain.comSecondary Namenodesn


1. Download Hadoop and Java to all machines 
See the INSTALL section on  http://hadoop-blog.blogspot.com/2010/11/how-to-install-standalone-hadoop-for.html to get more details. For the purpose of this document we will assume that hadoop is installed in directory /home/${USER}/hadoop-0.20.2 and java is installed in directory /home/${USER}/jdk1.6.0_22

2. Ensure that machines in the cluster can see each other
 Setup password less ssh between following machines 
  1. jt to nn 
  2. jt to sn
  3. jt to tt1
  4. jt to tt2
  5. nn to jt
  6. nn to sn
  7. nn to tt1
  8. nn to tt2
  9. sn to nn
  10. sn to jt

3. Set up the Namenode
On jt.mydomain.com, overwrite file /home/${USER}/hadoop-0.20.2/conf/core-site.xml with following lines
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->

4. Set up path to Java, Master and Slave directories/files 
In Hadoop, JobTracker and Namenode are called Masters and tasktracker and datanodes are called slaves. Every slave runs both Datanode and Tasktracker.
On jt.mydomain.com, add following 3 lines to file  /home/${USER}/hadoop-0.20.2/conf/hadoop-env.sh
export JAVA_HOME=/home/${USER}/jdk1.6.0_22
export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
export HADOOP_MASTER=jt.mydomain.com:/home/${USER}/hadoop-0.20.2

    5. Set up Jobtracker
    On jt.mydomain.com, overwrite file /home/${USER}/hadoop-0.20.2/conf/mapred-site.xml with following lines
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!-- Put site-specific property overrides in this file. -->

    6. List Masters
    On jt.mydomain.com, overwrite file /home/${USER}/hadoop-0.20.2/conf/masters with following line 

    7. List Slaves
    On jt.mydomain.com, overwrite file /home/${USER}/hadoop-0.20.2/conf/slaves with following lines

    8. Format Namenode
    Format the namenode by running the following command on nn.mydomain.com
    /home/${USER}/hadoop-0.20.2/bin/hadoop namenode -format

    9. Start DFS
    On nn.mydomain.com command prompt, run the following command to start HDFS daemon on Name node and data nodes and will also setup secondard name node
    sh /home/${USER}/hadoop-0.20.2/bin/start-dfs.sh

    10. Start MapReduce
    On jt.mydomain.com command prompt, run the following command to start MapReduce daemon on Jobtracker and tasktrackers.
    sh /home/${USER}/hadoop-0.20.2/bin/start-mapred.sh

    That's it!  Hadoop cluster is up and running now. 


    • The cluster is defined in files slaves and masters
    • You can use IP addresses instead of host names, i.e. instead of jt.mydomain.com
    • After you execute #8 and #9, you will notice that all files that were updated in steps #3-7 on jt.mydomain.com are also updated on nn, sn, tt1 and tt2 machines. This happened because we set property HADOOP_MASTER  in step #4 to jt.mydomain.com. Setting this property means that use the config files on jt.mydomain.com as master files and sync them across all nodes that you find in the cluster.
    • Jobtracker WebUI should be up and running on http://jt.mydomain.com:50030/jobtracker.jsp
    • HDFS WebUI should be up and running on http://nn.mydomain.com:50070/dfshealth.jsp
    I also found another excellent tutorial, better than my blog post for sure. 

    No comments:

    Post a Comment