Hadoop Blog: For beginners, quickly install Hadoop 0.20 on Linux in cluster mode

Its very easy to set up a small Hadoop cluster on Linux for testing and development purposes. In this blog post I will demonstrate how to setup a small Hadoop 0.20 cluster in 10 easy steps. You will be able to set it up in less than an hour.

SCENARIO

In the cluster we have 5 machines as follows

#	SERVERNAME	HADOOP COMPONENT	SHORT NAME
1	jt.mydomain.com	Jobtracker	jt
2	nn.mydomain.com	Namenode	nn
3	sn.mydomain.com	Secondary Namenode	sn
4	tt1.mydomain.com	Tasktracker1	tt1
5	tt2.mydomain.com	Tasktracker2	tt2

STEPS

1. Download Hadoop and Java to all machines
See the INSTALL section on http://hadoop-blog.blogspot.com/2010/11/how-to-install-standalone-hadoop-for.html to get more details. For the purpose of this document we will assume that hadoop is installed in directory /home/${USER}/hadoop-0.20.2 and java is installed in directory /home/${USER}/jdk1.6.0_22

2. Ensure that machines in the cluster can see each other
Setup password less ssh between following machines

jt to nn
jt to sn
jt to tt1
jt to tt2
nn to jt
nn to sn
nn to tt1
nn to tt2
sn to nn
sn to jt

3. Set up the Namenode
On jt.mydomain.com, overwrite file /home/${USER}/hadoop-0.20.2/conf/core-site.xml with following lines
<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://nn.mydomain.com/</value>

</property>

</configuration>

4. Set up path to Java, Master and Slave directories/files
In Hadoop, JobTracker and Namenode are called Masters and tasktracker and datanodes are called slaves. Every slave runs both Datanode and Tasktracker.
On jt.mydomain.com, add following 3 lines to file /home/${USER}/hadoop-0.20.2/conf/hadoop-env.sh

export JAVA_HOME=/home/${USER}/jdk1.6.0_22

export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves

export HADOOP_MASTER=jt.mydomain.com:/home/${USER}/hadoop-0.20.2

5. Set up Jobtracker
On jt.mydomain.com, overwrite file /home/${USER}/hadoop-0.20.2/conf/mapred-site.xml with following lines
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
   <name>mapred.job.tracker</name>
   <value>jt.mydomain.com:9001</value>
  </property>
</configuration>

6. List Masters
On jt.mydomain.com, overwrite file /home/${USER}/hadoop-0.20.2/conf/masters with following line

sn.mydomain.com

7. List Slaves
On jt.mydomain.com, overwrite file /home/${USER}/hadoop-0.20.2/conf/slaves with following lines

t1.mydomain.com

t2.mydomain.com

8. Format Namenode
Format the namenode by running the following command on nn.mydomain.com
/home/${USER}/hadoop-0.20.2/bin/hadoop namenode -format

9. Start DFS
On nn.mydomain.com command prompt, run the following command to start HDFS daemon on Name node and data nodes and will also setup secondard name node
sh /home/${USER}/hadoop-0.20.2/bin/start-dfs.sh

10. Start MapReduce
On jt.mydomain.com command prompt, run the following command to start MapReduce daemon on Jobtracker and tasktrackers.

sh /home/${USER}/hadoop-0.20.2/bin/start-mapred.sh

That's it! Hadoop cluster is up and running now.

NOTES

The cluster is defined in files slaves and masters
You can use IP addresses instead of host names, i.e. 10.2.3.4 instead of jt.mydomain.com
After you execute #8 and #9, you will notice that all files that were updated in steps #3-7 on jt.mydomain.com are also updated on nn, sn, tt1 and tt2 machines. This happened because we set property HADOOP_MASTER in step #4 to jt.mydomain.com. Setting this property means that use the config files on jt.mydomain.com as master files and sync them across all nodes that you find in the cluster.
Jobtracker WebUI should be up and running on http://jt.mydomain.com:50030/jobtracker.jsp
HDFS WebUI should be up and running on http://nn.mydomain.com:50070/dfshealth.jsp

I also found another excellent tutorial, better than my blog post for sure.

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/#java-io-ioexception-incompatible-namespaceids

Hadoop Blog

Tuesday, November 30, 2010

For beginners, quickly install Hadoop 0.20 on Linux in cluster mode

No comments:

Post a Comment