Tuesday, November 16, 2010

How to read Hadoop HDFS Sequence file using Hadoop streaming

Its very common to use unix "cat" command with hadoop streaming but it does not works if the underlying file is a sequence file. To get around the situation, run hadoop streaming with configuration

-inputformat SequenceFileAsTextInputFormat



The following code will read total lines in the file that is stored as sequence file in HDFS

HADOOP=$HADOOP_HOME/bin/hadoop
$HADOOP jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1-dev-streaming.jar \
-input
-output \
-mapper "/bin/cat" \
-reducer "/bin/wc -l" \
-inputformat SequenceFileAsTextInputFormat

No comments:

Post a Comment