Its very common to use unix "cat" command with hadoop streaming but it does not works if the underlying file is a sequence file. To get around the situation, run hadoop streaming with configuration
-inputformat SequenceFileAsTextInputFormat
The following code will read total lines in the file that is stored as sequence file in HDFS
HADOOP=$HADOOP_HOME/bin/hadoop
$HADOOP jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1-dev-streaming.jar \
-input
-output
No comments:
Post a Comment