Tuesday, November 16, 2010

Top 7 ways to optimize your map reducer jobs in Hadoop

Following are the best practices in Hadoop to optimize your code

1.     For small files, use CombineFileInputFormat instead of FileInputFormat

2.     To get name of the file that you are processing, you can poll job config to get the value of setting map.input.file

3.     To give a unique line number of each file, use the offset combined with the file name makes its unique within the whole file system.

4.     Use KeyValueTextInputFormat to read a key value file separated by tab. You can specify the separator via the key.value.separator.in.input.line property.

5.     To read Sequence files as input use SequenceFileInputFormat.

6.     To change the output delimiter for Key Value, set setting mapred.textoutputformat.separator

7.     Use MultipleTextOutputFormat to format data in multiple formats.


No comments:

Post a Comment