Tuesday, November 16, 2010

How to write output to multiple named files in Hadoop using MultipleTextOutputFormat

Sometime we want our Map Reduce job to output data in named files.

For e.g.
Suppose you have input file that contains the following data

This can be done in Hadoop by using MultipleTextOutputFormat class. The following is a simple example implementation of MultipleTextOutputFormat class which will read the file above and create 2 output files Name and Age
The code where the action happens is highlighted in red

The code is at https://sites.google.com/site/hadoopandhive/home/how-to-write-output-to-multiple-named-files-in-hadoop-using-multipletextoutputformat

The output would be files Name and Age.
File Name contains data
Name Nish
Name Dash

File Age contains data
Age 27
Age 29

Class MultiFileOutput extends MultipleTextOutputFormat. What this means is that when the reducer is ready to spit out the Key/Value pair then before writing it to a file, it passes them to method generateFileNameForKeyValue. The logic to name the output file is the embedded in this method (in this case the logic is to create 1 file per key). The String returned by method generateFileNameForKeyValue determines the name of the file where this Key/Value pair is logged.

No comments:

Post a Comment