Wednesday, December 8, 2010

Extended FileUtil class for Hadoop

While writing production jobs in Hadoop I identified following tasks that were required for some MapReduce jobs but were not readily available in Hadoop 0.20 API

  1. Get size of a file or directory in HDFS
    • We require this to dynamically change the number of reducers used for a job by looking at the amount of input data that the job will process
  2. Recursively remove all zero byte files from a directory in HDFS. 
    • This happens a lot when you use MultipleOutput class in reducer (impact is less when used in Mapper). A lot of times the reducer does not gets any record for which a MutipleOutput file needs to be created hence it creates a 0 byte files. These files have no use, its best to remove them after the job is finished. 
  3. Recursively get all subdirectories of a directories 
  4. Recursively get all files within a directory and its sub directories
    • By default, as of now, when Hadoop job is run, it only processes the immediate files under the input directory, any files in the subdirectories of the input path are not processed hence if you want your job to process all files under the subdirectories also then its better to create a comma delimited list of all files within the input path and submit it to the job.  

All the above tasks were implemented in the ExtendedFileUtil class. Source code can be found at

The wrapper class on link contains an example of how to use ExtendedFileUtil class

1 comment:

  1. You should try pushing vital utility stuff into Hadoop's trunk itself. It would help better than to have it externally IMO. No harm in attempting to contribute back.

    Thanks for the utils :-)