Wednesday, December 8, 2010

How to combine small files in Hadoop

Currently Hadoop is not built to work with a lot of small files. The following are the architectural limitations of hadoop that causes this problem

  • In HDFS, all file metadata is stored in memory of the Namenode (which is most often a single big powerful machine). This means "more files=more memory". There is a limitation on the amount of memory you can add to a machine and that limits the amount of files that can be stored in Hadoop. 
  • Namenode is used heavily for all jobs that run on Hadoop. More data in the memory can slow down Namenode and might end of slowing down the job execution time (it might be insignificant for long jobs though)
  • There is a setup time required by Hadoop to run a mapper. By default, Hadoop will start minimum 1 mapper for every file in the input directory. Till Hadoop 0.20, hadoop does not lets us choose the number of mappers you want to run hence if your file is small, say 100K, then more time is wasted in Hadoop setup than actually processing the data.
There are couple different solutions to solve this problem
  • Keep an eye on all data that is entered into HDFS  from other data sources. Try to optimize the processes, that push data to HDFS, to create files of size 128Mb (block size of HDFS). 
  • If you have map reduce pipeline where output of a map reduce job become input of the next map reduce job then try to use reducers wisely in your jobs. If suppose your job uses 100 reducers and outputs files of size 10 MB each, and if the reducer computations are not CPU bound, then try to run the same job with less reducers (7-10). Remember - Hadoop creates one file for every reducer run even if reducer did not output any data. 
  • If all else fails then try to combine small files into bigger files. Media6degrees has come up with a faily good solution to combine small files in Hadoop. You can use their jar straight out. See here for more details

No comments:

Post a Comment