I found that some of my production jobs slowed down after I refactored by code to use MultipleOutputs class. I did some benchmarking to ensure that its not the cluster but MultipleOutputs class that slowed my processes down.
I setup a small cluster with just 6 machines and some data
- 1 machine running JobTracker
- 1 machine running Namenode
- 4 machines running Datanodes and tasktracker
- Input data 8Gb
All machines were of same size and nothing else was running on them during benchmarking.
Test 1: Mapper without MultipleOutputs
I created a mapper that
- Reads a file line by line
- Creates output file name on the fly by taking first 3 characters of the hash of the input line. This information was not used to write output (because we are not using MultipleOutputs yet).
- Write the output key as input line and outputValue as NullWritable
I ran it 5 times and the median runtime was 4m 40s.
Test 2: MultipleOutputs Mapper
Then I modified the above mapper to use the output file name and write data out using MultipleOutputs. I ran this 5 times and the median runtimes was 5m 48s.
Based on this benchmark I found that MultipleOutputs slows down a job by almost 20%.
This happens because more small files are created when you use MultipleOutputs class.
Say you have 50 mappers then assuming that you don't have skewed data, Test1 will always generate exactly 50 files but Test2 will generate somewhere between 50 to 1000 files (50Mappers x 20TotalPartitionsPossible) and this causes a performance hit in I/O. In my benchmark, 199 output files were generated for Test1 and 4569 output files were generated for Test2.
No comments:
Post a Comment