I was testing different ways to improve performance of hadoop jobs and was testing how compression helps.
There are 2 places where you can configure a hadoop job to use compression
1. Compress the intermediate output of the mapper
To do this for all jobs you can set it in mapred-site.xml by adding the following properties
I am compressing using GzipCodec but you have the option to use any of the following
- GzipCodec
- DeflateCodec
- BZip2Codec
- SnappyCodec
Each of these have their strengths and weaknesses, choose what you can live with. Also not that due to some licensing differences, LZO does not ships with Hadoop. You can install it separately and use it if you'd like.
For just your job you can set it in the Configuration object
Configuration conf = new Configuration();
conf.setBoolean("mapreduce.map.output.compress", true);
conf.set("mapreduce.map.output.compress.codec","org.apache.hadoop.io.compress.GzipCodec");
2. Compress the final output of the job
To save the final output in gzip, run your M/R job with following code
There are 2 places where you can configure a hadoop job to use compression
1. Compress the intermediate output of the mapper
To do this for all jobs you can set it in mapred-site.xml by adding the following properties
<
property
>
<
name
> mapreduce.map.output.compress </
name
>
<
value
>true</
value
>
</
property
>
<
property
>
<
name
>mapreduce.map.output.compress.codec</
name
>
<
value
>org.apache.hadoop.io.compress.GzipCodec</
value
>
</
property
>
I am compressing using GzipCodec but you have the option to use any of the following
- GzipCodec
- DeflateCodec
- BZip2Codec
- SnappyCodec
Each of these have their strengths and weaknesses, choose what you can live with. Also not that due to some licensing differences, LZO does not ships with Hadoop. You can install it separately and use it if you'd like.
For just your job you can set it in the Configuration object
Configuration conf = new Configuration();
conf.setBoolean("mapreduce.map.output.compress", true);
conf.set("mapreduce.map.output.compress.codec","org.apache.hadoop.io.compress.GzipCodec");
2. Compress the final output of the job
To save the final output in gzip, run your M/R job with following code
job.setOutputFormatClass(TextOutputFormat.
class
);
TextOutputFormat.setCompressOutput(job,
true
);
TextOutputFormat.setOutputCompressorClass(job, GzipCodec.
class
);
Valuable information thanks for sharing from Manasa
ReplyDelete