Tuesday, November 16, 2010

How to create output in gzip files in Hive

Sometimes its required to output hive results in gzip files to reduce the file size so that the files can be transferred over network.

To do this, run the following commands in hive before running the query. The following code sets these options
set mapred.output.compress=true;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

Now if you run the hive query then the output of this hive query will be stored in gzip files.

INSERT OVERWRITE DIRECTORY 'hive_out' select * from table_name w ;"

How to create output in gzip files in Hive

Sometimes its required to output hive results in gzip files to reduce the file size so that the files can be transferred over network.

To do this, run the following commands in hive before running the query. The following code sets these options and then runs the hive query. The output of this hive query will be stored in gzip files.

set mapred.output.compress=true;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
INSERT OVERWRITE DIRECTORY 'hive_out' select * from web_requests w ;"

How to copy a table in Hive to another table or create temp table

If you want to store results of a query in a table in Hive then

Create a schema of the temp table using command
CREATE TABLE ..

Execute the following command
INSERT OVERWRITE TABLE temp_tablename SELECT * FROM table_name

How to check health of Hadoop cluster

Hadoop comes bundled with 2 very use JSP pages that let you monitor the overall health and job activity on the cluster

Use the following URL's to check the performance stats of hadoop

Hadoop job tracker:
http://[PUT URL OF JOBRTRACKER MACHINE]:50030/jobtracker.jsp

Hadoop DFS health page:
http://[PUT URL OF JOBRTRACKER MACHINE]:50070/dfshealth.jsp

How to change replication factor of existing files in HDFS

By default, HDFS replicates each file 3 times. You can change the replication factor if desired. To set replication of an individual file to 4, run the following command
hadoop dfs -setrep -w 4 /path/to/file

To change replication of entire HDFS to 1:
hadoop dfs -setrep -R -w 1 /

How to add a new column to a table in Hive

Use the following command in the hive shell to add a new column to a table in Hive
ALTER TABLE pokes ADD COLUMNS (new_col INT);

How to put a file in HDFS using HDFS client

To add a files to hadoop file system, run the following command
hadoop fs -put SOURCE DESTINATION

How to read a file from HDFS in Hadoop API classes in Java

The following java program is an example of how you can programmtically read a file in HDFS using HDFS API's bundled with Hadoop
1. Open File Cat.java and paste the following code

package org.myorg;
import java.io.*;
import java.util.*;
import java.net.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class Cat{
public static void main (String [] args) throws Exception{
try{
Path pt=new Path("hdfs://jp.seka.com:9000/user/john/abc.txt");
FileSystem fs = FileSystem.get(new Configuration());
BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(pt)));
String line;
line=br.readLine();
while (line != null){
System.out.println(line);
line=br.readLine();
}
}catch(Exception e){
}
}
}


2. Compile the code

javac -classpath hadoop-0.20.1-dev-core.jar -d Cat/ Cat.java


3. Create jar

jar -cvf Cat.jar -C Cat/ .


4. Run

hadoop jar Cat.jar org.myorg.Cat

How to get distinct values/lines (dedupe) for a file using Hadoop Map Reduce Framework Hadoop

The following prorgam takes in a text file with a single column and returns the distinct list of lines in the file in the output directory

1. Create file CalculateDistinct.java and paste the following code

package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class CalculateDistinct {
public static class Map extends MapReduceBase implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text("");
public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
word.set(value.toString());
output.collect(word,one);
}
}
public static class Reduce extends MapReduceBase implements Reducer {
public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += 1;
values.next();
}
output.collect(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(CalculateDistinct.class);
conf.setJobName("Calculate Distinct");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}

2. Compile, create Jar and Run

javac -classpath hadoop-0.20.1-dev-core.jar -d CalculateDistinct/ CalculateDistinct.java
jar -cvf CalculateDistinct.jar -C CalculateDistinct/ .
hadoop jar CalculateDistinct.jar org.myorg.CalculateDistinct in/abc.txt out

How to count number of times a word appeared in a file using Map Reduce framework Hadoop

The following programs counts the number of times each word appeared in input file/directory and places the results in the output directory

1. Create file WordCount.java and paste the following code in it

package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer {
public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}


2. Compile, create jar and run

javac -classpath hadoop-0.20.1-dev-core.jar -d WordCount/ WordCount.java
jar -cvf WordCount.jar -C WordCount/ .
hadoop jar WordCount.jar org.myorg.WordCount in out