Sunday, January 17, 2016

How to speed up count distinct queries in Hive

Have you ever wondered why a count distinct query on a high cardinality column in Hive is slow.

Lets say we have a table that contains list of all products

Table: Products
Columns:
  1. product_id (this column and product_id together are unique)
  2. product_name
  3. price
  4.  available_in_country (this column and product_id together are unique)

Let's say this table contains 1 billion rows and around 200 million unique product_id's.

You need to find distinct number of products in the table.
Most people will write query like
select count(distinct product_id) from products

This query will start a map reduce job with just 1 reducer and the reducer will take a long time to finish. This is because all 1 billion rows will go to just one reducer and the reducer will have go through 1 billion rows and maintain distinct id's in memory.

Instead you should write query like
select count(*) from (select distinct product_id from products)T

This query will very run very fast because the sub query will use multiple reducers. The outer query will then just have to count the number of row and that is normally done in map with reducer just summing up the values of the map and hence it will be really fast. 

Thursday, January 14, 2016

How to do UNION ALL in hive

Suppose you have two table 
Boys
- Name
- Age

Girls
- Name
- Age

Now lets say we want to a query that returns all Boys and Girls. Most of us who come from Oracle/SqlServer/MySql etc world will write following query 

select name, age from Boys 
union all 
select name, age from Girls

This query will throw an error in Hive and it will look like UNION ALL does not works in Hive. UNION ALL does works in Hive, the only trick is to enclose the UNION ALL tables in a subquery, i.e. write the above as 

select name, age from (
     select name, age from Boys 
     union all 
     select name, age from Girls
) BoysAndGirls