Sunday, January 17, 2016

How to speed up count distinct queries in Hive

Have you ever wondered why a count distinct query on a high cardinality column in Hive is slow.

Lets say we have a table that contains list of all products

Table: Products
Columns:
  1. product_id (this column and product_id together are unique)
  2. product_name
  3. price
  4.  available_in_country (this column and product_id together are unique)

Let's say this table contains 1 billion rows and around 200 million unique product_id's.

You need to find distinct number of products in the table.
Most people will write query like
select count(distinct product_id) from products

This query will start a map reduce job with just 1 reducer and the reducer will take a long time to finish. This is because all 1 billion rows will go to just one reducer and the reducer will have go through 1 billion rows and maintain distinct id's in memory.

Instead you should write query like
select count(*) from (select distinct product_id from products)T

This query will very run very fast because the sub query will use multiple reducers. The outer query will then just have to count the number of row and that is normally done in map with reducer just summing up the values of the map and hence it will be really fast. 

No comments:

Post a Comment