In the last blogs, I discussed Common Join and Map Join. In this blog, I am going to discuss Skewed Join. Remember the blog of Common Join, I mentioned one of major issues in Common Join is the join performs poorly when data is skewed. The query is waiting for the longest running reducers on the skewed keys while majority of reducers complete the join operation.
Skewed Join is exactly targeting this problem. At runtime, it scans the data and detects the keys with a large skew, which is controlled by parameter hive.skewjoin.key. Instead of processing those keys, it stores them temporarily in an HDFS directory. Then in a map-reduce job, process those skewed keys. The same key need not be skewed for all the tables, and so the follow-up map-reduce job (for the skewed keys) would be much faster, since it would be a map-join.
For example, let’s say we have a join with Table A and B. Both Table A and B has skewed data “mytest” in the joining column. Assuming Table B has fewer rows with skewed data in Table A. The first step is to scan B and save all rows with the key “mytest” in an in-memory hash table. Then run a set of mappers to read Table A to perform the followings:
- If it has skewed key “mytest”, then it will use hashed version of B for the join.
- For all other keys, send the rows to a reducer that performs the join. The same reducer will get rows from the mappers that scanning Table B.
We can see that Table B is scanned twice during Skewed Join. The skewed keys in Table A are read and processed by the mapper, and perform map-side join. The rows with skewed keys in Table A has never sent to the reducer. For the rest of keys in Table A, they uses the regular common join approach.
To use Skewed Join, you need to understand your data and query. Set parameter hive.optimize.skewjoin to true. Parameter hive.skewjoin.key is optional and it is 100000 by default.
How to Identify the Join
When using EXPLAIN command, you will see handleSkewJoin: true below Join Operator and Reduce Operator Tree.
set hive.optimize.skewjoin = true; set hive.skewjoin.key=500000; set hive.skewjoin.mapjoin.map.tasks=10000; set hive.skewjoin.mapjoin.min.split=33554432;