During the performance tuning for Hive query, one area needs to be paid attention is the join type during the execution. Just like join types in Oracle, different types can have significantly different execution time. In the next few blogs, I am going to discuss the join types in Hive. The first join type is Common Join.
Common Join is the default join type in Hive, also called Shuffle Join, or Distributed Join or Sort Merged Join. During the join, all rows from the two tables are distributed to all nodes based on the join keys. In this way, the values from the same join keys end up on the same node. This join has a complete cycle of Map/Reduce.
How it works
1. In the map stage, mappers reads the tables and output the join-column value as the key. The key-value pairs are written into an intermediate file.
2. In the shuffle stage, these pairs are sorts and merged. All rows from the same key will be sent to the same reducer instance.
3. In the reduce stage, reducer gets the sorted data and performs the join.
The advantage of Common Join is that it works in any table size. But as shuffle is an expensive operation, it is quite resource intensive. If one or more join keys has significantly large proportion of the data, the corresponding reducers will be overloaded. The symptom of the issue is that majority of reducers have completed the join operation while a few reducers are still running. The total run time of the query is determined by the longest running reducer. Obviously this is a typical skewed data issue. I am going to discuss a special join just for this kind of skewed issue in a later blog.
How to Identify the Join
When using EXPLAIN command, you will see Join Operator just below Reduce Operator Tree.