In the last three blogs, I discussed Common Join, Map Join and Skewed Join. Common Join is the default join type. Map Join is best used when one of join tables is small and can be fit into memory. Skewed Join improves the query performance for data is skewed in the join keys. What happened when both join tables are very large and none of the above three joins can work? This is where Bucket Join fits in.
Bucket Join is also called Collocated Joins. It is used when all join tables are large and table data has been distributed by the join key. In this case, data copy is unnecessary. It’s map side join and join can happen in the local node. Another condition for Bucket Join is that the number of buckets in one table must be equal or multiple of the number of buckets in another table of the join.
So at the time when creating the table, make sure the buckets are created using the join columns, and BEFORE the data is inserted into the table. Also set both parameter hive.optimize.bucketmapjoin and hive.enforce.bucketing to true before inserting data. One example of creating a bucketed table is shown below:
CREATE TABLE mytable ( name string, city string, employee_id int ) PARTITIONED BY (year STRING, month STRING, day STRING) CLUSTERED BY (employee_id) INTO 256 BUCKETS ;
The above join is also called Bucket Map Join. If the join tables have the same number of buckets and data are also sorted using the join columns, Sort Merge Bucket Map Join is used.
How to Identify the Join
When using EXPLAIN command, you will see Sorted Merge Bucket Map Join Operator below Map Operator Tree.
set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;