In the last blog, I discussed the default join type in Hive: Common Join. In this blog, I am going to discuss Map Join, also called Auto Map Join, or Map Side Join, or Broadcast Join.
One major issue from the Common Join or Sort Merged Join is too much activity spending on shuffling data around. To speed up the Hive queries, Map Join can be used. If one of the tables in the join is a small table and can be loaded into memory, then Map Join can be used.
The first step of the Map Join is to create a Map Reduce local task before the original Map Reduce task. This map/reduce task read data of the small table from HDFS and save it into an in-memory hash table, then into a hash table file. Next, when the original join Map Reduce task starts, it moves the hash table file to the Hadoop Distributed Cache, which will populate the file to each mapper’s local disk. So all the mapper can load this hash table file into the memory and then do the join in Map stage. For example, for a join with big table A and small table B, for every mapper for table A, Table B is read completely. As the smaller table is loaded into memory and then join is performed in the map phrase of the MapReduce job, no reducer is needed and reduce phase is skipped. The map join performs faster than the regular default join.
- The most important parameter for Map Join is hive.auto.convert.join. It must be set to true.
- Dur the join, the determination of small table is controlled by parameter hive.mapjoin.smalltable.filesize. By default, it is 25MB.
- When three or more tables involved in the join, Hive generates three or more map-side joins with the all assumption that all tables are of smaller size. To speed up the join further, you can combine three or more map-side joins into one single map-side join if size of n-1 table is less than 10MB, which is the default value. To achieve this, you need to set hive.auto.convert.join.noconditionaltask parameter to true and specify parameter hive.auto.convert.join.noconditionaltask.size.
- Full outer joins are never converted to Map Join.
- Left outer join can be converted to Map Join only if the right table is less than 25 MB in size. Right-outer join doesn’t work.
How to Identify the Join
When using EXPLAIN command, you will see Map Join Operator just below Map Operator Tree.
You can use hint to specify the query using Map Join. The example below shows that smaller table is the one put in the hint, and force to cache table B manually.
Select /*+ MAPJOIN(b) */ a.key, a.value from a join b on a.key = b.key
hive> set hive.auto.convert.join=true; hive> set hive.auto.convert.join.noconditionaltask=true; hive> set hive.auto.convert.join.noconditionaltask.size=20971520 hive> set hive.auto.convert.join.use.nonstaged=true; hive> set hive.mapjoin.smalltable.filesize = 30000000;