Join Type in Hive: Map Join

In the last blog, I discussed the default join type in Hive: Common Join. In this blog, I am going to discuss Map Join, also called Auto Map Join, or Map Side Join, or Broadcast Join.

One major issue from the Common Join or Sort Merged Join is too much activity spending on shuffling data around. To speed up the Hive queries, Map Join can be used. If one of the tables in the join is a small table and can be loaded into memory, then Map Join can be used.

The first step of the Map Join is to create a Map Reduce local task before the original Map Reduce task. This map/reduce task read data of the small table from HDFS and save it into an in-memory hash table, then into a hash table file. Next, when the original join Map Reduce task starts, it moves the hash table file to the Hadoop Distributed Cache, which will populate the file to each mapper’s local disk. So all the mapper can load this hash table file into the memory and then do the join in Map stage. For example, for a join with big table A and small table B, for every mapper for table A, Table B is read completely. As the smaller table is loaded into memory and then join is performed in the map phrase of the MapReduce job, no reducer is needed and reduce phase is skipped. The map join performs faster than the regular default join.

Parameters

The most important parameter for Map Join is hive.auto.convert.join. It must be set to true.
Dur the join, the determination of small table is controlled by parameter hive.mapjoin.smalltable.filesize. By default, it is 25MB.
When three or more tables involved in the join, Hive generates three or more map-side joins with the all assumption that all tables are of smaller size. To speed up the join further, you can combine three or more map-side joins into one single map-side join if size of n-1 table is less than 10MB, which is the default value. To achieve this, you need to set hive.auto.convert.join.noconditionaltask parameter to true and specify parameter hive.auto.convert.join.noconditionaltask.size.

Restriction

Full outer joins are never converted to Map Join.
Left outer join can be converted to Map Join only if the right table is less than 25 MB in size. Right-outer join doesn’t work.

How to Identify the Join
When using EXPLAIN command, you will see Map Join Operator just below Map Operator Tree.

Other
You can use hint to specify the query using Map Join. The example below shows that smaller table is the one put in the hint, and force to cache table B manually.

Select /*+ MAPJOIN(b) */ a.key, a.value from a join b on a.key = b.key

Example

hive> set hive.auto.convert.join=true;
hive> set hive.auto.convert.join.noconditionaltask=true;
hive> set hive.auto.convert.join.noconditionaltask.size=20971520
hive> set hive.auto.convert.join.use.nonstaged=true;
hive> set hive.mapjoin.smalltable.filesize = 30000000;

4 thoughts on “Join Type in Hive: Map Join”

Pingback: Join Type in Hive: Skewed Join | My Big Data World

Pingback: Join Type: Bucket Join | My Big Data World

Thank you, what a great and useful post!
How does Hive identify the “Small” table?

Weidong Zhou

October 16, 2017 at 9:20 am

In Hive, the small table is defined by any table size smaller than hive.mapjoin.smalltable.filesize. By default, this value is set to 25MB. You can change this parameter to something a little bigger like 200MB. Then you could allow more Map Join happen.

Reply

Pingback: Join Type in Hive: Skewed Join | My Big Data World
Pingback: Join Type: Bucket Join | My Big Data World
Efrat Amir-Caspi

October 16, 2017 at 9:06 am

Thank you, what a great and useful post!
How does Hive identify the “Small” table?

- Weidong Zhou
  
  October 16, 2017 at 9:20 am
  
  In Hive, the small table is defined by any table size smaller than hive.mapjoin.smalltable.filesize. By default, this value is set to 25MB. You can change this parameter to something a little bigger like 200MB. Then you could allow more Map Join happen.

	Weidong Zhou on GCP AlloyDB Blog Series Part 2…
	tobamiester on GCP AlloyDB Blog Series Part 2…
	William - on PGA_AGGREGATE_TARGET vs PGA_AG…
	GCP AlloyDB Blog Ser… on GCP AlloyDB Blog Series Part 4…
	GCP AlloyDB Blog Ser… on GCP AlloyDB Blog Series Part 5…

Share this:

Related

4 thoughts on “Join Type in Hive: Map Join”

Leave a comment Cancel reply