Load Data to Hive Table

In the last few posts, I discussed:
1. Install Cloudera Hadoop Cluster using Cloudera Manager
2. Configurations after CDH Installation
This post will discuss a basic scenario in Hive: Dump some data from Oracle database, load to HDFS, and query the data using Hive.
hive
Now let’s do the first hive table. I am going to dump data from ALL_OBJECTS view from dbm database on our X3 Exadata in the lab. Then transfer the file to my hadoop cluster, load the file to HDFS. Create a Hive table, load the data into this Hive table.

1. Create the flat file from ALL_OBJECTS on an Oracle database.
Run the following the script to generate a text file from all_objects view.
$ vi download_objects_data.sql

set echo off
set feedback off
set colsep '|'
set pagesize 0   
set trimspool on 
set headsep off  
set head off
--set sqlprompt ''
set linesize 150
set numw 10

col object_type for a20

spool all_objects_data.txt
select owner, object_name, object_id, object_type, to_char(created, 'yyyy-mm-dd') created_date
from all_objects;
spool off

set echo on
set feedback on

2. Run the above script in dbm database. Then copy all_objects_data.txt file to my Hadoop cluster.

3. Upload the file to /user/wzhou/test2 directory on HDFS.
[wzhou@vmhost1 data]$ pwd
/home/wzhou/test/data
[wzhou@vmhost1 data]$ cd ~/test/data
[wzhou@vmhost1 data]$ ls -l all_objects_data.txt
-rw-r–r– 1 wzhou wzhou 2050380 Sep 21 11:24 all_objects_data.txt
[wzhou@vmhost1 data]$ hdfs dfs -mkdir /user/wzhou/test2
[wzhou@vmhost1 data]$ hdfs dfs -copyFromLocal all_objects_data.txt /user/wzhou/test2
[wzhou@vmhost1 data]$ hdfs dfs -ls /user/wzhou/test2
Found 1 items
-rw-r–r– 2 wzhou bigdata 2050380 2015-09-21 11:28 /user/wzhou/test2/all_objects_data.txt

4. Create a Hive table.
hive
CREATE DATABASE test_oracle;
SHOW DATABASES;

USE test_oracle;
SHOW TABLES;

CREATE TABLE my_all_objects (
owner string,
object_name string,
object_id int,
object_type string,
create_date string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

SHOW TABLES;
DESC my_all_objects;

Here are the output from the execution.

[wzhou@vmhost1 data]$ hive
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-1.1.0-cdh5.4.3.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.

hive> CREATE DATABASE test_oracle;
OK
Time taken: 0.65 seconds

hive> SHOW DATABASES;
OK
default
test1
test_oracle
Time taken: 0.217 seconds, Fetched: 3 row(s)

hive> USE test_oracle;
OK
Time taken: 0.03 seconds

hive> SHOW TABLES;
OK
Time taken: 0.046 seconds

hive> CREATE TABLE my_all_objects (
    > owner string,
    > object_name string,
    > object_id int,
    > object_type string,
    > create_date string 
    > ) 
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
    > LINES TERMINATED BY '\n'
    > STORED AS TEXTFILE;
OK
Time taken: 0.228 seconds

hive> SHOW TABLES;
OK
my_all_objects
Time taken: 0.018 seconds, Fetched: 1 row(s)

hive> DESC my_all_objects;
OK
owner               	string              	                    
object_name         	string              	                    
object_id           	int                 	                    
object_type         	string              	                    
create_date         	string              	                    
Time taken: 0.128 seconds, Fetched: 5 row(s)

If want to know the script to create the table, use SHOW CREATE TABLE command to generate the DDL of the table.

hive> SHOW CREATE TABLE my_all_objects;
OK
CREATE TABLE `my_all_objects`(
  `owner` string, 
  `object_name` string, 
  `object_id` int, 
  `object_type` string, 
  `create_date` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '|' 
  LINES TERMINATED BY '\n' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://vmhost1.local:8020/user/hive/warehouse/test_oracle.db/my_all_objects'
TBLPROPERTIES (
  'transient_lastDdlTime'='1442853571')
Time taken: 0.209 seconds, Fetched: 17 row(s)

5. Insert data to the table from all_objects_data.txt.
LOAD DATA INPATH ‘/user/wzhou/test2/all_objects_data.txt’ OVERWRITE INTO TABLE my_all_objects;

6. Verify the result.
select * from my_all_objects limit 10;
select count(*) from my_all_objects;

Here are the output:

hive> LOAD DATA INPATH '/user/wzhou/test2/all_objects_data.txt' OVERWRITE INTO TABLE my_all_objects;
Loading data to table test_oracle.my_all_objects
chgrp: changing ownership of 'hdfs://vmhost1.local:8020/user/hive/warehouse/test_oracle.db/my_all_objects/all_objects_data.txt': User does not belong to hive
Table test_oracle.my_all_objects stats: [numFiles=1, numRows=0, totalSize=2050380, rawDataSize=0]
OK
Time taken: 0.71 seconds

Let’s find some rows from the hive table.

hive> select * from my_all_objects limit 10;
OK
SYS            	ICOL$                         	NULL	TABLE               	2013-03-12
SYS            	I_USER1                       	NULL	INDEX               	2013-03-12
SYS            	CON$                          	NULL	TABLE               	2013-03-12
SYS            	UNDO$                         	NULL	TABLE               	2013-03-12
SYS            	C_COBJ#                       	NULL	CLUSTER             	2013-03-12
SYS            	I_OBJ#                        	NULL	INDEX               	2013-03-12
SYS            	PROXY_ROLE_DATA$              	NULL	TABLE               	2013-03-12
SYS            	I_IND1                        	NULL	INDEX               	2013-03-12
SYS            	I_CDEF2                       	NULL	INDEX               	2013-03-12
SYS            	I_OBJ5                        	NULL	INDEX               	2013-03-12
Time taken: 0.339 seconds, Fetched: 10 row(s)

Getting the total row count is acutally a MapReduce job.

hive> select count(*) from my_all_objects;
Query ID = wzhou_20150921115656_f9fe088d-149b-4436-9974-0265d8fb676a
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1442853303881_0001, Tracking URL = http://vmhost1.local:8088/proxy/application_1442853303881_0001/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1442853303881_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-09-21 11:57:00,160 Stage-1 map = 0%,  reduce = 0%
2015-09-21 11:57:12,692 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.63 sec
2015-09-21 11:57:23,135 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.21 sec
MapReduce Total cumulative CPU time: 4 seconds 210 msec
Ended Job = job_1442853303881_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.21 sec   HDFS Read: 2057365 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 210 msec
OK
22782
Time taken: 40.221 seconds, Fetched: 1 row(s)

You might notice that I have two copies of data:
1. One is hive table under /user/hive/warehouse/test_oracle.db/my_all_objects
2. Another one of my original datafile /user/wzhou/test2/all_objects_data.txt.

 
[wzhou@vmhost1 data]$ hdfs dfs -ls /user/hive
Found 1 items
drwxrwxrwt   - hive hive          0 2015-09-21 11:38 /user/hive/warehouse
[wzhou@vmhost1 data]$ hdfs dfs -ls /user/hive/warehouse
Found 3 items
drwxrwxrwt   - wzhou hive          0 2015-09-20 19:05 /user/hive/warehouse/my_test1
drwxrwxrwt   - wzhou hive          0 2015-09-20 18:56 /user/hive/warehouse/test1.db
drwxrwxrwt   - wzhou hive          0 2015-09-21 11:39 /user/hive/warehouse/test_oracle.db
[wzhou@vmhost1 data]$ hdfs dfs -ls /user/hive/warehouse/test_oracle.db
Found 1 items
drwxrwxrwt   - wzhou hive          0 2015-09-21 11:55 /user/hive/warehouse/test_oracle.db/my_all_objects
[wzhou@vmhost1 data]$ hdfs dfs -ls /user/hive/warehouse/test_oracle.db/my_all_objects
Found 1 items
-rwxrwxrwt   2 wzhou bigdata    2050380 2015-09-21 11:28 /user/hive/warehouse/test_oracle.db/my_all_objects/all_objects_data.txt

For this test, it’s ok I have two copies of the same data. But for big dataset, it does not make sense to have two identical copies on HDFS. If I want to just keep one copy, this is where Hive External Table is used. Let me assume the /user/wzhou/test2 is the folder I keep all of the data for hive operations. The Hive table creation is a little different.

CREATE EXTERNAL TABLE my_all_objects_ext (
owner string,
object_name string,
object_id string,
object_type string,
create_date string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|’
LINES TERMINATED BY ‘\n’
LOCATION ‘/user/wzhou/test2’;

Note: the LOCATION specify the directory name, not the filename of the datafiles. You could have thousand data files under the same directory.

After run the above script to create a new Hive external table, verify the table.

 
hive> select * from my_all_objects_ext limit 10;
OK
SYS            	ICOL$               20	TABLE               	2013-03-12
SYS            	I_USER1             46	INDEX               	2013-03-12
SYS            	CON$                28	TABLE               	2013-03-12
SYS            	UNDO$               15	TABLE               	2013-03-12
SYS            	C_COBJ#             29	CLUSTER             	2013-03-12
SYS            	I_OBJ#              3	INDEX               	2013-03-12
SYS            	PROXY_ROLE_DATA$    25	TABLE               	2013-03-12
SYS            	I_IND1              41	INDEX               	2013-03-12
SYS            	I_CDEF2             54	INDEX               	2013-03-12
SYS            	I_OBJ5              40	INDEX               	2013-03-12
Time taken: 0.094 seconds, Fetched: 10 row(s)

hive> select count(*) from my_all_objects_ext;
Query ID = wzhou_20150921125858_48330cd5-220e-46ad-960d-67ca1315bc6c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1442853303881_0002, Tracking URL = http://vmhost1.local:8088/proxy/application_1442853303881_0002/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1442853303881_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-09-21 12:58:17,560 Stage-1 map = 0%,  reduce = 0%
2015-09-21 12:58:28,124 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.71 sec
2015-09-21 12:58:38,570 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.31 sec
MapReduce Total cumulative CPU time: 3 seconds 310 msec
Ended Job = job_1442853303881_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.31 sec   HDFS Read: 2057301 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 310 msec
OK
22782
Time taken: 36.329 seconds, Fetched: 1 row(s)

Ok, we have a basic understanding of Hive table and it’s quite simple to create Hive table on HDFS. This approach is good when you don’t care where the source of the data, whether the data is from RDBMS, web log, mainframe, or other sources. As long as you can put the file on HDFS, you can use Hive to perform the query. Well, you might have a scenario that you want to pull the data directly from RDBMS, such as Oracle. In this case, one popular tool is using Sqoop. In the next post, I am going to discuss how to use Sqoop to import the data to HDFS from Oracle database and export the content of Hive table back to Oracle database.

Advertisements

5 thoughts on “Load Data to Hive Table

  1. Pingback: Import Data to Hive from Oracle Database | My Big Data World

  2. Pingback: Export data from Hive table to Oracle Database | My Big Data World

  3. Pingback: Use Impala to query a Hive table | My Big Data World

  4. Pingback: Use incremental import in sqoop to load data from Oracle (Part I) | My Big Data World

  5. Pingback: Use incremental import in sqoop to load data from Oracle (Part II) | My Big Data World

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s