Import Data to Hive from Oracle Database

In the last three posts, I discussed the following:
1. Install Cloudera Hadoop Cluster using Cloudera Manager
2. Configurations after CDH Installation
3. Load Data to Hive Table.
import_export
This post will discuss a basic scenario in Hive: Dump some data from Oracle database, load to HDFS, and query the data using Hive. Sqoop is a tool designed for transferring bulk data between Hadoop and RDBMS, such as Oracle. It can be used to populated data to Hive or HBase on Hadoop. The import can be the entire table or the result from a query or incremental load. Ok, here is the example to show how to use sqoop to load data from Oracle.

1. Add the user to hive group on all hosts in the cluster. Otherwise, you could see the following error:
chgrp: changing ownership of ‘hdfs://vmhost1.local:8020/user/hive/warehouse/test_oracle.db/my_all_objects_sqoop/part-m-00004_copy_1’: User does not belong to hive

[root@vmhost1 mnt]# usermod -a -G hive wzhou
[root@vmhost1 mnt]# id wzhou
uid=502(wzhou) gid=502(wzhou) groups=502(wzhou),481(hive),501(bigdata)

2. Create the target directory, /tmp/sqoopimport, for the sqoop job.
[wzhou@vmhost1 data]$ hdfs dfs -mkdir /tmp/sqoopimport

3. Install Oracle JDBC driver. Check if Oracle JDBC exist or not, if not, do the following to install the driver. The Oracle JDBC driver file can be downloaded from Oracle web site at
http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html
The file name is ojdbc6.jar. Then copy this file to /usr/lib/sqoop/lib/ if using package, or /var/lib/sqoop/ if using parcel. Make sure to change file permission to 755 afterwards.

[root@vmhost1 mnt]# cp ojdbc6.jar /usr/lib/sqoop/lib/
[root@vmhost1 mnt]# ls -l /usr/lib/sqoop/lib/oj*.jar
-rw-r—– 1 root root 2739670 Sep 21 15:24 /usr/lib/sqoop/lib/ojdbc6.jar
[root@vmhost1 mnt]# chmod 755 /usr/lib/sqoop/lib/ojdbc6.jar
[root@vmhost1 mnt]# ls -l /usr/lib/sqoop/lib/oj*.jar
-rwxr-xr-x 1 root root 2739670 Sep 21 15:24 /usr/lib/sqoop/lib/ojdbc6.jar

Verify whether the JDBC connection is working or not, run sqoop with list-databases argument. It will connect to oracle database and list schemas in the dbm database. If don’t want to show the password in the commandline, use -P parameter to input the db password.

sqoop list-databases \
–connect jdbc:oracle:thin:@enkx3-scan:1521:dbm1 \
–username wzhou \
-P

Here are the execution result
[wzhou@vmhost1 data]$ sqoop list-databases \
> –connect jdbc:oracle:thin:@enkx3-scan:1521:dbm1 \
> –username wzhou \
> -P

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/21 18:55:58 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.3
Enter password:
15/09/21 18:56:02 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
15/09/21 18:56:02 INFO manager.SqlManager: Using default fetchSize of 1000
15/09/21 18:56:03 INFO manager.OracleManager: Time zone has been set to GMT
ORACLE
JBARBA
TANEL
WZHOU
MBH
MPAGANO
GSHEPPARD
ENKITEC
ACOLVIN
KARLARAO
KSO
MARTIN
KOSBORNE
ODI
RJOHNSON
ORACLE_OCM
. . . .
SYSMAN
DBSNMP
WMSYS
XDB
MGMT_VIEW
SYS
SYSTEM

5. Create the Hive table. The new Hive table is called my_all_objects_sqoop under test_oracle database.

First, just check no my_all_objects_sqoop table exist on HDFS
hdfs dfs -ls /user/hive/warehouse/test_oracle.db

Run the following commands to create hive table in test_oracle database.
hive
USE test_oracle;

CREATE TABLE my_all_objects_sqoop (
owner string,
object_name string,
object_id int,
object_type string,
create_date string
)
STORED AS TEXTFILE;
SHOW TABLES;

Check out the folders on HDFS.
hdfs dfs -ls /user/hive/warehouse/test_oracle.db
hdfs dfs -ls /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop

Execution result.
[wzhou@vmhost1 data]$ hdfs dfs -ls /user/hive/warehouse/test_oracle.db
Found 1 items
drwxrwxrwt – wzhou hive 0 2015-09-21 11:55 /user/hive/warehouse/test_oracle.db/my_all_objects

hive> USE test_oracle;
OK
Time taken: 0.221 seconds

hive> CREATE TABLE my_all_objects_sqoop (
> owner string,
> object_name string,
> object_id int,
> object_type string,
> create_date string
> )
> STORED AS TEXTFILE;

OK
Time taken: 0.247 seconds

hive> SHOW TABLES;
OK
my_all_objects
my_all_objects_sqoop
Time taken: 0.154 seconds, Fetched: 2 row(s)

[wzhou@vmhost1 data]$ hdfs dfs -ls /user/hive/warehouse/test_oracle.db
Found 2 items
drwxrwxrwt – wzhou hive 0 2015-09-21 11:55 /user/hive/warehouse/test_oracle.db/my_all_objects
drwxrwxrwt – wzhou hive 0 2015-09-21 17:15 /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop

[wzhou@vmhost1 data]$ hdfs dfs -ls /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop
[wzhou@vmhost1 data]$

6. Run the sqoop to import the table directly to hive table. Please note, hive has a weird requirement and I had to add $CONDITIONS in the where clause although I am going to retrieve all rows from all_objects. -m or –num-mappers argument allows the parallel import. It specifies the number of map tasks (parallel processes) to use to perform the import. When running in parallel, sqoop needs a way to split the workload. This is where –split-by argument is used for. –split-by specify the column to used as a split column. By the default, the primary key of the table is used as the split column. But you can specify a different column by using –split-by argument.

sqoop import \
–connect jdbc:oracle:thin:@enkx3-scan:1521:dbm1 \
–username wzhou \
–password wzhou \
–query “select owner, object_name, object_id, object_type, to_char(created, ‘yyyy-mm-dd’) created_date from all_objects where \$CONDITIONS” \
-m 4 \
–split-by object_type \
–hive-import \
–target-dir ‘/tmp/sqoopimport’ \
–hive-table test_oracle.my_all_objects_sqoop

select * from my_all_objects_sqoop limit 3;
select count(*) from my_all_objects_sqoop;

hdfs dfs -ls /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop
hdfs dfs -ls /user/hive/warehouse/test_oracle.db/my_all_objects
hdfs dfs -cat /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop/part-m-00000 | head -n 10

Execution result.

[wzhou@vmhost1 data]$ <b>sqoop import \
>     --connect jdbc:oracle:thin:@enkx3-scan:1521:dbm1 \
>     --username wzhou \
>     --password wzhou \
>     --query "select owner, object_name, object_id, object_type, to_char(created, 'yyyy-mm-dd') created_date from all_objects where \$CONDITIONS" \
>     -m 4 \
>     --split-by object_type \
>     --hive-import \
>     --target-dir '/tmp/sqoopimport' \
>     --hive-table test_oracle.my_all_objects_sqoop  <b> 
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/21 17:53:15 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.3
15/09/21 17:53:15 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/09/21 17:53:15 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
15/09/21 17:53:15 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
15/09/21 17:53:15 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
15/09/21 17:53:15 INFO manager.SqlManager: Using default fetchSize of 1000
15/09/21 17:53:15 INFO tool.CodeGenTool: Beginning code generation
15/09/21 17:53:18 INFO manager.OracleManager: Time zone has been set to GMT
15/09/21 17:53:18 INFO manager.SqlManager: Executing SQL statement: select owner, object_name, object_id, object_type, to_char(created, 'yyyy-mm-dd') created_date from all_objects where  (1 = 0) 
15/09/21 17:53:19 INFO manager.SqlManager: Executing SQL statement: select owner, object_name, object_id, object_type, to_char(created, 'yyyy-mm-dd') created_date from all_objects where  (1 = 0) 
15/09/21 17:53:19 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-wzhou/compile/d96bcc9fb5354b998f0db547e0ed17bc/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/09/21 17:53:21 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-wzhou/compile/d96bcc9fb5354b998f0db547e0ed17bc/QueryResult.jar
15/09/21 17:53:21 INFO mapreduce.ImportJobBase: Beginning query import.
15/09/21 17:53:21 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/09/21 17:53:22 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/09/21 17:53:22 INFO client.RMProxy: Connecting to ResourceManager at vmhost1.local/192.168.56.71:8032
15/09/21 17:53:26 INFO db.DBInputFormat: Using read commited transaction isolation
15/09/21 17:53:26 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(object_type), MAX(object_type) FROM (select owner, object_name, object_id, object_type, to_char(created, 'yyyy-mm-dd') created_date from all_objects where  (1 = 1) ) t1
15/09/21 17:53:26 WARN db.TextSplitter: Generating splits for a textual index column.
15/09/21 17:53:26 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
15/09/21 17:53:26 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
15/09/21 17:53:26 INFO mapreduce.JobSubmitter: number of splits:5
15/09/21 17:53:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1442870712675_0012
15/09/21 17:53:27 INFO impl.YarnClientImpl: Submitted application application_1442870712675_0012
15/09/21 17:53:27 INFO mapreduce.Job: The url to track the job: http://vmhost1.local:8088/proxy/application_1442870712675_0012/
15/09/21 17:53:27 INFO mapreduce.Job: Running job: job_1442870712675_0012
15/09/21 17:53:40 INFO mapreduce.Job: Job job_1442870712675_0012 running in uber mode : false
15/09/21 17:53:40 INFO mapreduce.Job:  map 0% reduce 0%
15/09/21 17:53:53 INFO mapreduce.Job:  map 20% reduce 0%
15/09/21 17:54:25 INFO mapreduce.Job:  map 40% reduce 0%
15/09/21 17:55:19 INFO mapreduce.Job:  map 60% reduce 0%
15/09/21 17:55:49 INFO mapreduce.Job:  map 80% reduce 0%
15/09/21 17:56:29 INFO mapreduce.Job:  map 100% reduce 0%
15/09/21 17:56:29 INFO mapreduce.Job: Job job_1442870712675_0012 completed successfully
15/09/21 17:56:29 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=680385
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=728
		HDFS: Number of bytes written=1089568
		HDFS: Number of read operations=20
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=10
	Job Counters 
		Launched map tasks=5
		Other local map tasks=5
		Total time spent by all maps in occupied slots (ms)=160538
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=160538
		Total vcore-seconds taken by all map tasks=160538
		Total megabyte-seconds taken by all map tasks=164390912
	Map-Reduce Framework
		Map input records=22519
		Map output records=22519
		Input split bytes=728
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=641
		CPU time spent (ms)=10540
		Physical memory (bytes) snapshot=614903808
		Virtual memory (bytes) snapshot=8024203264
		Total committed heap usage (bytes)=159252480
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=1089568
15/09/21 17:56:29 INFO mapreduce.ImportJobBase: Transferred 1.0391 MB in 187.358 seconds (5.6791 KB/sec)
15/09/21 17:56:29 INFO mapreduce.ImportJobBase: Retrieved 22519 records.
15/09/21 17:56:29 INFO manager.OracleManager: Time zone has been set to GMT
15/09/21 17:56:29 INFO manager.SqlManager: Executing SQL statement: select owner, object_name, object_id, object_type, to_char(created, 'yyyy-mm-dd') created_date from all_objects where  (1 = 0) 
15/09/21 17:56:30 INFO manager.SqlManager: Executing SQL statement: select owner, object_name, object_id, object_type, to_char(created, 'yyyy-mm-dd') created_date from all_objects where  (1 = 0) 
15/09/21 17:56:30 WARN hive.TableDefWriter: Column OBJECT_ID had to be cast to a less precise type in Hive
15/09/21 17:56:30 INFO hive.HiveImport: Loading uploaded data into Hive

Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-1.1.0-cdh5.4.3.jar!/hive-log4j.properties
OK
Time taken: 1.98 seconds
Loading data to table test_oracle.my_all_objects_sqoop
Table test_oracle.my_all_objects_sqoop stats: [numFiles=5, totalSize=1089568]
OK
Time taken: 0.886 seconds

Verify the result from Hive. You might need to exit Hive and re-enter hive to get the result from the new populated table. Otherwise, you could see 0 rows there.

hive> select * from my_all_objects_sqoop limit 3;
OK
SYS	C_COBJ#	29	CLUSTER	2013-03-12
SYS	C_TS#	6	CLUSTER	2013-03-12
SYS	C_FILE#_BLOCK#	8	CLUSTER	2013-03-12
Time taken: 0.684 seconds, Fetched: 3 row(s)

hive> select count(*) from my_all_objects_sqoop;
Query ID = wzhou_20150921175858_c67bac9f-9cf0-437a-a90c-d3a891455f33
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1442870712675_0013, Tracking URL = http://vmhost1.local:8088/proxy/application_1442870712675_0013/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1442870712675_0013
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-09-21 17:58:57,446 Stage-1 map = 0%,  reduce = 0%
2015-09-21 17:59:09,067 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.68 sec
2015-09-21 17:59:19,506 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.41 sec
MapReduce Total cumulative CPU time: 4 seconds 410 msec
Ended Job = job_1442870712675_0013
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.41 sec   HDFS Read: 1096958 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 410 msec
OK
22519
Time taken: 37.941 seconds, Fetched: 1 row(s)

Check the hive file on HDFS.

[wzhou@vmhost1 ~]$ hdfs dfs -ls /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop
Found 5 items
-rwxrwxrwt   2 wzhou hive       9927 2015-09-21 17:53 /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop/part-m-00000
-rwxrwxrwt   2 wzhou hive     248584 2015-09-21 17:54 /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop/part-m-00001
-rwxrwxrwt   2 wzhou hive      93915 2015-09-21 17:55 /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop/part-m-00002
-rwxrwxrwt   2 wzhou hive     735608 2015-09-21 17:55 /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop/part-m-00003
-rwxrwxrwt   2 wzhou hive       1534 2015-09-21 17:56 /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop/part-m-00004
[wzhou@vmhost1 ~]$ hdfs dfs -ls /user/hive/warehouse/test_oracle.db/my_all_objects
Found 1 items
-rwxrwxrwt   2 wzhou bigdata    2050380 2015-09-21 11:28 /user/hive/warehouse/test_oracle.db/my_all_objects/all_objects_data.txt

[wzhou@vmhost1 ~]$ hdfs dfs -cat /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop/part-m-00000 | head -n 10
SYSC_COBJ#29CLUSTER2013-03-12
SYSC_TS#6CLUSTER2013-03-12
SYSC_FILE#_BLOCK#8CLUSTER2013-03-12
SYSC_USER#10CLUSTER2013-03-12
SYSC_OBJ#2CLUSTER2013-03-12
SYSORA$BASE100EDITION2013-03-12
SYSSMON_SCN_TO_TIME_AUX268CLUSTER2013-03-12
SYSC_OBJ#_INTCOL#421CLUSTER2013-03-12
SYSC_TOID_VERSION#495CLUSTER2013-03-12
SYSC_MLOG#625CLUSTER2013-03-12

Ok, I am done with the import to Hive here. In the next post, I am going to discuss export from Hive directly to Oracle database.

Advertisements

16 thoughts on “Import Data to Hive from Oracle Database

  1. Pingback: Export data from Hive table to Oracle Database | My Big Data World

  2. Pingback: Use Impala to query a Hive table | My Big Data World

  3. Pingback: Use incremental import in sqoop to load data from Oracle (Part I) | My Big Data World

  4. Pingback: Use incremental import in sqoop to load data from Oracle (Part II) | My Big Data World

  5. I am facing this error, have been asking around but seems like no one can give me some light, do you have any idea?

    org.apache.sqoop.manager.OracleManager.listDatabases(OracleManager.java:695)

    Thanks!
    Yong

      • Sorry WeiDong, I have tried to reply the comment a few times, but it never show up in your blog, so I am not sure if you received any of my replies, if I annoyed you, I am sorry. Here is my one last try on this issue.

        Thanks for your reply, regarding the command, I was using list-databases to check the connectivity before I am going to import the oracle DB

        Command I used:
        sqoop list-databases –connect jdbc:oracle:thin:@//ip address:1521/ –username user –password password

        output:
        Warning: /opt/cloudera/parcels/CDH-5.4.9-1.cdh5.4.9.p0.19/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
        Please set $ACCUMULO_HOME to the root of your Accumulo installation.
        16/01/11 16:32:10 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.9
        16/01/11 16:32:11 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
        16/01/11 16:32:11 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
        16/01/11 16:32:11 INFO manager.SqlManager: Using default fetchSize of 1000
        16/01/11 16:32:12 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException
        java.lang.NullPointerException
        at org.apache.sqoop.manager.OracleManager.listDatabases(OracleManager.java:695)
        at org.apache.sqoop.tool.ListDatabasesTool.run(ListDatabasesTool.java:49)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:236)

        I have checked the connectivity using telnet, the port was opened, and have given select right on the user role in SQLdeveloper. Not sure what else did I missed out, hope you can shed some light.

        Thank you very much,
        Yong

  6. Hi, Yong,
    Your replied has to get my approval first before showing up on the site. This is why you couldn’t see your previous replies.

    I am almost for sure the issue could be from your jdbc connection string. It seems like you miss your db instance name in the connection string. Try the following:

    sqoop list-databases –connect jdbc:oracle:thin:@ip address:1521:YourDBInstanceName –username user –password password

    In my case, here is the command I used in my environment. Please note: dbm1 is my db instance here and you are missing the db instance name like that.
    sqoop list-databases \
    –connect jdbc:oracle:thin:@enkx3-scan:1521:dbm1 \
    –username wzhou \
    –password password

    Hope it can help you.

    Weidong

      • Another possibility is that you have issue in connecting Oracle database from your hadoop host. Check your environment variables and try to connect to your db instance from your hadoop host using sqlplus. Or try tnsping. If your jdbc connection string is correct, you should be able to connect to db using sqlplus or tnsping. This test should be able to tell you whether you have environment configuration issue or grant/password issue in your Oracle user.

  7. Managed to solved the list database by changing the driver and permission on Oracle, now I have another error while importing

    ERROR tool.BaseSqoopTool: Error parsing arguments for import:
    ERROR tool.BaseSqoopTool: Unrecognized argument: schema.tablename

  8. ERROR [main] tool.ImportTool: Imported Failed: No enum constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS
    getting above error while using below command.
    sqoop import –connect jdbc:oracle:thin:@ausagliplm1db03.us.dell.com:1521:agbidit –username AGILEPLM_DM -P –table QUERY_SMALLDATA –as-textfile –target-dir /suri/test -m -1
    can you please help. my email id is suri2005.allala@gmail.com

  9. Pingback: Data Query between BDA and Exadata (Part 1): Query Hive Table from Oracle on Exadata | My Big Data World

  10. Pingback: Data Query between BDA and Exadata (Part 4): Query Oracle Dump File on BDA Using Copy2BDA | My Big Data World

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s