Use incremental import in sqoop to load data from Oracle (Part I)

This is a two parts series in discussing incremental import job in sqoop. This post is the first part of the series.

In the last few posts, I discussed the following:
1. Install Cloudera Hadoop Cluster using Cloudera Manager
2. Configurations after CDH Installation
3. Load Data to Hive Table.
4. Import Data to Hive from Oracle Database
5. Export data from Hive table to Oracle Database.
6. Use Impala to query a Hive table
When using sqoop to load data to hive table from an Oracle table, it’s not always loading a full table to hive in one shot, just like taking many days’ work to build the house below. In other words, it is common to load partial data from Oracle table to an existing Hive table. This is where we need to use sqoop incremental job to do the work.

sqoop_incremental_1

First, I create a simple table to illustrate the process for incremental import.
1. Create the source table.
Run the following query to create a new table on Oracle database.

<b>create table wzhou.student
(
student_id number(8) not null,
student_name varchar2(20) not null,
major varchar2(20),
CONSTRAINT student_pk PRIMARY KEY (student_id)
);
insert into wzhou.student values ( 1, 'student1', 'math' );
insert into wzhou.student values ( 2, 'student2', 'computer' );
insert into wzhou.student values ( 3, 'student3', 'math' );
insert into wzhou.student values ( 4, 'student4', 'accounting' );

commit;
select * from wzhou.student;</b>

2. Create the import command.

sqoop import \
--connect jdbc:oracle:thin:@enkx3-scan:1521:dbm2 \
--username wzhou \
--password wzhou \
--table STUDENT \
--incremental append \
--check-column student_id \
-m 4 \
--split-by major

check-column argument specifies which column to be checked during the import operation. The column can not be *CHAR type, like VARCHAR2 or CHAR.

incremental argument can have two modes: append and lastmodified. Lasmodified argument is usually used with a lastmodified column defined as timestamp.
Last-value argument is used to specify a value that new rows with greater than this value will be insert.
or use other ways, suggested by one internet article:
–last-value $($HIVE_HOME/bin/hive -e “select max(idcolumn) from tablename”)

Note 1:
If seeing the following error, make sure the tablename specified in –table argument in UPPERCASE. If it is lowercase, you could see the error below.
ERROR tool.ImportTool: Imported Failed: There is no column found in the target table all_objects_inc_test. Please ensure that your table name is correct.

Note 2:
If using –hive-import argument, you could see the following error. It is not supported yet. So have to remove it and build Hive External table after complete the import data to hdfs.
ERROR Append mode for hive imports is not yet supported. Please remove the parameter –append-mode.

3. Execute the Table Import to Hive.
[wzhou@vmhost1 ~]$ sqoop import \
> –connect jdbc:oracle:thin:@enkx3-scan:1521:dbm2 \
> –username wzhou \
> –password wzhou \
> –table STUDENT \
> –incremental append \
> –check-column student_id \
> -m 4 \
> –split-by major

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/25 05:14:22 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.3
15/09/25 05:14:22 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
. . . .
15/09/25 05:17:33 INFO mapreduce.ImportJobBase: Transferred 74 bytes in 184.558 seconds (0.401 bytes/sec)
15/09/25 05:17:33 INFO mapreduce.ImportJobBase: Retrieved 4 records.
15/09/25 05:17:33 INFO util.AppendUtils: Creating missing output directory – STUDENT
15/09/25 05:17:33 INFO tool.ImportTool: Incremental import complete! To run another incremental import of all data following this import, supply the following arguments:
15/09/25 05:17:33 INFO tool.ImportTool: –incremental append
15/09/25 05:17:33 INFO tool.ImportTool: –check-column student_id
15/09/25 05:17:33 INFO tool.ImportTool: –last-value 4
15/09/25 05:17:33 INFO tool.ImportTool: (Consider saving this with ‘sqoop job –create’)

Notice there is a line of –last-value 4 at the end of execution. This is correct as I imported only 4 rows. The result file is under /user/wzhou/STUDENT. Let’s verify it.

[wzhou@vmhost1 ~]$ hdfs dfs -ls /user/wzhou/STUDENT
Found 5 items
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:14 /user/wzhou/STUDENT/part-m-00000
-rw-r--r--   2 wzhou bigdata         42 2015-09-25 05:15 /user/wzhou/STUDENT/part-m-00001
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:16 /user/wzhou/STUDENT/part-m-00002
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:17 /user/wzhou/STUDENT/part-m-00003
-rw-r--r--   2 wzhou bigdata         32 2015-09-25 05:17 /user/wzhou/STUDENT/part-m-00004
[wzhou@vmhost1 ~]$ hdfs dfs -cat /user/wzhou/STUDENT/part*
2,student2,computer
4,student4,accounting
1,student1,math
3,student3,math
[wzhou@vmhost1 ~]$ hdfs dfs -cat /user/wzhou/STUDENT/part*1
2,student2,computer
4,student4,accounting
[wzhou@vmhost1 ~]$ hdfs dfs -cat /user/wzhou/STUDENT/part*4
1,student1,math
3,student3,math

Result looks good so far.

4. Create Hive external table. The new hive external table on Hadoop is still under test_oracle database.
hive
USE test_oracle;
CREATE EXTERNAL TABLE student_ext (
student_id string,
student_name string,
major string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
LOCATION ‘/user/wzhou/STUDENT’;

select * from student_ext;

Here are the result.
[wzhou@vmhost1 ~]$ hive
hive> USE test_oracle;
OK
Time taken: 1.547 seconds
hive> CREATE EXTERNAL TABLE student_ext (
> student_id string,
> student_name string,
> major string
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
> LINES TERMINATED BY ‘\n’
> LOCATION ‘/user/wzhou/STUDENT’;

OK
Time taken: 1.266 seconds
hive> select * from student_ext;
OK
2 student2 computer
4 student4 accounting
1 student1 math
3 student3 math
Time taken: 0.679 seconds, Fetched: 4 row(s)

5. 2nd Round of Insert.
Let me to test out more insert to see how incremental import works. On oracle database, create a few more rows.

insert into wzhou.student values ( 5, ‘student3’, ‘computer’);
insert into wzhou.student values ( 6, ‘student4’, ‘math’);
insert into wzhou.student values ( 7, ‘student5’, ‘computer’ );
commit;

6. Rerun the incremental import command.
[wzhou@vmhost1 ~]$ sqoop import \
> –connect jdbc:oracle:thin:@enkx3-scan:1521:dbm2 \
> –username wzhou \
> –password wzhou \
> –table STUDENT \
> –incremental append \
> –check-column student_id \
> -m 4 \
> –split-by major

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/25 05:39:50 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.3
15/09/25 05:39:51 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/09/25 05:39:51 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
. . . .
15/09/25 05:42:54 INFO mapreduce.ImportJobBase: Transferred 130 bytes in 178.3882 seconds (0.7287 bytes/sec)
15/09/25 05:42:54 INFO mapreduce.ImportJobBase: Retrieved 7 records.
15/09/25 05:42:54 INFO util.AppendUtils: Appending to directory STUDENT
15/09/25 05:42:54 INFO util.AppendUtils: Using found partition 5
15/09/25 05:42:54 INFO tool.ImportTool: Incremental import complete! To run another incremental import of all data following this import, supply the following arguments:
15/09/25 05:42:54 INFO tool.ImportTool: –incremental append
15/09/25 05:42:54 INFO tool.ImportTool: –check-column student_id
15/09/25 05:42:54 INFO tool.ImportTool: –last-value 7
15/09/25 05:42:54 INFO tool.ImportTool: (Consider saving this with ‘sqoop job –create’)

Interesting, the result is not what I expect. I expected only three rows will be inserted. But it seem all 7 rows are inserted. Let me verify the result.

[wzhou@vmhost1 ~]$ hdfs dfs -ls /user/wzhou/STUDENT
Found 10 items
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:14 /user/wzhou/STUDENT/part-m-00000
-rw-r--r--   2 wzhou bigdata         42 2015-09-25 05:15 /user/wzhou/STUDENT/part-m-00001
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:16 /user/wzhou/STUDENT/part-m-00002
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:17 /user/wzhou/STUDENT/part-m-00003
-rw-r--r--   2 wzhou bigdata         32 2015-09-25 05:17 /user/wzhou/STUDENT/part-m-00004
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:40 /user/wzhou/STUDENT/part-m-00005
-rw-r--r--   2 wzhou bigdata         82 2015-09-25 05:41 /user/wzhou/STUDENT/part-m-00006
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:41 /user/wzhou/STUDENT/part-m-00007
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:42 /user/wzhou/STUDENT/part-m-00008
-rw-r--r--   2 wzhou bigdata         48 2015-09-25 05:42 /user/wzhou/STUDENT/part-m-00009

[wzhou@vmhost1 ~]$ hdfs dfs -cat /user/wzhou/STUDENT/part*6
2,student2,computer
4,student4,accounting
5,student3,computer
7,student5,computer

[wzhou@vmhost1 ~]$ hdfs dfs -cat /user/wzhou/STUDENT/part*9
1,student1,math
3,student3,math
6,student4,math

Hive result. 
hive> select * from student_ext;
OK
2	student2	computer
4	student4	accounting
1	student1	math
3	student3	math
2	student2	computer
4	student4	accounting
5	student3	computer
7	student5	computer
1	student1	math
3	student3	math
6	student4	math
Time taken: 0.085 seconds, Fetched: 11 row(s)

Ok, let me look at the result from Impala.

[vmhost3:21000] > select count(*) from student_ext;
Query: select count(*) from student_ext
+----------+
| count(*) |
+----------+
| 4        |
+----------+
Fetched 1 row(s) in 0.85s

Interesting. Impala shows 4 rows instead of 11 rows. The reason is that impala does not refresh metadata regularly. So need to do the invalidate metadata to get the last row count. Hive doesn’t seem have this issue.

[vmhost3:21000] > invalidate metadata;
Query: invalidate metadata

Fetched 0 row(s) in 4.48s
[vmhost3:21000] > select count(*) from student_ext;
Query: select count(*) from student_ext
+----------+
| count(*) |
+----------+
| 11       |
+----------+
Fetched 1 row(s) in 0.94s

7. Solution to fix this issue.
The solution is to add one more argument, last-value, to specify where the load stop last time. I redoed the step 1 to 5.
sqoop import \
–connect jdbc:oracle:thin:@enkx3-scan:1521:dbm2 \
–username wzhou \
–password wzhou \
–table STUDENT \
–incremental append \
–check-column student_id \
–last-value 4 \
-m 4 \
–split-by major

Here are the result

HDFS
[wzhou@vmhost1 ~]$ hdfs dfs -ls /user/wzhou/STUDENT
Found 9 items
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:14 /user/wzhou/STUDENT/part-m-00000
-rw-r--r--   2 wzhou bigdata         42 2015-09-25 05:15 /user/wzhou/STUDENT/part-m-00001
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:16 /user/wzhou/STUDENT/part-m-00002
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:17 /user/wzhou/STUDENT/part-m-00003
-rw-r--r--   2 wzhou bigdata         32 2015-09-25 05:17 /user/wzhou/STUDENT/part-m-00004
-rw-r--r--   2 wzhou bigdata         40 2015-09-25 06:02 /user/wzhou/STUDENT/part-m-00005
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 06:02 /user/wzhou/STUDENT/part-m-00006
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 06:03 /user/wzhou/STUDENT/part-m-00007
-rw-r--r--   2 wzhou bigdata         16 2015-09-25 06:04 /user/wzhou/STUDENT/part-m-00008
[wzhou@vmhost1 ~]$ hdfs dfs -cat /user/wzhou/STUDENT/part*5
5,student3,computer
7,student5,computer
[wzhou@vmhost1 ~]$ hdfs dfs -cat /user/wzhou/STUDENT/part*8
6,student4,math

Hive
hive> select * from student_ext;
OK
2	student2	computer
4	student4	accounting
1	student1	math
3	student3	math
5	student3	computer
7	student5	computer
6	student4	math
Time taken: 0.095 seconds, Fetched: 7 row(s)

Impala
[vmhost3:21000] > select count(*) from student_ext;
Query: select count(*) from student_ext
+----------+
| count(*) |
+----------+
| 7        |
+----------+
Fetched 1 row(s) in 0.82s

Ok, everything looks good right now.

Advertisements

Use Impala to query a Hive table

Previously, I discussed the followings:
In the last few posts, I discussed the following:
1. Install Cloudera Hadoop Cluster using Cloudera Manager
2. Configurations after CDH Installation
3. Load Data to Hive Table.
4. Import Data to Hive from Oracle Database
5. Export data from Hive table to Oracle Database.
Although Hive is popular in Hadoop world, it has its own drawback, like excessive Map/Reduce operations for certain queries and JVM overhead during Map/Reduce. Impala is designed to improve the query performance accessing data on HDFS. Hive is SQL on Hadoop while Impala is the SQL on HDFS. Hive is using MapReduce job to get the query result while Impala is using the its daemons running on the data nodes to directly access the files on HDFS and don’t use Map/Reduce at all.

There are two ways to use Impala to query tables in Hive. One way is to use command line, Impala Shell. Another one is to use Hue GUI. I am going to show both methods one by one.

Use Impala Shell
Impala Shell is a nice tool similar to SQL Plus to setup database and tables and issue queries. The speed of ad hoc queries is much faster than Hive’s query, especially for queries requiring fast response time. Here are the steps in using Impala shell.

1. Logon as wzhou user and start the impala shell.
[wzhou@vmhost1 ~]$ impala-shell
Starting Impala Shell without Kerberos authentication
Error connecting: TTransportException, Could not connect to vmhost1.local:21000
Welcome to the Impala shell. Press TAB twice to see a list of available commands.
Copyright (c) 2012 Cloudera, Inc. All rights reserved.
(Shell build version: Impala Shell v2.2.0-cdh5.4.3 (517bb0f) built on Wed Jun 24 19:17:40 PDT 2015)
[Not connected] >

Note: The prompt shows Not connected. I need to connect the Impala shell to any Data Node with impalad daemon. My cluster is using vmhost2 and vmhost3 as Data Node. So I pick any one of them, use vmhost2 for this test.
[Not connected] > connect vmhost2;
Connected to vmhost2:21000
Server version: impalad version 2.2.0-cdh5.4.3 RELEASE (build 517bb0f71cd604a00369254ac6d88394df83e0f6)
[vmhost2:21000] >

2. Run some queries. Impala can see the same list of databases and tables like Hive does.

[vmhost2:21000] > show databases;
Query: show databases
+------------------+
| name             |
+------------------+
| _impala_builtins |
| default          |
| test1            |
| test_oracle      |
+------------------+
Fetched 4 row(s) in 0.01s 

[vmhost2:21000] > use test_oracle;
Query: use test_oracle

[vmhost2:21000] > show tables;
Query: show tables
+----------------------+
| name                 |
+----------------------+
| my_all_objects       |
| my_all_objects_sqoop |
+----------------------+
Fetched 2 row(s) in 0.01s

[vmhost2:21000] > select * from my_all_objects_sqoop limit 3;
Query: select * from my_all_objects_sqoop limit 3
+-------+-------------+-----------+-------------+-------------+
| owner | object_name | object_id | object_type | create_date |
+-------+-------------+-----------+-------------+-------------+
| SYS   | I_USER1     | 46        | INDEX       | 2013-03-12  |
| SYS   | I_OBJ#      | 3         | INDEX       | 2013-03-12  |
| SYS   | I_IND1      | 41        | INDEX       | 2013-03-12  |
+-------+-------------+-----------+-------------+-------------+
Fetched 3 row(s) in 0.04s

[vmhost2:21000] > select count(*) from my_all_objects_sqoop;
Query: select count(*) from my_all_objects_sqoop
+----------+
| count(*) |
+----------+
| 22519    |
+----------+
Fetched 1 row(s) in 1.00s

Use Hue Web UI
Another way to use Impala is to run query from Hue UI.

1. From Cloudera Manager screen, click Hue. After Hue screen shows up, click Hue Web UI.
impala_hue_1

2. On Hue home screen, click Query Editors, then choose Impala.
impala_hue_2

3. After Impala Query Editor screen shows up, select test_oracle under DATABASE. Input the following query, then click Execute.
select * from my_all_objects_sqoop limit 3;
impala_hue_3

4. Run another query and check out Explain plan.
impala_hue_4

Run Time Comparison between Hive and Impala
Hive
hive> use test_oracle;
OK
Time taken: 0.534 seconds
hive> show tables;
OK
my_all_objects
my_all_objects_sqoop
Time taken: 0.192 seconds, Fetched: 2 row(s)

hive> select count(*) from my_all_objects;
Query ID = wzhou_20150922112626_efde0c06-2f04-44b2-9bf1-47b31e45de03
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1442935988315_0002, Tracking URL = http://vmhost1.local:8088/proxy/application_1442935988315_0002/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1442935988315_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-09-22 11:26:33,759 Stage-1 map = 0%, reduce = 0%
2015-09-22 11:26:46,563 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.25 sec
2015-09-22 11:26:58,035 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.97 sec
MapReduce Total cumulative CPU time: 3 seconds 970 msec
Ended Job = job_1442935988315_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.97 sec HDFS Read: 2057365 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 970 msec
OK
22782
Time taken: 40.881 seconds, Fetched: 1 row(s)

Impala
Wow, 41 seconds to get a row count of 22,782 by using Hive. That seem excessive on a cluster no other jobs running. Ok, let’s look at Impala’s result.
Note: Impala does not poll frequently for metadata changes. So in case you don’t see the table name after the import, just do the following:
invalidate metadata;
show tables;

[wzhou@vmhost1 ~]$ impala-shell
[Not connected] > connect vmhost2;
Connected to vmhost2:21000
Server version: impalad version 2.2.0-cdh5.4.3 RELEASE (build 517bb0f71cd604a00369254ac6d88394df83e0f6)

[vmhost2:21000] > use test_oracle;
Query: use test_oracle

[vmhost2:21000] > show tables;
Query: show tables
+———————-+
| name |
+———————-+
| my_all_objects |
| my_all_objects_sqoop |
+———————-+
Fetched 2 row(s) in 0.01s

[vmhost2:21000] > select count(*) from my_all_objects;
Query: select count(*) from my_all_objects
+———-+
| count(*) |
+———-+
| 22782 |
+———-+
Fetched 1 row(s) in 0.12s

The above result shows Hive took 41 seconds to get the row count of a table with 22, 782 rows while Impala was significant faster and took 0.12 seconds. I know my cluster is small, not powerful and hive is using Map/Reduce. But getting a total row count of 22,000 needs 45 seconds, it seems too much. On the other hand, Impala’s timing looks more reasonable to me. Obviously Map/Reduce is not my option if I want to run some queries that expect fast response time. But if executing a long running job against a huge dataset, Map/Reduce option might still be on the table if considering job fault tolerance.

Use SQL Developer to Access Hive Table on Hadoop

There are many UI or command-line tool to access Hive data on Hadoop and I am not going to list them one by one. I use SQL Developer a lot in accessing Oracle database and like this powerful tool. That makes me wondering whether I can use SQL Developer to access Hive table on HDFS. After some researches, I did find a way to configure SQL Developer to access Hive table. Here are the steps:

1. Download the JDBC Driver.
Goto Cloudera website, download Hive JDBC Driver at http://www.cloudera.com/content/cloudera/en/downloads/connectors/hive/jdbc/hive-jdbc-v2-5-4.html.
As for now, the latest version of Hive JDBC Connector is v2.5.4. Unzip the zip file to your target directory.

2. Configure SQL Developer
Open SQL Developer, click Preference -> Database -> Third Party JDBC Driver. Click Add Entry to add all of the jar files just unzipped. Close the SQL Developer, reopen it.
1_sql_developer_hive_jdbc_driver

3. Create a Hive Connection
Create a new connection. The port used is HiveSever2 port or the value for hive.server2.thrift.port. In my cluster, it is 10000. I still use test_oracle hive database created in my previous steps. Click Test, it should show Success.
2_sql_developer_hive_connection

4. Access Hive Tables
After click Connect, I can connect to my Hive databases on my Hadoop cluster.
3_sql_developer_hive_table
You can see the view and look to access Hive is very similar the way accessing regular oracle table. Awesome! Here are a few more other screens.
Properties Tab
4_sql_developer_hive_table_properties
DDL Tab
The DDL tab is nice one. The drawback for DDL is that everything is in one line. It would be nice to format the DDL string in a more readable way. Anyway, it is not a big deal.
5_sql_developer_hive_table_ddl
Detail Tab
6_sql_developer_hive_table_detail
Model Tab
7_sql_developer_hive_table_model

Let me run a query and here is the result, exact the look and feel like I use Oracle.
8_sql_developer_hive_table_query

Export data from Hive table to Oracle Database

In the last few posts, I discussed the following:
1. Install Cloudera Hadoop Cluster using Cloudera Manager
2. Configurations after CDH Installation
3. Load Data to Hive Table.
4. Import Data to Hive from Oracle Database

This post will discuss a different operation, exporting data from Hive table to Oracle database, just like goldfish jumps from fish bowl to the laptop. Again, we still use Sqoop to perform this operation.
fish_export

The following example will export the data from my_all_objects_sqoop table in Hive to WZHOU.TEST_IMPORT_FROM_SCOOP table in DBM database on X3.

1. Create WZHOU.TEST_IMPORT_FROM_SCOOP table in Oracle.

CREATE TABLE WZHOU.TEST_IMPORT_FROM_SCOOP
(
OWNER VARCHAR2(30),
OBJECT_NAME VARCHAR2(30),
OBJECT_ID NUMBER(10),
OBJECT_TYPE VARCHAR2(19),
CREATED_DATE VARCHAR2(30)
);

select count(*) from WZHOU.TEST_IMPORT_FROM_SCOOP;

2. Run the sqoop export job.

sqoop export \
–connect jdbc:oracle:thin:@enkx3-scan:1521:dbm1 \
–username wzhou \
–password wzhou \
–direct \
–export-dir ‘/user/hive/warehouse/test_oracle.db/my_all_objects_sqoop’ \
–table WZHOU.TEST_IMPORT_FROM_SCOOP \
–fields-terminated-by ‘\001’

Note: –export-dir argument specifies the location of hive source directory, not individual filename. –fields-terminated-by ‘\001’ must be included in the command. Otherwise, you will see the following error:
Caused by: java.lang.RuntimeException: Can’t parse input data: ‘SYSTEMLOGMNR_REFCON$7604TABLE2013-03-12’

Result:

[wzhou@vmhost1 ~]$ sqoop export \
>     --connect jdbc:oracle:thin:@enkx3-scan:1521:dbm1 \
>     --username wzhou \
>     --password wzhou \
>     --direct \
>     --export-dir '/user/hive/warehouse/test_oracle.db/my_all_objects_sqoop' \
>     --table WZHOU.TEST_IMPORT_FROM_SCOOP  \
>     --fields-terminated-by '\001' 
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/22 07:02:46 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.3
15/09/22 07:02:46 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/09/22 07:02:46 INFO manager.SqlManager: Using default fetchSize of 1000
15/09/22 07:02:47 INFO oracle.OraOopOracleQueries: Current schema is: WZHOU
15/09/22 07:02:48 INFO oracle.OraOopManagerFactory: 
**************************************************
*** Using Data Connector for Oracle and Hadoop ***
**************************************************
15/09/22 07:02:48 INFO oracle.OraOopManagerFactory: Oracle Database version: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
15/09/22 07:02:48 INFO oracle.OraOopManagerFactory: This Oracle database is a RAC.
15/09/22 07:02:49 INFO oracle.OraOopManagerFactory: 	Database time on dbm1 is 2015-09-22 07:02:49.060215
15/09/22 07:02:50 INFO oracle.OraOopManagerFactory: 	Database time on dbm2 is 2015-09-22 07:02:50.03952
15/09/22 07:02:50 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop will load-balance sessions across the following instances ofthe Oracle RAC:
	Instance: dbm1 	 URL: jdbc:oracle:thin:@enkx3db01.enkitec.com:1521:dbm1
	Instance: dbm2 	 URL: jdbc:oracle:thin:@enkx3db02.enkitec.com:1521:dbm2

15/09/22 07:02:50 INFO Configuration.deprecation: mapred.map.max.attempts is deprecated. Instead, use mapreduce.map.maxattempts
15/09/22 07:02:50 INFO tool.CodeGenTool: Beginning code generation
15/09/22 07:02:50 INFO manager.SqlManager: Executing SQL statement: SELECT "OWNER","OBJECT_NAME","OBJECT_ID","OBJECT_TYPE","CREATED_DATE" FROM WZHOU.TEST_IMPORT_FROM_SCOOP WHERE 0=1
15/09/22 07:02:50 INFO manager.SqlManager: Executing SQL statement: SELECT OWNER,OBJECT_NAME,OBJECT_ID,OBJECT_TYPE,CREATED_DATE FROM WZHOU.TEST_IMPORT_FROM_SCOOP WHERE 1=0
15/09/22 07:02:50 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-wzhou/compile/f33da649a2f40c4d01ab1701ed0d21a5/WZHOU_TEST_IMPORT_FROM_SCOOP.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/09/22 07:02:52 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-wzhou/compile/f33da649a2f40c4d01ab1701ed0d21a5/WZHOU.TEST_IMPORT_FROM_SCOOP.jar
15/09/22 07:02:52 INFO mapreduce.ExportJobBase: Beginning export of WZHOU.TEST_IMPORT_FROM_SCOOP
15/09/22 07:02:52 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/09/22 07:02:53 INFO manager.SqlManager: Executing SQL statement: SELECT OWNER,OBJECT_NAME,OBJECT_ID,OBJECT_TYPE,CREATED_DATE FROM WZHOU.TEST_IMPORT_FROM_SCOOP WHERE 1=0
15/09/22 07:02:53 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
15/09/22 07:02:53 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
15/09/22 07:02:53 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/09/22 07:02:53 INFO client.RMProxy: Connecting to ResourceManager at vmhost1.local/192.168.56.71:8032
15/09/22 07:02:55 INFO input.FileInputFormat: Total input paths to process : 5
15/09/22 07:02:56 INFO input.FileInputFormat: Total input paths to process : 5
15/09/22 07:02:56 INFO mapreduce.JobSubmitter: number of splits:4
15/09/22 07:02:56 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
15/09/22 07:02:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1442921609180_0004
15/09/22 07:02:56 INFO impl.YarnClientImpl: Submitted application application_1442921609180_0004
15/09/22 07:02:56 INFO mapreduce.Job: The url to track the job: http://vmhost1.local:8088/proxy/application_1442921609180_0004/
15/09/22 07:02:56 INFO mapreduce.Job: Running job: job_1442921609180_0004
15/09/22 07:03:05 INFO mapreduce.Job: Job job_1442921609180_0004 running in uber mode : false
15/09/22 07:03:05 INFO mapreduce.Job:  map 0% reduce 0%
15/09/22 07:03:14 INFO mapreduce.Job:  map 25% reduce 0%
15/09/22 07:03:38 INFO mapreduce.Job:  map 45% reduce 0%
15/09/22 07:03:39 INFO mapreduce.Job:  map 50% reduce 0%
15/09/22 07:04:24 INFO mapreduce.Job:  map 73% reduce 0%
15/09/22 07:04:25 INFO mapreduce.Job:  map 75% reduce 0%
15/09/22 07:04:54 INFO mapreduce.Job:  map 100% reduce 0%
15/09/22 07:04:54 INFO mapreduce.Job: Job job_1442921609180_0004 completed successfully
15/09/22 07:04:54 INFO mapreduce.Job: Counters: 31
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=550000
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=1176438
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=25
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
	Job Counters 
		Launched map tasks=4
		Data-local map tasks=1
		Rack-local map tasks=3
		Total time spent by all maps in occupied slots (ms)=102232
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=102232
		Total vcore-seconds taken by all map tasks=102232
		Total megabyte-seconds taken by all map tasks=104685568
	Map-Reduce Framework
		Map input records=22519
		Map output records=22519
		Input split bytes=1033
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=421
		CPU time spent (ms)=6480
		Physical memory (bytes) snapshot=481697792
		Virtual memory (bytes) snapshot=6458949632
		Total committed heap usage (bytes)=127401984
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=0
15/09/22 07:04:54 INFO mapreduce.ExportJobBase: Transferred 1.1219 MB in 121.1924 seconds (9.4797 KB/sec)
15/09/22 07:04:55 INFO mapreduce.ExportJobBase: Exported 22519 records.

3. Verify the result on Exadata. The row count and everything looks good.

SYS:dbm1> select count(*) from WZHOU.TEST_IMPORT_FROM_SCOOP;

   COUNT(*)
-----------
      22519

SYS:dbm1> select * from WZHOU.TEST_IMPORT_FROM_SCOOP where OWNER='WZHOU' and rownum < 4;

OWNER           OBJECT_NAME      OBJECT_ID   OBJECT_TYPE   CREATED_DATE
--------------- ---------------- ----------- ------------- -----------------
WZHOU           T1               29436       TABLE         2013-07-30
WZHOU           MOVETAB_BASIC    28832       TABLE         2013-07-23
WZHOU           MOVETAB_OLTP     28834       TABLE         2013-07-23

In the next post, I am going to discuss how to use Impala to access the same Hive data and performance comparison between Hive and Impala.

Import Data to Hive from Oracle Database

In the last three posts, I discussed the following:
1. Install Cloudera Hadoop Cluster using Cloudera Manager
2. Configurations after CDH Installation
3. Load Data to Hive Table.
import_export
This post will discuss a basic scenario in Hive: Dump some data from Oracle database, load to HDFS, and query the data using Hive. Sqoop is a tool designed for transferring bulk data between Hadoop and RDBMS, such as Oracle. It can be used to populated data to Hive or HBase on Hadoop. The import can be the entire table or the result from a query or incremental load. Ok, here is the example to show how to use sqoop to load data from Oracle.

1. Add the user to hive group on all hosts in the cluster. Otherwise, you could see the following error:
chgrp: changing ownership of ‘hdfs://vmhost1.local:8020/user/hive/warehouse/test_oracle.db/my_all_objects_sqoop/part-m-00004_copy_1’: User does not belong to hive

[root@vmhost1 mnt]# usermod -a -G hive wzhou
[root@vmhost1 mnt]# id wzhou
uid=502(wzhou) gid=502(wzhou) groups=502(wzhou),481(hive),501(bigdata)

2. Create the target directory, /tmp/sqoopimport, for the sqoop job.
[wzhou@vmhost1 data]$ hdfs dfs -mkdir /tmp/sqoopimport

3. Install Oracle JDBC driver. Check if Oracle JDBC exist or not, if not, do the following to install the driver. The Oracle JDBC driver file can be downloaded from Oracle web site at
http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html
The file name is ojdbc6.jar. Then copy this file to /usr/lib/sqoop/lib/ if using package, or /var/lib/sqoop/ if using parcel. Make sure to change file permission to 755 afterwards.

[root@vmhost1 mnt]# cp ojdbc6.jar /usr/lib/sqoop/lib/
[root@vmhost1 mnt]# ls -l /usr/lib/sqoop/lib/oj*.jar
-rw-r—– 1 root root 2739670 Sep 21 15:24 /usr/lib/sqoop/lib/ojdbc6.jar
[root@vmhost1 mnt]# chmod 755 /usr/lib/sqoop/lib/ojdbc6.jar
[root@vmhost1 mnt]# ls -l /usr/lib/sqoop/lib/oj*.jar
-rwxr-xr-x 1 root root 2739670 Sep 21 15:24 /usr/lib/sqoop/lib/ojdbc6.jar

Verify whether the JDBC connection is working or not, run sqoop with list-databases argument. It will connect to oracle database and list schemas in the dbm database. If don’t want to show the password in the commandline, use -P parameter to input the db password.

sqoop list-databases \
–connect jdbc:oracle:thin:@enkx3-scan:1521:dbm1 \
–username wzhou \
-P

Here are the execution result
[wzhou@vmhost1 data]$ sqoop list-databases \
> –connect jdbc:oracle:thin:@enkx3-scan:1521:dbm1 \
> –username wzhou \
> -P

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/21 18:55:58 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.3
Enter password:
15/09/21 18:56:02 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
15/09/21 18:56:02 INFO manager.SqlManager: Using default fetchSize of 1000
15/09/21 18:56:03 INFO manager.OracleManager: Time zone has been set to GMT
ORACLE
JBARBA
TANEL
WZHOU
MBH
MPAGANO
GSHEPPARD
ENKITEC
ACOLVIN
KARLARAO
KSO
MARTIN
KOSBORNE
ODI
RJOHNSON
ORACLE_OCM
. . . .
SYSMAN
DBSNMP
WMSYS
XDB
MGMT_VIEW
SYS
SYSTEM

5. Create the Hive table. The new Hive table is called my_all_objects_sqoop under test_oracle database.

First, just check no my_all_objects_sqoop table exist on HDFS
hdfs dfs -ls /user/hive/warehouse/test_oracle.db

Run the following commands to create hive table in test_oracle database.
hive
USE test_oracle;

CREATE TABLE my_all_objects_sqoop (
owner string,
object_name string,
object_id int,
object_type string,
create_date string
)
STORED AS TEXTFILE;
SHOW TABLES;

Check out the folders on HDFS.
hdfs dfs -ls /user/hive/warehouse/test_oracle.db
hdfs dfs -ls /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop

Execution result.
[wzhou@vmhost1 data]$ hdfs dfs -ls /user/hive/warehouse/test_oracle.db
Found 1 items
drwxrwxrwt – wzhou hive 0 2015-09-21 11:55 /user/hive/warehouse/test_oracle.db/my_all_objects

hive> USE test_oracle;
OK
Time taken: 0.221 seconds

hive> CREATE TABLE my_all_objects_sqoop (
> owner string,
> object_name string,
> object_id int,
> object_type string,
> create_date string
> )
> STORED AS TEXTFILE;

OK
Time taken: 0.247 seconds

hive> SHOW TABLES;
OK
my_all_objects
my_all_objects_sqoop
Time taken: 0.154 seconds, Fetched: 2 row(s)

[wzhou@vmhost1 data]$ hdfs dfs -ls /user/hive/warehouse/test_oracle.db
Found 2 items
drwxrwxrwt – wzhou hive 0 2015-09-21 11:55 /user/hive/warehouse/test_oracle.db/my_all_objects
drwxrwxrwt – wzhou hive 0 2015-09-21 17:15 /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop

[wzhou@vmhost1 data]$ hdfs dfs -ls /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop
[wzhou@vmhost1 data]$

6. Run the sqoop to import the table directly to hive table. Please note, hive has a weird requirement and I had to add $CONDITIONS in the where clause although I am going to retrieve all rows from all_objects. -m or –num-mappers argument allows the parallel import. It specifies the number of map tasks (parallel processes) to use to perform the import. When running in parallel, sqoop needs a way to split the workload. This is where –split-by argument is used for. –split-by specify the column to used as a split column. By the default, the primary key of the table is used as the split column. But you can specify a different column by using –split-by argument.

sqoop import \
–connect jdbc:oracle:thin:@enkx3-scan:1521:dbm1 \
–username wzhou \
–password wzhou \
–query “select owner, object_name, object_id, object_type, to_char(created, ‘yyyy-mm-dd’) created_date from all_objects where \$CONDITIONS” \
-m 4 \
–split-by object_type \
–hive-import \
–target-dir ‘/tmp/sqoopimport’ \
–hive-table test_oracle.my_all_objects_sqoop

select * from my_all_objects_sqoop limit 3;
select count(*) from my_all_objects_sqoop;

hdfs dfs -ls /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop
hdfs dfs -ls /user/hive/warehouse/test_oracle.db/my_all_objects
hdfs dfs -cat /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop/part-m-00000 | head -n 10

Execution result.

[wzhou@vmhost1 data]$ <b>sqoop import \
>     --connect jdbc:oracle:thin:@enkx3-scan:1521:dbm1 \
>     --username wzhou \
>     --password wzhou \
>     --query "select owner, object_name, object_id, object_type, to_char(created, 'yyyy-mm-dd') created_date from all_objects where \$CONDITIONS" \
>     -m 4 \
>     --split-by object_type \
>     --hive-import \
>     --target-dir '/tmp/sqoopimport' \
>     --hive-table test_oracle.my_all_objects_sqoop  <b> 
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/21 17:53:15 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.3
15/09/21 17:53:15 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/09/21 17:53:15 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
15/09/21 17:53:15 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
15/09/21 17:53:15 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
15/09/21 17:53:15 INFO manager.SqlManager: Using default fetchSize of 1000
15/09/21 17:53:15 INFO tool.CodeGenTool: Beginning code generation
15/09/21 17:53:18 INFO manager.OracleManager: Time zone has been set to GMT
15/09/21 17:53:18 INFO manager.SqlManager: Executing SQL statement: select owner, object_name, object_id, object_type, to_char(created, 'yyyy-mm-dd') created_date from all_objects where  (1 = 0) 
15/09/21 17:53:19 INFO manager.SqlManager: Executing SQL statement: select owner, object_name, object_id, object_type, to_char(created, 'yyyy-mm-dd') created_date from all_objects where  (1 = 0) 
15/09/21 17:53:19 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-wzhou/compile/d96bcc9fb5354b998f0db547e0ed17bc/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/09/21 17:53:21 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-wzhou/compile/d96bcc9fb5354b998f0db547e0ed17bc/QueryResult.jar
15/09/21 17:53:21 INFO mapreduce.ImportJobBase: Beginning query import.
15/09/21 17:53:21 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/09/21 17:53:22 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/09/21 17:53:22 INFO client.RMProxy: Connecting to ResourceManager at vmhost1.local/192.168.56.71:8032
15/09/21 17:53:26 INFO db.DBInputFormat: Using read commited transaction isolation
15/09/21 17:53:26 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(object_type), MAX(object_type) FROM (select owner, object_name, object_id, object_type, to_char(created, 'yyyy-mm-dd') created_date from all_objects where  (1 = 1) ) t1
15/09/21 17:53:26 WARN db.TextSplitter: Generating splits for a textual index column.
15/09/21 17:53:26 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
15/09/21 17:53:26 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
15/09/21 17:53:26 INFO mapreduce.JobSubmitter: number of splits:5
15/09/21 17:53:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1442870712675_0012
15/09/21 17:53:27 INFO impl.YarnClientImpl: Submitted application application_1442870712675_0012
15/09/21 17:53:27 INFO mapreduce.Job: The url to track the job: http://vmhost1.local:8088/proxy/application_1442870712675_0012/
15/09/21 17:53:27 INFO mapreduce.Job: Running job: job_1442870712675_0012
15/09/21 17:53:40 INFO mapreduce.Job: Job job_1442870712675_0012 running in uber mode : false
15/09/21 17:53:40 INFO mapreduce.Job:  map 0% reduce 0%
15/09/21 17:53:53 INFO mapreduce.Job:  map 20% reduce 0%
15/09/21 17:54:25 INFO mapreduce.Job:  map 40% reduce 0%
15/09/21 17:55:19 INFO mapreduce.Job:  map 60% reduce 0%
15/09/21 17:55:49 INFO mapreduce.Job:  map 80% reduce 0%
15/09/21 17:56:29 INFO mapreduce.Job:  map 100% reduce 0%
15/09/21 17:56:29 INFO mapreduce.Job: Job job_1442870712675_0012 completed successfully
15/09/21 17:56:29 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=680385
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=728
		HDFS: Number of bytes written=1089568
		HDFS: Number of read operations=20
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=10
	Job Counters 
		Launched map tasks=5
		Other local map tasks=5
		Total time spent by all maps in occupied slots (ms)=160538
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=160538
		Total vcore-seconds taken by all map tasks=160538
		Total megabyte-seconds taken by all map tasks=164390912
	Map-Reduce Framework
		Map input records=22519
		Map output records=22519
		Input split bytes=728
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=641
		CPU time spent (ms)=10540
		Physical memory (bytes) snapshot=614903808
		Virtual memory (bytes) snapshot=8024203264
		Total committed heap usage (bytes)=159252480
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=1089568
15/09/21 17:56:29 INFO mapreduce.ImportJobBase: Transferred 1.0391 MB in 187.358 seconds (5.6791 KB/sec)
15/09/21 17:56:29 INFO mapreduce.ImportJobBase: Retrieved 22519 records.
15/09/21 17:56:29 INFO manager.OracleManager: Time zone has been set to GMT
15/09/21 17:56:29 INFO manager.SqlManager: Executing SQL statement: select owner, object_name, object_id, object_type, to_char(created, 'yyyy-mm-dd') created_date from all_objects where  (1 = 0) 
15/09/21 17:56:30 INFO manager.SqlManager: Executing SQL statement: select owner, object_name, object_id, object_type, to_char(created, 'yyyy-mm-dd') created_date from all_objects where  (1 = 0) 
15/09/21 17:56:30 WARN hive.TableDefWriter: Column OBJECT_ID had to be cast to a less precise type in Hive
15/09/21 17:56:30 INFO hive.HiveImport: Loading uploaded data into Hive

Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-1.1.0-cdh5.4.3.jar!/hive-log4j.properties
OK
Time taken: 1.98 seconds
Loading data to table test_oracle.my_all_objects_sqoop
Table test_oracle.my_all_objects_sqoop stats: [numFiles=5, totalSize=1089568]
OK
Time taken: 0.886 seconds

Verify the result from Hive. You might need to exit Hive and re-enter hive to get the result from the new populated table. Otherwise, you could see 0 rows there.

hive> select * from my_all_objects_sqoop limit 3;
OK
SYS	C_COBJ#	29	CLUSTER	2013-03-12
SYS	C_TS#	6	CLUSTER	2013-03-12
SYS	C_FILE#_BLOCK#	8	CLUSTER	2013-03-12
Time taken: 0.684 seconds, Fetched: 3 row(s)

hive> select count(*) from my_all_objects_sqoop;
Query ID = wzhou_20150921175858_c67bac9f-9cf0-437a-a90c-d3a891455f33
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1442870712675_0013, Tracking URL = http://vmhost1.local:8088/proxy/application_1442870712675_0013/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1442870712675_0013
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-09-21 17:58:57,446 Stage-1 map = 0%,  reduce = 0%
2015-09-21 17:59:09,067 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.68 sec
2015-09-21 17:59:19,506 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.41 sec
MapReduce Total cumulative CPU time: 4 seconds 410 msec
Ended Job = job_1442870712675_0013
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.41 sec   HDFS Read: 1096958 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 410 msec
OK
22519
Time taken: 37.941 seconds, Fetched: 1 row(s)

Check the hive file on HDFS.

[wzhou@vmhost1 ~]$ hdfs dfs -ls /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop
Found 5 items
-rwxrwxrwt   2 wzhou hive       9927 2015-09-21 17:53 /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop/part-m-00000
-rwxrwxrwt   2 wzhou hive     248584 2015-09-21 17:54 /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop/part-m-00001
-rwxrwxrwt   2 wzhou hive      93915 2015-09-21 17:55 /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop/part-m-00002
-rwxrwxrwt   2 wzhou hive     735608 2015-09-21 17:55 /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop/part-m-00003
-rwxrwxrwt   2 wzhou hive       1534 2015-09-21 17:56 /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop/part-m-00004
[wzhou@vmhost1 ~]$ hdfs dfs -ls /user/hive/warehouse/test_oracle.db/my_all_objects
Found 1 items
-rwxrwxrwt   2 wzhou bigdata    2050380 2015-09-21 11:28 /user/hive/warehouse/test_oracle.db/my_all_objects/all_objects_data.txt

[wzhou@vmhost1 ~]$ hdfs dfs -cat /user/hive/warehouse/test_oracle.db/my_all_objects_sqoop/part-m-00000 | head -n 10
SYSC_COBJ#29CLUSTER2013-03-12
SYSC_TS#6CLUSTER2013-03-12
SYSC_FILE#_BLOCK#8CLUSTER2013-03-12
SYSC_USER#10CLUSTER2013-03-12
SYSC_OBJ#2CLUSTER2013-03-12
SYSORA$BASE100EDITION2013-03-12
SYSSMON_SCN_TO_TIME_AUX268CLUSTER2013-03-12
SYSC_OBJ#_INTCOL#421CLUSTER2013-03-12
SYSC_TOID_VERSION#495CLUSTER2013-03-12
SYSC_MLOG#625CLUSTER2013-03-12

Ok, I am done with the import to Hive here. In the next post, I am going to discuss export from Hive directly to Oracle database.