Missing Classpath Issue during Sqoop Import

Recently I run into an issue when using Sqoop import with the following error messages:

17/02/22 10:44:17 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1487780389745_0001/
17/02/22 10:44:17 INFO mapreduce.Job: Running job: job_1487780389745_0001
17/02/22 10:44:27 INFO mapreduce.Job: Job job_1487780389745_0001 running in uber mode : false
17/02/22 10:44:27 INFO mapreduce.Job:  map 0% reduce 0%
17/02/22 10:44:38 INFO mapreduce.Job: Task Id : attempt_1487780389745_0001_m_000000_0, Status : FAILED
Error: java.lang.ClassNotFoundException: org.apache.commons.lang3.StringUtils
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	at org.apache.sqoop.manager.oracle.OracleUtils.escapeIdentifier(OracleUtils.java:36)
	at org.apache.sqoop.manager.oracle.OraOopOracleQueries.getTableColumns(OraOopOracleQueries.java:683)
	at org.apache.sqoop.manager.oracle.OraOopOracleQueries.getTableColumns(OraOopOracleQueries.java:767)
	at org.apache.sqoop.manager.oracle.OraOopDBRecordReader.getSelectQuery(OraOopDBRecordReader.java:195)
	at org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:235)
	at org.apache.sqoop.manager.oracle.OraOopDBRecordReader.nextKeyValue(OraOopDBRecordReader.java:356)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)
	at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
	at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

I used sqoop import many times and have never run into this issue. This made me wondering what’s going.
I checked out the CDH version. The version is 5.8.0 and I used to use CDH 5.7 a lot.

[root@quickstart lib]# hadoop version
Hadoop 2.6.0-cdh5.8.0
Subversion http://github.com/cloudera/hadoop -r 57e7b8556919574d517e874abfb7ebe31a366c2b
Compiled by jenkins on 2016-06-16T19:38Z
Compiled with protoc 2.5.0
From source with checksum 9e99ecd28376acfd5f78c325dd939fed
This command was run using /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.8.0.jar

Check out the sqoop version and it is 1.4.6

[root@quickstart lib]# sqoop version
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
17/02/22 13:37:28 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.8.0
Sqoop 1.4.6-cdh5.8.0
git commit id 
Compiled by jenkins on Thu Jun 16 12:25:21 PDT 2016

The Cloudera Quickstart VM is using CDH 5.8.0 and this is the first time I am touching CDH 5.8. It might relate to CDH 5.8. Did some research and found someone had the exact issue as I did. It is a sqoop bug: SQOOP-2999Sqoop ClassNotFoundException (org.apache.commons.lang3.StringUtils) is thrown when executing Oracle direct import map task.

It looks like the new version of Sqoop in the CDH 5.8.0 used a new class OracleUtils that has a dependency on org.apache.commons.lang3.StringUtils. The jar that contains the class is not on the classpath that Sqoop passes to the mappers. Therefore the exception thrown at runtime. Unfortunately the fixed is in CDH 5.8.2 as indicated in the Cloudera’s release note and the first sqoop bug in the note is this one: SQOOP-2999.

At this moment, I don’t want to go through the upgrade or patching of CDH. Tried to see whether there is any workaround on the internet. Could not find anything related to this one. So I will use my way to fix this issue.

First I need to find out which jar file containing this class. There is one post discussing about this class and the related jar file: commons-lang3-3.1.jar. Luckily I do have a commons-lang3-3.1.jar file in the system.

[root@quickstart lib]# locate commons-lang3-3.1.jar
/usr/lib/mahout/lib/commons-lang3-3.1.jar

Next I need to figure out how to add this jar to the class path for Sqoop. Majority of Hadoop environment variables setting can be found at /etc/default directory and I did found a file called sqoop2-server.

[root@quickstart default]# cat sqoop2-server
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

CATALINA_BASE=/var/lib/sqoop2/tomcat-deployment
SQOOP_USER=sqoop2
SQOOP_CONFIG=/etc/sqoop2/conf
SQOOP_LOG=/var/log/sqoop2
SQOOP_TEMP=/var/run/sqoop2
SQOOP_PID=/var/run/sqoop2/sqoop-server-sqoop2.pid
CATALINA_BIN=/usr/lib/bigtop-tomcat/bin
CATALINA_TMPDIR=/var/tmp/sqoop2
CATALINA_OPTS=-Xmx1024m
CATALINA_OUT=/var/log/sqoop2/sqoop-tomcat.log
#AUX_CLASSPATH=

Uncomment AUX_CLASSPATH= line and change to the following
AUX_CLASSPATH=/usr/lib/mahout/lib/commons-lang3-3.1.jar
After restarted Sqoop service, no luck. Still not working.

I could change /etc/sqoop2/conf/setenv.sh file and add this jar file to CLASSPATH. But there is no CLASSPATH used in this script. I also don’t want to change something and completely forgot later on. So I tried the option to add CLASSPATH on Cloudera Manager. But the tricky question is to which parameter I can add.

As Sqoop is managed by YARN, I tried to the add the class in the YARN Application Classpath or yarn.application.classpath.
The result is shown below.
sqoop_classpath_missing_issue
After the change, redeploy the client configuration and bounce the CDH cluster. Rerun the Sqoop job and it completed without error. Please note: this is just temporary workaround solution. The right approach should still be upgrade to the new version of CDH.

Advertisements

Use incremental import in sqoop to load data from Oracle (Part II)

The last post discussed the first part of creating incremental import job in sqoop. This post will continue the discussion for the
2nd part of the of the series.

In the last few posts, I discussed the following:
1. Install Cloudera Hadoop Cluster using Cloudera Manager
2. Configurations after CDH Installation
3. Load Data to Hive Table.
4. Import Data to Hive from Oracle Database
5. Export data from Hive table to Oracle Database.
6. Use Impala to query a Hive table
7. Use incremental import in sqoop to load data from Oracle (Part I)

Although we can pass many arguments to execute the sqoop job just what I did in the last post, there is a better way to manage this kind of work by using sqoop job and can do the similar work as showing the picture below.
sqoop_incremental_2

First, let me create a sqoop job using the following command:
sqoop job \
–create student_job \
— import \
–connect jdbc:oracle:thin:@enkx3-scan:1521:dbm2 \
–username wzhou \
–password wzhou \
–table STUDENT \
–incremental append \
–check-column student_id \
–last-value 7 \
-m 4 \
–split-by major

Once the sqoop job is created, I can use the following commands to show, execute or delete a sqoop job.
sqoop job –list
sqoop job –show student_job
sqoop job –exec student_job
sqoop job –delete student_job

Here are the execution results:
[wzhou@vmhost1 ~]$ sqoop job \
> –create student_job \
> — import \
> –connect jdbc:oracle:thin:@enkx3-scan:1521:dbm2 \
> –username wzhou \
> –password wzhou \
> –table STUDENT \
> –incremental append \
> –check-column student_id \
> –last-value 7 \
> -m 4 \
> –split-by major

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/25 13:57:34 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.3
15/09/25 13:57:35 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

[wzhou@vmhost1 ~]$ sqoop job –list
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/25 13:57:47 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.3
Available jobs:
student_job

[wzhou@vmhost1 ~]$ sqoop job –show student_job
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/25 13:58:07 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.3
Enter password:
Job: student_job
Tool: import
Options:
—————————-
verbose = false
incremental.last.value = 7
db.connect.string = jdbc:oracle:thin:@enkx3-scan:1521:dbm2
codegen.output.delimiters.escape = 0
codegen.output.delimiters.enclose.required = false
codegen.input.delimiters.field = 0
hbase.create.table = false
db.require.password = true
hdfs.append.dir = true
db.table = STUDENT
codegen.input.delimiters.escape = 0
import.fetch.size = null
accumulo.create.table = false
codegen.input.delimiters.enclose.required = false
db.username = wzhou
codegen.output.delimiters.record = 10
import.max.inline.lob.size = 16777216
hbase.bulk.load.enabled = false
hcatalog.create.table = false
db.clear.staging.table = false
incremental.col = student_id
codegen.input.delimiters.record = 0
enable.compression = false
hive.overwrite.table = false
hive.import = false
codegen.input.delimiters.enclose = 0
accumulo.batch.size = 10240000
hive.drop.delims = false
codegen.output.delimiters.enclose = 0
hdfs.delete-target.dir = false
codegen.output.dir = .
codegen.auto.compile.dir = true
relaxed.isolation = false
mapreduce.num.mappers = 4
accumulo.max.latency = 5000
import.direct.split.size = 0
codegen.output.delimiters.field = 44
export.new.update = UpdateOnly
incremental.mode = AppendRows
hdfs.file.format = TextFile
codegen.compile.dir = /tmp/sqoop-wzhou/compile/5c52e95144738b92047bd2e1e37d1b1f
direct.import = false
db.split.column = major
hive.fail.table.exists = false
db.batch = false

[wzhou@vmhost1 ~]$ sqoop job –exec student_job
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/25 14:05:19 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.3
Enter password:
15/09/25 14:05:22 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
15/09/25 14:05:22 INFO manager.SqlManager: Using default fetchSize of 1000
15/09/25 14:05:22 INFO tool.CodeGenTool: Beginning code generation
15/09/25 14:05:23 INFO manager.OracleManager: Time zone has been set to GMT
15/09/25 14:05:23 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM STUDENT t WHERE 1=0
15/09/25 14:05:23 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-wzhou/compile/67a12da72d209152325de07df2841b68/STUDENT.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/09/25 14:05:25 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-wzhou/compile/67a12da72d209152325de07df2841b68/STUDENT.jar
15/09/25 14:05:26 INFO manager.OracleManager: Time zone has been set to GMT
15/09/25 14:05:26 INFO tool.ImportTool: Maximal id query for free form incremental import: SELECT MAX(student_id) FROM STUDENT
15/09/25 14:05:26 INFO tool.ImportTool: Incremental import based on column student_id
15/09/25 14:05:26 INFO tool.ImportTool: No new rows detected since last import.

As expected, no rows were inserted. So let me insert a few more rows and rerun the jobs.
Run the following SQL commands to insert the rows on Oracle database.
insert into wzhou.student values ( 8, ‘student8’, ‘history’);
insert into wzhou.student values ( 9, ‘student9’, ‘math’);
insert into wzhou.student values ( 10, ‘student10’, ‘computer’ );
commit;

Rerun the job and it shows 3 rows inserted.
[wzhou@vmhost1 ~]$ sqoop job –exec student_job
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/25 14:08:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.3
Enter password:
15/09/25 14:08:53 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
15/09/25 14:08:53 INFO manager.SqlManager: Using default fetchSize of 1000
15/09/25 14:08:53 INFO tool.CodeGenTool: Beginning code generation
15/09/25 14:08:53 INFO manager.OracleManager: Time zone has been set to GMT
15/09/25 14:08:53 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM STUDENT t WHERE 1=0
15/09/25 14:08:53 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-wzhou/compile/b4ec871b912ecfebe6b03a510f000581/STUDENT.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/09/25 14:08:55 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-wzhou/compile/b4ec871b912ecfebe6b03a510f000581/STUDENT.jar
15/09/25 14:08:56 INFO manager.OracleManager: Time zone has been set to GMT
15/09/25 14:08:56 INFO tool.ImportTool: Maximal id query for free form incremental import: SELECT MAX(student_id) FROM STUDENT
15/09/25 14:08:56 INFO tool.ImportTool: Incremental import based on column student_id
15/09/25 14:08:56 INFO tool.ImportTool: Lower bound value: 7
15/09/25 14:08:56 INFO tool.ImportTool: Upper bound value: 10
15/09/25 14:08:56 INFO mapreduce.ImportJobBase: Beginning import of STUDENT
15/09/25 14:08:56 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/09/25 14:08:56 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/09/25 14:08:56 INFO client.RMProxy: Connecting to ResourceManager at vmhost1.local/192.168.56.71:8032
15/09/25 14:08:59 INFO db.DBInputFormat: Using read commited transaction isolation
15/09/25 14:08:59 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(major), MAX(major) FROM STUDENT WHERE ( student_id > 7 AND student_id use test_oracle;
OK
Time taken: 1.548 seconds

hive> select * from student_ext;
OK
2 student2 computer
4 student4 accounting
1 student1 math
3 student3 math
5 student3 computer
7 student5 computer
6 student4 math
10 student10 computer
8 student8 history
9 student9 math

Time taken: 0.694 seconds, Fetched: 10 row(s)

We can see the new rows with student8, 9 and 10.

Use incremental import in sqoop to load data from Oracle (Part I)

This is a two parts series in discussing incremental import job in sqoop. This post is the first part of the series.

In the last few posts, I discussed the following:
1. Install Cloudera Hadoop Cluster using Cloudera Manager
2. Configurations after CDH Installation
3. Load Data to Hive Table.
4. Import Data to Hive from Oracle Database
5. Export data from Hive table to Oracle Database.
6. Use Impala to query a Hive table
When using sqoop to load data to hive table from an Oracle table, it’s not always loading a full table to hive in one shot, just like taking many days’ work to build the house below. In other words, it is common to load partial data from Oracle table to an existing Hive table. This is where we need to use sqoop incremental job to do the work.

sqoop_incremental_1

First, I create a simple table to illustrate the process for incremental import.
1. Create the source table.
Run the following query to create a new table on Oracle database.

<b>create table wzhou.student
(
student_id number(8) not null,
student_name varchar2(20) not null,
major varchar2(20),
CONSTRAINT student_pk PRIMARY KEY (student_id)
);
insert into wzhou.student values ( 1, 'student1', 'math' );
insert into wzhou.student values ( 2, 'student2', 'computer' );
insert into wzhou.student values ( 3, 'student3', 'math' );
insert into wzhou.student values ( 4, 'student4', 'accounting' );

commit;
select * from wzhou.student;</b>

2. Create the import command.

sqoop import \
--connect jdbc:oracle:thin:@enkx3-scan:1521:dbm2 \
--username wzhou \
--password wzhou \
--table STUDENT \
--incremental append \
--check-column student_id \
-m 4 \
--split-by major

check-column argument specifies which column to be checked during the import operation. The column can not be *CHAR type, like VARCHAR2 or CHAR.

incremental argument can have two modes: append and lastmodified. Lasmodified argument is usually used with a lastmodified column defined as timestamp.
Last-value argument is used to specify a value that new rows with greater than this value will be insert.
or use other ways, suggested by one internet article:
–last-value $($HIVE_HOME/bin/hive -e “select max(idcolumn) from tablename”)

Note 1:
If seeing the following error, make sure the tablename specified in –table argument in UPPERCASE. If it is lowercase, you could see the error below.
ERROR tool.ImportTool: Imported Failed: There is no column found in the target table all_objects_inc_test. Please ensure that your table name is correct.

Note 2:
If using –hive-import argument, you could see the following error. It is not supported yet. So have to remove it and build Hive External table after complete the import data to hdfs.
ERROR Append mode for hive imports is not yet supported. Please remove the parameter –append-mode.

3. Execute the Table Import to Hive.
[wzhou@vmhost1 ~]$ sqoop import \
> –connect jdbc:oracle:thin:@enkx3-scan:1521:dbm2 \
> –username wzhou \
> –password wzhou \
> –table STUDENT \
> –incremental append \
> –check-column student_id \
> -m 4 \
> –split-by major

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/25 05:14:22 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.3
15/09/25 05:14:22 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
. . . .
15/09/25 05:17:33 INFO mapreduce.ImportJobBase: Transferred 74 bytes in 184.558 seconds (0.401 bytes/sec)
15/09/25 05:17:33 INFO mapreduce.ImportJobBase: Retrieved 4 records.
15/09/25 05:17:33 INFO util.AppendUtils: Creating missing output directory – STUDENT
15/09/25 05:17:33 INFO tool.ImportTool: Incremental import complete! To run another incremental import of all data following this import, supply the following arguments:
15/09/25 05:17:33 INFO tool.ImportTool: –incremental append
15/09/25 05:17:33 INFO tool.ImportTool: –check-column student_id
15/09/25 05:17:33 INFO tool.ImportTool: –last-value 4
15/09/25 05:17:33 INFO tool.ImportTool: (Consider saving this with ‘sqoop job –create’)

Notice there is a line of –last-value 4 at the end of execution. This is correct as I imported only 4 rows. The result file is under /user/wzhou/STUDENT. Let’s verify it.

[wzhou@vmhost1 ~]$ hdfs dfs -ls /user/wzhou/STUDENT
Found 5 items
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:14 /user/wzhou/STUDENT/part-m-00000
-rw-r--r--   2 wzhou bigdata         42 2015-09-25 05:15 /user/wzhou/STUDENT/part-m-00001
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:16 /user/wzhou/STUDENT/part-m-00002
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:17 /user/wzhou/STUDENT/part-m-00003
-rw-r--r--   2 wzhou bigdata         32 2015-09-25 05:17 /user/wzhou/STUDENT/part-m-00004
[wzhou@vmhost1 ~]$ hdfs dfs -cat /user/wzhou/STUDENT/part*
2,student2,computer
4,student4,accounting
1,student1,math
3,student3,math
[wzhou@vmhost1 ~]$ hdfs dfs -cat /user/wzhou/STUDENT/part*1
2,student2,computer
4,student4,accounting
[wzhou@vmhost1 ~]$ hdfs dfs -cat /user/wzhou/STUDENT/part*4
1,student1,math
3,student3,math

Result looks good so far.

4. Create Hive external table. The new hive external table on Hadoop is still under test_oracle database.
hive
USE test_oracle;
CREATE EXTERNAL TABLE student_ext (
student_id string,
student_name string,
major string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
LOCATION ‘/user/wzhou/STUDENT’;

select * from student_ext;

Here are the result.
[wzhou@vmhost1 ~]$ hive
hive> USE test_oracle;
OK
Time taken: 1.547 seconds
hive> CREATE EXTERNAL TABLE student_ext (
> student_id string,
> student_name string,
> major string
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
> LINES TERMINATED BY ‘\n’
> LOCATION ‘/user/wzhou/STUDENT’;

OK
Time taken: 1.266 seconds
hive> select * from student_ext;
OK
2 student2 computer
4 student4 accounting
1 student1 math
3 student3 math
Time taken: 0.679 seconds, Fetched: 4 row(s)

5. 2nd Round of Insert.
Let me to test out more insert to see how incremental import works. On oracle database, create a few more rows.

insert into wzhou.student values ( 5, ‘student3’, ‘computer’);
insert into wzhou.student values ( 6, ‘student4’, ‘math’);
insert into wzhou.student values ( 7, ‘student5’, ‘computer’ );
commit;

6. Rerun the incremental import command.
[wzhou@vmhost1 ~]$ sqoop import \
> –connect jdbc:oracle:thin:@enkx3-scan:1521:dbm2 \
> –username wzhou \
> –password wzhou \
> –table STUDENT \
> –incremental append \
> –check-column student_id \
> -m 4 \
> –split-by major

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/25 05:39:50 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.3
15/09/25 05:39:51 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/09/25 05:39:51 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
. . . .
15/09/25 05:42:54 INFO mapreduce.ImportJobBase: Transferred 130 bytes in 178.3882 seconds (0.7287 bytes/sec)
15/09/25 05:42:54 INFO mapreduce.ImportJobBase: Retrieved 7 records.
15/09/25 05:42:54 INFO util.AppendUtils: Appending to directory STUDENT
15/09/25 05:42:54 INFO util.AppendUtils: Using found partition 5
15/09/25 05:42:54 INFO tool.ImportTool: Incremental import complete! To run another incremental import of all data following this import, supply the following arguments:
15/09/25 05:42:54 INFO tool.ImportTool: –incremental append
15/09/25 05:42:54 INFO tool.ImportTool: –check-column student_id
15/09/25 05:42:54 INFO tool.ImportTool: –last-value 7
15/09/25 05:42:54 INFO tool.ImportTool: (Consider saving this with ‘sqoop job –create’)

Interesting, the result is not what I expect. I expected only three rows will be inserted. But it seem all 7 rows are inserted. Let me verify the result.

[wzhou@vmhost1 ~]$ hdfs dfs -ls /user/wzhou/STUDENT
Found 10 items
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:14 /user/wzhou/STUDENT/part-m-00000
-rw-r--r--   2 wzhou bigdata         42 2015-09-25 05:15 /user/wzhou/STUDENT/part-m-00001
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:16 /user/wzhou/STUDENT/part-m-00002
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:17 /user/wzhou/STUDENT/part-m-00003
-rw-r--r--   2 wzhou bigdata         32 2015-09-25 05:17 /user/wzhou/STUDENT/part-m-00004
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:40 /user/wzhou/STUDENT/part-m-00005
-rw-r--r--   2 wzhou bigdata         82 2015-09-25 05:41 /user/wzhou/STUDENT/part-m-00006
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:41 /user/wzhou/STUDENT/part-m-00007
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:42 /user/wzhou/STUDENT/part-m-00008
-rw-r--r--   2 wzhou bigdata         48 2015-09-25 05:42 /user/wzhou/STUDENT/part-m-00009

[wzhou@vmhost1 ~]$ hdfs dfs -cat /user/wzhou/STUDENT/part*6
2,student2,computer
4,student4,accounting
5,student3,computer
7,student5,computer

[wzhou@vmhost1 ~]$ hdfs dfs -cat /user/wzhou/STUDENT/part*9
1,student1,math
3,student3,math
6,student4,math

Hive result. 
hive> select * from student_ext;
OK
2	student2	computer
4	student4	accounting
1	student1	math
3	student3	math
2	student2	computer
4	student4	accounting
5	student3	computer
7	student5	computer
1	student1	math
3	student3	math
6	student4	math
Time taken: 0.085 seconds, Fetched: 11 row(s)

Ok, let me look at the result from Impala.

[vmhost3:21000] > select count(*) from student_ext;
Query: select count(*) from student_ext
+----------+
| count(*) |
+----------+
| 4        |
+----------+
Fetched 1 row(s) in 0.85s

Interesting. Impala shows 4 rows instead of 11 rows. The reason is that impala does not refresh metadata regularly. So need to do the invalidate metadata to get the last row count. Hive doesn’t seem have this issue.

[vmhost3:21000] > invalidate metadata;
Query: invalidate metadata

Fetched 0 row(s) in 4.48s
[vmhost3:21000] > select count(*) from student_ext;
Query: select count(*) from student_ext
+----------+
| count(*) |
+----------+
| 11       |
+----------+
Fetched 1 row(s) in 0.94s

7. Solution to fix this issue.
The solution is to add one more argument, last-value, to specify where the load stop last time. I redoed the step 1 to 5.
sqoop import \
–connect jdbc:oracle:thin:@enkx3-scan:1521:dbm2 \
–username wzhou \
–password wzhou \
–table STUDENT \
–incremental append \
–check-column student_id \
–last-value 4 \
-m 4 \
–split-by major

Here are the result

HDFS
[wzhou@vmhost1 ~]$ hdfs dfs -ls /user/wzhou/STUDENT
Found 9 items
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:14 /user/wzhou/STUDENT/part-m-00000
-rw-r--r--   2 wzhou bigdata         42 2015-09-25 05:15 /user/wzhou/STUDENT/part-m-00001
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:16 /user/wzhou/STUDENT/part-m-00002
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 05:17 /user/wzhou/STUDENT/part-m-00003
-rw-r--r--   2 wzhou bigdata         32 2015-09-25 05:17 /user/wzhou/STUDENT/part-m-00004
-rw-r--r--   2 wzhou bigdata         40 2015-09-25 06:02 /user/wzhou/STUDENT/part-m-00005
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 06:02 /user/wzhou/STUDENT/part-m-00006
-rw-r--r--   2 wzhou bigdata          0 2015-09-25 06:03 /user/wzhou/STUDENT/part-m-00007
-rw-r--r--   2 wzhou bigdata         16 2015-09-25 06:04 /user/wzhou/STUDENT/part-m-00008
[wzhou@vmhost1 ~]$ hdfs dfs -cat /user/wzhou/STUDENT/part*5
5,student3,computer
7,student5,computer
[wzhou@vmhost1 ~]$ hdfs dfs -cat /user/wzhou/STUDENT/part*8
6,student4,math

Hive
hive> select * from student_ext;
OK
2	student2	computer
4	student4	accounting
1	student1	math
3	student3	math
5	student3	computer
7	student5	computer
6	student4	math
Time taken: 0.095 seconds, Fetched: 7 row(s)

Impala
[vmhost3:21000] > select count(*) from student_ext;
Query: select count(*) from student_ext
+----------+
| count(*) |
+----------+
| 7        |
+----------+
Fetched 1 row(s) in 0.82s

Ok, everything looks good right now.

Export data from Hive table to Oracle Database

In the last few posts, I discussed the following:
1. Install Cloudera Hadoop Cluster using Cloudera Manager
2. Configurations after CDH Installation
3. Load Data to Hive Table.
4. Import Data to Hive from Oracle Database

This post will discuss a different operation, exporting data from Hive table to Oracle database, just like goldfish jumps from fish bowl to the laptop. Again, we still use Sqoop to perform this operation.
fish_export

The following example will export the data from my_all_objects_sqoop table in Hive to WZHOU.TEST_IMPORT_FROM_SCOOP table in DBM database on X3.

1. Create WZHOU.TEST_IMPORT_FROM_SCOOP table in Oracle.

CREATE TABLE WZHOU.TEST_IMPORT_FROM_SCOOP
(
OWNER VARCHAR2(30),
OBJECT_NAME VARCHAR2(30),
OBJECT_ID NUMBER(10),
OBJECT_TYPE VARCHAR2(19),
CREATED_DATE VARCHAR2(30)
);

select count(*) from WZHOU.TEST_IMPORT_FROM_SCOOP;

2. Run the sqoop export job.

sqoop export \
–connect jdbc:oracle:thin:@enkx3-scan:1521:dbm1 \
–username wzhou \
–password wzhou \
–direct \
–export-dir ‘/user/hive/warehouse/test_oracle.db/my_all_objects_sqoop’ \
–table WZHOU.TEST_IMPORT_FROM_SCOOP \
–fields-terminated-by ‘\001’

Note: –export-dir argument specifies the location of hive source directory, not individual filename. –fields-terminated-by ‘\001’ must be included in the command. Otherwise, you will see the following error:
Caused by: java.lang.RuntimeException: Can’t parse input data: ‘SYSTEMLOGMNR_REFCON$7604TABLE2013-03-12’

Result:

[wzhou@vmhost1 ~]$ sqoop export \
>     --connect jdbc:oracle:thin:@enkx3-scan:1521:dbm1 \
>     --username wzhou \
>     --password wzhou \
>     --direct \
>     --export-dir '/user/hive/warehouse/test_oracle.db/my_all_objects_sqoop' \
>     --table WZHOU.TEST_IMPORT_FROM_SCOOP  \
>     --fields-terminated-by '\001' 
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/22 07:02:46 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.3
15/09/22 07:02:46 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/09/22 07:02:46 INFO manager.SqlManager: Using default fetchSize of 1000
15/09/22 07:02:47 INFO oracle.OraOopOracleQueries: Current schema is: WZHOU
15/09/22 07:02:48 INFO oracle.OraOopManagerFactory: 
**************************************************
*** Using Data Connector for Oracle and Hadoop ***
**************************************************
15/09/22 07:02:48 INFO oracle.OraOopManagerFactory: Oracle Database version: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
15/09/22 07:02:48 INFO oracle.OraOopManagerFactory: This Oracle database is a RAC.
15/09/22 07:02:49 INFO oracle.OraOopManagerFactory: 	Database time on dbm1 is 2015-09-22 07:02:49.060215
15/09/22 07:02:50 INFO oracle.OraOopManagerFactory: 	Database time on dbm2 is 2015-09-22 07:02:50.03952
15/09/22 07:02:50 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop will load-balance sessions across the following instances ofthe Oracle RAC:
	Instance: dbm1 	 URL: jdbc:oracle:thin:@enkx3db01.enkitec.com:1521:dbm1
	Instance: dbm2 	 URL: jdbc:oracle:thin:@enkx3db02.enkitec.com:1521:dbm2

15/09/22 07:02:50 INFO Configuration.deprecation: mapred.map.max.attempts is deprecated. Instead, use mapreduce.map.maxattempts
15/09/22 07:02:50 INFO tool.CodeGenTool: Beginning code generation
15/09/22 07:02:50 INFO manager.SqlManager: Executing SQL statement: SELECT "OWNER","OBJECT_NAME","OBJECT_ID","OBJECT_TYPE","CREATED_DATE" FROM WZHOU.TEST_IMPORT_FROM_SCOOP WHERE 0=1
15/09/22 07:02:50 INFO manager.SqlManager: Executing SQL statement: SELECT OWNER,OBJECT_NAME,OBJECT_ID,OBJECT_TYPE,CREATED_DATE FROM WZHOU.TEST_IMPORT_FROM_SCOOP WHERE 1=0
15/09/22 07:02:50 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-wzhou/compile/f33da649a2f40c4d01ab1701ed0d21a5/WZHOU_TEST_IMPORT_FROM_SCOOP.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/09/22 07:02:52 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-wzhou/compile/f33da649a2f40c4d01ab1701ed0d21a5/WZHOU.TEST_IMPORT_FROM_SCOOP.jar
15/09/22 07:02:52 INFO mapreduce.ExportJobBase: Beginning export of WZHOU.TEST_IMPORT_FROM_SCOOP
15/09/22 07:02:52 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/09/22 07:02:53 INFO manager.SqlManager: Executing SQL statement: SELECT OWNER,OBJECT_NAME,OBJECT_ID,OBJECT_TYPE,CREATED_DATE FROM WZHOU.TEST_IMPORT_FROM_SCOOP WHERE 1=0
15/09/22 07:02:53 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
15/09/22 07:02:53 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
15/09/22 07:02:53 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/09/22 07:02:53 INFO client.RMProxy: Connecting to ResourceManager at vmhost1.local/192.168.56.71:8032
15/09/22 07:02:55 INFO input.FileInputFormat: Total input paths to process : 5
15/09/22 07:02:56 INFO input.FileInputFormat: Total input paths to process : 5
15/09/22 07:02:56 INFO mapreduce.JobSubmitter: number of splits:4
15/09/22 07:02:56 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
15/09/22 07:02:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1442921609180_0004
15/09/22 07:02:56 INFO impl.YarnClientImpl: Submitted application application_1442921609180_0004
15/09/22 07:02:56 INFO mapreduce.Job: The url to track the job: http://vmhost1.local:8088/proxy/application_1442921609180_0004/
15/09/22 07:02:56 INFO mapreduce.Job: Running job: job_1442921609180_0004
15/09/22 07:03:05 INFO mapreduce.Job: Job job_1442921609180_0004 running in uber mode : false
15/09/22 07:03:05 INFO mapreduce.Job:  map 0% reduce 0%
15/09/22 07:03:14 INFO mapreduce.Job:  map 25% reduce 0%
15/09/22 07:03:38 INFO mapreduce.Job:  map 45% reduce 0%
15/09/22 07:03:39 INFO mapreduce.Job:  map 50% reduce 0%
15/09/22 07:04:24 INFO mapreduce.Job:  map 73% reduce 0%
15/09/22 07:04:25 INFO mapreduce.Job:  map 75% reduce 0%
15/09/22 07:04:54 INFO mapreduce.Job:  map 100% reduce 0%
15/09/22 07:04:54 INFO mapreduce.Job: Job job_1442921609180_0004 completed successfully
15/09/22 07:04:54 INFO mapreduce.Job: Counters: 31
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=550000
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=1176438
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=25
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
	Job Counters 
		Launched map tasks=4
		Data-local map tasks=1
		Rack-local map tasks=3
		Total time spent by all maps in occupied slots (ms)=102232
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=102232
		Total vcore-seconds taken by all map tasks=102232
		Total megabyte-seconds taken by all map tasks=104685568
	Map-Reduce Framework
		Map input records=22519
		Map output records=22519
		Input split bytes=1033
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=421
		CPU time spent (ms)=6480
		Physical memory (bytes) snapshot=481697792
		Virtual memory (bytes) snapshot=6458949632
		Total committed heap usage (bytes)=127401984
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=0
15/09/22 07:04:54 INFO mapreduce.ExportJobBase: Transferred 1.1219 MB in 121.1924 seconds (9.4797 KB/sec)
15/09/22 07:04:55 INFO mapreduce.ExportJobBase: Exported 22519 records.

3. Verify the result on Exadata. The row count and everything looks good.

SYS:dbm1> select count(*) from WZHOU.TEST_IMPORT_FROM_SCOOP;

   COUNT(*)
-----------
      22519

SYS:dbm1> select * from WZHOU.TEST_IMPORT_FROM_SCOOP where OWNER='WZHOU' and rownum < 4;

OWNER           OBJECT_NAME      OBJECT_ID   OBJECT_TYPE   CREATED_DATE
--------------- ---------------- ----------- ------------- -----------------
WZHOU           T1               29436       TABLE         2013-07-30
WZHOU           MOVETAB_BASIC    28832       TABLE         2013-07-23
WZHOU           MOVETAB_OLTP     28834       TABLE         2013-07-23

In the next post, I am going to discuss how to use Impala to access the same Hive data and performance comparison between Hive and Impala.