Use Impala to query a Hive table

Previously, I discussed the followings:
In the last few posts, I discussed the following:
1. Install Cloudera Hadoop Cluster using Cloudera Manager
2. Configurations after CDH Installation
3. Load Data to Hive Table.
4. Import Data to Hive from Oracle Database
5. Export data from Hive table to Oracle Database.
Although Hive is popular in Hadoop world, it has its own drawback, like excessive Map/Reduce operations for certain queries and JVM overhead during Map/Reduce. Impala is designed to improve the query performance accessing data on HDFS. Hive is SQL on Hadoop while Impala is the SQL on HDFS. Hive is using MapReduce job to get the query result while Impala is using the its daemons running on the data nodes to directly access the files on HDFS and don’t use Map/Reduce at all.

There are two ways to use Impala to query tables in Hive. One way is to use command line, Impala Shell. Another one is to use Hue GUI. I am going to show both methods one by one.

Use Impala Shell
Impala Shell is a nice tool similar to SQL Plus to setup database and tables and issue queries. The speed of ad hoc queries is much faster than Hive’s query, especially for queries requiring fast response time. Here are the steps in using Impala shell.

1. Logon as wzhou user and start the impala shell.
[wzhou@vmhost1 ~]$ impala-shell
Starting Impala Shell without Kerberos authentication
Error connecting: TTransportException, Could not connect to vmhost1.local:21000
Welcome to the Impala shell. Press TAB twice to see a list of available commands.
Copyright (c) 2012 Cloudera, Inc. All rights reserved.
(Shell build version: Impala Shell v2.2.0-cdh5.4.3 (517bb0f) built on Wed Jun 24 19:17:40 PDT 2015)
[Not connected] >

Note: The prompt shows Not connected. I need to connect the Impala shell to any Data Node with impalad daemon. My cluster is using vmhost2 and vmhost3 as Data Node. So I pick any one of them, use vmhost2 for this test.
[Not connected] > connect vmhost2;
Connected to vmhost2:21000
Server version: impalad version 2.2.0-cdh5.4.3 RELEASE (build 517bb0f71cd604a00369254ac6d88394df83e0f6)
[vmhost2:21000] >

2. Run some queries. Impala can see the same list of databases and tables like Hive does.

[vmhost2:21000] > show databases;
Query: show databases
+------------------+
| name             |
+------------------+
| _impala_builtins |
| default          |
| test1            |
| test_oracle      |
+------------------+
Fetched 4 row(s) in 0.01s 

[vmhost2:21000] > use test_oracle;
Query: use test_oracle

[vmhost2:21000] > show tables;
Query: show tables
+----------------------+
| name                 |
+----------------------+
| my_all_objects       |
| my_all_objects_sqoop |
+----------------------+
Fetched 2 row(s) in 0.01s

[vmhost2:21000] > select * from my_all_objects_sqoop limit 3;
Query: select * from my_all_objects_sqoop limit 3
+-------+-------------+-----------+-------------+-------------+
| owner | object_name | object_id | object_type | create_date |
+-------+-------------+-----------+-------------+-------------+
| SYS   | I_USER1     | 46        | INDEX       | 2013-03-12  |
| SYS   | I_OBJ#      | 3         | INDEX       | 2013-03-12  |
| SYS   | I_IND1      | 41        | INDEX       | 2013-03-12  |
+-------+-------------+-----------+-------------+-------------+
Fetched 3 row(s) in 0.04s

[vmhost2:21000] > select count(*) from my_all_objects_sqoop;
Query: select count(*) from my_all_objects_sqoop
+----------+
| count(*) |
+----------+
| 22519    |
+----------+
Fetched 1 row(s) in 1.00s

Use Hue Web UI
Another way to use Impala is to run query from Hue UI.

1. From Cloudera Manager screen, click Hue. After Hue screen shows up, click Hue Web UI.
impala_hue_1

2. On Hue home screen, click Query Editors, then choose Impala.
impala_hue_2

3. After Impala Query Editor screen shows up, select test_oracle under DATABASE. Input the following query, then click Execute.
select * from my_all_objects_sqoop limit 3;
impala_hue_3

4. Run another query and check out Explain plan.
impala_hue_4

Run Time Comparison between Hive and Impala
Hive
hive> use test_oracle;
OK
Time taken: 0.534 seconds
hive> show tables;
OK
my_all_objects
my_all_objects_sqoop
Time taken: 0.192 seconds, Fetched: 2 row(s)

hive> select count(*) from my_all_objects;
Query ID = wzhou_20150922112626_efde0c06-2f04-44b2-9bf1-47b31e45de03
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1442935988315_0002, Tracking URL = http://vmhost1.local:8088/proxy/application_1442935988315_0002/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1442935988315_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-09-22 11:26:33,759 Stage-1 map = 0%, reduce = 0%
2015-09-22 11:26:46,563 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.25 sec
2015-09-22 11:26:58,035 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.97 sec
MapReduce Total cumulative CPU time: 3 seconds 970 msec
Ended Job = job_1442935988315_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.97 sec HDFS Read: 2057365 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 970 msec
OK
22782
Time taken: 40.881 seconds, Fetched: 1 row(s)

Impala
Wow, 41 seconds to get a row count of 22,782 by using Hive. That seem excessive on a cluster no other jobs running. Ok, let’s look at Impala’s result.
Note: Impala does not poll frequently for metadata changes. So in case you don’t see the table name after the import, just do the following:
invalidate metadata;
show tables;

[wzhou@vmhost1 ~]$ impala-shell
[Not connected] > connect vmhost2;
Connected to vmhost2:21000
Server version: impalad version 2.2.0-cdh5.4.3 RELEASE (build 517bb0f71cd604a00369254ac6d88394df83e0f6)

[vmhost2:21000] > use test_oracle;
Query: use test_oracle

[vmhost2:21000] > show tables;
Query: show tables
+———————-+
| name |
+———————-+
| my_all_objects |
| my_all_objects_sqoop |
+———————-+
Fetched 2 row(s) in 0.01s

[vmhost2:21000] > select count(*) from my_all_objects;
Query: select count(*) from my_all_objects
+———-+
| count(*) |
+———-+
| 22782 |
+———-+
Fetched 1 row(s) in 0.12s

The above result shows Hive took 41 seconds to get the row count of a table with 22, 782 rows while Impala was significant faster and took 0.12 seconds. I know my cluster is small, not powerful and hive is using Map/Reduce. But getting a total row count of 22,000 needs 45 seconds, it seems too much. On the other hand, Impala’s timing looks more reasonable to me. Obviously Map/Reduce is not my option if I want to run some queries that expect fast response time. But if executing a long running job against a huge dataset, Map/Reduce option might still be on the table if considering job fault tolerance.

Use SQL Developer to Access Hive Table on Hadoop

There are many UI or command-line tool to access Hive data on Hadoop and I am not going to list them one by one. I use SQL Developer a lot in accessing Oracle database and like this powerful tool. That makes me wondering whether I can use SQL Developer to access Hive table on HDFS. After some researches, I did find a way to configure SQL Developer to access Hive table. Here are the steps:

1. Download the JDBC Driver.
Goto Cloudera website, download Hive JDBC Driver at http://www.cloudera.com/content/cloudera/en/downloads/connectors/hive/jdbc/hive-jdbc-v2-5-4.html.
As for now, the latest version of Hive JDBC Connector is v2.5.4. Unzip the zip file to your target directory.

2. Configure SQL Developer
Open SQL Developer, click Preference -> Database -> Third Party JDBC Driver. Click Add Entry to add all of the jar files just unzipped. Close the SQL Developer, reopen it.
1_sql_developer_hive_jdbc_driver

3. Create a Hive Connection
Create a new connection. The port used is HiveSever2 port or the value for hive.server2.thrift.port. In my cluster, it is 10000. I still use test_oracle hive database created in my previous steps. Click Test, it should show Success.
2_sql_developer_hive_connection

4. Access Hive Tables
After click Connect, I can connect to my Hive databases on my Hadoop cluster.
3_sql_developer_hive_table
You can see the view and look to access Hive is very similar the way accessing regular oracle table. Awesome! Here are a few more other screens.
Properties Tab
4_sql_developer_hive_table_properties
DDL Tab
The DDL tab is nice one. The drawback for DDL is that everything is in one line. It would be nice to format the DDL string in a more readable way. Anyway, it is not a big deal.
5_sql_developer_hive_table_ddl
Detail Tab
6_sql_developer_hive_table_detail
Model Tab
7_sql_developer_hive_table_model

Let me run a query and here is the result, exact the look and feel like I use Oracle.
8_sql_developer_hive_table_query

DBMCLI Utility

dbmcli-cellcli
I like good tools. CELLCLI is a useful tools to perform administration works on Exadata Storage Servers. Now, starting Exadata Storage Server Release 12.1.2.1.0, there is another utility, called DBMCLI to configure and monitor Exadata Database Servers. DBMCLI replaces the /opt/oracle.cellos/compmon/exadata_mon_hw_asr.pl Perl script. The usage of DBMCLI is similar to the usage of CELLCLI.

With DBMCLI, we can start/stop services, list alert history, configure SMTP, and monitor hardware components. Here are a few examples executing on our X3 box in the lab. I removed some similar content from the execution result to save space here.

[root@enkx3db01 ~]# imageinfo
Kernel version: 2.6.39-400.248.3.el6uek.x86_64 #1 SMP Wed Mar 11 18:04:34 PDT 2015 x86_64
Image version: 12.1.2.1.1.150316.2
Image activated: 2015-05-13 07:35:16 -0500
Image status: success
System partition on device: /dev/mapper/VGExaDb-LVDbSys2


[root@enkx3db01 ~]# dbmcli
DBMCLI: Release  - Production on Sat Oct 03 07:55:10 CDT 2015

Copyright (c) 2007, 2014, Oracle.  All rights reserved.

DBMCLI> list dbserver
	 enkx3db01	 online

DBMCLI> list dbserver detail
	 name:              	 enkx3db01
	 bbuStatus:         	 normal
	 coreCount:         	 16
	 cpuCount:          	 32
	 diagHistoryDays:   	 7
	 fanCount:          	 16/16
	 fanStatus:         	 normal
	 id:                	 1302FML051
	 interconnectCount: 	 2
	 ipaddress1:        	 192.168.12.1/24
	 kernelVersion:     	 2.6.39-400.248.3.el6uek.x86_64
	 locatorLEDStatus:  	 off
	 makeModel:         	 Oracle Corporation SUN FIRE X4170 M3
	 metricHistoryDays: 	 7
	 msVersion:         	 OSS_12.1.2.1.1_LINUX.X64_150316.2
	 powerCount:        	 2/2
	 powerStatus:       	 normal
	 releaseImageStatus:	 success
	 releaseVersion:    	 12.1.2.1.1.150316.2
	 releaseTrackingBug:	 20240049
	 status:            	 online
	 temperatureReading:	 18.0
	 temperatureStatus: 	 normal
	 upTime:            	 108 days, 16:29
	 msStatus:          	 running
	 rsStatus:          	 running

DBMCLI> list physicaldisk
	 252:0	 NLV9ZD	 normal
	 252:1	 NGXRZF	 normal
	 252:2	 NLV1JD	 normal
	 252:3	 NH06DD	 normal

DBMCLI> list physicaldisk detail
	 name:              	 252:0
	 deviceId:          	 11
	 diskType:          	 HardDisk
	 enclosureDeviceId: 	 252
	 errMediaCount:     	 0
	 errOtherCount:     	 0
	 makeModel:         	 "HITACHI H106030SDSUN300G"
	 physicalFirmware:  	 A3D0
	 physicalInsertTime:	 2015-05-13T07:32:53-05:00
	 physicalInterface: 	 sas
	 physicalSerial:    	 NLV9ZD
	 physicalSize:      	 279.39677238464355G
	 slotNumber:        	 0
	 status:            	 normal

	 name:              	 252:1
	 deviceId:          	 10
	 diskType:          	 HardDisk
	 enclosureDeviceId: 	 252
	 errMediaCount:     	 0
	 errOtherCount:     	 0
	 makeModel:         	 "HITACHI H106030SDSUN300G"
	 physicalFirmware:  	 A3D0
	 physicalInsertTime:	 2015-05-13T07:32:53-05:00
	 physicalInterface: 	 sas
	 physicalSerial:    	 NGXRZF
	 physicalSize:      	 279.39677238464355G
	 slotNumber:        	 1
	 status:            	 normal

	 name:              	 252:2
	 deviceId:          	 9
	 diskType:          	 HardDisk
	 enclosureDeviceId: 	 252
	 errMediaCount:     	 0
	 errOtherCount:     	 0
	 makeModel:         	 "HITACHI H106030SDSUN300G"
	 physicalFirmware:  	 A3D0
	 physicalInsertTime:	 2015-05-13T07:32:53-05:00
	 physicalInterface: 	 sas
	 physicalSerial:    	 NLV1JD
	 physicalSize:      	 279.39677238464355G
	 slotNumber:        	 2
	 status:            	 normal

	 name:              	 252:3
	 deviceId:          	 8
	 diskType:          	 HardDisk
	 enclosureDeviceId: 	 252
	 errMediaCount:     	 0
	 errOtherCount:     	 0
	 makeModel:         	 "HITACHI H106030SDSUN300G"
	 physicalFirmware:  	 A3D0
	 physicalInsertTime:	 2015-05-13T07:32:53-05:00
	 physicalInterface: 	 sas
	 physicalSerial:    	 NH06DD
	 physicalSize:      	 279.39677238464355G
	 slotNumber:        	 3
	 status:            	 normal
	
DBMCLI> list lun 
	 0_0	 0_0	 normal
	 
DBMCLI> list lun detail
	 name:              	 0_0
	 diskType:          	 HardDisk
	 id:                	 0_0
	 lunSize:           	 835.3940000003204G
	 lunUID:            	 0_0
	 raidLevel:         	 5
	 lunWriteCacheMode: 	 "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
	 status:            	 normal
	 	 
DBMCLI> list ibport
	 HCA-1:1	 Active
	 HCA-1:2	 Active


DBMCLI> list ibport detail
	 name:              	 HCA-1:1
	 activeSlave:       	 TRUE
	 dataRate:          	 "40 Gbps"
	 hcaFWVersion:      	 2.11.2012
	 id:                	 0x0010e0000128ce65
	 lid:               	 45
	 linkDowned:        	 0
	 linkIntegrityErrs: 	 0
	 linkRecovers:      	 0
	 physLinkState:     	 LinkUp
	 portNumber:        	 1
	 rcvConstraintErrs: 	 0
	 rcvData:           	 1448988236094
	 rcvErrs:           	 0
	 rcvRemotePhysErrs: 	 0
	 status:            	 Active
	 symbolErrs:        	 0
	 vl15Dropped:       	 0
	 xmtConstraintErrs: 	 0
	 xmtData:           	 237626900021
	 xmtDiscards:       	 0

	 name:              	 HCA-1:2
	 activeSlave:       	 FALSE
	 dataRate:          	 "40 Gbps"
	 hcaFWVersion:      	 2.11.2012
	 id:                	 0x0010e0000128ce66
	 lid:               	 46
	 linkDowned:        	 0
	 linkIntegrityErrs: 	 0
	 linkRecovers:      	 0
	 physLinkState:     	 LinkUp
	 portNumber:        	 2
	 rcvConstraintErrs: 	 0
	 rcvData:           	 22172909573667
	 rcvErrs:           	 0
	 rcvRemotePhysErrs: 	 0
	 status:            	 Active
	 symbolErrs:        	 0
	 vl15Dropped:       	 0
	 xmtConstraintErrs: 	 0
	 xmtData:           	 18855706123959
	 xmtDiscards:       	 0

Here are the result from list alerthistory.

DBMCLI> list alerthistory
	 5_1	 2015-06-16T12:39:32-05:00	 critical	 "A power supply component is suspected of causing a fault with a 100 certainty.  Component Name : /SYS/PS1  Fault class    : fault.chassis.power.ext-fail  Fault message  : http://www.sun.com/msg/SPX86-8003-73"
	 5_2	 2015-07-15T15:59:20-05:00	 clear   	 "A power supply component fault has been cleared.  Component Name       : /SYS/PS1  Trap Additional Info : fault.chassis.power.ext-fail"
	 6_1	 2015-06-25T19:36:44-05:00	 critical	 "File system "/" is 80% full, which is above the 80% threshold. Accelerated space reclamation has started.  This alert will be cleared when file system "/" becomes less than 75% full. Top three directories ordered by total space usage are as follows: /opt        : 7.79G /home        : 7.38G /usr        : 3.06G"
	 7_1	 2015-07-06T14:33:51-05:00	 info    	 "The HDD disk controller battery is performing an unscheduled learn cycle. All disk drives have been placed in WriteThrough caching mode. The flash drives are not affected. Battery Serial Number : 10494  Battery Type          : ibbu08  Battery Temperature   : 39 C  Full Charge Capacity  : 1368 mAh  Relative Charge       : 98%  Ambient Temperature   : 19 C"
	 7_2	 2015-07-06T15:45:31-05:00	 clear   	 "All disk drives are in WriteBack caching mode.  Battery Serial Number : 10494  Battery Type          : ibbu08  Battery Temperature   : 39 C  Full Charge Capacity  : 1368 mAh  Relative Charge       : 72%  Ambient Temperature   : 17 C"
	 8_1	 2015-07-15T15:59:35-05:00	 critical	 "A power supply component is suspected of causing a fault with a 100 certainty.  Component Name : /SYS/PS1  Fault class    : fault.chassis.power.ext-fail  Fault message  : http://www.sun.com/msg/SPX86-8003-73"
	 8_2	 2015-07-15T16:01:37-05:00	 clear   	 "A power supply component fault has been cleared.  Component Name       : /SYS/PS1  Trap Additional Info : fault.chassis.power.ext-fail"
	 9_1	 2015-08-04T22:04:16-05:00	 critical	 "File system "/u01" is 80% full, which is above the 80% threshold.  This alert will be cleared when file system "/u01" becomes less than 75% full. Top three directories ordered by total space usage are as follows: /u01/app        : 147.43G /u01/lost+found        : 16K /u01/stage        : 4K"

If want to check out the detail of alert history, run the following.

DBMCLI> list alerthistory detail
	 name:              	 5_1
	 alertDescription:  	 "A power supply component suspected of causing a fault"
	 alertMessage:      	 "A power supply component is suspected of causing a fault with a 100 certainty.  Component Name : /SYS/PS1  Fault class    : fault.chassis.power.ext-fail  Fault message  : http://www.sun.com/msg/SPX86-8003-73"
	 alertSequenceID:   	 5
	 alertShortName:    	 Hardware
	 alertType:         	 Stateful
	 beginTime:         	 2015-06-16T12:39:32-05:00
	 endTime:           	 2015-07-15T15:59:20-05:00
	 examinedBy:        	 
	 metricObjectName:  	 /SYS/PS1_FAULT
	 notificationState: 	 0
	 sequenceBeginTime: 	 2015-06-16T12:39:32-05:00
	 severity:          	 critical
	 alertAction:       	 "For additional information, please refer to http://www.sun.com/msg/SPX86-8003-73"

	 name:              	 5_2
	 alertDescription:  	 "A power supply component fault cleared"
	 alertMessage:      	 "A power supply component fault has been cleared.  Component Name       : /SYS/PS1  Trap Additional Info : fault.chassis.power.ext-fail"
	 alertSequenceID:   	 5
	 alertShortName:    	 Hardware
	 alertType:         	 Stateful
	 beginTime:         	 2015-07-15T15:59:20-05:00
	 endTime:           	 2015-07-15T15:59:20-05:00
	 examinedBy:        	 
	 metricObjectName:  	 /SYS/PS1_FAULT
	 notificationState: 	 0
	 sequenceBeginTime: 	 2015-06-16T12:39:32-05:00
	 severity:          	 clear
	 alertAction:       	 Informational.

	 name:              	 6_1
	 alertDescription:  	 "File system "/" is 80% full"
	 alertMessage:      	 "File system "/" is 80% full, which is above the 80% threshold. Accelerated space reclamation has started.  This alert will be cleared when file system "/" becomes less than 75% full. Top three directories ordered by total space usage are as follows: /opt        : 7.79G /home        : 7.38G /usr        : 3.06G"
	 alertSequenceID:   	 6
	 alertShortName:    	 Software
	 alertType:         	 Stateful
	 beginTime:         	 2015-06-25T19:36:44-05:00
	 examinedBy:        	 
	 metricObjectName:  	 /
	 notificationState: 	 0
	 sequenceBeginTime: 	 2015-06-25T19:36:44-05:00
	 severity:          	 critical
	 alertAction:       	 "MS includes a file deletion policy that is triggered when file system utilitization is high. Deletion of files is triggered when file utilization reaches 80%. For the / file system, 1) files in metric history directory will be deleted using a policy based on the file modification time stamp. Files older than the number of days set by the metricHistoryDays attribute value will be deleted first, then successive deletions will occur for earlier files, down to files with modification time stamps older than or equal to 10 minutes, or until file system utilization is less than 75%. 2) files in the ADR base directory and LOG_HOME directory will be deleted using a policy based on the file modification time stamp. Files older than the number of days set by the diagHistoryDays attribute value will be deleted first, then successive deletions will occur for earlier files, down to files with modification time stamps older than or equal to 10 minutes, or until file system utilization is less than 75%. The renamed alert.log files and ms-odl generation files that are over 5 MB, and older than the successively-shorter age intervals are also deleted. Crash files that are over 5 MB and older than one day will be deleted.Try to delete more recent files, or files not being automatically purged, to free up space if needed."
	 
. . . .

If just want to check out the critcial alerts, run the following.
DBMCLI> describe alerthistory
name
alertDescription
alertMessage
alertSequenceID
alertShortName
alertType
beginTime
endTime
examinedBy modifiable
failedMail
failedSNMP
metricObjectName
metricValue
notificationState
sequenceBeginTime
severity
alertAction

DBMCLI> list alerthistory where alertType=’critical’;

DBMCLI> list alerthistory where severity=’critical’
5_1 2015-06-16T12:39:32-05:00 critical “A power supply component is suspected of causing a fault with a 100 certainty. Component Name : /SYS/PS1 Fault class : fault.chassis.power.ext-fail Fault message : http://www.sun.com/msg/SPX86-8003-73”
6_1 2015-06-25T19:36:44-05:00 critical “File system “/” is 80% full, which is above the 80% threshold. Accelerated space reclamation has started. This alert will be cleared when file system “/” becomes less than 75% full. Top three directories ordered by total space usage are as follows: /opt : 7.79G /home : 7.38G /usr : 3.06G”
8_1 2015-07-15T15:59:35-05:00 critical “A power supply component is suspected of causing a fault with a 100 certainty. Component Name : /SYS/PS1 Fault class : fault.chassis.power.ext-fail Fault message : http://www.sun.com/msg/SPX86-8003-73”
9_1 2015-08-04T22:04:16-05:00 critical “File system “/u01” is 80% full, which is above the 80% threshold. This alert will be cleared when file system “/u01″ becomes less than 75% full. Top three directories ordered by total space usage are as follows: /u01/app : 147.43G /u01/lost+found : 16K /u01/stage : 4K”

DBMCLI> list alerthistory where severity=’critical’ and beginTime > ‘2015-08-01T01:00:01-07:00’
9_1 2015-08-04T22:04:16-05:00 critical “File system “/u01” is 80% full, which is above the 80% threshold. This alert will be cleared when file system “/u01″ becomes less than 75% full. Top three directories ordered by total space usage are as follows: /u01/app : 147.43G /u01/lost+found : 16K /u01/stage : 4K”

We can also find out the meric history information.

DBMCLI> list metrichistory
	 DS_TEMP           	 enkx3db01   	 18.0 C      	 2015-10-03T07:00:39-05:00
	 DS_FANS           	 enkx3db01   	 16          	 2015-10-03T07:00:40-05:00
	 DS_BBU_TEMP       	 enkx3db01   	 35.0 C      	 2015-10-03T07:00:41-05:00
	 DS_CPUT           	 enkx3db01   	 2.6 %       	 2015-10-03T07:00:41-05:00
	 DS_CPUT_MS        	 enkx3db01   	 0.0 %       	 2015-10-03T07:00:41-05:00
	 DS_FSUT           	 /           	 87 %        	 2015-10-03T07:00:41-05:00
	 DS_FSUT           	 /boot       	 7 %         	 2015-10-03T07:00:41-05:00
	 DS_FSUT           	 /mnt/oldroot	 80 %        	 2015-10-03T07:00:41-05:00
	 DS_FSUT           	 /u01        	 87 %        	 2015-10-03T07:00:41-05:00
	 DS_MEMUT          	 enkx3db01   	 82 %        	 2015-10-03T07:00:41-05:00
	 DS_MEMUT_MS       	 enkx3db01   	 0.4 %       	 2015-10-03T07:00:41-05:00
	 DS_MEMUT_MS       	 enkx3db01   	 0.4 %       	 2015-10-03T07:00:41-05:00
	 DS_RUNQ           	 enkx3db01   	 1.6         	 2015-10-03T07:00:41-05:00
	 DS_SWAP_IN_BY_SEC 	 enkx3db01   	 4.7 KB/sec  	 2015-10-03T07:00:41-05:00
	 DS_SWAP_OUT_BY_SEC	 enkx3db01   	 0.0 KB/sec  	 2015-10-03T07:00:41-05:00
	 DS_SWAP_USAGE     	 enkx3db01   	 2 %         	 2015-10-03T07:00:41-05:00
	 DS_VIRTMEM_MS     	 enkx3db01   	 4,708 MB    	 2015-10-03T07:00:41-05:00
	 DS_VIRTMEM_MS     	 enkx3db01   	 4,708 MB    	 2015-10-03T07:00:41-05:00
	 N_HCA_MB_RCV_SEC  	 enkx3db01   	 1.818 MB/sec	 2015-10-03T07:00:42-05:00
	 N_HCA_MB_TRANS_SEC	 enkx3db01   	 2.075 MB/sec	 2015-10-03T07:00:42-05:00
	 N_IB_MB_RCV_SEC   	 HCA-1:1     	 0.233 MB/sec	 2015-10-03T07:00:42-05:00
	 N_IB_MB_RCV_SEC   	 HCA-1:2     	 1.585 MB/sec	 2015-10-03T07:00:42-05:00
	 N_IB_MB_TRANS_SEC 	 HCA-1:1     	 0.216 MB/sec	 2015-10-03T07:00:42-05:00
	 N_IB_MB_TRANS_SEC 	 HCA-1:2     	 1.859 MB/sec	 2015-10-03T07:00:42-05:00
	 N_IB_UTIL_RCV     	 HCA-1:1     	 0.0 %       	 2015-10-03T07:00:42-05:00
	 N_IB_UTIL_RCV     	 HCA-1:2     	 0.0 %       	 2015-10-03T07:00:42-05:00
	 N_IB_UTIL_TRANS   	 HCA-1:1     	 0.0 %       	 2015-10-03T07:00:42-05:00
	 N_IB_UTIL_TRANS   	 HCA-1:2     	 0.1 %       	 2015-10-03T07:00:42-05:00
	 N_NIC_KB_RCV_SEC  	 enkx3db01   	 0.5 KB/sec  	 2015-10-03T07:00:42-05:00
	 N_NIC_KB_TRANS_SEC	 enkx3db01   	 0.4 KB/sec  	 2015-10-03T07:00:42-05:00
	 DS_TEMP           	 enkx3db01   	 18.0 C      	 2015-10-03T07:01:39-05:00
	 DS_FANS           	 enkx3db01   	 16          	 2015-10-03T07:01:40-05:00
	 DS_BBU_TEMP       	 enkx3db01   	 35.0 C      	 2015-10-03T07:01:41-05:00
	 DS_CPUT           	 enkx3db01   	 3.5 %       	 2015-10-03T07:01:41-05:00
	 DS_CPUT_MS        	 enkx3db01   	 0.0 %       	 2015-10-03T07:01:41-05:00
	 DS_FSUT           	 /           	 87 %        	 2015-10-03T07:01:41-05:00
	 DS_FSUT           	 /boot       	 7 %         	 2015-10-03T07:01:41-05:00
	 DS_FSUT           	 /mnt/oldroot	 80 %        	 2015-10-03T07:01:41-05:00
	 DS_FSUT           	 /u01        	 87 %        	 2015-10-03T07:01:41-05:00
	 DS_MEMUT          	 enkx3db01   	 82 %        	 2015-10-03T07:01:41-05:00
	 DS_MEMUT_MS       	 enkx3db01   	 0.4 %       	 2015-10-03T07:01:41-05:00
	 DS_MEMUT_MS       	 enkx3db01   	 0.4 %       	 2015-10-03T07:01:41-05:00
	 DS_RUNQ           	 enkx3db01   	 1.3         	 2015-10-03T07:01:41-05:00
	 DS_SWAP_IN_BY_SEC 	 enkx3db01   	 0.0 KB/sec  	 2015-10-03T07:01:41-05:00
	 DS_SWAP_OUT_BY_SEC	 enkx3db01   	 0.0 KB/sec  	 2015-10-03T07:01:41-05:00
	 DS_SWAP_USAGE     	 enkx3db01   	 2 %         	 2015-10-03T07:01:41-05:00
	 DS_VIRTMEM_MS     	 enkx3db01   	 4,708 MB    	 2015-10-03T07:01:41-05:00
	 DS_VIRTMEM_MS     	 enkx3db01   	 4,708 MB    	 2015-10-03T07:01:41-05:00
. . . .
	 DS_TEMP           	 enkx3db01   	 18.0 C      	 2015-10-03T08:01:39-05:00
	 DS_FANS           	 enkx3db01   	 16          	 2015-10-03T08:01:40-05:00
	 DS_BBU_TEMP       	 enkx3db01   	 35.0 C      	 2015-10-03T08:01:41-05:00
	 DS_CPUT           	 enkx3db01   	 3.8 %       	 2015-10-03T08:01:41-05:00
	 DS_CPUT_MS        	 enkx3db01   	 0.0 %       	 2015-10-03T08:01:41-05:00
	 DS_FSUT           	 /           	 87 %        	 2015-10-03T08:01:41-05:00
	 DS_FSUT           	 /boot       	 7 %         	 2015-10-03T08:01:41-05:00
	 DS_FSUT           	 /mnt/oldroot	 80 %        	 2015-10-03T08:01:41-05:00
	 DS_FSUT           	 /u01        	 87 %        	 2015-10-03T08:01:41-05:00
	 DS_MEMUT          	 enkx3db01   	 82 %        	 2015-10-03T08:01:41-05:00
	 DS_MEMUT_MS       	 enkx3db01   	 0.4 %       	 2015-10-03T08:01:41-05:00
	 DS_MEMUT_MS       	 enkx3db01   	 0.4 %       	 2015-10-03T08:01:41-05:00
	 DS_RUNQ           	 enkx3db01   	 1.5         	 2015-10-03T08:01:41-05:00
	 DS_SWAP_IN_BY_SEC 	 enkx3db01   	 0.0 KB/sec  	 2015-10-03T08:01:41-05:00
	 DS_SWAP_OUT_BY_SEC	 enkx3db01   	 0.0 KB/sec  	 2015-10-03T08:01:41-05:00
	 DS_SWAP_USAGE     	 enkx3db01   	 2 %         	 2015-10-03T08:01:41-05:00
	 DS_VIRTMEM_MS     	 enkx3db01   	 4,708 MB    	 2015-10-03T08:01:41-05:00
	 DS_VIRTMEM_MS     	 enkx3db01   	 4,708 MB    	 2015-10-03T08:01:41-05:00
	 N_HCA_MB_RCV_SEC  	 enkx3db01   	 0.833 MB/sec	 2015-10-03T08:01:43-05:00
	 N_HCA_MB_TRANS_SEC	 enkx3db01   	 0.708 MB/sec	 2015-10-03T08:01:43-05:00
	 N_IB_MB_RCV_SEC   	 HCA-1:1     	 0.095 MB/sec	 2015-10-03T08:01:43-05:00
	 N_IB_MB_RCV_SEC   	 HCA-1:2     	 0.737 MB/sec	 2015-10-03T08:01:43-05:00
	 N_IB_MB_TRANS_SEC 	 HCA-1:1     	 0.081 MB/sec	 2015-10-03T08:01:43-05:00
	 N_IB_MB_TRANS_SEC 	 HCA-1:2     	 0.627 MB/sec	 2015-10-03T08:01:43-05:00
	 N_IB_UTIL_RCV     	 HCA-1:1     	 0.0 %       	 2015-10-03T08:01:43-05:00
	 N_IB_UTIL_RCV     	 HCA-1:2     	 0.0 %       	 2015-10-03T08:01:43-05:00
	 N_IB_UTIL_TRANS   	 HCA-1:1     	 0.0 %       	 2015-10-03T08:01:43-05:00
	 N_IB_UTIL_TRANS   	 HCA-1:2     	 0.0 %       	 2015-10-03T08:01:43-05:00
	 N_NIC_KB_RCV_SEC  	 enkx3db01   	 0.6 KB/sec  	 2015-10-03T08:01:43-05:00
	 N_NIC_KB_TRANS_SEC	 enkx3db01   	 0.4 KB/sec  	 2015-10-03T08:01:43-05:00

If just want to know the average number of processes in the run queue, run the following.

DBMCLI&gt; <b>list metrichistory DS_RUNQ</b>
	 DS_RUNQ	 enkx3db01	 1.5	 2015-10-04T08:00:41-05:00
	 DS_RUNQ	 enkx3db01	 1.3	 2015-10-04T08:01:42-05:00
	 DS_RUNQ	 enkx3db01	 1.3	 2015-10-04T08:02:41-05:00
	 DS_RUNQ	 enkx3db01	 1.6	 2015-10-04T08:03:41-05:00
	 DS_RUNQ	 enkx3db01	 1.6	 2015-10-04T08:04:41-05:00
	 DS_RUNQ	 enkx3db01	 1.7	 2015-10-04T08:05:41-05:00
	 DS_RUNQ	 enkx3db01	 1.8	 2015-10-04T08:06:41-05:00
	 DS_RUNQ	 enkx3db01	 1.4	 2015-10-04T08:07:41-05:00
	 DS_RUNQ	 enkx3db01	 1.3	 2015-10-04T08:08:41-05:00
	 DS_RUNQ	 enkx3db01	 1.7	 2015-10-04T08:09:41-05:00
	 DS_RUNQ	 enkx3db01	 1.6	 2015-10-04T08:10:41-05:00
	 DS_RUNQ	 enkx3db01	 1.6	 2015-10-04T08:11:41-05:00
	 DS_RUNQ	 enkx3db01	 1.7	 2015-10-04T08:12:41-05:00
	 DS_RUNQ	 enkx3db01	 2.2	 2015-10-04T08:13:41-05:00
	 DS_RUNQ	 enkx3db01	 1.8	 2015-10-04T08:14:42-05:00
	 DS_RUNQ	 enkx3db01	 1.6	 2015-10-04T08:15:42-05:00
	 DS_RUNQ	 enkx3db01	 1.4	 2015-10-04T08:16:41-05:00

Similarly we can check out other metric history like
DS_MEMUT: The percentage of total physical memory used on the server.
DS_SWAP_IN_BY_SEC: The number of swap pages read in KB per second.
DS_SWAP_OUT_BY_SEC: The number of swap pages written in KB per second.
DS_SWAP_USAGE: The percentage of swap space used.
N_HCA_MB_RCV_SEC: The number of MB received by the InfiniBand interfaces per second.
N_HCA_MB_TRANS_SEC: The number of MB transmitted by the InfiniBand interfaces per second.
N_NIC_KB_RCV_SEC: The number of KB received by the Ethernet interfaces per second.
N_NIC_KB_TRANS_SEC: The number of KB transmitted by the Ethernet interfaces per second.

Another useful feature is that DBMCLI can configure email notifications for database server. I did not perform the following steps, just use the example from Oracle document.

Configure SMTP on database server
DBMCLI> ALTER DBSERVER smtpServer=’my_mail.example.com’, –
smtpFromAddr=’john.doe@example.com’, –
smtpFrom=’John Doe’, –
smtpToAddr=’jane.smith@example.com’, –
snmpSubscriber=((host=host1),(host=host2)), –
notificationPolicy=’clear’, –
notificationMethod=’mail,snmp’

Validate e-mail on a database server
DBMCLI> ALTER DBSERVER VALIDATE MAIL

Change the email format
DBMCLI> ALTER DBSERVER emailFormat=’text’
DBMCLI> ALTER DBSERVER emailFormat=’html’

To get further detail about DBMCLI, check out the document at http://docs.oracle.com/cd/E50790_01/doc/doc.121/e51951/app_dbmcli.htm#DBMMN22053.

Export data from Hive table to Oracle Database

In the last few posts, I discussed the following:
1. Install Cloudera Hadoop Cluster using Cloudera Manager
2. Configurations after CDH Installation
3. Load Data to Hive Table.
4. Import Data to Hive from Oracle Database

This post will discuss a different operation, exporting data from Hive table to Oracle database, just like goldfish jumps from fish bowl to the laptop. Again, we still use Sqoop to perform this operation.
fish_export

The following example will export the data from my_all_objects_sqoop table in Hive to WZHOU.TEST_IMPORT_FROM_SCOOP table in DBM database on X3.

1. Create WZHOU.TEST_IMPORT_FROM_SCOOP table in Oracle.

CREATE TABLE WZHOU.TEST_IMPORT_FROM_SCOOP
(
OWNER VARCHAR2(30),
OBJECT_NAME VARCHAR2(30),
OBJECT_ID NUMBER(10),
OBJECT_TYPE VARCHAR2(19),
CREATED_DATE VARCHAR2(30)
);

select count(*) from WZHOU.TEST_IMPORT_FROM_SCOOP;

2. Run the sqoop export job.

sqoop export \
–connect jdbc:oracle:thin:@enkx3-scan:1521:dbm1 \
–username wzhou \
–password wzhou \
–direct \
–export-dir ‘/user/hive/warehouse/test_oracle.db/my_all_objects_sqoop’ \
–table WZHOU.TEST_IMPORT_FROM_SCOOP \
–fields-terminated-by ‘\001’

Note: –export-dir argument specifies the location of hive source directory, not individual filename. –fields-terminated-by ‘\001’ must be included in the command. Otherwise, you will see the following error:
Caused by: java.lang.RuntimeException: Can’t parse input data: ‘SYSTEMLOGMNR_REFCON$7604TABLE2013-03-12’

Result:

[wzhou@vmhost1 ~]$ sqoop export \
>     --connect jdbc:oracle:thin:@enkx3-scan:1521:dbm1 \
>     --username wzhou \
>     --password wzhou \
>     --direct \
>     --export-dir '/user/hive/warehouse/test_oracle.db/my_all_objects_sqoop' \
>     --table WZHOU.TEST_IMPORT_FROM_SCOOP  \
>     --fields-terminated-by '\001' 
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/22 07:02:46 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.3
15/09/22 07:02:46 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/09/22 07:02:46 INFO manager.SqlManager: Using default fetchSize of 1000
15/09/22 07:02:47 INFO oracle.OraOopOracleQueries: Current schema is: WZHOU
15/09/22 07:02:48 INFO oracle.OraOopManagerFactory: 
**************************************************
*** Using Data Connector for Oracle and Hadoop ***
**************************************************
15/09/22 07:02:48 INFO oracle.OraOopManagerFactory: Oracle Database version: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
15/09/22 07:02:48 INFO oracle.OraOopManagerFactory: This Oracle database is a RAC.
15/09/22 07:02:49 INFO oracle.OraOopManagerFactory: 	Database time on dbm1 is 2015-09-22 07:02:49.060215
15/09/22 07:02:50 INFO oracle.OraOopManagerFactory: 	Database time on dbm2 is 2015-09-22 07:02:50.03952
15/09/22 07:02:50 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop will load-balance sessions across the following instances ofthe Oracle RAC:
	Instance: dbm1 	 URL: jdbc:oracle:thin:@enkx3db01.enkitec.com:1521:dbm1
	Instance: dbm2 	 URL: jdbc:oracle:thin:@enkx3db02.enkitec.com:1521:dbm2

15/09/22 07:02:50 INFO Configuration.deprecation: mapred.map.max.attempts is deprecated. Instead, use mapreduce.map.maxattempts
15/09/22 07:02:50 INFO tool.CodeGenTool: Beginning code generation
15/09/22 07:02:50 INFO manager.SqlManager: Executing SQL statement: SELECT "OWNER","OBJECT_NAME","OBJECT_ID","OBJECT_TYPE","CREATED_DATE" FROM WZHOU.TEST_IMPORT_FROM_SCOOP WHERE 0=1
15/09/22 07:02:50 INFO manager.SqlManager: Executing SQL statement: SELECT OWNER,OBJECT_NAME,OBJECT_ID,OBJECT_TYPE,CREATED_DATE FROM WZHOU.TEST_IMPORT_FROM_SCOOP WHERE 1=0
15/09/22 07:02:50 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-wzhou/compile/f33da649a2f40c4d01ab1701ed0d21a5/WZHOU_TEST_IMPORT_FROM_SCOOP.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/09/22 07:02:52 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-wzhou/compile/f33da649a2f40c4d01ab1701ed0d21a5/WZHOU.TEST_IMPORT_FROM_SCOOP.jar
15/09/22 07:02:52 INFO mapreduce.ExportJobBase: Beginning export of WZHOU.TEST_IMPORT_FROM_SCOOP
15/09/22 07:02:52 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/09/22 07:02:53 INFO manager.SqlManager: Executing SQL statement: SELECT OWNER,OBJECT_NAME,OBJECT_ID,OBJECT_TYPE,CREATED_DATE FROM WZHOU.TEST_IMPORT_FROM_SCOOP WHERE 1=0
15/09/22 07:02:53 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
15/09/22 07:02:53 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
15/09/22 07:02:53 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/09/22 07:02:53 INFO client.RMProxy: Connecting to ResourceManager at vmhost1.local/192.168.56.71:8032
15/09/22 07:02:55 INFO input.FileInputFormat: Total input paths to process : 5
15/09/22 07:02:56 INFO input.FileInputFormat: Total input paths to process : 5
15/09/22 07:02:56 INFO mapreduce.JobSubmitter: number of splits:4
15/09/22 07:02:56 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
15/09/22 07:02:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1442921609180_0004
15/09/22 07:02:56 INFO impl.YarnClientImpl: Submitted application application_1442921609180_0004
15/09/22 07:02:56 INFO mapreduce.Job: The url to track the job: http://vmhost1.local:8088/proxy/application_1442921609180_0004/
15/09/22 07:02:56 INFO mapreduce.Job: Running job: job_1442921609180_0004
15/09/22 07:03:05 INFO mapreduce.Job: Job job_1442921609180_0004 running in uber mode : false
15/09/22 07:03:05 INFO mapreduce.Job:  map 0% reduce 0%
15/09/22 07:03:14 INFO mapreduce.Job:  map 25% reduce 0%
15/09/22 07:03:38 INFO mapreduce.Job:  map 45% reduce 0%
15/09/22 07:03:39 INFO mapreduce.Job:  map 50% reduce 0%
15/09/22 07:04:24 INFO mapreduce.Job:  map 73% reduce 0%
15/09/22 07:04:25 INFO mapreduce.Job:  map 75% reduce 0%
15/09/22 07:04:54 INFO mapreduce.Job:  map 100% reduce 0%
15/09/22 07:04:54 INFO mapreduce.Job: Job job_1442921609180_0004 completed successfully
15/09/22 07:04:54 INFO mapreduce.Job: Counters: 31
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=550000
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=1176438
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=25
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
	Job Counters 
		Launched map tasks=4
		Data-local map tasks=1
		Rack-local map tasks=3
		Total time spent by all maps in occupied slots (ms)=102232
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=102232
		Total vcore-seconds taken by all map tasks=102232
		Total megabyte-seconds taken by all map tasks=104685568
	Map-Reduce Framework
		Map input records=22519
		Map output records=22519
		Input split bytes=1033
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=421
		CPU time spent (ms)=6480
		Physical memory (bytes) snapshot=481697792
		Virtual memory (bytes) snapshot=6458949632
		Total committed heap usage (bytes)=127401984
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=0
15/09/22 07:04:54 INFO mapreduce.ExportJobBase: Transferred 1.1219 MB in 121.1924 seconds (9.4797 KB/sec)
15/09/22 07:04:55 INFO mapreduce.ExportJobBase: Exported 22519 records.

3. Verify the result on Exadata. The row count and everything looks good.

SYS:dbm1> select count(*) from WZHOU.TEST_IMPORT_FROM_SCOOP;

   COUNT(*)
-----------
      22519

SYS:dbm1> select * from WZHOU.TEST_IMPORT_FROM_SCOOP where OWNER='WZHOU' and rownum < 4;

OWNER           OBJECT_NAME      OBJECT_ID   OBJECT_TYPE   CREATED_DATE
--------------- ---------------- ----------- ------------- -----------------
WZHOU           T1               29436       TABLE         2013-07-30
WZHOU           MOVETAB_BASIC    28832       TABLE         2013-07-23
WZHOU           MOVETAB_OLTP     28834       TABLE         2013-07-23

In the next post, I am going to discuss how to use Impala to access the same Hive data and performance comparison between Hive and Impala.