Use OEM 13c R2 to Discover Oracle BDA

OEM 13c Cloud Control is a powerful monitoring tool, not only for Exadata and Oracle database, but also for Oracle Big Data Appliance (BDA). There are many articles or blogs about Exadata Discovery using OEM 12c or 13c. But not many places discuss the OEM BDA Discovery, especially using the new version of OEM, 13c Cloud Control. In this blog, I am going to discuss the steps to discover BDA using OEM 13c R2.

First, do not use OEM 13c R1 for BDA Discovery. It is very time consuming and very likely not going to work. OEM 13c R2 is much better, at least I can successfully do the BDA Discovery on all of the BDAs I have worked on.

Secondly, unlike OEM Exadata Discovery, BDA Discovery usually requires one extra step before the Manual OEM BDA Discovery by using bdacli enable em command first. Theoretically if works, I don’t need to do anything in manual BDA discovery process. Unfortunately I have never run into this perfect situation in different BDA environment and always get certain kind of errors at the end.

Preparation
There are a few useful notes about OEM BDA Discovery.
1) Instructions to Install 12.1.0.4 BDA Plug-in on Oracle Big Data Appliance (BDA) V2.*/V3.0.*/V3.1/V4.* (Doc ID 1682558.1)
2) BDA Credentials for Enterprise Manager 13.x Plugin (Doc ID 2206111.1)
3) Instructions to Enable / Disable the 13.x BDA Enterprise Manager Plug-in on Oracle Big Data Appliance (BDA) V4.5-V4.7 (Doc ID 2206207.1)

Execute bdacli command
Run bdacli enable em. For BDA version below 4.5, run command bdacli enable em –force. I am almost 100% guarantee you won’t see the successful completion message from this command. For example, get the following error at the end.

INFO: Running: /opt/oracle/emcli_home/emcli discover_bda_cluster -hostname=enkx4bda1node01.enkitec.local -cloudera_credential=BDA_ENKX4BDA_CM_CRED -host_credential=BDA_ENKX4BDA_HOSTS_CRED -cisco_credential=BDA_ENKX4BDA_CISCO_CRED -ilom_credential=BDA_ENKX4BDA_ILOM_CRED -infiniband_credential=BDA_ENKX4BDA_IB_CRED -pdu_credential=BDA_ENKX4BDA_PDU_CRED -cisco_snmp_string="snmp_v3;;SNMPV3Creds;authUser:none;authPwd:none;authProtocol:none;privPwd:none" -pdu_snmp_string="snmp_v1v2_v3;;SNMPV1Creds;COMMUNITY:none" -switch_snmp_string="snmp_v1v2_v3;;SNMPV3Creds;authUser:none;authPwd:none;authProtocol:none;privPwd:none"
ERROR: Syntax Error: Unrecognized argument -cisco_snmp_string #Step Syntax Error: Unrecognized argument -pdu_snmp_string#
Are you sure you want to completely cleanup em and lose all related state ?

When see the above message, always type in N and not rollback the changes. Basically you have OEM agent deployed, just need to figure out which node you want to use as the start point for Manual OEM BDA Discovery.

On each node, run the following command:

[root@enkx4bda1node06 ~]# java -classpath /opt/oracle/EMAgent/agent_13.2.0.0.0/jlib/*:/opt/oracle/EMAgent/agent_13.2.0.0.0/plugins/oracle.sysman.bda.discovery.plugin_13.2.2.0.0/archives/* oracle.sysman.bda.discovery.pojo.GetHadoopClusters http://enkx4bda1node03.enkitec.local:7180/api/v1/clusters admin admin_password

You should see the error below for the execution on many nodes.

Apr 10, 2017 10:14:44 AM com.sun.jersey.api.client.ClientResponse getEntity
SEVERE: A message body reader for Java class [Loracle.sysman.bda.discovery.pojo.Items;, and Java type class [Loracle.sysman.bda.discovery.pojo.Items;, and MIME media type text/html was not found
Apr 10, 2017 10:14:44 AM com.sun.jersey.api.client.ClientResponse getEntity
SEVERE: The registered message body readers compatible with the MIME media type are:
*/* ->
  com.sun.jersey.core.impl.provider.entity.FormProvider
  com.sun.jersey.core.impl.provider.entity.MimeMultipartProvider
  com.sun.jersey.core.impl.provider.entity.StringProvider
  com.sun.jersey.core.impl.provider.entity.ByteArrayProvider
  com.sun.jersey.core.impl.provider.entity.FileProvider
  com.sun.jersey.core.impl.provider.entity.InputStreamProvider
  com.sun.jersey.core.impl.provider.entity.DataSourceProvider
  com.sun.jersey.core.impl.provider.entity.XMLJAXBElementProvider$General
  com.sun.jersey.core.impl.provider.entity.ReaderProvider
  com.sun.jersey.core.impl.provider.entity.DocumentProvider
  com.sun.jersey.core.impl.provider.entity.SourceProvider$StreamSourceReader
  com.sun.jersey.core.impl.provider.entity.SourceProvider$SAXSourceReader
  com.sun.jersey.core.impl.provider.entity.SourceProvider$DOMSourceReader
  com.sun.jersey.core.impl.provider.entity.XMLRootElementProvider$General
  com.sun.jersey.core.impl.provider.entity.XMLListElementProvider$General
  com.sun.jersey.core.impl.provider.entity.XMLRootObjectProvider$General
  com.sun.jersey.core.impl.provider.entity.EntityHolderReader

For certain node, you could see successful message and showing below.

enkx4bda;;

In my case, it is node 2. So I will use Node 2 for my manual BDA Discovery in the following steps.

Manual OEM BDA Discovery
Logon to OEM as sysman user. Select Add Target -> Add Target Manually.

Select Add Targets Using Guided Process

Select Oracle Big Data Appliance

The Add Targets Manually pages shows up. Select node2 from the list. Click Next.

After it completes, it will show the following hardware information. Click Next.

The Hardware Credentials screen shows up. If all Host credentials show green sign, you don’t need to do anything related to Host. Go to the next one, for example, IB Switch. Select Set Credentials -> All Infiniband Switches . Then set SNMP Credentials type and community string. Majority of the time, input public for community string. Then click OK.

If successful, it shows the green check.

Following the similar procedure for all other hardware components, like ILOM, PDU and Cisco Switch. At the end, you should see the following screen.
One interesting note about PDU. PDU component always behave in a weird way during the discovery. For this case, it shows successful with green check, but later on OEM shows PDUs as DOWN status after the discovery. In my other discovery works for different BDA environments, the green check has never shown up in this page, but PDUs shows UP status after the discovery. So the result is inconsistent.

Click Next. The screen for Cloudera Manager shows up. Click Edit, verify the credential for admin user for Cloudera Manager. Then click Next.

The Software page shows up, click Next.

The review page shows up, click Submit

If successful, will see the screen message below, click OK.

The BDA Discovery is completed.
.
You might notice the new BDA cluster is called BDA Network1. This is not a good way to name a cluster, especially you have multiple BDAs under the management from the same OEM. I don’t understand why not to use BDA’s cluster name or Cloudera Manager’s cluster name. Either one will be much better than this naming. Even worse, you can change a lot of target name in OEM, but not for this one. I have another blog (Change BDA Cluster Name in OEM Cloud Control 13c) discussing a partial workaround for this issue.

To view the detail of a host target, you can have the following:

The presentation looks better than OEM 12c. In general, OEM 13c for BDA is good one. But pay attention to the followings. Otherwise you will spend a lot of additional time.
1) Before performing OEM BDA Discovery, make sure you have changed all of your default passwords on BDA. It’s easier to use default password during the discovery, but a huge pain after you change passwords for certain user accounts used in BDA discovery. Basically, update the Named Credentials is not enough and you have to delete the whole BDA target in OEM and redo the discovery.

2) Similarly, if configure TLS with Cloudera Manager after BDA Discovery, you will have to remove the BDA target and redo the discovery. It is a clearly a bug in OEM, at least not fixed at the time I am writing this blog.

3) Sometimes you might see tons of alerts from almost every ports in the Cisco switch. If from a few ports, I might believe it. But for almost every port, there is no way this is the right alert. As matter of fact, Oracle Support confirmed it seem false alert. At the time I had to do the BDA Rediscovery after configuring TLS with Cloudera Manager, I happened to notice all Cisco port alerts were gone after BDA rediscovery.

4) Both Oracle document and Oracle support says OEM 13c R2 supports BDA v4.5+ and any version below it is not supported. It’s true the lower BDA version would run into additional issues, but I managed to find workaround and make it working for BDA v4.3.

Finding out Keystore and Truststore Passwords on BDA

I am working in a project involving configuring SSL with Cloudera Manager on BDA. There are several ways to do it: go with Oracle’s bdacli approach or use Cloudera’s approach. For BDA related work, I usually prefer Oracle’s BDA approach because it needs to write some information to Oracle BDA’s configuration files, which are usually outside the control of Cloudera Manager. Cloudera’s approach is definitely working as well. But during the time when doing BDA upgrade or patching, if mammoth couldn’t find the correct value in BDA’s configuration files, it might cause unnecessary trouble. For example, if mammoth think certain features are not enabled, then it could skip certain steps to disable the features before upgrade. Anyway, it is another unrelated topic.

To enable TLS on Cloudera Manager is pretty easy on BDA, instead of doing so many steps stated in Cloudera Manager’s document. On BDA, just run the following command:
bdacli enable https_cm_hue_oozie

The command will automatically enable TLS for all major services on CDH, such Cloudera Manager, Hue and Oozie. Please note: TLS on Cloudera Manager agent is automatically enabled during BDA installation. Usually running this command is enough for many clients as client just need to encrypt the content when communicating
with Cloudera Manager. There is a downside for this approach: BDA uses self-signed certificates during the execution of bdacli enable https_cm_hue_oozie. This kind of self-signed certificate is good for security, but sometime can be annoying with browsing alerts. Therefore some users might prefer to use their own signed SSL certificates.

After working with Eric from Oracle Support, he recommended a way actually pretty good documented in Doc ID 2187903.1: How to Use Certificates Signed by a User’s Certificate Authority for Web Consoles and Hadoop Network Encryption Use on the BDA. The key of this approach is to get keystore’s and truststore’s paths and passwords, creating new keystore and truststore, and then importing customer’s certificates. Unfortunately, this approach works for BDA version 4.5 and above. It is not going to work in my current client environment, which is using BDA v4.3. One of major issue is that BDA v4.5 and above has the following bdacli commands while BDA v4.3 doesn’t have the following commands:
bdacli getinfo cluster_https_keystore_password
bdacli getinfo cluster_https_truststore_password

Eric then recommended a potential workaround by querying MySQL database directly by using the commands below:

use scm;
select * from CONFIGS where ATTR = 'truststore_password' or ATTR = 'keystore_password'; 

I then used two BDAs in our lab for the verification.
First, I tested on our X4 Starter rack.

[root@enkx4bda1node01 ~]# bdacli getinfo cluster_https_keystore_password
Enter the admin user for CM (press enter for admin): 
Enter the admin password for CM: 
******

[root@enkx4bda1node01 ~]# bdacli getinfo cluster_https_truststore_password
Enter the admin user for CM (press enter for admin): 
Enter the admin password for CM: 

Interestingly, the keystore password is still showing ****** while truststore password is empty. I can understand empty password for truststore as nothing is configured for truststore. But keystore password shouldn’t show hidden value as ******.

Query MySQL db on the same rack.

[root@enkx4bda1node03 ~]# mysql -u root -p
Enter password: 
mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| activity_monitor   |
| hive               |
| host_monitor       |
| hue                |
| mysql              |
| navigator          |
| navigator_metadata |
| oozie              |
| performance_schema |
| reports_manager    |
| resource_manager   |
| scm                |
| sentry_db          |
| service_monitor    |
| studio             |
+--------------------+
16 rows in set (0.00 sec)

mysql> use scm;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed

mysql> select * from CONFIGS where ATTR = 'truststore_password' or ATTR = 'keystore_password'; 
+-----------+---------+-------------------+--------+------------+---------+---------------------+-------------------------+----------------------+---------+
| CONFIG_ID | ROLE_ID | ATTR              | VALUE  | SERVICE_ID | HOST_ID | CONFIG_CONTAINER_ID | OPTIMISTIC_LOCK_VERSION | ROLE_CONFIG_GROUP_ID | CONTEXT |
+-----------+---------+-------------------+--------+------------+---------+---------------------+-------------------------+----------------------+---------+
|         8 |    NULL | keystore_password | ****** |       NULL |    NULL |                   2 |                       2 |                 NULL | NONE    |
+-----------+---------+-------------------+--------+------------+---------+---------------------+-------------------------+----------------------+---------+
1 row in set (0.00 sec)

MySQL database also store the password as *****. I remember my colleague mentioned this BDA has some issue. This could be one of them.

Ok, this rack doesn’t really tell me anything and I move to the 2nd full rack BDA. Perform the same commands there.

[root@enkbda1node03 ~]# bdacli getinfo cluster_https_keystore_password 
Enter the admin user for CM (press enter for admin): 
Enter the admin password for CM: 
KUSld8yni8PMQcJbltvCnZEr2XG4BgKohAfnW6O02jB3tCP8v1DYlbMO5PqhJCVR

[root@enkbda1node03 ~]# bdacli getinfo cluster_https_truststore_password
Enter the admin user for CM (press enter for admin): 
Enter the admin password for CM: 


[root@enkbda1node03 ~]# mysql -u root -p
Enter password: 
mysql> use scm;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select * from CONFIGS where ATTR = 'truststore_password' or ATTR = 'keystore_password'; 
+-----------+---------+---------------------+------------------------------------------------------------------+------------+---------+---------------------+-------------------------+----------------------+---------+
| CONFIG_ID | ROLE_ID | ATTR                | VALUE                                                            | SERVICE_ID | HOST_ID | CONFIG_CONTAINER_ID | OPTIMISTIC_LOCK_VERSION | ROLE_CONFIG_GROUP_ID | CONTEXT |
+-----------+---------+---------------------+------------------------------------------------------------------+------------+---------+---------------------+-------------------------+----------------------+---------+
|         7 |    NULL | keystore_password   | KUSld8yni8PMQcJbltvCnZEr2XG4BgKohAfnW6O02jB3tCP8v1DYlbMO5PqhJCVR |       NULL |    NULL |                   2 |                       0 |                 NULL | NULL    |
|       991 |    NULL | truststore_password | NULL                                                             |       NULL |    NULL |                   2 |                       1 |                 NULL | NONE    |
+-----------+---------+---------------------+------------------------------------------------------------------+------------+---------+---------------------+-------------------------+----------------------+---------+
2 rows in set (0.00 sec)

MySQL database show same value as the value as the result from command bdacli getinfo cluster_https_keystore_password. This is exactly what I want to know. It looks like I can use MySQL query to get the necessary passwords for my work.

One side note: In case you want to check out those self-signed certificates on BDA, run the following command. When prompting for password, just press ENTER.

[root@enkx4bda1node03 ~]# bdacli getinfo cluster_https_keystore_path
Enter the admin user for CM (press enter for admin): 
Enter the admin password for CM: 
/opt/cloudera/security/jks/node.jks

[root@enkx4bda1node03 ~]# keytool -list -v -keystore /opt/cloudera/security/jks/node.jks
Enter keystore password:  

*****************  WARNING WARNING WARNING  *****************
* The integrity of the information stored in your keystore  *
* has NOT been verified!  In order to verify its integrity, *
* you must provide your keystore password.                  *
*****************  WARNING WARNING WARNING  *****************

Keystore type: JKS
Keystore provider: SUN

Your keystore contains 1 entry

Alias name: enkx4bda1node03.enkitec.local
Creation date: Mar 5, 2016
Entry type: PrivateKeyEntry
Certificate chain length: 1
Certificate[1]:
Owner: CN=enkx4bda1node03.enkitec.local, OU=, O=, L=, ST=, C=
Issuer: CN=enkx4bda1node03.enkitec.local, OU=, O=, L=, ST=, C=
Serial number: 427dc79f
Valid from: Sat Mar 05 02:17:45 CST 2016 until: Fri Feb 23 02:17:45 CST 2018
Certificate fingerprints:
	 MD5:  A1:F9:78:EE:D4:C7:C0:D0:65:25:4C:30:09:D8:18:6E
	 SHA1: 8B:E3:7B:5F:76:B1:81:33:35:03:B9:00:97:D0:F7:F9:03:F9:74:C2
	 SHA256: EC:B5:F3:EB:E5:DC:D9:19:DB:2A:D6:3E:71:9C:62:55:10:0A:59:59:E6:98:2C:AD:23:AC:24:48:E4:68:6A:AF
	 Signature algorithm name: SHA256withRSA
	 Version: 3

Extensions: 

#1: ObjectId: 2.5.29.14 Criticality=false
SubjectKeyIdentifier [
KeyIdentifier [
0000: 36 D2 3D 49 AF E2 C6 7A   3C C6 14 D5 4D 64 81 F2  6.=I...z<...Md..
0010: 6E F2 2C B6                                        n.,.
]
]

*******************************************
*******************************************

If you dont’ like this kind of default password, you can use command keytool -storepasswd -keystore /opt/cloudera/security/jks/node.jks to change the password.

Data Query between BDA and Exadata (Part 2): Query Oracle Table from Hive on BDA

mouse_dancing
In the last post (Data Query between BDA and Exadata (Part 1): Query Hive Table from Oracle on Exadata), I show the way to use Oracle Big Data SQL from Oracle table to access hive table on BDA. In this post, I am going to show the data flow in opposite direction: Query data in an Oracle Table on Exadata from Hive on BDA.

Step 1: Load some data to an Oracle table on Exadata
On one of DB nodes on Exadata, run the following in SQLPlus.

sqlplus wzhou/wzhou
CREATE TABLE ORA_STOCK_PRICE1
(    
    STOCK_DATE  VARCHAR2(20),
    CLOSE_PRICE NUMBER(8,3),
    VOLUME NUMBER(8),
    OPEN_PRICE NUMBER(8,3),
    HIGH_PRICE NUMBER(8,3),
    LOW_PRICE NUMBER(8,3)               
);
INSERT INTO ORA_STOCK_PRICE1 values ( '6/18/16', 23.6,320000,23.62,23.74,23.5 );
INSERT INTO ORA_STOCK_PRICE1 values ( '6/19/16', 23.72,350000,23.73,23.8,23.38 );
COMMIT;

Let’s check out the content in the table.

WZHOU:SQL> set lines 120
WZHOU:SQL> select * from ORA_STOCK_PRICE1; 

STOCK_DATE           CLOSE_PRICE      VOLUME  OPEN_PRICE  HIGH_PRICE   LOW_PRICE
-------------------- ----------- ----------- ----------- ----------- -----------
6/18/16                     23.6      320000       23.62       23.74        23.5
6/19/16                    23.72      350000       23.73        23.8       23.38

Step 2: Create the script for the external table in Hive.
To access data in Oracle table, one way is using Oracle Table Access for Hadoop and Spark (OTA4H). OTA4H is an Oracle BDA feature that converts Oracle tables to Hadoop datasources. It allows direct access to master data in Oracle database using Hive SQL.

On BDA node, create a Hive external table.

vi hive_create_ext_ora_to_stock_price_test1.hql
DROP TABLE EXT_ORA_STOCK_PRICE_TEST1;
CREATE EXTERNAL TABLE IF NOT EXISTS EXT_ORA_STOCK_PRICE_TEST1 (
    stock_date string,
    close_price float,
    volume int,
    open_price float,
    high_price float,
    low_price float
)
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
WITH SERDEPROPERTIES (
    'oracle.hcat.osh.columns.mapping' = 'stock_date,close_price,volume,open_price,high_price,low_price')
TBLPROPERTIES(
    'mapreduce.jdbc.url'='jdbc:oracle:thin:@enkx3-scan:1521/bigdatadb',
    'mapreduce.jdbc.username'='wzhou',
    'mapreduce.jdbc.password'='wzhou',
    'mapreduce.jdbc.input.table.name'='ORA_STOCK_PRICE1'
);

Step 3: Create the hive table on HDFS

[oracle@enkbda1node01 test]$ export HIVE_AUX_JARS_PATH=/opt/oracle/ota4h/jlib/osh.jar:/opt/oracle/ota4h/jlib/ucp.jar:/opt/oracle/ota4h/jlib/ojdbc7.jar
[oracle@enkbda1node01 test]$ hive -f hive_create_ext_ora_to_stock_price_test1.hql
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: Using incremental CMS is deprecated and will likely be removed in a future release
16/11/02 14:29:29 WARN mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present.  Continuing without it.
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p1464.1349/jars/hive-common-1.1.0-cdh5.7.0.jar!/hive-log4j.properties
OK
Time taken: 0.642 seconds
OK
Time taken: 0.381 seconds

Step 4: Verify the result
We should see two rows in the result.

[oracle@enkbda1node01 test]$ hive
hive> select * from EXT_ORA_STOCK_PRICE_TEST1;
OK
6/18/16	23.6	320000	23.62	23.74	23.5
6/19/16	23.72	350000	23.73	23.8	23.38
Time taken: 2.093 seconds, Fetched: 2 row(s)

Ok, the result is exactly what I expect.

Data Query between BDA and Exadata (Part 1): Query Hive Table from Oracle on Exadata

mouse_cheese
Last year I wrote several posts showing below to demonstrate load data between Hadoop Hive tables and Oracle tables using Sqoop.
Import Data to Hive from Oracle Database
Export data from Hive table to Oracle Database
Use incremental import in sqoop to load data from Oracle (Part I)
Use incremental import in sqoop to load data from Oracle (Part II)

To accessing data between Big Data Appliance (BDA) and Exadata, there are a few other approaches that can be used in addition to Sqoop approach. For the next few blogs, I am going to discuss the ways to query hive table from an Oracle table, query Oracle table from Hive on BDA, and load data from BDA to Exadata.
Last post, I show the way how to install Big Data SQL on Exadata. In this post, I am going to show how to use Big Data SQL to query a Hive table on BDA from Exadata.

Step 1: Load test dataset to HDFS on BDA
I use a small CSV file, stock_price.csv, which contains the stock price for one stock. The columns in the file include date, close, volume, open, high, and low. For simplicity, I am going to define the date column as STRING (Hive)/VARCHAR2(Oracle) instead of DATE type. The part of file content is shown below:

[oracle@enkbda1node01 test]$ cat stock_price.csv
date,close,volume,open,high,low
9/23/16,24.05,56837,24.13,24.22,23.88
9/22/16,24.1,56675,23.49,24.18,23.49
9/21/16,23.38,70925,23.21,23.58,23.025
9/20/16,23.07,35429,23.17,23.264,22.98
9/19/16,23.12,34257,23.22,23.27,22.96
9/16/16,23.16,83309,22.96,23.21,22.96
9/15/16,23.01,43258,22.7,23.25,22.53
9/14/16,22.69,33891,22.81,22.88,22.66
9/13/16,22.81,59871,22.75,22.89,22.53
9/12/16,22.85,109145,22.9,22.95,22.74
9/9/16,23.03,115901,23.53,23.53,23.02
9/8/16,23.6,32717,23.8,23.83,23.55
9/7/16,23.85,143635,23.69,23.89,23.69
9/6/16,23.68,43577,23.78,23.79,23.43
....
[oracle@enkbda1node01 test]$ vi stock_price.csv
[oracle@enkbda1node01 test]$ hdfs dfs -mkdir /user/oracle/mytest
[oracle@enkbda1node01 test]$ hdfs dfs -put stock_price.csv /user/oracle/mytest
[oracle@enkbda1node01 test]$ hdfs dfs -ls /user/oracle/mytest
Found 1 items
-rw-r--r--   3 oracle hadoop       2495 2016-10-30 15:36 /user/oracle/mytest/stock_price.csv

Step 2: Create a hive table using the datafile
vi hive_create_stock_price1.hql
Input the following content in the hive_create_stock_price1.hql file.

CREATE EXTERNAL TABLE hive_stock_price1 (
stock_date string,
close_price float,
volume int,
open_price float,
high_price float,
low_price float
)
ROW FORMAT delimited fields terminated by ‘,’
LOCATION ‘/user/oracle/mytest’
tblproperties (“skip.header.line.count”=”1”);

Create the table and verify the result.

[oracle@enkbda1node01 test]$ hive -f hive_create_stock_price1.hql
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: Using incremental CMS is deprecated and will likely be removed in a future release
16/11/01 15:47:47 WARN mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present.  Continuing without it.
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p1464.1349/jars/hive-common-1.1.0-cdh5.7.0.jar!/hive-log4j.properties
OK
Time taken: 0.839 seconds

[oracle@enkbda1node01 test]$ hive
hive> select * from hive_stock_price1 limit 5;
OK
9/23/16	24.05	56837	24.13	24.22	23.88
9/22/16	24.1	56675	23.49	24.18	23.49
9/21/16	23.38	70925	23.21	23.58	23.025
9/20/16	23.07	35429	23.17	23.264	22.98
9/19/16	23.12	34257	23.22	23.27	22.96
Time taken: 1.93 seconds, Fetched: 5 row(s)

Step 3: Create an Oracle external table on Exadata
Goto another session and logon to one db node on Exadata. Create an external table using type ORACLE_HIVE. Parameter com.oracle.bigdata.cluster refers to the Hadoop cluster name.

sqlplus wzhou/wzhou
CREATE TABLE EXT_HDFS_STOCK_PRICE1
(    
    STOCK_DATE  VARCHAR2(20),
    CLOSE_PRICE NUMBER(8,3),
    VOLUME NUMBER(8),
    OPEN_PRICE NUMBER(8,3),
    HIGH_PRICE NUMBER(8,3),
    LOW_PRICE NUMBER(8,3)               
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_HIVE
      DEFAULT DIRECTORY "DEFAULT_DIR"
      ACCESS PARAMETERS
      ( com.oracle.bigdata.cluster=enkbda
        com.oracle.bigdata.tablename: hive_stock_price1
      )
)
REJECT LIMIT UNLIMITED ;

Step 3: Query the result

WZHOU:SQL> select * from EXT_HDFS_STOCK_PRICE1 where rownum < 6;
STOCK_DATE           CLOSE_PRICE      VOLUME  OPEN_PRICE  HIGH_PRICE   LOW_PRICE
-------------------- ----------- ----------- ----------- ----------- -----------
date
9/23/16                    24.05       56837       24.13       24.22       23.88
9/22/16                     24.1       56675       23.49       24.18       23.49
9/21/16                    23.38       70925       23.21       23.58      23.025
9/20/16                    23.07       35429       23.17      23.264       22.98

You might notice the first row is actually the header of the CSV file on HDFS. On HDFS, when I created the hive table, I used a property tblproperties (“skip.header.line.count”=”1”) to skip the header row when showing the result from the hive table. Unfortunately I don’t find a way to allow me to skip header row from Big Data SQL. Anyway, it is not a big deal for me. I just need to remember to remove the header row when loading the data to HDFS.

Configurations after CDH Installation

In the last post, I discussed the steps to install a 3 node hadoop cluster by using Cloudera Manager. In the next few posts, I am going to discuss some technologies that are frequently used, such as Hive, Sqoop, Impala and Spark.

There are a few things that need to be configured after the CDH Installation.

1. Configure NTPD. Start up ntpd process on every host. Otherwise, Clouder Manager could display a healthcheck failure: The host’s NTP service did not respond to a request for the clock offset.
# service ntpd status
# service ntpd start
# chkconfig ntpd on
# chkconfig –list ntpd
# ntpdc -np

2. Configure Replication Factor. As my little cluster has only 2 Data nodes, I need to reduce the replication factor from the default value of 3 to 2 to avoid the annoying blocks under-replicated type of error. First run the following command to change the replication factor to 2.

hadoop fs -setrep -R 2 /

Then goto HDFS Configuration, change Replication Factor to 2.

3. Change message logging level from INFO to WARN. I can not believe how many INFO messages are logged and there are no way I can see a message for more than 3 seconds before it is quickly refreshed away by a flood of INFO messages. In my opinion, majority of the INFO messages are useless and should not be logged in the first place. It seems more like DEBUG messages to me. So before my little cluster goes crazy in logging tons of useless messages, I need to quickly change logging level from INFO to WARNING. Another painful thing is that there are many log files from various Hadoop components, and are located at many different locations. I feel like I am siting in a space shuttle cockpit and need to turn off many switches not in a central location.
space_shuttle_cockpit
I could find out the logfile configuration file, and fix the parameters one by one. But it would take some time and too painful. The easiest way I found out is to use Cloudera Manager to make the change. Bascially, type in logging level as the search term. It will pop up a long list of components with the logging level and change them one by one. You will not believe how many logging level parameters are in the system. After the change, it’s recommended to restart the cluster as certain parameters are stale.
CM_change_INO_WARN

4. Configure Hue’s superuser and password. From Cloudera Manager screen, click Hue to start the Hue screen. The weird part about the Hue is that there is no pre-set superuser for the administration. Whoever logon to the Hue first will become the superuser of Hue. I don’t understand why Hue just takes whatever user and password Cloudera Manager uses. Anyway, to make my life easier, I just use the same login user and password for Cloudera Manager, admin.
hue_initial_screen

5. Add new user.
By default hdfs user is the superuser for HDFS, not the root user. So before doing any work on Hadoop, it is a good idea to create a separte OS user instead of using hdfs user to execute Hadoop commands. Run the following commands on EVERY Host in the cluster.
a. Logon as root user.
b. Create bigdata group.
# groupadd bigdata
# grep bigdata /etc/group

c. Add the new user, wzhou.
# useradd -G bigdata -m wzhou

If the user exist before the bigdata created, do the following
# usermod -a -G bigdata wzhou

d. Change password
# passwd wzhou

e. Verify the user.
# id wzhou

f. Create the user home directory on HDFS.
# sudo -u hdfs hdfs dfs -mkdir /user/wzhou
# sudo -u hdfs hdfs dfs -ls /user

[root@vmhost1 ~]# sudo -u hdfs hdfs dfs -ls /user
Found 8 items
drwxrwxrwx   - mapred hadoop              0 2015-09-15 05:40 /user/history
drwxrwxr-t   - hive   hive                0 2015-09-15 05:44 /user/hive
drwxrwxr-x   - hue    hue                 0 2015-09-15 10:12 /user/hue
drwxrwxr-x   - impala impala              0 2015-09-15 05:46 /user/impala
drwxrwxr-x   - oozie  oozie               0 2015-09-15 05:47 /user/oozie
drwxr-x--x   - spark  spark               0 2015-09-15 05:41 /user/spark
drwxrwxr-x   - sqoop2 sqoop               0 2015-09-15 05:42 /user/sqoop2
drwxr-xr-x   - hdfs   supergroup          0 2015-09-20 11:23 /user/wzhou

g. Change the ownership of the directory.
# sudo -u hdfs hdfs dfs -chown wzhou:bigdata /user/wzhou
# hdfs dfs -ls /user

[root@vmhost1 ~]# sudo -u hdfs hdfs dfs -chown wzhou:bigdata /user/wzhou
[root@vmhost1 ~]# sudo -u hdfs hdfs dfs -ls /user</strong>
Found 8 items
drwxrwxrwx   - mapred hadoop           0 2015-09-15 05:40 /user/history
drwxrwxr-t   - hive   hive             0 2015-09-15 05:44 /user/hive
drwxrwxr-x   - hue    hue              0 2015-09-15 10:12 /user/hue
drwxrwxr-x   - impala impala           0 2015-09-15 05:46 /user/impala
drwxrwxr-x   - oozie  oozie            0 2015-09-15 05:47 /user/oozie
drwxr-x--x   - spark  spark            0 2015-09-15 05:41 /user/spark
drwxrwxr-x   - sqoop2 sqoop            0 2015-09-15 05:42 /user/sqoop2
drwxr-xr-x   - wzhou  bigdata          0 2015-09-20 11:23 /user/wzhou

h. Run a sample test.
Logon as wzhou user and verify whether the user can run sample MapReduce job from hadoop-mapreduce-examples.jar.

[wzhou@vmhost1 hadoop-mapreduce]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 1000000
Number of Maps  = 10
Samples per Map = 1000000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
15/09/20 11:32:28 INFO client.RMProxy: Connecting to ResourceManager at vmhost1.local/192.168.56.71:8032
15/09/20 11:32:29 INFO input.FileInputFormat: Total input paths to process : 10
15/09/20 11:32:29 INFO mapreduce.JobSubmitter: number of splits:10
15/09/20 11:32:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1442764085933_0001
15/09/20 11:32:30 INFO impl.YarnClientImpl: Submitted application application_1442764085933_0001
15/09/20 11:32:30 INFO mapreduce.Job: The url to track the job: http://vmhost1.local:8088/proxy/application_1442764085933_0001/
15/09/20 11:32:30 INFO mapreduce.Job: Running job: job_1442764085933_0001
15/09/20 11:32:44 INFO mapreduce.Job: Job job_1442764085933_0001 running in uber mode : false
15/09/20 11:32:44 INFO mapreduce.Job:  map 0% reduce 0%
15/09/20 11:32:55 INFO mapreduce.Job:  map 10% reduce 0%
15/09/20 11:33:03 INFO mapreduce.Job:  map 20% reduce 0%
15/09/20 11:33:11 INFO mapreduce.Job:  map 30% reduce 0%
15/09/20 11:33:18 INFO mapreduce.Job:  map 40% reduce 0%
15/09/20 11:33:26 INFO mapreduce.Job:  map 50% reduce 0%
15/09/20 11:33:34 INFO mapreduce.Job:  map 60% reduce 0%
15/09/20 11:33:42 INFO mapreduce.Job:  map 70% reduce 0%
15/09/20 11:33:50 INFO mapreduce.Job:  map 80% reduce 0%
15/09/20 11:33:58 INFO mapreduce.Job:  map 90% reduce 0%
15/09/20 11:34:06 INFO mapreduce.Job:  map 100% reduce 0%
15/09/20 11:34:14 INFO mapreduce.Job:  map 100% reduce 100%
15/09/20 11:34:14 INFO mapreduce.Job: Job job_1442764085933_0001 completed successfully
15/09/20 11:34:15 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=124
		FILE: Number of bytes written=1258521
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=2680
		HDFS: Number of bytes written=215
		HDFS: Number of read operations=43
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=3
	Job Counters 
		Launched map tasks=10
		Launched reduce tasks=1
		Data-local map tasks=10
		Total time spent by all maps in occupied slots (ms)=65668
		Total time spent by all reduces in occupied slots (ms)=6387
		Total time spent by all map tasks (ms)=65668
		Total time spent by all reduce tasks (ms)=6387
		Total vcore-seconds taken by all map tasks=65668
		Total vcore-seconds taken by all reduce tasks=6387
		Total megabyte-seconds taken by all map tasks=67244032
		Total megabyte-seconds taken by all reduce tasks=6540288
	Map-Reduce Framework
		Map input records=10
		Map output records=20
		Map output bytes=180
		Map output materialized bytes=360
		Input split bytes=1500
		Combine input records=0
		Combine output records=0
		Reduce input groups=2
		Reduce shuffle bytes=360
		Reduce input records=20
		Reduce output records=0
		Spilled Records=40
		Shuffled Maps =10
		Failed Shuffles=0
		Merged Map outputs=10
		GC time elapsed (ms)=1026
		CPU time spent (ms)=8090
		Physical memory (bytes) snapshot=3877482496
		Virtual memory (bytes) snapshot=17644212224
		Total committed heap usage (bytes)=3034685440
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=1180
	File Output Format Counters 
		Bytes Written=97
Job Finished in 106.368 seconds
Estimated value of Pi is 3.14158440000000000000

To restart all services in the cluster, you can just click Restart Action on the cluster from Cloudera Manager screen. However, if you want to start/stop a particular service, you might want to know the dependency of the services. Here are the order of starting/stopping sequence for all services on CDH 5.

Startup Sequence
1. Cloudera Management service
2. ZooKeeper
3. HDFS
4. Solr
5. Flume
6. Hbase
7. Key-Value Store Indexer
8. MapReduce or YARN
9. Hive
10. Impala
11. Oozie
12. Sqoop
13. Hue

Stop Sequence
1. Hue
2. Sqoop
3. Oozie
4. Impala
5. Hive
6. MapReduce or YARN
7. Key-Value Store Indexer
8. Hbase
9. Flume
10. Solr
11. HDFS
12. ZooKeeper
13. Cloudera Management Service

Ok, we are good here. In the next post, I am going to discuss load data to Hive.