Parquet File Can not Be Read in Sparkling Water H2O

For the past few months, I wrote several blogs related to H2O topic:
Use Python for H2O
H2O vs Sparkling Water
Sparking Water Shell: Cloud size under 12 Exception
Access Sparkling Water via R Studio
Running H2O Cluster in Background and at Specific Port Number
Weird Ref-count mismatch Message from H2O

Sparkling Water and H2O are very good in terms of performance for data science projects. But if something works beautifully in one environment, not working in another environment could be the most annoyed thing in the world. This is exactly what had happened at one of my clients.

They used to use Sparkling Water and H2O in Oracle BDA environment and worked great. However, even with a full rack BDA (18 nodes), it is still not enough to run big dataset on H2O. Recently they moved to a much bigger CDH cluster (non-BDA environment) with CDH 5.13 installed. Sparkling Water is still working, however there was one major issue: parquet files can not be read correctly. There are no issue in reading the same parquet files from Spark shell and pyspark. This is really an annoying issue as parquet format is one of data formats that are heavily used by the client. After many failed tries and investigation, I was finally able to figure out the issue and implement a workaround solution. This blog discussed this parquet reading issue and workaround solution in Sparkling Water.

Create Test Data Set
I did the test in my CDH cluster (CDH 5.13). I first created a small test data set, stock.csv, and uploaded to /user/root directory on HDFS.

date,close,volume,open,high,low
9/23/16,24.05,56837,24.13,24.22,23.88
9/22/16,24.1,56675,23.49,24.18,23.49
9/21/16,23.38,70925,23.21,23.58,23.025
9/20/16,23.07,35429,23.17,23.264,22.98
9/19/16,23.12,34257,23.22,23.27,22.96
9/16/16,23.16,83309,22.96,23.21,22.96
9/15/16,23.01,43258,22.7,23.25,22.53
9/14/16,22.69,33891,22.81,22.88,22.66
9/13/16,22.81,59871,22.75,22.89,22.53
9/12/16,22.85,109145,22.9,22.95,22.74
9/9/16,23.03,115901,23.53,23.53,23.02
9/8/16,23.6,32717,23.8,23.83,23.55
9/7/16,23.85,143635,23.69,23.89,23.69
9/6/16,23.68,43577,23.78,23.79,23.43
9/2/16,23.84,31333,23.45,23.93,23.41
9/1/16,23.42,49547,23.45,23.48,23.26

Create a Parquet File
Run the following in spark2-shell to create a parquet file and make sure that I can read it back.

scala> val myTest=spark.read.format("csv").load("/user/root/stock.csv")
myTest: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 4 more fields]
 
scala> myTest.show(3)
+-------+-----+------+-----+-----+-----+
|    _c0|  _c1|   _c2|  _c3|  _c4|  _c5|
+-------+-----+------+-----+-----+-----+
|   date|close|volume| open| high|  low|
|9/23/16|24.05| 56837|24.13|24.22|23.88|
|9/22/16| 24.1| 56675|23.49|24.18|23.49|
+-------+-----+------+-----+-----+-----+
only showing top 3 rows
 
scala> myTest.write.format("parquet").save("/user/root/mytest.parquet")
 
scala> val readTest = spark.read.format("parquet").load("/user/root/mytest.parquet")
readTest: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 4 more fields]
 
scala> readTest.show(3)
+-------+-----+------+-----+-----+-----+
|    _c0|  _c1|   _c2|  _c3|  _c4|  _c5|
+-------+-----+------+-----+-----+-----+
|   date|close|volume| open| high|  low|
|9/23/16|24.05| 56837|24.13|24.22|23.88|
|9/22/16| 24.1| 56675|23.49|24.18|23.49|
+-------+-----+------+-----+-----+-----+
only showing top 3 rows

Start a Sparkling Water H2O Cluster
I started a Sparking Water Cluster with 2 nodes.

[root@a84-master--2df67700-f9d1-46f3-afcf-ba27a523e143 sparkling-water-2.2.7]# . /etc/spark2/conf.cloudera.spark2_on_yarn/spark-env.sh
/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2
[root@a84-master--2df67700-f9d1-46f3-afcf-ba27a523e143 sparkling-water-2.2.7]# bin/sparkling-shell \
> --master yarn \
> --conf spark.executor.instances=2 \
> --conf spark.executor.memory=1g \
> --conf spark.driver.memory=1g \
> --conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
> --conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
> --conf spark.dynamicAllocation.enabled=false \
> --conf spark.sql.autoBroadcastJoinThreshold=-1 \
> --conf spark.locality.wait=30000 \
> --conf spark.scheduler.minRegisteredResourcesRatio=1

-----
  Spark master (MASTER)     : yarn
  Spark home   (SPARK_HOME) : /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2
  H2O build version         : 3.16.0.4 (wheeler)
  Spark build version       : 2.2.1
  Scala version             : 2.11
----

/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://10.128.0.5:4040
Spark context available as 'sc' (master = yarn, app id = application_1518097883047_0001).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera1
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._

scala> val h2oContext = H2OContext.getOrCreate(spark) 
h2oContext: org.apache.spark.h2o.H2OContext =                                   

Sparkling Water Context:
 * H2O name: sparkling-water-root_application_1518097883047_0001
 * cluster size: 2
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (1,a84-worker--6e693a0a-d92c-4172-81b7-f2f07b6d5d7c.c.cdh-director-194318.internal,54321)
  (2,a84-worker--c0ecccae-cead-44f2-9f75-39aadb1d024a.c.cdh-director-194318.internal,54321)
  ------------------------

  Open H2O Flow in browser: http://10.128.0.5:54321 (CMD + click in Mac OSX)

scala> import h2oContext._
import h2oContext._
scala> 

Read Parquet File from H2O Flow UI
Open H2O Flow UI and read the same parquet file.

After click Parse these files, got corrupted file.

Obviously, parquet file was not read correctly. At this moment, there are no error messages in the H2O console. If continue to import the file, the H2O Flow UI throw the following error

The H2O console would show the following error:

scala> 18/02/08 09:23:59 WARN servlet.ServletHandler: Error for /3/Parse
java.lang.NoClassDefFoundError: org/apache/parquet/hadoop/api/ReadSupport
	at water.parser.parquet.ParquetParser.correctTypeConversions(ParquetParser.java:104)
	at water.parser.parquet.ParquetParserProvider.createParserSetup(ParquetParserProvider.java:48)
	at water.parser.ParseSetup.getFinalSetup(ParseSetup.java:213)
	at water.parser.ParseDataset.forkParseDataset(ParseDataset.java:119)
	at water.parser.ParseDataset.parse(ParseDataset.java:43)
	at water.api.ParseHandler.parse(ParseHandler.java:36)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at water.api.Handler.handle(Handler.java:63)
	at water.api.RequestServer.serve(RequestServer.java:451)
	at water.api.RequestServer.doGeneric(RequestServer.java:296)
	at water.api.RequestServer.doPost(RequestServer.java:222)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
	at ai.h2o.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
	at ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
	at ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
	at ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)
	at ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
	at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
	at water.JettyHTTPD$LoginHandler.handle(JettyHTTPD.java:192)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
	at ai.h2o.org.eclipse.jetty.server.Server.handle(Server.java:370)
	at ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
	at ai.h2o.org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
	at ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
	at ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
	at ai.h2o.org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
	at ai.h2o.org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
	at ai.h2o.org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
	at ai.h2o.org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
	at ai.h2o.org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
	at ai.h2o.org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.parquet.hadoop.api.ReadSupport
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 39 more

Solution
As BDA has no issue in the same Sparkling Water H2O deployment and BDA used CDH 5.10, I initially focused more on CDH version difference. I built three CDH clusters using three different CDH versions: 5.13, 5.12 and 5.10. All of them show the exact same error. This made me rule out the possibility from CDH version difference and shifted focus on the environment difference, especially class path and jar files. Tried setting JAVA_HOME, SPARK_HOME, SPARK_DIST_CLASSPATH and unfortunately none of them worked.

I noticed /etc/spark2/conf.cloudera.spark2_on_yarn/classpath.txt seem have much less entries than classpath.txt under spark 1.6. Tried adding back the missing entries. Still no luck.

Added two more parameters to get more information about H2O log.

--conf spark.ext.h2o.node.log.level=INFO \
--conf spark.ext.h2o.client.log.level=INFO \

It gave a little more useful information. It complained about class ParquetFileWriter not found.

$ cat h2o_10.54.225.9_54000-5-error.log
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR: DistributedException from /192.168.10.54:54005: 'org/apache/parquet/hadoop/ParquetFileWriter', caused by java.lang.NoClassDefFoundError: org/apache/parquet/hadoop/ParquetFileWriter
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at water.MRTask.getResult(MRTask.java:478)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at water.MRTask.getResult(MRTask.java:486)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at water.MRTask.doAll(MRTask.java:402)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at water.parser.ParseSetup.guessSetup(ParseSetup.java:283)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at water.api.ParseSetupHandler.guessSetup(ParseSetupHandler.java:40)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

My client found a temporary solution by using h2odriver.jar following the instruction from Using H2O on Hadoop. The command used is shown below:

cd /opt/h2o-3
hadoop jar h2odriver.jar -nodes 70 -mapperXmx 40g -output hdfs://PROD01ns/user/mytestuser/log24

Although this solution provides similar functionalities in Sparkling Water, it has some critical performance issues:
1. The above command would create 70 nodes H2O cluster. If using Sparkling Water, it would be evenly distribute to all available nodes. But the above h2odriver.jarapproach would heavily use a few hadoop nodes. For big dataset, majority of activities happened only to 3~4 nodes, which made those nodes’ cpu utilization close to 100%. For one test big dataset, it has never completed the parsing file. It failed after 20 minutes run.
2. Unlike Sparkling Water, it actually read files during the parsing phase, not in the importing phase.
3. The performance is pretty bad compared with Sparkling Water. I guess Sparkling Water is using underlined Spark to distribute the load evenly.

Anyway this hadoop jar h2odriver.jar solution is not an ideal workaround for this issue.

Then I happened to read this blog: Incorrect results with corrupt binary statistics with the new Parquet reader. This article has nothing to do my issue, but it mentioned about parquet v1.8. I did remember seeing one note from one Sparkling Water developer discussing should integrate with parquet v1.8 in the future for certain parquet issue in H2O. Unfortunately I could not find the link to this discussion any more. But it inspired me to think that maybe the issue is that Sparkling Water depends certain parquet library and the current environment don’t have it. The standard CDH distribution and Spark2 seem using parquet v1.5. Oracle BDA has many more software installed and maybe it happened to have the correct library installed somewhere. It seems H2O related jar file may contain this library, what’s happened if I include the H2O jar somewhere in Sparkling Water.

With this idea in mind, I download H2O from http://h2o-release.s3.amazonaws.com/h2o/rel-wheeler/4/index.html. Unzip the file and h2o.jar file is the one I need. I then modified sparkling-shell and change the last line of code as follows by add h2o.jar file to jars parameter.

#spark-shell --jars "$FAT_JAR_FILE" --driver-memory "$DRIVER_MEMORY" --conf spark.driver.extraJavaOptions="$EXTRA_DRIVER_PROPS" "$@"
H2O_JAR_FILE="/home/mytestuser/install/h2o-3.16.0.4/h2o.jar"
spark-shell --jars "$FAT_JAR_FILE,$H2O_JAR_FILE" --driver-memory "$DRIVER_MEMORY" --conf spark.driver.extraJavaOptions="$EXTRA_DRIVER_PROPS" "$@"

Restart my H2O cluster. It worked!

Finally after many days work, Sparkling Water can work again in the new cluster. Reloading the big testing dataset, it took less than 1 minute to load the same dataset with only 24 H2O nodes. The load was also evenly distributed to the cluster. Problem solved!

Advertisements

Create Cloudera Hadoop Cluster Using Cloudera Director on Google Cloud

I have a blog discussing how to install Cloudera Hadoop Cluster several years ago. It basically took about at least half day to complete the installation in my VM cluster. In my last post, I discussed an approach to deploy Hadoop cluster using DataProc on Google Cloud Platform. It literally took less than two minutes to create a Hadoop Cluster. Although it is a good to have a cluster launched in a very short time, it does not have the nice UI like Cloudera Manager as the Hadoop distribution used by Dataproc is not CDH. I could repeat my blogs to build a Hadoop Cluster using VM instances on Google Cloud Platform. But it will take some time and involve a lot of work. Actually there is another way to create Hadoop cluster on the cloud. Cloudera has a product, called Cloudera Director. It currently supports not only Google Cloud, but also AWS and Azure as well. It is designed to deploy CDH cluster faster and easier to scale the cluster on the cloud. Another important feature is that Cloud Director allows you to move your deployment scripts or steps easily from one cloud provider to another provider and you don’t have to be locked in one cloud vendor. In this blog, I will show you the way to create a CDH cluster using Cloudera Director.

The first step is to start my Cloudera Director instance. In my case, I have already installed Cloudera Director based on the instruction from Cloudera. It is pretty straight forward process and I am not going to repeat it here. The Cloudera Director instance is where you can launch your CDH cluster deployment.

Both Cloudera Director and Cloudera Manager UI are browser-based and you have to setup secure connection between your local machine and VM instances on the cloud. To achieve this, you need to configure SOCKS proxy on your local machine that is used to connect to the Cloudera Director VM. It provides a secure way to connect to your VM on the cloud and can use VM’s internal IP and hostname in the web browser. Google has a nice note about the steps, Securely Connecting to VM Instances. Following this note will help you to setup SOCKS proxy.

Ok, here are the steps.
Logon to Cloudera Director
Open a terminal session locally, and run the following code:

gcloud compute ssh cdh-director-1 \
    --project cdh-director-173715 \
    --zone us-central1-c \
    --ssh-flag="-D" \
    --ssh-flag="1080" \
    --ssh-flag="-N"    

cdh-director-1 is the name of my Cloudera Director instance on Google cloud and cdh-director-173715 is my Google Cloud project id. After executing the above command, it looks hang and never complete. This is CORRECT behavior. Do not kill or exit this session. Open a browser and type in the internal IP of Cloudera Director instance with port number 7189. For my cdh-director-1 instance, the internal IP is 10.128.0.2.

After input the URL http://10.128.0.2:7189 for Cloudera Director. The login screen shows up. Login as admin user.

Deployment
After login, the initial setup wizard shows up. Click Let’s get started.

In the Add Environment screen, input the information as follows. The Client ID JSON Key is the file you can create during the initial setup of you Google project with SSH key stuff.

In the next Add Cloudera Manager screen, I usually create the Instance Template first. Click the drop down of Instance Template, then select Create a new instance template. I need at least three template, one for Cloudera Manager, one for Master nodes, and one for Worker nodes. In my case here, I did not create a template for Edge nodes. To save resource on my Google cloud environment, I did not create the template for Edge node. Here are the configuration for all three templates.

Cloudera Manager Template

Master Node Template

Worker Node Template

Input the following for Cloudera Manager. For my test, I use Embedded Database. If it is used for production, you need to setup external database first and register the external database here.

After click Continue, Add Cluster screen shows up. There is a gateway instance group and I removed it by clicking Delete Group because I don’t have edge node here. Input the corresponding template and number of instances for masters and workders.

After click Continue, the deployment starts.

After about 20 minutes, it completes. Click Continue.

Review Cluster
The nice Cloudera Director dashboard shows up.

You can also login to Cloudera Manager from the link on Cloudera Director.

Nice and easy. Excellent product from Cloudera. For more information about deploying CDH cluster on Google Cloud, you can also check out Cloudera’s document, Getting Started on Google Cloud Platform.

Use OEM 13c R2 to Discover Oracle BDA

OEM 13c Cloud Control is a powerful monitoring tool, not only for Exadata and Oracle database, but also for Oracle Big Data Appliance (BDA). There are many articles or blogs about Exadata Discovery using OEM 12c or 13c. But not many places discuss the OEM BDA Discovery, especially using the new version of OEM, 13c Cloud Control. In this blog, I am going to discuss the steps to discover BDA using OEM 13c R2.

First, do not use OEM 13c R1 for BDA Discovery. It is very time consuming and very likely not going to work. OEM 13c R2 is much better, at least I can successfully do the BDA Discovery on all of the BDAs I have worked on.

Secondly, unlike OEM Exadata Discovery, BDA Discovery usually requires one extra step before the Manual OEM BDA Discovery by using bdacli enable em command first. Theoretically if works, I don’t need to do anything in manual BDA discovery process. Unfortunately I have never run into this perfect situation in different BDA environment and always get certain kind of errors at the end.

Preparation
There are a few useful notes about OEM BDA Discovery.
1) Instructions to Install 12.1.0.4 BDA Plug-in on Oracle Big Data Appliance (BDA) V2.*/V3.0.*/V3.1/V4.* (Doc ID 1682558.1)
2) BDA Credentials for Enterprise Manager 13.x Plugin (Doc ID 2206111.1)
3) Instructions to Enable / Disable the 13.x BDA Enterprise Manager Plug-in on Oracle Big Data Appliance (BDA) V4.5-V4.7 (Doc ID 2206207.1)

Execute bdacli command
Run bdacli enable em. For BDA version below 4.5, run command bdacli enable em –force. I am almost 100% guarantee you won’t see the successful completion message from this command. For example, get the following error at the end.

INFO: Running: /opt/oracle/emcli_home/emcli discover_bda_cluster -hostname=enkx4bda1node01.enkitec.local -cloudera_credential=BDA_ENKX4BDA_CM_CRED -host_credential=BDA_ENKX4BDA_HOSTS_CRED -cisco_credential=BDA_ENKX4BDA_CISCO_CRED -ilom_credential=BDA_ENKX4BDA_ILOM_CRED -infiniband_credential=BDA_ENKX4BDA_IB_CRED -pdu_credential=BDA_ENKX4BDA_PDU_CRED -cisco_snmp_string="snmp_v3;;SNMPV3Creds;authUser:none;authPwd:none;authProtocol:none;privPwd:none" -pdu_snmp_string="snmp_v1v2_v3;;SNMPV1Creds;COMMUNITY:none" -switch_snmp_string="snmp_v1v2_v3;;SNMPV3Creds;authUser:none;authPwd:none;authProtocol:none;privPwd:none"
ERROR: Syntax Error: Unrecognized argument -cisco_snmp_string #Step Syntax Error: Unrecognized argument -pdu_snmp_string#
Are you sure you want to completely cleanup em and lose all related state ?

When see the above message, always type in N and not rollback the changes. Basically you have OEM agent deployed, just need to figure out which node you want to use as the start point for Manual OEM BDA Discovery.

On each node, run the following command:

[root@enkx4bda1node06 ~]# java -classpath /opt/oracle/EMAgent/agent_13.2.0.0.0/jlib/*:/opt/oracle/EMAgent/agent_13.2.0.0.0/plugins/oracle.sysman.bda.discovery.plugin_13.2.2.0.0/archives/* oracle.sysman.bda.discovery.pojo.GetHadoopClusters http://enkx4bda1node03.enkitec.local:7180/api/v1/clusters admin admin_password

You should see the error below for the execution on many nodes.

Apr 10, 2017 10:14:44 AM com.sun.jersey.api.client.ClientResponse getEntity
SEVERE: A message body reader for Java class [Loracle.sysman.bda.discovery.pojo.Items;, and Java type class [Loracle.sysman.bda.discovery.pojo.Items;, and MIME media type text/html was not found
Apr 10, 2017 10:14:44 AM com.sun.jersey.api.client.ClientResponse getEntity
SEVERE: The registered message body readers compatible with the MIME media type are:
*/* ->
  com.sun.jersey.core.impl.provider.entity.FormProvider
  com.sun.jersey.core.impl.provider.entity.MimeMultipartProvider
  com.sun.jersey.core.impl.provider.entity.StringProvider
  com.sun.jersey.core.impl.provider.entity.ByteArrayProvider
  com.sun.jersey.core.impl.provider.entity.FileProvider
  com.sun.jersey.core.impl.provider.entity.InputStreamProvider
  com.sun.jersey.core.impl.provider.entity.DataSourceProvider
  com.sun.jersey.core.impl.provider.entity.XMLJAXBElementProvider$General
  com.sun.jersey.core.impl.provider.entity.ReaderProvider
  com.sun.jersey.core.impl.provider.entity.DocumentProvider
  com.sun.jersey.core.impl.provider.entity.SourceProvider$StreamSourceReader
  com.sun.jersey.core.impl.provider.entity.SourceProvider$SAXSourceReader
  com.sun.jersey.core.impl.provider.entity.SourceProvider$DOMSourceReader
  com.sun.jersey.core.impl.provider.entity.XMLRootElementProvider$General
  com.sun.jersey.core.impl.provider.entity.XMLListElementProvider$General
  com.sun.jersey.core.impl.provider.entity.XMLRootObjectProvider$General
  com.sun.jersey.core.impl.provider.entity.EntityHolderReader

For certain node, you could see successful message and showing below.

enkx4bda;;

In my case, it is node 2. So I will use Node 2 for my manual BDA Discovery in the following steps.

Manual OEM BDA Discovery
Logon to OEM as sysman user. Select Add Target -> Add Target Manually.

Select Add Targets Using Guided Process

Select Oracle Big Data Appliance

The Add Targets Manually pages shows up. Select node2 from the list. Click Next.

After it completes, it will show the following hardware information. Click Next.

The Hardware Credentials screen shows up. If all Host credentials show green sign, you don’t need to do anything related to Host. Go to the next one, for example, IB Switch. Select Set Credentials -> All Infiniband Switches . Then set SNMP Credentials type and community string. Majority of the time, input public for community string. Then click OK.

If successful, it shows the green check.

Following the similar procedure for all other hardware components, like ILOM, PDU and Cisco Switch. At the end, you should see the following screen.
One interesting note about PDU. PDU component always behave in a weird way during the discovery. For this case, it shows successful with green check, but later on OEM shows PDUs as DOWN status after the discovery. In my other discovery works for different BDA environments, the green check has never shown up in this page, but PDUs shows UP status after the discovery. So the result is inconsistent.

Click Next. The screen for Cloudera Manager shows up. Click Edit, verify the credential for admin user for Cloudera Manager. Then click Next.

The Software page shows up, click Next.

The review page shows up, click Submit

If successful, will see the screen message below, click OK.

The BDA Discovery is completed.
.
You might notice the new BDA cluster is called BDA Network1. This is not a good way to name a cluster, especially you have multiple BDAs under the management from the same OEM. I don’t understand why not to use BDA’s cluster name or Cloudera Manager’s cluster name. Either one will be much better than this naming. Even worse, you can change a lot of target name in OEM, but not for this one. I have another blog (Change BDA Cluster Name in OEM Cloud Control 13c) discussing a partial workaround for this issue.

To view the detail of a host target, you can have the following:

The presentation looks better than OEM 12c. In general, OEM 13c for BDA is good one. But pay attention to the followings. Otherwise you will spend a lot of additional time.
1) Before performing OEM BDA Discovery, make sure you have changed all of your default passwords on BDA. It’s easier to use default password during the discovery, but a huge pain after you change passwords for certain user accounts used in BDA discovery. Basically, update the Named Credentials is not enough and you have to delete the whole BDA target in OEM and redo the discovery.

2) Similarly, if configure TLS with Cloudera Manager after BDA Discovery, you will have to remove the BDA target and redo the discovery. It is a clearly a bug in OEM, at least not fixed at the time I am writing this blog.

3) Sometimes you might see tons of alerts from almost every ports in the Cisco switch. If from a few ports, I might believe it. But for almost every port, there is no way this is the right alert. As matter of fact, Oracle Support confirmed it seem false alert. At the time I had to do the BDA Rediscovery after configuring TLS with Cloudera Manager, I happened to notice all Cisco port alerts were gone after BDA rediscovery.

4) Both Oracle document and Oracle support says OEM 13c R2 supports BDA v4.5+ and any version below it is not supported. It’s true the lower BDA version would run into additional issues, but I managed to find workaround and make it working for BDA v4.3.

Finding out Keystore and Truststore Passwords on BDA

I am working in a project involving configuring SSL with Cloudera Manager on BDA. There are several ways to do it: go with Oracle’s bdacli approach or use Cloudera’s approach. For BDA related work, I usually prefer Oracle’s BDA approach because it needs to write some information to Oracle BDA’s configuration files, which are usually outside the control of Cloudera Manager. Cloudera’s approach is definitely working as well. But during the time when doing BDA upgrade or patching, if mammoth couldn’t find the correct value in BDA’s configuration files, it might cause unnecessary trouble. For example, if mammoth think certain features are not enabled, then it could skip certain steps to disable the features before upgrade. Anyway, it is another unrelated topic.

To enable TLS on Cloudera Manager is pretty easy on BDA, instead of doing so many steps stated in Cloudera Manager’s document. On BDA, just run the following command:
bdacli enable https_cm_hue_oozie

The command will automatically enable TLS for all major services on CDH, such Cloudera Manager, Hue and Oozie. Please note: TLS on Cloudera Manager agent is automatically enabled during BDA installation. Usually running this command is enough for many clients as client just need to encrypt the content when communicating
with Cloudera Manager. There is a downside for this approach: BDA uses self-signed certificates during the execution of bdacli enable https_cm_hue_oozie. This kind of self-signed certificate is good for security, but sometime can be annoying with browsing alerts. Therefore some users might prefer to use their own signed SSL certificates.

After working with Eric from Oracle Support, he recommended a way actually pretty good documented in Doc ID 2187903.1: How to Use Certificates Signed by a User’s Certificate Authority for Web Consoles and Hadoop Network Encryption Use on the BDA. The key of this approach is to get keystore’s and truststore’s paths and passwords, creating new keystore and truststore, and then importing customer’s certificates. Unfortunately, this approach works for BDA version 4.5 and above. It is not going to work in my current client environment, which is using BDA v4.3. One of major issue is that BDA v4.5 and above has the following bdacli commands while BDA v4.3 doesn’t have the following commands:
bdacli getinfo cluster_https_keystore_password
bdacli getinfo cluster_https_truststore_password

Eric then recommended a potential workaround by querying MySQL database directly by using the commands below:

use scm;
select * from CONFIGS where ATTR = 'truststore_password' or ATTR = 'keystore_password'; 

I then used two BDAs in our lab for the verification.
First, I tested on our X4 Starter rack.

[root@enkx4bda1node01 ~]# bdacli getinfo cluster_https_keystore_password
Enter the admin user for CM (press enter for admin): 
Enter the admin password for CM: 
******

[root@enkx4bda1node01 ~]# bdacli getinfo cluster_https_truststore_password
Enter the admin user for CM (press enter for admin): 
Enter the admin password for CM: 

Interestingly, the keystore password is still showing ****** while truststore password is empty. I can understand empty password for truststore as nothing is configured for truststore. But keystore password shouldn’t show hidden value as ******.

Query MySQL db on the same rack.

[root@enkx4bda1node03 ~]# mysql -u root -p
Enter password: 
mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| activity_monitor   |
| hive               |
| host_monitor       |
| hue                |
| mysql              |
| navigator          |
| navigator_metadata |
| oozie              |
| performance_schema |
| reports_manager    |
| resource_manager   |
| scm                |
| sentry_db          |
| service_monitor    |
| studio             |
+--------------------+
16 rows in set (0.00 sec)

mysql> use scm;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed

mysql> select * from CONFIGS where ATTR = 'truststore_password' or ATTR = 'keystore_password'; 
+-----------+---------+-------------------+--------+------------+---------+---------------------+-------------------------+----------------------+---------+
| CONFIG_ID | ROLE_ID | ATTR              | VALUE  | SERVICE_ID | HOST_ID | CONFIG_CONTAINER_ID | OPTIMISTIC_LOCK_VERSION | ROLE_CONFIG_GROUP_ID | CONTEXT |
+-----------+---------+-------------------+--------+------------+---------+---------------------+-------------------------+----------------------+---------+
|         8 |    NULL | keystore_password | ****** |       NULL |    NULL |                   2 |                       2 |                 NULL | NONE    |
+-----------+---------+-------------------+--------+------------+---------+---------------------+-------------------------+----------------------+---------+
1 row in set (0.00 sec)

MySQL database also store the password as *****. I remember my colleague mentioned this BDA has some issue. This could be one of them.

Ok, this rack doesn’t really tell me anything and I move to the 2nd full rack BDA. Perform the same commands there.

[root@enkbda1node03 ~]# bdacli getinfo cluster_https_keystore_password 
Enter the admin user for CM (press enter for admin): 
Enter the admin password for CM: 
KUSld8yni8PMQcJbltvCnZEr2XG4BgKohAfnW6O02jB3tCP8v1DYlbMO5PqhJCVR

[root@enkbda1node03 ~]# bdacli getinfo cluster_https_truststore_password
Enter the admin user for CM (press enter for admin): 
Enter the admin password for CM: 


[root@enkbda1node03 ~]# mysql -u root -p
Enter password: 
mysql> use scm;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select * from CONFIGS where ATTR = 'truststore_password' or ATTR = 'keystore_password'; 
+-----------+---------+---------------------+------------------------------------------------------------------+------------+---------+---------------------+-------------------------+----------------------+---------+
| CONFIG_ID | ROLE_ID | ATTR                | VALUE                                                            | SERVICE_ID | HOST_ID | CONFIG_CONTAINER_ID | OPTIMISTIC_LOCK_VERSION | ROLE_CONFIG_GROUP_ID | CONTEXT |
+-----------+---------+---------------------+------------------------------------------------------------------+------------+---------+---------------------+-------------------------+----------------------+---------+
|         7 |    NULL | keystore_password   | KUSld8yni8PMQcJbltvCnZEr2XG4BgKohAfnW6O02jB3tCP8v1DYlbMO5PqhJCVR |       NULL |    NULL |                   2 |                       0 |                 NULL | NULL    |
|       991 |    NULL | truststore_password | NULL                                                             |       NULL |    NULL |                   2 |                       1 |                 NULL | NONE    |
+-----------+---------+---------------------+------------------------------------------------------------------+------------+---------+---------------------+-------------------------+----------------------+---------+
2 rows in set (0.00 sec)

MySQL database show same value as the value as the result from command bdacli getinfo cluster_https_keystore_password. This is exactly what I want to know. It looks like I can use MySQL query to get the necessary passwords for my work.

One side note: In case you want to check out those self-signed certificates on BDA, run the following command. When prompting for password, just press ENTER.

[root@enkx4bda1node03 ~]# bdacli getinfo cluster_https_keystore_path
Enter the admin user for CM (press enter for admin): 
Enter the admin password for CM: 
/opt/cloudera/security/jks/node.jks

[root@enkx4bda1node03 ~]# keytool -list -v -keystore /opt/cloudera/security/jks/node.jks
Enter keystore password:  

*****************  WARNING WARNING WARNING  *****************
* The integrity of the information stored in your keystore  *
* has NOT been verified!  In order to verify its integrity, *
* you must provide your keystore password.                  *
*****************  WARNING WARNING WARNING  *****************

Keystore type: JKS
Keystore provider: SUN

Your keystore contains 1 entry

Alias name: enkx4bda1node03.enkitec.local
Creation date: Mar 5, 2016
Entry type: PrivateKeyEntry
Certificate chain length: 1
Certificate[1]:
Owner: CN=enkx4bda1node03.enkitec.local, OU=, O=, L=, ST=, C=
Issuer: CN=enkx4bda1node03.enkitec.local, OU=, O=, L=, ST=, C=
Serial number: 427dc79f
Valid from: Sat Mar 05 02:17:45 CST 2016 until: Fri Feb 23 02:17:45 CST 2018
Certificate fingerprints:
	 MD5:  A1:F9:78:EE:D4:C7:C0:D0:65:25:4C:30:09:D8:18:6E
	 SHA1: 8B:E3:7B:5F:76:B1:81:33:35:03:B9:00:97:D0:F7:F9:03:F9:74:C2
	 SHA256: EC:B5:F3:EB:E5:DC:D9:19:DB:2A:D6:3E:71:9C:62:55:10:0A:59:59:E6:98:2C:AD:23:AC:24:48:E4:68:6A:AF
	 Signature algorithm name: SHA256withRSA
	 Version: 3

Extensions: 

#1: ObjectId: 2.5.29.14 Criticality=false
SubjectKeyIdentifier [
KeyIdentifier [
0000: 36 D2 3D 49 AF E2 C6 7A   3C C6 14 D5 4D 64 81 F2  6.=I...z<...Md..
0010: 6E F2 2C B6                                        n.,.
]
]

*******************************************
*******************************************

If you dont’ like this kind of default password, you can use command keytool -storepasswd -keystore /opt/cloudera/security/jks/node.jks to change the password.

Data Query between BDA and Exadata (Part 2): Query Oracle Table from Hive on BDA

mouse_dancing
In the last post (Data Query between BDA and Exadata (Part 1): Query Hive Table from Oracle on Exadata), I show the way to use Oracle Big Data SQL from Oracle table to access hive table on BDA. In this post, I am going to show the data flow in opposite direction: Query data in an Oracle Table on Exadata from Hive on BDA.

Step 1: Load some data to an Oracle table on Exadata
On one of DB nodes on Exadata, run the following in SQLPlus.

sqlplus wzhou/wzhou
CREATE TABLE ORA_STOCK_PRICE1
(    
    STOCK_DATE  VARCHAR2(20),
    CLOSE_PRICE NUMBER(8,3),
    VOLUME NUMBER(8),
    OPEN_PRICE NUMBER(8,3),
    HIGH_PRICE NUMBER(8,3),
    LOW_PRICE NUMBER(8,3)               
);
INSERT INTO ORA_STOCK_PRICE1 values ( '6/18/16', 23.6,320000,23.62,23.74,23.5 );
INSERT INTO ORA_STOCK_PRICE1 values ( '6/19/16', 23.72,350000,23.73,23.8,23.38 );
COMMIT;

Let’s check out the content in the table.

WZHOU:SQL> set lines 120
WZHOU:SQL> select * from ORA_STOCK_PRICE1; 

STOCK_DATE           CLOSE_PRICE      VOLUME  OPEN_PRICE  HIGH_PRICE   LOW_PRICE
-------------------- ----------- ----------- ----------- ----------- -----------
6/18/16                     23.6      320000       23.62       23.74        23.5
6/19/16                    23.72      350000       23.73        23.8       23.38

Step 2: Create the script for the external table in Hive.
To access data in Oracle table, one way is using Oracle Table Access for Hadoop and Spark (OTA4H). OTA4H is an Oracle BDA feature that converts Oracle tables to Hadoop datasources. It allows direct access to master data in Oracle database using Hive SQL.

On BDA node, create a Hive external table.

vi hive_create_ext_ora_to_stock_price_test1.hql
DROP TABLE EXT_ORA_STOCK_PRICE_TEST1;
CREATE EXTERNAL TABLE IF NOT EXISTS EXT_ORA_STOCK_PRICE_TEST1 (
    stock_date string,
    close_price float,
    volume int,
    open_price float,
    high_price float,
    low_price float
)
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
WITH SERDEPROPERTIES (
    'oracle.hcat.osh.columns.mapping' = 'stock_date,close_price,volume,open_price,high_price,low_price')
TBLPROPERTIES(
    'mapreduce.jdbc.url'='jdbc:oracle:thin:@enkx3-scan:1521/bigdatadb',
    'mapreduce.jdbc.username'='wzhou',
    'mapreduce.jdbc.password'='wzhou',
    'mapreduce.jdbc.input.table.name'='ORA_STOCK_PRICE1'
);

Step 3: Create the hive table on HDFS

[oracle@enkbda1node01 test]$ export HIVE_AUX_JARS_PATH=/opt/oracle/ota4h/jlib/osh.jar:/opt/oracle/ota4h/jlib/ucp.jar:/opt/oracle/ota4h/jlib/ojdbc7.jar
[oracle@enkbda1node01 test]$ hive -f hive_create_ext_ora_to_stock_price_test1.hql
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: Using incremental CMS is deprecated and will likely be removed in a future release
16/11/02 14:29:29 WARN mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present.  Continuing without it.
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p1464.1349/jars/hive-common-1.1.0-cdh5.7.0.jar!/hive-log4j.properties
OK
Time taken: 0.642 seconds
OK
Time taken: 0.381 seconds

Step 4: Verify the result
We should see two rows in the result.

[oracle@enkbda1node01 test]$ hive
hive> select * from EXT_ORA_STOCK_PRICE_TEST1;
OK
6/18/16	23.6	320000	23.62	23.74	23.5
6/19/16	23.72	350000	23.73	23.8	23.38
Time taken: 2.093 seconds, Fetched: 2 row(s)

Ok, the result is exactly what I expect.