Data Query between BDA and Exadata (Part 3): Load data from BDA to Oracle Table on Exadata Using OLH

In the last two blogs, I show how to query data between the two system: Hive on BDA Oracle on Exadata.
Data Query between BDA and Exadata (Part 1): Query Hive Table from Oracle on Exadata
Data Query between BDA and Exadata (Part 2): Query Oracle Table from Hive on BDA

We know we can dump massive raw data to Hadoop, do some analysis or ETL transformation, load the result data to Oracle database. There are many ways to load the data into Oracle database:

  • Copy the result files to a NFS mount that is visible on both BDA and Exadata. Then use sqlldr or Oracle external table to load the data into Oracle database. There are many articles/blogs about sqlldr and Oracle external tables. I am not going to repeat here.
  • Use Sqoop to load the data from HDFS to Oracle. I actually had a blog last year to cover this topic. Export data from Hive table to Oracle Database
  • Oracle Loader for Hadoop (OLH). This is the topic I am going to cover in this blog.
  • OLH is part of Oracle Big Data Connectors. It is a MapReduce utility to load data from Hadoop into Oracle Database. It first sorts, partitions, and converts data into Oracle Database formats on the Hadoop cluster. Then it loads the converted data into the database. One of main advantage of OLH is to preprocess the data on the Hadoop side and reduce the CPU and IO utilization on Oracle database.

    In my test, I am going to reuse the stock_price.csv file from my first blog to load into an Oracle table on Exadata.

    Step 1: Create target Oracle table on Exadata
    This time the table is created in a normal way instead of external table.

    sqlplus wzhou/wzhou
        STOCK_DATE  VARCHAR2(20),
        CLOSE_PRICE NUMBER(8,3),
        VOLUME NUMBER(8),
        OPEN_PRICE NUMBER(8,3),
        HIGH_PRICE NUMBER(8,3),
        LOW_PRICE NUMBER(8,3)               

    Step 2: Clean up OLH output folder
    OLH requires a temporary output folder on HDFS. This directory needs to be removed before the data loading.
    In my example, I am using /user/oracle/oraloadout directory on HDFS.
    [oracle@enkbda1node01 ~]$ hdfs dfs -rm -R -skipTrash /user/oracle/oraloadout
    Deleted /user/oracle/oraloadout

    Step 3: Create a job configuration file

    vi MyOLHJobConf.xml
    <?xml version="1.0" encoding="UTF-8" ?>
    <!--                          Input settings                             -->
     <!--\u002C is ; -->
    <!--                          Output settings                    -->         
    <!--                          Table information     -->                       
    <!--                          Connection information        -->              

    Step 4: Execute the job to load data from HDFS to Oracle table on Exadata

    [oracle@enkbda1node01 test]$ export OLH_HOME=/opt/oracle/oraloader-3.6.0-h2
    [oracle@enkbda1node01 test]$ export HADOOP_CLASSPATH=$OLH_HOME/jlib/*:$HADOOP_CLASSPATH
    [oracle@enkbda1node01 test]$ 
    [oracle@enkbda1node01 test]$ hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader \
    > -conf MyOLHJobConf.xml -libjars $OLH_HOME/jlib/oraloader.jar
    Oracle Loader for Hadoop Release 3.6.0 - Production
    Copyright (c) 2011, 2016, Oracle and/or its affiliates. All rights reserved.
    16/11/03 05:04:41 INFO loader.OraLoader: Oracle Loader for Hadoop Release 3.6.0 - Production
    Copyright (c) 2011, 2016, Oracle and/or its affiliates. All rights reserved.
    16/11/03 05:04:41 INFO loader.OraLoader: Built-Against: hadoop-2.2.0 hive-0.13.0 avro-1.7.6 jackson-1.8.8
    16/11/03 05:04:42 INFO loader.OraLoader: oracle.hadoop.loader.loadByPartition is disabled because table: ORA_TARGET_STOCK_PRICE1 is not partitioned
    16/11/03 05:04:42 INFO loader.OraLoader: oracle.hadoop.loader.enableSorting disabled, no sorting key provided
    16/11/03 05:04:42 INFO loader.OraLoader: Reduce tasks set to 0 because of no partitioning or sorting. Loading will be done in the map phase.
    16/11/03 05:04:42 INFO output.DBOutputFormat: Setting map tasks speculative execution to false for : oracle.hadoop.loader.lib.output.JDBCOutputFormat
    16/11/03 05:04:43 WARN loader.OraLoader: Sampler is disabled because the number of reduce tasks is less than two. Job will continue without sampled information.
    16/11/03 05:04:43 INFO loader.OraLoader: Submitting OraLoader job OraLoader
    16/11/03 05:04:43 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm28
    16/11/03 05:04:43 INFO input.FileInputFormat: Total input paths to process : 1
    16/11/03 05:04:43 INFO mapreduce.JobSubmitter: number of splits:1
    16/11/03 05:04:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1476296428851_0053
    16/11/03 05:04:43 INFO impl.YarnClientImpl: Submitted application application_1476296428851_0053
    16/11/03 05:04:43 INFO mapreduce.Job: The url to track the job:
    16/11/03 05:04:48 INFO loader.OraLoader: map 0% reduce 0%
    16/11/03 05:04:51 INFO loader.OraLoader: map 100% reduce 0%
    16/11/03 05:04:51 INFO loader.OraLoader: Job complete: OraLoader (job_1476296428851_0053)
    16/11/03 05:04:51 INFO loader.OraLoader: Counters: 34
    	File System Counters
    		FILE: Number of bytes read=0
    		FILE: Number of bytes written=0
    		FILE: Number of read operations=0
    		FILE: Number of large read operations=0
    		FILE: Number of write operations=0
    		HDFS: Number of bytes read=2704
    		HDFS: Number of bytes written=185843
    		HDFS: Number of read operations=69
    		HDFS: Number of large read operations=0
    		HDFS: Number of write operations=8
    	Job Counters 
    		Launched map tasks=1
    		Other local map tasks=1
    		Total time spent by all maps in occupied slots (ms)=2089
    		Total time spent by all reduces in occupied slots (ms)=0
    		Total time spent by all map tasks (ms)=2089
    		Total vcore-seconds taken by all map tasks=2089
    		Total megabyte-seconds taken by all map tasks=2139136
    	Map-Reduce Framework
    		Map input records=66
    		Map output records=65
    		Input split bytes=116
    		Spilled Records=0
    		Failed Shuffles=0
    		Merged Map outputs=0
    		GC time elapsed (ms)=40
    		CPU time spent (ms)=3150
    		Physical memory (bytes) snapshot=474824704
    		Virtual memory (bytes) snapshot=2996711424
    		Total committed heap usage (bytes)=632291328
    	Rows skipped by input error
    		Parse Error=1
    		Total rows skipped by input error=1
    	File Input Format Counters 
    		Bytes Read=2495
    	File Output Format Counters 
    		Bytes Written=39900

    Step 5: Verify the result on Exadata

    sqlplus wzhou/wzhou
    WZHOU:SQL > select count(*) from ORA_TARGET_STOCK_PRICE1;