Data Query between BDA and Exadata (Part 4): Query Oracle Dump File on BDA Using Copy2BDA

ship_ship
In the last three blogs below, I mainly discussed how to query data between BDA and Exadata and a way to load data from BDA to Exadata using OLH.
Data Query between BDA and Exadata (Part 1): Query Hive Table from Oracle on Exadata
Data Query between BDA and Exadata (Part 2): Query Oracle Table from Hive on BDA
Data Query between BDA and Exadata (Part 3): Load data from BDA to Oracle Table on Exadata Using OLH

When moving data from Oracle Database to Hadoop, I discussed the approach using Sqoop in this blog. There is actually another better approach on BDA by using Copy to BDA. Copy to BDA allows you to copy tables from an Oracle database into Hadoop in the format of regular Oracle Data Pump format, dmp file. After generating Data Pump format files from the tables and copying the files to HDFS, you can use Hive to query the data locally without accessing remote Oracle database on Exadata.

Ok, here is the overview of my plan to do the work. I am going to use the same Oracle table, ORA_TARGET_STOCK_PRICE1, used in my last blog on Exadata to generate an Oracle Data Pump dmp file. Then copy the Data Pump format file to the BDA and load it into HDFS. After the dmp file is available on HDFS, link the dmp file to a Hive external table. Then you can query the Oracle Data Pump file directly by using Hive on BDA.

Here are the detail steps:
On Exadata:
Step 1: Create db directory and generate Oracle Data Pump format file

mkdir -p /home/oracle/wzhou/mytest/expdir
sqlplus / as sysdba
CREATE DIRECTORY exp_2hdfs_dir AS '/home/oracle/wzhou/mytest/expdir';
GRANT READ, WRITE ON DIRECTORY exp_2hdfs_dir TO WZHOU;

Step 2: Create data pump format files for the data

sqlplus wzhou/wzhou
WZHOU:SQL> SELECT COUNT(*) FROM ORA_TARGET_STOCK_PRICE1;

   COUNT(*)
-----------
         65
         
WZHOU:SQL> CREATE TABLE EXT_DMP_ORA_TEST2
ORGANIZATION EXTERNAL
(                    
   TYPE oracle_datapump
   DEFAULT DIRECTORY exp_2hdfs_dir
   LOCATION('exa_ora_dmp_test2.dmp')
)
AS SELECT * FROM ORA_TARGET_STOCK_PRICE1;
2    3    4    5    6    7    8  
Table created.

Verify the result

WZHOU:SQL> desc EXT_DMP_ORA_TEST2;
 Name               Null?    Type
 ------------------ -------- -----------------
 STOCK_DATE                  VARCHAR2(20)
 CLOSE_PRICE                 NUMBER(8,3)
 VOLUME                      NUMBER(8)
 OPEN_PRICE                  NUMBER(8,3)
 HIGH_PRICE                  NUMBER(8,3)
 LOW_PRICE                   NUMBER(8,3)

WZHOU:SQL> select count(*) from EXT_DMP_ORA_TEST2;
   COUNT(*)
-----------
         65

WZHOU:SQL> !ls -l /home/oracle/wzhou/mytest/expdir
total 44
-rw-r----- 1 oracle dba 16384 Nov  5 19:12 exa_ora_dmp_test2.dmp
-rw-r--r-- 1 oracle dba   258 Nov  5 19:15 EXT_DMP_ORA_TEST2_180093.log

Step 3: Copy the dump file to Hadoop HDFS

. $ORACLE_HOME/bigdatasql/hadoop_enkbda.env
cd /home/oracle/wzhou/mytest/expdir
ls *.dmp
hadoop fs -mkdir /user/oracle/mytest/dmp
hadoop fs -put *.dmp /user/oracle/mytest/dmp
hadoop fs -ls /user/oracle/mytest/dmp

Ok, I am done with the work on Exadata and move my work to BDA side
On BDA
Step 4: Create the mapping external table in Hive
[root@enkbda1node01 ~]# su – oracle
[oracle@enkbda1node01 ~]$ hive
hive> use default;
OK
Time taken: 0.486 seconds
hive>
CREATE EXTERNAL TABLE ext_ora_dump_test2
ROW FORMAT SERDE ‘oracle.hadoop.hive.datapump.DPSerDe’
STORED AS
INPUTFORMAT ‘oracle.hadoop.hive.datapump.DPInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION ‘/user/oracle/mytest/dmp’;

OK
Time taken: 0.48 seconds

Step 5: Querying the Data in Hive

hive> DESCRIBE ext_ora_dump_test2;
OK
stock_date          	varchar(20)         	from deserializer   
close_price         	decimal(8,3)        	from deserializer   
volume              	int                 	from deserializer   
open_price          	decimal(8,3)        	from deserializer   
high_price          	decimal(8,3)        	from deserializer   
low_price           	decimal(8,3)        	from deserializer   
Time taken: 0.188 seconds, Fetched: 6 row(s)

hive> show create table  ext_ora_dump_test2;
OK
CREATE EXTERNAL TABLE `ext_ora_dump_test2`(
  `stock_date` varchar(20) COMMENT 'from deserializer', 
  `close_price` decimal(8,3) COMMENT 'from deserializer', 
  `volume` int COMMENT 'from deserializer', 
  `open_price` decimal(8,3) COMMENT 'from deserializer', 
  `high_price` decimal(8,3) COMMENT 'from deserializer', 
  `low_price` decimal(8,3) COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'oracle.hadoop.hive.datapump.DPSerDe' 
STORED AS INPUTFORMAT 
  'oracle.hadoop.hive.datapump.DPInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://enkbda-ns/user/oracle/mytest/dmp'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='false', 
  'numFiles'='0', 
  'numRows'='-1', 
  'rawDataSize'='-1', 
  'totalSize'='0', 
  'transient_lastDdlTime'='1478391818')
Time taken: 0.092 seconds, Fetched: 22 row(s)

hive> select * from ext_ora_dump_test2 limit 3;
OK
9/23/16	24.05	56837	24.13	24.22	23.88
9/22/16	24.1	56675	23.49	24.18	23.49
9/21/16	23.38	70925	23.21	23.58	23.025
Time taken: 0.237 seconds, Fetched: 3 row(s)

Ok, I can get the result from ora dump file in Hive. Nice feature.

Additional Note:

  • In my test, I use only one dmp file. You actually can generate multiple dmp files to speed up the export of dmp file as well as faster data access on Hadoop. The syntax to use multiple dmp files is like LOCATION(‘file1.dmp, file2.dmp’).
  • This feature is quite useful when you want to reuse the same Oracle dmp file on both sides. For example, you can use the dmp file on HDFS to query data directly on BDA without connecting to Oracle database on Exadata. At the same time, if you want to reuse the same data on Exadata, you could easily reimport the same dmp file to Oracle database on Exadata.
  • At the time when the source Oracle table changes, you can regenerate the dmp file on Exadata and then refresh the copy in Hadoop. Copy to BDA is primarily useful for Oracle tables that are relatively static and do not require frequent refreshes.
  • Copy to BDA is licensed under Oracle Big Data SQL and you have to have an Oracle Big Data SQL license before using Copy to BDA feature.