Cold Failover Single Instance DB to Different Node on ODA

server-disaster

Oracle Database Appliance (ODA) is the entry level of Oracle Engineering System. It is like Exadata, but it does not have cell nodes and InfiniBand. It does have two db nodes and a shared storage with 24 disks. Both ODA and Exadata are using the same Grid Infrastructure to manage ASM across multiple database nodes. As the ASM db files can be accessed by all db nodes, usually companies are implementing RAC or RAC ONE options to handle the failover in case of hardware failure. As both ODA and Exadata can support single instance database, it makes me wonder how the failover works and how I can bring up the database on a different node in case there is a hardware failure or a failure show above.

As many people know, the failover on RAC and RAC ONE is super easy and you don’t need do anything if the db node is down, the db can continue to run on the other db node. The difference for RAC and RAC ONE is that RAC perform hot failover while RAC ONE perform cold failover. I wouldn’t expect single instance or non-rac db can do the hot failover, but some of kind of cold failover is fine with me. It’s better than nothing.

At first, I thought just shutdown the db and created some necessary directories, add an entry to oratab file on another node should work. I remember reading one article about this procedure long time ago by changing certain db parameters from the pfile, and bring up the db by using the modified pfile. Search the internet and could not find it anymore. The most closed article I can find is, How To Setup ASM (11.2) On An Active/Passive Cluster (Non-RAC). (Doc ID 1296124.1). The document discusses a lot of steps to install and configure GI on the 2nd db node, so I basically can skip those steps. When I worked on the most important step, srvctl add database, I got the following errors:

oracle:testdb@odap2 ~ > /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/srvctl add database -d testdb -n odap2 -o /u01/app/oracle/product/11.2.0.3/dbhome_1 -p +data/TESTDB/spfiletestdb.ora -s OPEN -y AUTOMATIC -a DATA,RECO,REDO -t IMMEDIATE
PRCS-1007 : Server pool testdb already exists
PRCR-1086 : server pool ora.testdb is already registered

It looks weird that this db exists on db node 2 even before I created it. Later I realized the srvctl command can see the resource across all db nodes. Obviously this document is not going to work for me. So let me try another way. What if I shutdown the db on node 1 and bring it up on node 2 directly. Wrong! After I did that, I got many errors during mount stage as follows:

SQL> startup nomount
ORACLE instance started.

Total System Global Area 4275781632 bytes
Fixed Size 2235208 bytes
Variable Size 838862008 bytes
Database Buffers 3388997632 bytes
Redo Buffers 45686784 bytes
SQL> alter database mount
2 ;
alter database mount
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 2838
Session ID: 584 Serial number: 1

The alert log also shows the following errors:
Thu Sep 05 19:56:38 2013
alter database mount
Thu Sep 05 19:56:38 2013
NOTE: Loaded library: System
ORA-15025: could not open disk “/dev/mapper/SSD_E0_S20_805849611p1”
ORA-27041: unable to open file
Linux-x86_64 Error: 13: Permission denied
Additional information: 9
ORA-15025: could not open disk “/dev/mapper/SSD_E0_S21_805849344p1”
ORA-27041: unable to open file
Linux-x86_64 Error: 13: Permission denied
Additional information: 9
ORA-15025: could not open disk “/dev/mapper/SSD_E0_S22_805849349p1”
ORA-27041: unable to open file
Linux-x86_64 Error: 13: Permission denied
Additional information: 9
ORA-15025: could not open disk “/dev/mapper/SSD_E0_S23_805849238p1”
ORA-27041: unable to open file
Linux-x86_64 Error: 13: Permission denied
Additional information: 9
Thu Sep 05 19:56:38 2013
SUCCESS: diskgroup REDO was mounted
Thu Sep 05 19:56:38 2013
NOTE: dependency between database testdb and diskgroup resource ora.REDO.dg is established
Errors in file /u01/app/oracle/diag/rdbms/testdb/testdb/trace/testdb_ckpt_2813.trc (incident=24145):
ORA-00600: internal error code, arguments: [kfioTranslateIO03], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/testdb/testdb/incident/incdir_24145/testdb_ckpt_2813_i24145.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/testdb/testdb/trace/testdb_ckpt_2813.trc (incident=24146):
ORA-00600: internal error code, arguments: [17090], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/testdb/testdb/incident/incdir_24146/testdb_ckpt_2813_i24146.trc

Out of ideas, I posted my question to Enkitec‘s internal technical distribution email list. My colleague, Andy Colvin, came back with an interesting idea. He recommended to use srvctl modify database command to tell the OCR to use the other node. After that, just run srvctl start database to start the database. It is an interesting idea and seem make sense to me. However, when I checked out the Oracle document about srvctl modify database, there is no parameter that can specify the hostname. The srvctl is used to use -n to specify hostname in many occasion. For this command, -n means db name. Here is the syntax about this command from the document.

srvctl modify database -d db_unique_name [-o oracle_home] [-u oracle_user]
[-m domain_name] [-n db_name] [-p spfile] [-s start_options]
[-t stop_options] [-r {PRIMARY | PHYSICAL_STANDBY | LOGICAL_STANDBY |
SNAPSHOT_STANDBY}] [-y {automatic | manual}] [-a disk_group_list] [-z]

It looks weird to me and I pretty sure it must have some ways to specify hostname somewhere when creating the db. Luckily, whenever I created a db using dbca, I always save a copy of the creation script. So I run the following to search the creation scripts.

oracle:testdb@odap1 scripts> cd /u01/app/oracle/admin/testdb/scripts
oracle:testdb@odap1 scripts> grep odap1 *
testdb.sql:host /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/srvctl add database -d testdb -o /u01/app/oracle/product/11.2.0.3/dbhome_1 -p +DATA/testdb/spfiletestdb.ora -n testdb -x odap1 -a “DATA,RECO”

Ah, here is the answer, it does have undocument parameter -x that can specify the hostname. So I did the following:
/u01/app/oracle/product/11.2.0.3/dbhome_1/bin/srvctl modify database -d testdb -o /u01/app/oracle/product/11.2.0.3/dbhome_1 -p +DATA/testdb/spfiletestdb.ora -n testdb -x odap2 -a “DATA,RECO”

Then the database can be brought up by using sqlplus on the 2nd db node.

Sounds like the end of story. Later on, another colleague, Frits Hoogland, brought up a even nicer and cleaner solution than what I did. He mentioned the approach can be used on Exadata. Basically on 2nd db node, I don’t need to do make change in oratab file, and create directories, but must use srvctl command instead of sqlplus to do the work. Another interesting idea, so I reproduced his recommendation and it works easily without any issue. To make the example to use for different scenarios, I created a db on node 1 with the following information:

db name : mytestdb
db unquie name : mytestprd
service : mytest

Here are the steps:
Step 1. Verificate some information about the db and directories
on node 1

oracle:mytestdb@odap1 ~> srvctl config database -d mytestprd
Database unique name: mytestprd
Database name: mytestdb
Oracle home: /u01/app/oracle/product/11.2.0.3/dbhome_1
Oracle user: oracle
Spfile: +DATA/mytestprd/spfilemytestdb.ora
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: mytestprd
Database instance: mytestdb
Disk Groups: DATA,RECO,REDO
Mount point paths:
Services: mytest
Type: SINGLE
Database is administrator managed

SYS@mytestdb> show parameter name

NAME TYPE                   VALUE
--------------------------- ----------- ------------------------------
db_file_name_convert        string
db_name                     string	 mytestdb
db_unique_name              string	 mytestprd
global_names                boolean      TRUE
instance_name               string	 mytestdb
lock_name_space             string
log_file_name_convert       string
processor_group_name        string
service_names               string	 mytest

On db node 2, check the entry and db directories exist or not. Of course, they should not exit.

oracle:testdb@odap2 ~> grep mytest /etc/oratab

oracle:testdb@odap2 ~> ls -l /u01/app/oracle/diag/rdbms/mytestprd/mytestdb/trace
ls: /u01/app/oracle/diag/rdbms/mytestprd/mytestdb/trace: No such file or directory

oracle:testdb@odap2 ~> ls -l /u01/app/oracle/admin/mytestprd/adump
ls: /u01/app/oracle/admin/mytestprd/adump: No such file or directory

Step 2. Configure the db to move from db node 1 to node 2

oracle:testdb@odap2 ~> srvctl status database -d mytestprd
Instance mytestdb is running on node odap1

oracle:testdb@odap2 ~> srvctl status service -d mytestprd -s mytest
Service mytest is running on instance(s) mytestdb

oracle:testdb@odap2 ~> srvctl stop database -d mytestprd
oracle:testdb@odap2 ~> srvctl modify database -d mytestprd -x odap2
oracle:testdb@odap2 ~> srvctl start database -d mytestprd

oracle:testdb@odap2 ~> srvctl status database -d mytestprd
Instance mytestdb is running on node odap2
oracle:testdb@odap2 ~> srvctl status service -d mytestprd -s mytest
Service mytest is running on instance(s) mytestdb

oracle:testdb@odap2 ~> srvctl config database -d mytestprd
Database unique name: mytestprd
Database name: mytestdb
Oracle home: /u01/app/oracle/product/11.2.0.3/dbhome_1
Oracle user: oracle
Spfile: +DATA/mytestprd/spfilemytestdb.ora
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: mytestprd
Database instance: mytestdb
Disk Groups: DATA,RECO,REDO
Mount point paths:
Services: mytest
Type: SINGLE
Database is administrator managed
[/code]

Right now, we can verify whether the oratab entry and directories exist or not.

verify the entry in oratab file

oracle:testdb@odap2 ~> grep mytest /etc/oratab
mytestdb:/u01/app/oracle/product/11.2.0.3/dbhome_1:N    # line added by Agent

oracle:testdb@odap2 ~> ls -l /u01/app/oracle/diag/rdbms/mytestprd/mytestdb/trace
total 160
-rw-r----- 1 oracle asmadmin 10037 Sep  7 10:20 alert_mytestdb.log
-rw-r----- 1 oracle asmadmin   893 Sep  7 10:20 mytestdb_asmb_31403.trc
-rw-r----- 1 oracle asmadmin    64 Sep  7 10:20 mytestdb_asmb_31403.trm
-rw-r----- 1 oracle asmadmin  1062 Sep  7 10:20 mytestdb_cjq0_31545.trc
....

oracle:testdb@odap2 ~> ls -l /u01/app/oracle/admin/mytestprd/adump
total 32
-rw-r----- 1 oracle asmadmin  796 Sep  7 10:20 mytestdb_ora_31064_2.aud
-rw-r----- 1 oracle asmadmin 2600 Sep  7 10:20 mytestdb_ora_31421_1.aud
....

Sqlplus command also show the db is relocated to 2nd db node

SQL> col host_name for a20
select instance_name, host_name from v$instance;

SQL>
INSTANCE_NAME	 HOST_NAME
---------------- --------------------
mytestdb	     odap2

Let’s do some check under $ORACLE_HOME/dbs directory on node 2.

oracle:mytestdb@odap2 dbs> ls -l *mytestdb*
-rw-rw---- 1 oracle asmadmin 1544 Sep  7 10:20 hc_mytestdb.dat
-rw-r--r-- 1 oracle oinstall   67 Sep  7 10:19 initmytestdb.ora

oracle:mytestdb@odap2 dbs> cat initmytestdb.ora
SPFILE='+DATA/mytestprd/spfilemytestdb.ora'		# line added by Agent

Check out the same $ORACLE_HOME/dbs directory on node 1

oracle:mytestdb@odap1 dbs> ls -l *mytest*
-rw-rw---- 1 oracle asmadmin 1544 Sep  7 10:19 hc_mytestdb.dat
-rw-r----- 1 oracle oinstall   44 Sep  6 09:12 initmytestdb.ora
-rw-r----- 1 oracle oinstall 1536 Sep  6 09:05 orapwmytestdb

You can notice that the password file is missing on the node 2. So do the copy work from db node 1.

oracle:mytestdb@odap2 dbs> cd $ORACLE_HOME/dbs
oracle:mytestdb@odap2 dbs> scp -p oracle@odap1:$PWD/orapwmytestdb .
orapwmytestdb                                                                     100% 1536     1.5KB/s   00:00
oracle:mytestdb@odap2 dbs> ls -lart *mytest*
-rw-r----- 1 oracle oinstall 1536 Sep  6 09:05 orapwmytestdb
-rw-r--r-- 1 oracle oinstall   67 Sep  7 10:19 initmytestdb.ora
-rw-rw---- 1 oracle asmadmin 1544 Sep  7 10:20 hc_mytestdb.dat

Run the crsstat command, it also shows the right result.

oracle:mytestdb@odap2 dbs> crsstat
Resource Name            Resource Type    Target     State        Node            FC RC State Details
------------------------ ---------------- ---------- ------------ --------------- -- -- ------
ora.DATA.dg               diskgroup      C ONLINE     ONLINE       odap1       0  0
ora.DATA.dg               diskgroup      C ONLINE     ONLINE       odap2       0  0
ora.LISTENER.lsnr         Listener       C ONLINE     ONLINE       odap1       0  0
ora.LISTENER.lsnr         Listener       C ONLINE     ONLINE       odap2       0  0
ora.LISTENER_SCAN1.lsnr   SCAN Listener  L ONLINE     ONLINE       odap1       0  0
ora.LISTENER_SCAN2.lsnr   SCAN Listener  L ONLINE     ONLINE       odap2       0  0
ora.RECO.dg               diskgroup      C ONLINE     ONLINE       odap1       0  0
ora.RECO.dg               diskgroup      C ONLINE     ONLINE       odap2       0  0
ora.REDO.dg               diskgroup      C ONLINE     ONLINE       odap1       0  0
ora.REDO.dg               diskgroup      C ONLINE     ONLINE       odap2       0  0
ora.asm                   ASM            C ONLINE     ONLINE       odap1       0  0 Started
ora.asm                   ASM            C ONLINE     ONLINE       odap2       0  0 Started
ora.cvu                   cvu            L ONLINE     ONLINE       odap2       0  0
ora.dbm.db                database       L ONLINE     ONLINE       odap1       0  0 Open
ora.odap1.vip             Cluster VIP    L ONLINE     ONLINE       odap1       0  0
ora.odap2.vip             Cluster VIP    L ONLINE     ONLINE       odap2       0  0
ora.gsd                   Gbl Svc Daemon C OFFLINE    OFFLINE      (odap1)     0  0
ora.gsd                   Gbl Svc Daemon C OFFLINE    OFFLINE      (odap2)     0  0
ora.mytestprd.db          database       L ONLINE     ONLINE       odap2       0  0 Open
ora.mytestprd.mytest.svc  service        L ONLINE     ONLINE       odap2       0  0
ora.net1.network          Network (VIP)  C ONLINE     ONLINE       odap1       0  0
ora.net1.network          Network (VIP)  C ONLINE     ONLINE       odap2       0  0
ora.oc4j                  OC4J           L ONLINE     ONLINE       odap2       0  0
ora.ons                   Ora Notif Svc  C ONLINE     ONLINE       odap1       0  0
ora.ons                   Ora Notif Svc  C ONLINE     ONLINE       odap2       0  0
ora.reco.acfsvol.acfs     acfs           C ONLINE     ONLINE       odap1       0  0 mounted on /cloudfs
ora.reco.acfsvol.acfs     acfs           C ONLINE     ONLINE       odap2       0  0 mounted on /cloudfs
ora.registry.acfs         registry       C ONLINE     ONLINE       odap1       0  0
ora.registry.acfs         registry       C ONLINE     ONLINE       odap2       0  0
ora.scan1.vip             SCAN VIP       L ONLINE     ONLINE       odap1       0  0
ora.scan2.vip             SCAN VIP       L ONLINE     ONLINE       odap2       0  0
ora.testdb.db             database       L ONLINE     ONLINE       odap2       0  0 Open
ora.testwz.db             database       L ONLINE     ONLINE       odap1       0  0 Open

After the db is relocated to node 2, I want to do some testings to try out some interesting scenarios

Scenario 1. While db is running on node 2, what’s happened if I try to bring it up on node 1?
Of course, everyone know it should not work for single-instance db. But just want to see what is the result.

SYS@mytestdb> startup nomount
ORA-00304: requested INSTANCE_NUMBER is busy

As I expected, the db can not start. Otherwise, who is going to buy Oracle RAC license!

Scenario 2. what happened if I stop the db from node 2 and start the db on node 1 using sqlplus?

On node 1
oracle:mytestdb@odap1 dbs> srvctl stop database -d mytestdb
oracle:mytestdb@odap1 dbs> sqlplus / as sysdba
SQL*Plus: Release 11.2.0.3.0 Production on Sat Sep 7 10:44:56 2013
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to an idle instance.
SYS@mytestdb> startup nomount
ORACLE instance started.

Total System Global Area 4275781632 bytes
Fixed Size 2235208 bytes
Variable Size 822084792 bytes
Database Buffers 3422552064 bytes
Redo Buffers 28909568 bytes
SYS@mytestdb> alter database mount;
Database altered.

SYS@mytestdb> alter database open;
Database altered.

SYS@mytestdb> col host_name for a20
select instance_name, host_name from v$instance;

INSTANCE_NAME HOST_NAME
—————- ——————–
mytestdb odap1

It’s an interesting result as I am expecting it can not start on the node 1. So what’s happened if shutdown the db and start the db using srvctl on db node 1.

SYS@mytestdb> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SYS@mytestdb> quit

oracle:mytestdb@odap1 dbs> srvctl start database -d mytestprd
oracle:mytestdb@odap1 dbs> srvctl status database -d mytestprd
Instance mytestdb is running on node odap2
oracle:mytestdb@odap1 dbs> srvctl status service -d mytestprd -s mytest
Service mytest is running on instance(s) mytestdb

Not surprise. The db and service were started on db node 2 as expected.

Scenario 3. Based on the scenario 2 with db startup on node 1, can I start service on node 1 as well?

oracle:mytestdb@odap1 dbs> sqlplus / as sysdba
SQL*Plus: Release 11.2.0.3.0 Production on Sat Sep 7 10:59:20 2013
Copyright (c) 1982, 2011, Oracle. All rights reserved.

Connected to an idle instance.
SYS@mytestdb> startup
ORACLE instance started.

Total System Global Area 4275781632 bytes
Fixed Size 2235208 bytes
Variable Size 822084792 bytes
Database Buffers 3422552064 bytes
Redo Buffers 28909568 bytes
Database mounted.
Database opened.
SYS@mytestdb> quit

oracle:mytestdb@odap1 dbs> srvctl status database -d mytestprd
Instance mytestdb is not running on node odap2
oracle:mytestdb@odap1 dbs> srvctl status service -d mytestprd -s mytest
Service mytest is not running.
oracle:mytestdb@odap1 dbs> srvctl start service -d mytestprd -s mytest
PRCD-1084 : Failed to start service mytest
PRCR-1079 : Failed to start resource ora.mytestprd.mytest.svc
CRS-5017: The resource action “ora.mytestprd.db start” encountered the following error:
ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
Linux-x86_64 Error: 2: No such file or directory
Process ID: 0
Session ID: 0 Serial number: 0
. For details refer to “(:CLSN00107:)” in “/u01/app/11.2.0.3/grid/log/odap2/agent/crsd/oraagent_oracle/oraagent_oracle.log”.

CRS-2674: Start of ‘ora.mytestprd.db’ on ‘odap2’ failed
CRS-2632: There are no more servers to try to place resource ‘ora.mytestprd.mytest.svc’ on that would satisfy its placement policy

oracle:mytestdb@odap1 dbs> ls -l /u01/app/11.2.0.3/grid/log/odap2/agent/crsd/oraagent_oracle/oraagent_oracle.log
ls: /u01/app/11.2.0.3/grid/log/odap2/agent/crsd/oraagent_oracle/oraagent_oracle.log: No such file or directory

There is no odap2 directory under /u01/app/11.2.0.3/grid/log on node 1. It is on db node 2.
oracle:mytestdb@odap1 dbs> ls -l /u01/app/11.2.0.3/grid/log
total 12
drwxr-xr-x 2 grid oinstall 4096 Apr 5 22:29 crs
drwxr-xr-t 25 root oinstall 4096 Jul 1 12:10 odap1
drwxrwxr-x 5 grid asmadmin 4096 Jul 1 12:44 diag

oracle:mytestdb@odap1 dbs> crsstat | egrep 'Target|mytest'
Resource Name             Resource Type    Target State    Node            FC RC State Details
ora.mytestprd.db          database       L ONLINE OFFLINE  (odap2)     0  0 Instance Shutdown
ora.mytestprd.mytest.svc  service        L ONLINE OFFLINE  (odap2)     0  0

Therefore, from the above test case, once configured the db starting up on a different node. You can bring up the db from either database nodes using sqlplus. However, if you are using service name, you have to start db from the server you associate db with by the “srvctl modify database” command. In general, this trick can be used in non-RAC environment on ODA, Exadata and any cluster that is using the same GI across multiple nodes.

Advertisements