AWR is not Enough to Track Down IO Problem on Exadata

Recently we run into an interesting performance issue at one of our clients. They reported significant slow down in the system during the day time for some time. Their X2 Exadata does not host many databases, with three production databases and a few test/QA databases. Out of the three production databases, let me give a fake name (CRDB), is the most important and critical one, which is mainly OLTP and have some reporting activities. The other two production databases are tiny and less important databases with not much activities. In other words, the majority of db activities happens at CRDB database.

The slow down is mysterious and randomly during the day and did not seem to follow a pattern. When the slow down happens, the active sessions at CRDB shot up from average 2~4 to at least 30, sometimes reach to 100. At the same time, there were massive slow down in all other databases on the same Exadata. To track down the issue, we requested a few AWR reports from the client for the CRDB database.

In the AWR report, the top event is cell smart table scan. For example, on db node 1 alone, from 10:30am to 2:30pm on August 6, 2013, the total waits for cell smart table scan was 574, 218 with average wait time of 106 ms, equal to 60,980 seconds of DB time, in other words, 78.20% of DB time. Other than that, AWR report did not tell anything useful that should cause performance issue. Therefore, we  mainly focused on IO and smart scan related operation.

On Exadata, cell smart table scan is the scan on cell storage for large IOs. Many offloading activities involve cell smart table scan. As client said there was no application code change recently and data volume remains similar in the past, we assume the total logical and physical IO for CRDB database should be at the similar level as before. They also mentioned that the performance issue began a few days after Exadata patch.

Their resource plan gives  50% allocation to CRDB, and another 50% to the rest of databases with objective of LOW_LATENCY. This resource plan has been in production for over a year and managed quite well in IO.
Luckily during the time we were tracking down the issue, I was attending our company‘s 2013 Exadata conference (E4) . It’s an excellent Exadata and Bigdata specific conference, not only it has many great Exadata/Big Data speakers, but also many Exadata experts/users around the world. I happened to listen one session about IORM presented by Oracle’s Sue Lee. She is a Director of Development in the Oracle RDBMS division and responsible for the Oracle Resource Manager. Her presentation about IORM was an excellent one. She will also give a similar session at OpenWorld. If you are working on Exadata and attend OpenWorld this year, I highly recommended to attend her session, Using Resource Manager for Database Consolidation with Oracle Database 12c (Session ID: CON8884)  and you will gain in-depth knowledge about IORM and Exadata internals.

During her presentation, she mentioned a tool, called iorm_metrics script (Doc ID 1337265.1), and said this tools is frequently used in her group to track down IORM performance problems. It sounds interesting, so I immediately downloaded the script and asked our client run the followings on all of cell nodes.

./metric_iorm.pl “where collectionTime > ‘2013-08-06T11:30:00-06:00’ and collectionTime < ‘2013-08-06T13:20:00-06:00′” > metric_iorm_`hostname`.log

The result was quite interesting and I used one snapshot from cell node 1 as example:

 
Time: 2013-08-06T12:33:03-05:00
Database: CRDB
Utilization:     Small=4%    Large=7%
Flash Cache:     IOPS=226
Disk Throughput: MBPS=120
Small I/O's:     IOPS=258    Avg qtime=1.1ms
Large I/O's:     IOPS=114    Avg qtime=3026ms
	Consumer Group: _ORACLE_MEDPRIBG_GROUP_
	Utilization:     Small=0%    Large=0%
	Flash Cache:     IOPS=0.1
	Disk Throughput: MBPS=0
	Small I/O's:     IOPS=0.3    Avg qtime=0.0ms
	Large I/O's:     IOPS=0.0    Avg qtime=0.0ms
	Consumer Group: OTHER_GROUPS
	Utilization:     Small=0%    Large=7%
	Flash Cache:     IOPS=150
	Disk Throughput: MBPS=116
	Small I/O's:     IOPS=1.8    Avg qtime=0.8ms
	Large I/O's:     IOPS=112    Avg qtime=3077ms
	Consumer Group: _ORACLE_LOWPRIBG_GROUP_
	Utilization:     Small=1%    Large=0%
	Flash Cache:     IOPS=23.8
	Disk Throughput: MBPS=1
	Small I/O's:     IOPS=82.1    Avg qtime=2.7ms
	Large I/O's:     IOPS=0.0    Avg qtime=0.0ms
	Consumer Group: _ORACLE_BACKGROUND_GROUP_
	Utilization:     Small=3%    Large=0%
	Flash Cache:     IOPS=51.6
	Disk Throughput: MBPS=2
	Small I/O's:     IOPS=174    Avg qtime=0.3ms
	Large I/O's:     IOPS=1.9    Avg qtime=66.1ms

Database: _OTHER_DATABASE_
Utilization:     Small=10%    Large=9%
Flash Cache:     IOPS=89.7
Disk Throughput: MBPS=142
Small I/O's:     IOPS=504    Avg qtime=0.5ms
Large I/O's:     IOPS=134    Avg qtime=4137ms
	Consumer Group: _ORACLE_MEDPRIBG_GROUP_
	Utilization:     Small=0%    Large=0%
	Flash Cache:     IOPS=0.1
	Disk Throughput: MBPS=0
	Small I/O's:     IOPS=0.4    Avg qtime=0.2ms
	Large I/O's:     IOPS=0.0    Avg qtime=0.0ms
	Consumer Group: OTHER_GROUPS
	Utilization:     Small=0%    Large=9%
	Flash Cache:     IOPS=42.8
	Disk Throughput: MBPS=139
	Small I/O's:     IOPS=0.8    Avg qtime=0.6ms
	Large I/O's:     IOPS=134    Avg qtime=4138ms
	Consumer Group: _ORACLE_LOWPRIBG_GROUP_
	Utilization:     Small=0%    Large=0%
	Flash Cache:     IOPS=16.9
	Disk Throughput: MBPS=0
	Small I/O's:     IOPS=56.2    Avg qtime=3.5ms
	Large I/O's:     IOPS=0.0    Avg qtime=0.0ms
	Consumer Group: _ORACLE_BACKGROUND_GROUP_
	Utilization:     Small=9%    Large=0%
	Flash Cache:     IOPS=29.8
	Disk Throughput: MBPS=2
	Small I/O's:     IOPS=447    Avg qtime=0.2ms
	Large I/O's:     IOPS=0.0    Avg qtime=53.0ms

CELL METRICS SUMMARY

Cell Total Utilization:     Small=14%    Large=16%
Cell Total Flash Cache:     IOPS=315.7
Cell Total Disk Throughput: MBPS=249.454
Cell Total Small I/O's:     IOPS=762.2
Cell Total Large I/O's:     IOPS=245.6
Cell Avg small read latency:  11.24 ms
Cell Avg small write latency: 2.64 ms
Cell Avg large read latency:  13.35 ms
Cell Avg large write latency: 4.47 ms

From the above result, we can see the average queue time for every large IO (cell smart scan) was over 3, 000 ms for CRDB resource group and over 4, 000 ms for OTHER_DATABASE. The normal range should be < 30ms. The throughput for OTHER_DATABASE was 142 MB/second while CRDB was 120 MB/second. This indicates the saturated disk I/O.

It’s possible CRDB’s large amount IO caused OTHER_DATABASE to slow down, but Disk Throughpt from OTHER_DATABASE should be small and wait time could be longer. On the contrary, the high IO Throughput from OTHER_DATABASE indicates something not right. So I zoomed into other databases, and compare the IO throughput between databases.

The following charts shows the throughput for CRDB database was around 300MB/second before 3:30pm.
iorm1308_io_CRDB
At the same time, another much smaller and less used database, the throughput was much higher than CRDB database. The throughput was mostly between 500MB to 600MB/second, with some peak to over 1200MB/second at a few times.

iorm1308_io_otherDB

It is normal to see CRDB has a lot of large IOs during the day as it is the major database and there are many activities against this database. However it is unusual to see a tiny small database took significant amount of large IOs. This inspired me to investigate more on this small database.

From 12c OEM’s SQL Monitor screen, we can see a lot of queries with SQL ID d0af9yxrrrvy5 with many IOs, over 80GB for each execution and long running time. There  were  multiple instances for the same query executed during the same timeframe  the slow down happened.

iorm1308_slow_query

The 12c Cloud Control OEM also shows the Average Throttle Time for Disk I/Os for both CRDB and OTHER_DATABASE shot up to 1,500~2,000 milliseconds on August 6 and 7 afternoon. This was exact the time the query was executing in one of OTHER_DATABASE.

iorm1308_disk_throttle_time

After this query was shutdown, system returned to normal.  So the slow down in CRDB database was not the cause of the problem, but the victim of IO throttle caused by other database.

It seems the end of story for the problem. Actually not yet.

Although the system looks normal, there were two queries running about 2~3 times slower than in the past. Our Enkitec‘s awr_plan_change.sql shows almost identical LIO and PIO for each execution, but the timing changes significantly since the patch date of July 15. It’s another interesting issue.

SYS@CRDB1> @awr_plan_change
Enter value for sql_id: 4zbfxnv733dzb
Enter value for instance_number:
SNAP_ID   NODE BEGIN_INTERVAL_TIME		 SQL_ID    EXECS AVG_ETIME	AVG_LIO       AVG_PIO
---------- ------ -------------------------------- ----- --------  ------------   ------------ -------------- --------------
9700	1 28-MAY-13 07.00.37.592 AM	 4zbfxnv733dzb   1	 288.559   55,876,213.0   55,875,801.0
9748	2 29-MAY-13 07.00.16.396 AM	 4zbfxnv733dzb   1	 334.543   55,876,213.0   55,875,801.0
9796	3 30-MAY-13 07.00.01.330 AM	 4zbfxnv733dzb	 1	 315.035   55,876,333.0   55,875,801.0
....
....
11956	3 14-JUL-13 07.00.09.104 AM	 4zbfxnv733dzb	 1	 258.629   55,876,269.0   55,875,804.0
12000	2 15-JUL-13 07.30.03.260 AM	 4zbfxnv733dzb	 1 1,549.712   43,115,159.0   43,107,149.0
12001	2 15-JUL-13 08.00.11.035 AM	 4zbfxnv733dzb	 0	 993.135   12,778,387.0   12,768,812.0
12047	1 16-JUL-13 07.00.11.740 AM	 4zbfxnv733dzb	 1	 565.923   55,876,638.0   55,875,801.0
12096	1 17-JUL-13 07.30.33.069 AM	 4zbfxnv733dzb	 1 1,148.289   55,878,883.0   55,875,923.0
12143	3 18-JUL-13 07.00.10.648 AM	 4zbfxnv733dzb	 1	 567.586   55,876,013.0   55,875,803.0
....
....
13057	1 06-AUG-13 07.00.12.679 AM	 4zbfxnv733dzb	 1	 645.235   55,876,538.0   55,875,821.0
13105	2 07-AUG-13 07.00.03.051 AM	 4zbfxnv733dzb	 1	 986.482   55,877,223.0   55,875,823.0
13153	3 08-AUG-13 07.00.23.646 AM	 4zbfxnv733dzb	 1	 587.454   55,875,957.0   55,875,801.0
13201	1 09-AUG-13 07.00.27.502 AM	 4zbfxnv733dzb	 1	 594.734   55,876,423.0   55,875,801.0
13249	3 10-AUG-13 07.00.48.524 AM	 4zbfxnv733dzb	 1	 515.732   55,877,880.0   55,875,801.0
13297	1 11-AUG-13 07.00.09.256 AM	 4zbfxnv733dzb	 1	 477.941   55,875,965.0   55,875,802.0

So I compare AWR reports one before the patch and another after the patch during the same period (7am~8:30am) when both queries run. Although there are some difference here and there, I focus on more about the difference on IO wait, especially large scan wait.

iorm1308_awr_compare_wait_events

From the above chart, we can see the 2nd line, cell smart table scan, avg wait time jump from 7ms to 38ms and total wait time jump from 2,083 seconds to 3,000 seconds. At the same time, the average Wait Time drop 24% for cell single block physical read, which is small IO. At this moment, it seems I realized IORM internal handling logic seem change after the patch.

Checking out Oracle Support site, found something interesting related to IORM.
For cell version 11.2.2.2 – 11.2.3.1, IORM is disabled by default. To disable IORM, run “alter iormplan objective=off;” command. 11.2.3.1 is cell version our client used before the patch.

For cell version 11.2.3.2 and above, IORM is enabled by default and use basic objective. To disable IORM, run “alter iormplan objective=basic;”. 11.2.3.2 is our client’s current version.

The above indicates there are some changes involving IORM. Their current objective is low-latency. Low-latency will be good for fast response to OLTP requests, with lower wait time for small IO. At the same time, the large IO is scarified to have longer wait time per request. Another possibility is IORM changes the way to throttle IO. These two queries generate over 110 million physical IO within 10 minutes. I saw throttle wait time for CRDB was between 100~200 ms when these two queries run in the morning.

So the LOW-LATENCY might not be a good objective for the client. I remember Sue Lee recommended to use AUTO objective as best practice. If it doesn’t work, try other next. Obviously, this is something we would like to follow. So the client made just one line change as follows to switch LOW-LATENCY to AUTO objective.

Here is the command to make the change to AUTO.
# dcli -g ~/cell_group -l root ‘cellcli -e alter iormplan objective = auto’

Here is the result after the objective change to AUTO.

SNAP_ID   NODE BEGIN_INTERVAL_TIME		 SQL_ID    EXECS AVG_ETIME	AVG_LIO       AVG_PIO
---------- ------ -------------------------------- ----- --------  ------------   ------------ -------------- --------------
13633	3 18-AUG-13 07.00.21.207 AM	 4zbfxnv733dzb	 1	 471.955   55,875,957.0   55,875,801.0
13681	3 19-AUG-13 07.00.35.982 AM	 4zbfxnv733dzb	 1	 556.433   55,877,846.0   55,875,863.0
13729	3 20-AUG-13 07.00.23.422 AM	 4zbfxnv733dzb	 1	 537.509   55,877,821.0   55,875,861.0
13777	2 21-AUG-13 07.00.43.473 AM	 4zbfxnv733dzb   1	 166.424   55,876,181.0   55,875,806.0
13825	2 22-AUG-13 07.00.06.274 AM	 4zbfxnv733dzb	 1	 175.517   55,875,957.0   55,875,801.0

The result was amazing. Not only there is no negative impact on the system, the two queries running time immediately back to their normal range, even better than before. The query above drops from 500+ seconds to less than 200 seconds.

During a few emails back and forth with Sue Lee about this topic, she added a few good points as follows.

When you look at the disk utilization charts, you should keep in mind that IORM is only actively regulating when you see IOs being throttled. So you should only see that the allocations are “kicking in” when the utilization is pretty high.

If you look at slide 59, you’ll see that with LOW LATENCY objective, we don’t allow the disks to run at their full utilization. This is why you got better results with AUTO objective for throughput-intensive applications like these reports. This is basically what you already said in the attached email.

When I checked out her presentation slide again, of course, it explains everything. For Low-Latency, the Peak Disk Utilization for scans is the lowest, only 40% and next one is Balanced with 90%, High Throughput with 100%. No wonder the disk throttle time is that high.

My colleague, Carlos Sierra, the author of famous SQLT tool, also did excellent analysis for the same issue from SQLT perspective and here is the link to his analysis.

I must say I am very happy to attend E4 conference and uses the knowledge immediately to the real time production issue. In case you didn’t attend, here is one photo I took from the conference. This year we had a larger room than last year’s E4 conference with more people coming. I am sure next year it will have even a larger room for bigger audience.
e4_2013_photo1

Advertisements

Compression Methods on Exadata: Use parallel to speed up compression – Part 3 of 6

parallalData

In the part 1 and part 2 of the blog series, I show the examples to create compress tables using CREATE TABLE and ALTER TABLE MOVE. For small tables in the example, it might be ok not to use the parallelism. But for big tables, parallelism is highly recommended just like the photos above. You have to have many lines on highway to allow high throughput. I wish adding a new line on highway could be as easy as change DOP in SQL. This post discusses parallelism for the compression methods.

As the command will be similar for different kinds of compression method, I pick one HCC method and test the timing for different DOP values. As QUERY LOW gives a lower compression ratio and is very unlikely I am going to use in production, so QUERY LOW is out. Let’s look at ARCHIVE HIGH, although it can give a much better compression ratio, the processing time is usually several times longer than other HCC methods. Given it could also take a longer time to uncompress the data when we need to move data to non-Exadata environment, I will be very unlikely to use this method in the production unless I really need to save significant amount space. So it leaves with QUERY HIGH and ARCHIVE LOW. Both of them are using ZLIB (gzip) compression algorithm. The timing and compression ratios seem quite close from my experience. So either one of them is fine for my test and I choose QUERY HIGH compression method as the base for our test.

I also increase the size of the test table to 80 million rows.

Build Test Data


WZHOU@dbm1> create table paratab_no_parallel as select * from testcp;

Table created.

Elapsed: 00:00:09.17

WZHOU@dbm1> insert /*+ APPEND */ into paratab_no_parallel select * from paratab_no_parallel;

10000000 rows created.

Elapsed: 00:00:17.32

WZHOU@dbm1> insert /*+ APPEND */ into paratab_no_parallel select * from paratab_no_parallel;
insert /*+ APPEND */ into paratab_no_parallel select * from paratab_no_parallel
                          *
ERROR at line 1:
ORA-12838: cannot read/modify an object after modifying it in parallel

Elapsed: 00:00:00.00

It seems I need to do the commit first before I can do further insert operation.

WZHOU@dbm1> commit;

Commit complete.

Elapsed: 00:00:00.00

WZHOU@dbm1> insert /*+ APPEND */ into paratab_no_parallel select * from paratab_no_parallel;

20000000 rows created.

Elapsed: 00:00:24.68

WZHOU@dbm1> commit;

Commit complete.

Elapsed: 00:00:00.01

WZHOU@dbm1> insert /*+ APPEND */ into paratab_no_parallel select * from paratab_no_parallel;

40000000 rows created.

Elapsed: 00:00:44.70

WZHOU@dbm1> select count(*) from paratab_no_parallel
                     *
ERROR at line 1:
ORA-12838: cannot read/modify an object after modifying it in parallel

Elapsed: 00:00:00.01

WZHOU@dbm1> commit;

Commit complete.

Elapsed: 00:00:00.02

WZHOU@dbm1> select count(*) from paratab_no_parallel;

  COUNT(*)
----------
  80000000

Elapsed: 00:00:34.27

Let’s find out what is the size for this table.

WZHOU@dbm1> @table_size
Enter value for owner: WZHOU
Enter value for table_name: PARATAB%
Enter value for type:

OWNER  SEGMENT_NAME         TYPE    TOTALSIZE_MEGS TABLESPACE_NAME
------ -------------------- ------- -------------- ---------------
WZHOU  PARATAB_NO_PARALLEL  TABLE          8,824.4 USERS
                                       --------------
sum                                        8,824.4

Elapsed: 00:00:00.09

Then create three more similar tables.

WZHOU@dbm1> create table paratab_dop_2 as select * from paratab_no_parallel;

Table created.

Elapsed: 00:01:09.56

WZHOU@dbm1> create table paratab_dop_4 as select * from paratab_no_parallel;

Table created.

Elapsed: 00:01:08.55

WZHOU@dbm1> create table paratab_dop_8 as select * from paratab_no_parallel;

Table created.

Elapsed: 00:01:08.86

Parallel Tests

WZHOU@dbm1> ALTER TABLE paratab_no_parallel MOVE COMPRESS for query high;

Table altered.

Elapsed: 00:03:32.30

WZHOU@dbm1> ALTER TABLE paratab_dop_2 MOVE COMPRESS for query high parallel 2;

Table altered.

Elapsed: 00:01:48.43

WZHOU@dbm1> ALTER TABLE paratab_dop_4 MOVE COMPRESS for query high parallel 4;

Table altered.

Elapsed: 00:00:57.07

WZHOU@dbm1> ALTER TABLE paratab_dop_8 MOVE COMPRESS for query high parallel 8;

Table altered.

Elapsed: 00:00:29.63

WZHOU@dbm1> @table_size
Enter value for owner: WZHOU
Enter value for table_name: PARATAB%
Enter value for type:

OWNER  SEGMENT_NAME         TYPE    TOTALSIZE_MEGS TABLESPACE_NAME
------ -------------------- ------- -------------- ---------------
WZHOU  PARATAB_DOP_2        TABLE           453.3  USERS
WZHOU  PARATAB_DOP_4        TABLE           454.2  USERS
WZHOU  PARATAB_DOP_8        TABLE           456.7  USERS
WZHOU  PARATAB_NO_PARALLEL  TABLE           464.0  USERS
                                    --------------
sum                                       1,828.2

As the DOP increases, the processing time reduces in a linear fashion. The compression time for a 9G table reduces from three and half minutes to less than half minute with DOP of 8. Another interesting part is that the final compression size is slight different with different DOP value.

Commandline Utilities for Infiniband Network on Exadata

highspeed_data

Many times I need to check out the network traffic on Exadata. 12c OEM Cloud Control is good way to monitor system performance on Exadata. However, sometime I need something quick and want to see the network traffic result from command line. Here are the commands I usually use to check out network traffic for ethernet network and infiniband network.

The first command is dstat.
dstat -dnyc -N eth0,bondeth0,bondib0 -C total -f

Here is the result I run from our X3 box.

Network_dstat_2
The Options I used are
-c enable cpu stats
-d enable disk read/write stats
-n enable network stats (receive, send)
-y enable system stats (interrupts, context switches)

If add three options lms, it will also shows load, memory usage and swap usage.
dstat -dnyclms -N eth0,bondeth0,bondib0 -C total -f

Network_dstat_1

Another command is sar
sar -n DEV 3 100|egrep ‘bondib0|bondeth0|eth0’

network_sar_1

The above command does not show the heading for the sar command. Here the one with heading:

Network_sar_2

In the previous post, iDB vs RDS vs SDP on Exadata, I discussed high level overview about these three different concepts used in Oracle Exadata and related Oracle Engineered Systems. I will show a few more useful commands to illustrate these protocols.

The first command I would like to talk about is ibhost. This InfiniBand command discovers the InfiniBand fabric topology or uses the existing topology file to extract the channel adapter nodes. The followings is the output from our x3 1/8 rack Exadata.

[root@enkx3db01 ~]# ibhosts
Ca	: 0x0010e00b4e20c000 ports 2 "SUN IB QDR GW switch enkbda1sw-ib2 192.168.8.150 Bridge 0"
Ca	: 0x0010e00b4e20c040 ports 2 "SUN IB QDR GW switch enkbda1sw-ib2 192.168.8.150 Bridge 1"
Ca	: 0x0021280001efdf70 ports 2 "enkbda1node10 BDA 192.168.12.110 HCA-1"
Ca	: 0x0021280001efd5ee ports 2 "enkbda1node09 BDA 192.168.12.109 HCA-1"
Ca	: 0x0021280001efd4ea ports 2 "enkbda1node12 BDA 192.168.12.112 HCA-1"
Ca	: 0x0021280001efd4d6 ports 2 "enkbda1node11 BDA 192.168.12.111 HCA-1"
Ca	: 0x0021280001efd5f6 ports 2 "enkbda1node14 BDA 192.168.12.114 HCA-1"
Ca	: 0x0021280001efd4e6 ports 2 "enkbda1node13 BDA 192.168.12.113 HCA-1"
Ca	: 0x0021280001ceda62 ports 2 "enkbda1node16 BDA 192.168.12.116 HCA-1"
Ca	: 0x0021280001cf5abe ports 2 "enkbda1node15 BDA 192.168.12.115 HCA-1"
Ca	: 0x0021280001efac6a ports 2 "enkbda1node18 BDA 192.168.12.118 HCA-1"
Ca	: 0x0021280001efd4fa ports 2 "enkbda1node17 BDA 192.168.12.117 HCA-1"
Ca	: 0x0021280001efdf68 ports 2 "enkbda1node08 BDA 192.168.12.108 HCA-1"
Ca	: 0x0021280001efd5e6 ports 2 "enkbda1node07 BDA 192.168.12.107 HCA-1"
Ca	: 0x0021280001efd606 ports 2 "enkbda1node05 BDA 192.168.12.105 HCA-1"
Ca	: 0x0021280001efd4ee ports 2 "enkbda1node06 BDA 192.168.12.106 HCA-1"
Ca	: 0x0021280001efd616 ports 2 "enkbda1node03 BDA 192.168.12.103 HCA-1"
Ca	: 0x0021280001efdf98 ports 2 "enkbda1node04 BDA 192.168.12.104 HCA-1"
Ca	: 0x0021280001efd84e ports 2 "enkbda1node01 BDA 192.168.12.101 HCA-1"
Ca	: 0x0021280001efdf6c ports 2 "enkbda1node02 BDA 192.168.12.102 HCA-1"
Ca	: 0x0010e00b88c0c000 ports 2 "SUN IB QDR GW switch enkbda1sw-ib3 192.168.8.151 Bridge 0"
Ca	: 0x0010e00b88c0c040 ports 2 "SUN IB QDR GW switch enkbda1sw-ib3 192.168.8.151 Bridge 1"
Ca	: 0x0021280001fcb9ec ports 2 "enkalytics EL-C 192.168.12.131 HCA-1"
Ca	: 0x0021280001fc4a1e ports 2 "enkx3db02 S 192.168.12.2 HCA-1"
Ca	: 0x0021280001fcbf5c ports 2 "enkx3cel03 C 192.168.12.5 HCA-1"
Ca	: 0x0021280001fbe18e ports 2 "enkx3cel01 C 192.168.12.3 HCA-1"
Ca	: 0x0021280001fc80c6 ports 2 "enkx3cel02 C 192.168.12.4 HCA-1"
Ca	: 0x0010e0000128ce64 ports 2 "enkx3db01 S 192.168.12.1 HCA-1"

For a 1/8 rack, it has 2 database nodes, 3 cell nodes and 2 IB switches. You might notice we have many more nodes than supposed to be on the InfiniBand fabric. From the naming, you might figure out we have our X3 Exadata, Oracle Big Data Appliance, and Oracle Exlatics connected together all within the same InfiniBand network.

If just want to list IB switches only, use ibswitches command

[root@enkx3db01 ~]# ibswitches
Switch	: 0x002128f57326a0a0 ports 36 "SUN DCS 36P QDR enkbda1sw-ib1 192.168.8.149" enhanced port 0 lid 59 lmc 0
Switch	: 0x0010e00b88c0c0a0 ports 36 "SUN IB QDR GW switch enkbda1sw-ib3 192.168.8.151" enhanced port 0 lid 61 lmc 0
Switch	: 0x0010e00b4e20c0a0 ports 36 "SUN IB QDR GW switch enkbda1sw-ib2 192.168.8.150" enhanced port 0 lid 60 lmc 0
Switch	: 0x002128f575bba0a0 ports 36 "SUN DCS 36P QDR enkx3sw-ib3.enkitec.com" enhanced port 0 lid 1 lmc 0
Switch	: 0x002128f57469a0a0 ports 36 "SUN DCS 36P QDR enkx3sw-ib2.enkitec.com" enhanced port 0 lid 2 lmc 0

In TCP/IP network, we use ping command to verify whether a host can be accessed or not. Similarly, in InfiniBand network, we use rds-ping command to ping another IB node in the network. The following example shows we could do rds-ping from Exlatics node to the ibvip on the first database node.

[root@enkalytics ~]# rds-ping -c 5 enkx3db01-ibvip.enkitec.com
   1: 240 usec
   2: 214 usec
   3: 201 usec
   4: 199 usec
   5: 269 usec

usec is microseconds.

Another useful rds related command is rds-info. To save space, I removed some smiliar messages.

[root@enkx3db01 ~]# rds-info

RDS IB Connections:
      LocalAddr      RemoteAddr                         LocalDev                        RemoteDev
  192.168.12.31  192.168.12.131           fe80::10:e000:128:ce66           fe80::21:2800:1fc:b9ee
   192.168.12.1    192.168.12.3           fe80::10:e000:128:ce66           fe80::21:2800:1fb:e18f
   192.168.12.1    192.168.12.1           fe80::10:e000:128:ce66           fe80::10:e000:128:ce66
  192.168.12.31   192.168.12.31           fe80::10:e000:128:ce66           fe80::10:e000:128:ce66
   192.168.12.1  192.168.12.101                               ::                               ::
 169.254.87.194  169.254.87.194           fe80::10:e000:128:ce66           fe80::10:e000:128:ce66
   192.168.12.1  192.168.12.118                               ::                               ::
   192.168.12.1    192.168.12.5           fe80::10:e000:128:ce66           fe80::21:2800:1fc:bf5e
   192.168.12.1    192.168.12.4           fe80::10:e000:128:ce66           fe80::21:2800:1fc:80c8
   192.168.12.1    192.168.12.2           fe80::10:e000:128:ce66           fe80::21:2800:1fc:4a20
  192.168.12.31    192.168.12.2           fe80::10:e000:128:ce66           fe80::21:2800:1fc:4a20
 169.254.87.194  169.254.97.245           fe80::10:e000:128:ce66           fe80::21:2800:1fc:4a20
rds-info: Unable get statistics: Protocol not available

Counters:
              CounterName            Value
               conn_reset          2879033
   recv_drop_bad_checksum                0
        recv_drop_old_seq               17
        recv_drop_no_sock             2985
      recv_drop_dead_sock                0
       recv_deliver_raced                0
           recv_delivered        222260977
              recv_queued        130604931
     recv_immediate_retry                0
       recv_delayed_retry                0
        recv_ack_required         14752884
          recv_rdma_bytes     136276672512
                recv_ping           288786
         send_queue_empty         85915668
          send_queue_full               15
     send_lock_contention           764917
    send_lock_queue_raced            16222
     send_immediate_retry                0
       send_delayed_retry             1222
          send_drop_acked                0
        send_ack_required         12818936
              send_queued        115370469
                send_rdma           261202
          send_rdma_bytes     136280842240
                send_pong           288786
       page_remainder_hit        102755106
      page_remainder_miss         11378031
             copy_to_user     165149391125
           copy_from_user     126738139340
       cong_update_queued                0
     cong_update_received               49
          cong_send_error                0
        cong_send_blocked                0
         ib_connect_raced               24
   ib_listen_closed_stale                0
      ib_evt_handler_call        278198961
          ib_tasklet_call        278198961
           ib_tx_cq_event        138358343
          ib_tx_ring_full             1319
           ib_tx_throttle                0
 ib_tx_sg_mapping_failure                0
            ib_tx_stalled              259
     ib_tx_credit_updates                0
           ib_rx_cq_event        172332234
         ib_rx_ring_empty               83
     ib_rx_refill_from_cq                0
 ib_rx_refill_from_thread                0
        ib_rx_alloc_limit                0
     ib_rx_credit_updates                0
              ib_ack_sent         14648252
      ib_ack_send_failure                0
      ib_ack_send_delayed           125532
  ib_ack_send_piggybacked            73902
          ib_ack_received         12960925
         ib_rdma_mr_alloc             6355
          ib_rdma_mr_free             5488
          ib_rdma_mr_used         48938489
    ib_rdma_mr_pool_flush          6438472
     ib_rdma_mr_pool_wait                0
 ib_rdma_mr_pool_depleted                0
           ib_atomic_cswp                0
           ib_atomic_fadd                0
         iw_connect_raced                0
   iw_listen_closed_stale                0
            iw_tx_cq_call                0
           iw_tx_cq_event                0
          iw_tx_ring_full                0
           iw_tx_throttle                0
 iw_tx_sg_mapping_failure                0
            iw_tx_stalled                0
     iw_tx_credit_updates                0
            iw_rx_cq_call                0
           iw_rx_cq_event                0
         iw_rx_ring_empty                0
     iw_rx_refill_from_cq                0
 iw_rx_refill_from_thread                0
        iw_rx_alloc_limit                0
     iw_rx_credit_updates                0
              iw_ack_sent                0
      iw_ack_send_failure                0
      iw_ack_send_delayed                0
  iw_ack_send_piggybacked                0
          iw_ack_received                0
         iw_rdma_mr_alloc                0
          iw_rdma_mr_free                0
          iw_rdma_mr_used                0
    iw_rdma_mr_pool_flush                0
     iw_rdma_mr_pool_wait                0
 iw_rdma_mr_pool_depleted                0

RDS Sockets:
      BoundAddr BPort        ConnAddr CPort     SndBuf     RcvBuf    Inode
   192.168.12.1  7978         0.0.0.0     0     262144    2097152 1422468668
   192.168.12.1 31215         0.0.0.0     0     262144    2097152 1422492506
   192.168.12.1  7588         0.0.0.0     0     262144    2097152 1422492510

....

 169.254.87.194 61081         0.0.0.0     0     131072    2097152 1531223507
   192.168.12.1 16962         0.0.0.0     0     262144    2097152 1534922520
   192.168.12.1   442         0.0.0.0     0     131072    2097152 1534922522
   192.168.12.1 49167         0.0.0.0     0     262144    2097152 1539515072
   192.168.12.1 48917         0.0.0.0     0     131072    2097152 1539515074
   192.168.12.1 14675         0.0.0.0     0     262144    2097152 1539517764
   192.168.12.1 12371         0.0.0.0     0     131072    2097152 1539517766
        0.0.0.0     0         0.0.0.0     0     131072    2097152 1539617228

RDS Connections:
      LocalAddr      RemoteAddr           NextTX           NextRX Flg
  192.168.12.31  192.168.12.131               13               13 --C
   192.168.12.1    192.168.12.3         22473392          1304865 --C
   192.168.12.1    192.168.12.1          2385972           139482 --C
  192.168.12.31   192.168.12.31                9                9 --C
   192.168.12.1  192.168.12.101                4                0 ---
 169.254.87.194  169.254.87.194           492250                0 --C
   192.168.12.1  192.168.12.118              119                0 ---
   192.168.12.1    192.168.12.5         36615339        150138911 --C
   192.168.12.1    192.168.12.4         17227536         60920516 --C
   192.168.12.1    192.168.12.2         28714935          5551186 --C
  192.168.12.31    192.168.12.2              287              287 --C
      127.0.0.1       127.0.0.1            18895            18895 --C
 169.254.87.194  169.254.97.245          7302242          1582910 --C

Receive Message Queue:
      LocalAddr LPort      RemoteAddr RPort              Seq      Bytes
   192.168.12.1 22526    192.168.12.2 20819          3971282        168
   192.168.12.1 22526    192.168.12.2 20819          4210130        168
   192.168.12.1 22526    192.168.12.2 20819          5177334        168
   192.168.12.1 22526    192.168.12.2 20819          5288457        168
   192.168.12.1 44950    192.168.12.2 27716          4485037        168
   192.168.12.1 44950    192.168.12.2 27716          4603330        168
   192.168.12.1 44950    192.168.12.2 27716          4717860        168

....

 169.254.87.194 61209  169.254.97.245  2286          1322929        168
 169.254.87.194 61209  169.254.97.245 32997          1322939        168
   192.168.12.1 62729    192.168.12.2 62458          5513836        168
   192.168.12.1 62729    192.168.12.2 33848          5513844        168
   192.168.12.1 62729    192.168.12.2 37522          5513850        168

Send Message Queue:
      LocalAddr LPort      RemoteAddr RPort              Seq      Bytes

Retransmit Message Queue:
      LocalAddr LPort      RemoteAddr RPort              Seq      Bytes
 169.254.87.194 31175  169.254.87.194 42828           492248        156
 169.254.87.194 31175  169.254.87.194 42828           492249        156
 169.254.87.194   104  169.254.97.245 27039          7302241        252

If you want to test the throughput between two IB nodes, you could use rds-stress.
First, start rds-stress on the target node.

[root@enkx3db01 ~]# rds-stress
waiting for incoming connection on 0.0.0.0:4000

Then, run the following command on source IB node.
rds-stress -s enkx3db01-ibvip.enkitec.com -p 4000 -t 1 -D 600000
-s specify the hostname
-p specify the port number
-t specify the number of tasks
-D specify the total bytes in the RDMA message
After it starts on the source node, the target node will also show the progress.

Source IB node

[root@enkalytics ~]#  rds-stress -s enkx3db01-ibvip.enkitec.com -p 4000 -t 1 -D 600000
connecting to 192.168.12.31:4000
negotiated options, tasks will start in 2 seconds
Starting up....
tsks   tx/s   rx/s  tx+rx K/s    mbi K/s    mbo K/s tx us/c   rtt us cpu %
   1   1668   1668    3531.12  977247.98  977247.98   31.57   558.62 -1.00
   1   1652   1653    3497.93  968354.24  967768.42   34.82   565.02 -1.00
   1   1673   1673    3541.62  980153.86  980153.86   35.05   556.53 -1.00
   1   1682   1682    3560.71  985437.49  985437.49   28.11   555.55 -1.00
   1   1673   1673    3541.71  980179.34  980179.34   29.50   558.08 -1.00
   1   1663   1663    3520.50  974308.84  974308.84   34.43   560.88 -1.00
   1   1692   1692    3581.88  991294.23  991294.23   30.13   552.21 -1.00
   1   1681   1681    3558.60  984852.60  984852.60   29.30   555.98 -1.00
   1   1666   1666    3526.84  976063.53  976063.53   34.09   560.13 -1.00
   1   1678   1678    3552.24  983093.02  983093.02   31.21   556.31 -1.00
   1   1726   1726    3653.88 1011220.94 1011220.94   31.72   538.88 -1.00
   1   1678   1678    3552.26  983097.93  983097.93   29.29   557.05 -1.00
^C

Target IB node

[root@enkx3db01 ~]# rds-stress
waiting for incoming connection on 0.0.0.0:4000
accepted connection from 192.168.12.131:19942 on 192.168.12.31:4000
negotiated options, tasks will start in 2 seconds
Starting up....
tsks   tx/s   rx/s  tx+rx K/s    mbi K/s    mbo K/s tx us/c   rtt us cpu %
   1   1670   1670    3531.99  977489.26  977489.26   15.12   573.84 -1.00
   1   1654   1653    3496.99  967509.78  968095.08   15.71   578.94 -1.00
   1   1675   1675    3542.75  981052.16  979881.45   14.75   570.90 -1.00
   1   1683   1683    3559.70  984572.15  985742.86   15.76   568.26 -1.00
   1   1674   1674    3540.45  979829.57  979829.57   15.93   570.79 -1.00
   1   1666   1666    3523.46  975127.51  975127.51   15.21   575.50 -1.00
   1   1692   1692    3581.02  991057.40  991057.40   16.02   566.28 -1.00
   1   1681   1682    3557.17  984749.23  984163.76   15.50   569.27 -1.00
   1   1667   1667    3526.36  975931.20  975931.20   15.12   574.15 -1.00
   1   1681   1680    3554.11  983903.24  983317.93   15.85   569.99 -1.00
   1   1728   1728    3654.66 1011436.98 1011436.98   16.37   556.93 -1.00
   1   1678   1678    3551.18  982799.19  982799.19   15.53   571.33 -1.00
---------------------------------------------
   1   1677   1677    3551.65  982954.38  982905.59   15.59   570.97 -1.00  (average)

After press CTRL-C on source node, the result average is printed out on target node.

Finally, let’s talk about SDP. My colleague, Andy Colvin, has setup a SDP listener on our X3 Exadata. Here are the command showing Oracle listens on the SDP port.

[enkx3db01:oracle:dbm1] /home/oracle
> srvctl status vip -i enkx3db01-ibvip
VIP enkx3db01-ibvip is enabled
VIP enkx3db01-ibvip is running on node: enkx3db01

[enkx3db01:oracle:dbm1] /home/oracle
> srvctl config listener -l LISTENER_IB
Name: LISTENER_IB
Network: 2, Owner: oracle
Home:
End points: TCP:1522/SDP:1522

[enkx3db01:oracle:+ASM1] /home/oracle
> lsnrctl status LISTENER_IB

LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 09-AUG-2013 21:52:19

Copyright (c) 1991, 2011, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_IB)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER_IB
Version                   TNSLSNR for Linux: Version 11.2.0.3.0 - Production
Start Date                24-JUL-2013 11:36:40
Uptime                    16 days 10 hr. 15 min. 38 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/11.2.0.3/grid/network/admin/listener.ora
Listener Log File         /u01/app/11.2.0.3/grid/log/diag/tnslsnr/enkx3db01/listener_ib/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER_IB)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=sdp)(HOST=192.168.12.31)(PORT=1522)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.12.31)(PORT=1522)))
Services Summary...
Service "DBM_ETL" has 1 instance(s).
  Instance "dbm1", status READY, has 1 handler(s) for this service...
Service "DBM_REPORTING" has 1 instance(s).
  Instance "dbm1", status READY, has 1 handler(s) for this service...
Service "dbm" has 1 instance(s).
  Instance "dbm1", status READY, has 1 handler(s) for this service...
The command completed successfully

Like netstat command to check out TCP connections, there is corresponding command, sdpnetstat, for SDP connections. Unfortunately, by default, this command does not exist on Exadata for now, but it does exist on Oracle Big Data Appliance or Oracle Exlatics. Here is one example of the output from Exlatics.

[root@enkalytics ~]# sdpnetstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 localhost.localdom:6700 localhost.localdo:35791 ESTABLISHED 
tcp        0      0 enkalytics.Enkitec:9710 enkalytics.Enkite:42991 ESTABLISHED 
tcp        0      0 enkalytics.Enkitec:9706 enkalytics.Enkite:11093 ESTABLISHED 
tcp        0      0 enkalytics.Enkitec:9701 enkalytics.Enkite:16776 ESTABLISHED 
tcp        0      0 localhost.localdom:6700 localhost.localdo:35793 ESTABLISHED 
tcp        0      0 enkalytics.Enkitec:9710 enkalytics.Enkite:10002 ESTABLISHED 
tcp        0      0 localhost.localdom:6700 localhost.localdo:35790 ESTABLISHED 
....
....
tcp        0      0 localhost.localdo:35791 localhost.localdom:6700 ESTABLISHED 
tcp        0      0 localhost.localdom:6700 localhost.localdo:35798 ESTABLISHED 
tcp        0      0 enkalytics.Enkite:11093 enkalytics.Enkitec:9706 ESTABLISHED 
tcp        0      0 localhost.localdom:6700 localhost.localdo:35797 ESTABLISHED 
tcp        0      0 localhost.localdo:35793 localhost.localdom:6700 ESTABLISHED 
tcp        0      0 enkalytics.Enkite:60565 enkalytics.Enkitec:9710 TIME_WAIT   
tcp        0      0 enkalytics.enkite:36136 enk03-vip.enki:ncube-lm ESTABLISHED 
tcp        0      0 enkalytics.enkite:21666 enkalytic:afs3-callback ESTABLISHED 
tcp        0      0 enkalytics.enkitec:9704 enkalytics.enkite:52854 TIME_WAIT   
tcp        0      0 enkalytics.enkitec:9704 enkalytics.enkite:52849 TIME_WAIT   
tcp        0      0 enkalytics.enkite:25478 enk04-vip.enki:ncube-lm ESTABLISHED 
tcp        0      0 enkalytic:afs3-callback enkalytics.enkite:11226 ESTABLISHED 
tcp        0      0 enkalytics.enkite:44861 enkalytics.enkitec:9704 ESTABLISHED 
tcp        0      0 enkalytics.enkitec:9704 enkalytics.enkite:52850 TIME_WAIT   
tcp        0      0 enkalytics.enkite:44867 enkalytics.enkitec:9704 ESTABLISHED 
tcp        0      0 enkalytics.enkitec:9704 enkalytics.enkite:52846 TIME_WAIT   
tcp        0      0 enkalytics.enkitec:9704 enkalytics.enkite:44867 ESTABLISHED 
tcp        0      0 localhost.localdo:35797 localhost.localdom:6700 ESTABLISHED 
tcp        0      0 enkalytics.enkite:11226 enkalytic:afs3-callback ESTABLISHED 
tcp        0      0 enkalytics.enkitec:9704 enkalytics.enkite:44861 ESTABLISHED 
tcp        0      0 enkalytic:afs3-callback enkalytics.enkite:11210 ESTABLISHED 
sdp        0      0 192.168.12.131:43307    enkx3db01-ib:ricardo-lm ESTABLISHED 
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags       Type       State         I-Node Path
unix  20     [ ]         DGRAM                    24350  /dev/log
unix  2      [ ]         DGRAM                    7979   @/org/kernel/udev/udevd
unix  2      [ ]         DGRAM                    27732  @/org/freedesktop/hal/udev_event
unix  2      [ ]         DGRAM                    4592850 
unix  2      [ ]         STREAM     CONNECTED     1138317 
unix  2      [ ]         STREAM     CONNECTED     1104247 
unix  2      [ ]         STREAM     CONNECTED     1099975 
unix  2      [ ]         STREAM     CONNECTED     1099714 
unix  3      [ ]         STREAM     CONNECTED     971435 /var/run/dbus/system_bus_socket
unix  3      [ ]         STREAM     CONNECTED     971434 
unix  2      [ ]         DGRAM                    478947 
unix  3      [ ]         STREAM     CONNECTED     29367  @/tmp/fam-root-
unix  3      [ ]         STREAM     CONNECTED     29366  
unix  3      [ ]         STREAM     CONNECTED     29353  /var/run/dbus/system_bus_socket
unix  3      [ ]         STREAM     CONNECTED     29352  
unix  3      [ ]         STREAM     CONNECTED     29170  /var/run/dbus/system_bus_socket
unix  3      [ ]         STREAM     CONNECTED     29169  
unix  3      [ ]         STREAM     CONNECTED     29164  
unix  3      [ ]         STREAM     CONNECTED     29163  
unix  2      [ ]         DGRAM                    29161  
unix  2      [ ]         DGRAM                    28889  
....
....
unix  2      [ ]         DGRAM                    27890  
unix  3      [ ]         STREAM     CONNECTED     27865  /var/run/dbus/system_bus_socket
unix  3      [ ]         STREAM     CONNECTED     27864  
unix  3      [ ]         STREAM     CONNECTED     27813  @/var/run/hald/dbus-aeLDAYiwqS
unix  3      [ ]         STREAM     CONNECTED     27812  
unix  3      [ ]         STREAM     CONNECTED     27798  @/var/run/hald/dbus-aeLDAYiwqS
unix  3      [ ]         STREAM     CONNECTED     27797  
unix  3      [ ]         STREAM     CONNECTED     27783  @/var/run/hald/dbus-aeLDAYiwqS
unix  3      [ ]         STREAM     CONNECTED     27782  
unix  3      [ ]         STREAM     CONNECTED     27766  /var/run/acpid.socket
unix  3      [ ]         STREAM     CONNECTED     27765  
unix  3      [ ]         STREAM     CONNECTED     27760  @/var/run/hald/dbus-aeLDAYiwqS
unix  3      [ ]         STREAM     CONNECTED     27759  
unix  3      [ ]         STREAM     CONNECTED     27727  @/var/run/hald/dbus-0e5V2Tfgxi
unix  3      [ ]         STREAM     CONNECTED     27726  
unix  2      [ ]         DGRAM                    27562  
unix  3      [ ]         STREAM     CONNECTED     27445  /var/run/dbus/system_bus_socket
unix  3      [ ]         STREAM     CONNECTED     27444  
unix  2      [ ]         DGRAM                    27433  
unix  2      [ ]         DGRAM                    27422  
unix  3      [ ]         STREAM     CONNECTED     27381  
unix  3      [ ]         STREAM     CONNECTED     27380  
unix  3      [ ]         STREAM     CONNECTED     27339  
unix  3      [ ]         STREAM     CONNECTED     27338  
unix  2      [ ]         DGRAM                    26918  
unix  2      [ ]         DGRAM                    24358  
unix  3      [ ]         STREAM     CONNECTED     24299  
unix  3      [ ]         STREAM     CONNECTED     24298  

[root@enkalytics ~]# ifconfig 
bond0     Link encap:InfiniBand  HWaddr 80:00:00:4B:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:192.168.12.131  Bcast:192.168.12.255  Mask:255.255.255.0
          inet6 addr: fe80::221:2800:1fc:b9ed/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:65520  Metric:1
          RX packets:102393 errors:0 dropped:0 overruns:0 frame:0
          TX packets:133607 errors:0 dropped:16 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:6058691 (5.7 MiB)  TX bytes:4830507 (4.6 MiB)
....

iDB vs RDS vs SDP on Exadata

If you have done work on Exadata, you probably hear many buzz words, like Storage Index, Smart Scan, Offloading, and etc. Many of these features are based on InfiniBand Architecture, which is high-speed interconnect architecture with high throughput and low latency. Talking about InfiniBand, many of us know iDB and RDS on Exadata. But not many people know about SDP. In this post, I discuss more in detail about among iDB, RDS and SDP.

Oracle Exadata uses the Intelligent Database protocol (iDB) to transfer data between Database Node and Storage Cell Node. It is implemented in the database kernel and work as funtion shipping architecture to transparently maps database operations to Exadata operations. iDB can be used to transfer SQL operation from Database Node to Cell node, and get query result back or full data blocks back from Cell Node.

iDB is built on Reliable Datagram Sockets (RDS v3) protocol and runs over InfiniBand ZDP (Zero-loss Zero-copy Datagram Protocol). The objective of ZDP is to eliminate unnessary copying of blocks. RDS is based on Socket API with low overhead, low latency, high bandwidth. Exadata Cell Node can send/receive large transfer using Remote Direct Memory Access (RDMA).

RDMA_copy

RDMA is a direct memory access from the memory of one computer into another computer without involving either’s operating system. The transfer require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. It is quite useful in massively parallel processing environment.

RDS is highly used on Oracle Exadata. RDS can deliver high available and low overhead of datagrams, which is like UDP but more reliable and zero copy. It accesses to InfiniBand via the Socket API. RDS v3 supports both RDMA read and write and can allow large data transfer up to 8MB. It also supports the control messages for asynchronous operation for submit and completion notifications.

rds_sdp_stack

If you want to optimize communications between Oracle Engineered System, like Exadata, Big Data Appliance, and Exlatics, you can use Sockets Direct Protocol (SDP) networking protocol. SDP only deals with stream sockets.

SDP allows high-performance zero-copy data transfers via RDMA network fabrics and uses a standard wire protocol over an RDMA fabric to support stream sockets (SOCK_STREAM). The goal of SDP is to provide an RDMA-accelerated alternative to the TCP protocol on IP, at the same time transparent to the application.

It bypasses the OS resident TCP stack for stream connections between any endpoints on the RDMA fabric. All other socket types (such as datagram, raw, packet, etc.) are supported by the IP stack and operate over standard IP interfaces (i.e., IPoIB on InfiniBand fabrics). The IP stack has no dependency on the SDP stack; however, the SDP stack depends on IP drivers for local IP assignments and for IP address resolution for endpoint identifications.

sdp_stack

In a future post, I will discuss about some commands useful to check out Infiniband traffic, RDS and SDP.

Default Port Numbers Used on Exadata: Port Numbers for OEM Part 3 of 3

oem_12c

In the post of Part 1, I show the default port numbers for general purpose.

The Part 2 shows the port numbers related to ILOM. This post shows the port numbers for 12c Oracle Enterprise Manager (OEM) Cloud Control.


Port Name                Normal Ranges    Exadata Default Value
EM Upload HTTP Port      4889-4898        4889
EM Upload HTTPS Port     1559,4899-4908   1159

Node Mgr HTTPS Port      7401-7500        Check
Managed Server HTTP Port 7201-7300        Check
EM Console HTTP Port     7788-7798        7788
EM Console HTTPS Port    7799-7809        7799

Management Agent Port    3872,1830-1849   Check
Admin Server HTTP Port   7001             Check
Admin Server HTTPS Port  7101-7200        Check
Managed Server HTTPS Port 7301-7400       Check

Check means check the port number configured in the installation.

There are a few ports not necessarily through the firewall between Exadata and OEM OMS.

EM Repository DB Port                 1521

There are also some more ports related to OEM and optional, and only need when using the components.

Port Name                             Port Number
JVM Diagnostics Managed Server	      3800
JVM Diagnostics Managed Server (SSL)  3801

ADP RMI Registry Port	              51099
ADP Java Provider Port	              55003
ADP Remote Service Controller Port	  55000
ADP Listen                            4210
ADP Listen Port (SSL)                 4211
BI Publisher HTTP                     9701
BI Publisher HTTPS                    9702
Secure web connection to oracle.com   443

Note: Port 443 is https to updates.oracle.com, support.oracle.com, ccr.oracle.com, login.oracle.com,aru-akam.oracle.com
It is outgoing from OMS and used for communication with Oracle for OCM, MOS, Patching, Self-Updates, ASR.

To verify detail about the ports used in oms, you can run the followings

[oracle@gc12c bin]$ emctl status oms -details
Oracle Enterprise Manager Cloud Control 12c Release 2
Copyright (c) 1996, 2012 Oracle Corporation. All rights reserved.
Enter Enterprise Manager Root (SYSMAN) Password :
Console Server Host : gc12c.enkitec.com
HTTP Console Port : 7789
HTTPS Console Port : 7801
HTTP Upload Port : 4890
HTTPS Upload Port : 4901
EM Instance Home : /u01/app/oracle/oms12c/gc_inst/em/EMGC_OMS1
OMS Log Directory Location : /u01/app/oracle/oms12c/gc_inst/em/EMGC_OMS1/sysman/log
OMS is not configured with SLB or virtual hostname
Agent Upload is locked.
OMS Console is locked.
Active CA ID: 1
Console URL: https://gc12c.enkitec.com:7801/em
Upload URL: https://gc12c.enkitec.com:4901/empbs/upload

WLS Domain Information
Domain Name : GCDomain
Admin Server Host: gc12c.enkitec.com

Managed Server Information
Managed Server Instance Name: EMGC_OMS1
Managed Server Instance Host: gc12c.enkitec.com
WebTier is Up
Oracle Management Server is Up

I could not find a command to show what’s port number is used for something like Node Manager, Managed Server. But do find a way to show this kind of information from a temp file in the initial installation. The file is in MIDDLEWARE_HOME/.gcinstall_temp/staticports.ini on OMS host.

[oracle@gc12c oracle]$ cat /u01/app/oracle/oms12c/.gcinstall_temp/staticports.ini
Enterprise Manager Upload Http Port=4890
Enterprise Manager Upload Http SSL Port=4901
Enterprise Manager Central Console Http SSL Port=7801
Node Manager Http SSL Port=7405
Managed Server Http Port=7203
Enterprise Manager Central Console Http Port=7789
Oracle Management Agent Port=3872
Admin Server Http SSL Port=7102
Managed Server Http SSL Port=7302

The following chart shows firewall configurations for OEM components.

OEM_firewall_port

Related Posts:

Default Port Numbers Used on Exadata: Port Numbers for General Use Part 1 of 3

Default Port Numbers Used on Exadata: Port Numbers for ILOM Part 2 of 3