AWR is not Enough to Track Down IO Problem on Exadata

Recently we run into an interesting performance issue at one of our clients. They reported significant slow down in the system during the day time for some time. Their X2 Exadata does not host many databases, with three production databases and a few test/QA databases. Out of the three production databases, let me give a fake name (CRDB), is the most important and critical one, which is mainly OLTP and have some reporting activities. The other two production databases are tiny and less important databases with not much activities. In other words, the majority of db activities happens at CRDB database.

The slow down is mysterious and randomly during the day and did not seem to follow a pattern. When the slow down happens, the active sessions at CRDB shot up from average 2~4 to at least 30, sometimes reach to 100. At the same time, there were massive slow down in all other databases on the same Exadata. To track down the issue, we requested a few AWR reports from the client for the CRDB database.

In the AWR report, the top event is cell smart table scan. For example, on db node 1 alone, from 10:30am to 2:30pm on August 6, 2013, the total waits for cell smart table scan was 574, 218 with average wait time of 106 ms, equal to 60,980 seconds of DB time, in other words, 78.20% of DB time. Other than that, AWR report did not tell anything useful that should cause performance issue. Therefore, we  mainly focused on IO and smart scan related operation.

On Exadata, cell smart table scan is the scan on cell storage for large IOs. Many offloading activities involve cell smart table scan. As client said there was no application code change recently and data volume remains similar in the past, we assume the total logical and physical IO for CRDB database should be at the similar level as before. They also mentioned that the performance issue began a few days after Exadata patch.

Their resource plan gives  50% allocation to CRDB, and another 50% to the rest of databases with objective of LOW_LATENCY. This resource plan has been in production for over a year and managed quite well in IO.
Luckily during the time we were tracking down the issue, I was attending our company‘s 2013 Exadata conference (E4) . It’s an excellent Exadata and Bigdata specific conference, not only it has many great Exadata/Big Data speakers, but also many Exadata experts/users around the world. I happened to listen one session about IORM presented by Oracle’s Sue Lee. She is a Director of Development in the Oracle RDBMS division and responsible for the Oracle Resource Manager. Her presentation about IORM was an excellent one. She will also give a similar session at OpenWorld. If you are working on Exadata and attend OpenWorld this year, I highly recommended to attend her session, Using Resource Manager for Database Consolidation with Oracle Database 12c (Session ID: CON8884)  and you will gain in-depth knowledge about IORM and Exadata internals.

During her presentation, she mentioned a tool, called iorm_metrics script (Doc ID 1337265.1), and said this tools is frequently used in her group to track down IORM performance problems. It sounds interesting, so I immediately downloaded the script and asked our client run the followings on all of cell nodes.

./metric_iorm.pl “where collectionTime > ‘2013-08-06T11:30:00-06:00’ and collectionTime < ‘2013-08-06T13:20:00-06:00′” > metric_iorm_`hostname`.log

The result was quite interesting and I used one snapshot from cell node 1 as example:

 
Time: 2013-08-06T12:33:03-05:00
Database: CRDB
Utilization:     Small=4%    Large=7%
Flash Cache:     IOPS=226
Disk Throughput: MBPS=120
Small I/O's:     IOPS=258    Avg qtime=1.1ms
Large I/O's:     IOPS=114    Avg qtime=3026ms
	Consumer Group: _ORACLE_MEDPRIBG_GROUP_
	Utilization:     Small=0%    Large=0%
	Flash Cache:     IOPS=0.1
	Disk Throughput: MBPS=0
	Small I/O's:     IOPS=0.3    Avg qtime=0.0ms
	Large I/O's:     IOPS=0.0    Avg qtime=0.0ms
	Consumer Group: OTHER_GROUPS
	Utilization:     Small=0%    Large=7%
	Flash Cache:     IOPS=150
	Disk Throughput: MBPS=116
	Small I/O's:     IOPS=1.8    Avg qtime=0.8ms
	Large I/O's:     IOPS=112    Avg qtime=3077ms
	Consumer Group: _ORACLE_LOWPRIBG_GROUP_
	Utilization:     Small=1%    Large=0%
	Flash Cache:     IOPS=23.8
	Disk Throughput: MBPS=1
	Small I/O's:     IOPS=82.1    Avg qtime=2.7ms
	Large I/O's:     IOPS=0.0    Avg qtime=0.0ms
	Consumer Group: _ORACLE_BACKGROUND_GROUP_
	Utilization:     Small=3%    Large=0%
	Flash Cache:     IOPS=51.6
	Disk Throughput: MBPS=2
	Small I/O's:     IOPS=174    Avg qtime=0.3ms
	Large I/O's:     IOPS=1.9    Avg qtime=66.1ms

Database: _OTHER_DATABASE_
Utilization:     Small=10%    Large=9%
Flash Cache:     IOPS=89.7
Disk Throughput: MBPS=142
Small I/O's:     IOPS=504    Avg qtime=0.5ms
Large I/O's:     IOPS=134    Avg qtime=4137ms
	Consumer Group: _ORACLE_MEDPRIBG_GROUP_
	Utilization:     Small=0%    Large=0%
	Flash Cache:     IOPS=0.1
	Disk Throughput: MBPS=0
	Small I/O's:     IOPS=0.4    Avg qtime=0.2ms
	Large I/O's:     IOPS=0.0    Avg qtime=0.0ms
	Consumer Group: OTHER_GROUPS
	Utilization:     Small=0%    Large=9%
	Flash Cache:     IOPS=42.8
	Disk Throughput: MBPS=139
	Small I/O's:     IOPS=0.8    Avg qtime=0.6ms
	Large I/O's:     IOPS=134    Avg qtime=4138ms
	Consumer Group: _ORACLE_LOWPRIBG_GROUP_
	Utilization:     Small=0%    Large=0%
	Flash Cache:     IOPS=16.9
	Disk Throughput: MBPS=0
	Small I/O's:     IOPS=56.2    Avg qtime=3.5ms
	Large I/O's:     IOPS=0.0    Avg qtime=0.0ms
	Consumer Group: _ORACLE_BACKGROUND_GROUP_
	Utilization:     Small=9%    Large=0%
	Flash Cache:     IOPS=29.8
	Disk Throughput: MBPS=2
	Small I/O's:     IOPS=447    Avg qtime=0.2ms
	Large I/O's:     IOPS=0.0    Avg qtime=53.0ms

CELL METRICS SUMMARY

Cell Total Utilization:     Small=14%    Large=16%
Cell Total Flash Cache:     IOPS=315.7
Cell Total Disk Throughput: MBPS=249.454
Cell Total Small I/O's:     IOPS=762.2
Cell Total Large I/O's:     IOPS=245.6
Cell Avg small read latency:  11.24 ms
Cell Avg small write latency: 2.64 ms
Cell Avg large read latency:  13.35 ms
Cell Avg large write latency: 4.47 ms

From the above result, we can see the average queue time for every large IO (cell smart scan) was over 3, 000 ms for CRDB resource group and over 4, 000 ms for OTHER_DATABASE. The normal range should be < 30ms. The throughput for OTHER_DATABASE was 142 MB/second while CRDB was 120 MB/second. This indicates the saturated disk I/O.

It’s possible CRDB’s large amount IO caused OTHER_DATABASE to slow down, but Disk Throughpt from OTHER_DATABASE should be small and wait time could be longer. On the contrary, the high IO Throughput from OTHER_DATABASE indicates something not right. So I zoomed into other databases, and compare the IO throughput between databases.

The following charts shows the throughput for CRDB database was around 300MB/second before 3:30pm.
iorm1308_io_CRDB
At the same time, another much smaller and less used database, the throughput was much higher than CRDB database. The throughput was mostly between 500MB to 600MB/second, with some peak to over 1200MB/second at a few times.

iorm1308_io_otherDB

It is normal to see CRDB has a lot of large IOs during the day as it is the major database and there are many activities against this database. However it is unusual to see a tiny small database took significant amount of large IOs. This inspired me to investigate more on this small database.

From 12c OEM’s SQL Monitor screen, we can see a lot of queries with SQL ID d0af9yxrrrvy5 with many IOs, over 80GB for each execution and long running time. There  were  multiple instances for the same query executed during the same timeframe  the slow down happened.

iorm1308_slow_query

The 12c Cloud Control OEM also shows the Average Throttle Time for Disk I/Os for both CRDB and OTHER_DATABASE shot up to 1,500~2,000 milliseconds on August 6 and 7 afternoon. This was exact the time the query was executing in one of OTHER_DATABASE.

iorm1308_disk_throttle_time

After this query was shutdown, system returned to normal.  So the slow down in CRDB database was not the cause of the problem, but the victim of IO throttle caused by other database.

It seems the end of story for the problem. Actually not yet.

Although the system looks normal, there were two queries running about 2~3 times slower than in the past. Our Enkitec‘s awr_plan_change.sql shows almost identical LIO and PIO for each execution, but the timing changes significantly since the patch date of July 15. It’s another interesting issue.

SYS@CRDB1> @awr_plan_change
Enter value for sql_id: 4zbfxnv733dzb
Enter value for instance_number:
SNAP_ID   NODE BEGIN_INTERVAL_TIME		 SQL_ID    EXECS AVG_ETIME	AVG_LIO       AVG_PIO
---------- ------ -------------------------------- ----- --------  ------------   ------------ -------------- --------------
9700	1 28-MAY-13 07.00.37.592 AM	 4zbfxnv733dzb   1	 288.559   55,876,213.0   55,875,801.0
9748	2 29-MAY-13 07.00.16.396 AM	 4zbfxnv733dzb   1	 334.543   55,876,213.0   55,875,801.0
9796	3 30-MAY-13 07.00.01.330 AM	 4zbfxnv733dzb	 1	 315.035   55,876,333.0   55,875,801.0
....
....
11956	3 14-JUL-13 07.00.09.104 AM	 4zbfxnv733dzb	 1	 258.629   55,876,269.0   55,875,804.0
12000	2 15-JUL-13 07.30.03.260 AM	 4zbfxnv733dzb	 1 1,549.712   43,115,159.0   43,107,149.0
12001	2 15-JUL-13 08.00.11.035 AM	 4zbfxnv733dzb	 0	 993.135   12,778,387.0   12,768,812.0
12047	1 16-JUL-13 07.00.11.740 AM	 4zbfxnv733dzb	 1	 565.923   55,876,638.0   55,875,801.0
12096	1 17-JUL-13 07.30.33.069 AM	 4zbfxnv733dzb	 1 1,148.289   55,878,883.0   55,875,923.0
12143	3 18-JUL-13 07.00.10.648 AM	 4zbfxnv733dzb	 1	 567.586   55,876,013.0   55,875,803.0
....
....
13057	1 06-AUG-13 07.00.12.679 AM	 4zbfxnv733dzb	 1	 645.235   55,876,538.0   55,875,821.0
13105	2 07-AUG-13 07.00.03.051 AM	 4zbfxnv733dzb	 1	 986.482   55,877,223.0   55,875,823.0
13153	3 08-AUG-13 07.00.23.646 AM	 4zbfxnv733dzb	 1	 587.454   55,875,957.0   55,875,801.0
13201	1 09-AUG-13 07.00.27.502 AM	 4zbfxnv733dzb	 1	 594.734   55,876,423.0   55,875,801.0
13249	3 10-AUG-13 07.00.48.524 AM	 4zbfxnv733dzb	 1	 515.732   55,877,880.0   55,875,801.0
13297	1 11-AUG-13 07.00.09.256 AM	 4zbfxnv733dzb	 1	 477.941   55,875,965.0   55,875,802.0

So I compare AWR reports one before the patch and another after the patch during the same period (7am~8:30am) when both queries run. Although there are some difference here and there, I focus on more about the difference on IO wait, especially large scan wait.

iorm1308_awr_compare_wait_events

From the above chart, we can see the 2nd line, cell smart table scan, avg wait time jump from 7ms to 38ms and total wait time jump from 2,083 seconds to 3,000 seconds. At the same time, the average Wait Time drop 24% for cell single block physical read, which is small IO. At this moment, it seems I realized IORM internal handling logic seem change after the patch.

Checking out Oracle Support site, found something interesting related to IORM.
For cell version 11.2.2.2 – 11.2.3.1, IORM is disabled by default. To disable IORM, run “alter iormplan objective=off;” command. 11.2.3.1 is cell version our client used before the patch.

For cell version 11.2.3.2 and above, IORM is enabled by default and use basic objective. To disable IORM, run “alter iormplan objective=basic;”. 11.2.3.2 is our client’s current version.

The above indicates there are some changes involving IORM. Their current objective is low-latency. Low-latency will be good for fast response to OLTP requests, with lower wait time for small IO. At the same time, the large IO is scarified to have longer wait time per request. Another possibility is IORM changes the way to throttle IO. These two queries generate over 110 million physical IO within 10 minutes. I saw throttle wait time for CRDB was between 100~200 ms when these two queries run in the morning.

So the LOW-LATENCY might not be a good objective for the client. I remember Sue Lee recommended to use AUTO objective as best practice. If it doesn’t work, try other next. Obviously, this is something we would like to follow. So the client made just one line change as follows to switch LOW-LATENCY to AUTO objective.

Here is the command to make the change to AUTO.
# dcli -g ~/cell_group -l root ‘cellcli -e alter iormplan objective = auto’

Here is the result after the objective change to AUTO.

SNAP_ID   NODE BEGIN_INTERVAL_TIME		 SQL_ID    EXECS AVG_ETIME	AVG_LIO       AVG_PIO
---------- ------ -------------------------------- ----- --------  ------------   ------------ -------------- --------------
13633	3 18-AUG-13 07.00.21.207 AM	 4zbfxnv733dzb	 1	 471.955   55,875,957.0   55,875,801.0
13681	3 19-AUG-13 07.00.35.982 AM	 4zbfxnv733dzb	 1	 556.433   55,877,846.0   55,875,863.0
13729	3 20-AUG-13 07.00.23.422 AM	 4zbfxnv733dzb	 1	 537.509   55,877,821.0   55,875,861.0
13777	2 21-AUG-13 07.00.43.473 AM	 4zbfxnv733dzb   1	 166.424   55,876,181.0   55,875,806.0
13825	2 22-AUG-13 07.00.06.274 AM	 4zbfxnv733dzb	 1	 175.517   55,875,957.0   55,875,801.0

The result was amazing. Not only there is no negative impact on the system, the two queries running time immediately back to their normal range, even better than before. The query above drops from 500+ seconds to less than 200 seconds.

During a few emails back and forth with Sue Lee about this topic, she added a few good points as follows.

When you look at the disk utilization charts, you should keep in mind that IORM is only actively regulating when you see IOs being throttled. So you should only see that the allocations are “kicking in” when the utilization is pretty high.

If you look at slide 59, you’ll see that with LOW LATENCY objective, we don’t allow the disks to run at their full utilization. This is why you got better results with AUTO objective for throughput-intensive applications like these reports. This is basically what you already said in the attached email.

When I checked out her presentation slide again, of course, it explains everything. For Low-Latency, the Peak Disk Utilization for scans is the lowest, only 40% and next one is Balanced with 90%, High Throughput with 100%. No wonder the disk throttle time is that high.

My colleague, Carlos Sierra, the author of famous SQLT tool, also did excellent analysis for the same issue from SQLT perspective and here is the link to his analysis.

I must say I am very happy to attend E4 conference and uses the knowledge immediately to the real time production issue. In case you didn’t attend, here is one photo I took from the conference. This year we had a larger room than last year’s E4 conference with more people coming. I am sure next year it will have even a larger room for bigger audience.
e4_2013_photo1