Recently we run into an interesting performance issue at one of our clients. They reported significant slow down in the system during the day time for some time. Their X2 Exadata does not host many databases, with three production databases and a few test/QA databases. Out of the three production databases, let me give a fake name (CRDB), is the most important and critical one, which is mainly OLTP and have some reporting activities. The other two production databases are tiny and less important databases with not much activities. In other words, the majority of db activities happens at CRDB database.
The slow down is mysterious and randomly during the day and did not seem to follow a pattern. When the slow down happens, the active sessions at CRDB shot up from average 2~4 to at least 30, sometimes reach to 100. At the same time, there were massive slow down in all other databases on the same Exadata. To track down the issue, we requested a few AWR reports from the client for the CRDB database.
In the AWR report, the top event is cell smart table scan. For example, on db node 1 alone, from 10:30am to 2:30pm on August 6, 2013, the total waits for cell smart table scan was 574, 218 with average wait time of 106 ms, equal to 60,980 seconds of DB time, in other words, 78.20% of DB time. Other than that, AWR report did not tell anything useful that should cause performance issue. Therefore, we mainly focused on IO and smart scan related operation.
On Exadata, cell smart table scan is the scan on cell storage for large IOs. Many offloading activities involve cell smart table scan. As client said there was no application code change recently and data volume remains similar in the past, we assume the total logical and physical IO for CRDB database should be at the similar level as before. They also mentioned that the performance issue began a few days after Exadata patch.
Their resource plan gives 50% allocation to CRDB, and another 50% to the rest of databases with objective of LOW_LATENCY. This resource plan has been in production for over a year and managed quite well in IO.
Luckily during the time we were tracking down the issue, I was attending our company‘s 2013 Exadata conference (E4) . It’s an excellent Exadata and Bigdata specific conference, not only it has many great Exadata/Big Data speakers, but also many Exadata experts/users around the world. I happened to listen one session about IORM presented by Oracle’s Sue Lee. She is a Director of Development in the Oracle RDBMS division and responsible for the Oracle Resource Manager. Her presentation about IORM was an excellent one. She will also give a similar session at OpenWorld. If you are working on Exadata and attend OpenWorld this year, I highly recommended to attend her session, Using Resource Manager for Database Consolidation with Oracle Database 12c (Session ID: CON8884) and you will gain in-depth knowledge about IORM and Exadata internals.
During her presentation, she mentioned a tool, called iorm_metrics script (Doc ID 1337265.1), and said this tools is frequently used in her group to track down IORM performance problems. It sounds interesting, so I immediately downloaded the script and asked our client run the followings on all of cell nodes.
./metric_iorm.pl “where collectionTime > ‘2013-08-06T11:30:00-06:00’ and collectionTime < ‘2013-08-06T13:20:00-06:00′” > metric_iorm_`hostname`.log
The result was quite interesting and I used one snapshot from cell node 1 as example:
Time: 2013-08-06T12:33:03-05:00 Database: CRDB Utilization: Small=4% Large=7% Flash Cache: IOPS=226 Disk Throughput: MBPS=120 Small I/O's: IOPS=258 Avg qtime=1.1ms Large I/O's: IOPS=114 Avg qtime=3026ms Consumer Group: _ORACLE_MEDPRIBG_GROUP_ Utilization: Small=0% Large=0% Flash Cache: IOPS=0.1 Disk Throughput: MBPS=0 Small I/O's: IOPS=0.3 Avg qtime=0.0ms Large I/O's: IOPS=0.0 Avg qtime=0.0ms Consumer Group: OTHER_GROUPS Utilization: Small=0% Large=7% Flash Cache: IOPS=150 Disk Throughput: MBPS=116 Small I/O's: IOPS=1.8 Avg qtime=0.8ms Large I/O's: IOPS=112 Avg qtime=3077ms Consumer Group: _ORACLE_LOWPRIBG_GROUP_ Utilization: Small=1% Large=0% Flash Cache: IOPS=23.8 Disk Throughput: MBPS=1 Small I/O's: IOPS=82.1 Avg qtime=2.7ms Large I/O's: IOPS=0.0 Avg qtime=0.0ms Consumer Group: _ORACLE_BACKGROUND_GROUP_ Utilization: Small=3% Large=0% Flash Cache: IOPS=51.6 Disk Throughput: MBPS=2 Small I/O's: IOPS=174 Avg qtime=0.3ms Large I/O's: IOPS=1.9 Avg qtime=66.1ms Database: _OTHER_DATABASE_ Utilization: Small=10% Large=9% Flash Cache: IOPS=89.7 Disk Throughput: MBPS=142 Small I/O's: IOPS=504 Avg qtime=0.5ms Large I/O's: IOPS=134 Avg qtime=4137ms Consumer Group: _ORACLE_MEDPRIBG_GROUP_ Utilization: Small=0% Large=0% Flash Cache: IOPS=0.1 Disk Throughput: MBPS=0 Small I/O's: IOPS=0.4 Avg qtime=0.2ms Large I/O's: IOPS=0.0 Avg qtime=0.0ms Consumer Group: OTHER_GROUPS Utilization: Small=0% Large=9% Flash Cache: IOPS=42.8 Disk Throughput: MBPS=139 Small I/O's: IOPS=0.8 Avg qtime=0.6ms Large I/O's: IOPS=134 Avg qtime=4138ms Consumer Group: _ORACLE_LOWPRIBG_GROUP_ Utilization: Small=0% Large=0% Flash Cache: IOPS=16.9 Disk Throughput: MBPS=0 Small I/O's: IOPS=56.2 Avg qtime=3.5ms Large I/O's: IOPS=0.0 Avg qtime=0.0ms Consumer Group: _ORACLE_BACKGROUND_GROUP_ Utilization: Small=9% Large=0% Flash Cache: IOPS=29.8 Disk Throughput: MBPS=2 Small I/O's: IOPS=447 Avg qtime=0.2ms Large I/O's: IOPS=0.0 Avg qtime=53.0ms CELL METRICS SUMMARY Cell Total Utilization: Small=14% Large=16% Cell Total Flash Cache: IOPS=315.7 Cell Total Disk Throughput: MBPS=249.454 Cell Total Small I/O's: IOPS=762.2 Cell Total Large I/O's: IOPS=245.6 Cell Avg small read latency: 11.24 ms Cell Avg small write latency: 2.64 ms Cell Avg large read latency: 13.35 ms Cell Avg large write latency: 4.47 ms
From the above result, we can see the average queue time for every large IO (cell smart scan) was over 3, 000 ms for CRDB resource group and over 4, 000 ms for OTHER_DATABASE. The normal range should be < 30ms. The throughput for OTHER_DATABASE was 142 MB/second while CRDB was 120 MB/second. This indicates the saturated disk I/O.
It’s possible CRDB’s large amount IO caused OTHER_DATABASE to slow down, but Disk Throughpt from OTHER_DATABASE should be small and wait time could be longer. On the contrary, the high IO Throughput from OTHER_DATABASE indicates something not right. So I zoomed into other databases, and compare the IO throughput between databases.
The following charts shows the throughput for CRDB database was around 300MB/second before 3:30pm.
At the same time, another much smaller and less used database, the throughput was much higher than CRDB database. The throughput was mostly between 500MB to 600MB/second, with some peak to over 1200MB/second at a few times.
It is normal to see CRDB has a lot of large IOs during the day as it is the major database and there are many activities against this database. However it is unusual to see a tiny small database took significant amount of large IOs. This inspired me to investigate more on this small database.
From 12c OEM’s SQL Monitor screen, we can see a lot of queries with SQL ID d0af9yxrrrvy5 with many IOs, over 80GB for each execution and long running time. There were multiple instances for the same query executed during the same timeframe the slow down happened.
The 12c Cloud Control OEM also shows the Average Throttle Time for Disk I/Os for both CRDB and OTHER_DATABASE shot up to 1,500~2,000 milliseconds on August 6 and 7 afternoon. This was exact the time the query was executing in one of OTHER_DATABASE.
After this query was shutdown, system returned to normal. So the slow down in CRDB database was not the cause of the problem, but the victim of IO throttle caused by other database.
It seems the end of story for the problem. Actually not yet.
Although the system looks normal, there were two queries running about 2~3 times slower than in the past. Our Enkitec‘s awr_plan_change.sql shows almost identical LIO and PIO for each execution, but the timing changes significantly since the patch date of July 15. It’s another interesting issue.
SYS@CRDB1> @awr_plan_change Enter value for sql_id: 4zbfxnv733dzb Enter value for instance_number: SNAP_ID NODE BEGIN_INTERVAL_TIME SQL_ID EXECS AVG_ETIME AVG_LIO AVG_PIO ---------- ------ -------------------------------- ----- -------- ------------ ------------ -------------- -------------- 9700 1 28-MAY-13 07.00.37.592 AM 4zbfxnv733dzb 1 288.559 55,876,213.0 55,875,801.0 9748 2 29-MAY-13 07.00.16.396 AM 4zbfxnv733dzb 1 334.543 55,876,213.0 55,875,801.0 9796 3 30-MAY-13 07.00.01.330 AM 4zbfxnv733dzb 1 315.035 55,876,333.0 55,875,801.0 .... .... 11956 3 14-JUL-13 07.00.09.104 AM 4zbfxnv733dzb 1 258.629 55,876,269.0 55,875,804.0 12000 2 15-JUL-13 07.30.03.260 AM 4zbfxnv733dzb 1 1,549.712 43,115,159.0 43,107,149.0 12001 2 15-JUL-13 08.00.11.035 AM 4zbfxnv733dzb 0 993.135 12,778,387.0 12,768,812.0 12047 1 16-JUL-13 07.00.11.740 AM 4zbfxnv733dzb 1 565.923 55,876,638.0 55,875,801.0 12096 1 17-JUL-13 07.30.33.069 AM 4zbfxnv733dzb 1 1,148.289 55,878,883.0 55,875,923.0 12143 3 18-JUL-13 07.00.10.648 AM 4zbfxnv733dzb 1 567.586 55,876,013.0 55,875,803.0 .... .... 13057 1 06-AUG-13 07.00.12.679 AM 4zbfxnv733dzb 1 645.235 55,876,538.0 55,875,821.0 13105 2 07-AUG-13 07.00.03.051 AM 4zbfxnv733dzb 1 986.482 55,877,223.0 55,875,823.0 13153 3 08-AUG-13 07.00.23.646 AM 4zbfxnv733dzb 1 587.454 55,875,957.0 55,875,801.0 13201 1 09-AUG-13 07.00.27.502 AM 4zbfxnv733dzb 1 594.734 55,876,423.0 55,875,801.0 13249 3 10-AUG-13 07.00.48.524 AM 4zbfxnv733dzb 1 515.732 55,877,880.0 55,875,801.0 13297 1 11-AUG-13 07.00.09.256 AM 4zbfxnv733dzb 1 477.941 55,875,965.0 55,875,802.0
So I compare AWR reports one before the patch and another after the patch during the same period (7am~8:30am) when both queries run. Although there are some difference here and there, I focus on more about the difference on IO wait, especially large scan wait.
From the above chart, we can see the 2nd line, cell smart table scan, avg wait time jump from 7ms to 38ms and total wait time jump from 2,083 seconds to 3,000 seconds. At the same time, the average Wait Time drop 24% for cell single block physical read, which is small IO. At this moment, it seems I realized IORM internal handling logic seem change after the patch.
Checking out Oracle Support site, found something interesting related to IORM.
For cell version 126.96.36.199 – 188.8.131.52, IORM is disabled by default. To disable IORM, run “alter iormplan objective=off;” command. 184.108.40.206 is cell version our client used before the patch.
For cell version 220.127.116.11 and above, IORM is enabled by default and use basic objective. To disable IORM, run “alter iormplan objective=basic;”. 18.104.22.168 is our client’s current version.
The above indicates there are some changes involving IORM. Their current objective is low-latency. Low-latency will be good for fast response to OLTP requests, with lower wait time for small IO. At the same time, the large IO is scarified to have longer wait time per request. Another possibility is IORM changes the way to throttle IO. These two queries generate over 110 million physical IO within 10 minutes. I saw throttle wait time for CRDB was between 100~200 ms when these two queries run in the morning.
So the LOW-LATENCY might not be a good objective for the client. I remember Sue Lee recommended to use AUTO objective as best practice. If it doesn’t work, try other next. Obviously, this is something we would like to follow. So the client made just one line change as follows to switch LOW-LATENCY to AUTO objective.
Here is the command to make the change to AUTO.
# dcli -g ~/cell_group -l root ‘cellcli -e alter iormplan objective = auto’
Here is the result after the objective change to AUTO.
SNAP_ID NODE BEGIN_INTERVAL_TIME SQL_ID EXECS AVG_ETIME AVG_LIO AVG_PIO ---------- ------ -------------------------------- ----- -------- ------------ ------------ -------------- -------------- 13633 3 18-AUG-13 07.00.21.207 AM 4zbfxnv733dzb 1 471.955 55,875,957.0 55,875,801.0 13681 3 19-AUG-13 07.00.35.982 AM 4zbfxnv733dzb 1 556.433 55,877,846.0 55,875,863.0 13729 3 20-AUG-13 07.00.23.422 AM 4zbfxnv733dzb 1 537.509 55,877,821.0 55,875,861.0 13777 2 21-AUG-13 07.00.43.473 AM 4zbfxnv733dzb 1 166.424 55,876,181.0 55,875,806.0 13825 2 22-AUG-13 07.00.06.274 AM 4zbfxnv733dzb 1 175.517 55,875,957.0 55,875,801.0
The result was amazing. Not only there is no negative impact on the system, the two queries running time immediately back to their normal range, even better than before. The query above drops from 500+ seconds to less than 200 seconds.
During a few emails back and forth with Sue Lee about this topic, she added a few good points as follows.
When you look at the disk utilization charts, you should keep in mind that IORM is only actively regulating when you see IOs being throttled. So you should only see that the allocations are “kicking in” when the utilization is pretty high.
If you look at slide 59, you’ll see that with LOW LATENCY objective, we don’t allow the disks to run at their full utilization. This is why you got better results with AUTO objective for throughput-intensive applications like these reports. This is basically what you already said in the attached email.
When I checked out her presentation slide again, of course, it explains everything. For Low-Latency, the Peak Disk Utilization for scans is the lowest, only 40% and next one is Balanced with 90%, High Throughput with 100%. No wonder the disk throttle time is that high.
I must say I am very happy to attend E4 conference and uses the knowledge immediately to the real time production issue. In case you didn’t attend, here is one photo I took from the conference. This year we had a larger room than last year’s E4 conference with more people coming. I am sure next year it will have even a larger room for bigger audience.