Before Oracle 12c, PGA_AGGREGATE_TARGET is the most useful parameter to control PGA memory allocated to work areas. However PGA_AGGREGATE_TARGET is a soft target and not a hard limit. The actual PGA usage can be as high as three times of the value of PGA_AGGREGATE_TARGET. Excessive PGA usage can lead to memory swapping issue and have impact on the performance. There are many articles and blogs discussing this parameter and tuning technique and I will not repeat more here in this blog. Instead I am going to discuss more about a new PGA related parameter PGA_AGGREGATE_LIMIT.
PGA_AGGREGATE_LIMIT is a new parameter introduced in 12c. It put a hard limit instead of a soft limit on the PGA memory usage. It just like we need to put a limit to the height of skyscraper and some of them are away too high. If this limit is reached or exceeded, Oracle will terminates the sessions that are consuming the most untunable PGA memory. I have more detail discussion about tunable and untunable PGA memory at the end of this blog.
Recently I was requested to help my client to investigate a PGA memory related production issue and it was all related to PGA_AGGREGATE_TARGET, PGA_AGGREGATE_LIMIT and user application.
The database is a small 12c database with not much activities. The PGA_AGGREGATE_TARGET is set to 2GB and it is usually sufficient as PGA Cache Hit Ratio hits almost 100% based on the result from v$pga_target_advice.
When the production issue happened, the user job failed in completion and Oracle alert log file has many PGA related errors:
Errors in file /db/oracle/diag/rdbms/dp783/DP783/trace/DP783_ora_13351.trc (incident=68575):
ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT
Errors in file /db/oracle/diag/rdbms/dp783/DP783/trace/DP783_ora_28611.trc (incident=68279):
ORA-04030: out of process memory when trying to allocate 16328 bytes (koh-kghu call ,pmuccst: adt/record)
Incident details in: /db/oracle/diag/rdbms/dp783/DP783/incident/incdir_68279/DP783_ora_28611_i68279.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /db/oracle/diag/rdbms/dp783/DP783/trace/DP783_ora_28611.trc (incident=68280):
ORA-04030: out of process memory when trying to allocate 169040 bytes (pga heap,kgh stack)
ORA-04030: out of process memory when trying to allocate 16328 bytes (koh-kghu call ,pmuccst: adt/record)
PGA_AGGREGATE_LIMIT parameter was not set at that time. So it took a default value that was dynamically set by the system. By default, Oracle uses the following 3 ways to find out greater of three values and use it as the limit.
1. 2 GB
2. Two times of PGA_AGGREGATE_TARGET parameter , 2 GB x 2 = 4 GB in our case.
3. The value of PROCESS parameter * 3MB. The PROCESS is 300, so the value is 3MB x 300 = 900 MB.
Therefore the actual limit was 4 GB for PGA_AGGREGATE_LIMIT parameter by default. Less than one week before the incident, I requested the execution of edb360 and execution result was indeed confirmed the PGA_AGGREGATE_LIMIT parameter was 4GB.
Obviously, we have a situation with PGA_AGGREGATE_LIMIT reached. As always, client’s next question would be: Does this issue happen before? Which application or process triggered this issue?
Does this issue happen before?
To answer the question whether this issue happened before, I used one query from Carlos Sierra‘s edb360 and modified the code a little bit to allow me to capture the PGA allocation and usage for the past 60 days.
WITH pgastat_denorm_1 AS ( SELECT /*+ MATERIALIZE NO_MERGE */ snap_id, dbid, instance_number, SUM(CASE name WHEN 'PGA memory freed back to OS' THEN value ELSE 0 END) pga_mem_freed_to_os, SUM(CASE name WHEN 'aggregate PGA auto target' THEN value ELSE 0 END) aggr_pga_auto_target, SUM(CASE name WHEN 'aggregate PGA target parameter' THEN value ELSE 0 END) aggr_pga_target_param, SUM(CASE name WHEN 'bytes processed' THEN value ELSE 0 END) bytes_processed, SUM(CASE name WHEN 'extra bytes read/written' THEN value ELSE 0 END) extra_bytes_rw, SUM(CASE name WHEN 'global memory bound' THEN value ELSE 0 END) global_memory_bound, SUM(CASE name WHEN 'maximum PGA allocated' THEN value ELSE 0 END) max_pga_allocated, SUM(CASE name WHEN 'maximum PGA used for auto workareas' THEN value ELSE 0 END) max_pga_used_aut_wa, SUM(CASE name WHEN 'maximum PGA used for manual workareas' THEN value ELSE 0 END) max_pga_used_man_wa, SUM(CASE name WHEN 'total PGA allocated' THEN value ELSE 0 END) tot_pga_allocated, SUM(CASE name WHEN 'total PGA inuse' THEN value ELSE 0 END) tot_pga_inuse, SUM(CASE name WHEN 'total PGA used for auto workareas' THEN value ELSE 0 END) tot_pga_used_aut_wa, SUM(CASE name WHEN 'total PGA used for manual workareas' THEN value ELSE 0 END) tot_pga_used_man_wa, SUM(CASE name WHEN 'total freeable PGA memory' THEN value ELSE 0 END) tot_freeable_pga_mem FROM dba_hist_pgastat WHERE name IN ('PGA memory freed back to OS' ,'aggregate PGA auto target' ,'aggregate PGA target parameter' ,'bytes processed' ,'extra bytes read/written' ,'global memory bound' ,'maximum PGA allocated' ,'maximum PGA used for auto workareas' ,'maximum PGA used for manual workareas' ,'total PGA allocated' ,'total PGA inuse' ,'total PGA used for auto workareas' ,'total PGA used for manual workareas' ,'total freeable PGA memory' ) AND snap_id in (select snap_id from dba_hist_snapshot where begin_interval_time > sysdate -60) GROUP BY snap_id, dbid, instance_number ), pgastat_denorm_2 AS ( SELECT /*+ MATERIALIZE NO_MERGE */ h.dbid, h.instance_number, s.startup_time, MIN(h.pga_mem_freed_to_os) pga_mem_freed_to_os, MIN(h.bytes_processed) bytes_processed, MIN(h.extra_bytes_rw) extra_bytes_rw FROM pgastat_denorm_1 h, dba_hist_snapshot s WHERE s.snap_id = h.snap_id AND s.dbid = h.dbid AND s.instance_number = h.instance_number GROUP BY h.dbid, h.instance_number, s.startup_time ), pgastat_delta AS ( SELECT /*+ MATERIALIZE NO_MERGE */ h1.snap_id, h1.dbid, h1.instance_number, s1.begin_interval_time, s1.end_interval_time, ROUND((CAST(s1.end_interval_time AS DATE) - CAST(s1.begin_interval_time AS DATE)) * 24 * 60 * 60) interval_secs, (h1.pga_mem_freed_to_os - h0.pga_mem_freed_to_os) pga_mem_freed_to_os, h1.aggr_pga_auto_target, h1.aggr_pga_target_param, (h1.bytes_processed - h0.bytes_processed) bytes_processed, (h1.extra_bytes_rw - h0.extra_bytes_rw) extra_bytes_rw, h1.global_memory_bound, h1.max_pga_allocated, h1.max_pga_used_aut_wa, h1.max_pga_used_man_wa, h1.tot_pga_allocated, h1.tot_pga_inuse, h1.tot_pga_used_aut_wa, h1.tot_pga_used_man_wa, h1.tot_freeable_pga_mem FROM pgastat_denorm_1 h0, pgastat_denorm_1 h1, dba_hist_snapshot s0, dba_hist_snapshot s1, pgastat_denorm_2 min /* to see cumulative use (replace h0 with min on select list above) */ WHERE h1.snap_id = h0.snap_id + 1 AND h1.dbid = h0.dbid AND h1.instance_number = h0.instance_number AND s0.snap_id = h0.snap_id AND s0.dbid = h0.dbid AND s0.instance_number = h0.instance_number AND s1.snap_id = h1.snap_id AND s1.dbid = h1.dbid AND s1.instance_number = h1.instance_number AND s1.snap_id = s0.snap_id + 1 AND s1.startup_time = s0.startup_time AND s1.begin_interval_time > (s0.begin_interval_time + (1 / (24 * 60))) /* filter out snaps apart < 1 min */ AND min.dbid = s1.dbid AND min.instance_number = s1.instance_number AND min.startup_time = s1.startup_time ) SELECT snap_id, TO_CHAR(MIN(begin_interval_time), 'YYYY-MM-DD HH24:MI') begin_time, TO_CHAR(MIN(end_interval_time), 'YYYY-MM-DD HH24:MI') end_time, ROUND(SUM(pga_mem_freed_to_os) / POWER(2, 30), 3) pga_mem_freed_to_os, ROUND(SUM(aggr_pga_auto_target) / POWER(2, 30), 3) aggr_pga_auto_target, ROUND(SUM(aggr_pga_target_param) / POWER(2, 30), 3) aggr_pga_target_param, ROUND(SUM(bytes_processed) / POWER(2, 30), 3) bytes_processed, ROUND(SUM(extra_bytes_rw) / POWER(2, 30), 3) extra_bytes_rw, ROUND(SUM(global_memory_bound) / POWER(2, 30), 3) global_memory_bound, ROUND(SUM(max_pga_allocated) / POWER(2, 30), 3) max_pga_allocated, ROUND(SUM(max_pga_used_aut_wa) / POWER(2, 30), 3) max_pga_used_aut_wa, ROUND(SUM(max_pga_used_man_wa) / POWER(2, 30), 3) max_pga_used_man_wa, ROUND(SUM(tot_pga_allocated) / POWER(2, 30), 3) tot_pga_allocated, ROUND(SUM(tot_pga_inuse) / POWER(2, 30), 3) tot_pga_inuse, ROUND(SUM(tot_pga_used_aut_wa) / POWER(2, 30), 3) tot_pga_used_aut_wa, ROUND(SUM(tot_pga_used_man_wa) / POWER(2, 30), 3) tot_pga_used_man_wa, ROUND(SUM(tot_freeable_pga_mem) / POWER(2, 30), 3) tot_freeable_pga_mem, 0 dummy_15 FROM pgastat_delta GROUP BY snap_id ORDER BY snap_id;
The 60 day chart shows there is no event of PGA limit reached until the date of the incident and the majority of the PGA allocation/usage is between 1 GB and 1.5 GB. At the date of the incident, it had the spike to 3.5 GB. As the data is pulled from the snapshot and has average down effect, so the value is 3.5 GB instead of 4 GB. But that’s good enough. 7 Days chart shows the same issue happened twice at the same date of the incident.
Which application or process triggered this issue?
Ok, let me find out who is the guilty one. I pulled PGA usage information from historical active session history (dba_hist_active_sess_history) and sorted by PGA usage in descending order. Not surprisingly SQL ID cmdzs9ud02bdb is the only SQL that has PGA usage > 1G and there are 605 occurrences. We can also see the top occurrences were around 4GB, reaching to the limit of PGA_AGGREGATE_LIMIT parameter. Pay more attention to XMLTABLE EVALUATION. If not careful, anything related to XML can cause a lot of performance headache.
Is this SQL a new SQL or an old SQL?
For a query with performance issue, The first thing I usually check is to use Kerry‘s awr_plan_change script to find out the SQL history and performance.
set lines 155 col execs for 999,999,999 col avg_etime for 999,999.999 col avg_lio for 9,999,999,999.9 col avg_pio for 9,999,999,999.9 col begin_interval_time for a30 col node for 99999 break on plan_hash_value on startup_time skip 1 select ss.snap_id, ss.instance_number node, begin_interval_time, sql_id, plan_hash_value, nvl(executions_delta,0) execs, (elapsed_time_delta/decode(nvl(executions_delta,0),0,1,executions_delta))/1000000 avg_etime, (buffer_gets_delta/decode(nvl(executions_delta,0),0,1,executions_delta)) avg_lio, (disk_reads_delta/decode(nvl(executions_delta,0),0,1,executions_delta)) avg_pio, rows_processed_delta total_rows from DBA_HIST_SQLSTAT S, DBA_HIST_SNAPSHOT SS where sql_id = nvl('&sql_id','4dqs2k5tynk61') and ss.snap_id = S.snap_id and ss.instance_number = S.instance_number and s.instance_number like nvl('&instance_number',s.instance_number) order by 1, 2, 3 /
The result from awr_plan_change shows this is an old SQL that had executions as early as 6 months ago. The query cmdzs9ud02bdb performed good until Feb. 16 and Feb. 17 was date that had this performance issue. Looking closely, there is no plan change and the key difference between these good runs and bad runs is the the total number of rows returned. Good runs usually return a few thousand rows at most while the bad run on Feb. 17 returned 11 million rows before blowing out the PGA memory. PL/SQL and XML are untunable PGA memory and no wonder Oracle decided to kill this SQL first. I have no insight information about this query from application side. But purely from database perspective, it doesn’t seem to be a good practice for this query to return 11 million rows of result to application layer. Interestingly, the next day the query performed normally and returned just a few thousand rows. Coincidentally, there was a production application upgrade just the night before the incident and I do not know whether it was related or not. Anyway, I believe this is an application issue instead of database issue.
A few other side notes:
1. To monitor the limit is reached or not, Oracle uses CKPT process to checks the limit every 3 seconds. If the limit is reached or exceeded, then CKPT terminates calls for the sessions using the most untunable PGA memory.
Tunable PGA memory refers to the part of PGA memory that is allocated for SQL activities like sorts, hash-join, group-by, bitmap merge and bitmap index create. This part of memory can be shrinked or expanded based on the system load. Untunable PGA memory refers to other part of the PGA memory that can not be shrinked, such as PL/SQL, Java, context information for each session or for each open/active cursor. The size of tunable memory can be found from “aggregate PGA auto target”. Make sure aggregate PGA auto target should not be too small compared with the value of PGA_AGGREGATE_TARGET because we want to have enough PGA Memory left for workareas. The ‘maximum PGA used for auto workareas‘ gives the maximum reached by previous statistic since instance startup.
col NAME for a40 SELECT NAME, VALUE, UNIT FROM V$PGASTAT where NAME in ( 'aggregate PGA target parameter', 'aggregate PGA auto target', 'total PGA inuse ', 'total PGA allocated', 'maximum PGA used for auto workareas', 'cache hit percentage', 'over allocation count'); NAME VALUE UNIT ---------------------------------------- ---------- ------------ aggregate PGA target parameter 2147483648 bytes aggregate PGA auto target 1428175872 bytes total PGA allocated 816254976 bytes maximum PGA used for auto workareas 2305454080 bytes over allocation count 1924 cache hit percentage 97.36 percent
If untunable PGA memory + minimum tunable memory required to execute the work area workload > PGA_AGGREGATE_TARGET, over-allocating PGA memory can happen and extra PGA memory allocation beyond PGA_AGGREGATE_TARGET occurs. We can see this number from over allocation count and you want this number as small as possible.
To find out current active workarea, we can use the following query:
col sid for 99999 col operation for a20 col start_time for a20 col expected_size_MB for 99,999.9 col Act_Used_MB for 99,999.9 col Max_Used_MB for 99,999.9 col TEMP_Seg_MB for 999,999 select to_char(sql_exec_start, 'yyyy-mm-dd hh24:mi:ss') start_time, sql_id, to_number(decode(sid, 65535, NULL, sid)) sid, operation_type operation, trunc(expected_size/1024/1024) expected_size_MB, trunc(actual_mem_used/1024/1024) Act_Used_MB, trunc(max_mem_used/1024/1204) Max_Used_MB, number_passes pass, trunc(tempseg_size/1024/1024) TEMP_Seg_MB FROM v$sql_workarea_active ORDER BY 1,2; START_TIME SQL_ID SID OPERATION EXPECTED_SIZE_MB ACT_USED_MB MAX_USED_MB PASS TEMP_SEG_MB -------------------- ------------- ------ -------------------- ---------------- ----------- ----------- ---------- ----------- 2016-02-23 02:45:53 ff0vxztrvduhw 169 SORT (v2) 3.0 3.0 3.0 0 2016-02-23 09:37:19 77zanb2q2bcxc 435 GROUP BY (SORT) .0 .0 .0 0 2016-02-23 16:43:48 2x8yy828hjmn5 451 HASH-JOIN 5.0 .0 .0 0 2016-02-23 16:43:48 2x8yy828hjmn5 340 HASH-JOIN 5.0 .0 .0 0 2016-02-23 16:43:48 4a1x3rvduurnr 91 HASH-JOIN 4.0 .0 .0 0
As CKPT process checks the limit in 3 second interval, it is possible for PGA memory usage to briefly exceed PGA_AGGREGATE_LIMIT and return back to the level below limit.
2. According to Oracle Doc Limiting process size with database parameter PGA_AGGREGATE_LIMIT (1520324.1), there is no difference in behavior between PGA_AGGREGATE_LIMIT being explicitly set or being set to the default. Because the parameter responds to changing memory conditions, you do not need to set the parameter value explicitly. The document also states: If a value of 0 is specified, it means there is no limit to the aggregate PGA memory consumed by the instance. I believe this statement doesn’t mean we could use unlimited memory in PGA, but just consider this PGA_AGGREGATE_LIMIT not exists anymore and 3xPGA_AGGREGATE_TARGET is the limit for PGA usage.