PGA_AGGREGATE_TARGET vs PGA_AGGREGATE_LIMIT

Before Oracle 12c, PGA_AGGREGATE_TARGET is the most useful parameter to control PGA memory allocated to work areas. However PGA_AGGREGATE_TARGET is a soft target and not a hard limit. The actual PGA usage can be as high as three times of the value of PGA_AGGREGATE_TARGET. Excessive PGA usage can lead to memory swapping issue and have impact on the performance. There are many articles and blogs discussing this parameter and tuning technique and I will not repeat more here in this blog. Instead I am going to discuss more about a new PGA related parameter PGA_AGGREGATE_LIMIT.

PGA_AGGREGATE_LIMIT is a new parameter introduced in 12c. It put a hard limit instead of a soft limit on the PGA memory usage. It just like we need to put a limit to the height of skyscraper and some of them are away too high. If this limit is reached or exceeded, Oracle will terminates the sessions that are consuming the most untunable PGA memory. I have more detail discussion about tunable and untunable PGA memory at the end of this blog.

skyscraper_over_cloud

Recently I was requested to help my client to investigate a PGA memory related production issue and it was all related to PGA_AGGREGATE_TARGET, PGA_AGGREGATE_LIMIT and user application.

Background
The database is a small 12c database with not much activities. The PGA_AGGREGATE_TARGET is set to 2GB and it is usually sufficient as PGA Cache Hit Ratio hits almost 100% based on the result from v$pga_target_advice.
pga_issue_image_1_advice

When the production issue happened, the user job failed in completion and Oracle alert log file has many PGA related errors:
Errors in file /db/oracle/diag/rdbms/dp783/DP783/trace/DP783_ora_13351.trc (incident=68575):
ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT
….
Errors in file /db/oracle/diag/rdbms/dp783/DP783/trace/DP783_ora_28611.trc (incident=68279):
ORA-04030: out of process memory when trying to allocate 16328 bytes (koh-kghu call ,pmuccst: adt/record)
Incident details in: /db/oracle/diag/rdbms/dp783/DP783/incident/incdir_68279/DP783_ora_28611_i68279.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /db/oracle/diag/rdbms/dp783/DP783/trace/DP783_ora_28611.trc (incident=68280):
ORA-04030: out of process memory when trying to allocate 169040 bytes (pga heap,kgh stack)
ORA-04030: out of process memory when trying to allocate 16328 bytes (koh-kghu call ,pmuccst: adt/record)

PGA_AGGREGATE_LIMIT parameter was not set at that time. So it took a default value that was dynamically set by the system. By default, Oracle uses the following 3 ways to find out greater of three values and use it as the limit.
1. 2 GB
2. Two times of PGA_AGGREGATE_TARGET parameter , 2 GB x 2 = 4 GB in our case.
3. The value of PROCESS parameter * 3MB. The PROCESS is 300, so the value is 3MB x 300 = 900 MB.
Therefore the actual limit was 4 GB for PGA_AGGREGATE_LIMIT parameter by default. Less than one week before the incident, I requested the execution of edb360 and execution result was indeed confirmed the PGA_AGGREGATE_LIMIT parameter was 4GB.

Obviously, we have a situation with PGA_AGGREGATE_LIMIT reached. As always, client’s next question would be: Does this issue happen before? Which application or process triggered this issue?

Does this issue happen before?
To answer the question whether this issue happened before, I used one query from Carlos Sierra‘s edb360 and modified the code a little bit to allow me to capture the PGA allocation and usage for the past 60 days.

WITH
pgastat_denorm_1 AS (
SELECT /*+ MATERIALIZE NO_MERGE */
       snap_id,
       dbid,
       instance_number,
       SUM(CASE name WHEN 'PGA memory freed back to OS'           THEN value ELSE 0 END) pga_mem_freed_to_os,
       SUM(CASE name WHEN 'aggregate PGA auto target'             THEN value ELSE 0 END) aggr_pga_auto_target,
       SUM(CASE name WHEN 'aggregate PGA target parameter'        THEN value ELSE 0 END) aggr_pga_target_param,
       SUM(CASE name WHEN 'bytes processed'                       THEN value ELSE 0 END) bytes_processed,
       SUM(CASE name WHEN 'extra bytes read/written'              THEN value ELSE 0 END) extra_bytes_rw,
       SUM(CASE name WHEN 'global memory bound'                   THEN value ELSE 0 END) global_memory_bound,
       SUM(CASE name WHEN 'maximum PGA allocated'                 THEN value ELSE 0 END) max_pga_allocated,
       SUM(CASE name WHEN 'maximum PGA used for auto workareas'   THEN value ELSE 0 END) max_pga_used_aut_wa,
       SUM(CASE name WHEN 'maximum PGA used for manual workareas' THEN value ELSE 0 END) max_pga_used_man_wa,
       SUM(CASE name WHEN 'total PGA allocated'                   THEN value ELSE 0 END) tot_pga_allocated,
       SUM(CASE name WHEN 'total PGA inuse'                       THEN value ELSE 0 END) tot_pga_inuse,
       SUM(CASE name WHEN 'total PGA used for auto workareas'     THEN value ELSE 0 END) tot_pga_used_aut_wa,
       SUM(CASE name WHEN 'total PGA used for manual workareas'   THEN value ELSE 0 END) tot_pga_used_man_wa,
       SUM(CASE name WHEN 'total freeable PGA memory'             THEN value ELSE 0 END) tot_freeable_pga_mem
  FROM dba_hist_pgastat
 WHERE name IN
('PGA memory freed back to OS'
,'aggregate PGA auto target'
,'aggregate PGA target parameter'
,'bytes processed'
,'extra bytes read/written'
,'global memory bound'
,'maximum PGA allocated'
,'maximum PGA used for auto workareas'
,'maximum PGA used for manual workareas'
,'total PGA allocated'
,'total PGA inuse'
,'total PGA used for auto workareas'
,'total PGA used for manual workareas'
,'total freeable PGA memory'
)
   AND snap_id in (select snap_id from dba_hist_snapshot where begin_interval_time > sysdate -60)
 GROUP BY
       snap_id,
       dbid,
       instance_number
),
pgastat_denorm_2 AS (
SELECT /*+ MATERIALIZE NO_MERGE */
       h.dbid,
       h.instance_number,
       s.startup_time,
       MIN(h.pga_mem_freed_to_os) pga_mem_freed_to_os,
       MIN(h.bytes_processed) bytes_processed,
       MIN(h.extra_bytes_rw) extra_bytes_rw
  FROM pgastat_denorm_1 h,
       dba_hist_snapshot s
 WHERE s.snap_id = h.snap_id
   AND s.dbid = h.dbid
   AND s.instance_number = h.instance_number
 GROUP BY
       h.dbid,
       h.instance_number,
       s.startup_time
),
pgastat_delta AS (
SELECT /*+ MATERIALIZE NO_MERGE */
       h1.snap_id,
       h1.dbid,
       h1.instance_number,
       s1.begin_interval_time,
       s1.end_interval_time,
       ROUND((CAST(s1.end_interval_time AS DATE) - CAST(s1.begin_interval_time AS DATE)) * 24 * 60 * 60) interval_secs,
       (h1.pga_mem_freed_to_os - h0.pga_mem_freed_to_os) pga_mem_freed_to_os,
       h1.aggr_pga_auto_target,
       h1.aggr_pga_target_param,
       (h1.bytes_processed - h0.bytes_processed) bytes_processed,
       (h1.extra_bytes_rw - h0.extra_bytes_rw) extra_bytes_rw,
       h1.global_memory_bound,
       h1.max_pga_allocated,
       h1.max_pga_used_aut_wa,
       h1.max_pga_used_man_wa,
       h1.tot_pga_allocated,
       h1.tot_pga_inuse,
       h1.tot_pga_used_aut_wa,
       h1.tot_pga_used_man_wa,
       h1.tot_freeable_pga_mem
  FROM pgastat_denorm_1 h0,
       pgastat_denorm_1 h1,
       dba_hist_snapshot s0,
       dba_hist_snapshot s1,
       pgastat_denorm_2 min /* to see cumulative use (replace h0 with min on select list above) */
 WHERE h1.snap_id = h0.snap_id + 1
   AND h1.dbid = h0.dbid
   AND h1.instance_number = h0.instance_number
   AND s0.snap_id = h0.snap_id
   AND s0.dbid = h0.dbid
   AND s0.instance_number = h0.instance_number
   AND s1.snap_id = h1.snap_id
   AND s1.dbid = h1.dbid
   AND s1.instance_number = h1.instance_number
   AND s1.snap_id = s0.snap_id + 1
   AND s1.startup_time = s0.startup_time
   AND s1.begin_interval_time > (s0.begin_interval_time + (1 / (24 * 60))) /* filter out snaps apart < 1 min */
   AND min.dbid = s1.dbid
   AND min.instance_number = s1.instance_number
   AND min.startup_time = s1.startup_time
)
SELECT snap_id,
       TO_CHAR(MIN(begin_interval_time), 'YYYY-MM-DD HH24:MI') begin_time,
       TO_CHAR(MIN(end_interval_time), 'YYYY-MM-DD HH24:MI') end_time,
       ROUND(SUM(pga_mem_freed_to_os) / POWER(2, 30), 3) pga_mem_freed_to_os,
       ROUND(SUM(aggr_pga_auto_target) / POWER(2, 30), 3) aggr_pga_auto_target,
       ROUND(SUM(aggr_pga_target_param) / POWER(2, 30), 3) aggr_pga_target_param,
       ROUND(SUM(bytes_processed) / POWER(2, 30), 3) bytes_processed,
       ROUND(SUM(extra_bytes_rw) / POWER(2, 30), 3) extra_bytes_rw,
       ROUND(SUM(global_memory_bound) / POWER(2, 30), 3) global_memory_bound,
       ROUND(SUM(max_pga_allocated) / POWER(2, 30), 3) max_pga_allocated,
       ROUND(SUM(max_pga_used_aut_wa) / POWER(2, 30), 3) max_pga_used_aut_wa,
       ROUND(SUM(max_pga_used_man_wa) / POWER(2, 30), 3) max_pga_used_man_wa,
       ROUND(SUM(tot_pga_allocated) / POWER(2, 30), 3) tot_pga_allocated,
       ROUND(SUM(tot_pga_inuse) / POWER(2, 30), 3) tot_pga_inuse,
       ROUND(SUM(tot_pga_used_aut_wa) / POWER(2, 30), 3) tot_pga_used_aut_wa,
       ROUND(SUM(tot_pga_used_man_wa) / POWER(2, 30), 3) tot_pga_used_man_wa,
       ROUND(SUM(tot_freeable_pga_mem) / POWER(2, 30), 3) tot_freeable_pga_mem,
       0 dummy_15
  FROM pgastat_delta
 GROUP BY
       snap_id
 ORDER BY
       snap_id;

Captured and imported the result into Excel spreadsheet. Then I created the following two charts showing PGA allocation and usage for the past 60 days and 7 days.
pga_issue_image_2_60dayUsage
pga_issue_image_3_7dayUsage

The 60 day chart shows there is no event of PGA limit reached until the date of the incident and the majority of the PGA allocation/usage is between 1 GB and 1.5 GB. At the date of the incident, it had the spike to 3.5 GB. As the data is pulled from the snapshot and has average down effect, so the value is 3.5 GB instead of 4 GB. But that’s good enough. 7 Days chart shows the same issue happened twice at the same date of the incident.

Which application or process triggered this issue?
Ok, let me find out who is the guilty one. I pulled PGA usage information from historical active session history (dba_hist_active_sess_history) and sorted by PGA usage in descending order. Not surprisingly SQL ID cmdzs9ud02bdb is the only SQL that has PGA usage > 1G and there are 605 occurrences. We can also see the top occurrences were around 4GB, reaching to the limit of PGA_AGGREGATE_LIMIT parameter. Pay more attention to XMLTABLE EVALUATION. If not careful, anything related to XML can cause a lot of performance headache.

pga_issue_image_4_topQuery
pga_issue_image_5_topQueryTotal
Is this SQL a new SQL or an old SQL?
For a query with performance issue, The first thing I usually check is to use Kerry‘s awr_plan_change script to find out the SQL history and performance.

set lines 155
col execs for 999,999,999
col avg_etime for 999,999.999
col avg_lio for 9,999,999,999.9
col avg_pio for 9,999,999,999.9
col begin_interval_time for a30
col node for 99999
break on plan_hash_value on startup_time skip 1
select ss.snap_id, ss.instance_number node, begin_interval_time, sql_id, plan_hash_value,
nvl(executions_delta,0) execs,
(elapsed_time_delta/decode(nvl(executions_delta,0),0,1,executions_delta))/1000000 avg_etime,
(buffer_gets_delta/decode(nvl(executions_delta,0),0,1,executions_delta)) avg_lio,
(disk_reads_delta/decode(nvl(executions_delta,0),0,1,executions_delta)) avg_pio,
rows_processed_delta total_rows
from DBA_HIST_SQLSTAT S, DBA_HIST_SNAPSHOT SS
where sql_id = nvl('&sql_id','4dqs2k5tynk61')
and ss.snap_id = S.snap_id
and ss.instance_number = S.instance_number
and s.instance_number like nvl('&instance_number',s.instance_number)
order by 1, 2, 3
/

pga_issue_image_6_totalRow

The result from awr_plan_change shows this is an old SQL that had executions as early as 6 months ago. The query cmdzs9ud02bdb performed good until Feb. 16 and Feb. 17 was date that had this performance issue. Looking closely, there is no plan change and the key difference between these good runs and bad runs is the the total number of rows returned. Good runs usually return a few thousand rows at most while the bad run on Feb. 17 returned 11 million rows before blowing out the PGA memory. PL/SQL and XML are untunable PGA memory and no wonder Oracle decided to kill this SQL first. I have no insight information about this query from application side. But purely from database perspective, it doesn’t seem to be a good practice for this query to return 11 million rows of result to application layer. Interestingly, the next day the query performed normally and returned just a few thousand rows. Coincidentally, there was a production application upgrade just the night before the incident and I do not know whether it was related or not. Anyway, I believe this is an application issue instead of database issue.

A few other side notes:
1. To monitor the limit is reached or not, Oracle uses CKPT process to checks the limit every 3 seconds. If the limit is reached or exceeded, then CKPT terminates calls for the sessions using the most untunable PGA memory.

Tunable PGA memory refers to the part of PGA memory that is allocated for SQL activities like sorts, hash-join, group-by, bitmap merge and bitmap index create. This part of memory can be shrinked or expanded based on the system load. Untunable PGA memory refers to other part of the PGA memory that can not be shrinked, such as PL/SQL, Java, context information for each session or for each open/active cursor. The size of tunable memory can be found from “aggregate PGA auto target”. Make sure aggregate PGA auto target should not be too small compared with the value of PGA_AGGREGATE_TARGET because we want to have enough PGA Memory left for workareas. The ‘maximum PGA used for auto workareas‘ gives the maximum reached by previous statistic since instance startup.

col NAME for a40
SELECT NAME, VALUE, UNIT FROM V$PGASTAT 
where
NAME in ( 'aggregate PGA target parameter', 'aggregate PGA auto target', 'total PGA inuse ', 
'total PGA allocated', 'maximum PGA used for auto workareas', 'cache hit percentage', 'over allocation count');

NAME                                          VALUE UNIT
---------------------------------------- ---------- ------------
aggregate PGA target parameter           2147483648 bytes
aggregate PGA auto target                1428175872 bytes
total PGA allocated                       816254976 bytes
maximum PGA used for auto workareas      2305454080 bytes
over allocation count                          1924
cache hit percentage                          97.36 percent

If untunable PGA memory + minimum tunable memory required to execute the work area workload > PGA_AGGREGATE_TARGET, over-allocating PGA memory can happen and extra PGA memory allocation beyond PGA_AGGREGATE_TARGET occurs. We can see this number from over allocation count and you want this number as small as possible.

To find out current active workarea, we can use the following query:

col sid for 99999 
col operation for a20
col start_time for a20
col expected_size_MB for 99,999.9 
col Act_Used_MB for 99,999.9 
col Max_Used_MB for 99,999.9
col TEMP_Seg_MB for 999,999
select to_char(sql_exec_start, 'yyyy-mm-dd hh24:mi:ss') start_time, sql_id,
       to_number(decode(sid, 65535, NULL, sid)) sid,
       operation_type operation, trunc(expected_size/1024/1024) expected_size_MB,
       trunc(actual_mem_used/1024/1024) Act_Used_MB, trunc(max_mem_used/1024/1204) Max_Used_MB,
       number_passes pass, trunc(tempseg_size/1024/1024) TEMP_Seg_MB
FROM v$sql_workarea_active
ORDER BY 1,2;

START_TIME           SQL_ID           SID OPERATION            EXPECTED_SIZE_MB ACT_USED_MB MAX_USED_MB       PASS TEMP_SEG_MB
-------------------- ------------- ------ -------------------- ---------------- ----------- ----------- ---------- -----------
2016-02-23 02:45:53  ff0vxztrvduhw    169 SORT (v2)                         3.0         3.0         3.0          0
2016-02-23 09:37:19  77zanb2q2bcxc    435 GROUP BY (SORT)                    .0          .0          .0          0
2016-02-23 16:43:48  2x8yy828hjmn5    451 HASH-JOIN                         5.0          .0          .0          0
2016-02-23 16:43:48  2x8yy828hjmn5    340 HASH-JOIN                         5.0          .0          .0          0
2016-02-23 16:43:48  4a1x3rvduurnr     91 HASH-JOIN                         4.0          .0          .0          0

As CKPT process checks the limit in 3 second interval, it is possible for PGA memory usage to briefly exceed PGA_AGGREGATE_LIMIT and return back to the level below limit.

2. According to Oracle Doc Limiting process size with database parameter PGA_AGGREGATE_LIMIT (1520324.1), there is no difference in behavior between PGA_AGGREGATE_LIMIT being explicitly set or being set to the default. Because the parameter responds to changing memory conditions, you do not need to set the parameter value explicitly. The document also states: If a value of 0 is specified, it means there is no limit to the aggregate PGA memory consumed by the instance. I believe this statement doesn’t mean we could use unlimited memory in PGA, but just consider this PGA_AGGREGATE_LIMIT not exists anymore and 3xPGA_AGGREGATE_TARGET is the limit for PGA usage.

Advertisements

2 thoughts on “PGA_AGGREGATE_TARGET vs PGA_AGGREGATE_LIMIT

  1. Hi, you say that “The actual PGA usage can be as high as three times of the value of PGA_AGGREGATE_TARGET”, but i’ve searched in documentation e found no evidence of this, how are you arrived at this conclusion?

  2. This is the tip used, not only by me, but also by some other people. I did see this 3x value at several different clients. I don’t think this is an official guideline by Oracle Document, but more from experience. You can also check out the link below. It shows the example of actual PGA usage is 3x. https://www.scribd.com/doc/6469805/How-to-Determine-if-PGA-Size-is-Set-Properly
    Other blog (http://www.dbaglobe.com/2013/12/difference-between-pgaaggregatetarget.html) for the same topic has the same observation.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s