Capture the Cause of Excessive Logon Activities

I have been to many clients for the past few years and one of common issues is the excessive logon activities. From what I observe, excessive logon activities are mainly from two areas.

1) The first one is the logon storm. This behavior usually happens in a database that is shared by multiple applications. As poorly performed queries from one application can easily have impact on the performance of other applications, other application teams might feel not have enough db resource dedicating to their application and use their own ways to get the resource they needs by firing up hundreds of threads within a tiny time window, logon to db, do some insert and update work, and logoff. This one can be easily spotted from OEM by a huge spike in Active Sessions. I am not going to discuss this one in this blog, but talking about the second category.
2) The second category is almost unnoticeable, continuous logon activities that comes in a way like memory leak. Database performs ok, but over time audit table AUD$ grows at a much faster rate than it should be. At the time when AUD$ table is so big, any insert or query activities on AUD$ table can be easily seen as one of top 10 queries on OEM. Just like the house surrounded by the water. When want to get out, it is too late. This issue is usually caused by the design or configuration issue on application side. This blog discusses how to identify this issue from database side and provide useful information for application side to track down the cause of the issue.

house_surround_by_water

How to identify whether we have excessive logon activities?
First, goto OEM’s Performance Home page and pay attention to the Logon activities at the bottom of the pages. If you see a constant line with logons > 0 all the time, it means something is not right in the logon, especially if the database is small and not many users are using it. The following screenshot shows a logon rate of 10 all the time. We could have logon rate of 10 for certain time with a pattern of up and down, but sustain this number all the time usually means issue in logon activities.
1_OEM_Logon_Rate

Next, I usually run the following script to capture who are the top contributors of this excessive logon activities. It shows the users with logon frequency > 100 for the past 15 minutes.

col os_username for a15
col username for a18
col userhost for a15
col avg_sess_cpu for 99,999
col total for 999,999

prompt show all users login info with logon freqency > 100 for the past 15 minutes
select os_username, username, userhost, action_name action, count(*)
from dba_audit_trail
where timestamp > sysdate - 1/24/4
having count(*) > 100
group by  os_username, username, userhost, action_name
order by os_username, username, userhost;

Here is the sample output from the query:

OS_USER USERNAME      USERHOST    ACTION            COUNT(*)
------- ------------- ----------- ----------------- --------
. . . .
x1w01  APP_SUPWOK     kprodw327   SET ROLE               210
x1w01  APP_SUPWOK     kprodw328   SET ROLE               105
x1w01  APP_RTL01      kprodw333   LOGOFF                 118
x1w01  APP_RTL01      kprodw333   LOGON                  118
x5w01  APP_RTL01_ORD  kprodw447   LOGOFF BY CLEANUP     9520
x5w01  APP_RTL01_ORD  kprodw447   LOGON                 9522
x5w01  APP_RTL01_ORD  kprodw448   LOGOFF BY CLEANUP     2664
x5w01  APP_RTL01_ORD  kprodw448   LOGON                 2661

The above result shows APP_RTL01_ORD user has the largest number of logon/logoff activities and has about 12,000+ logon activities from two hosts, kprodw447 and kprodw448.

What’s the logon activities look like?
This kind of excessive logon session usually has a very short life: logon, executes one or a few queries, and then gone. So I have to run the following query multiple times within a few seconds to create my own snapshot. For this test, I am checking APP_RTL01_ORD user and kprodw447 host.

col right_now for a20
col logon_time for a20
col inst for 9999
col username for a18
col osuser for a8
col sid for 99999999
col machine for a10
col proc for a5
col port for 99999
col program for a20
col event for a15 trunc
select to_char(sysdate, 'yyyy-mm-dd hh24:mi:ss') right_now, to_char(logon_time, 'yyyy-mm-dd hh24:mi:ss') logon_time, 
inst_id inst, sid, serial#, audsid, username, status, osuser, machine, process proc, port, program, 
sql_id, sql_hash_value hash_val, prev_sql_id, prev_hash_value prev_hash, event
from gv$session 
where 
username='APP_RTL01_ORD' and machine='kprodw447' 
order by logon_time;

The following is the execution result. We can see only the first two sessions (SID 1329 and 2330) are in the all of three snapshots while the rest of sessions are gone. If those sessions are gone, it means the old sessions were closed and new sessions are opened. So I picked the 2nd snapshot in the middle and used the sessions with audid column is red to do further analysis.
2_session_snapshot

You might notice there is audsid column in the query. The reason why I includes it in the query is that there is no SID and Serial# columns in SYS.AUD$ table. AUDSID is Auditing Session ID, a unique identifier for the user session. Confusingly, this column is called AUDSID in V$SESSION, but SESSIONID in SYS.AUD$. When a new user session is created, the SYS.AUDSES$ sequence is incremented using NEXTVAL and the new value is assigned to the session as its AUDSID. In case of sessions connecting as ‘internal’, AUDSID is set to 0 and SYS.AUDSES$ sequence is not incremented. The common used query select userenv(‘SESSIONID’) from dual is actually getting this AUDSID.

For more information about AUDSID, please refers to Oracle Doc: How Sessions get Their AUDSID Identifier (Doc ID 122230.1).

How long each logon session last?
Although I feel the session has short life, I don’t know the exact time for the session. So next step, I need to pull the logon and logoff records from dba_audit_trail. I used the audsid information captured from the last step and run the following query.

col logoff_time for a20
col priv_used for a15
col comment_text for a50 wrap
select sessionid audsid, os_username osuser, username, userhost machine, os_process proc, 
to_char(timestamp, 'yyyy-mm-dd hh24:mi:ss') time, to_char(logoff_time, 'yyyy-mm-dd hh24:mi:ss') logoff_time, 
action_name, priv_used, session_cpu, comment_text 
from
dba_audit_trail 
where
sessionid in (120893447,120893450,120893448,120893445,120893451,120893453,120893452) 
order by sessionid, timestamp;

The result proves my assumption is right. All of these session shows the same pattern: logon and logoff within a second.
3_session_logon_logoff_history

What kind of queries used in these short life sessions?
In the 2nd step, I marked those queries in red and let me check out the content and execution history of the queries. As usual, I used Kerry‘s dplan.sql and awr_plan_change.sql to get plan content and execution information.

1. SQL_ID 0dqgq52qf68ag
SELECT ATTR_NO, ATTR_NAME, ATTR_TYPE_NAME, ATTR_TYPE_OWNER FROM
USER_TYPE_ATTRS WHERE TYPE_NAME = :1 ORDER BY ATTR_NO

2. SQL_ID g6wmnb1uq3hnh
begin :1 := dbms_pickler.get_type_shape(:2,:3,:4,:5,:6,:7); end;

3. SQL_ID bunvx480ynf57
SELECT 1 FROM DUAL

4. SQL_ID fcfpwturkq7dm
SELECT ELEM_TYPE_NAME, ELEM_TYPE_OWNER FROM USER_COLL_TYPES WHERE
TYPE_NAME = :1

4_sql_exec_history

Among these queries, pay more attention to SQL bunvx480ynf57 in red above. I remembered when I run my check out logon activities scripts at around 2:30pm on March 1, there was 15,000 logon for the past 15 minutes. The above screenshot of SQL execution history shows the total number of execution of SQL bunvx480ynf57 is around 28,000 executions in a 30 minute snapshot window for the same time period. So it seems there is a strong correlation between this query and frequent logon activity. To verify my assumption, I recheck the total logon/logoff activities between 2pm and 3pm today on Mar 2.

col os_username for a15
col username for a18
col userhost for a15
col avg_sess_cpu for 99,999
col total for 999,999

select os_username, username, userhost, action_name action, count(*)
from dba_audit_trail
where timestamp between to_timestamp( '2016-03-02 14:00', 'yyyy-mm-dd hh24:mi') and to_timestamp( '2016-03-02 15:00', 'yyyy-mm-dd hh24:mi') 
having count(*) > 100
group by  os_username, username, userhost, action_name
order by os_username, username, userhost;

Here is the result:
5_session_snapshot_for_1hr

Yes, my assumption is right. There are 43,600 (34,000 + 9,600) logon/logoff between 2pm and 3pm on Mar 2 while the total number of execution of SQL bunvx480ynf57 during the same timeframe is 19,000 + 26,000 = 45,000. These two numbers are very close.

Based on this finding, I would say the main purpose of this frequent session logon/logoff thing is related to SQL bunvx480ynf57, or SELECT 1 FROM DUAL. For SELECT 1 FROM DUAL query, it is usually used at application layer to keep session alive if the session is idle for too long and avoid session to time out. In addition, the action for LOGOFF is For LOGOFF BY CLEANUP, it usually means the connection on client-side vanishes without a clean disconnect. It’s the client-process that “dies abruptly” and then database PMON process cleans up these abandoned connections. Therefore, it seems to me that this keep alive logic does not work as expected at application layer. Anyway, this is a purely an application design or configuration issue.

Additional Note:
My client’s DBA, Rick, also brought up another nice point about the possibility of this issue at application layer: another common use for ‘SELECT 1 FROM DUAL;’ is to ‘test’ connections. However this model is falling out of favor. The combination of the excessively high login/logout rate and the most frequent sql listed could be caused by improper connection pooling config OR usage. Example of bad usage would be issuing an explicit CLOSE() in the code instead of a RELEASE_CONNECTION() or something better.

Here is the Oracle document about this topic:
Use Test Connections on Reserve with Care

When Test Connections on Reserve is enabled, the server instance checks a database connection prior to returning the connection to a client. This helps reduce the risk of passing invalid connections to clients.

However, it is a fairly expensive operation. Typically, a server instance performs the test by executing a full-fledged SQL query with each connection prior to returning it. If the SQL query fails, the connection is destroyed and a new one is created in its place. A new and optional performance tunable has been provided in WebLogic Server 9.x within this “test connection on reserve” feature. The new optional performance tunable in 9.x allows WebLogic Server to skip this SQL-query test within a configured time window of a prior successful client use (default is 10 seconds). When a connection is returned to the pool by a client, the connection is timestamped by WebLogic Server. WebLogic Server will then skip the SQL-query test if this particular connection is returned to a client within the time window. Once the time window expires, WebLogic Server will execute the SQL-query test. This feature can provide significant performance boosts for busy systems using “test connection on reserve”.

Thanks Rick for sharing his knowledge in WebLogic Server.

Advertisements