I have been to many clients for the past few years and one of common issues is the excessive logon activities. From what I observe, excessive logon activities are mainly from two areas.
1) The first one is the logon storm. This behavior usually happens in a database that is shared by multiple applications. As poorly performed queries from one application can easily have impact on the performance of other applications, other application teams might feel not have enough db resource dedicating to their application and use their own ways to get the resource they needs by firing up hundreds of threads within a tiny time window, logon to db, do some insert and update work, and logoff. This one can be easily spotted from OEM by a huge spike in Active Sessions. I am not going to discuss this one in this blog, but talking about the second category.
2) The second category is almost unnoticeable, continuous logon activities that comes in a way like memory leak. Database performs ok, but over time audit table AUD$ grows at a much faster rate than it should be. At the time when AUD$ table is so big, any insert or query activities on AUD$ table can be easily seen as one of top 10 queries on OEM. Just like the house surrounded by the water. When want to get out, it is too late. This issue is usually caused by the design or configuration issue on application side. This blog discusses how to identify this issue from database side and provide useful information for application side to track down the cause of the issue.
How to identify whether we have excessive logon activities?
First, goto OEM’s Performance Home page and pay attention to the Logon activities at the bottom of the pages. If you see a constant line with logons > 0 all the time, it means something is not right in the logon, especially if the database is small and not many users are using it. The following screenshot shows a logon rate of 10 all the time. We could have logon rate of 10 for certain time with a pattern of up and down, but sustain this number all the time usually means issue in logon activities.
Next, I usually run the following script to capture who are the top contributors of this excessive logon activities. It shows the users with logon frequency > 100 for the past 15 minutes.
col os_username for a15 col username for a18 col userhost for a15 col avg_sess_cpu for 99,999 col total for 999,999 prompt show all users login info with logon freqency > 100 for the past 15 minutes select os_username, username, userhost, action_name action, count(*) from dba_audit_trail where timestamp > sysdate - 1/24/4 having count(*) > 100 group by os_username, username, userhost, action_name order by os_username, username, userhost;
Here is the sample output from the query:
OS_USER USERNAME USERHOST ACTION COUNT(*) ------- ------------- ----------- ----------------- -------- . . . . x1w01 APP_SUPWOK kprodw327 SET ROLE 210 x1w01 APP_SUPWOK kprodw328 SET ROLE 105 x1w01 APP_RTL01 kprodw333 LOGOFF 118 x1w01 APP_RTL01 kprodw333 LOGON 118 x5w01 APP_RTL01_ORD kprodw447 LOGOFF BY CLEANUP 9520 x5w01 APP_RTL01_ORD kprodw447 LOGON 9522 x5w01 APP_RTL01_ORD kprodw448 LOGOFF BY CLEANUP 2664 x5w01 APP_RTL01_ORD kprodw448 LOGON 2661
The above result shows APP_RTL01_ORD user has the largest number of logon/logoff activities and has about 12,000+ logon activities from two hosts, kprodw447 and kprodw448.
What’s the logon activities look like?
This kind of excessive logon session usually has a very short life: logon, executes one or a few queries, and then gone. So I have to run the following query multiple times within a few seconds to create my own snapshot. For this test, I am checking APP_RTL01_ORD user and kprodw447 host.
col right_now for a20 col logon_time for a20 col inst for 9999 col username for a18 col osuser for a8 col sid for 99999999 col machine for a10 col proc for a5 col port for 99999 col program for a20 col event for a15 trunc select to_char(sysdate, 'yyyy-mm-dd hh24:mi:ss') right_now, to_char(logon_time, 'yyyy-mm-dd hh24:mi:ss') logon_time, inst_id inst, sid, serial#, audsid, username, status, osuser, machine, process proc, port, program, sql_id, sql_hash_value hash_val, prev_sql_id, prev_hash_value prev_hash, event from gv$session where username='APP_RTL01_ORD' and machine='kprodw447' order by logon_time;
The following is the execution result. We can see only the first two sessions (SID 1329 and 2330) are in the all of three snapshots while the rest of sessions are gone. If those sessions are gone, it means the old sessions were closed and new sessions are opened. So I picked the 2nd snapshot in the middle and used the sessions with audid column is red to do further analysis.
You might notice there is audsid column in the query. The reason why I includes it in the query is that there is no SID and Serial# columns in SYS.AUD$ table. AUDSID is Auditing Session ID, a unique identifier for the user session. Confusingly, this column is called AUDSID in V$SESSION, but SESSIONID in SYS.AUD$. When a new user session is created, the SYS.AUDSES$ sequence is incremented using NEXTVAL and the new value is assigned to the session as its AUDSID. In case of sessions connecting as ‘internal’, AUDSID is set to 0 and SYS.AUDSES$ sequence is not incremented. The common used query select userenv(‘SESSIONID’) from dual is actually getting this AUDSID.
For more information about AUDSID, please refers to Oracle Doc: How Sessions get Their AUDSID Identifier (Doc ID 122230.1).
How long each logon session last?
Although I feel the session has short life, I don’t know the exact time for the session. So next step, I need to pull the logon and logoff records from dba_audit_trail. I used the audsid information captured from the last step and run the following query.
col logoff_time for a20 col priv_used for a15 col comment_text for a50 wrap select sessionid audsid, os_username osuser, username, userhost machine, os_process proc, to_char(timestamp, 'yyyy-mm-dd hh24:mi:ss') time, to_char(logoff_time, 'yyyy-mm-dd hh24:mi:ss') logoff_time, action_name, priv_used, session_cpu, comment_text from dba_audit_trail where sessionid in (120893447,120893450,120893448,120893445,120893451,120893453,120893452) order by sessionid, timestamp;
What kind of queries used in these short life sessions?
In the 2nd step, I marked those queries in red and let me check out the content and execution history of the queries. As usual, I used Kerry‘s dplan.sql and awr_plan_change.sql to get plan content and execution information.
1. SQL_ID 0dqgq52qf68ag
SELECT ATTR_NO, ATTR_NAME, ATTR_TYPE_NAME, ATTR_TYPE_OWNER FROM
USER_TYPE_ATTRS WHERE TYPE_NAME = :1 ORDER BY ATTR_NO
2. SQL_ID g6wmnb1uq3hnh
begin :1 := dbms_pickler.get_type_shape(:2,:3,:4,:5,:6,:7); end;
3. SQL_ID bunvx480ynf57
SELECT 1 FROM DUAL
4. SQL_ID fcfpwturkq7dm
SELECT ELEM_TYPE_NAME, ELEM_TYPE_OWNER FROM USER_COLL_TYPES WHERE
TYPE_NAME = :1
Among these queries, pay more attention to SQL bunvx480ynf57 in red above. I remembered when I run my check out logon activities scripts at around 2:30pm on March 1, there was 15,000 logon for the past 15 minutes. The above screenshot of SQL execution history shows the total number of execution of SQL bunvx480ynf57 is around 28,000 executions in a 30 minute snapshot window for the same time period. So it seems there is a strong correlation between this query and frequent logon activity. To verify my assumption, I recheck the total logon/logoff activities between 2pm and 3pm today on Mar 2.
col os_username for a15 col username for a18 col userhost for a15 col avg_sess_cpu for 99,999 col total for 999,999 select os_username, username, userhost, action_name action, count(*) from dba_audit_trail where timestamp between to_timestamp( '2016-03-02 14:00', 'yyyy-mm-dd hh24:mi') and to_timestamp( '2016-03-02 15:00', 'yyyy-mm-dd hh24:mi') having count(*) > 100 group by os_username, username, userhost, action_name order by os_username, username, userhost;
Yes, my assumption is right. There are 43,600 (34,000 + 9,600) logon/logoff between 2pm and 3pm on Mar 2 while the total number of execution of SQL bunvx480ynf57 during the same timeframe is 19,000 + 26,000 = 45,000. These two numbers are very close.
Based on this finding, I would say the main purpose of this frequent session logon/logoff thing is related to SQL bunvx480ynf57, or SELECT 1 FROM DUAL. For SELECT 1 FROM DUAL query, it is usually used at application layer to keep session alive if the session is idle for too long and avoid session to time out. In addition, the action for LOGOFF is For LOGOFF BY CLEANUP, it usually means the connection on client-side vanishes without a clean disconnect. It’s the client-process that “dies abruptly” and then database PMON process cleans up these abandoned connections. Therefore, it seems to me that this keep alive logic does not work as expected at application layer. Anyway, this is a purely an application design or configuration issue.
My client’s DBA, Rick, also brought up another nice point about the possibility of this issue at application layer: another common use for ‘SELECT 1 FROM DUAL;’ is to ‘test’ connections. However this model is falling out of favor. The combination of the excessively high login/logout rate and the most frequent sql listed could be caused by improper connection pooling config OR usage. Example of bad usage would be issuing an explicit CLOSE() in the code instead of a RELEASE_CONNECTION() or something better.
Here is the Oracle document about this topic:
Use Test Connections on Reserve with Care
When Test Connections on Reserve is enabled, the server instance checks a database connection prior to returning the connection to a client. This helps reduce the risk of passing invalid connections to clients.
However, it is a fairly expensive operation. Typically, a server instance performs the test by executing a full-fledged SQL query with each connection prior to returning it. If the SQL query fails, the connection is destroyed and a new one is created in its place. A new and optional performance tunable has been provided in WebLogic Server 9.x within this “test connection on reserve” feature. The new optional performance tunable in 9.x allows WebLogic Server to skip this SQL-query test within a configured time window of a prior successful client use (default is 10 seconds). When a connection is returned to the pool by a client, the connection is timestamped by WebLogic Server. WebLogic Server will then skip the SQL-query test if this particular connection is returned to a client within the time window. Once the time window expires, WebLogic Server will execute the SQL-query test. This feature can provide significant performance boosts for busy systems using “test connection on reserve”.
Thanks Rick for sharing his knowledge in WebLogic Server.