Saturday, February 04, 2012

session_cached_cursors again

Not so long time ago I had to troubleshoot code performance regression issue between 9iR2 and 11gR2.

The issue had been related to a business-critical stored procedure that, for some reason, was performing almost 50% slower when run in 11gR2 compared to 9iR2. When I took a look at the procedure I've discovered that there were, strictly speaking, nothing that could regress. At least not in the commons sense when we're talking about upgrades. All the procedure was doing was a look up using a statement containing UNION ALL.

Below is the artificially created test case which demonstrates the issue.

9iR2 results:

select *
from
 dual where :"SYS_B_0"=:"SYS_B_1" union all select * from dual where
  :"SYS_B_2"=:"SYS_B_3"


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse    99999      3.88       4.41          0          0          0           0
Execute  99999      1.23       1.31          0          0          0           0
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   199998      5.11       5.73          0          0          0           0

11gR2 results:

select *
from
 dual where :"SYS_B_0"=:"SYS_B_1" union all select * from dual where
  :"SYS_B_2"=:"SYS_B_3"


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse    99999      6.14       7.21          0          0          0           0
Execute  99999      1.04       1.38          0          0          0           0
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   199998      7.19       8.59          0          0          0           0

As you can see from the above figures, 11gR2 spent significantly more time parsing. Given that both sessions had 100% session cursor cache hit (and yes, the procedure didn't make use of bind variables either) and were otherwise run on pretty much identical environments (apart from the Oracle version), such a difference in parse time is certainly unexpected.

It was another time to gear up for a reproducible test case. Perhaps I'm being lucky but it boiled down to a session_cache_cursors setting again. To better demonstrate the point I'm going to run the following anonymous PL/SQL block in my 11gR2 database:

SQL> begin
  2   execute immediate 'alter session set cursor_sharing=force';
  3   for i in 1 .. 10
  4   loop
  5    execute immediate 'alter session set session_cached_cursors='||to_char(i*100);
  6    for j in 1 .. 10000
  7    loop
  8     execute immediate
  9      'select /* test_'||to_char(i*100)||' */ * from dual where '||to_char(j)||'='||to_char(j)||
 10      ' union all select * from dual where '||to_char(j)||'='||to_char(j);
 11    end loop;
 12   end loop;
 13  end;
 14  /
 
PL/SQL procedure successfully completed

What it does is sets cursor_sharing=force and then executes the same statement with different settings for session_cached_cursors -- from 100 to 1000 (with 100 increment). What do you think the results would be?

SQL> select to_number(substr(sql_text, instr(sql_text, 'test')+5, 4)) cursor_cache, executions, cpu_time
  2   from v$sql
  3   where sql_text like '%test%'
  4    and executions=10000
  5   order by 1;
 
CURSOR_CACHE EXECUTIONS   CPU_TIME
------------ ---------- ----------
         100      10000     571916
         200      10000     517928
         300      10000     559908
         400      10000     572909
         500      10000     561912
         600      10000     616915
         700      10000     652899
         800      10000     670886
         900      10000     671892
        1000      10000     702893
 
10 rows selected
Surprisingly enough, rising session_cached_cursors from 100 to 1000 increases the CPU time required to execute the statements from 571916 to 702893! You can also spot how the CPU time increases together with the session_cached_cursors setting.

What happened is 9iR2 was running with session_cached_cursors=1000 and, apparently, it never had any negative effect... until the setting was transferred to 11gR2. It's somewhat hard to speculate what exactly is happening but based on my limited test cases it appears that you need (a) a union all in the statement, (b) run with cursor_sharing=force and (c) have literals in your query. Under these circumstances there appears to be that strange performance degradation observed in 11gR2. But I need to do more investigation on that one. For now all I can say is that, after setting session_cached_cursors=100, 11gR2 exhibited the same results as 9iR2.

***** UPDATE 07-Feb-2012 *****

It appears that only 11.2.0.1 and 11.2.0.2 are affected by this issue (see comments for more details).

***** UPDATE 07-Feb-2012 #2 *****

I've found the bug# for this one: "Bug 12345980 high parse time with cursor_sharing=force when session_cached_cursors set".

Wednesday, February 01, 2012

Latch: row cache objects, session_cached_cursors and a database link

What could possibly be common about the above things?

Sometime ago I was involved in tuning a large scale production system which experienced quite a lot of latch contention. In fact the contention was bad enough as to render the entire system unusable during peak hours.

A typical AWR report during times with moderate workload looked like this:

Event                      Waits           Time(s) Avg wait (ms)   % DB time  Wait Class
DB CPU                     21,805          36.95  
latch: row cache objects   14,524,462      11,552  1               19.57      Concurrency
db file sequential read    2,697,778       8,988   3               15.23      User I/O
gc current block 3-way     4,202,356       3,599   1               6.10       Cluster
gc current block 2-way     3,670,293       2,330   1               3.95       Cluster

(in reality it was a 7-node RAC cluster but it is not really relevant to our subject).

It is not hard to spot latch: row cache objects on a second place consuming almost 20% of DB time. This event was rapidly escalating whether load increased quickly bringing the entire cluster into unusable state as far as any front-end application were concerned. The total number of gets during an hour (the interval with which this particular AWR report had been made) totaled 1,281,167,103 which averages a whopping 355,879 gets per second!

The first step was to take a look at the Dictionary Cache statistics and see if there were any obvious deviations:
Cache                   Get Requests    Pct Miss
dc_awr_control          72              9.72
dc_database_links       124,452         0.05
dc_files                46,900          17.14
dc_global_oids          555,599         0.07
dc_histogram_data       8,962,576       1.01
dc_histogram_defs       412,885,094     0.02
dc_object_grants        52,475          1.13
dc_objects              2,860,222       0.64
dc_profiles             61,189          0.01
dc_rollback_segments    392,885         0.00
dc_segments             657,653         7.72
dc_sequences            1,324           67.67
dc_table_scns           3,206           3.31
dc_tablespaces          1,253,322       0.06
dc_users                5,475,824       0.02
global database name    65,092          0.01
kqlsubheap_object       10,910          0.46
outstanding_alerts      721             95.98

dc_histogram_defs really stands out but so far provides no direct clues. Though the entire dictionary cache seems to be quite busy. Let's take a look at instance activity statistics:
                  Per Second   Per Transaction  Per Exec        Per Call
DB Time(s):       16.4         0.5              0.00            0.00
DB CPU(s):        6.1          0.2              0.00            0.00
Redo size:        87,543.5     2,398.6
Logical reads:    336,555.7    9,221.2
Block changes:    449.3        12.3
Physical reads:   1,018.3      27.9
Physical writes:  36.8         1.0
User calls:       41,250.4     1,130.2
Parses:           6,264.4      171.6
Hard parses:      27.1         0.7
W/A MB processed: 33,769,202.4 925,233.0
Logons:           18.3         0.5
Executes:         14,230.5     389.9
Rollbacks:        3.4          0.1
Transactions:     36.5

A couple of things to notice:
  1. The number of executions is not that high (at least for an IBM box with 56 CPU cores, that is).
  2. Execute to Parse ratio is not particularly good nor bad (6,264.4 parses per second compared to 14,230.5 executions).
  3. None of the above seems to be enough to justify 355,879 latch: row cache objects gets per second, even when you consider 27.1 hard parses/sec.

There had to be more going on. Looking at the SQL ordered by Parse Calls section of the report revealed that all the top parsing queries were coming from a database link. Which prompted me to ask additional questions...

It turned out that the system had a somewhat weird architecture. A 7-node RAC cluster served as a back end to a numerous front-end Oracle databases which were essentially acting as PL/SQL-engines, accessing all the data through the database link while dong all the processing. The architectural feasibility of such solution is debatable (to say the least) but let's not steer away from the main topic of our discussion.

I geared to find out whether there was some special case which could cause latch: row cache objects to shoot through the roof when we deal with a lot of queries executed through a database link. After a number of experiments I eventually found a cause which explains the title of this blog post.

Take a look at the following example:

SQL> declare
  2   l_gets number;
  3  begin
  4   select gets into l_gets from v$latch@dblink where name='row cache objects';
  5  
  6   for i in 1 .. 1000
  7   loop
  8    for cur in (select null from dual@dblink) loop null; end loop;
  9   end loop;
 10  
 11   select sum(gets)-l_gets into l_gets from v$latch@dblink where name='row cache objects';
 12   dbms_output.put_line(to_char(l_gets));
 13  end;
 14  /
 
79
 
PL/SQL procedure successfully completed

Accessing a remote table 1000 times through a database link only caused 79 row cache objects gets on the destination database. However, exactly the same example can produce quite a different result:

SQL> alter session set session_cached_cursors=0;
 
Session altered
 
SQL> 
SQL> declare
  2   l_gets number;
  3  begin
  4   select gets into l_gets from v$latch@dblink where name='row cache objects';
  5  
  6   for i in 1 .. 1000
  7   loop
  8    for cur in (select null from dual@dblink) loop null; end loop;
  9   end loop;
 10  
 11   select sum(gets)-l_gets into l_gets from v$latch@dblink where name='row cache objects';
 12   dbms_output.put_line(to_char(l_gets));
 13  end;
 14  /
 
3082
 
PL/SQL procedure successfully completed

Now we jumped from 79 gets to 3082 which is a 39x increase! What's really astonishing is the fact that it's a cursor cache miss on the source which wreaks havoc on the destination. Evidently, the solution to the problem had nothing to do with the database where the problem was observed. The "PL/SQL-databases" had an insufficient setting for session_cached_cursors and once that was corrected the issue disappeared.

It is somewhat ironic that that problem would have never appears had the data and PL/SQL be allowed to be kept in the same database (where they belong).