Tuesday, June 08, 2010

11GR2 Result Cache Scalability

Result Cache in 11GR1

Two years ago I wrote a series of posts where I explained some of the dynamics around Result Cache latch. To recap, the result cache memory in 11GR1 is backed up by a single RC latch. That in itself wouldn't be so much of an issue (at least relatively to what we've got in reality) had the latch allowed for shared mode gets in case all you have to do is read from the result cache memory.

Alas, the latch turned out to be without shared mode gets. It is going almost without saying that, as concurrency levels increased, that single latch was behaving more and more like to a hand brake (link to a test I've done back then on a 8-way Itanium 2).

Back to the future

When 11GR2 has been released I knew that at some point in time I'll need to go back and revisit this subject. What I did is a couple of quick and dirty runs which came back confirming the same single latch and no shared mode gets so it didn't look like something has really changed. At this point I've decided to revisit it a bit later. This a "bit later" happened just recently.

How bad can it get?

What I wanted to do is get an UltraSPARC T2 and face it against Core i7 980X on a different concurrency levels in order to see how bad it can get. T2 will require quite a lot of parallelism in order to keep up even with a single i7 core. But since all we've got is a single RC latch, I've expected T2 to choke on it quite fast as not only there will be a lot of processes competing for the same latch, the slow single-threaded performance will cause the latch to be held for a much longer periods of time. Performance degradation will be dare.

Result Cache in 11GR2

I used the same test described here as it is targeted at exploiting RC latch weakness and gives me the ability to compare with the old results. I've used 250K lookup iterations. The performance was measured as a total number of lookups performed per second and RC latch statistics were captured for analysis.

Since 980X has 6 cores and 12 threads, the tests were done with 1 to 12 processes running at the same time which also gave an opportunity to see how well HT will scale. Note that I plan to do some further testing on T2 with up to 64 threads but for now I've tested up to 12 threads only as I couldn't get a test window big enough.

UltraSPARC T2 Results





























































































# of processes Buffer Cache % linear Result Cache % linear
1 4426 100 4555 100
2 8930 100.88 9124 100.15
3 13465 101.41 13731 100.48
4 17886 101.03 18179 99.77
5 22290 100.72 22715 99.74
6 26615 100.22 27012 98.84
7 30659 98.96 30804 96.61
8 34347 97 34910 95.8
9 38389 96.37 39029 95.2
10 42772 96.64 43126 94.68
11 46840 96.21 46936 93.68
12 50667 95.4 50590 92.55

When I saw these numbers for the first time I was quite surprised just how good these results are! UlstraSPARC T2 end up being far from choking and, as a matter of fact, the only position where Result Cache had to give up is the last one. If you reference the results I've obtained on 8-way Itanium 2 in 11GR1 you'll see that Result Cache gave up much earlier and scaled a lot worse.

This certainly looks promising so let's take a look at the RC latch statistic:






























































































# of processes Gets Misses Sleeps Wait Time
1 500001 0 0 0
2 1000002 40253 1 0
3 1500003 50404 0 0
4 2000004 165116 9 464
5 2500005 211559 5 182
6 3000006 437898 8 6877
7 3500007 805752 52 16556
8 4000008 1214762 20 2980
9 4500009 1775372 188 3140
10 5000010 2244964 491 29568
11 5500011 2552323 664 28011
12 6000012 3019903 1226 60005


There is one astonishing fact about the above number. Let's get some efficiency metrics in place for comparison between these numbers and the ones I've got in 11GR1. I'll use a data point with eight parallel processes as it's the highest reference point I can get across both data sets.

First of all, the number of gets per execution remained the same and equals two gets per exec. If we were going to calculate % miss per get we'll get 28.62% in 11GR1 and 50.33% in 11GR2. In other words, roughly every second get request has resulted in a miss in 11GR2 and every third in 11GR1. It may appear as if this got worse but it's really a consequence from something else.

If we calculate % sleep per miss we'll get 31.36% in 11GR1 but only 0.04% in 11GR2! In other words, the amount of times a process had to go to sleep has drastically decreased. In almost all of the cases the process was able to acquire a latch during a spin without going into a sleep. This also explains why % miss per get in 11GR2 went up and shows that a lowering in efficiency for a single metric does not necessarily indicates a problem, it might happen because some other correlated metric has in fact improved.

There is certainly a sign of a great improvement but what is it? Most likely the improvement is related to the optimization of how long the latch is required to be held. The time required to hold the latch became so small that, in most of the cases, the process is able to acquire it during spinning before being required to go to sleep (i.e. less than _spin_count iterations).

Core i7 980X Results





























































































# of processes Buffer Cache % linear Result Cache % linear
1 40064 100 43554 100
2 78989 98.58 84602 97.12
3 121753 101.3 127768 97.79
4 159490 99.52 166667 95.67
5 194704 97.2 204583 93.94
6 229709 95.56 240770 92.13
7 231788 82.65 244755 80.28
8 233918 72.98 246305 70.69
9 250836 69.57 260718 66.51
10 267094 66.67 275330 63.22
11 280326 63.61 290084 60.55
12 290416 60.41 293830 56.22


Here Result Cache won across all the positions. We need about 10 processes running on UltraSPARC T2 in order to beat a single process running on i7 980X. Performance gains declined rapidly once we got over six concurrent processes but still we were able to realize some additional performance with 12 threads being about 22% faster than 6 threads.

Latch statistics:






























































































# of processes Gets Misses Sleeps Wait Time
1 500001 0 0 0
2 1000002 40456 0 0
3 1500003 117893 5 71
4 2000004 209399 0 0
5 2500005 381160 0 0
6 3000006 517745 11 179
7 3500007 913125 20 555
8 4000008 1355226 26 11914
9 4500009 1834112 13 1017
10 5000010 2602801 42 1607
11 5500011 3196415 145 3451
12 6000012 3730467 184 123954


Essentially we're looking at the same phenomena with the amount of sleeps being significantly lower compared to what we observed in 11GR1. With six concurrent processes % miss per get is 17.26% and % sleep per miss is 0.002%! This allowed Result Cache to stay ahead with up to (and including) 12 concurrent processes running.

UltraSPARC T2 vs i7 980X

We'll wrap up with a nice graph showing result cache performance on both UltraSPARC T2 and Core i7 980X:



i7 980X starts almost where 12 UltraSPARC T2 processes ends. Would T2 be able to narrow the gap with more parallel threads? I'll certainly find out.

Conclusion

There is an enormous improvement when it comes to Result Cache scalability in 11GR2. Still it's slower than if we had shared mode gets (or multiple child latches or, even better, both) but it gets very, very close.

Monday, June 07, 2010

Row cache objects latch contention

A data loading process was running on UltraSPARC T2 CPU. To take advantage of the platform architecture (or, I'd rather say, to avoid it's limitations) the loading process has been design to run with a massive amount of parallel query slaves in order to extract the maximum output from CMT architecture.

Every time this data loading process executed, it experienced strange slowdowns on seemingly random points in time. Performance drops were quite substantial, which prompted to do an additional investigation. Upon a closer examination of ASH data, it turned out that all slowdowns were due to latch: row cache objects contention.

Row cache objects latch protects the dictionary cache. The first thing was to figure out whether most of the contention was contributed by a particular row cache objects child latch:
SQL> select latch#, child#, sleeps
2 from v$latch_children
3 where name='row cache objects'
4 and sleeps > 0
5 order by sleeps desc;

LATCH# CHILD# SLEEPS
---------- ---------- ----------
270 1 24241645
270 5 523
270 4 52
The first child certainly doesn't look good when we take the amount of sleeps experienced by it, compared to all the other child latches. Once we have the troublesome child latch identified, we can move on and see which type of dictionary cache it protects:
SQL> select distinct s.kqrstcln latch#,r.cache#,r.parameter name,r.type,r.subordinate#
from v$rowcache r,x$kqrst s
where r.cache#=s.kqrstcid
order by 1,4,5; 2 3 4

LATCH# CACHE# NAME TYPE SUBORDINATE#
------- ------ --------------------------------- ----------- ------------
1 3 dc_rollback_segments PARENT
2 1 dc_free_extents PARENT
3 4 dc_used_extents PARENT
4 2 dc_segments PARENT
5 0 dc_tablespaces PARENT
6 5 dc_tablespace_quotas PARENT
7 6 dc_files PARENT
8 10 dc_users PARENT
8 7 dc_users SUBORDINATE 0
8 7 dc_users SUBORDINATE 1
8 7 dc_users SUBORDINATE 2
9 8 dc_objects PARENT
9 8 dc_object_grants SUBORDINATE 0
10 17 dc_global_oids PARENT
11 12 dc_constraints PARENT
12 13 dc_sequences PARENT
13 16 dc_histogram_defs PARENT
13 16 dc_histogram_data SUBORDINATE 0
13 16 dc_histogram_data SUBORDINATE 1
14 32 kqlsubheap_object PARENT
15 19 dc_table_scns PARENT
15 19 dc_partition_scns SUBORDINATE 0
16 18 dc_outlines PARENT
17 14 dc_profiles PARENT
18 47 realm cache PARENT
18 47 realm auth SUBORDINATE 0
19 48 Command rule cache PARENT
20 49 Realm Object cache PARENT
20 49 Realm Subordinate Cache SUBORDINATE 0
21 46 Rule Set Cache PARENT
22 34 extensible security user and rol PARENT
23 35 extensible security principal pa PARENT
24 37 extensible security UID to princ PARENT
25 36 extensible security principal na PARENT
26 33 extensible security principal ne PARENT
27 38 XS security class privilege PARENT
28 39 extensible security midtier cach PARENT
29 44 event map PARENT
30 45 format PARENT
31 43 audit collector PARENT
32 15 global database name PARENT
33 20 rule_info PARENT
34 21 rule_or_piece PARENT
34 21 rule_fast_operators SUBORDINATE 0
35 23 dc_qmc_ldap_cache_entries PARENT
36 52 qmc_app_cache_entries PARENT
37 53 qmc_app_cache_entries PARENT
38 27 qmtmrcin_cache_entries PARENT
39 28 qmtmrctn_cache_entries PARENT
40 29 qmtmrcip_cache_entries PARENT
41 30 qmtmrctp_cache_entries PARENT
42 31 qmtmrciq_cache_entries PARENT
43 26 qmtmrctq_cache_entries PARENT
44 9 qmrc_cache_entries PARENT
45 50 qmemod_cache_entries PARENT
46 24 outstanding_alerts PARENT
47 22 dc_awr_control PARENT
48 25 SMO rowcache PARENT
49 40 sch_lj_objs PARENT
50 41 sch_lj_oids PARENT

60 rows selected.
The first child protects dc_rollback_segments. We can confirm it by referencing data in v$rowcache:
SQL> select parameter, gets
2 from v$rowcache
3 order by gets desc;

PARAMETER GETS
-------------------------------- ----------
dc_rollback_segments 310995555
dc_tablespaces 76251831
dc_segments 3912096
dc_users 2307601
dc_objects 1460725
dc_users 608659
dc_histogram_defs 250666
global database name 67475
dc_histogram_data 43098
dc_histogram_data 14364
dc_global_oids 14320
outstanding_alerts 2956
dc_profiles 2555
dc_awr_control 1925
dc_object_grants 745
dc_files 532
dc_constraints 201
sch_lj_oids 158
dc_sequences 156
dc_table_scns 20
sch_lj_objs 18
dc_qmc_ldap_cache_entries 0
qmc_app_cache_entries 0
qmc_app_cache_entries 0
qmtmrcin_cache_entries 0
qmtmrctn_cache_entries 0
qmtmrcip_cache_entries 0
qmtmrctp_cache_entries 0
qmtmrciq_cache_entries 0
qmtmrctq_cache_entries 0
qmrc_cache_entries 0
qmemod_cache_entries 0
SMO rowcache 0
dc_users 0
dc_partition_scns 0
dc_users 0
realm auth 0
Realm Subordinate Cache 0
rule_or_piece 0
rule_info 0
audit collector 0
format 0
event map 0
extensible security midtier cach 0
XS security class privilege 0
extensible security principal ne 0
extensible security principal na 0
extensible security UID to princ 0
extensible security principal pa 0
extensible security user and rol 0
Rule Set Cache 0
Realm Object cache 0
Command rule cache 0
realm cache 0
dc_outlines 0
kqlsubheap_object 0
dc_tablespace_quotas 0
dc_used_extents 0
rule_fast_operators 0
dc_free_extents 0

60 rows selected
The next step is to see whether latch miss source can give us some more hints regarding the issue:
SQL> select "WHERE", sleep_count, location
2 from v$latch_misses
3 where parent_name='row cache objects'
4 and sleep_count > 0;

WHERE SLEEP_COUNT LOCATION
------------------- ----------- ------------------------------
kqrpre: find obj 20612167 kqrpre: find obj
kqrpup 7 kqrpup
kqrcmt: while loop 1 kqrcmt: while loop
kqrcmt: clear flag 1 kqrcmt: clear flag
kqreqd 1026837 kqreqd
kqreqd: reget 2602576 kqreqd: reget

6 rows selected
Now if you take kqrpre: find obj and plug it into a search on My Oracle Support you'll quickly yield Bug 5749075 High Requests on dc_rollback_segments. Among other things, this note points out at the unusually high number of undo segments being created due to cleanup not able to work properly...
SQL> select count(*) from dba_rollback_segs;

COUNT(*)
----------
14838
...and this seems to be the case. The only difference is that the issue has been observed on 11GR2 and the bug has been filled against the older versions. Though it was still worth checking in case we were seeing a regression. Indeed, after getting rid of that many undo segments by simply recreating the undo tablespace, the issue, thought not completely vanished, manifested itself a lot less making it's impact relatively insignificant to the process throughput.