Monday, September 08, 2014

Zone Maps On Commit Refresh Oddities

One of the ways Zone Maps can be refreshed when the underlying table data is changed is fast on commit. This is similar to how materialized views can be refreshed with the exception that a Zone Map does not need a materialized view log to do so.

It can also lead to some peculiar side effects.

Test setup

Let's begin by creating a test table with the on commit refresh materialized zone map:
SQL> create table t pctfree 95 clustering by linear order (n) as
  2   select level n, rpad('x', 200, 'x') v
  3    from dual
  4    connect by level <= 5000;
 
Table created
 
SQL> create materialized zonemap zm$t refresh fast on commit on t (n);
 
Done

SQL> select * from zm$t order by 1;
 
  ZONE_ID$    MIN_1_N    MAX_1_N ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
---------- ---------- ---------- ----------- ----------- ----------
3241022750          1        748           0           0        748
3241022750        749       1756           0           0       1008
3241022750       1757       2764           0           0       1008
3241022750       2765       3772           0           0       1008
3241022750       3773       4780           0           0       1008
3241022750       4781       5000           0           0        220
 
6 rows selected
Zone Staleness (first session)

When a session updates a row in the table it will also mark the corresponding zone in the zone map as stale by setting ZONE_STATE$=1:
SQL> update t set n=0 where n=1;
 
1 row updated
 
SQL> select * from zm$t order by 1;
 
  ZONE_ID$    MIN_1_N    MAX_1_N ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
---------- ---------- ---------- ----------- ----------- ----------
3241022750          1        748           0           1        748 <=== Zone marked as stale
3241022750        749       1756           0           0       1008
3241022750       1757       2764           0           0       1008
3241022750       2765       3772           0           0       1008
3241022750       3773       4780           0           0       1008
3241022750       4781       5000           0           0        220
 
6 rows selected
A stale zone is always scanned when performing Zone Map pruning regardless of the predicates involved. This makes sense otherwise we won't be able to select the updated row back while performing Zone Map pruning at the same time because the value is now out of range. With the zone marked as stale we have no such problem:
SQL> select n from t where n=0;
 
         N
----------
         0
A zone will be marked as fresh when the session commits by setting ZONE_STATE$=0. Note that our session did not commit yet.

Second session

So what happens if some other session updates a row belonging to the stale zone map and then commits? Does the zone gets invalidated by being considered fresh again? Let's find out what happens:
SQL> update t set n=2 where n=2;
 
1 row updated
 
SQL> commit;
 
Commit complete
 
SQL> select * from zm$t order by 1;
 
  ZONE_ID$    MIN_1_N    MAX_1_N ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
---------- ---------- ---------- ----------- ----------- ----------
3241022750          1        748           0           0        748 <=== Zone is no longer marked as stale
3241022750        749       1756           0           0       1008
3241022750       1757       2764           0           0       1008
3241022750       2765       3772           0           0       1008
3241022750       3773       4780           0           0       1008
3241022750       4781       5000           0           0        220
 
6 rows selected
Indeed the zone is no longer marked as stale! Now let's go back to our first session...

First session

So what happens if we try to select the same updated row in the first session?
SQL> select n from t where n=0;

no rows selected
We've lost the ability to see the updated row thanks to the second session invalidating the zone! Worse yet we still won't be able to see the row even after this session commits meaning any query in the database which relies on this Zone Map to do the pruning will be unable to see the data either:
SQL> commit;

Commit complete.

SQL> select n from t where n=0;

no rows selected

SQL> select /*+ no_zonemap(t scan) */ n from t where n=0;

         N
----------
         0
The only way to select the row is to disable Zone Map pruning.

The end result is we have a perfectly "valid" Zone Map as long as Oracle is concerned which is out of sync with the actual table data:
SQL> select zonemap_name, invalid, stale, unusable, compile_state from dba_zonemaps;
 
ZONEMAP_NAME INVALID STALE   UNUSABLE COMPILE_STATE
------------ ------- ------- -------- -------------------
ZM$T         NO      NO      NO       VALID 

The fact that Oracle documentation claims that on commit refresh zone maps stay transactionally fresh suggests that this behavior is clearly a bug resulted from incorrect handling of concurrent transactions modifying the data belonging to the same zone.

Tuesday, August 19, 2014

Zone Map Zone ID's

Just a quick follow up to my previous post on how Zone ID's are calculated.

Let's take the following example:
SQL> select rid, sys_op_zone_id(rid) zone_id
  2   from
  3   (
  4    select chartorowid('AAAS5KAAHAAABYDAAA') rid
  5     from dual
  6   );
 
RID                     ZONE_ID
------------------ ------------
AAAS5KAAHAAABYDAAA 324580438021

Recalling that extended ROWID has the following format (a nice picture from Oracle Documentation):



In the binary format that would correspond to:
  • Data Object Number -- 32 bits
  • Relative File Number -- 10 bits
  • Block Number -- 22 bits
  • Row Number -- 16 bits
We know that Row Number is irrelevant for the Zone Maps because they deal with block ranges. With that in mind a simple conversion to a (base 10) number would be:

(Block Number) + (Relative File Number) * 2 ^ 22 + (Data Object Number) * 2 ^ 32

Applying the above formula to our ROWID:
SQL> select
  2    dbms_rowid.rowid_block_number(rid) +
  3    dbms_rowid.rowid_relative_fno(rid) * power(2,22) +
  4    dbms_rowid.rowid_object(rid) * power(2,32) base_10
  5   from
  6   (
  7    select chartorowid('AAAS5KAAHAAABYDAAA') rid
  8     from dual
  9   );
 
        BASE_10
---------------
332370368534019

What's left after that is to slice the number into Zone Map chunk size (2^10 by default, thanks to Timur for pointing that out in the comments sections for the previous post):
SQL> select
  2    trunc((
  3    dbms_rowid.rowid_block_number(rid) +
  4    dbms_rowid.rowid_relative_fno(rid) * power(2,22) +
  5    dbms_rowid.rowid_object(rid) * power(2,32)
  6    )/power(2,10)) zone_id,
  7    sys_op_zone_id(rid) sys_zone_id
  8   from
  9   (
 10    select chartorowid('AAAS5KAAHAAABYDAAA') rid
 11     from dual
 12   );
 
     ZONE_ID  SYS_ZONE_ID
------------ ------------
324580438021 324580438021

That's all there is to it!

Monday, August 18, 2014

Zone Maps

Zone Maps is a new feature that got officially introduced in 12.1.0.2.0 so I've decided to take a closer look.

From the Oracle Documentation:
For full table access, zone maps allow I/O pruning of data based on the physical location of the data on disk, acting like an anti-index.

Let's start by creating a test table:
SQL> create table t pctfree 95 clustering by linear order (n) as
  2   select level n, rpad('x', 200, 'x') v
  3    from dual
  4    connect by level <= 5000;
 
Table created
I've used a high pctfree setting to make sure the table gets spread out on disk -- each row will occupy it's own block:
SQL> select count(*) num_rows,
  2    count(distinct dbms_rowid.rowid_block_number(rowid)) num_blocks
  3   from t;
 
  NUM_ROWS NUM_BLOCKS
---------- ----------
      5000       5000
Zone Maps do not require attribute clustering, however, I did use clustering in this particular case to make sure that value ranges for column N do not overlap when physically stored on disk to make it easier to see what's happening when a Zone Map is created.

Now let's create a Zone Map:
SQL> create materialized zonemap zm$t on t (n);
 
Done
The first thing to notice is we now got additional table in our schema with the same name as a Zone Map:
SQL> select * from zm$t order by zone_id$;
 
    ZONE_ID$    MIN_1_N    MAX_1_N ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
------------ ---------- ---------- ----------- ----------- ----------
324421046272          1        748           0           0        748
324421046273        749       1756           0           0       1008
324421046274       1757       2764           0           0       1008
324421046275       2765       3772           0           0       1008
324421046276       3773       4780           0           0       1008
324421046277       4781       5000           0           0        220
 
6 rows selected
As you can see we've got six zones defined with most of them covering a range of about thousand rows with the exception of the first and the last ones. I can now map each ZONE_ID$ to it's respective block range on disk:
SQL> select zone_id$, min_block_id, max_block_id, zone_rows$
  2  from (
  3  select zm$t.zone_id$,
  4    min(dbms_rowid.rowid_block_number(t.rowid)) min_block_id,
  5    max(dbms_rowid.rowid_block_number(t.rowid)) max_block_id,
  6    max(zone_rows$) zone_rows$
  7   from t, zm$t
  8   where t.n between zm$t.min_1_n and zm$t.max_1_n
  9   group by zm$t.zone_id$
 10  ) order by zone_id$;
 
  ZONE_ID$   MIN_BLOCK_ID MAX_BLOCK_ID ZONE_ROWS$
------------ ------------ ------------ ----------
324421046272          179         1023        748
324421046273         1026         2047       1008
324421046274         2050         3071       1008
324421046275         3074         4095       1008
324421046276         4098         5119       1008
324421046277         5122         5343        220
 
6 rows selected
Based on a couple more tests I've done the algorithm appears to work itself out until the fist block number in a segment which divides evenly by 1024 after which all subsequent blocks get mapped to 8MB regions. The last map has 221 blocks because that's where the end of the last table block happens to be.

Zone Maps seems to be completely independent from the actual extents on disk. In my tests I was able to get multiple Zone Maps pointing at the same extent and in the case above we have multiple extents covered by the same Zone Map. In all cases zones were 8MB in size (more on that a little bit later). By the way if you're wondering why do we have 1008 rows and a little bit strange starting block_ids (i.e. 1026 instead of 1024 and so on) remember that the fist couple blocks are used for first level bmb and do not store actual table data.

So how is this all pieced together?

SYS_OP_ZONE_ID

SYS_OP_ZONE_ID function computes ZONE_ID$ for a given ROWID value. The calculation appears to be a straightforward math based on the ROWID value and assumes 8MB Zone Map chunks -- this explains why Zone Maps are created adjacent to 8MB boundaries since ZONE_ID$ values in the Zone Map are computed using the same function:
SQL> select rowid, SYS_OP_ZONE_ID(rowid)
  2   from t
  3   where rownum=1;
 
ROWID              SYS_OP_ZONE_ID(ROWID)
------------------ ---------------------
AAAS4kAAFAAAACzAAA          324421046272

Essentially if we imagine the entire possible database space divided into 8MB regions this function would return into which region a particular ROWID value would belong.

SYS_ZMAP_FILTER

Let's look at the explain plan for the following query:
SQL> select * from t where n=1;


Execution Plan
----------------------------------------------------------
Plan hash value: 2931408918

--------------------------------------------------------------------------------
| Id  | Operation                              | Name | Rows  | Bytes | Cost (%C
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |      |     1 |   205 |  1380
|*  1 |  TABLE ACCESS STORAGE FULL WITH ZONEMAP| T    |     1 |   205 |  1380
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - storage("N"=1)
       filter(SYS_ZMAP_FILTER('/* ZM_PRUNING */ SELECT "ZONE_ID$", CASE WHEN
              BITAND(zm."ZONE_STATE$",1)=1 THEN 1 ELSE CASE WHEN (zm."MIN_1_N" > :1 OR zm."MAX_1_N"
              < :2) THEN 3 ELSE 2 END END FROM "ROOT"."ZM$T" zm WHERE zm."ZONE_LEVEL$"=0 ORDER BY
              zm."ZONE_ID$"',SYS_OP_ZONE_ID(ROWID),1,1)<3 AND "N"=1)
We can see that SYS_ZMAP_FILTER appears to be the function involved in figuring out which Zone Maps needs to be accessed in order to execute the query. The condition inside the query (zm."MIN_1_N" > :1 OR zm."MAX_1_N" < :2) will be used to eliminate Zone Maps which do not have a value we're looking for and is dynamically constructed based on the predicate(-s) we have in the query. From here ZONE_ID$ can be mapped back to ROWID ranges (a reverse operation relative to SYS_OP_ZONE_ID) in order to scan only required portions of the data on disk.

This looks to be a very exciting feature and I can't help but think that it's a little bit disappointing that it's limited to Exadata storage only as it could have been very useful on other systems due to lack of storage indexes support.

Thursday, February 20, 2014

AWR Top 5 Timed Foreground Events

I've noticed that people post how to get AWR Top 5 Timed Foreground Events other a range of snapshots using a SQL query from time to time. Since this is something I've done for years here is the version of the SQL I use in case somebody finds it useful:
select case wait_rank when 1 then inst_id end "Inst Num",
 case wait_rank when 1 then snap_id end "Snap Id",
 case wait_rank when 1 then begin_snap end "Begin Snap",
 case wait_rank when 1 then end_snap end "End Snap",
 event_name "Event",
 total_waits "Waits",
 time_waited "Time(s)",
 round((time_waited/total_waits)*1000) "Avg wait(ms)",
 round((time_waited/db_time)*100, 2) "% DB time",
 substr(wait_class, 1, 15) "Wait Class"
from (
select
  inst_id,
  snap_id, to_char(begin_snap, 'DD-MM-YY hh24:mi:ss') begin_snap,
  to_char(end_snap, 'hh24:mi:ss') end_snap,
  event_name,
  wait_class,
  total_waits,
  time_waited,
  dense_rank() over (partition by inst_id, snap_id order by time_waited desc)-1 wait_rank,
  max(time_waited) over (partition by inst_id, snap_id) db_time
from (
select
  s.instance_number inst_id,
  s.snap_id,
  s.begin_interval_time begin_snap,
  s.end_interval_time end_snap,
  event_name,
  wait_class,
  total_waits-lag(total_waits, 1, total_waits) over
   (partition by s.startup_time, s.instance_number, stats.event_name order by s.snap_id) total_waits,
  time_waited-lag(time_waited, 1, time_waited) over
   (partition by s.startup_time, s.instance_number, stats.event_name order by s.snap_id) time_waited,
  min(s.snap_id) over (partition by s.startup_time, s.instance_number, stats.event_name) min_snap_id
from (
 select dbid, instance_number, snap_id, event_name, wait_class, total_waits_fg total_waits, round(time_waited_micro_fg/1000000, 2) time_waited
  from dba_hist_system_event
  where wait_class not in ('Idle', 'System I/O')
 union all
 select dbid, instance_number, snap_id, stat_name event_name, null wait_class, null total_waits, round(value/1000000, 2) time_waited
  from dba_hist_sys_time_model
  where stat_name in ('DB CPU', 'DB time')
) stats, dba_hist_snapshot s
 where stats.instance_number=s.instance_number
  and stats.snap_id=s.snap_id
  and stats.dbid=s.dbid
  and s.dbid=3870213301
  and s.instance_number=1
  and stats.snap_id between 190 and 195
) where snap_id > min_snap_id and nvl(total_waits,1) > 0
) where event_name!='DB time' and wait_rank <= 5
order by inst_id, snap_id;

Inst Snap  Begin Snap        End Snap Event                            Waits    Time(s) Avg wait(ms)  % DB time Wait Class
---- ----- ----------------- -------- --------------------------- ---------- ---------- ------------ ---------- ---------------
   1   191 20-02-14 14:10:10 14:20:10 cell smart table scan           631829    9569.43           15      79.03 User I/O
                                      DB CPU                                    1202.09                    9.93 
                                      direct path read temp            66074    1006.82           15       8.32 User I/O
                                      PX Deq: Slave Session Stats      11730     429.91           37       3.55 Other
                                      latch: shared pool               28134     162.47            6       1.34 Concurrency
   1   192 20-02-14 14:20:10 14:30:11 cell smart table scan          1391832    4620.11            3      67.39 User I/O
                                      DB CPU                                    1017.78                   14.85 
                                      direct path read temp            76329     977.95           13      14.26 User I/O
                                      PX Deq: Slave Session Stats      25043     401.53           16       5.86 Other
                                      latch free                       38836      214.1            6       3.12 Other
   1   193 20-02-14 14:30:11 14:40:14 cell smart table scan          2448539   11075.29            5       79.3 User I/O
                                      DB CPU                                    1529.93                   10.95 
                                      PX Deq: Slave Session Stats      44242    1520.01           34      10.88 Other
                                      direct path read temp            77583     985.65           13       7.06 User I/O
                                      latch free                       67518     376.52            6        2.7 Other
   1   194 20-02-14 14:40:14 14:50:15 direct path read temp            99224      857.3            9      71.63 User I/O
                                      DB CPU                                     328.78                   27.47 
                                      name-service call wait              91        5.4           59       0.45 Other
                                      PX Deq: Slave Session Stats         83       0.17            2       0.01 Other
                                      direct path write                  194       0.12            1       0.01 User I/O
   1   195 20-02-14 14:50:15 15:00:18 DB CPU                                    1188.84                   98.15 
                                      log switch/archive                   1      10.01        10010       0.83 Other
                                      direct path read temp              775       3.96            5       0.33 User I/O
                                      cell smart table scan             1393        1.1            1       0.09 User I/O
                                      cell single block physical         148        0.9            6       0.07 User I/O
                                      read                                                                      
 
 
25 rows selected
 

Friday, February 14, 2014

'active txn count during cleanout', part II

In part I I've shown some interesting side effects that happen when you're trying to select from a table block which have an outstanding active transaction in it. In this post we're going to make things a little bit more interesting by introducing indexes into the picture.

Test Setup

I'll create a table with two rows and an index:
SQL> create table test as
  2   select level n
  3    from dual
  4    connect by level <= 2;
 
Table created
 
SQL> create index i_test on test (n);
 
Index created
Session 1

I'll update one row in my first session and leave the transaction open:
SQL> update test set n=3 where n=1;
 
1 row updated
Here is xid for this transaction:
SQL> select '0x'||to_char(XIDUSN, 'fm000x')||'.'||
  2    to_char(XIDSLOT, 'fm00x')||'.'||
  3    to_char(XIDSQN, 'fm0000000x') xid
  4   from v$transaction
  5   where addr=(
  6    select taddr
  7     from v$session
  8     where sid=sys_context('userenv','sid')
  9    );
 
XID
----------------------
0x0004.01c.00001fd5
Index Block Dump 1

Since I only have two rows in the table I will end up with a special case where my index root block will be able to hold all the data essentially playing a role of both the root block and a leaf block at the same time. This makes it easier for me to dump the relevant index block because I know there is only one index block to dump:
Block header dump:  0x0100008b
 Object id on Block? Y
 seg/obj: 0x12f46  csc: 0x00.1efcb3c  itc: 2  flg: E  typ: 2 - INDEX
     brn: 0  bdba: 0x1000088 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x02   0x0004.01c.00001fd5  0x00c011ac.09a5.0b  ----    2  fsc 0x000e.00000000
We have two rows locked in the index block because the row with value=1 got deleted and a row with value=3 got inserted, as per our update. Let's notice block cleanout scn (csc) value: 0x00.1efcb3c

Session 2

I'll update another row in the second session leaving the transaction open as well:
SQL> update test set n=4 where n=2;
 
1 row updated

XID
----------------------
0x0003.01f.00001eab
Index Block Dump 2

Here is how index block dump looks right now:
Block header dump:  0x0100008b
 Object id on Block? Y
 seg/obj: 0x12f46  csc: 0x00.1efcd0c  itc: 3  flg: E  typ: 2 - INDEX
     brn: 0  bdba: 0x1000088 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x02   0x0004.01c.00001fd5  0x00c011ac.09a5.0b  ----    2  fsc 0x000e.00000000
0x03   0x0003.01f.00001eab  0x00c00e73.0982.31  ----    2  fsc 0x000e.00000000
Notice that csc value has changed from 0x00.1efcb3c to 0x00.1efcd0c

What happened is just another variation of the theme we saw in part I -- when our second session updates the index block it notices an active transaction in the ITL list and tries to perform a cleanout. It will do the same for the table block as well but since I've shown all the relevant mechanics in the previous post I'll leave it at that.

Undo Segment Header Checks

The important consequence from all the above is that a session which tries to perform a cleanout will have to look into the other transaction(-s) undo segment header block in order to find out whether the other transaction has committed or not:
SQL> select
  2    trn.xidusn,
  3    rbs.file_id,
  4    rbs.block_id header_block,
  5    trn.ubablk undo_block,
  6    '0x'||to_char(trn.XIDUSN, 'fm000x')||'.'||
  7    to_char(trn.XIDSLOT, 'fm00x')||'.'||
  8    to_char(trn.XIDSQN, 'fm0000000x') xid
  9   from v$transaction trn, dba_rollback_segs rbs
 10   where trn.XIDUSN=rbs.segment_id
 11   order by 1;
 
    XIDUSN    FILE_ID HEADER_BLOCK UNDO_BLOCK XID
---------- ---------- ------------ ---------- ----------------------
         3          3          160       3699 0x0003.01f.00001eab
         4          3          176       4524 0x0004.01c.00001fd5
Our first session xid was 0x0004.01c.00001fd5 so when our second session performed the update it had to look into block 176 (undo header block) to check the other transaction status and block 4524 (undo block) in order to rollback the other session changes for write consistency checks:
WAIT #140055864053216: nam='db file sequential read' ela= 341 file#=3 block#=176 blocks=1 obj#=0 tim=1392400391392719
WAIT #140055864053216: nam='db file sequential read' ela= 675 file#=3 block#=4524 blocks=1 obj#=0 tim=1392400391393679
I'll continue setting the up stage for a perfect disaster with delayed block cleanout and parallel DML in the upcoming series.

Tuesday, January 28, 2014

'active txn count during cleanout', part I

I was going to write a blog post about some peculiar side effects you can get into with the delayed block cleanout when running parallel DML but soon discovered that the entry became so big that I've decided to split it up into a series of more manageable posts.

For a good background on various themes of block cleanout check out Clean it up by Jonathan Lewis.

Active transactions, consistent reads and table scans

First I'm going to show you some interesting observations about what happens when you're trying to select from the block which has an outstanding open transaction in it. Let's create a test table with one row and update it leaving the transaction open. I'm using 11.2.0.4 here:
SQL> create table test (n number, m number);
 
Table created
 
SQL> insert into test values (1, 1);
 
1 row inserted
 
SQL> commit;
 
Commit complete

SQL> select dbms_rowid.rowid_relative_fno(rowid) file#,
  2    dbms_rowid.rowid_block_number(rowid) block#
  3    from test;
 
     FILE#     BLOCK#
---------- ----------
         4        134
 
SQL> update test set n=n;
 
1 row updated
If we were to look into the buffer headers view (x$bh) we would find the following:
     FILE#     DBABLK STATE CR_SCN_BAS CR_SCN_WRP
---------- ---------- ----- ---------- ----------
         4        134 xcur           0          0
         4        134 cur       595913          0
Now lets select from this table in a different session while checking couple stats at the same time:
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
immediate (CR) block cleanout applications                                0
active txn count during cleanout                                          0
 
SQL> select  * from test;
 
         N          M
---------- ----------
         1          1
 
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
immediate (CR) block cleanout applications                                1
active txn count during cleanout                                          1
 
SQL> select  * from test;
 
         N          M
---------- ----------
         1          1
 
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
immediate (CR) block cleanout applications                                2
active txn count during cleanout                                          2
 
SQL> select  * from test;
 
         N          M
---------- ----------
         1          1

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
immediate (CR) block cleanout applications                                3
active txn count during cleanout                                          3
The first thing worth mentioning is that both immediate (CR) block cleanout applications and active txn count during cleanout statistics increment every time we execute the select. Immediate (CR) block cleanout applications indicates that the session is performing delayed block cleanout while doing the consistent read (CR). Active txn count during cleanout indicates how many currently active transactions the cleanout process encountered in each block. From one perspective this makes sense -- when our select reads the block and discovers that there are open transactions it may not know whether these transactions are currently active or the block indeed requires a cleanout. However, something interesting happened when we look into x$bh again:
     FILE#     DBABLK STATE CR_SCN_BAS CR_SCN_WRP
---------- ---------- ----- ---------- ----------
         4        134 xcur           0          0
         4        134 cur       595913          0
         4        134 cur       595922          0
         4        134 cur       595926          0
         4        134 cur       595940          0
Clearly each subsequent select generated a new consistent read version of the block at a different SCN. Indeed, if we were to dump the table's block before and after the last select here is what we would find:

Before:
buffer tsn: 4 rdba: 0x01000086 (4/134)
scn: 0x0000.000917d6 seq: 0x01 flg: 0x04 tail: 0x17d60601
frmt: 0x02 chkval: 0xf347 type: 0x06=trans data
...
Block header dump:  0x01000086
 Object id on Block? Y
 seg/obj: 0x113da  csc: 0x00.917d6  itc: 2  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x1000080 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0002.01f.00000242  0x00c0c19c.0035.19  C---    0  scn 0x0000.00090e7e
0x02   0x0008.00b.00000241  0x00c00989.0039.38  ----    1  fsc 0x0000.00000000
After:
buffer tsn: 4 rdba: 0x01000086 (4/134)
scn: 0x0000.000917e4 seq: 0x01 flg: 0x04 tail: 0x17e40601
frmt: 0x02 chkval: 0xf375 type: 0x06=trans data
...
Block header dump:  0x01000086
 Object id on Block? Y
 seg/obj: 0x113da  csc: 0x00.917e4  itc: 2  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x1000080 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0002.01f.00000242  0x00c0c19c.0035.19  C---    0  scn 0x0000.00090e7e
0x02   0x0008.00b.00000241  0x00c00989.0039.38  ----    1  fsc 0x0000.00000000
Notice that cleanout scn (cscn) is different and the block got updated with the same scn as well. As you probably have guessed by now, each select generates a block cleanout redo record too:
REDO RECORD - Thread:1 RBA: 0x00000c.00035faf.0010 LEN: 0x006c VLD: 0x05
SCN: 0x0000.000917e4 SUBSCN:  1 01/28/2014 00:16:44
(LWN RBA: 0x00000c.00035faf.0010 LEN: 0001 NST: 0001 SCN: 0x0000.000917e4)
CHANGE #1 TYP:0 CLS:1 AFN:4 DBA:0x01000086 OBJ:70618 SCN:0x0000.000917d6 SEQ:1 OP:4.1 ENC:0 RBL:0
Block cleanout record, scn:  0x0000.000917e4 ver: 0x01 opt: 0x01, entries follow...
From another perspective why even bother doing any of this? It almost looks like the cleanout code is moving along in tiding up the block, updates scn plus cscn and then discovers that there is really nothing to cleanup because the other transaction is still active. But it did changes to the current version of the block by now which then results in proliferation of CR copies from the selects and increased redo generation from the (essentially) shell block cleanout records. There is going to be more interesting details worth mentioning in the next post.

Saturday, January 25, 2014

crsd.bin core dumps

Core dump issues sometimes can be notoriously difficult to troubleshoot. I've got a call this morning from one of my customers saying that after a power outage Grid Infrastructure is not able to fully come up on some nodes on their Exadata cluster. After further examining the situation it turned out that crsd.bin binary is simply core dumping upon start up.

Troubleshooting Grid Infrastructure startup issues when nothing is core dumping sometimes could be a chore so what could be more fun when it's not able to fully start due to a major daemon core dumping?

One of the useful things to do when a binary core dumps is to get a stack trace to see which function raised the exception (you can examine the core file the gdb, for example, in order to do that). Let's see what the stack trace holds for us:
Core was generated by `/u01/app/11.2.0.3/grid/bin/crsd.bin reboot'.
Program terminated with signal 6, Aborted.
#0  0x0000003ea3e30285 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003ea3e30285 in raise () from /lib64/libc.so.6
#1  0x0000003ea3e31d30 in abort () from /lib64/libc.so.6
#2  0x0000003ea56bed94 in __gnu_cxx::__verbose_terminate_handler() ()
   from /usr/lib64/libstdc++.so.6
#3  0x0000003ea56bce46 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x0000003ea56bce73 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x0000003ea56bcef9 in __cxa_rethrow () from /usr/lib64/libstdc++.so.6
#6  0x0000000000df8672 in Acl::Acl (this=0x4556d440, domain=..., resource=...,
    aclString=..., useOcr=true, $U7=,
    $U8=, $U9=,
    $V0=, $V1=) at acl.cpp:120
#6  0x0000000000df8672 in Acl::Acl (this=0x4556d440, domain=..., resource=...,
    aclString=..., useOcr=true, $U7=,
    $U8=, $U9=,
    $V0=, $V1=) at acl.cpp:120
#7  0x0000000000df879c in Acl::_ZN3CAA3AclC1ERKSsS2_S2_b (this=0x4556d440,
    $U7=, $U8=,
    $U9=, $V0=,
    $V1=)
#8  0x0000000000a4d81e in SrvResource::initUserId (this=0x7f15803d7550,
    $1=) at clsAgfwSrvResource.cpp:204
We can see that the source of the exception is in the Acl::Acl which is then propagated through the standard libraries. Moreover, function SrvResource::initUserId appears in the stack trace as well, which makes you wonder whether there is some issue with some of the resource's Access Control List, in particular with it's user id setting.

Armed with that knowledge you can now sift through the Grind Infrastructure logs in a much more effective way because these logs are notoriously big and "chatty" (I think my worst nightmare is when the database alert log will become like GI alert log thereby making it much less useful). And there we have it:
Exception: ACL entry creation failed for: owner:ggate:rwx
Turned out the nodes which were core dumping were recently added to the cluster and the user ggate, which is the owner of the GoldenGate resource, simply did not exist on these nodes. Apparently that was enough to cause crsd.bin core dumps. Yikes!