Tuesday, August 19, 2014

Zone Map Zone ID's

Just a quick follow up to my previous post on how Zone ID's are calculated.

Let's take the following example:
SQL> select rid, sys_op_zone_id(rid) zone_id
  2   from
  3   (
  4    select chartorowid('AAAS5KAAHAAABYDAAA') rid
  5     from dual
  6   );
 
RID                     ZONE_ID
------------------ ------------
AAAS5KAAHAAABYDAAA 324580438021

Recalling that extended ROWID has the following format (a nice picture from Oracle Documentation):



In the binary format that would correspond to:
  • Data Object Number -- 32 bits
  • Relative File Number -- 10 bits
  • Block Number -- 22 bits
  • Row Number -- 16 bits
We know that Row Number is irrelevant for the Zone Maps because they deal with block ranges. With that in mind a simple conversion to a (base 10) number would be:

(Block Number) + (Relative File Number) * 2 ^ 22 + (Data Object Number) * 2 ^ 32

Applying the above formula to our ROWID:
SQL> select
  2    dbms_rowid.rowid_block_number(rid) +
  3    dbms_rowid.rowid_relative_fno(rid) * power(2,22) +
  4    dbms_rowid.rowid_object(rid) * power(2,32) base_10
  5   from
  6   (
  7    select chartorowid('AAAS5KAAHAAABYDAAA') rid
  8     from dual
  9   );
 
        BASE_10
---------------
332370368534019

What's left after that is to slice the number into Zone Map chunk size (2^10 by default, thanks to Timur for pointing that out in the comments sections for the previous post):
SQL> select
  2    trunc((
  3    dbms_rowid.rowid_block_number(rid) +
  4    dbms_rowid.rowid_relative_fno(rid) * power(2,22) +
  5    dbms_rowid.rowid_object(rid) * power(2,32)
  6    )/power(2,10)) zone_id,
  7    sys_op_zone_id(rid) sys_zone_id
  8   from
  9   (
 10    select chartorowid('AAAS5KAAHAAABYDAAA') rid
 11     from dual
 12   );
 
     ZONE_ID  SYS_ZONE_ID
------------ ------------
324580438021 324580438021

That's all there is to it!

Monday, August 18, 2014

Zone Maps

Zone Maps is a new feature that got officially introduced in 12.1.0.2.0 so I've decided to take a closer look.

From the Oracle Documentation:
For full table access, zone maps allow I/O pruning of data based on the physical location of the data on disk, acting like an anti-index.

Let's start by creating a test table:
SQL> create table t pctfree 95 clustering by linear order (n) as
  2   select level n, rpad('x', 200, 'x') v
  3    from dual
  4    connect by level <= 5000;
 
Table created
I've used a high pctfree setting to make sure the table gets spread out on disk -- each row will occupy it's own block:
SQL> select count(*) num_rows,
  2    count(distinct dbms_rowid.rowid_block_number(rowid)) num_blocks
  3   from t;
 
  NUM_ROWS NUM_BLOCKS
---------- ----------
      5000       5000
Zone Maps do not require attribute clustering, however, I did use clustering in this particular case to make sure that value ranges for column N do not overlap when physically stored on disk to make it easier to see what's happening when a Zone Map is created.

Now let's create a Zone Map:
SQL> create materialized zonemap zm$t on t (n);
 
Done
The first thing to notice is we now got additional table in our schema with the same name as a Zone Map:
SQL> select * from zm$t order by zone_id$;
 
    ZONE_ID$    MIN_1_N    MAX_1_N ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
------------ ---------- ---------- ----------- ----------- ----------
324421046272          1        748           0           0        748
324421046273        749       1756           0           0       1008
324421046274       1757       2764           0           0       1008
324421046275       2765       3772           0           0       1008
324421046276       3773       4780           0           0       1008
324421046277       4781       5000           0           0        220
 
6 rows selected
As you can see we've got six zones defined with most of them covering a range of about thousand rows with the exception of the first and the last ones. I can now map each ZONE_ID$ to it's respective block range on disk:
SQL> select zone_id$, min_block_id, max_block_id, zone_rows$
  2  from (
  3  select zm$t.zone_id$,
  4    min(dbms_rowid.rowid_block_number(t.rowid)) min_block_id,
  5    max(dbms_rowid.rowid_block_number(t.rowid)) max_block_id,
  6    max(zone_rows$) zone_rows$
  7   from t, zm$t
  8   where t.n between zm$t.min_1_n and zm$t.max_1_n
  9   group by zm$t.zone_id$
 10  ) order by zone_id$;
 
  ZONE_ID$   MIN_BLOCK_ID MAX_BLOCK_ID ZONE_ROWS$
------------ ------------ ------------ ----------
324421046272          179         1023        748
324421046273         1026         2047       1008
324421046274         2050         3071       1008
324421046275         3074         4095       1008
324421046276         4098         5119       1008
324421046277         5122         5343        220
 
6 rows selected
Based on a couple more tests I've done the algorithm appears to work itself out until the fist block number in a segment which divides evenly by 1024 after which all subsequent blocks get mapped to 8MB regions. The last map has 221 blocks because that's where the end of the last table block happens to be.

Zone Maps seems to be completely independent from the actual extents on disk. In my tests I was able to get multiple Zone Maps pointing at the same extent and in the case above we have multiple extents covered by the same Zone Map. In all cases zones were 8MB in size (more on that a little bit later). By the way if you're wondering why do we have 1008 rows and a little bit strange starting block_ids (i.e. 1026 instead of 1024 and so on) remember that the fist couple blocks are used for first level bmb and do not store actual table data.

So how is this all pieced together?

SYS_OP_ZONE_ID

SYS_OP_ZONE_ID function computes ZONE_ID$ for a given ROWID value. The calculation appears to be a straightforward math based on the ROWID value and assumes 8MB Zone Map chunks -- this explains why Zone Maps are created adjacent to 8MB boundaries since ZONE_ID$ values in the Zone Map are computed using the same function:
SQL> select rowid, SYS_OP_ZONE_ID(rowid)
  2   from t
  3   where rownum=1;
 
ROWID              SYS_OP_ZONE_ID(ROWID)
------------------ ---------------------
AAAS4kAAFAAAACzAAA          324421046272

Essentially if we imagine the entire possible database space divided into 8MB regions this function would return into which region a particular ROWID value would belong.

SYS_ZMAP_FILTER

Let's look at the explain plan for the following query:
SQL> select * from t where n=1;


Execution Plan
----------------------------------------------------------
Plan hash value: 2931408918

--------------------------------------------------------------------------------
| Id  | Operation                              | Name | Rows  | Bytes | Cost (%C
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |      |     1 |   205 |  1380
|*  1 |  TABLE ACCESS STORAGE FULL WITH ZONEMAP| T    |     1 |   205 |  1380
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - storage("N"=1)
       filter(SYS_ZMAP_FILTER('/* ZM_PRUNING */ SELECT "ZONE_ID$", CASE WHEN
              BITAND(zm."ZONE_STATE$",1)=1 THEN 1 ELSE CASE WHEN (zm."MIN_1_N" > :1 OR zm."MAX_1_N"
              < :2) THEN 3 ELSE 2 END END FROM "ROOT"."ZM$T" zm WHERE zm."ZONE_LEVEL$"=0 ORDER BY
              zm."ZONE_ID$"',SYS_OP_ZONE_ID(ROWID),1,1)<3 AND "N"=1)
We can see that SYS_ZMAP_FILTER appears to be the function involved in figuring out which Zone Maps needs to be accessed in order to execute the query. The condition inside the query (zm."MIN_1_N" > :1 OR zm."MAX_1_N" < :2) will be used to eliminate Zone Maps which do not have a value we're looking for and is dynamically constructed based on the predicate(-s) we have in the query. From here ZONE_ID$ can be mapped back to ROWID ranges (a reverse operation relative to SYS_OP_ZONE_ID) in order to scan only required portions of the data on disk.

This looks to be a very exciting feature and I can't help but think that it's a little bit disappointing that it's limited to Exadata storage only as it could have been very useful on other systems due to lack of storage indexes support.