8i | 9i | 10g | 11g | 12c | 13c | 18c | 19c | 21c | 23ai | Misc | PL/SQL | SQL | RAC | WebLogic | Linux

Attribute Clustering in Oracle Database 12c Release 1

Attribute clustering groups specific attributes together so they are in close physical proximity, making the processing of related attributes more efficient.

Setup
The Problem
The Solution: Attribute Clustering
Attribute Clustering Basic Syntax
Views
Considerations

Setup

We create a customers, products and sales table.

-- drop table sales purge;
-- drop table products purge;
-- drop table customers purge;

create table customers (
  customer_id  number,
  name         varchar2(50),
  constraint customers_pk PRIMARY KEY (customer_id)
);


create table products (
  product_id  number,
  name        varchar2(50),
  constraint products_pk PRIMARY KEY (product_id)
);


create table sales (
  id              number generated always as identity,
  customer_id     number,
  product_id      number,
  quantity        number,
  created_date    date,
  padding_string  varchar2(4000),
  constraint sales_pk primary key (id),
  constraint sales_customers_fk foreign key (customer_id) references customers(customer_id),
  constraint sales_products_fk foreign key (product_id) references products(product_id)
);

We create 100 customers and 10 products.

insert into customers
select level,
       'Customer ' || level
from   dual
connect by level <= 100;

commit;

insert into products
select level,
       'Product ' || level
from   dual
connect by level <= 10;

commit;

We populate the SALES table with 100,000 rows, each of which has a random CUSTOMER_ID value between 1 and 100, a random PRODUCT_ID value between 1 and 10 and a random QUANTITY between 1 and 50. The padding is used for this test to increase the size of the rows, and therefore the number of blocks required to store them.

insert into sales (customer_id, product_id, quantity, created_date, padding_string)
select trunc(dbms_random.value(1, 101)),
       trunc(dbms_random.value(1, 11)),
       trunc(dbms_random.value(1, 51)),
       sysdate,
       dbms_random.string('x',100)
from   dual
connect by level <= 100000;

commit;

Looking at a sample of 20 rows we see the CUSTOMER_ID and PRODUCT_ID values are randomly distributed throughout the blocks.

set autotrace off

select customer_id, product_id
from   sales
where  rownum <= 20;

CUSTOMER_ID PRODUCT_ID
----------- ----------
         62          4
         65          2
         56          1
         65          1
         96          3
         99          2
         37          3
         27          1
         50          5
         75          4
         90          2

CUSTOMER_ID PRODUCT_ID
----------- ----------
         64          4
         15          4
         89          5
         72          7
         93         10
         54          6
         74          6
          9          7
         76          4

20 rows selected.

SQL>

We'll keep the SALES table as our starting point, and use it as the data source for our tests.

The Problem

We create a table called SALES2 as a copy of the SALES table.

-- drop table sales2 purge;

create table sales2 as
select * from sales;

If we pull back all the records for a single customer, we can see it's taking 1864 consistent gets to return the data.

set autotrace trace stat

select /* attempt 1 */ * from sales2 where customer_id = 1;

998 rows selected.


Statistics
----------------------------------------------------------
         17  recursive calls
         31  db block gets
       1864  consistent gets
       1786  physical reads
       6212  redo size
     130591  bytes sent via SQL*Net to client
        778  bytes received via SQL*Net from client
         68  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        998  rows processed

SQL>

We add an index on the CUSTOMER_ID column and gather fresh statistics. We also display the clustering factor of the new index.

create index sales2_customer_id_idx on sales2 (customer_id);
exec dbms_stats.gather_table_stats(null, 'sales2');

set autotrace off

select clustering_factor
from   user_indexes
where  index_name = 'SALES2_CUSTOMER_ID_IDX';

CLUSTERING_FACTOR
-----------------
            76808

SQL>

We run the query again, but we don't see a dramatic improvement in the number of consistent gets, because we have not used the index.

set autotrace trace stat

select /* attempt 2 */ * from sales2 where customer_id = 1;

998 rows selected.


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       1857  consistent gets
          0  physical reads
          0  redo size
     130591  bytes sent via SQL*Net to client
        778  bytes received via SQL*Net from client
         68  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        998  rows processed

SQL>

We force the query to use the new index, which drops our consistent gets down to 860.

set autotrace trace stat

select /*+ index(sales2) */ * from sales2 where customer_id = 1;

998 rows selected.


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        860  consistent gets
          0  physical reads
          0  redo size
     142053  bytes sent via SQL*Net to client
        778  bytes received via SQL*Net from client
         68  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        998  rows processed

SQL>

So why was the index not used until we forced it?

The data for customer "1" is spread throughout the table, and the database determined it was still going to have to hit a lot of blocks to get it all, whether using the index or not, so it chose not to use it. This is a classic case of the index having a high cluster factor, so the optimizer has chosen not to use it in this scenario.

The Solution: Attribute Clustering

We add linear clustering to the table, based on the CUSTOMER_ID column, and rebuild the table using an online operation.

alter table sales2 add clustering by linear order (customer_id);
alter table sales2 move online;

The data has now been rearranged into ascending order based on the CUSTOMER_ID column. We can see this by checking a sample of the data.

set autotrace off

select customer_id, product_id
from   sales2
where  rownum <= 20;

CUSTOMER_ID PRODUCT_ID
----------- ----------
          1          9
          1          8
          1          6
          1          1
          1          3
          1          4
          1          8
          1          7
          1          5
          1         10
          1          2

CUSTOMER_ID PRODUCT_ID
----------- ----------
          1          8
          1          5
          1          9
          1          5
          1         10
          1          6
          1          3
          1          4
          1          3

20 rows selected.

SQL>

The online move of the table maintains the indexes. When we check the clustering factor of the index, we see it has reduced drastically.

set autotrace off

select clustering_factor
from   user_indexes
where  index_name = 'SALES2_CUSTOMER_ID_IDX';

CLUSTERING_FACTOR
-----------------
             1786

SQL>

We try the original query again, without the index hint.

set autotrace trace stat

select /* attempt 3 */ * from sales2 where customer_id = 1;

998 rows selected.


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        154  consistent gets
          4  physical reads
          0  redo size
     142053  bytes sent via SQL*Net to client
        778  bytes received via SQL*Net from client
         68  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        998  rows processed

SQL>

There are two things of note here.

The optimizer has chosen to use the index without being forced to.
When using the index, the consistent gets have dropped from 860 to 154.

Now the data in the table is ordered in the same order as the index leaf blocks, the index clustering factor is much lower, making it a more interesting index to the optimizer. This is why the optimizer chose to use the index without being forced. What's more, the fact rows for the same customer are clustered together in the same blocks mean less total blocks have to be retrieved to get the data. This is why the total number of consistent gets dropped drastically.

Attribute Clustering Basic Syntax

Attribute clustering can be defined during table creation, including during a CTAS operation.

drop table sales3 purge;

create table sales3 (
  id              number,
  customer_id     number,
  product_id      number,
  quantity        number,
  created_date    date,
  padding_string  varchar2(4000),
  constraint sales3_pk primary key (id),
  constraint sales3_customers_fk foreign key (customer_id) references customers(customer_id),
  constraint sales3_products_fk foreign key (product_id) references products(product_id)
)
clustering by linear order (customer_id);


drop table sales3 purge;

create table sales3
clustering by linear order (customer_id) as
select * from sales;

By default, attribute clustering is enabled for direct-path inserts (YES ON LOAD) and data movement operations (YES ON DATA MOVEMENT). The data movement operations include MOVE, MERGE, SPLIT, COALESCE and online table redefinitions. The attribute clustering defaults can be modified at creation time.

drop table sales3 purge;

-- Limit attribute clustering to data movement operations.
create table sales3 (
  id              number,
  customer_id     number,
  product_id      number,
  quantity        number,
  created_date    date,
  padding_string  varchar2(4000),
  constraint sales3_pk primary key (id),
  constraint sales3_customers_fk foreign key (customer_id) references customers(customer_id),
  constraint sales3_products_fk foreign key (product_id) references products(product_id)
)
clustering by linear order (customer_id)
no on load yes on data movement;


drop table sales3 purge;

-- Specify the attribute clustering defaults explicitly.
create table sales3 (
  id              number,
  customer_id     number,
  product_id      number,
  quantity        number,
  created_date    date,
  padding_string  varchar2(4000),
  constraint sales3_pk primary key (id),
  constraint sales3_customers_fk foreign key (customer_id) references customers(customer_id),
  constraint sales3_products_fk foreign key (product_id) references products(product_id)
)
clustering by linear order (customer_id)
yes on load yes on data movement;

The attribute clustering settings of an existing table can be modified using the ALTER TABLE command. The existing data is untouched.

-- Drop clustering.
alter table sales3 drop clustering;

-- Add clustering.
alter table sales3 add clustering by linear order (customer_id);

-- Modify ON LOAD and ON DATA MOVEMENT settings.
alter table sales3 modify clustering yes on load no on data movement;

alter table sales3 modify clustering yes on load yes on data movement;

The attribute clustering settings of load operations can be overridden using the CLUSTERING and NO_CLUSTERING hints.

insert /*+ append no_clustering */ into sales3 select * from sales;
commit;

It's possible to do join attribute clustering. Data from the joined tables is not stored in the clustered table, but the cluster order is determined by the join columns data in joined tables. This is meant for star schemas and fact/dimension tables.

drop table sales4 purge;

create table sales4 (
  id              number,
  customer_id     number,
  product_id      number,
  quantity        number,
  created_date    date,
  padding_string  varchar2(4000),
  constraint sales4_pk primary key (id),
  constraint sales4_customers_fk foreign key (customer_id) references customers(customer_id),
  constraint sales4_products_fk foreign key (product_id) references products(product_id)
)
clustering
  sales4 join products on (sales4.product_id = products.product_id)
  by linear order (customer_id, products.name);

Views

The CLUSTERING column of the {CDB|DBA|ALL|USER}_TABLES views shows us the clustering status of a table.

set autotrace off
column table_name format a30
column clustering format a10

select table_name,
       clustering
from   user_tables
order by table_name;

TABLE_NAME                     CLUSTERING
------------------------------ ----------
CUSTOMERS                      NO
PRODUCTS                       NO
SALES                          NO
SALES2                         YES
SALES3                         YES
SALES4                         YES

6 rows selected.

SQL>

The following views provide more information about attribute clustering.

{CDB|DBA|ALL|USER}_CLUSTERING_TABLES : Information about the attribute clustering settings for the table.
{CDB|DBA|ALL|USER}_CLUSTERING_KEYS : Information about the columns used for attribute clustering.
{CDB|DBA|ALL|USER}_CLUSTERING_DIMENTIONS : Information about the dimension tables by which a fact table is clustered.
{CDB|DBA|ALL|USER}_CLUSTERING_JOINS : Information about the joins of the fact table and dimension tables.

Here are some example queries.

set linesize 200
column owner format a10
column table_name format a10
column clustering_type format a25
column on_load format a7
column on_datamovement format a15
column valid format a5
column with_zonemap format a12
column last_load_clst format a30
column last_datamove_clst format a30

select owner,
       table_name,
       clustering_type,
       on_load,
       on_datamovement,
       valid,
       with_zonemap,
       last_load_clst,
       last_datamove_clst
from   user_clustering_tables
order by owner, table_name;

OWNER      TABLE_NAME CLUSTERING_TYPE           ON_LOAD ON_DATAMOVEMENT VALID WITH_ZONEMAP LAST_LOAD_CLST                 LAST_DATAMOVE_CLST
---------- ---------- ------------------------- ------- --------------- ----- ------------ ------------------------------ ------------------------------
TESTUSER1  SALES2     LINEAR                    YES     YES             YES   NO                                          24-DEC-20 08.37.43.591982 PM
TESTUSER1  SALES3     LINEAR                    YES     YES             YES   NO
TESTUSER1  SALES4     LINEAR                    YES     YES             YES   NO

SQL>


column owner format a10
column table_name format a10
column detail_owner format a13
column detail_name format a11
column detail_column format a14

select owner,
       table_name,
       detail_owner,
       detail_name,
       detail_column,
       position,
       groupid
from   user_clustering_keys
order by owner, table_name;

OWNER      TABLE_NAME DETAIL_OWNER  DETAIL_NAME DETAIL_COLUMN    POSITION    GROUPID
---------- ---------- ------------- ----------- -------------- ---------- ----------
TESTUSER1  SALES2     TESTUSER1     SALES2      CUSTOMER_ID             1          0
TESTUSER1  SALES3     TESTUSER1     SALES3      CUSTOMER_ID             1          0
TESTUSER1  SALES4     TESTUSER1     SALES4      CUSTOMER_ID             1          0
TESTUSER1  SALES4     TESTUSER1     PRODUCTS    NAME                    2          0

SQL>


column table_name form a10
column dimension_owner form a15
column dimension_name form a14

select table_name,
       dimension_owner,
       dimension_name
from   user_clustering_dimensions
order by table_name;

TABLE_NAME DIMENSION_OWNER DIMENSION_NAME
---------- --------------- --------------
SALES4     TESTUSER1       PRODUCTS

SQL>


column table_name form a10
column tab1_owner form a10
column tab1_name form a10
column tab1_column form a11
column tab2_owner form a10
column tab2_name form a10
column tab2_column form a11

select table_name,
       tab1_owner,
       tab1_name,
       tab1_column,
       tab2_owner,
       tab2_name,
       tab2_column
from   user_clustering_joins
order by table_name;

TABLE_NAME TAB1_OWNER TAB1_NAME  TAB1_COLUMN TAB2_OWNER TAB2_NAME  TAB2_COLUMN
---------- ---------- ---------- ----------- ---------- ---------- -----------
SALES4     TESTUSER1  SALES4     PRODUCT_ID  TESTUSER1  PRODUCTS   PRODUCT_ID

SQL>

Considerations

There are some things to consider when using attribute clustering.

Attribute clustering is an Enterprise Edition feature.
Zone maps can only be used as part of attribute clustering if you are on the Exadata platform.
Attribute clustering works on direct path loads and data movement operations, so it is a more natural fit for data warehouses.
It's possible to do join attribute clustering. Data from the joined tables is not stored in the clustered table, but the cluster order is determined by the join columns in data in joined tables. This is meant for star schemas and fact/dimension tables.
This article ignores CLUSTERING ... BY INTERLEAVED ORDER as it is for multidimensional clustering on multiple columns, and I don't really understand it.

For more information see:

Hope this helps. Regards Tim...

Back to the Top.