8i | 9i | 10g | 11g | 12c | 13c | 18c | 19c | 21c | 23c | Misc | PL/SQL | SQL | RAC | WebLogic | Linux
Attribute Clustering in Oracle Database 12c Release 1
Attribute clustering groups specific attributes together so they are in close physical proximity, making the processing of related attributes more efficient.
- Setup
- The Problem
- The Solution: Attribute Clustering
- Attribute Clustering Basic Syntax
- Views
- Considerations
Related articles.
Setup
We create a customers, products and sales table.
-- drop table sales purge; -- drop table products purge; -- drop table customers purge; create table customers ( customer_id number, name varchar2(50), constraint customers_pk PRIMARY KEY (customer_id) ); create table products ( product_id number, name varchar2(50), constraint products_pk PRIMARY KEY (product_id) ); create table sales ( id number generated always as identity, customer_id number, product_id number, quantity number, created_date date, padding_string varchar2(4000), constraint sales_pk primary key (id), constraint sales_customers_fk foreign key (customer_id) references customers(customer_id), constraint sales_products_fk foreign key (product_id) references products(product_id) );
We create 100 customers and 10 products.
insert into customers select level, 'Customer ' || level from dual connect by level <= 100; commit; insert into products select level, 'Product ' || level from dual connect by level <= 10; commit;
We populate the SALES
table with 100,000 rows, each of which has a random CUSTOMER_ID
value between 1 and 100, a random PRODUCT_ID
value between 1 and 10 and a random QUANTITY
between 1 and 50. The padding is used for this test to increase the size of the rows, and therefore the number of blocks required to store them.
insert into sales (customer_id, product_id, quantity, created_date, padding_string) select trunc(dbms_random.value(1, 101)), trunc(dbms_random.value(1, 11)), trunc(dbms_random.value(1, 51)), sysdate, dbms_random.string('x',100) from dual connect by level <= 100000; commit;
Looking at a sample of 20 rows we see the CUSTOMER_ID
and PRODUCT_ID
values are randomly distributed throughout the blocks.
set autotrace off select customer_id, product_id from sales where rownum <= 20; CUSTOMER_ID PRODUCT_ID ----------- ---------- 62 4 65 2 56 1 65 1 96 3 99 2 37 3 27 1 50 5 75 4 90 2 CUSTOMER_ID PRODUCT_ID ----------- ---------- 64 4 15 4 89 5 72 7 93 10 54 6 74 6 9 7 76 4 20 rows selected. SQL>
We'll keep the SALES
table as our starting point, and use it as the data source for our tests.
The Problem
We create a table called SALES2
as a copy of the SALES
table.
-- drop table sales2 purge; create table sales2 as select * from sales;
If we pull back all the records for a single customer, we can see it's taking 1864 consistent gets to return the data.
set autotrace trace stat select /* attempt 1 */ * from sales2 where customer_id = 1; 998 rows selected. Statistics ---------------------------------------------------------- 17 recursive calls 31 db block gets 1864 consistent gets 1786 physical reads 6212 redo size 130591 bytes sent via SQL*Net to client 778 bytes received via SQL*Net from client 68 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 998 rows processed SQL>
We add an index on the CUSTOMER_ID
column and gather fresh statistics. We also display the clustering factor of the new index.
create index sales2_customer_id_idx on sales2 (customer_id); exec dbms_stats.gather_table_stats(null, 'sales2'); set autotrace off select clustering_factor from user_indexes where index_name = 'SALES2_CUSTOMER_ID_IDX'; CLUSTERING_FACTOR ----------------- 76808 SQL>
We run the query again, but we don't see a dramatic improvement in the number of consistent gets, because we have not used the index.
set autotrace trace stat select /* attempt 2 */ * from sales2 where customer_id = 1; 998 rows selected. Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 1857 consistent gets 0 physical reads 0 redo size 130591 bytes sent via SQL*Net to client 778 bytes received via SQL*Net from client 68 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 998 rows processed SQL>
We force the query to use the new index, which drops our consistent gets down to 860.
set autotrace trace stat select /*+ index(sales2) */ * from sales2 where customer_id = 1; 998 rows selected. Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 860 consistent gets 0 physical reads 0 redo size 142053 bytes sent via SQL*Net to client 778 bytes received via SQL*Net from client 68 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 998 rows processed SQL>
So why was the index not used until we forced it?
The data for customer "1" is spread throughout the table, and the database determined it was still going to have to hit a lot of blocks to get it all, whether using the index or not, so it chose not to use it. This is a classic case of the index having a high cluster factor, so the optimizer has chosen not to use it in this scenario.
The Solution: Attribute Clustering
We add linear clustering to the table, based on the CUSTOMER_ID
column, and rebuild the table using an online operation.
alter table sales2 add clustering by linear order (customer_id); alter table sales2 move online;
The data has now been rearranged into ascending order based on the CUSTOMER_ID
column. We can see this by checking a sample of the data.
set autotrace off select customer_id, product_id from sales2 where rownum <= 20; CUSTOMER_ID PRODUCT_ID ----------- ---------- 1 9 1 8 1 6 1 1 1 3 1 4 1 8 1 7 1 5 1 10 1 2 CUSTOMER_ID PRODUCT_ID ----------- ---------- 1 8 1 5 1 9 1 5 1 10 1 6 1 3 1 4 1 3 20 rows selected. SQL>
The online move of the table maintains the indexes. When we check the clustering factor of the index, we see it has reduced drastically.
set autotrace off select clustering_factor from user_indexes where index_name = 'SALES2_CUSTOMER_ID_IDX'; CLUSTERING_FACTOR ----------------- 1786 SQL>
We try the original query again, without the index hint.
set autotrace trace stat select /* attempt 3 */ * from sales2 where customer_id = 1; 998 rows selected. Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 154 consistent gets 4 physical reads 0 redo size 142053 bytes sent via SQL*Net to client 778 bytes received via SQL*Net from client 68 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 998 rows processed SQL>
There are two things of note here.
- The optimizer has chosen to use the index without being forced to.
- When using the index, the consistent gets have dropped from 860 to 154.
Now the data in the table is ordered in the same order as the index leaf blocks, the index clustering factor is much lower, making it a more interesting index to the optimizer. This is why the optimizer chose to use the index without being forced. What's more, the fact rows for the same customer are clustered together in the same blocks mean less total blocks have to be retrieved to get the data. This is why the total number of consistent gets dropped drastically.
Attribute Clustering Basic Syntax
Attribute clustering can be defined during table creation, including during a CTAS operation.
drop table sales3 purge; create table sales3 ( id number, customer_id number, product_id number, quantity number, created_date date, padding_string varchar2(4000), constraint sales3_pk primary key (id), constraint sales3_customers_fk foreign key (customer_id) references customers(customer_id), constraint sales3_products_fk foreign key (product_id) references products(product_id) ) clustering by linear order (customer_id); drop table sales3 purge; create table sales3 clustering by linear order (customer_id) as select * from sales;
By default, attribute clustering is enabled for direct-path inserts (YES ON LOAD
) and data movement operations (YES ON DATA MOVEMENT
). The data movement operations include MOVE
, MERGE
, SPLIT
, COALESCE
and online table redefinitions. The attribute clustering defaults can be modified at creation time.
drop table sales3 purge; -- Limit attribute clustering to data movement operations. create table sales3 ( id number, customer_id number, product_id number, quantity number, created_date date, padding_string varchar2(4000), constraint sales3_pk primary key (id), constraint sales3_customers_fk foreign key (customer_id) references customers(customer_id), constraint sales3_products_fk foreign key (product_id) references products(product_id) ) clustering by linear order (customer_id) no on load yes on data movement; drop table sales3 purge; -- Specify the attribute clustering defaults explicitly. create table sales3 ( id number, customer_id number, product_id number, quantity number, created_date date, padding_string varchar2(4000), constraint sales3_pk primary key (id), constraint sales3_customers_fk foreign key (customer_id) references customers(customer_id), constraint sales3_products_fk foreign key (product_id) references products(product_id) ) clustering by linear order (customer_id) yes on load yes on data movement;
The attribute clustering settings of an existing table can be modified using the ALTER TABLE
command. The existing data is untouched.
-- Drop clustering. alter table sales3 drop clustering; -- Add clustering. alter table sales3 add clustering by linear order (customer_id); -- Modify ON LOAD and ON DATA MOVEMENT settings. alter table sales3 modify clustering yes on load no on data movement; alter table sales3 modify clustering yes on load yes on data movement;
The attribute clustering settings of load operations can be overridden using the CLUSTERING
and NO_CLUSTERING
hints.
insert /*+ append no_clustering */ into sales3 select * from sales; commit;
It's possible to do join attribute clustering. Data from the joined tables is not stored in the clustered table, but the cluster order is determined by the join columns data in joined tables. This is meant for star schemas and fact/dimension tables.
drop table sales4 purge; create table sales4 ( id number, customer_id number, product_id number, quantity number, created_date date, padding_string varchar2(4000), constraint sales4_pk primary key (id), constraint sales4_customers_fk foreign key (customer_id) references customers(customer_id), constraint sales4_products_fk foreign key (product_id) references products(product_id) ) clustering sales4 join products on (sales4.product_id = products.product_id) by linear order (customer_id, products.name);
Views
The CLUSTERING
column of the {CDB|DBA|ALL|USER}_TABLES
views shows us the clustering status of a table.
set autotrace off column table_name format a30 column clustering format a10 select table_name, clustering from user_tables order by table_name; TABLE_NAME CLUSTERING ------------------------------ ---------- CUSTOMERS NO PRODUCTS NO SALES NO SALES2 YES SALES3 YES SALES4 YES 6 rows selected. SQL>
The following views provide more information about attribute clustering.
{CDB|DBA|ALL|USER}_CLUSTERING_TABLES
: Information about the attribute clustering settings for the table.{CDB|DBA|ALL|USER}_CLUSTERING_KEYS
: Information about the columns used for attribute clustering.{CDB|DBA|ALL|USER}_CLUSTERING_DIMENTIONS
: Information about the dimension tables by which a fact table is clustered.{CDB|DBA|ALL|USER}_CLUSTERING_JOINS
: Information about the joins of the fact table and dimension tables.
Here are some example queries.
set linesize 200 column owner format a10 column table_name format a10 column clustering_type format a25 column on_load format a7 column on_datamovement format a15 column valid format a5 column with_zonemap format a12 column last_load_clst format a30 column last_datamove_clst format a30 select owner, table_name, clustering_type, on_load, on_datamovement, valid, with_zonemap, last_load_clst, last_datamove_clst from user_clustering_tables order by owner, table_name; OWNER TABLE_NAME CLUSTERING_TYPE ON_LOAD ON_DATAMOVEMENT VALID WITH_ZONEMAP LAST_LOAD_CLST LAST_DATAMOVE_CLST ---------- ---------- ------------------------- ------- --------------- ----- ------------ ------------------------------ ------------------------------ TESTUSER1 SALES2 LINEAR YES YES YES NO 24-DEC-20 08.37.43.591982 PM TESTUSER1 SALES3 LINEAR YES YES YES NO TESTUSER1 SALES4 LINEAR YES YES YES NO SQL> column owner format a10 column table_name format a10 column detail_owner format a13 column detail_name format a11 column detail_column format a14 select owner, table_name, detail_owner, detail_name, detail_column, position, groupid from user_clustering_keys order by owner, table_name; OWNER TABLE_NAME DETAIL_OWNER DETAIL_NAME DETAIL_COLUMN POSITION GROUPID ---------- ---------- ------------- ----------- -------------- ---------- ---------- TESTUSER1 SALES2 TESTUSER1 SALES2 CUSTOMER_ID 1 0 TESTUSER1 SALES3 TESTUSER1 SALES3 CUSTOMER_ID 1 0 TESTUSER1 SALES4 TESTUSER1 SALES4 CUSTOMER_ID 1 0 TESTUSER1 SALES4 TESTUSER1 PRODUCTS NAME 2 0 SQL> column table_name form a10 column dimension_owner form a15 column dimension_name form a14 select table_name, dimension_owner, dimension_name from user_clustering_dimensions order by table_name; TABLE_NAME DIMENSION_OWNER DIMENSION_NAME ---------- --------------- -------------- SALES4 TESTUSER1 PRODUCTS SQL> column table_name form a10 column tab1_owner form a10 column tab1_name form a10 column tab1_column form a11 column tab2_owner form a10 column tab2_name form a10 column tab2_column form a11 select table_name, tab1_owner, tab1_name, tab1_column, tab2_owner, tab2_name, tab2_column from user_clustering_joins order by table_name; TABLE_NAME TAB1_OWNER TAB1_NAME TAB1_COLUMN TAB2_OWNER TAB2_NAME TAB2_COLUMN ---------- ---------- ---------- ----------- ---------- ---------- ----------- SALES4 TESTUSER1 SALES4 PRODUCT_ID TESTUSER1 PRODUCTS PRODUCT_ID SQL>
Considerations
There are some things to consider when using attribute clustering.
- Attribute clustering is an Enterprise Edition feature.
- Zone maps can only be used as part of attribute clustering if you are on the Exadata platform.
- Attribute clustering works on direct path loads and data movement operations, so it is a more natural fit for data warehouses.
- It's possible to do join attribute clustering. Data from the joined tables is not stored in the clustered table, but the cluster order is determined by the join columns in data in joined tables. This is meant for star schemas and fact/dimension tables.
- This article ignores
CLUSTERING ... BY INTERLEAVED ORDER
as it is for multidimensional clustering on multiple columns, and I don't really understand it.
For more information see:
Hope this helps. Regards Tim...