APPROX_COUNT_DISTINCT : Quick Distinct Count in Oracle Database 12cR1 (22.214.171.124)
APPROX_COUNT_DISTINCT function was added, but not documented, in Oracle 11g to improve the speed of calculating the number of distinct values (NDV) when gathering statistics using the
DBMS_STATS package. Oracle database 12c (126.96.36.199) now includes the function in the documentation, so we are free to use it in our applications as a supported SQL function.
In previous database versions, if we wanted to perform a count of distinct values, we would probably have done something like the following.
SELECT COUNT(DISTINCT object_name) AS obj_count FROM all_objects; OBJ_COUNT ---------- 47171 1 row selected. SQL>
This query gives us the exact count of the distinct values based on Oracle's read-consistency model. As such, we see all the committed data, along with any uncommitted changes made by the current session.
In contrast, the new
APPROX_COUNT_DISTINCT function does not provide exact results, but should give "negligible deviation from the exact result" according to the documentation.
SELECT APPROX_COUNT_DISTINCT(object_name) AS obj_count FROM all_objects; OBJ_COUNT ---------- 46788 1 row selected. SQL>
The function can also be used as part of a group query.
SELECT tablespace_name, APPROX_COUNT_DISTINCT(table_name) AS tab_count FROM all_tables GROUP BY tablespace_name ORDER BY tablespace_name; TABLESPACE_NAME TAB_COUNT ------------------------------ ---------- SYSAUX 73 SYSTEM 36 USERS 7 44 4 rows selected. SQL>
Even in the following simple example we can see a repeatable difference in the speed of the two approaches, but the performance difference does not seem too dramatic.
SET TIMING ON SELECT COUNT(DISTINCT object_name) AS obj_count FROM all_objects; OBJ_COUNT ---------- 47171 1 row selected. Elapsed: 00:00:02.39 SQL> SELECT APPROX_COUNT_DISTINCT(object_name) AS obj_count FROM all_objects; OBJ_COUNT ---------- 46788 1 row selected. Elapsed: 00:00:02.00 SQL>
In reality, the
APPROX_COUNT_DISTINCT function is designed to process much bigger workloads than this, so we can create a bigger table to test.
DROP TABLE t1 PURGE; CREATE TABLE t1 AS SELECT level AS data FROM dual CONNECT BY level <= 10000; INSERT /*+ APPEND */ INTO t1 SELECT a.data FROM t1 a CROSS JOIN t1 b; COMMIT; EXEC DBMS_STATS.gather_table_stats(USER,'T1');
Now we have a table with a little over 100 million rows and 10,000 distinct values. Now we can see the difference in performance of the two methods is more significant.
SET TIMING ON SELECT COUNT(DISTINCT data) AS data_count FROM t1; DATA_COUNT ---------- 10000 1 row selected. Elapsed: 00:00:19.66 SQL> SELECT APPROX_COUNT_DISTINCT(data) AS data_count FROM t1; DATA_COUNT ---------- 10050 1 row selected. Elapsed: 00:00:10.46 SQL>
Christian Antognini has an interesting blog post about the performance benefits of the
APPROX_COUNT_DISTINCT function here, where he shows the elapsed time and memory usage of the
APPROX_COUNT_DISTINCT function remains consistent regardless of the number of distinct values in the data set. In contrast, as the number of distinct values increase, the elapsed time and memory usage of the "
COUNT(DISTINCT expr)" method increase drastically.
For more information see:
- The APPROX_COUNT_DISTINCT Function – A Test Case
- APPROX_COUNT_DISTINCT Function
- Approximate Query Processing in Oracle Database 12c Release 2 (12.2)
Hope this helps. Regards Tim...