8i | 9i | 10g | 11g | 12c | 13c | 18c | 19c | 21c | 23c | Misc | PL/SQL | SQL | RAC | WebLogic | Linux

Home » Articles » 23c » Here

FUZZY_MATCH and PHONIC_ENCODE Data Quality Operators in Oracle Database 23c

Oracle database 23c introduced the FUZZY_MATCH and PHONIC_ENCODE data quality operators to perform fuzzy string matching.

The UTL_MATCH package was introduced in Oracle 10g Release 2, but first documented (and therefore supported) in Oracle 11g Release 2. It contains a variety of functions that are helpful for testing the level of similarity/difference between strings. The FUZZY_MATCH and PHONIC_ENCODE operators extend the fuzzy string matching functionality of the database.

Related articles.

Setup

The examples in this article require the following test table.

drop table if exists match_tab;

create table match_tab (
  id    number,
  col1  varchar2(15),
  col2  varchar2(15),
  constraint match_tab_pk primary  key (id)
);

insert into match_tab values
  (1, 'Peter Parker', 'Pete Parker'),
  (2, 'Peter Parker', 'peter parker'),
  (3, 'Clark Kent', 'Claire Kent'),
  (4, 'Wonder Woman', 'Ponder Woman'),
  (5, 'Superman', 'Superman'),
  (6, 'The Hulk', 'Iron Man');
commit;

Notice the COL1 and COL2 values have varying degrees of similarity including exact matches, near matches and completely different values.

FUZZY_MATCH

The FUZZY_MATCH operator is language-neutral. It determines the similarity between two strings and supports several algorithms listed here.

set linesize 100
column col1 format a12
column col2 format a12

select col1,
       col2,
       fuzzy_match(levenshtein, col1, col2) as levenshtein,
       fuzzy_match(jaro_winkler, col1, col2) as jaro_winkler,
       fuzzy_match(bigram, col1, col2) as bigram,
       fuzzy_match(trigram, col1, col2) as trigram,
       fuzzy_match(whole_word_match, col1, col2) as wwm,
       fuzzy_match(longest_common_substring, col1, col2) as lcs
from   match_tab;

COL1         COL2         LEVENSHTEIN JARO_WINKLER     BIGRAM    TRIGRAM        WWM        LCS
------------ ------------ ----------- ------------ ---------- ---------- ---------- ----------
Peter Parker Pete Parker           92           92         90         70         50         58
Peter Parker peter parker          84           88         72         60          0         41
Clark Kent   Claire Kent           82           90         60         44         50         45
Wonder Woman Ponder Woman          92           94        100         90         50         91
Superman     Superman             100          100        100        100        100        100
The Hulk     Iron Man               0           41          0          0          0         12

6 rows selected.

SQL>

By default the output is a percentage similarity, but the UNSCALED keyword can be added to return the raw value.

select col1,
       col2,
       fuzzy_match(levenshtein, col1, col2, unscaled) as levenshtein,
       fuzzy_match(jaro_winkler, col1, col2, unscaled) as jaro_winkler,
       fuzzy_match(bigram, col1, col2, unscaled) as bigram,
       fuzzy_match(trigram, col1, col2, unscaled) as trigram,
       fuzzy_match(whole_word_match, col1, col2, unscaled) as wwm,
       fuzzy_match(longest_common_substring, col1, col2, unscaled) as lcs
from   match_tab;

COL1         COL2         LEVENSHTEIN JARO_WINKLER     BIGRAM    TRIGRAM        WWM        LCS
------------ ------------ ----------- ------------ ---------- ---------- ---------- ----------
Peter Parker Pete Parker            1          .92         10          7          1          7
Peter Parker peter parker           2          .88          8          6          0          5
Clark Kent   Claire Kent            2           .9          6          4          1          5
Wonder Woman Ponder Woman           1          .94         11          9          1         11
Superman     Superman               0            1          7          6          1          8
The Hulk     Iron Man               8          .41          0          0          0          1

6 rows selected.

SQL>

If you have used the UTL_MATCH package, the scaled verses unscaled values represent the following calls.

Algorithm Scaled Unscaled
LEVENSHTEIN UTL_MATCH.EDIT_DISTANCE_SIMILARITY UTL_MATCH.EDIT_DISTANCE
JARO_WINKLER UTL_MATCH.JARO_WINKLER_SIMILARITY UTL_MATCH.JARO_WINKLER

By default the output is scaled by the length of the longer input string. The RELATE_TO_SHORTER keyword alters this to scale to the shorter input string.

select col1,
       col2,
       fuzzy_match(levenshtein, col1, col2, relate_to_shorter) as levenshtein,
       fuzzy_match(jaro_winkler, col1, col2, relate_to_shorter) as jaro_winkler,
       fuzzy_match(bigram, col1, col2, relate_to_shorter) as bigram,
       fuzzy_match(trigram, col1, col2, relate_to_shorter) as trigram,
       fuzzy_match(whole_word_match, col1, col2, relate_to_shorter) as wwm,
       fuzzy_match(longest_common_substring, col1, col2, relate_to_shorter) as lcs
from   match_tab;

COL1         COL2         LEVENSHTEIN JARO_WINKLER     BIGRAM    TRIGRAM        WWM        LCS
------------ ------------ ----------- ------------ ---------- ---------- ---------- ----------
Peter Parker Pete Parker           91           92        100         77         50         63
Peter Parker peter parker          84           88         72         60          0         41
Clark Kent   Claire Kent           80           90         66         50         50         50
Wonder Woman Ponder Woman          92           94        100         90         50         91
Superman     Superman             100          100        100        100        100        100
The Hulk     Iron Man               0           41          0          0          0         12

6 rows selected.

SQL>

The EDIT_TOLERANCE keyword can be used with the WHOLE_WORD_MATCH algorithm. The tolerance is a percentage of characters in a word that can be different, while still considering it to be the same word.

select col1,
       col2,
       fuzzy_match(whole_word_match, col1, col2) as wwm,
       fuzzy_match(whole_word_match, col1, col2, edit_tolerance 20) as wwm20,
       fuzzy_match(whole_word_match, col1, col2, edit_tolerance 82) as wwm82
from   match_tab;

COL1         COL2                WWM      WWM20      WWM82
------------ ------------ ---------- ---------- ----------
Peter Parker Pete Parker          50        100         50
Peter Parker peter parker          0        100         50
Clark Kent   Claire Kent          50        100         50
Wonder Woman Ponder Woman         50        100        100
Superman     Superman            100        100        100
The Hulk     Iron Man              0          0          0

6 rows selected.

SQL>

PHONIC_ENCODE

The PHONIC_ENCODE operator converts text into language-specific codes based on the pronunciation of the text. It implements the Double Metaphone algorithm and an alternative algorithm.

set linesize 100
column col1 format a12
column col2 format a12
column col1_dm format a8
column col2_dm format a8
column col1_dma format a8
column col2_dma format a8

select col1,
       col2,
       phonic_encode(double_metaphone, col1) as col1_dm,
       phonic_encode(double_metaphone, col2) as col2_dm,
       phonic_encode(double_metaphone_alt, col1) as col1_dma,
       phonic_encode(double_metaphone_alt, col2) as col2_dma
from   match_tab;

COL1         COL2         COL1_DM  COL2_DM  COL1_DMA COL2_DMA
------------ ------------ -------- -------- -------- --------
Peter Parker Pete Parker  PTRP     PTPR     PTRP     PTPR
Peter Parker peter parker PTRP     PTRP     PTRP     PTRP
Clark Kent   Claire Kent  KLRK     KLRK     KLRK     KLRK
Wonder Woman Ponder Woman ANTR     PNTR     FNTR     PNTR
Superman     Superman     SPRM     SPRM     SPRM     SPRM
The Hulk     Iron Man     0LK      ARNM     TLK      ARNM

6 rows selected.

SQL>

When using DOUBLE_METAPHONE_ALT, if there is no alternative code the primary code is returned.

The maximum length of the code is controlled by an optional third parameter, which accepts integer values from 1 to 12.

column col1_dm1 format a9
column col2_dm1 format a9
column col1_dm6 format a9
column col2_dm6 format a9
column col1_dm12 format a9
column col2_dm12 format a9

select col1,
       col2,
       phonic_encode(double_metaphone, col1, 1) as col1_dm1,
       phonic_encode(double_metaphone, col2, 1) as col2_dm1,
       phonic_encode(double_metaphone, col1, 6) as col1_dm6,
       phonic_encode(double_metaphone, col2, 6) as col2_dm6,
       phonic_encode(double_metaphone, col1, 12) as col1_dm12,
       phonic_encode(double_metaphone, col2, 12) as col2_dm12
from   match_tab;

COL1         COL2         COL1_DM1  COL2_DM1  COL1_DM6  COL2_DM6  COL1_DM12 COL2_DM12
------------ ------------ --------- --------- --------- --------- --------- ---------
Peter Parker Pete Parker  P         P         PTRPRK    PTPRKR    PTRPRKR   PTPRKR
Peter Parker peter parker P         P         PTRPRK    PTRPRK    PTRPRKR   PTRPRKR
Clark Kent   Claire Kent  K         K         KLRKKN    KLRKNT    KLRKKNT   KLRKNT
Wonder Woman Ponder Woman A         P         ANTRMN    PNTRMN    ANTRMN    PNTRMN
Superman     Superman     S         S         SPRMN     SPRMN     SPRMN     SPRMN
The Hulk     Iron Man     0         A         0LK       ARNMN     0LK       ARNMN

6 rows selected.

SQL>

PL/SQL Support

In this release there is no direct support for the FUZZY_MATCH or PHONIC_ENCODE operators in PL/SQL, so a direct assignment is not possible.

declare
  l_output  number;
begin
  l_output := fuzzy_match(levenshtein, 'Peter Parker', 'peter parker');
end;
/
              *
ERROR at line 4:
ORA-06550: line 4, column 15:
PLS-00201: identifier 'FUZZY_MATCH' must be declared
ORA-06550: line 4, column 3:
PL/SQL: Statement ignored

SQL>


declare
  l_output  varchar2(10);
begin
  l_output := phonic_encode(double_metaphone, 'Peter Parker');
end;
/
              *
ERROR at line 4:
ORA-06550: line 4, column 15:
PLS-00201: identifier 'PHONIC_ENCODE' must be declared
ORA-06550: line 4, column 3:
PL/SQL: Statement ignored

SQL>

We can use a SELECT ... INTO to make the assignment.

declare
  l_output  number;
begin
  select fuzzy_match(levenshtein, 'Peter Parker', 'peter parker')
  into   l_output;
end;
/

PL/SQL procedure successfully completed.

SQL>


declare
  l_output  varchar2(10);
begin
  select phonic_encode(double_metaphone, 'Peter Parker')
  into   l_output;
end;
/
PL/SQL procedure successfully completed.

SQL>

For more information see:

Hope this helps. Regards Tim...

Back to the Top.