Phonetic, text normalization and address matching functions for record linkage.
Maintainer(s):
RobinL
Installing and Loading
INSTALL splink_udfs FROM community;
LOAD splink_udfs;
Example
LOAD splink_udfs;
SELECT soundex(unaccent('Jürgen')); -- returns 'J625'
About splink_udfs
The splink_udfs extension provides functions for data cleaning and phonetic matching.
Includes soundex(str)
, strip_diacritics(str)
, unaccent(str)
,
ngrams(list,n)
, double_metaphone(str)
and faster versions of levenshtein
and damerau_levenshtein
.
Added Functions
function_name | function_type | description | comment | examples |
---|---|---|---|---|
build_suffix_trie | aggregate | NULL | NULL | |
double_metaphone | scalar | NULL | NULL | |
find_address | scalar | NULL | NULL | |
ngrams | scalar | NULL | NULL | |
soundex | scalar | NULL | NULL | |
strip_diacritics | scalar | NULL | NULL | |
unaccent | scalar | NULL | NULL |
Overloaded Functions
function_name | function_type | description | comment | examples |
---|---|---|---|---|
damerau_levenshtein | scalar | Extension of Levenshtein distance to also include transposition of adjacent characters as an allowed edit operation. In other words, the minimum number of edit operations (insertions, deletions, substitutions or transpositions) required to change one string to another. Characters of different cases (e.g., a and A ) are considered different. |
NULL | [damerau_levenshtein('duckdb', 'udckbd')] |
levenshtein | scalar | The minimum number of single-character edits (insertions, deletions or substitutions) required to change one string to the other. Characters of different cases (e.g., a and A ) are considered different. |
NULL | [levenshtein('duck', 'db')] |