Text similarity measures in PostgreSQL
pg_similarity
extension provides a collection of similarity measures for comparing text strings in PostgreSQL databases.
It includes a comprehensive collection of search algorithms, which you can find listed toward the end of this document.
Your Nile database arrives with this extension already enabled.
pg_similarity
extension to find similar products in our database.
threshold
: The threshold for the similarity measure.is_normalized
: Whether the similarity measure is normalized (between 0 and 1) or not.tokenizer
: The tokenizer to use for the similarity measure ( Default is alnum, and other options are gram, word, and camelcase). Note that not every algorithm supports the tokenizer option.SET pg_similarity.<algorithm_name>_<config_name> = value
.
For example, to use the cosine
similarity measure with a threshold of 0.3, you can use:
Algorithm | Function | Operator | ||
---|---|---|---|---|
Block | block(text, text) | ~++ | ||
Cosine | cosine(text, text) | ~## | ||
Dice | dice(text, text) | ~-~ | ||
Euclidean | euclidean(text, text) | ~!! | ||
Hamming | hamming(text, text) | |||
Jaccard | jaccard(text, text) | ~?? | ||
Jaro | jaro(text, text) | ~%% | ||
Jaro-Winkler | jarowinkler(text, text) | ~@@ | ||
Levenshtein | lev(text, text) | ~== | ||
Matching | matching(text, text) | ~^^ | ||
Monge-Elkan | mongeelkan(text, text) | ~ | ||
Needleman-Wunsch | needlemanwunch(text, text) | |||
Overlap | overlap(text, text) | ~** | ||
Q-Gram | qgram(text, text) | ~~~ | ||
Smith-Waterman | smithwaterman(text, text) | |||
Smith-Waterman-Gotoh | smithwatermangotoh(text, text) | |||
Soundex | soundex(text, text) | ~*~ |
jaro
is better for name matching, while levenshtein
is better for typo detection.