Fuzzy matching (aka approximate string matching) is a technique used with matches that may be less than 100% perfect matches when finding correspondences between text segments and entries in a database. I.e., fuzzy matching let us discover in a database the equivalence among VANILLA, *** VANILLA ***, vanilla, vanila, Vanilla, etc.
We use fuzzy matching to gather and merge data (e.g., find all the information about a customer in a database) and improve data accuracy and quality.
Fuzzy matching is not new. It first appeared in the 1990s as a prime feature in the software to aid translation. Nowadays, we use fuzzy matching in many other cases, as for example:
- Improve Web search engine results. This use case is crucial when looking for names, as there can be many legitimate spelling variations for a personal name,
- Spell checkers,
- Fraud detection,
- Genome data classification, etc.
Fuzzy matching is usually achieved with the following types of algorithms:
- String similarity algorithms, which count the number of transformations (insertions, deletions, transpositions, substitutions, up to a predefined number of symbol mismatches) required to transform a source string into the target one. E.g., match “hello” and “helo”.
- There are several algorithms to find string similarities. The most popular one is the Levenshtein Distance.
- Phonetic encoding algorithms, which preprocess data to code names based on how they sound. Searching and matching are done by converting a name to some phonetic coding and comparing codings. E.g., match “* hello” and “hello” (these two strings sound the same. The difference is the asterisk, which doesn’t have a sound.
- The Double Metaphone is one of the phonetic encoding algorithms.
If we want to apply fuzzy matching to our data, the algorithms are already written in several languages. We will easily find them by searching on the Internet. We will need to integrate them into our application while avoiding pitfalls such as losing data relevance. Another option is to use one of the solutions that are already in the market. There are several of them, and we must choose the one that better adapts to our use case. My next post will show an example of cleaning inconsistent company names with the Trifacta data preparation solution.
Leave a Reply