Fuzzy Search

Fuzzy Search [fazi səːtʃ]: An algorithm to find strings that match patterns approximately rather than exactly.

Fuzzy search algorithms are useful to clean data. They can cluster strings that are similarly written. You can specify the pattern: For example, if you want to extract e-mail addresses, there has to be an @-symbol and a dot in the string. Usually, you provide an input vector of strings – let’s say the names of countries. The algorithm then evaluates a given data frame row by row and looks for strings similarly written to the input vector’s names. Then, the fuzzy search can cluster “Kolumbia”, “Kolumbien” and “Coholumbien” to “Columbia”. Of course, the output isn’t always perfect, especially when two entries in the input vector look similar to each other.