I’d like to begin a multi-part series of postings were I detail the various algorithms available in Informatica Data Quality (IDQ) Workbench. In this post I’ll start by giving a quick overview of the algorithms available and some typical uses for each. In subsequent postings I’ll get more detailed and outline the math behind the algorithm. Finally I’d like to finish up with some baseline comparisons using a single set of data.
[tweetmeme source=”dqchronicle” only_single=false]
IDQ Workbench enables the data quality professional to select from several algorithms in order to perform matching analysis. Each of these serve a different purpose or are tailored toward a specific type of matching. These algorithms include the following:
- Hamming Distance
- Edit Distance
- Bigram or Bigram frequency
Let’s look at the differences and main purpose for each of these algorithms.
The Hamming distance algorithm, for instance, is particularly useful when the position of the characters in the string are important. Examples of such strings are telephone numbers, dates and postal codes. The Hamming Distance algorithm measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other.
The Jaro-Winkler algorithm is well suited for matching strings where the prefix of the string is of particular importance. Examples include strings like company names (xyz associates vs. abc associates). The Jaro-Winkler algorithm is a measure of how similar two strings are by calculating the number of matching characters and number of transpositions required.
The Edit Distance algorithm is an implementation of the Levenshtein distance algorithm where matches are calculated based on the minimum number of operations needed to transform one string into the other. These operations can include an insertion, deletion, or substitution of a single character. This algorithm is well suited for matching fields containing a short text string such as a name or short address field.
The Bigram algorithm is one of my favorites due to its thorough decomposition of a string. The bigram algorithm matches data based on the occurrence of consecutive characters in both data strings in a matching pair, looking for pairs of consecutive characters that are common to both strings. The greater the number of common identical pairs between the strings, the higher the match score. This algorithm is useful in the comparison of long text strings, such as free format address lines.
Informatica provides several options for matching data out-of-box with Data Quality (IDQ) Workbench. Although some will argue the ability of another algorithm to detect with greater strength, Informatica has provided some very robust methods to match various types of strings. With this flexibility the data quality professional is enabled to handle various types of data elements in their match routines. As with any tool, it is not a replacement for the research required to use the right method in the right way. This is one of the aspects I’ll cover in the subsequent postings where we take each algorithm and get more detailed.
Drop by next month for more about the Hamming distance algorithm and some real word examples of how it can be implemented!