The Data Quality Chronicle

An event based log about a service offering

Tag Archives: Informatica Data Quality workbench

Hamming Distance Matching Algorithm


Introduction

Last month we examined, at a high level, each of the matching algorithms available in Informatica’s data quality tool, Data Quality Workbench or IDQ.  In this month’s edition we’ll dive deeper into one of those options, the Hamming distance algorithm.

[tweetmeme source=”dqchronicle” only_single=false]

As we touched on last month, the Hamming algorithm score measures the minimum number of substitutions required to change one string into the other.  Often this algorithm is used when analyzing strings which represent numeric strings.

A Little Background

The Hamming distance is named after Richard Hamming.  Hamming was an American mathematician whose accomplishments include many advancements in Information Science.  Not the least of which was his work on The Manhattan Project in 1945.  One notable contribution of his work is his philosophy on scientific computing which appeared as the preface to his book in 1962 on numerical methods (Numerical Methods for Scientists and Engineers) which read:

The purpose of computing is insight, not numbers

A Little about the algorithm

Perhaps as a result of Hamming’s time at Bell Laboratories, the Hamming distance algorithm is most often associated with the analysis of telephone numbers.  However the advantages of the algorithm are applicable to various types of strings and is not limited to numeric strings.

There is one condition that needs to be adhered to when using this algorithm that is worth noting.  The strings analyzed need to be of the same length.  Since the Hamming distance algorithm is based on the “cost” of transposing one string into another, strings of unequal length will result in high penalties for transposition.  If you are using the Hamming algorithm to analyze telephone numbers it is critically important to cleanse the data before analyzing it.  For instance, if not all telephone numbers include an area code than the area codes that are in the data need to be parsed out before analysis.

Hamming Distance based checks

Hamming distance based checks determine the number of errors between two strings.  If we want to detect the number of errors (x) in a string we can map every string (y) into a bigger y+x+1 string so that the minimum Hamming distance between each valid mapping is x+1.

Hamming Distance in Informatica Data Quality Workbench

In Data Quality workbench implementing the Hamming Distance algorithm is very simple and is done via adding the Hamming component from the Component Palette.  Refer to the figure below for a look at the Hamming icon.

Informatica Data Quality Workbench Hamming Distance icon

Once the Hamming component has been added the your IDQ plan it is time to configure it.  Configuration options include the specification of  Inputs, Parameters, & Outputs.  Refer to the figure below for a look at the configuration options.

Configuration Options of the Hamming Distance Component

As the name indicates the Inputs tab is where to specify what data elements will be matched in the component.  It is important to note that when configuring the inputs there is a requirement to select two data elements for each match desired.  Without going into extensive details when the data is grouped Informatica IDQ formats the data so that each element will have two instances.  One instance is labeled “x_1” and the other is labeled “x_2” where x is the data element name.  Both the x_1 and the x_2 data element need to be selected.

The parameters tab is where you configure some important options.  The Reverse Hamming option is a check box that can be selected that configures the component to read the string from right to left instead of the default left to right.  The remaining two options dictate what the match score will be when null values exist in the pair of strings.  The Single Null Match Value setting indicates the match score when one field in the pair of matched values is null.  The Both Null Match Value setting indicates the match score when both fields are null.

The Outputs tab is where you can define the name of the output value.  Pretty straight forward there.

Now for the good part, the results!  Refer to the to the figure below for a look at what results look like when using the Hamming distance component.

Results of the Hamming distance match analysis in Informatica Data Quality workbench

I’ve masked the results to ensure data privacy however it is still useful in describing the results.  As you can see in the sample above, row two resulted in a Hamming distance score of 1 because both values are identical.   However the data in row one is not identical and consequently resulted in a Hamming distance score of approximately 0.93.  This is due to the transposition “cost” of turning “ZBL ZSSXCNZTES” into “ZBT ZSSXCNZTES”.  The penalty of 0.07 was due entirely to the “ZBL” into “ZBT”.

Summary

Among Richard Hamming’s many accomplishments is the development of an algorithm to compare various types of strings of the same length to determine how different they are.  Due to the requirement of equal length, the algorithm is primarily used to detect differences in numeric strings but can be used with textual data as well.

Informatica has incorporated the Hamming algorithm into the data quality workbench tool in order to produce a match score.  The Hamming component requires the selection of at least two inputs, it can be configured to handle data with nulls and will output a match score.  In IDQ a Hamming match score of one (1) indicates a perfect match while a Hamming match score of zero (0) indicates that there was no correlation between the two values being analyzed.

I’ve used the Hamming component in IDQ to analyze match possibilities in telephone numbers and postal codes.  I’ve found it to be reliable in detecting true positive matches and sensitive enough to detect even slight differences (as indicated in the sample data above).  I hope this review will help those of you interested in using the Hamming component in IDQ or those just interested in developing knowledge of the algorithm.

Thank you for reading this month’s edition of The Data Quality Chronicle.  Stop by again next month when I detail the Jaro-Winkler algorithm and it’s implementation in Informatica IDQ.

Advertisements

Informatica Data Quality Workbench Matching Algorithms


I’d like to begin a multi-part series of postings were I detail the various algorithms available in Informatica Data Quality (IDQ) Workbench.  In this post I’ll start by giving a quick overview of the algorithms available and some typical uses for each.  In subsequent postings I’ll get more detailed and outline the math behind the algorithm.  Finally I’d like to finish up with some baseline comparisons using a single set of data.

[tweetmeme source=”dqchronicle” only_single=false]

IDQ Workbench enables the data quality professional to select from several algorithms in order to perform matching analysis.  Each of these serve a different purpose or are tailored toward a specific type of matching.   These algorithms include the following:

  1. Hamming Distance
  2. Jaro-Winkler
  3. Edit Distance
  4. Bigram or Bigram frequency

 Let’s look at the differences and main purpose for each of these algorithms.

The Hamming distance algorithm, for instance, is particularly useful when the position of the characters in the string are important.  Examples of such strings are telephone numbers, dates and postal codes.  The Hamming Distance algorithm measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other.

The Jaro-Winkler algorithm is well suited for matching strings where the prefix of the string is of particular importance.  Examples include strings like company names (xyz associates vs. abc associates).   The Jaro-Winkler algorithm is a measure of how similar two strings are by calculating the number of matching characters and number of transpositions required.

The Edit Distance algorithm is an implementation of the Levenshtein distance algorithm where matches are calculated based on  the minimum number of operations needed to transform one string into the other.  These operations can include an insertion, deletion, or substitution of a single character.  This algorithm is well suited for matching fields containing a short text string such as a name or short address field.

The Bigram algorithm is one of my favorites due to its thorough decomposition of a string.  The bigram algorithm matches data based on the occurrence of consecutive characters in both data strings in a matching pair, looking for pairs of consecutive characters that are common to both strings. The greater the number of common identical pairs between the strings, the higher the match score.  This algorithm is useful in the comparison of long text strings, such as free format address lines.

Informatica provides several options for matching data out-of-box with Data Quality (IDQ) Workbench.  Although some will argue the ability of another algorithm to detect with greater strength, Informatica has provided some very robust methods to match various types of strings.  With this flexibility the data quality professional is enabled to handle various types of data elements in their match routines.  As with any tool, it is not a replacement for the research required to use the right method in the right way.  This is one of the aspects I’ll cover in the subsequent postings where we take each algorithm and get more detailed. 

Drop by next month for more about the Hamming distance algorithm and some real word examples of how it can be implemented!

GUI or command line? Where to run an IDQ plan.


Recently on a data quality project I stumbled across an anomoly that I thought I share with the data quality / Informatica community.  It involves the use of Informatica Data Quality (IDQ) and the use of certain types of queries.  I discovered this anomoly when I recieved the following error message:
Failure to initialise

Image 1: Failure to initialise plan

One of the root causes of this error is a “behind-the-scenes” process that involves the import of data into an IDQ plan.  When importing data into an IDQ plan, via SQL inside the source component, IDQ wraps a select count () around the SQL in the original query.  At times this causes IDQ to fail while initializing and returns the error message above.  Most commonly this error occurs when using a query that is beyond the standard select statement.  I was using a union in my query when I got this error.

If you are using Oracle you have the option of editing your SQL or running the plan from the command line.  However if you are using Microsoft SQL Server, you’ll need to get the to command line.

The SQL edit in Oracle is simple and straight-forward so let’s address that first.  If you get an error where the IDQ plan fails to initialise try wrapping the SQL as follows:

  • Select * from (<your query here>)

As for Microsoft SQL Server the  command line is the only option but it is very straight-forward.  You could even argue that the command line is more enterprise robust since the jobs could then be scheduled and thus integrated into existing data services more easily. 

Regardless of the perspective here is how you deploy the plan in realtime using the command line.  An IDQ plan can be executed from one of two locations; the IDQ repository or the file location of the XML version of the plan.  On Windows, the executable file for implementing runtime functionality is Athanor-RT.exe, located in the bin folder of the Data Quality Server installation.  It is also a requirement to specify the location of the plan in the command line utility. 

To deploy a plan from an XML file the -f switch is used.  So the command to deploy an XML version of a plan, call it myplan, in the C:/idqplans directory would be as follows:

  • athanor-rt -f c:/idqplans/myplan.xml

To deploy a plan from the repository the -p switch is used.  Consequently the command to deploy the myplan dq plan from the repository would be as follows:

  • athanor-rt -p c:/idqplans/myplan

There are other useful switches to utilize while deploying your IDQ plans at the command line. 

For instance to run an IDQ plan along with a parameter file, which allows you to override plan variables with those in the parameter file, use the -c switch.  The command for executing a plan along with a parameter file would be as follows:

  • athanor-rt -c <parameter file name> c:/idqplans/myplan

Another helpful switchs allows you to receive feedback as to the status of the plan execution.  This switch is -i. 

The -i switch will provide feedback in the command line window.   The feedback occurs at the specified interval in the command line.  For instance, the command -i 200 will provide feedback every 200 records for each field in the process.  The following is an example of the feedback provided in the command line with the -i switch:

cmdlineexample

With these basic switches you can deploy any IDQ plan regardless of the query required to source the data.  I hope this post helps someone avoid hours of debugging!

Check back later for more details about scheduling IDQ plans using the command line.