The Data Quality Chronicle

An event based log about a service offering

Category Archives: data cleansing

Data Quality Basic Training


Recently a reader asked me if I had any posts on “data quality basics”.  Turns out, I didn’t.  So I’ve decided to put together a series of posts that covers what I feel are the basic essentials to a data quality program.

[tweetmeme source=”dqchronicle” only_single=false http://wp.me/prD2R-fe]

The many sides of data quality

It is generally accepted in the data quality world that there are  seven categories by which data quality can be analyzed.  These include the following:

  1. Conformity
  2. Consistency
  3. Completeness
  4. Accuracy
  5. Integrity
  6. Duplication
  7. Timeliness
  • Conformity – Analyzing data for conformity measures adherence to the data definition standards.  This can include determining if data is of the correct type and length
  • Consistency – Analyzing data for consistency measures that data is uniformly represented across systems.  This can involve comparing the same attribute across the various systems in the enterprise
  • Completeness – Analyzing data for completeness measures whether or not required data is populated.  This can involve one or more elements and is usually tightly coupled with required field validation rules
  • Accuracy – Analyzing data for accuracy measures if data is nonstandard or out-of-date.  This can involve comparing data against standards like USPS deliverability and ASCII code references
  • Integrity – Analyzing data for integrity measures data references that link information like customers and their addresses.  Using our example, this analysis would determine what addresses are not associated with customers
  • Duplication – Analyzing data for duplication measures the pervasiveness of redundant records.  This involves determining those pieces of information that uniquely define data and identifying the extent to which this is present
  • Timeliness – Analyzing data for timeliness measures the availability of data.  This involves analyzing the creation, update or deletion of data and its dependent business processes

Read more of this post

Data Quality Tips & Tricks: Using delimiters to your advantage


Introduction

[tweetmeme source=”dqchronicle”]
While I am doing research on my next matching algorithm post, the Jaro-Winkler algorithm, I have decided to throw together some of my favorite “lessons learned” which I have discovered during my practice with Informatica Data Quality (IDQ) Workbench.  This eclectic bunch of tricks has helped me carry out various tasks such as more comprehensive data profiling and more accurate matching.  Some of the tips are generic and apply to any data quality software; however, some require detailed knowledge of IDQ.
If you’d like a more details about the tips below, please feel free to leave a comment and I’ll follow-up with you directly.

Data profiling and the use of delimiters

One of the most valuable profiling and analysis components Informatica supplies is the Token Labeler.  This component categorizes the data format of the values in a given field.  It does so through the use of tokens and delimiters. These tokens are essentially labels that will be assigned to each string in a field that are separated by a delimiter. 

The significance of the delimiter is that it is often an indication of the “noise” characters that need to be removed from or converted in the data. For instance, on a recent engagement I was tasked with performing duplication consolidation on a company’s accounts data. During analysis I was able to identify that there were some inconsistencies with how the word and was represented in the data. At times an account could be entered as Slapping Zombie & Fish Co., and at times it was entered as Slapping Zombie and Fish Co. In order to identify the extent of the variability, I wanted to profile the pervasiveness of the ampersand (&) in the data.

There was only one small issue. The ampersand character is not one of the twelve standard delimiters that are built into IDQ. In fact, with 32 possible special characters, odds are you may need to know how to add a delimiter while using IDQ to profile your data.

The first thing you need is access to the Informatica Data Quality repository which is a relational database named idq_repo. Once you have access, all you need to do is write some SQL code to add to the delimiter to the t_delimiters table in the repository. In figure 1 below I have illustrated what this looks like when using Microsoft SQL Server for the IDQ repository.


Figure 1 Snapshot of the idq_repo repository & the t_delimiters table

The required SQL is very straight forward once you know what table the delimiters are stored in and the structure of the table. In Figure 2 below I illustrate with the SQL code is to add the ampersand to the list of available delimiters.


Figure 2 SQL Code to add a delimiter to IDQ

Once you run this statement, you’ll be able to add the ampersand to your IDQ plan immediately without the need to close the application. As you can see in figure 3 below, I have added this along with several other delimiters to assist in my analysis, standardization and match data quality routines.


Figure 3 an example of the addition of delimiters in the IDQ token labeler

I hope this helps those of you who use IDQ in your data quality routines and who are looking for a way to analyze and standardize special characters not already available in the list provided by Informatica.

Data Parsing in IDQ

This next tip also involves the use of delimiters, however this time we can use a built-in delimiter in the token parser component to help us parse data combined in one data element. In my experience it is not uncommon to see cases where customer first and last name are collected and stored in one data element. For various reasons this causes a challenge when standardizing and matching the data. As a result, you need to parse this data into two separate data elements. This can be performed by utilizing the space delimiter. In the sample data in figure 4 below you can see customer name data stored in one data element.


Figure 4 Name data stored in one data element

Using a simple standardization plan in IDQ we can transform this data into two elements so that further standardization and more accurate matching can occur. In figure 5 the token parser is used to parse the name components into two data elements.


Figure 5 IDQ plan using the Token Parser

Let’s take a closer look at how the Token Parser needs to be configured to achieve this result. First an input value needs to be selected from the Inputs tab by selecting a checkbox next to the data label as shown in figure 6 below.


In the Components pane, double-clicking on the generic Token Parser # will provide the opportunity to change the name of the instance of the token parser. This helps organize each instance of parsing by data element and is a beneficial practice.


In the parameters pane, you can selected the delimiter that will be used to parse the input data. You can select more than delimiter from the list and data will be  parsed if any of the delimiters are encountered.  As we mentioned in the previous tip, the delimiters are stored in the t_delimiters table of the idq_repo and can be edited with a simple SQL statement.


Lastly, outputs are added by right-clicking on the appropriate output type pane on the Output tab. In our sample case, we are looking for a text-based output.  As a result, we right-click in that pane and select the add option to include an output of a given type. Notice there is a predefined selection for an output overflow value. In our example, this value will hold any data encountered after the space delimiter. This is a particularly useful placeholder and it is a good practice to rename this value something meaningful so that you can easily track what data is in that value.


Once the Token Parser is configured and the IDQ plan has been executed you get an output the looks something similar to the output file depicted in figure 9 below.


Figure 9 Output of IDQ plan which parses out First and Last Name

While this example focuses on parsing customer first and last name, another useful implementation I have used this method for is parsing the street number and street name.  This is done in order to enable grouping of customers by street name during matching analysis. Other similar parsing implementations use the dash (-) delimiter to parse the Postal Code data element into Zip Code and Zip plus 4 components. These implementations follow the same general process documented above.

Summary

These are two simple methods that help you leverage delimiters in your data profiling and cleasning routines.  By expanding the special characters that you can profile, you gain greater insight into what types of cleansing you need to perform.  By using delimiters in parsing routines, you increase data conformity, consistency, integrity and accuracy.  You also can enrich and exapnd on the data by more properly aligning it.  The most advantageous benefits of these methods are increased data matching volumes and accuracy which ultimately leads to reduced operational costs.

I hope these little tidbits were valuable and help someone out there struggling with a way to better cleanse their data.  I’ll be posting more of these in future, so keep your eyes peeled!