The Data Quality Chronicle

An event based log about a service offering

The DQ Two Step!

Howdy Folks!  I’d like to sing you a tune about matching customer data if you don’t mind?  It’s called the “data quality two step”

Heck! You can even grab your partner if the mood strikes you?

Feels good!

Feels good!

I’ve recently cleansed some customer data for a great bunch of folks that I love calling my client. 

We had a consolidation ratio of 1:6 or around 17%.  Which equated to roughly a million duplicates removed.  That’s a lot savings on postage stamps for the marketing department so they were psyched!  We validated over 90% of the addresses and built reports to identify those that did not meet the requirements for a valid address.  Not too shabby if I don’t say so myself! 

Now that we’ve deployed the data into User Acceptance (UAT) I find myself in a familiar place; the business logic. 

What's this?

You missed something

You can spend all the time you need, or even care to, on rules for consolidation but it usually is not until the data hits the screen that the ramifications are easily understood by the average business user.

Case in point, I recently received an email from a stakeholder asking me to look over some data with him.  I was curious what I’d find when I reached his office as I analyzed this data and the processing code more than a few times by now.  On my walk over I went through many possible scenarios in my head.

Was it something I missed?  Surely not.  I’ve performed several test runs in order to validate the business logic.  With my curiosity peaked I rounded the corner and into his office I went.

"Good point!"

"Good point!"

After a little chit-chat, like I said I love this client, we got down to business.  He proceeded to type a few parameters in the search utility and I waited with anticipation.

However after a second, maybe less,  my anticipation was replaced with relief and more than a little disbelief.  I’d been over this a time or two which is why I was in a state of shock.  Not to mention my client was not someone who needed “Data for Dummies”. 

With identities masked to protect the innocent, below is a sample of the records he was concerned about and wanted me to see.



So if you’ve been wondering about this two step thing, here it comes. 
In order to positively identify a non-unique individual you need to pair their name with an additional piece of identifying information, usually an address. 
In other words, it is a two part match on name and address that can, with a realtively high confidence level, identify a true duplicate
If we only used a match on name to identify duplicate, we’d consolidate all the John Smith’s in the dataset to one customer.  Talk about lost opportunity!  This approach could turn millions of customers into thousands in an instant.
One brief glance in the local phone directory will be enough to demonstrate how non-unique names really are.  
They've been through this before!

They've been through this before!

Go a step further and ask your local DBA to run some counts on first-last name combinations and you’ll be surprised at the results.
Just in case this little story wasn’t sufficient enough to remind you here is that tune I promised you:
The two step matching ditty goes a little like this …
Grab your partner’s name and twirl it around
Make sure the nickname’s proper equal is found
Then grab you their address and scrub with the care
Make sure their mail can be delivered there
Don’t get rid of your partner until you are sure
That you’ve got a match on more
Than the name or the door
lyrics by Data Pickins
music by YouToo?

4 responses to “The DQ Two Step!

  1. Henrik Liliendahl Sørensen August 5, 2009 at 4:15 am

    Awesome post.

    About James Bond. I have a way of categorising party data

    I think the 2 rows are not 2 ‘C’ duplicates but 2 separate instances of type ‘I’.

    My Data Quality 2.0
    system thought that out by mimic a human.

    The name ‘James Bond’ was found in the table with ‘Comic names’ with the possibility weight 33% (‘Donald Duck’ is 100%).

    ‘Secret Place’ nor ‘Hidden Drive’ (with stated numbers) in ‘Brooklyn’ or ‘New York’ was not found in the table ‘US Thoroughfares’ or ‘All addresses of the World’.

    Some 3.0 day the system will also recognize a misplaced connection between the name ‘James Bond’ and the number ‘007’ and street element ‘Secret’.

  2. Daragh O Brien August 5, 2009 at 11:57 am

    Great post William. It clearly illustrates the fun we can have trying to match entities. I tend to look for at least two other “facts” in a match (same telephone number, same data of birth) to build as robust a picture of why two things are the same thing.

    Acceptable error rates in matching can vary between industries. At the first Information Quality conference we held in Dublin back in 2005, a presenter was talking about error rates in telco and financial services matching. A delegate from Healthcare stood up and challenged the “acceptableness” of the thresholds being talked about. His point: at those error rates his team would kill 300 people a year.

  3. Jim Harris August 5, 2009 at 1:14 pm

    Excellent post William,

    I always enjoy a data quality song!

    I do have one minor criticism, however.

    I have an issue with a two part match being referred to as a true duplicate, especially when the two parts are name and postal address.

    In my Data Quality Pro article Identifying Duplicate Customers (Part 3): False Positives , I show a few examples of common scenarios specific to personal name matching (see Keys 431-433 and Keys 441-443) where nearly identical names at the exact same address are not duplicates.

    Furthermore, as someone who has been the victim of identity theft – where using only my name and my postal address incorrectly matched me to thousands of dollars of fraudulent medical expenses by someone who has the same name as me and looked up my postal address in the phone book – I know firsthand that even a two part exact match on name and postal address is not necessarily a true duplicate.

    Even when more parts are available (telephone number, date of birth, tax identifiers), the possibility of a false positive still exists. In fact, in my identity theft, some of the fraudulent medical records had all of my information because a hospital worker performed duplicate consolidation using only personal name and postal address between my actual medical records and my identity thief.

    Best Regards…


    • dqchronicle author August 5, 2009 at 11:05 pm

      Excellent point Jim, I should have noted that it needs to be at least two pieces of identifying information and that the possibility of false positives always exists.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: