Howdy Folks! I’d like to sing you a tune about matching customer data if you don’t mind? It’s called the “data quality two step”.
Heck! You can even grab your partner if the mood strikes you?
I’ve recently cleansed some customer data for a great bunch of folks that I love calling my client.
We had a consolidation ratio of 1:6 or around 17%. Which equated to roughly a million duplicates removed. That’s a lot savings on postage stamps for the marketing department so they were psyched! We validated over 90% of the addresses and built reports to identify those that did not meet the requirements for a valid address. Not too shabby if I don’t say so myself!
Now that we’ve deployed the data into User Acceptance (UAT) I find myself in a familiar place; the business logic.
You missed something
You can spend all the time you need, or even care to, on rules for consolidation but it usually is not until the data hits the screen that the ramifications are easily understood by the average business user.
Case in point, I recently received an email from a stakeholder asking me to look over some data with him. I was curious what I’d find when I reached his office as I analyzed this data and the processing code more than a few times by now. On my walk over I went through many possible scenarios in my head.
Was it something I missed? Surely not. I’ve performed several test runs in order to validate the business logic. With my curiosity peaked I rounded the corner and into his office I went.
After a little chit-chat, like I said I love this client, we got down to business. He proceeded to type a few parameters in the search utility and I waited with anticipation.
However after a second, maybe less, my anticipation was replaced with relief and more than a little disbelief. I’d been over this a time or two which is why I was in a state of shock. Not to mention my client was not someone who needed “Data for Dummies”.
With identities masked to protect the innocent, below is a sample of the records he was concerned about and wanted me to see.
So if you’ve been wondering about this two step thing, here it comes.
In order to positively identify a non-unique individual you need to pair their name with an additional piece of identifying information, usually an address.
In other words, it is a two part match on name and address that can, with a realtively high confidence level, identify a true duplicate.
If we only used a match on name to identify duplicate, we’d consolidate all the John Smith’s in the dataset to one customer. Talk about lost opportunity! This approach could turn millions of customers into thousands in an instant.
One brief glance in the local phone directory will be enough to demonstrate how non-unique names really are.
They've been through this before!
Go a step further and ask your local DBA to run some counts on first-last name combinations and you’ll be surprised at the results.
Just in case this little story wasn’t sufficient enough to remind you here is that tune I promised you:
The two step matching ditty goes a little like this …
Grab your partner’s name and twirl it around
Make sure the nickname’s proper equal is found
Then grab you their address and scrub with the care
Make sure their mail can be delivered there
Don’t get rid of your partner until you are sure
That you’ve got a match on more
Than the name or the door
lyrics by Data Pickins
music by YouToo?