Recently I had coffee with Dr. John Talburt of the University of Arkansas at Little Rock’s Information Quality program. During the conversation we exchanged experiences that we’ve both encountered while implementing data quality solutions, especially with regard to matching. One of the prevailing topics was the promotion of a master record from the duplicates. That’s when Dr. Talburt shared with me a perspective that I had not considered.
[tweetmeme source=”dqchronicle” only_single=false http://bit.ly/cXmAza]
He asserted that most often the focus of duplication validation is placed on the confirmation of true positive results. In other words, most often data quality analysts focus on confirming that the duplicates they identify are really duplicates. Sounds straight forward enough and I confess I am guilty of this approach. However, Dr. Talburt proposed that, at least, one more validation process be added to the results analysis. This is the validation of true negative results, or validating that those records not identified as duplicates are truly not duplicates.
Granted this approach is more difficult and certainly more time-consuming, particularly for large datasets. However, it can be viewed as a more valuable exercise to an organization. Afterall leaving duplicates unidentified in an enterprise dataset compounds the cost of the data by affectively under-utilizing the investment in the de-duplication project.
With this in mind, I have begun compiling some methodologies to assure that my efforts to eliminate duplicates has not left the proverbial stone unturned. Here are a list of what I have been able to identify as effective and worth the extra time and resources.
Post Matching Group Analysis
Just as I stated earlier, after I run my primary matching runs I analyze the results to be sure I have positively identified duplicates. As if with blinders on, I have not traditionally analyzed the results to determine if there were possible duplicates that were left behind. Afterall, I went through extensive efforts to cleanse and standardize the data in order to increase matching accuracy. Why then would I analyze those records that were not identified as potential duplicates? The answer lies in, as I stated earlier, capitalizing on the investment of the data quality initiative. Well it is time to take the blinders off and look at those transactions not identified as a duplicate.
On CDI projects, this process would involve reviewing non-duplicate transactions that share, at a minimum, a similar last name or similar address values. On PIM projects, the process would include non-duplicate transactions that share similar product descriptions.
Multiple Matching Runs
Multiple matching processes with varied match thresholds is another way that it is possible to assure all duplicate transactions are identified. Match thresholds are typically used to determine an acceptable level of matching to be considered positive and are usually in the form of a percentage. For instance, I typically set an initial threshold of 0.9, or 90%. In light of my conversations with Dr. Talburt I will start running at least one more match run with a lower threshold of 0.75-0.80 to possibly identify some duplicates not identified in my initial run.
It is always good to gain new perspectives, especially on something that you do frequently. My eyes have definitely been opened to another potential outcome of my matching runs. I feel strongly about the need for data quality. As a result, I also feel strongly about delivering the need to return the investment in data quality initiatives. To do this I now firmly believe that I need to include a process to analyze and, if necessary, remediate the existence of false negatives in my future matching routines.