An event based log about a service offering
The Data Quality Chronicle Editorial
[tweetmeme source=”dqchronicle” only_single=false http://bit.ly/chhYDg]
Data Quality Assessments can function as an affirmation too
Not all data quality assessments need to report glaring issues with massive implications. Sometimes assessments can function as a way to validate that controls and processes are in place and affective. Reporting this to clients can sometimes be more valuable than the reporting only the issues. In fact, even if you do find issues, I’d argue you need to report on where you did not find issues. It’s kind of like a performance review. Here’s what you are doing well and here’s what you need to improve on.
Data Quality Resource
Recently a reader, Richard Ordowich, posted this resource in a comment so I thought I’d pass it along.
The most comprehensive list I have seen is in the book; Managing Information Quality by Martin Eppler in which he lists 70 typical information quality criteria which was compiled from various other sources (and referenced).
Data Profiling & Scorecarding with Informatica Data Quality
In my opinion, profiling and scoring data is a fundamental part of a sound data quality assessment. I routinely use these processes to build my “current state” report for clients. I recently used Informatica’s Data Quality developer and analyst tools to put together such a package.
I am of the opinion that these tools represent the “best in breed” available to do so. The learning curve is not steep, the functionality is easy to implement and, perhaps most of all, the solution is comprehensive. In a matter of hours you go from raw data to a management reporting dashboard.
If you’ve used Informatica or another tool, let me here your thoughts … (leave a comment)
I guess the word is out?
I recently did some analysis on my visitation statistics. I was almost in disbelief at the numbers. I knew that I had experienced substantial growth this past year but wasn’t aware that it was this drastic.
It is an affirmation of all the hard work that goes into this blog that the popularity has increased so much. I’m honored and humbled that things have progressed so rapidly. It is motivation to continue my efforts in 2011.
It is also a reminder that I need to keep working hard and provide more content that people are interested in. It seems this is underway in January which has set new highs in monthly visits and average hits per day.
I’d like to extend my gratitude to all of those that have helped spread the word about the blog. I draw inspiration for you all on a daily basis.
Data Discovery vs. Data Quality
I’ve been engaged recently on a data discovery project which is a divergence from my typcial tactical role in data quality projects. One thing I have observed that I wanted to share is that data discovery efforts, and the tools that support them, are best suited for those organizations who’ve had data growth/explosion to the point where the knowledge of what data is where is not present. In short, data discovery is for those with “no clue” of what data is located where. If you know where your important data is, you are more likely to benefit from data profiling than data discovery.
Keep this in mind when selecting tools and organizing an initiative!
Data Federation and Data Quality?
Data Federation will only increase the need for data quality initiatives. In addition it will drive up the complexity of master record determination and master data management.
As I ramp up on technologies like GreenPlum, I cannot help but see a storm of DQ issues coming. Chief among these issues is what version of data is the master. And what if there is no one master copy that works for everyone?
How do you feel about data federation and data quality? Leave and comment and weigh in!
Data Quality ROI = Address Validation and Duplication Consolidation
I have had conversations recently with fellow data quality gurus which centered around DQ ROI. We all know how important it is to tie a DQ initiative to a return on the investment. This is even more true of an initiative with long-term implementation objectives. During the course of the conversation I pointed out that I believe DQ ROI is all about validating addresses and consolidating duplicates and there seemed to be a cathartic agreement that made us all feel like we weren’t crazy (even if it was only a brief feeling of sanity).
Address validation provides a return by increasing revenue assurance and target marketing delivery. In short, mailing to a valid and deliverable address shortens the bill to cash cycle. In addition, it provides a cost avoidance on return mail charges and provides assurance on bulk mail delivery status. Address validation also increases the potential for and accuracy of house-holding efforts which can significantly reduce marketing initiatives.
Duplicate consolidation has a similar effect on cost which in turn provides a return on investment. Consolidating duplicates reduces billing errors incurred due to discrepancies between customer data (duplicate records does not always mean exactly the same data). It also reduces the number of marketing pieces sent to the same customer, an obvious cost avoidance.
A rough ROI calculation can be determined by totaling measures like cost per marketing piece, cost of return marketing pieces, lost marketing opportunity, lost revenue due to bill return, cost of billing remediation, lost revenue due to the loss of ability to bill and multiply these by the number of invalid addresses and number of duplicate customers. The exact formula is more formal than this, of course, but you get the idea about how much cost can be avoided by implementing a DQ initiative.
Soundex for String Matching
Soundex is a useful function for performing data matching
While you can use a Soundex function in the process of identifying potential duplicate strings, I don’t recommend it. Here’s why …
- The algorithm encodes consonants
- Vowels will not be encoded unless it is the first letter
- Consonants to the right of a vowel are not coded
- Similar sounding consonants share the same digit
- C,G,J,K,Q,S,X,Z are all encoded with the same digit
To illustrate the impact of this type of encoding let’s look at an example of soundex codes for deviations of my first name, William.
As you can see from the brief example above, Soundex codes fall short of matching like strings. One of my biggest issues with Soundex can be illustrated in the comparison of the typical nicknames for William. Only Billy and Bill are similarly coded, while Will is not coded similar to Bill or William.
I plan to dig deeper into Soundex functions and their applicability in a future blog post. In the meantime, I wanted to get this observation of mine out there for public consumption.
Data Cleansing every quarter?
This is a clip from a recent tweet from Julian Schwarzenbach of Data and Process Advantage Limited (DPA).
My response to his tweet was ” I can see validity [of quarterly cleansing] esp. if the data is from external sources like customers”.
I can see where Julian and others might see quarterly cleansing as a lack of attention to the main issue. His assertion is that if you need to cleanse your data every three months, maybe you have other issues you could address “up-stream” that would alleviate the need to perform cleansing so often. I want to say that I completely agree with this especially when the data is created, maintained and distribute within an organization.
However, there are quite a few occasions when data is not created or even maintained “in-house” and in this situation it is a good practice to cleanse this data at practical intervals.
An example of data created from outside the organization is customer data. Frequently this data is entered directly by the customer into a database from web-enabled order entry and customer service forms.
Julian would interject to say increase data quality validations on the web forms and alleviate the need for quarterly cleansing. Agreed! However, this doesn’t assure high quality data and there is a prevailing thought in the industry that user interface validations decrease the “customer experience” so there is a reluctance to implementing these types of validations.
When considering the statement above in this context, I can see where quarterly data cleansing efforts can be a feasible, practical and even wise practice.
Data Quality & Cloud-based services
Software as a Service (SaaS) will help proliferate data quality solutions
I agree with this assertion for a few reasons, not the least of which is the ease at which “front-end” data quality solutions will be included in the suite of services in a Service Oriented Architecture (SOA).
In my opinion, data qualities true promise lies in a DQ service that can be integrated into any SOA.
Data Quality: where does it belong?
Data Quality is not a technology issue, it’s a business issue
Here is my opinion on why people think it is about technology. Business initiatives like MDM/BI/DQ and the like are being presented, sold on, and driven by technology experts. Information technology has carried business forward to the point where we are the chauffeurs for change and progress. Without the ability to integrate new technologies into a business, the business fails. In this way, I believe that these disciplines are about technology.
To me business issues are sales, budgeting, customer service and marketing. Everything else like operations and reporting are so fused with technology now that they are, in a sense, a technology issue.
Let me take this theory of mine for a spin in that context. Let’s say we are driving, or “chauffeuring”, a business executive towards a list of master product entries. Does he/she know these entries off the top of their head? Most likely not. How would you “steer” them in the right direction? Probably by querying the data and beginning with a list of values? The decision of which ones are selected might even require a count of popular values? In this way, technology is the vehicle to the master data destination and the business folks are in the back seat selecting routes from the suggestions provided by his/her driver.
This content originally appeared as a comment(s) in reference to a blog post by Henrik Liliendahl Sørensen here.
Data Quality: to whom does it belong?
How should data ownership be addressed?
In my opinion a governance committee is the best option. There should be at least one, probably two representatives from the business, from technology and from budgeting. I’d suggested budgeting be the head of the committee so that solid cost-based decisions can be made. Business and technology can present their case for why money should/should not get spent on a data management issue.
This content originally appeared as a comment(s) in reference to a blog post by Charles Blyth here.