The Data Quality Chronicle

An event based log about a service offering

The Data Quality Chronicle Editorial

[tweetmeme source=”dqchronicle” only_single=false]

Data Quality Assessments can function as an affirmation too

Not all data quality assessments need to report glaring issues with massive implications.  Sometimes assessments can function as a way to validate that controls and processes are in place and affective.  Reporting this to clients can sometimes be more valuable than the reporting only the issues.  In fact, even if you do find issues, I’d argue you need to report on where you did not find issues.  It’s kind of like a performance review.  Here’s what you are doing well and here’s what you need to improve on.

Your thoughts?

Data Quality Resource

Recently a reader, Richard Ordowich, posted this resource in a comment so I thought I’d pass it along.

The most comprehensive list I have seen is in the book; Managing Information Quality by Martin Eppler in which he lists 70 typical information quality criteria which was compiled from various other sources (and referenced).

Data Profiling & Scorecarding with Informatica Data Quality

In my opinion, profiling and scoring data is a fundamental part of a sound data quality assessment.  I routinely use these processes to build my “current state” report for clients.  I recently used Informatica’s Data Quality developer and analyst tools to put together such a package. 

I am of the opinion that these tools represent the “best in breed” available to do so.  The learning curve is not steep, the functionality is easy to implement and, perhaps most of all, the solution is comprehensive.  In a matter of hours you go from raw data to a management reporting dashboard.

If you’ve used Informatica or another tool, let me here your thoughts … (leave a comment)

I guess the word is out?

The Data Quality Chronicle Year-on-Year Analysis

I recently did some analysis on my visitation statistics.  I was almost in disbelief at the numbers.  I knew that I had experienced substantial growth this past year but wasn’t aware that it was this drastic. 

It is an affirmation of all the hard work that goes into this blog that the popularity has increased so much.  I’m honored and humbled that things have progressed so rapidly.  It is motivation to continue my efforts in 2011.

It is also a reminder that I need to keep working hard and provide more content that people are interested in.  It seems this is underway in January which has set new highs in monthly visits and average hits per day.

I’d like to extend my gratitude to all of those that have helped spread the word about the blog.  I draw inspiration for you all on a daily basis.

Data Discovery vs. Data Quality

I’ve been engaged recently on a data discovery project which is a divergence from my typcial tactical role in data quality projects.  One thing I have observed that I wanted to share is that data discovery efforts, and the tools that support them, are best suited for those organizations who’ve had data growth/explosion to the point where the knowledge of what data is where is not present.  In short, data discovery is for those with “no clue” of what data is located where.  If you know where your important data is, you are more likely to benefit from data profiling than data discovery. 

Keep this in mind when selecting tools and organizing an initiative!

Data Federation and Data Quality?

Data Federation will only increase the need for data quality initiatives.  In addition it will drive up the complexity of master record determination and master data management.

As I ramp up on technologies like GreenPlum, I cannot help but see a storm of DQ issues coming.  Chief among these issues is what version of data is the master.  And what if there is no one master copy that works for everyone?

How do you feel about data federation and data quality?  Leave and comment and weigh in!

Data Quality ROI = Address Validation and Duplication Consolidation

I have had conversations recently with fellow data quality gurus which centered around DQ ROI.  We all know how important it is to tie a DQ initiative to a return on the investment.  This is even more true of an initiative with long-term implementation objectives.  During the course of the conversation I pointed out that I believe DQ ROI is all about validating addresses and consolidating duplicates and there seemed to be a cathartic agreement that made us all feel like we weren’t crazy (even if it was only a brief feeling of sanity).

Address validation provides a return by increasing revenue assurance and target marketing delivery.  In short, mailing to a valid and deliverable address shortens the bill to cash cycle.  In addition, it provides a cost avoidance on return mail charges and provides assurance on bulk mail delivery status.  Address validation also increases the potential for and accuracy of house-holding efforts which can significantly reduce marketing initiatives.

Duplicate consolidation has a similar effect on cost which in turn provides a return on investment.  Consolidating duplicates reduces billing errors incurred due to discrepancies between customer data (duplicate records does not always mean exactly the same data).  It also reduces the number of marketing pieces sent to the same customer, an obvious cost avoidance.

A rough ROI calculation can be determined by totaling measures like cost per marketing piece, cost of return marketing pieces, lost marketing opportunity, lost revenue due to bill return, cost of billing remediation, lost revenue due to the loss of ability to bill and multiply these by the number of invalid addresses and number of duplicate customers.  The exact formula is more formal than this, of course, but you get the idea about how much cost can be avoided by implementing a DQ initiative.

Soundex for String Matching

Soundex is a useful function for performing data matching

While you can use a Soundex function in the process of identifying potential duplicate strings, I don’t recommend it.  Here’s why …

  • The algorithm encodes consonants
  • Vowels will not be encoded unless it is the first letter
  • Consonants to the right of a vowel are not coded
  • Similar sounding consonants share the same digit
  • C,G,J,K,Q,S,X,Z are all encoded with the same digit

To illustrate the impact of this type of encoding let’s look at an example of soundex codes for deviations of my first name, William.

 As you can see from the brief example above, Soundex codes fall short of matching like strings.  One of my biggest issues with Soundex can be illustrated in the comparison of the typical nicknames for William.  Only Billy and Bill are similarly coded, while Will is not coded similar to Bill or William.

 I plan to dig deeper into Soundex functions and their applicability in a future blog post.  In the meantime, I wanted to get this observation of mine out there for public consumption.

Data Cleansing every quarter?

@jschwa1 Data cleansing every 3 months? – Someones not addressing the right problem!

This is a clip from a recent tweet from Julian Schwarzenbach of Data and Process Advantage Limited (DPA)

My response to his tweet was ” I can see validity [of quarterly cleansing] esp. if the data is from external sources like customers”. 

I can see where Julian and others might see quarterly cleansing as a lack of attention to the main issue.  His assertion is that if you need to cleanse your data every three months, maybe you have other issues you could address “up-stream” that would alleviate the need to perform cleansing so often.  I want to say that I completely agree with this especially when the data is created, maintained and distribute within an organization. 

However, there are quite a few occasions when data is not created or even maintained “in-house” and in this situation it is a good practice to cleanse this data at practical intervals.

An example of data created from outside the organization is customer data.  Frequently this data is entered directly by the customer into a database from web-enabled order entry and customer service forms. 

Julian would interject to say increase data quality validations on the web forms and alleviate the need for quarterly cleansing.  Agreed!  However, this doesn’t assure high quality data and there is a prevailing thought in the industry that user interface validations decrease the “customer experience” so there is a reluctance to implementing these types of validations.

When considering the statement above in this context, I can see where quarterly data cleansing efforts can be a feasible, practical and even wise practice.

[tweetmeme source=”dqchronicle”]

Data Quality & Cloud-based services

Software as a Service (SaaS) will help proliferate data quality solutions

I agree with this assertion for a few reasons, not the least of which is the ease at which “front-end” data quality solutions will be included in the suite of services in a Service Oriented Architecture (SOA).

In my opinion, data qualities true promise lies in a DQ service that can be integrated into any SOA.

[tweetmeme source=”dqchronicle”]

Data Quality: where does it belong?

Data Quality is not a technology issue, it’s a business issue

Here is my opinion on why people think it is about technology. Business initiatives like MDM/BI/DQ and the like are being presented, sold on, and driven by technology experts. Information technology has carried business forward to the point where we are the chauffeurs for change and progress. Without the ability to integrate new technologies into a business, the business fails.  In this way, I believe that these disciplines are about technology.

To me business issues are sales, budgeting, customer service and marketing. Everything else like operations and reporting are so fused with technology now that they are, in a sense, a technology issue.

Let me take this theory of mine for a spin in that context. Let’s say we are driving, or “chauffeuring”, a business executive towards a list of master product entries. Does he/she know these entries off the top of their head? Most likely not. How would you “steer” them in the right direction? Probably by querying the data and beginning with a list of values? The decision of which ones are selected might even require a count of popular values? In this way, technology is the vehicle to the master data destination and the business folks are in the back seat selecting routes from the suggestions provided by his/her driver.

This content originally appeared as a comment(s) in reference to a blog post by Henrik Liliendahl Sørensen here.

Data Quality: to whom does it belong?

How should data ownership be addressed?

In my opinion a governance committee is the best option.  There should be at least one, probably two representatives from the business, from technology and from budgeting.  I’d suggested budgeting be the head of the committee so that solid cost-based decisions can be made.  Business and technology can present their case for why money should/should not get spent on a data management issue.

This content originally appeared as a comment(s) in reference to a blog post by Charles Blyth here.


6 responses to “The Data Quality Chronicle Editorial

  1. Jackie Roberts March 12, 2010 at 4:32 pm

    William, excellent twitter snippets for discussion!!!

    Data cleansing every quarter – in my world of data cleansing we classify, profile, verify, enrich and translation before the data is exported to set up a material master which naturally feeds downward system streams. We also have maintenance processes to re-verify and audit that product information is current. After a while, the relationship is developed with the team of analysts and the manufacturers / suppliers to provide feedback of manufacturer obsolesce or product updates.

    Data Quality & Cloud-base Services – It is imperative that data cleansing is a critical step at set up. I am very interested in on-going data quality maintenance tools and data reporting. From what I can see, there isn’t much thought of data matching and ease of inconsistent data structuring and reconciling or reporting being addressed in the “Cloud based Services”.

    Data Quality: where does it belong & to whom does it belong? Data Quality and Governance needs to be an enterprise solution with a stirring committee of the cross functional core disciplines represented. A budget is a must to ensure that data governance and data cleansing is a standard business process as data is the foundation of information quality. The enterprise will have cost saving opportunities that will arise out of a cleansed data quality environment that will also require funding to implement streamlined business processes, such as a virtual inventory sharing program, improved processes of data extracts to improve data processing cost or throughput, etc.

  2. William Sharp March 12, 2010 at 7:40 pm

    Thanks Jackie! Comments are blogging’s sweet reward! I was thinking of you and our recent discussion when I was writing about quarterly data cleansing. I think DQ service advocacy, no matter what periodicity, is good. Like I said, I see Julian’s point of there being a root-cause that is potentially being ignored, however, there are scenarios where the root is outside the organization. Most often it is not possible to require DQ services in this these scenarios.
    As for DQ services in the cloud, Informatica has made strides there. You should check out @infacloud on twitter for more info.
    Thanks again, Jackie. I look forward to more discussions with you about the nitty-gritty of DQ, cloud based or otherwise.

    • Julian Schwarzenbach March 25, 2010 at 7:59 am


      Unfortunately, the 140 character limit in Twitter means that messages are sometimes truncated or don’t cover every angle. I accept that where data is coming into an organisation from external sources, then you can be less rigorous about validation. I also recognise that where customers enter data that validation may detract from the ‘customer experience’ however, that should not prevent a reasonable level of drop-down lists, check boxes etc. being used for data entry. Many web sites are still over reliant on free text entry, even for standard items such as country codes.

      Even then, I am still not sure that full data cleansing every quarter is the correct answer – for example, if a customer database contains 10 million entries and is growing at a rate of 100,000 entries per month, then once you minimised the likely causes of error, cleansing should only be required on these 100,000 new records. As any other data changes through internal processes should be appropriately controlled and validated, the vast majority of these 10 million records should not need cleansing. Surely, running a full quarterly cleanse will be a waste of business resources? What about BI generated immediately prior to a cleanse cycle, surely this will not be giving the correct answer?

      I appreciate we all have different backgrounds and perspectives, so there may be other things I have missed. However, clients should still make sure they understand why a vendor is suggesting a quarterly data cleanse (because that is in the vendors interest) and check that validation processes are suitable to reduce cleansing to the optimum level.


      • William Sharp March 25, 2010 at 2:27 pm

        So glad you elaborated on this! And although we’ve privately discussed this, let me state that this editorial quip was not intended to slight you, your firm, or your years of domain expertise. In fact, I often learn and gain new perspectives as I read your writings.
        I am also glad you highlighted something I failed to address; incremental cleansing. I agree that cleansing should only be required incrementally. There is one exception and that is when a new cleaning requirement is developed/discovered.
        Now traditionally cleansing does not infer duplicate consolidation. However, it is worth noting that duplicate consolidation would need to be performed on the entire recordset each quarter. I do feel as though this would be a proper best practice to recommend as well.
        Thanks for the comment, Julian. That is exactly what I am aiming for with this page! I find that healthy, respectful debate is often the path to insight.
        Thanks again!

  3. Ivan Chong May 16, 2010 at 10:15 pm

    Well written post -thanks for doing a great job of educating. Informatica has customers that derive address cleansing ROI in the way you mention. They measure revenue assurance via DSO and can directly tie reduced DSO to better billing address quality. One customer remarked that customers who never receive invoices tend not to pay their bills.

    Other customers measure duplicates and easily relate those DQ issues to process inefficiency. My favorite example is where a customer measured duplicate inbound invoices for AP. Not surprisingly, vendors do not complained when they receive multiple payments on the same invoice. Our customer saves millions per quarter just by reduplicating their AP records.

  4. Alastair McKeating August 31, 2010 at 9:56 pm

    I agree that Greenplum will raise a storm of DQ issues and that’s a good thing.Having spent many years as a data architect I think the bane of my existence was debating the one true master record. A federated view is a more accurate reflection of reality where quality can be enhanced in the context of a specific use while a central governance process aggregates the individual “correct in their context” views into a master view (deliberately using the concept of a view rather than the more restrictive concept of a single physical record).

    Also, my understanding of Greenplum is that is emphasizes the value of collaborative technology which may be the better, less formal, more collective way to advise/warn/adapt at risk inconsistencies into any decision made on the basis of said aggregation as a complement to the formal record structure.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: