The Data Quality Chronicle

An event based log about a service offering

Data Discovery: What it is and what it isn’t

What Data Discovery is …

On a recent engagement I was tasked with performing extensive data discovery on a large amount of data in various systems.  While the normal practice of a data quality initiative is to work toward an established business goal, in this case we were not immediately sure what that goal would include.  The impact of that condition is that we were effectively “fishing” to determine where the data stood.  In essence, we were building a “current state” definition of the data which could then be used to determine what types of data quality goals needed to be established.

Data Discovery is useful in determining a current state of the data environment

As a result of the discovery process, we were able to build profiles of tables that included various aspects of the attributes like data types, field lengths and patterns, and uniqueness.  With standard patterns, lengths and types established, outlier reports were created identifying which attributes required data cleansing and more stringent data governance.  From this analysis, the framework of a data quality program began taking shape.

Data Discovery is useful in developing a data quality framework

What Data Discovery is not …

Even though the data discovery played an essential role in the development of the current state assesment and framework of the data quality program, it’s important to realize that it did not provide everything required to develop these.  While data discovery can describe numerous aspects of the current state of the data, it cannot determine what the optimal state should be.  Discovery does not have a vision of what should be, it can only describe what is.  Data discovery should not be viewed as a way to replace engaging the business about how they want the data to look.  It is a data tool that can help data quality practioners get up-to-speed qucikly on the current state of the data landscape.  At best, it helps the data quality practioner make suggestions about what types of data cleansing might be required.

Data Discovery is not able to determine the optimal state of the data environment

Data Discovery is not a replacement for business knowledge and vision


While this post is not a particularly detailed post, I feel like it is an important topic to cover and discuss.  If you listen to the sales hype, it is easy to get the impression that data discovery is your answer to having meetings with the business in order to build data quality goals.  This is such a dangerous prospect that I have begun to state this at the beginning of all my data discovery conversations.  Don’t get me wrong, I value data discovery tools.  I recognize their importance.  However, I’m a data quality guy and not a bsuiness owner. 

Does a business owner care about data patterns and lengths?  I doubt it.  It’s critical to be able to present this type of information in a business context that means something to a business owner or business user.  Ultimately this conversation starts with some type of business goal.  For instance, when performing data discovery on email addresses it would be more effective to explain how direct marketing will be affected due to the fact that 10% of the data cannot be used to send electronic marketing materials rather than 10,000 values do not contain an “@” symbol.

In summary, data discovery is useful in gaining insight as to the current state of the data landscape, however, data discovery is not a “silver bullet solution” to data quality, master data mangement or data governance initiatives.



2 responses to “Data Discovery: What it is and what it isn’t

  1. Derek Munro November 16, 2010 at 2:12 pm

    While agreeing with the points you make, I feel that the range of information you describe is not “data discovery” but instead is “data profiling”.
    I wouldn’t say this is “fishing”, as all of it can be supplied “out of the box” by a good profiling tool.
    Data Discovery is more about ad-hoc investigation to understand the relationships that exist in the data, based on equal values,
    similarly formatted values, embedded values, and so on. All this is neccesary to develop a DQ framework.
    By associating the rule which uncovers the error with a “measure” it is also possible to provide “objective” business context.
    In your example the invalid email could be associated with the “average sales per customer” to calculate the potential lost opportunity.

    • William Sharp November 16, 2010 at 2:54 pm

      Thanks for stopping by and commenting on the post! I agree 100% with your assertion that profiling is required for a solid DQ framework.
      I think there is a gray area in the industry as to what exactly is the difference between “discovery” & “profiling”. To me, they are one in the same. If you are profiling data for specifics, you will discover, hopefully, some new insight.
      The tone of the post, and specifically the “fishing” comment, is heavily influenced by the common use of discovery and profiling as a substitution for business insight into the data.
      That’s my focal point that I want to communicate. No amount of discovery and profiling will be adequate alone to form a DQ framework. You still need business insight into what you’ve discovered.
      Thanks again,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: