What Data Discovery is …
On a recent engagement I was tasked with performing extensive data discovery on a large amount of data in various systems. While the normal practice of a data quality initiative is to work toward an established business goal, in this case we were not immediately sure what that goal would include. The impact of that condition is that we were effectively “fishing” to determine where the data stood. In essence, we were building a “current state” definition of the data which could then be used to determine what types of data quality goals needed to be established.
Data Discovery is useful in determining a current state of the data environment
As a result of the discovery process, we were able to build profiles of tables that included various aspects of the attributes like data types, field lengths and patterns, and uniqueness. With standard patterns, lengths and types established, outlier reports were created identifying which attributes required data cleansing and more stringent data governance. From this analysis, the framework of a data quality program began taking shape.
Data Discovery is useful in developing a data quality framework
What Data Discovery is not …
Even though the data discovery played an essential role in the development of the current state assesment and framework of the data quality program, it’s important to realize that it did not provide everything required to develop these. While data discovery can describe numerous aspects of the current state of the data, it cannot determine what the optimal state should be. Discovery does not have a vision of what should be, it can only describe what is. Data discovery should not be viewed as a way to replace engaging the business about how they want the data to look. It is a data tool that can help data quality practioners get up-to-speed qucikly on the current state of the data landscape. At best, it helps the data quality practioner make suggestions about what types of data cleansing might be required.
Data Discovery is not able to determine the optimal state of the data environment
Data Discovery is not a replacement for business knowledge and vision
While this post is not a particularly detailed post, I feel like it is an important topic to cover and discuss. If you listen to the sales hype, it is easy to get the impression that data discovery is your answer to having meetings with the business in order to build data quality goals. This is such a dangerous prospect that I have begun to state this at the beginning of all my data discovery conversations. Don’t get me wrong, I value data discovery tools. I recognize their importance. However, I’m a data quality guy and not a bsuiness owner.
Does a business owner care about data patterns and lengths? I doubt it. It’s critical to be able to present this type of information in a business context that means something to a business owner or business user. Ultimately this conversation starts with some type of business goal. For instance, when performing data discovery on email addresses it would be more effective to explain how direct marketing will be affected due to the fact that 10% of the data cannot be used to send electronic marketing materials rather than 10,000 values do not contain an “@” symbol.
In summary, data discovery is useful in gaining insight as to the current state of the data landscape, however, data discovery is not a “silver bullet solution” to data quality, master data mangement or data governance initiatives.