An event based log about a service offering
A Data Quality Checklist
As a follow up to my previous post, I have compiled a checklist that can used to manage the effort of preparing for a data quality engagement.
- Form a data quality mission statement
- Define the systems and data in-scope
- Define the data quality environment
- Extract the data and land it to the data quality environment
[tweetmeme source=”dqchronicle” only_single=false http://wp.me/prD2R-et]
Form a data quality mission statement
- This statement should include the nature of the data issues encountered and their business impacts. This statement can be developed from the tales of data issues recanted by business stakeholders, logs or remediation requests submitted to the data management team, or derived from the requirements of upcoming data migration projects.
In an effort to reduce marketing costs, a data quality effort will be initiated to reduce the customer record duplication by forming a master customer record, identify and correct invalid customer addresses, and form customer households.
Define the systems and data in-scope
- Once the data issues have been identified, data elements that are most significant can be defined and constitute the elements in scope. Significance can be developed by identifying the data which is highly coupled with the business impacts identified in the mission statement.
The data which supports customer master records, correcting invalid addresses and forming householding includes customer name and address which is sourced from the CRM application. Customer name consists of the following attributes name prefix, first name, middle initial, last name and name suffix. Customer address consists of the following attributes street number, street name, suite or apartment number, city, state, postal code, and country code/name.
Define the data quality environment
- There are some options worth investigating when considering a data quality environment. The main criteria for selecting a data quality environment are cost, available resources and implementation time. Quite often data quality intiatives have a shorter timeframe than most information technology projects. They also typically have smaller budget allocation. However, they most definitely require additional resources. This mix brings about a necessity to plan for deployment of the data quality environment. One possible strategy is to procure the required hardware and software, dedicate a team of trained data quality practioners and begin the work of remediating data. Another possible strategy is to leverage leased based solutions where data is hosted on a secured cloud instance, lease terms are defined for software and experienced consultants are engaged to conduct the analysis. Often, these two strategies are combined in a phasic nature where the initial phase involves the latter and the subsequent phase involves the former.
Hardware requirements include a dedicated server with an 8 core CPU with 8 GB of RAM and at least 200-400 GB of fast disk storage
Software requirements include a database management server and data quality toolset that includes assessment, remediation and reporting functions
Human resource requirements include a dedicated team that has knowledge of typcial data quality issues, data management techniques and business intelligence reporting
Extract the data and land it to the data quality environment
- Now that we know why we are performing data quality analysis, what data we are performing it on and where we are performing the analysis it is time to get the data staged into the data quality environment. This step is often underestimated and requires a level of planning that includes resources to extract the data and scheduling to coordinate appropriate times for data extraction and transfer.
The CRM DBA team develop and initiate the required queries to extract the defined data during production off-hours and transfer the data to the data quality environment no later than 24 hours following the completion of extraction completion.
It should be evident from this checklist that there are quite a few of time intensive tasks that need to take place prior to performing data quality analysis. I’ve listed them in the order I have found most chronically. Following this checklist is a way to establish momentum for a data quality project from the onset while reducing resource churn. I’ve formed this checklist by encountering a lack of momentum and experiencing extensive resource churn through the years. Hopefully it can prevent others from experiencing this by setting clear expectations of what it takes to be prepared!