- Downloading MongoChef - free #MongoDB GUI with full support for MongoDB 3.X 3t.io/mongochef via @MongoChef>> 1 year ago
- Migrating from Relational Databases to MongoDB lnkd.in/eNMWqbE>> 1 year ago
- Migrating from Relational Databases to MongoDB by @matkeep #migration #mongodb slideshare.net/matkeep/migrat… via @SlideShare>> 1 year ago
- cannot wait to mess with this! lnkd.in/ecimasm>> 1 year ago
- "is Chief Data Officer a Game Changer? " lnkd.in/ek-k6ry by @Rkreddy0228 on @LinkedIn>> 1 year ago
An event based log about a service offering
Data Quality Basic Training
February 2, 2011Posted by on
Recently a reader asked me if I had any posts on “data quality basics”. Turns out, I didn’t. So I’ve decided to put together a series of posts that covers what I feel are the basic essentials to a data quality program.
[tweetmeme source=”dqchronicle” only_single=false http://wp.me/prD2R-fe]
The many sides of data quality
It is generally accepted in the data quality world that there are seven categories by which data quality can be analyzed. These include the following:
- Conformity – Analyzing data for conformity measures adherence to the data definition standards. This can include determining if data is of the correct type and length
- Consistency – Analyzing data for consistency measures that data is uniformly represented across systems. This can involve comparing the same attribute across the various systems in the enterprise
- Completeness – Analyzing data for completeness measures whether or not required data is populated. This can involve one or more elements and is usually tightly coupled with required field validation rules
- Accuracy – Analyzing data for accuracy measures if data is nonstandard or out-of-date. This can involve comparing data against standards like USPS deliverability and ASCII code references
- Integrity – Analyzing data for integrity measures data references that link information like customers and their addresses. Using our example, this analysis would determine what addresses are not associated with customers
- Duplication – Analyzing data for duplication measures the pervasiveness of redundant records. This involves determining those pieces of information that uniquely define data and identifying the extent to which this is present
- Timeliness – Analyzing data for timeliness measures the availability of data. This involves analyzing the creation, update or deletion of data and its dependent business processes
Data Quality In Motion
There are a certain set of core activities that define data quality. In my opinion, data quality is about change. As a result, many of the data quality activities have to do with changes to data. The following is my list of the basic data quality activities:
- Data Profiling
- Data Cleansing
- Data Standardization
- Data Deduplication
- Data Consolidation
- Data Validation
- Data Profiling – during data profiling the main goal is to define the nature and magnitude of the data quality issues
- Data Cleansing – during data cleansing the main goal is to transform data from the original value which contains inappropriate values or errors to one that is devoid of error
- Data Standardization – during data standardization the main goal is to transform data from the original value to a defined enterprise standard
- Data Deduplication – during data deduplication the main goal is to identify the redundant data present
- Data Consolidation – during data consolidation the main goal is to reduce the redundancy present through survivorship rules
- Data Validation – during data consolidation the main goal is to verify that the cleansing, standardization, duplicate identification and consolidation have been conducted in a manner that complies with business requirements
Not all data quality efforts require all six steps, however, there are a two activities mentioned above that, in my opinion, should be included in every effort. Those are data profiling and data validation.
Without a data profiling step, the effort is reduced to a the proverbial dog chasing its tail. Profiling gives the efforts direction and an accurate estimate of when the destination can be reached.
Without data validation, you cannot be confident that the direction gleaned from the profiling is the right direction. The list maybe somewhat misleading in that data validation really happens after each one of those activities and not at the end of the effort.
There are some routine processes within each data quality activity. These processes seemed to cut across data quality efforts in various industries and systems.
Data profiling often involves investigating data types and lengths, as well as the pervasiveness of non-alphanumeric characters in the data. These types of errors often cause issues with data migration efforts. From a data quality perspective these types of errors would fall into the data conformity domain. Moreover, these issues could also affect the data timeliness domain as a result of the data migration failures. As you may detect, data quality domains and their issues are closely related and often have a ripple effect.
Data cleansing typically focuses on the removal of things like special characters from data but can also include conversion from one data type to another. If you think about data being dirty as data not looking like it should than data cleansing makes a little more sense.
Data standardization is the process of converting data to a commonly accepted norm, or standard. Typically this is done with various reference files or tables. For instance, there are commonly accepted abbreviations for things like state and country codes. Data standardization is the process by which conversion from the original source value to the defined standard occurs.
Data deduplication typically focuses on either the customer or product domains. Often organizations that grow through merger and acquisition end up having multiple records for the same customer and/or products. Being able to gauge the magnitude of this issue and remediate it is one of the most compelling offerings in the data quality space.
Data consolidation is performed in conjunction with duplicate identification and is the process by which the duplicates are reduced. Often this process is dependent on “survivorship” rules that define which record is kept and which is archived. Sometimes there are rules which condense the two records into a master record. Simple examples of survivorship rules include most recent, most complete, or source system based rules. It is crucial to note that the non-surviving record is archived and not discarded.
Data validation is the process by which data remediation steps taken and their results are reviewed with the data owner. Typically it is not possible to go through all the records involved so various use cases are often selected prior to remediation and reviewed once the work is completed. Use cases are involved data or records that are known to be an issue. This process is affectively the before and after snapshot and provide the data owner with a comfort level that those changes implemented are valid for all the data or records in question.
Off to battle!
I think that covers a great deal of what data quality efforts involve. I try to revisit these basics from time to time and drill into more detail when I come across examples. Hopefully this post contains enough information for those of you that are interested to go out and win some data quality battles.