Definition of data quality
Data quality is defined as a measurement of how well a data is suited to serve specified purposes in particular contexts. Based on this, data quality can be considered high for one use and low for another; data quality is considered to be high if it is fit for its intended purpose.
Common factors (also referred to as data quality dimensions) used to determine data quality are accuracy, completeness, consistency, timeliness, uniqueness, and validity. Among the issues that data quality dimensions take into consideration are the presence and frequency of duplicated data, incomplete data, inconsistent data, incorrect data, ill-defined data, poorly organized data, and data that lacks sufficient security controls.
Organizations’ prioritization of data quality is commensurate with its value; always a valuable resource, data’s role in all aspects of an organization’s operations continues to grow and take on importance.
Data quality is critical for powering analytics used across organizations to inform decisions related to everything from finance and compliance work to sales and marketing.
Data quality is a key driver for data management and data governance programs. These programs strive to optimize and protect data quality, helping to identify opportunities to make it better and risks that jeopardize it. This is important as poor data quality can lead to inaccurate analysis, result in negative outcomes for an organization, and create compliance risks.
Data quality vs data integrity vs data profiling
Data Quality | Data Integrity | Data Profiling |
---|---|---|
Data quality measures how well the data serves its intended purpose. Data quality efforts identify and correct errors within an organization’s data sets. | Data integrity measures the accuracy and consistency of data over its lifecycle to track data quality. Data integrity focuses on preserving and protecting data’s original state as it is stored, retrieved, and processed. | Data profiling refers to the process of examining, analyzing, reviewing, and summarizing data to assess data quality. Data profiling also includes reviewing source data to understand its structure, content, and interrelationships. |
Data quality dimensions
Data quality dictates its value to an organization. There are a number of metrics used to evaluate data quality to establish value and identify areas for improvement. Following are six of the most widely used data quality dimensions.
Accuracy
Cited as the most important measure of data quality, accuracy refers to the degree to which information accurately represents the information (e.g., an event or object). Accuracy is of the utmost importance because it ensures that functions that rely on the information can be depended on to act appropriately and deliver the expected results. For instance, having an employee’s start date or position in an organization recorded correctly can impact the benefits that they are entitled to receive.
Ways in which the accuracy data quality dimension can be measured are:
- How current is the data and are there concerns with stale data?
- How do data values compare to standard or reference values provided by a reliable source?
- How do the data values compare to a physical measurement or physical observations?
- How well does a piece of information reflect reality?
Completeness
The completeness data quality dimension measures the percentage of data populated with regard to requirements for high data quality; 100% fulfillment is the ideal. At this point, data quality meets all expectations for comprehensiveness needed to achieve stated objectives.
“100%” is slightly deceptive as it could mean different things for different use cases. For instance, in some cases, a first and last name and phone number are all that is required (e.g., dropping off a product for service), whereas in others, a complete record of a person’s contact information is required to complete a particular function (e.g., shipping a product).
Ways in which the completeness data quality dimension can be measured are:
- Are known records missing?
- Does the data fulfill users’ expectations and requirements for what is comprehensive?
- Is data truncated?
- What percentage of necessary values are missing from a dataset?
Consistency
The consistency data quality dimension measures how closely data follows the same format across data sets. Examples of this are:
- Dates written as numbers or words, such as January 1, 1999 vs 01/01/1999 or 01/01/99
- Phone number formatting, such as using dashes or periods (e.g., 800-222-3333 or 800.222.3333)
- Use of capitalization, such as sentence case vs title case
Ways in which the consistency data quality dimension can be measured are:
- Do all records in a data set use the same format for information?
- Does information stored in one place match comparable data stored elsewhere?
Timeliness
Timeliness related to data quality is the time delay between an actual event time when the event information is captured in a system, and when the data is available. Users’ expectations and requirements determine the data quality measure.
Ways in which the timeliness data quality dimension can be measured are:
- Does information availability cause delays in processes?
- Is information available when users need it?
- What is the lag in data capture and information availability?
Uniqueness
Data duplication negatively impacts data quality. To maintain high data quality, only one instance of information should appear in a database.
In effect, uniqueness measures duplicate records, which could include the same record repeated with slight variations, such as Jonathan Smith being repeated as Jon Smith. Uniqueness should be measured within a data set and across all other datasets, for instance, in accounting and sales systems.
Ways in which the uniqueness of the data quality dimension can be measured include:
- Are data elements duplicated across multiple fields?
- Is one entity represented multiple times with the same identity?
- Do two identities represent one entity?
- Is this the only instance in which this information appears in a database?
Validity
Validity describes the data quality dimension related to how closely, if at all, information aligns with available value attributes. When information fails to meet validity requirements, it can be rejected by the system or negatively impact data quality standards. Information validity needs to align with predetermined values or business rules to achieve data quality.
Ways in which the validity data quality dimension can be measured are:
- Is information in the format specified by business rules, a range of values (e.g., numeric or date), or a sequence of events?
- Is information in a usable format, such as MM/DD/YYYY, for a system that only accepts MM/DD/YY?
Ensuring adherence to data quality dimensions
Steps that can be taken to assure high ratings in these data quality dimensions are:
- Assess whether the information represents the reality of the situation.
- Consider how data measures up to the data quality dimensions across all of an organization’s resources (e.g., a data format used in different systems, such as finance, sales, and customer support).
- Identify and update incorrect data.
- Leverage data management and data governance systems and best practices.
- Using testing to assure the accuracy of data.
Why data quality is important
Poor data quality causes a range of issues, including:
- Expenses related to correcting data errors
- Fines for improper financial or regulatory compliance reporting
- Inaccurate analytics that negatively impact decision-making
- Increased data processing costs
- Loss of brand value
- Lost sales opportunities
Maintaining high data quality delivers many benefits, including:
- Avoiding operational errors and process breakdowns that can increase operating expenses and reduce revenue
- Engaging with customers more effectively
- Enhancing operational efficiency and productivity
- Extracting greater value from data sets
- Freeing up data management teams to focus on more productive tasks
- Gaining a competitive edge
- Improving internal processes
- Increasing the accuracy of analytics to improve decision-making
- Informing decisions across the organization (e.g., marketing, product development, sales, and finance)
- Reducing risks and costs
- Reducing the cost of identifying and fixing bad data in systems
What is data quality assurance?
Data quality assurance is a collection of processes that are used to improve data quality. To establish and maintain a high standard for data quality, data sets are cleaned and reviewed to ensure that there are no anomalies, inconsistencies, or obsolete information.
Data quality assurance uses data profiling and cleansing to ensure data quality throughout the lifecycle.
This work should be conducted prior to and while acquiring data and should be an ongoing process to identify and eliminate distortions caused by people or external factors.
The overall process for data quality assurance follows six key steps.
Step one—define metrics for data quality assurance
Define the standards for data quality to provide metrics for data quality assurance work. Commonly used data quality standards include:
- Accuracy
- Completeness
- Comprehensibility
- Precision
- Relevancy
- Timeliness
- Trustworthiness
- Validity
Examples of specific data quality checks include:
- Applying formatting checks.
- Checking for mandatory fields, null values, and missing values.
- Checking how recent the data is or when it was last updated.
- Identifying duplicates or overlaps.
- Using business rules with a range of values or default values and validity.
- Validating row, column, conformity, and value checks.
Step two—conduct data profiling for data quality assurance
Perform data profiling for data quality assurance to review, cleanse, and monitor data. The objective is to understand how data is structured, its content, and relationships to maintain data quality standards.
- Structure discovery
The structure discovery part of data profiling validates that data is consistent and formatted according to data quality standards. - Content discovery
In data profiling, content discovery closely examines each element of a data set to check data quality. - Relationship discovery
To ensure that data quality is maintained across data sets, relationship discovery identifies connections between the data sets and confirms alignment.
Step three—establish standards for data quality assurance
Data standardization is a critical part of data quality assurance. During this step, policies are developed to enforce internal and external data quality standards.
- External standards for data quality assurance
Standards for commonly used data types often leverage external standards, such as ISO-8601, which is a globally accepted standard, to represent daytime. - Internal standards for data quality assurance
Organizations need to create internal standards for information that is unique to their organization, such as job titles or billing codes.
Step four—matching and linking records for data quality assurance
This data quality assurance step is about matching and linking datasets across the systems to find which one has the best data quality and use that as the master. During this step, duplicates and errors are identified, such as Sam Smith and Sma Smith who both have all other information in common aside from the misspelled last name. This step can also be used to merge multiple partial records to create a super record with all information consolidated.
Step five—monitoring data quality
Continuous monitoring is required to maintain data quality. This ensures the highest data quality and minimizes duplicates, errors, and anomalies that can cause problems for applications that rely on this information.
Step six—maintain data quality
To maintain data quality, after performing the steps for data quality assurance, organizations must put processes and procedures in place to ensure that data remains clean.
What is data quality control?
Data quality control is a step in data quality enforcement that is taken before and after data quality assurance. It is used to restrict inputs until data quality assurance criteria have been met as measured by data quality dimensions.
Information gathered as part of data quality assurance processes is used to direct data quality controls. Data quality controls must be cleared before users can access data.
Implementing data quality control is critical for organizations to effectively maintain data according to the standards required for various use cases. Data quality control processes help organizations:
- Detect and remove duplicates.
- Flag missing information that is mandatory.
- Identify errors made during the input, transfer, or storage of information.
Commonly used methods for data quality control include:
- Anomaly detection
Anomaly detection brings advanced analytics and machine learning to bear to help organizations identify data quality issues that are hard to detect. It uses structured and unstructured data to identify outliers and anomalies. For instance, anomaly detection uses means to identify potential errors (e.g., a person’s age is 102 when the mean is 35). - Data inspection
Data inspection supports data quality control by inspecting information at the data or row level to identify and filter problematic information, such as duplicates or invalid data, and flag it for further inspection or additional processing. Data inspection systems use data quality criteria to judge data quality and filter any data that does not fit to prevent it from negatively impacting downstream processes and use in applications. - Data monitoring
Data monitoring uses pre-established rules to continuously assess data quality to ensure data validity or flag data that does not conform to standards or has missing attributes.
Data quality impacts all areas of the enterprise’s operations
Every part of the enterprise generates, works with, and depends on data, which makes ensuring data quality imperative. Any organization can achieve high data quality; it is simply a matter of taking advantage of available tools, and establishing and enforcing guidelines and protocols. The effort and expenditures required to enable high-quality data have been proven repeatedly to deliver an excellent return on investment.