Businesses aiming to be ahead in the race keep their tabs on data quality to bring a change for the better. Improved data is directly proportional to better performance, fewer mistakes, lower costs, and better decisions. Coming to the fact how important data quality is- it’s high time for companies to pay the attention it deserves.
Caution: Bad data is the cruel living human. Every day, businesses send packages to customers, managers decide which candidate to hire, and executives make long-term plans based on the data available. When data acts as a villain, there are immediate consequences including angry customers, wasted time, and added difficulties in the execution. You know the sound bites — “decisions are no better than the data on which they’re based” and “garbage in, garbage out.” But do you know the price tag of your organization?
Based on recent research by Experian plc, as well as by consultants James Price of Experience Matters and Martin Spratt of Clear Strategic IT Partners Pty. Ltd., we estimate the cost of bad data to be 15% to 25% of revenue for most companies (more on this research later). These costs come as people accommodate bad data by correcting errors, seeking confirmation in other sources, and dealing with the inevitable mistakes that line up.
Fewer errors imply lower cost, and the key to fewer errors lies in finding and eliminating their root cause. Fortunately, this is not too difficult in most cases. All told, we estimate that two-thirds of these costs can be identified and eliminated — once and for all.
The aforementioned are 5 main criteria used to measure data quality:
- Accuracy: for whatever data described, it should be accurate.
- Relevancy: the data should meet the requirements for the intended use.
- Completeness: the data should not have missing values or absent data records.
- Timeliness: the data should be up to date.
- Consistency: the data should contain data format as expected and can be cross reference-able with the same results.
MORAL OF THE STORY: If a company can effectively manage the data quality of each dataset when it is received or created, the data quality becomes 100% guaranteed. To know how- here are 7 essential steps to make it happen:
- Rigorous Data Profiling and Control of Incoming Data
Mostly, bad data comes from bad data receiving. In an organization, the data usually comes from other sources outside the control of the organization. It could be the data sent from another organization, or, in many cases, collected by third-party software. Therefore, its data quality cannot be guaranteed, and a rigorous data quality control of incoming data is the most important aspect among all data quality control tasks. A good data profiling tool comes into use; such a tool should be adept in examining the following aspects of the data:
- Data format and data patterns
- Data consistency on each record
- Data value distributions and abnormalities
- Completeness of the data
It is very important to automate the data profiling and data quality alerts so that the quality of incoming data is consistently controlled and managed at the right time in the right manner. Lastly, each piece of incoming data should be managed using the best practices, and a centralized catalogue and KPI dashboard should be established to accurately record and monitor the data available.
- Careful Data Pipeline Design to Avoid Data Duplicity at All Costs
Duplicate data refers to when the whole or part of data is laid out exactly from the same data source. When duplicate data is created, it is very likely out of sync and leads to hazardous results. In the end, when a data issue arises, it becomes difficult to trace the root cause, not to mention fixing it in the best way possible.
In order for an organization to prevent this mishap, a data pipeline needs to be clearly defined and carefully designed in areas including data assets, data modelling, business rules, and architecture. Effective communication is also needed to promote and enforce data sharing within the organization, which will improve overall efficiency and reduce any potential data quality issues that come up due to data duplications. On a high level, there are 3 key areas that need to be established to prevent duplicate data from finding breathing space in the world:
- A data governance program, which clearly defines the ownership of a dataset and effectively communicates and promotes dataset sharing to avoid any department silos.
- Centralized data assets management and data modelling, which are reviewed and audited regularly.
- Clear logical design of data pipelines at the enterprise level, which is shared across the organization.
- Accurate Gathering of Data Requirements
An important aspect of having good data quality is to meet the requirements and deliver the data to users. It is not as simple as it appears!
- Firstly, it is not easy to properly present the data. Truly understanding what a client wants.
- The requirement should contain all data conditions and scenarios — it is considered incomplete if all the dependencies or conditions are not reviewed in the best manner.
- Clear documentation of the requirements, with easy access and sharing.
- Enforcement of Data Integrity
An important feature of the relational database is the effective enforcement of data Integrity using techniques such as foreign keys, check constraints, and triggers. When the data volume grows, along with more and more data sources and deliverables, not all datasets can survive in a single database system. The referential integrity of the data, therefore, needs to be enforced by applications, which need to be defined by best practices of data governance and design for implementation.
- Integration of Data Lineage Traceability into the Data Pipelines
Without the data lineage traceability built into the pipeline, when a data-related issue appears, it could take hours to trace the root cause of the problem. Sometimes it could go through multiple teams and require data engineers to look into the code of proper interrogation.
Data Lineage traceability has 2 aspects:
- Meta-data: the ability to trace through the relationships between datasets, data fields and the transformation logic.
- Data itself: the ability to trace a data issue in a quick and fluent manner.
Data traceability can be difficult and complex! Below lists some common techniques to make it function:
- Trace by unique keys of each dataset: This first requires each dataset has one or a group of unique keys. Unfortunately, not every dataset can be traced by unique keys. For example, when a dataset is aggregated, the keys from the source get lost in the process.
- Create a unique sequence number, such as transaction or record identifier when there are no obvious unique keys in the data.
- Build link tables when there are many-to-many relationships, but not 1-to-1or 1-to-many.
- Add timestamp to each data record, to indicate when it is added or changed.
- Log data change in a log table with the value before a change and the timestamp happens
Data traceability takes time to design and implement. It is, however, strategically critical for data architects and engineers to build it into the pipeline in an effective manner.
- Automated Regression Testing as Part of Change Management
Data quality issues often appear when a new dataset is introduced or an existing dataset is modified. For effective change management, test plans should be implemented with 2 themes: 1) confirming the change meets the requirement; 2) ensuring the change does not have an unintentional impact on the data in the pipelines that should not be changed. For mission-critical datasets, when a change takes place, regular regression testing should be implemented for every deliverable.
- Capable Data Quality Control Teams
Lastly, 2 types of teams play critical roles to raise high data quality for an organization:
Quality Assurance: This team checks the quality of software and programs whenever changes come up. Rigorous change management performed by this team is required to ensure data quality in an organization.
Production Quality Control: Depending on an organization, this team does not have to be an individual team. Sometimes it can be a function of Quality Assurance or Business Analyst. The objective of this team is to identify any data quality issue and have it fixed in real-time. This team also needs to partner with customer service teams and can get direct feedback and quickly address their concerns.
In conclusion, good data quality requires good data governance, rigorous management of incoming data, accurate requirement gathering, thorough regression testing for change management, careful design of data pipelines, and data quality control programs for the data delivered. For all quality problems, it is much easier and less costly to prevent the data issue from happening, rather than depending on defending systems to deal with data quality problems.