7 Steps to Ensure and Sustain Data Quality Management by Stephanie Shen
Before data science became mainstream, data quality was mostly mentioned for the reports delivered to internal or external clients. Nowadays, because machine learning requires a large amount of training data, the internal datasets within an organization are in high demand. In addition, the analytics are always hungry for data and constantly search for data assets that can potentially add value, which has led to quick adoption of new datasets or data sources not explored or used before. This trend has made data management and good practices of ensuring good data quality more important than ever. • The goal of this article is to give you a clear idea of how to build a data pipeline that creates and sustains good data quality from the beginning. In other words, data quality is not something that can be fundamentally improved by finding problems and fixing them. Instead, every organization should start by producing data with good quality in the first place.
First of all, what is Data Quality? Generally speaking, data is of high quality when it satisfies the requirements of its intended use for clients, decision-makers, downstream applications and processes. A good analogy is the quality of a product produced by a manufacturer, for which good product quality is not the business outcome, but drives customer satisfaction and impacts the value and life cycle of the product itself. Similarly, the quality of the data is an important attribute that could drive the value of the data and, hence, impact aspects of the business outcome, such as regulatory compliance, customer satisfaction, or accuracy of decision making. Below lists 5 main criteria used to measure data quality: • Accuracy: for whatever data described, it needs to be accurate. • Relevancy: the data should meet the requirements for the intended use. • Completeness: the data should not have missing values or miss data records. • Timeliness: the data should be up to date. • Consistency: the data should have the data format as expected and can be cross reference-able with the same results.
The standard for good data quality managementcan differ depending on the requirement and the nature of the data itself. For example, the core customer dataset of a company needs to meet very high standards for the above criteria, while there could be higher tolerance of errors or incompleteness for a third-party data source. For an organization to deliver data with good quality, it needs to manage and control each data storage created in the pipeline from the beginning to the end. Many organizations simply focus on the final data and invest in data quality control effort right before it is delivered. This is not good enough and too often, when an issue is found in the end, it is already too late — either it takes a long time to find out where the problem came from, or it becomes too costly and time consuming to fix the issue. However, if a company can manage the data quality of each dataset at the time when it is received or created, the data quality is naturally guaranteed. There are 7 essential steps to making that happen: • 1. Rigorous data profiling and control of incoming data • In most cases, bad data comes from data receiving. In an organization, the data usually comes from other sources outside the control of the company or department. It could be the data sent from another organization, or, in many cases, collected by third-party software. Therefore, its data quality cannot be guaranteed, and a rigorous data quality control of incoming data is perhaps the most important aspect among all data quality control tasks. A good data profiling tool then comes in handy; such a tool should be capable of examining the following aspects of the data: • Data format and data patterns • Data consistency on each record • Data value distributions and abnormalies • Completeness of the data
It is also essential to automate the data profiling and data quality alerts so that the quality of incoming data is consistently controlled and managed whenever it is received — never assume an incoming data is as good as expected without profiling and checks. Lastly, each piece of incoming data should be managed using the same standards and best practices, and a centralized catalog and KPI dashboard should be established to accurately record and monitor the quality of the data. • 2. Careful data pipeline design to avoid duplicate data • Duplicate data refers to when the whole or part of data is created from the same data source, using the same logic, but by different people or teams likely for different downstream purposes. When a duplicate data is created, it is very likely out of sync and leads to different results, with cascading effects throughout multiple systems or databases. At the end, when a data issue arises, it becomes difficult or time-consuming to trace the root cause, not to mention fixing it. • In order for an organization to prevent this from happening, a data pipeline needs to be clearly defined and carefully designed in areas including data assets, data modeling, business rules, and architecture. Effective communication is also needed to promote and enforce data sharing within the organization, which will improve overall efficiency and reduce any potential data quality management issues caused by data duplications. This gets into the core of data management, the details of which are beyond the scope of this article. On a high level, there are 3 areas that need to be established to prevent duplicate data from being created: • A data governance program, which clearly defines the ownership of a dataset and effectively communicates and promotes dataset sharing to avoid any department silos.
Quality Assurance: This team checks the quality of software and programs whenever changes happen. Rigorous change management performed by this team is essential to ensure data quality managementin an organization that undergoes fast transformations and changes with data-intensive applications. • Production Quality Control: Depending on an organization, this team does not have to be a separate team by itself. Sometime it can be a function of the Quality Assurance or Business Analyst team. The team needs to have a good understanding of the business rules and business requirements, and be equipped by the tools and dashboards to detect abnormalities, outliers, broken trends and any other unusual scenarios that happen on Production. The objective of this team is to identify any data quality issue and have it fixed before users and clients do. This team also needs to partner with customer service teams and can get direct feedback from customers and address their concerns quickly. With the advances of modern AI technologies, efficiency can be potentially improved drastically. However, as stated at the beginning of this article, quality control at the end is necessary but not sufficient to ensure a company creates and sustains good data quality. The 6 steps stated above are also required.