Guided Data Repair

Guided Data Repair
Mohamed Yakout# Ahmed K. Elmagarmid* Jennifer Nevile#MouradOuzzani*Ihab F. Ilyas* #Purdue University West Lafayette, Indiana, USA. *Qatar Computing Research Institute Qatar Foundation – Doha, Qatar. Machine Learning Seminar Fall 2011

Data Quality Real data has problems Entry errors Incomplete information Information extraction from text Data Integration from heterogeneous sources … etc

Data Quality Problems manifest themselves in Duplicate Records Violations of Integrity Constraints (zip code determines city) Records with Missing Values Misalignment of Attribute Values … etc

Data Quality Problems - Example Inconsistent Inconsistent Duplicates Missing Values

Data Repair The Process of “Correcting” Data Problems An Essential Step in the Traditional ETL Process: the “T” Has Been the Focus of Multiple Research Communities: Statistics Machine Learning Database Theory Business Intelligence

Approaches: Automatic Repair Inconsistent MICHIGAN CITY WESTVILLE MICHIGAN CITY Inconsistent Delete Duplicate MICHIGAN CITY Duplicates Missing Values

Approaches: Involve an Expert Identify all problems Correct discovered problems in a consistent way Interactive systems to help explore problems and the expert manually specify transformations (e.g. AJAX, Potter’s Wheel) Time consuming Not scalable for large datasets

Outline GDR framework Updates generation Ranking groups of update according to benefit estimation Active learning user feedback Experimental results Conclusion

Guided Data Repair A novel approach that leverages scalability of automatic approaches and fidelity of expert-based approaches Use automatic techniques for problem identification and suggesting cleaning updates Benefit estimation techniques to prioritize quality problems and form user questions Iterative cleaning process that converges to the clean data state Active learning to piggyback learning with expert interaction

GDR – System Architecture Demo in SIGMOD 2010

GDR – Updates Generation

GDR – Updates Generation 1 : 2 : Suggested Update: replace City “FORT WAYNE” with “Westville” in t5 Automatic techniques rely on heuristics to decide RISKY Suggested Update: replace Zip “46391” with “46825” in t5

GDR – Updates Generation Contextual grouping for the suggested updates Update Group g1: The city should be “Michigan City” for {t2, t3, t4}. Update Group g2: The zip should be “46825” for {t5, t8}. …. …. ….

GDR – Ranking Updates

GDR – Ranking Updates Contextual grouping for the suggested updates Seeking User feedback for g1 is more beneficial to the data quality: Contains updates that are more likely to be correct. Higher numbers of correct updates allow for fast convergence to better quality. Update Group g1: The city should be “Michigan City” for {t2, t3, t4}. Update Group g2: The zip should be “46825” for {t5, t8}. …. …. ….

GDR – Ranking Updates In Decision Theory, VOI (Value of Information) is a mean of quantifying the potential benefit of determining the true value of some unknown (i.e., user feedback) Define a loss (or utility) function for actions Compare the loss before and after an action to help make decisions

GDR – Ranking Updates We need to define a DQ loss function (L). Given a group of updates c = {r1, r2, …}, where the probability for rj to be correct is pj. The DQ benefit from verifying c is: Quality Loss w.r.t a rule  : Challenges: We do not know pj and we do not know Dopt

GDR – Ranking Updates We estimatepj using , (the prediction probability obtained from the learning component)

GDR – Ranking Updates

GDR – Active Learning Learning User Feedback: There could be correlations between the attribute values and the correct updates Example: when SRC = H2, the CT attribute is incorrect most of the time. This could help reject any suggested updates for the ZIP and consider only updates for the CT. Modeling these correlations by a machine learning algorithm, can help minimize user involvement.

GDR – Active Learning We learn a classifier for each attribute. The training example is a tuple containing original record, the suggested replacement, the distance between the original and suggested value, and finally the label whether the suggested value is correct or not. For example, a training example for the city attribute <Tom, H2, REDWOOD DR, WESTVILLE, IN, 46360, MICHIGAN CITY, 0.7, CORRECT> Distance Label Suggested City Original record values

GDR – Active Learning Active learning is used when unlabeled instances is plentiful but there is a cost for labeling examples for training. Acquire feedback for instances that would strengthen the learned model. Rank updates by uncertainty. We used Random Forest model, which is a set of decision trees forming a committee. The uncertainty can be quantified using the entropy of the predicted labels fractions. r1 r2 T1 T2 T3 T4 T5 Uncertainty (r1) = - 1/5 log(1/5) – 4/5 log(4/5) = 0.72 Uncertainty (r2) = - 2/5 log(2/5) – 3/5 log(3/5) = 0.97

Experiments Data and ground truth: Dataset 1: 20,000 records of patients personal and address information repaired manually using lookup address web sites to get a ground truth. Dataset 2: Adults dataset from UCI 23,000 records Rules (CFDs): Rules specified during the manual cleaning process for Dataset1 For Dataset 2 we implemented a CFD discovery technique User simulation By consulting the ground truth dataset Data Quality metric Compute the improvement in data quality through the reduction in the loss L(D) --- we know Dopt.

Evaluating: VOI Ranking Number of feedback as percentage of maximum number of updates required by a technique

Overall Evaluation Number of feedback as percentage of the number of initially identified dirty records.

Overall Evaluation

Conclusion GDR guides the user to focus the efforts on inspecting the updates that would improve quality faster, while the user guides the system to automaticallyrepair the data A novel combination of decision theory and active learning as a new application for data repair. We presented GDR as an end-to-end framework for interactive data cleaning to provide fast conversion to better DQ. We are currently studying: Better ways to model the dependencies between the suggested updates. Leverage the uncertainty of the user input.

Thank you

Experiments: Results

Guided Data Repair