60 likes | 170 Vues
XClean is an innovative XML data cleaning system designed to address common data quality issues such as typos, inconsistent data formats, missing data, contradictions, and duplicates. This system employs a clear methodology with defined processing stages to enhance data quality. XClean's modular and readable architecture utilizes XQuery and offers a set of cleaning operators for efficient data management. Explore how XClean can eliminate dirty data and support your data integrity needs through its Java plugin and comprehensive functionalities demonstrated at CIDR 2007.
E N D
XClean in Action Melanie Weis, HPI Potsdam, Germany Ioana Manolescu, INRIA Futurs, France CIDR 2007 05.11.2006 |
What is XClean? • XClean is an XML data cleaning system. • Types of errors that require data cleaning: • Typos • Different data formats (e.g., date, abbreviations, language) • Missing data • Contradictory data • Duplicates Melanie Weis, Hasso Plattner Institut Potsdam, 18.01.2007
Where do we find Duplicates? False Duplicate Melanie Weis, Hasso Plattner Institut Potsdam, 18.01.2007
How do we get rid of dirty data? • Quick fix (get glasses) • Start over again next year(get new, expensive glasses) • Clear methodology(Clearly defined processing stages that combine) • Possibility to reuse (parts of) a solution No! Yes! Melanie Weis, Hasso Plattner Institut Potsdam, 18.01.2007
Data Cleaning with XClean • XClean/PL • Declarative • Modular • Readable XQuery CleanXMLdata DirtyXMLdata XQuery Processor Set of clearly defined cleaning operators. Melanie Weis, Hasso Plattner Institut Potsdam, 18.01.2007
Come see the demo! • XClean Java plugin • Supports • Writing XClean/PL • Compiling XClean/PL to XQuery • Executing XQuery to obtain clean data Melanie Weis, Hasso Plattner Institut Potsdam, 18.01.2007