1 / 29

Advanced Database Systems F24DS2 / F29AT2

Advanced Database Systems F24DS2 / F29AT2. Data Quality and Data Cleaning 1. Acknowledgements. I adapted this material from various sources, most notably: A ppt presentation called `Data Quality and Data Cleaning: An Overview’ by Tamrapani Dasu and Theodore Johnson, at AT & T Labs

karenflynn
Télécharger la présentation

Advanced Database Systems F24DS2 / F29AT2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Database SystemsF24DS2 / F29AT2 Data Quality and Data Cleaning 1 David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  2. Acknowledgements I adapted this material from various sources, most notably: • A ppt presentation called `Data Quality and Data Cleaning: An Overview’ by Tamrapani Dasu and Theodore Johnson, at AT & T Labs • A paper called `Data Cleaning: Problems and Current Approaches’, by Erhard Rahm and Hong Hai Do, University of Leipzig, Germany. My thanks to these researchers for making their materials freely available online. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  3. On Data Quality Suppose you have a database sitting in front of you, and I ask ``Is it a good quality database?’’ What is your answer? What does quality depend on? Note: this is about the data themselves, not the system in use to access it. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  4. A Conventional Definition of Data Quality Good quality data are: Accurate, Complete, Unique, Up-to-date, and Consistent ; meaning … David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  5. A Conventional Definition of Data Quality, continued … Accurate: This refers to how the data were recorded in the first place. What might be the inaccurately recorded datum in the following table? David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  6. A Conventional Definition of Data Quality, continued … Complete: This refers to whether or not the database really contains everything it is supposed to contain. E.g. a patient’s medical records should contain references to all medication prescribed to date for that patient. The BBC Licensing DB should contain an entry for every address in the country. Does it? David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  7. A Conventional Definition of Data Quality, continued … Unique: Every separate datum appears only once. How many `Data Quality errors’ can you find in the following table, and what types are they? David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  8. A Conventional Definition of Data Quality, continued … Up-to-date: The data are kept up to date. The post office has just changed my postcode from EX6 8RA to EX6 8NU. Why does this make it difficult for me to get a sensible quote for home insurance or car insurance? Can you think of a DB where it doesn’t matter whether or not the data are kept up to date?? David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  9. A Conventional Definition of Data Quality, continued … Consistent: The data contains no logical errors or impossibilities. It makes sense in and of itself. Why is the following mini DB inconsistent? David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  10. The Several Problems with the Conventional (or any?) Definition Fine, we can define Data Quality (DQ) in a way that makes it clear what kinds of issues we are thinking about. But is the definition any use beyond that? Can it be used: • to Measure DQ, so that we can say whether or not one DB is better than another, or so we can gauge improvements in DQ over time. • on a wide range of different DBs, of widely different purposes, sizes, etc. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  11. No: Four problems There are at least four specific problems with this definition if we are going to use it for any other purpose than just discussing DQ issues: Measurability, rigidity, context independence, unhelpfulness David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  12. DQ Def problems: measurability It is not clear how we might measure quality according to some of these items. Completeness: How will we know?? Uniqueness: It is hard to tell whether two entries are similar, or duplicates! Up-to-date-ness: How do we know? Consistent: consistency errors can be very hard to find, especially in a very large DB David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  13. DQ Def problems: rigidity The conventional definition is rigid and static. It is easy to think of two DBs that are high quality according to this definition, but are nevertheless of entirely different quality. E.g. one may contain much more potentially relevant information. One may have better metadata. Etc… David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  14. DQ Def problems: context independence The conventional definition does not take into account the context of the DB. Different things may be important for quality in different circumstances. E.g. a DB for public access must be highly accessible and interpretable; in a DB of experimental results, accuracy is more important than consistency (why?), etc. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  15. DQ Def problems: unhelpfulness/vagueness Partly as a result of the fact that we need different DQ measures of different DBs, depending on context, the conventional definition alone is vague and unhelpful when it comes to using it to provide clues for how to improve data quality. How do we improve accuracy? Depends on how the data are recorded. How do we improve consistency? Depends immensely on the kind of data we have, etc … David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  16. Towards modern definitions of DQ An ideal definition of data quality: • Is backed up by a good understanding of how and why errors occur. • Relates to how the data are used • Is helpful in leading to improvements in processes • Is helpful in leading to usable measures/metrics David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  17. The Data Quality Continuum It’s rare that a datum is entered once into a DB and then left alone. Usually, a datum has a long and varied life, into which errors can arise at each and every stage. The continuum is: • Data gathering • Data delivery • Data storage • Data integration • Data retrieval • Data analysis So, if we want to monitor DQ, we need to monitor it at each of these stages David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  18. DQ Continuum: Example This is an example I am familiar with, helping to illustrate the DQ continuum. The International Seismological Centre (ISC) is in Thatcham, in Berkshire. It’s a charity funded by various governments. Their role is to be the repository for recording all earthquake events on the planet. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  19. DQ Continuum: ISC example: gathering ISC Data gathering centres David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  20. DQ Continuum: ISC example: delivery Raw seismograph data from local collection points to DG centres. Email or ftp to ISC; some centres provide raw data, some provide interpreted data (e.g. maybe won’t send some data if they believe it in error in the first place) David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  21. DQ Continuum: ISC example: integration The ISC’s role is actually to figure out where and when the Earth tremors were (there are hundreds per month) based on reverse engineering from seismograph readings. They integrate the raw data and attempt to do this, largely by hand and brain, and record their findings in archival CDs David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  22. DQ Continuum: ISC example: retrieval/analysis You can get a CD from ISC anytime, for the earth tremor activity on any particular day. I’m not sure whether you can get the raw data from them. Naturally, you can analyse the data and see if you can find inconsistencies or errors. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  23. The ISC DQ Continuum Where might there occur errors, of: Accuracy? Completeness? Uniqueness? Timeliness? Consisency? What else is important in this case? David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  24. Where DQ problems occur (gathering) • Manual data entry (how can we improve this?) • Lack of uniform standards for format and content. • Duplicates arising from parallel entry • Approximations, alternatives, entries altered in order to cope with s/w and/or h/w constraints. • Measurement errors. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  25. Where DQ problems occur (delivery) • Multiple hops from source to DB – problems can happen anywhere • Inappropriate pre-processing (e.g. removing some `small’ seismograph readings before sending on to ISC; rounding up or down, when the destination needs more accurate data). • Transmission problems: buffer overflows, checks (did all files arrive, and all correctly?) David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  26. Where DQ problems occur (storage) • Poor, out of date or inappropriate metadata • Missing timestamps • conversion to storage format (e.g. to excel files, to higher/lower precision David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  27. Where DQ problems occur (integration) This is the business of combining datasets – e.g. from different parts of a company, from (previously) different companies following an acquisition; from different government agencies, etc. • Different keys, different fields, different formats • Different definitions (`customer’, `income’, …) • Sociological factors: reluctance to share! David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  28. Where DQ problems occur (retrieval/analysis) The problem here is usually the quality of DBs that store the retrieved data, or the use of the retrieved data in general. Problems arise because: • The source DB is not properly understood! • Straightforward mistakes in the queries that retrieve the relevant data. E.g. A database of genes contains entries that indicate whether or not each gene has a known or suspected link with cancer. A retrieval/analysis task leads to publishing a list of genes that are not relevant to cancer. What is the problem here? David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

  29. What this lecture was about • A conventional definition of DQ • Problems with that definition • The DQ Continuum • Where errors might occur along that continuum David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

More Related