1 / 20

Dealing with Data Quality

Dealing with Data Quality. Google Workshop July 24, 2009. ?. Low light. Blurry. Missing. Blurry. Faults can reduce the quantity and quality of the collected information. When ignored, faults in a dataset can lead to ambiguous, or worse, incorrect conclusions. “Circle”. “Circle”.

glynn
Télécharger la présentation

Dealing with Data Quality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dealing with Data Quality Google Workshop July 24, 2009

  2. ? Low light Blurry Missing Blurry

  3. Faults can reduce the quantity and quality of the collected information.

  4. When ignored, faults in a dataset can lead to ambiguous, or worse, incorrect conclusions. “Circle” “Circle” “Square” “Square” “Square” “Square” “Square” “Circle” “Square”

  5. Unfortunately faults in networked sensing systems are common Good Data Network Faults Data Faults *** Numbers are approximations based on publications, personal communications 1 R. Szewczyk et. al. An analysis of a large scale habitat monitoring application. In Procs. Sensys, 2004. 2 G. Tolle et. al. A macroscope in the redwoods. In Proc. SenSys, 2005. 3 G. Werner-Allen et. al. Fidelity and Yield in a Volcano Monitoring Sensor Network. In Procs. OSDI, 2006. 4 Cms database. http://cens.jamesreserve.edu/phpmyadmin

  6. Our experience is similar: Almost 60% of data was faulty in this soil deployment (Bangladesh, 2006) Ammonium Calcium Carbonate Chloride Nitrate pH

  7. Many methods to find faults Examples include • Visual inspection • Manual validation • Analytical validation: statistical, scientific models Temperature Depth Statistical, e.g. outlier detection Scientific, e.g. “temperature decreases with depth”

  8. Several methods to fix faults • Go into the field and replace or fix the problem. • Remove the faulty data, (“clean” the dataset), after the deployment is over.

  9. Faults persist for a number of reasons, including: First, faults can be difficult to define and identify

  10. Faults persist partly because they are difficult to define X

  11. Faults persist partly because they are difficult to define A nitrate deployment in the riverbed of Merced river

  12. Faults persist partly because they are difficult to define A nitrate deployment in the riverbed of Merced river

  13. Faults persist partly because they are difficult to define Nitrate data taken from nearby locations A nitrate deployment in the riverbed of Merced river Which one is correct? Are the both correct? Are they both faulty?

  14. Faults persist for a number of reasons, including: First, faults can be difficult to define and identify Second, faults are not always worth fixing

  15. Not all faults need to be fixed [Schoellhammer ‘08] Maintenance can be expensive And, if the analysis can happen without the faulty data, then what’s the point? Temperature Temperature Depth Depth

  16. Faults persist for a number of reasons, including: First, faults can be difficult to define and identify Second, faults are not always worth fixing Answering these questions is hard

  17. Incomplete, ad-hoc, or last minute solutions for addressing faults only exacerbates the problem. Regardless of the solution for addressing faults - and there are many – it should be incorporated into the design and implementation of the system right from the beginning.

  18. Nithya Ramanathan Thank You

  19. Collecting usable sensor data from a networked system is never easy. Whether the data consists of images or nitrate levels from a chemistry sensor, faults can reduce the quantity and quality of the collected information. And when ignored, faults in a dataset can lead to ambiguous, or worse, incorrect conclusions. Unfortunately faults in networked sensing systems are painfully common. Faults persist partly because they are difficult to define, and even once identified, they are not always worth fixing. Incomplete, ad-hoc, or last minute solutions for addressing faults only exacerbates the problem. Regardless of the solution for addressing faults - and there are many - it should be incorporated into the design and implementation of the system right from the beginning.

More Related