1 / 54

Cleaning Uncertain Data with Quality Guarantees

Cleaning Uncertain Data with Quality Guarantees. Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB. Presented by SHAO Yufeng. Outline. Background Related works Data and Query model PWS-quality model Cleaning procedure Experiments result. Uncertain Database(old model).

zagiri
Télécharger la présentation

Cleaning Uncertain Data with Quality Guarantees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, XikeXie 2008 VLDB Presented by SHAO Yufeng

  2. Outline • Background • Related works • Data and Query model • PWS-quality model • Cleaning procedure • Experiments result

  3. Uncertain Database(old model) • Inherent in various application • Examples: • RFID data • sensor networks • data protected because of privacy reason • Infeasible to eliminate all uncertainty in many models

  4. Uncertain Database(new model) • Previous model focus on query in the uncertain database • But what if we are able to reduce SOME of the uncertainty in this kind of database? • New model are required to produce optimal solution

  5. Example 1: Sensor probing • Some sensors in the sensor network might have transmission problems and cannot update data • Commands can be sent to refresh some sensors • New certain data are obtained • Limited by the bandwidth / battery power, cannot probe too often

  6. Example 2: Movie Rating • Movie ratings(IMDB, Netflix) collected from customers might contain some uncertainty • managers can communicate with customers to verify the rating data • New certain movie rating data is obtained • Limited by the human power or other resource

  7. Query Query Ambiguous result LESS ambiguous result Cleaning Data Cleaning procedure Uncertain DB LESS Uncertain DB

  8. Real model example • A database of some products and theirs price(uncertain) Price of product a has two different possible values: 120 (prob 0.7 ) or 80 (prob 0.3)

  9. Query Example 1: Query 1(Range Query): Select product with price in range [100$, 110$] Possible world result: ({b1,c2}, 0.18), ({b1,c3}, 0.12), ({b1},0.3), ({c2},0.12), ({c3}, 0.08), ({Φ},0.2)

  10. Query Example 2: Query 2 (Max query): Select product with highest price Possible world answer: ({c1}, 0.5), ({a1}, 0.35), ({c2} 0.036) ({b1}, 0.06), ({b1, c2}, 0.054) ({c3}, 0.054)

  11. Clean up example • Suppose we have some amount of resource to clean up some data • Assume we clean up the information related to product a and c New database with less uncertainty

  12. Clean up example (Cont.) Run query 1 again: Select product with price in range [100$, 110$] New possible world result: ({b1,c3}, 0.6), ({c3}, 0.4) Old possible result: ({b1,c2}, 0.18), ({b1,c3}, 0.12), ({b1},0.3), ({c2},0.12), ({c3}, 0.08), ({Φ},0.2) Apparently less uncertain in the cleaned database, but clean up procedure limited by budget New database with less uncertainty

  13. Outline • Background • Related works • Data and Query model • PWS-quality model • Cleaning procedure • Experiments result

  14. Important related works • ReynoldCheng, Dmitri V. Kalashnikov, Sunil Prabhakar: Evaluating Probabilistic Queries over Imprecise Data. SIGMOD Conference 2003: 551-562 • Mentioned about the ideas of doing clean up in Max/Min and Range query, but not real implementation • P. Andritsos, A. Fuxman, and R. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, 2006. • Introduce the technique to rewrite query

  15. Important related works (Cont) • Jinchuan Chen, Reynold Cheng: Quality-Aware Probing of Uncertain Data with Resource Constraints. SSDBM 2008 • Similar cleaning method • continuous pdf function representation of uncertainty • Support less query type(only range query) • Chris Mayfield, Jennifer Neville, Sunil Prabhakar ERACER: A Database Approach for Statistical Inference and Data Cleaning SIGMOD 2010 • Use the attribute level correlation to provide optimized clean up

  16. Outline • Background • Related works • Database and Query model • PWS-quality model • Cleaning procedure • Experiments result

  17. System Structure

  18. Important Notations tuple ti (total n tuples) x-tuple τi (total m x-tuple) uncertain attribute existential probability (ei) One x-tuple

  19. Important Notations tuple ti (total n tuples) x-tuple τi (total m x-tuple) uncertain attribute existential probability (ei) One x-tuple

  20. Query in possible world model {b1,c2}, 0.18 - 1.44 0.18 0.1 (b1,0.28), (c2,0.18), (c3,0.1) {b1,c3}, 0.1 0.1 Qualification probability(pi) of c2: 0.18 Qualification probability(Pk) of c: 0.28

  21. Possible Range Query(PRQ) • Given a closed interval , where and , a PRQ returns a set of tuples , where is the non-zero probability that . Range Query: Select product with price in range [100$, 110$] Possible world result set: ({b1,c2}, 0.18), ({b1,c3}, 0.12), ({b1},0.3), ({c2},0.12), ({c3}, 0.08), ({Φ},0.2) Prob. qj of occurrence

  22. Probabilistic Maximum Query(PMaxQ) • A PMaxQ returns a set of tuples , where , the probability of , is the non-zero probability that , where and . Query: Select product with highest price Possible world answer: ({c1}, 0.5), ({a1}, 0.35), ({c2} 0.036) ({b1}, 0.06), ({b1, c2}, 0.054) ({c3}, 0.054)

  23. Outline • Background • Related works • Data and Query model • PWS-quality model • Cleaning procedure • Experiments result

  24. PWS-quality • Suppose we have two sets of possible world result: 0.3 0.2 0.2 We need a measurement to tell which result is more uncertain and by how 0.1 0.1 0.1 {a1,b2,c1} {a2,b1} {b3,c2} 0.9 Solution: Use entropy like measurement to calculate the PWS-quality (degree of uncertainty) 0.1 {a1,c1} {b1}

  25. PWS-Quality: Calculation • Let qj be the prob. of getting distinct PW-result rj • Let d be the number of distinct pw-result • Negative S(D, Q) score, larger the score, better the quality • 0 means no uncertainty(only 1 possible world result exist)

  26. PWS-quality example • Suppose we have a set of possible world result: 0.5 0.4 0.1 {b2} {a1,c1} {b1} PWS score: S(D,Q) = 0.5*log0.5 + 0.4*log0.4 + 0.1*log0.1 = -0.496

  27. PWS-quality problem • However, calculating PWS-quality for all possible worlds are too expensive • # of possible world result might be exponential • Need to speed up the algorithm

  28. x-Form PWS-Quality • x-Form of PWS-Quality • g(k,D,Q)= func(existential & qualification probs. of tuples in k-th x-tuple) • Summation of quality information of all the result x-tuples • Only consider x-tuples whose tuples are in query answer

  29. x-Form of PRQ (Range Query) • Each g(k, D, Q) only require O(|τk|) time • pi and Pk are the qualification probability of the current tuple ti and current x-tuple tK which can be calculated easily

  30. x-Form of PMaxQ (Max Query) • Require O(|τk|2) to calculate g(k, D, Q) for PMaxQ • Details of the proof will be talked at the end of present

  31. x-form PWS-quality summary • By transforming the original PWS-quality calculation to the x-form PWS calculation, we avoid the exponential computation time • Total computation time O(m log(n/m)) • Compared to the query time, the x-form PWS-quality calculation time is small. (will be shown in the experiment)

  32. Outline • Background • Related works • Data and Query model • PWS-quality model • Cleaning procedure • Experiments result

  33. Cleaning with limited budget • With a limited budget, say, 10 Units, which tuples should we clean? Clean cost: 5 unit • Clean cost: 7 unit • Clean cost: 10 unit

  34. Example of cleaning • After Cleaning, the tuple existential probability become 1 • This x-tuple contracted to 1 single tuple with certain attribute value

  35. Quality improvement • Expected Quality after cleaning • The set of x-tuple that we are going to clean is represented by X = {τ1, ···, τ|x|} • Quality Improvement But quality improvement calculation is exponential

  36. Computation example: Query 2 (Max query): Select product with highest price if we decided to clean up x-tuple c

  37. Computation example (Cont.): Query 2 (Max query): Select product with highest price We decided to clean up x-tuple c one possible case is c3 is the real world case New PWS-quality S(D’, Q) = -1.17

  38. Computation example (Cont.): Query 2 (Max query): Select product with highest price We decided to clean up x-tuple c another possible case is c2 is the real world case New PWS-quality S(D’, Q) = -1.17

  39. Computation example (Cont.): Query 2 (Max query): Select product with highest price To clean up x-tuple c we have 3 different possible real world scenarios Expected quality of cleaning up x-tuple c = 0 * 0.5 + (-1.17) * 0.3 + (- 1.17) * 0.2 = -0.585

  40. x-form quality improvement • calculation of the quality improvement in x-form will become following • X is the set of x-tuple that we are going to clean • proof: rewrite the original E(S(D’(t), Q)) as left side is equal to 0, right side is unchanged after the cleaning

  41. Optimal Data Cleaning Algorithm • in x-form quality improvement problem, we get the following objective function: • cK: the cleaning cost k-th x-tuple • C: total cleaning budget • Z: total number of x-tuple with pi in (0,1) • Can be transformed to 0/1 Knapsack problem

  42. DP algorithm • Time complexity O(CZ) Space Complexity O(CZ2) • C: total budget Z: number of x-tuples

  43. Other heuristics methods: • Random • MaxQP • Select x-tuples with highest qualification probability • Greedy: • Rank x-tuples with max expected quality improvement per cleaning cost

  44. Outline • Background • Related works • Data and Query model • PWS-quality model • Cleaning procedure • Experiments result

  45. Experiment set up

  46. PWS-quality(S) vs database size(Z) (PRQ)

  47. Quality evaluation performance(PRQ) (database size)

  48. Running time for Clean up selection(PMaxQ) Total budget

  49. Quality improvement vs Budget(PRQ) Quality Improvement Total budget

  50. Quality improvement vs Budget(PMaxQ) Quality Improvement Total budget

More Related