1 / 68

Managing Uncertain Data

Managing Uncertain Data. Anish Das Sarma Stanford University. What is Uncertain Data?. Why Does It Arise?. Precision of devices. Lack of information. Uncertainty about the future. Anonymization. Applications: Information Extraction. Applications: Information Integration. name, hPhone,

kohana
Télécharger la présentation

Managing Uncertain Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing Uncertain Data Anish Das Sarma Stanford University Anish Das Sarma

  2. What is Uncertain Data? Anish Das Sarma

  3. Why Does It Arise? Precision of devices Lack of information Uncertainty about the future Anonymization Anish Das Sarma

  4. Applications: Information Extraction Anish Das Sarma

  5. Applications: Information Integration name, hPhone, oPhone, hAddr, oAddr name, phone, address Combined View Anish Das Sarma

  6. Applications: Deduplication ? 80% match Anish Das Sarma

  7. Applications: Scientific & Medical Experiments Probably not cancer Anish Das Sarma

  8. How Do Database Management Systems (DBMS) Handle Uncertainty? They don’t  Anish Das Sarma

  9. What Do (Most) Applications Do? • Clean: turn into data that DBMSs can handle • Loss of information • Errors compound insidiously Anish Das Sarma

  10. Outline of The Talk • Part 1: Managing Uncertainty in a DBMS theory  systems • Part 2: Handling Uncertainty in Data Integration systems  theory • Other Research (trailer) • Future Plans Anish Das Sarma

  11. Part 1: Managing Uncertain Data • Primarily in the context of the Trio project • Data • Uncertainty • Lineage • Today’s focus: how lineage helps Anish Das Sarma

  12. Uncertain Data • Anuncertain database represents a set of possible instances (or, possible worlds) • Our work: finite sets of possible instances Anish Das Sarma

  13. Representing Uncertain Data • 20+ years of work (mostly theoretical) • Appears to be fundamental trade-off between expressiveness & intuitiveness • We spent some time exploring the space of models for uncertainty Anish Das Sarma

  14. Hierarchy of Models [ICDE 06] + Expressive - Complex • Next • Consider a model M • Isolate inexpressiveness • Solve problem with lineage + Intuitive - Inexpressive Anish Das Sarma

  15. Running Example: Crime-Solver • Saw (witness, color, car) // may be uncertain • Drives (person, color, car) // may be uncertain • Suspects (person) = πperson(Saw ⋈ Drives) Anish Das Sarma

  16. Simple Model M 1. Alternatives:uncertainty about value 2. ‘?’ (Maybe) Annotations Three possible instances Anish Das Sarma

  17. Simple Model M 1. Alternatives 2.‘?’ (Maybe): uncertainty about presence ? Six possible instances Anish Das Sarma

  18. Review: Relational Queries D S Q πperson(σcolor=red) Anish Das Sarma

  19. Queries on Uncertain Data D D′ Closure: up-arrow always exists direct implementation possible instances rep. of instances Q on each instance I1, I2, …, In J1, J2, …, Jm Completeness: All sets of possible instances can be represented Anish Das Sarma

  20. Model M is Not Closed Suspects= πperson(Saw ⋈ Drives) CANNOT Does not correctly capture possible instances in the result ? ? ? Anish Das Sarma

  21. to the Rescue Lineage Model M + Lineage = Completeness Anish Das Sarma

  22. Example with Lineage Suspects= πperson(Saw ⋈ Drives) ? ? ? Anish Das Sarma

  23. Example with Lineage Correctly captures possible instances in the result Suspects= πperson(Saw ⋈ Drives) λ(31) = (11,2) Λ (21,2) ? λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2) ? λ(33) = (11,1) Λ 23 ?

  24. Trio’s Data Model Uncertainty-Lineage Databases (ULDBs) • Alternatives • ‘?’ (Maybe) Annotations • Confidence values (next) • Lineage Theorem: ULDBs are closed and complete [VLDB 06] Formally studied properties like minimization, equivalence, approximation and membership. [VLDB 06, VLDB J. 08] Anish Das Sarma

  25. Confidence Values in Trio • Confidence values supplied with base data • Default probabilistic interpretation • Problem: Compute confidence values on result data [ICDE 08] • 5-minute DBClip • Search “confidence computation” on YouTube. Anish Das Sarma

  26. Problem Description Cars= πcar(Saw ⋈ Drives) : ? : ? Anish Das Sarma

  27. Operator-by-Operator Saw Drives : 0.5*0.9 : 0.45 : 0.4 ⋈ : 0.6 Wrong!! πcar : 0.67 0.45 + 0.4 - (0.45*0.4) Anish Das Sarma

  28. Operator-by-Operator Not independent! : 0.45 : 0.4 : 0.6 0.45 + 0.4 - (0.45*0.4) Anish Das Sarma

  29. Database Query Processing 101 Execution Plans Query Pick and execute best plan Q Statistics, indexes Anish Das Sarma

  30. Operator-by-Operator Confidence Computation Plans Query Can be much smaller or empty Q Anish Das Sarma

  31. Decouple Data and Confidence Computation Plans • Compute data • Use lineage to compute confidences (on demand) Query Q Theorem: Arbitrary improvement. [ICDE 08] Anish Das Sarma

  32. Our Approach Correct!! 0.5 * (0.9 + 0.8 - 0.9*0.8) λ(41) = 11 Λ (21 V 22) : ? : 0.49 λ(42) = 12 Λ 23 : 0.6 : ? Anish Das Sarma

  33. Algorithm 0.9 0.4 1.0 t5 t6 t7 0.7 1. Expand lineage to base data t4 2. Get confidence of base data 0.4 3. Evaluate the probability λ(t) Detecting independence t1 t2 Memoization Batch computation R λ(t) = f(t4,t5,t6,t7) t 0.823 Anish Das Sarma

  34. Some Other Trio Work • Modifications and Versioning [TR 08] • Stored derived relations • Modifications  versions • Indexes and Statistics [MUD 08] • Specialized indexes, histograms • Functional Dependencies & Schema Design [TR 07] • Definitions, sound and complete axiomatization of FDs • Lossless decomposition • FD testing, finding, and inference Anish Das Sarma

  35. Related Work (sample) • Modeling Uncertainty: Plenty, covered in textbooks • Systems: Avatar, BayesStore, MayBMS, MYSTIQ, ORION, PrDB, ProbView, Trio, others? Anish Das Sarma

  36. Part 2: Data Integration • Reboot! or, wake up! Anish Das Sarma

  37. Mediated Schema D5 D1 D2 D4 D3 Traditional Data Integration: Setup Who authored the most SIGMOD papers in the 90’s? MappingSELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS B WHERE A.aid=B.aid AND P.pid=B.pid 1. Mediated Schema Publication(title, author, conf, year) 2. Schema Mappings 3. Query Answering Significant up-front effort Mike Carey Author(aid, name) Paper(pid, title, year) AuthoredBy(aid,pid) Bib(title, authors, conf, year)

  38. “Pay-As-You-Go” Data Integration • Automated best-effort integration from the outset • Further improve the system over time with feedback How advanced a starting point can we provide? Anish Das Sarma

  39. to the Rescue Uncertainty • Automatic integration • Make guesses • Model probabilities • Specifically • Probabilistic schema mappings • Probabilistic mediated-schema >90% accuracy in automatically integrating 50-800 data sources for several domains [SIGMOD 08] Anish Das Sarma

  40. Next • Probabilistic mediated schemas • Probabilistic schema mappings • Experimental results Anish Das Sarma

  41. Mediated Schema {name, person-name} {email} {phone-num, phone} {address, mailing-addr} Med-S (name, email, phone, addr) S1(name, email, phone-num, address) S2(person-name,phone,mailing-addr) • A mediated schema is a clustering of a subset of the set of all attributes appearing in source schemas. Anish Das Sarma

  42. Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) ? S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, hPhone, oPhone FROM Med

  43. Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med

  44. Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med

  45. Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med

  46. Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med

  47. Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med

  48. Probabilistic Mediated Schema Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Pr=0.5 Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Pr=0.5 S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) • Probabilistic Mediated Schema (p-med-schema) is a set M = {(M1,Pr(M1)), …, (Mk,Pr(Mk))} where • Mi is a med-schema; i≠j => Mi≠ Mj • Pr(Mi)ϵ(0,1]; ΣPr(Mi) = 1 Anish Das Sarma

  49. P-Mappings PM1 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.64 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.04 PM2 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.64 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.04 Anish Das Sarma

  50. Expressive Power of P-Med-Schema & P-Mapping • Theorem 1. For one-to-many mappings: • (p-med-schema + p-mappings) = (mediated schema + p-mapping) > (p-med-schema + mappings) • Theorem 2. When restricted to one-to-one mappings: • (p-med-schema + p-mappings) • = (p-med-schema + mappings) > (mediated schema + p-mapping) Anish Das Sarma

More Related