1 / 69

Probabilistic Queries and Uncertain Data

Probabilistic Queries and Uncertain Data. Sunil Prabhakar Department of Computer Sciences Purdue University Email: sunil@cs.purdue.edu http://www.cs.purdue.edu/homes/sunil. Introduction.

varick
Télécharger la présentation

Probabilistic Queries and Uncertain Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Queries and Uncertain Data Sunil Prabhakar Department of Computer Sciences Purdue University Email: sunil@cs.purdue.edu http://www.cs.purdue.edu/homes/sunil

  2. Introduction • The traditional database model expects data items to be modeled as sets (bags) of tuples consisting of precise attribute values. • However, real-world data does not easily fit into this model if there is uncertainty in the information. • Uncertainty comes from many sources: unreliable measurements and data sources, incomplete or missing information, irreconcilable facts, … • This problem has been recognized for a long time (e.g. NULL values) and numerous models have been proposed. Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  3. Introduction • Long history of ideas for incorporating uncertain data in databases • Many proposals for models • Recent renewed interest in the area • Some initial work on developing systems • This tutorial provides a sampling of the area. • More information at http://www.cs.purdue.edu/homes/sunil Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  4. Outline • Motivating examples • Proposed Models • Implementation issues • Efficiency • Scalability • Prototypes • Open problems • References • Motivating examples Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  5. Application: Sensor databases Database System sensor sensor External Environment e.g., temperature, moving objects, hazardous materials Network Channel queries results sensor sensor user Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  6. Data uncertainty • Due to limited network bandwidth and battery power, readings are sampled • The value of the entity being monitored (e.g., temperature, location) is changing • Most of the time the database stores old values • Query results can be incorrect! Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  7. Database:X Correct answer: Y Answering a Minimum Query Recorded Temperature 30 Current Temperature x1 y0 20 10 x0 y1 0 oF x y Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  8. Bounding Uncertainty with Dead-Reckoning • Data values cannot change drastically • The system negotiates a bound dwith the sensor [v-d,v+d] System (v, d) sensor v • Trade-off between data uncertainty and update frequency Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  9. Answering Minimum Query with Error-Bounded Readings Recorded Temperature 30 Bound for Current Temperature y0 20 • x certainly gives the minimum temperature reading 10 x0 0 oF x y Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  10. Answering Minimum Query with Error-Bounded Readings Recorded Temperature uncertainty pdf 30 Bound for Current Temperature y0 20 • How do we determine the answer to this query? • Each sensor has some chance of given the minimum reading. • Probabilistic Queries 10 x0 0 oF x y Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  11. Probabilistic Queries • As attribute values become uncertain (actually, imprecise), operators (e.g =, <,>) over these data need to be defined. • These operators may no longer return Boolean results. Instead, given the probability distributions, they can return probabilistic answers Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  12. Answering Minimum Query with Error-Bounded Readings Recorded Temperature 30 Bound for Current Temperature y0 20 • (X,0.7), (Y,0.3) • Answers augmented with probabilistic guarantees 10 x0 0 oF x y Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  13. Sensor Errors • In the previous examples, uncertainty was introduced in order to avoid incorrect results • Uncertainty may be inherent due to measurement errors, e.g. • Most scientific instruments have well known errors • GPS has a Gaussian distribution • Micro-array data have a Lorentzian distribution • Statistical results also have margins of error • Similar to previous case Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  14. Data Privacy • Uncertainty may sometimes be desirable in order to provide privacy for individuals. • Instead of reporting an exact location to a Location-Based service provider, users can obfuscate their location to a small spatial region. • This naturally results in ambiguity (uncertainty) in query results. Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  15. Application: Protein Annotation • Consider a protein database that records the functions of the proteins (annotations). • Some function information is experimentally derived and has high confidence (certainty). • More often, annotations are transferred based upon computational results • HMMs • Sequence similarity • Rule bases • Such annotations are inherently less reliable. • As these annotations propagate, so do the errors. • It is desirable to be able to capture the uncertainties in the annotations within the database. Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  16. Application: Text Retrieval • In text retrieval systems, answers to queries are typically inexact. • For example, “Find documents on uncertain data management” • Results are ranked in order of relevance to the query • Thus, the answer can be viewed as having a probability of being part of the result relation • When multiple conditions are tested -- how do we combine these rankings? • Probabilistic modeling can help in this situation. Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  17. Application: Data Integration & Cleaning • When integrating multiple database, it is necessary to identify matches between tuples • For many pairs, there is no clear Yes/No answer to the matching question • Existing methods can provide a probability or degree of match which can be exploited in an application-specific manner. • How should these uncertainties in the result of cleaning or integration be handled? Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  18. Unreliable Sources, Missing Data • Consider the following cases: • Information received from certain sources may not be entirely reliable (compromised sensors, poor quality of data, …). • Information from multiple sources may be inconsistent, even contradictory. • An attribute’s exact value may not be known, but it can be only one of few possibilities. • Each of these cases are examples where the data is uncertain. Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  19. Application Needs • In summary, we see that there are numerous applications for which uncertainty in data is either inherent or desirable. • Existing systems do not provide any support for uncertain data thereby compelling applications to morph their data to fit the model. • There is a real need for the development of database systems that handle uncertain data. • The characteristics of uncertainty are diverse and often application-dependent. Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  20. Outline • Motivating examples • Proposed Models • Implementation issues • Efficiency • Scalability • Prototypes • Open problems • References Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  21. Uncertain Data Models • There have been numerous proposal for models. Some distinguishing features include: • Nature of uncertainty (probabilitic, …) • Types of databases (Relational, XML,…) • Complexity of uncertainty • Granularity of uncertainty • Handling correlations • Handling missing data • Types of uncertainty supported Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  22. Types of uncertainty models • Qualitative models • NULL values • Definite, Indefinite, or Maybe[LS87,LS91] • Quantitative models • Probabilistic • Dempster-Shafer (evidence-based) [LSS96, Lee92] • Fuzzy sets (possibilities) [CUP06] Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  23. Probabilistic Models • There are two main types of probabilistic data uncertainty addressed in recent work: • Attribute uncertainty • The value of an attribute of a tuple is not known precisely • Modeled as a set or range of possible values with associated probabilities • Tuple uncertainty • The membership (presence) of an entire tuple within a relation is uncertain • Maybe modeled as an probability attached to the tuple. Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  24. Other Models • Some systems consider both types ([GUP06]) • Table uncertainty has also been proposed to handle coverage of a table (what percentage of tuples are present in the table) [Wid05]. • Probabilistic database in semi-structured model • XML data (Nierman & Jagadish) [NJ02] • Acyclic data structure (Hung,Getoor & Subrahmanian) [HGS03] • Fuzzy databases [GUP06] (possibility values) • Uncertainty in Deductive Databases [LS97,LS01,LS03] Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  25. Tuple Uncertainty • There has been a significant amount of work in this domain dating back (at least) to 1979. • The basic idea is that the membership of a tuple in a relation is not certain. • This uncertainty may reflect the degree of confidence that this tuple belongs to the relation or the degree of relevance of the tuple to the relation (a query answer). Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  26. Some Tuple Uncertainty Models • Cavallo and Pittarelli [CP87] • Fuhr and Roellke [RK97] • Fuhr [Fuhr95] • Dey and Sarkar [DS96] • TRIO [Wid05] Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  27. Fuhr [FR97,Fuhr90,Fuhr95] • Input relations are assumed to have attributes that have probabilistic events associated with them. • These are assumed to be independent • The evaluation of queries results in new tuples with complex events associated with them. • These tuples may no longer be independent thus causing complications. • Fuhr solves this problem using intensional semantics -- for each tuple, the complex event is derived. In the final step the probability value of this event is computed. • This is very expensive and complicated. Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  28. Dalvi & Suciu [DS04, DS05] • Dalvi and Suciu explore extensional evaluations -- the probability values of tuples after the application of operators are computed. • However, this can lead to incorrect results in some cases. Notion of safe query plans. • An algorithm to identify a safe extensional plan for a query is developed. May not always return a result. • Heuristic plans and approximations are proposed for the case where the data complexity of the query is #P-complete. • [DS05] addresses the case where input relation tuples are not independent. Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  29. Information Source Tracking • Fereidoon Sadri [FS91, FS95] • Sources of data are assigned a reliability • Query answers and derived data are also assigned a score that can be computed • Each tuple is assigned a propositional formula that describes its certainty (in terms of the reliability of sources) -- vectors • Sources are assumed to be independent • Computing a query implies computing the vectors for each tuple and then computing the corresponding certainty -- requires certainty of sources Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  30. Information Source Tracking (Cont.) • Possible worlds semantics: k sources, 2k possible relations • Provided definitions of extended operators that guaranteed Soundness and completeness: I.e. the result of these operators over uncertain relations had the same set of possible words as applying regular relational operators over the possible worlds of the input relations • Efficiency concerns due to large size of pwd. • Algorithms for aggregations also developed, but mostly expensive or NP-Complete Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  31. Attribute Uncertainty • The earliest example of work in this area is the notion of NULL values (Codd) • The probabilistic data model (PDM) proposed in [BHP92] -- focus on discrete values • ProbView [LLR+97] • Continuous attribute case proposed for sensor data [CKP03] Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  32. Codd’s model for uncertainty • NULL values are a means of capturing uncertainty with three-valued logic (T,F,M) • A-mark and I-mark also introduced along with a four-valued logic (T, F, A, I) • A-mark implies that the attribute value exists, but is not known. • I-mark implies that the attribute value is undefined, or does not exist. Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  33. Probabilistic Data Model • Barbara, Garcia-Molina, Porter [BGP92] • Discrete attribute uncertainty • Key attributes are deterministic (precise) • Notion of attribute groups (handles dependent data) • Captures missing probability (no assumption) • Probabilities may be user defined, statistically determined, due to staleness, etc. Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  34. Probabilistic Data Model (cont.) • Selects can refer to attributes or probabilities • Selection conditions specify cutt-off probabilities • Two flavors -- must and maybe (with or without the missing probability) • SELECT APPLICANTS WHERE ACC_EVAL: V = [Y, *], P > 0.7 (Adam not in result -- Must semantics) • SELECT APPLICANTS WHERE ACC_EVAL: v = [Y, *], p > 0.7 (Adam in result -- Maybe semantics) • Natural joins allowed where join attribute must be key for one of the relations (not commutative) • Project similarly defined for dropping attributes from groups • Studied impact of missing probabilities on joins -- may lead to loss of information. Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  35. Probabilistic Data Model (contd.) • New operators: • -SELECT, -Join: Based upon similarity of probability distributions • STOCHASTIC: convert regular relation to probabilistic based upon given schema (freq gives probability) • DISCRETE: convert probabilistic relation to a regular relation (based upon expected values) • GROUP: merge two or more attribute groups into one Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  36. ProbView [LLR+97] • Attribute values specified as alternative discrete values with probability intervals. • Attribute uncertainty is converted to tuple uncertainty. • Possible worlds are derived from this set with upper and lower bounds on probabilities. • Annotated relations obtained by flattening probabilistic relations with path (expressions on worlds) • Computing probabilities for queries is done via user-specified functions. • Relational algebra operations are extended to handle the probability bounds and paths. Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  37. Continuous Attribute Uncertainty fi(x) – uncertainty pdf [L uncertainty interval R] • Cheng, Kalashnikov, Prabhakar [CKP03a, CKP04] • Allow an attribute value to be a continuous range with an associated probability density function • The cumulative probability over the interval should be 1 • General continuous attribute uncertainty model • Covers models used in various application domains, e.g., • location uncertainty [WSCY99, PJ99] • DNA microarray data error [BWW+02] Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  38. Probabilistic Nearest Neighbor Query • At distance r, A is the nearest neighbor of Q if: • A is at distance r from Q • B,C,D are all located at distances > r from Q. • The pdfpA(r) can be computed. C A r Q D B Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  39. Probabilistic Nearest Neighbor Query • Compute pA(r) • From the shortest distance of A to Q (nA) • To the longest distance of A to Q (fA) C fA A nA Q D B Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  40. Classification of Probabilistic Results Four classes of queries identified [CKP03b] • Nature of result values • Continuous: returns a single value e.g., Average query ([l,u], pdf) • Discrete: returns a set of objects e.g., Range query ({(Ti,pi), pi>0}) • Relationship between result values • Independent: whether an object satisfies a query is independent of others e.g., Range query • Interdependent: interplay between objects decides result e.g., Nearest-Neighbor query Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  41. Classification of Probabilistic Queries The notion of query answer quality was also introduced. For each class of queries, a metric for query quality was specified. Intuitively, this metric captures the degree of uncertainty in the answer (as compared to an answer derived over precise data). Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  42. Quality of Probabilistic Result • Probabilistic queries: notion of result "quality" • Example: range query (is Ti.z in range [l, u]?) • regular range query • "yes" or "no" • probabilistic range query Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  43. Quality for Continuous-Interdependent Queries • Query result: [l,u], {p(x) : x [l,u]} • U[3,4] less ambiguous than U[1,100] • Differential entropy • Measures uncertainty associated with r.v. X with pdf p • max(H(X)) = log2(u-l)iffX~U[l,u] (most uncertain) • Metrics for other classes also proposed. Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  44. Outline • Motivating examples • Proposed Models • Implementation issues • Efficiency • Scalability • Prototypes • Open problems • References Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  45. Implementation Challenges • Many proposals have not addressed the issues of implementation • Some models are known to be very expensive computationally, e.g. the model proposed in [FR97]. • Is it possible to avoid enumeration of all possible worlds in order to compute queries? • Notion of safe queries and extensional evaluation [DS04]. Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  46. Extensional Semantics [DS04] • Intensional evaluation is very expensive. • Propose new extensional evaluation where probabilities are continuously maintained. • Can lead to incorrect results -- develop the notion of safe extensional plans based upon PWD semantics. • Extensional plans not always available. • Some heuristics have been proposed. • Can one do better? • Work done in the context of queries with uncertain predicates (information retrieval). • What about other domains? Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  47. Orion Query Evaluation [CKP03] Probabilistic Range Query example {(T1,0.2),(T2,0.8)} Recorded Temperature Uncertainty for Current Temperature 30 20 10 0 oF T1 T2 Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  48. Probabilistic Threshold Range Query (PTRQ) • Users are likely to be concerned with results that meet a given cutoff probability. • Retrieve sensor ids with readings between 10oF to 25oFwith probability ≥ 0.7 • PTRQ: Given [a,b] and p, return {Ti} where Prob(value of Ti is inside [a,b]) ≥p • How to exploit indexes for such queries? • Use R-tree or interval index [AV96, KRVV96, MTT00] to find intervals intersecting [a,b] • For each object retrieved, evaluate its probability of being within [a,b]. Return objects with probability ≥p Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  49. Problem with Current Indexes • Current Interval indexes do not consider probabilities during search • Many irrelevant objects (probability < p) may be processed. • New indexes for probabilistic data. Orion [CXP+04]: • Probability Threshold Indexing (PTI) 1D interval R-tree with uncertainty • Variance-based Clustering Transform intervals to 2D points and index based on variance Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

  50. Q (p = 0.3) a b Pruning in a 1D R-Tree • Some intervals in the MBR may satisfy Q • Need to retrieve the contents of the MBR and evaluate Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b

More Related