Efficient Query Evaluation on Probabilistic Databases

Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Outline • Motivation • Query Evaluation: • Intensional • Extensional • Query Optimization • Complexity • Unsafe Plans • Extensions • Conclusions

Databases Are Deterministic • Databases we see today are deterministic • A tuple is either in the query answer or not • They don’t deal with uncertainties

Future of Data Management • Uncertainties in Data • Biological Data • Sensor Data (Geographical Data) • Data extracted from various AI, data mining techniques (information extraction) • Uncertainties are represented as probabilities • Extend data management tools to handle probabilistic data

Example Review Text I have not used IPOD but Apple products are good Facts Table

Representing Uncertainty • Tuple-existence uncertainty • All attributes in a tuple are known precisely; existence of the tuple is uncertain • E.g. in previous slide. More later • Attribute-value uncertainty • Tuples (identified by keys) exist for certain; attributes (one or more) value are however uncertain • Tomorrow, it may rain (probability is 0.6)

Our Goal For Today • Understand how queries can be evaluated efficiently on Probabilistic Databases • For simplicity, we will deal with tuple-level uncertainties only • We also assume independence among tuples. i.e. P(t1, t2) = P(t1) * P(t2)

Cam Feat Cam Feat Cam Feat Cam Feat C21 Lens C29 Batt C21 Lens C21 Lens C31 Lens C29 Batt I1 I2 I3 C31 Lens I4 I5 (1-p1)(1-p2)(1-p3) p1(1-p2)(1-p3) (1-p1)p2(1-p3) p1(1-p2)p3 p1p2p3 Possible Worlds: Example 1 Total number of worlds: 2^count_tuples ; ∑Ii = 1

Possible Worlds: Example 2 s1 s2 0.8 0.5 0.6 t1 T S Possible Worlds pwd(Dp)

Query Evaluation • So, lets consider a query: • Q(D) :- S(A,B), T(C,D), B = C • S join T on B = C, project on D • Intuitively: • Execute the query on each possible world • The final result is a probabilistic relation that represents end result

Query Evaluation: Example S join T on B = C, project on D qpwd(Dp) =

Query Evaluation • Semantically correct • If T has ‘n’ tuples, there can be as many as 2^n possible worlds. • Exponential complexity, thus impractical Goal of the paper: Evaluate query efficiently

Intensional Query Evaluation • Define the complex event ep(t) for each tuple t • For each intermediate tuple, associate an explicit (complex) event expression • Compute the actual probabilities at the end • For this talk, we will look only select, join project queries

v E v v1 v2 E1 ˄ E2 v E1 v1 E1 v2 E2 v E2 v E Intensional Semantics E1 V E2 V…   X … …

Theorem (2) The intesional semantics and the possible world semantics on probabilistic databases are equivalent for conjunctive queries. pwd(qi(Dp)) = qpwd(Dp)

Intensional Semantics: Example T S Project on D S join T on B = C qrank(Dp) = Pr(q) = (0.8 * 0.6) + (0.5 * 0.6) – (0.8 * 0.5 * 0.6) = 0.48 + 0.3 – 0.24 = 0.54

Intensional Semantics • Does not depend on the choice of plan • Impractical to use it: • The event expressions can become very large due to projections • For each tuple t, one has to compute Pr(e) for its event e, which is #P-complete problem • Thus very expensive

Extensional Semantics • Play with probabilities instead of event expressions • Much more efficient • Assume tuple independence • Not always correct. WHY?

v p v 1-(1-p1)(1-p2)… v1 v2 p1 p2 v p1 v1 p1 v2 p2 v p2 v p Extensional Semantics   x

Extensional Query Evaluation:Example Plan : πD(S joinB=C T) T S S join T on B = C Project on D Wrong?? Because the two tuples in the join are no longer independent!!

Extensional: Alternate Query Plan Plan : πD(πB(S) joinB=C T) T S Join with T on B=C Project S on B CORRECT!!

Observation The answer depends on query plan

Notations • R is a relation name. • D = instance of a database schema • Γ = set of functional dependencies • E = set of all complex events • q = query • PRels(q) = the probabilistic relation names in q • Attr(q) = all attributes in all relations in q • Head(q) = the set of attributes that are in output of the query q

Safe Plan • A plan is safe if it produces the correct result • Formally, given a schema Rp, Γp, a plan P for a query q is safe if Pe(Dp) = qrank(Dp) for all instances Dp of that schema

Theorem (3) • Consider a database schema where all the probabilistic relations are tuple-independent. Let q, q’ be the conjunctive queries that do not share any relation name. Then • σ is always safe • x is always safe in q x q’ • Π is safe iff A1,…Ak, R.E → Head (q)

Example • Same example, Γp is : • S.A, S.B → S.E • T.C, T.D → T.E • S.E → S.A, S.B • T.E → T.C, T.D • Query :- S join T on B = C, project on D • Plan : πD(S joinB=C T) • Join is safe. We need to check the safeness of project. From theorem 3, we need to check • A1,…Ak, R.E → Head (q) • T.D, S.E → S.A, S.B, T.C, T.D (pass) • T.D, T.E → S.A, S.B, T.C, T.D (fails, why?) • Where • A1,…Ak is T.D • R.E is S.E and T.E • Head (q) is S.A, S.B, T.C, T.D

Example: Alternative Plan • Query :- S join T on B = C, project on D • Plan : πD(πB(S) joinB=C T) • Project on B is safe. We need to check the safeness of project on D. From theorem 3, we need to check • A1,…Ak, R.E → Head (q) • T.D, S.E → S.B, T.C, T.D • T.D, T.E → S.B, T.C, T.D • Where • A1,…Ak is T.D • R.E is S.E and T.E • Head (q) is S.B, T.C, T.D Plan is safe!!

Separation • Let q be a conjunctive query. Two relations R1, R2 are called connected if the query contains a join condition R1.A = R2.B and either R1.A or R2.B is not in Head(q). The relations R1, R2 are called separate if they are not connected. • Two sets of relations Y1 and Y2 are said to form a separation for query q iff • They partition the set Rels(q) • For any pair of R1 and R2 s.t. R1 belongs to Y1 and R2 belongs to Y2, they are separate • Intuitively, • The query does not contains a join condition • If the query has join condition, output of query does contains both R1.A and R2.B

Separation: Example • Query :- S(A,B), T(C,D), B = C • qBC = (S joinB=C T) • Head(qBC) = {B,C,D} S join T on B = C Both B and C are present in head(qBC). Thus S and T are separate for this query

Finding Safe Plan • Authors proposed SAFE-PLAN algorithm to find safe plans for a query • Try to postpone all safe projections in the query plan • When no more safe projections possible, it tries to perform a join, by splitting q into q1 join q2 • Since we perform join in the last, all attributes of join condition must be in Head(q), thus making sure that relations involved in join are separate. • If a safe plan exist, the algorithm finds it

Finding Safe Plan: Example Processing :- SAFE-PLAN(πD(S joinB=C T)) Is πHead(q)(qA) is a safe operator? Conditions: T.D, S.E → S.A, T.D (safe) T.D, T.E → S.A, T.D (unsafe) Head(qA) = {A, D} qA = πD(S joinB=C T)) Z = {A} Head(q) = {D}

Finding Safe Plan: Example Processing :- SAFE-PLAN(πD(S joinB=C T)) Is πHead(q)(qB) is a safe operator? Conditions: T.D, S.E → S.B, T.D (safe) T.D, T.E → S.B, T.D (safe) Return πD(SAFE-PLAN(qB)) Head(qB) = {B, D} qB = πD(S joinB=C T)) Z = {B} Head(q) = {D}

Finding Safe Plan: Example Processing :- πD(SAFE-PLAN(qB)) • Is πHead(q)(qAB) is a safe operator? • Conditions: • T.D, S.E → S.A, S.B, T.D (safe) • T.D, T.E → S.A, S.B, T.D (unsafe) Head(qAB) = {A, B, D} qAB = πD(S joinB=C T)) Z = {A} Head(qB) = {B, D}

Finding Safe Plan: Example Processing :- πD(SAFE-PLAN(qB)) • Is πHead(q)(qBC) is a safe operator? • Conditions: • T.D, S.E → T.C, S.B, T.D (safe) • T.D, T.E → T.C, S.B, T.D (safe) • Return • πBD(SAFE-PLAN(qBC)) Head(qBC) = {B, C, D} qBC = πD(S joinB=C T)) Z = {C} Head(qB) = {B, D}

Finding Safe Plan: Example Processing :- πD(πBD(SAFE-PLAN(qBC)) • Is πHead(q)(qABC) is a safe operator? • Conditions: • T.D, S.E → S.A,T.C, S.B, T.D (safe) • T.D, T.E → S.A,T.C, S.B, T.D (unsafe) Head(qABC) = {A, B, C, D} qABC = πD(S joinB=C T)) Z = {A} Head(qBC) = {B, C, D}

Finding Safe Plan: Example Processing :- πD(πBD(SAFE-PLAN(qBC)) • Split qBC into q1 joinB=C q2, s.t. • q1(B) :- S(A,B) • q2(C,D) :- T(C,D) • We know that S and T are separate on query qBC!! • Return • SAFE-PLAN(q1) joinB=C • SAFE-PLAN(q2)) No projection possible!! qBC = πD(S joinB=C T)) Head(qBC) = {B, C, D}

Finding Safe Plan: Example πD(πBD(SAFE-PLAN(q1) joinB=C SAFE-PLAN(q2))) • Is πHead(q1)(qA) is a safe operator? • Conditions: • S.B, S.E → S.A, S.B (safe) • Return • πB(SAFE-PLAN(S(A,B))) • i.e. πB(S(A,B)) Head(qA) = {A, B} qA = S(A,B) Z = {A} Head(q1) = {B}

Finding Safe Plan: Example • SAFE-PLAN(q2) = T(C,D) • Thus, final result : πD(πBD(πB(S) joinB=C T)) • πBD is redundant. Can be optimized. • SAFE-PLAN algorithm is sound and complete • How can we optimize our query plan? • Traditional equivalences do not work in extensional semantics. • Need to define extensional semantics equivalences

Query Optimization • Select behaves exactly like traditional select operator • Extensional joins are commutative R join S  S join R • Extensional joins are associative R join (S join T)  (R join S) join T • Cascading Projections πA(πAUB(R))  πA(R) • Pushing Projection below a join πA(R join S) => (πA(R)) join (πA(S)) • Lifting Projections Up a Join: only when it satisfies the project condition in theorem 3 (πA(R)) join S => πAUAttrs(S)(R join S) • Theorem (10) : Let Z1 and Z2 be two safe plans for a query q. Then Z1  Z2

Complexity Fundamentals • PTIME : solvable in polynomial time • NP complete : Is? Checks satisfiability. • #P complete : How many?

Complexity Analysis • The data complexity of a query q is the complexity of evaluating qrank(Dp) as a function of size of Dp • If q has a safe plan, then its data complexity is in PTIME • All extensional operators are in PTIME • If q does not has a safe plan, then its data complexity is in #P-complete. i.e. if SAFE-PLAN algorithm fails to return a plan

Unsafe Plans • What if there is no safe plan? The author proposes two solutions • Least Unsafe Plans • Monte-Carlo Approximations

Least Unsafe Plans • Minimize the error in computing the probabilities • Modify SAFE-PLAN algorithm • When splitting a query q in two sub-queries q1 and q2, allow joins b/w q1 and q2 on attributes not in Head(q), then project out these attributes • These projections will be unsafe. Minimize their degree of unsafety • Pick q1, q2 to be a minimum cut of graph (rather than separation) • Problem of finding minimum cut is in PTIME

Monte-Carlo Approximations • Let q’ be the query obtained from q by making it return all the variables in its body. • Evaluate q’ instead of q without any probability calculations • Group the tuples based on the values of attributes in Head(q) • Complex event expression of a group will be in CNF. i.e. Vni=1Ci where each Ci is in DNF. i.e. e1 ˄ e2 ˄ … • Back to same problem!! • Complexity of evaluating the probability of a boolean expression is in #P-complete

Monte-Carlo Approximations • Given a DNF formula with N clauses and any ε and δ, the probability can be approximated in time O(N/ε2 ln (1/δ)) • Probability of the error being greater than ε is less than δ. • If N is small, an exact algorithm may be applied in place of simulation

Extensions • Till now: • All the events in probabilistic relations are distinct • Dealt with select, project, join queries. • The authors have extended their solutions to non-distinct relations and additional operators

Handling Repeated Events • Multiple tuple can share a common event • 4 easy steps to handle them: • Normalize the schema – represents the same data in normalized form, s.t. no probabilistic table has repeated events TP :- T1 and TP2 • Translate original query into new schema • Find a safe plan • Translate back to original schema

Handling Repeated Events:Example • Consider two prob. Relations: R(A,B) and S(C,D) s.t. R has all distinct events while S has a distinct event for each value of D • Query q(x) :- R(x,y), S(y,z) • Step1: create a new schema. Decompose S into two relations: S1(C, D, EID) and S2(EID) • q’(x) :- R(x,y), S1(y,z, eid), S2(eid) • Using SAFE-PLAN, we get the following plan • P’ = πA(R joinB=C (πB,EID(S1) joinEID S2)) • Substitute back S1 and S2 accordingly

Additional Operators • Union, Difference and Groupby operators • Covers almost all queries with nested sub-queries, aggregates, group-by and existensial/universal quantifiers

Uncertain Predicates • q≈ predicate on a deterministic database • Syntactic closeness: String Matching. e.g. certain ~ uncertain • Edit distances, q-grams etc. • Semantic closeness: e.g. musical ~ opera • TF/IDF, ontologies from Wordnet • Numeric closeness: e.g. 25 ~ 26 • similar numeric values • Once distances are defined, they need to be meaningfully converted into probabilities • gaussian, student-T, normal-gamma • parameters can be learned (ideal case) or can be specified by user

Efficient Query Evaluation on Probabilistic Databases