Queries with Difference on Probabilistic Databases
E N D
Presentation Transcript
Queries with Difference on Probabilistic Databases SanjeevKhanna Sudeepa Roy Val Tannen University of Pennsylvania
Probabilistic Databases • To model and query uncertain data (sensor networks, information extraction…) • Possible worlds model • Each possible world W is a standard database instance, has a probability P[W] • Compact representation Dassuming independence S T R D
Query Semantics • Query Semantics on probabilistic databases: • Apply the query q on each possible world W • Add up the probabilities of the worlds that give the same query answer A P[q(D) = A] = ∑W : q(W) = AP[W] • Goal: Efficiently evaluate P[q(D) = A] • Data complexity; want time polynomial in n = |D| • Can we always efficiently compute P[q(D)]? • NO, in general it is #P-hard
Query Answering in Two Steps easy hard Event variables D S R T Probability Boolean query q():-R(x),S(x, y),T(y) Introduce event variables for tuples (P[w1] = 0.3, …) Step 1: Boolean provenance for q(D) [FR ’97, Z ’97] f = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 Step 2:ComputeP[q(D)] = P[f] given P[w1] = 0.3, P[v1] = 0.4, …
Probability Computation for Positive Queries • Dichotomy Result [DS ’04, ’07; DSS ’10] Givenq as input, we can efficiently decide if qis • Safe: Safe plans run in poly-time on all instances, or, • Unsafe: #P-hard, e.g. q() :- R(x) S(x, y) T(y) • Instance-by-instance approach [SDG ’10, RPT ’11] • Both q and D are given as input • Poly-time algorithm to compute P[q(D)] for special cases even if q is unsafe What about queries with difference?
Boolean Provenances for Difference S R T q1(x):- R(x, y), S(y, z) q2(x):- R(x, y), S(y, z), T(z) q = q1 – q2
Previous Work on Difference FOR ’11 • Framework for exact and approximate probability computation • But, no guarantee of polynomial running time In fact, we show in this paper that with difference, in some cases no approximation exists (unless NP = RP) How far can we go with difference in poly-time?
A Quick Comparison With difference • DNF of boolean provenance may be exponential in n • P[q(D)] may not be approximable Without difference • DNF of booleanprovenance is poly-size (n|q|) • P[q(D)] is always approximable (FPRAS) • FPRAS: Fully Polynomial Randomized Approx. Scheme • Compute with prob. ≥ ¾in time polynomial in n, 1/ε • p [(1-ε) P[q(D)], (1+ε) P[q(D)]
Our Results • We study queries of the form q1 – q2 and their generalization • FPRAS: If q1 is any UCQ, q2 is any safe CQ- • #P-hardness: Even if both q1 and q2 are safe CQ- • Inapproximability: Even if q1 is the trivial TRUE query and q2 is a UCQ • Our FPRAS result extends to a larger class of queries of which q1 – q2 is a special case [CQ-: Conjunctive queries without self-joins]
Difference Rank • Define difference rank (q) of query q recursively • (R) = 0 • (q1 - q2) = (q1) + (q2) + 1 • R – S : rank 1 • (q1 ⋈ q2) = (q1) + (q2) • (R – S1) ⋈ (R - S2) : rank 2 • (R - T1) ⋈ T2: rank 1 • (q1 q2) = max ((q1), (q2)) • (R – S1) ⋈ (R - S2) (R - T1) ⋈ T2 : rank 2 • Select, project: rank remains the same
FPRAS for queries q with (q) = 1given some conditions hold(inapproximable for (q) = 1 in general)
Steps in FPRAS • Step 1: Compute boolean provenance of q[D] for any query q with (q) = 1 • Step 2: Write the boolean provenance in a “Probability Friendly Form” (if possible) • Step 3: FPRAS inspired by Karp-Luby framework
Boolean Provenance for Queries q s.t. (q) = 1 Lemma: For any q with (q) = 1, on any D, the provenance f of q(D) has form f is poly-size in n = |D|, poly-time computable
Probability Friendly Form (PFF) f is in PFF, if the negated DNF-s can be written in poly-size d-DNNFs (next slide) If f is in PFF, we can approximate P[f] using Karp-Luby Framework
d-DNNF Darwiche ’01, ’02, DM ’02 deterministic - DecomposableNegation Normal Form At most one child of a +-node is satisfiable Children of a .-node do not share variables No internal node can have negation In general, can be a DAG + Probability can be computed in linear time +
Karp-Luby Framework [KL ’83] Given boolean expression DAGs F1, …, Fm f = F1 + F2 + ... + Fm P[f] can be computed in poly-time (in m, n) if in poly-time, i (1) P[Fi] can be computed (2) it can be checked if a given assignment satisfies Fi (3) a random satisfying assignment of Fican be sampled Well-studied special case: DNF counting, where F1, …, Fm are DNF minterms: f = xyz + xyw + wuv
Conditions (1) and (2) hold for PFF Product of minterm and d-DNNF is another d-DNNF + + + + w2=1, z1=1
Condition (3) also holds Lemma: Generating a random satisfying assignment on a d-DNNF can be done in poly-time At random • Process in reverse topological order • Generate a random satisfying assignment bottom up v1 = 1, v2 = 0 + v1 = 0, v2 = 0 v2 = 0 v1 = 1 + v1 = 0 v2 = 1 v2 = 0
Expressibility in PFF So, if f is in PFF, we can approximate P[q(D)] But, can we decide in poly-time if some sub-expressions of a boolean expression have poly-size d-DNNFs? • Not known • But, there are natural sufficient conditionsthat can be verified in poly-time • If certain sub-queries are safe and hence generate read-once expressions [OH ’08] • If sub-queries generate poly-size OBDDs [JS ’11] • Extends to instance-by-instance approach (both q, D given)
#P-hardness: Steps in the proof “Hard” query q = q1 – q2 • q1() := R1(x, y1) R2(x, y2) R3(x, y3) R4(x, y4) • q2() := R1(x1, y) R2(x2, y) R3(x3, y) R4(x4, y) Counting edge covers in bipartite graphs of degree ≤ 4, where the edge set can be partitioned into 4 disjoint matchings Counting independent sets in 3-regular bipartite graphs (XZ ’06)
Other Related Work • Semantics of probabilistic query answering • Fuhr-Rollecke ’97, Zimanyi ‘97 • Dichotomy of CQ- ,CQ and UCQ queries • Dalvi-Suciu ’04, ’07, Dalvi-Schnaitter-Suciu ’10 • Knowledge compilation techniques • Olteanu-Huang ’08, Jha-Olteanu-Suciu ’10, Jha-Suciu ’11, Fink-Olteanu ’11 • Instance-by-instance approach • Sen-Deshpande-Getoor ’10, Roy-Perduca-Tannen ’11
Conclusions and Future work A step towards understanding complexity of exact and approximate computation for queries with difference operations Future work • Dichotomy results that classify syntactically difference queries (similar to positive UCQ)? • Extending FPRAS to queries with difference rank > 1? • Experimental evaluation of our algorithms
Thank you Questions?