10 Years of Probabilistic Querying – What Next?

10 Years of Probabilistic Querying– What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig, Andre Melo, Iris Miliaraki, Luc de Raedt, Mauro Sozio, Fabian Suchanek

“ The important thing is to not stop questioning ... One cannot help but be in awe when contemplating the mysteries of eternity, of life, of the marvelous structure of reality. It is enough if one tries merely to comprehend a little of this mystery every day.” - Albert Einstein, 1936 “The Marvelous Structure of Reality” Joseph M. Hellerstein Keynote at WebDB 2003, San Diego

Look, There is Structure! The important thing is to not stop questioning

Look, There is Structure! Text is not just “unstructured data” • Plethora of natural-language-processing techniques & tools • Part-Of-Speech (POS) Tagging • Named-Entity Recognition & Disambiguation (NERD) • Dependency Parsing • Semantic Role Labeling C1

Look, There is Structure! Text is not just “unstructured data” • Plethora of natural-language-processing techniques & tools • Part-Of-Speech (POS) Tagging • Named-Entity Recognition & Disambiguation (NERD) • Dependency Parsing • Semantic Role Labeling • But: • Even the best NLP tools frequently yield errors • Facts found on the Web are logically inconsistent • Web-extracted knowledge bases are inherently incomplete C1

Information Extraction YAGO/DBpedia et al. bornOn(Jeff, 09/22/42) gradFrom(Jeff, Columbia) hasAdvisor(Jeff, Arthur) hasAdvisor(Surajit, Jeff) knownFor(Jeff, Theory) >120 M facts for YAGO2 (mostly from Wikipedia infoboxes) type(Jeff, Author)[0.9] author(Jeff, Drag_Book)[0.8] New fact candidates author(Jeff,Cind_Book)[0.6] worksAt(Jeff, Bell_Labs)[0.7] type(Jeff, CEO)[0.4] 100’s M additional facts from Wikipedia free-text

YAGO Knowledge Base 3 M entities, 120 M facts 100 relations, 200k classes Entity accuracy  95% subclass subclass subclass Organization Person Location subclass subclass subclass subclass subclass Country Scientist Politician subclass subclass State Biologist instanceOf instanceOf Physicist instanceOf City instanceOf Germany instanceOf instanceOf locatedIn Erwin_Planck Oct 23, 1944 diedOn locatedIn Kiel Schleswig-Holstein fatherOf bornIn Nobel Prize hasWon instanceOf citizenOf diedOn Oct 4, 1947 Max_Planck Society Max_Planck Angela Merkel Apr 23, 1858 bornOn means means means means means “Max Planck” “Max Karl Ernst Ludwig Planck” “Angela Merkel” “Angela Dorothea Merkel” http://www.mpi-inf.mpg.de/yago-naga/

Linked Open Data As of Sept. 2011: >200 linked-data sources >30 billion RDF triples >400 million owl:sameAs links http://linkeddata.org/

Maybe Even More Importantly:Linked Vocabularies! • LinkedData.org • Instance & class links between DBpedia, WordNet, OpenCyc, GeoNames, and many more… • Schema.org • Common vocabulary released by Google, Yahoo!, BINGto annotate Web pages, incl. links to DBpedia. • Micro-Formats: RDFa (W3C) <html xmlns="http://www.w3.org/1999/xhtml" xmlns:dc="http://purl.org/dc/elements/1.1/" version="XHTML+RDFa 1.0" xml:lang="en"> <head><title>Martin's Home Page</title> <base href="http://adrem.ua.ac.be/~tmartin/"/> <meta property="dc:creator" content="Martin"/> </head> Source: http://en.wikipedia.org/wiki/Linked_data

As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase Currently (Sept. 2011) > 5 Mio owl:sameAs links between DBpedia/YAGO/Freebase

Application 1: Enrichment of Search Results “Recent Advances in Structured Data and the Web.” Alon Y. Halevy, Keynote at ICDE 2013, Brisbane

Application II: Machine Reading It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits HenrikVanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for DraganArmansky, who, in turn, worries that LisbethSalander is “the perfect victim for anyone who wished her ill." same same same owns uncleOf hires enemyOf same affairWith same affairWith headOf same Etzioni, Banko, Cafarella: Machine Reading. AAAI’06 Mitchell, Carlson et al.: Toward an Architecture for Never-Ending Language Learning. AAAI’10

Application III: Natural-Language Question Answering evi.com(formerly trueknowledge.com)

Application III: Natural-Language Question Answering wolframalpha.com >10 trillion(!) facts >50,000 search algorithms >5,000 visualizations

IBM Watson: Deep Question Answering William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel This town is known as "Sin City" & its downtown is "Glitter Gulch" As of 2010, this is the only former Yugoslav republic in the EU 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain Question classification & decomposition Knowledge back-ends D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project.AI Magazine, Fall 2010. www.ibm.com/innovation/us/watson/index.htm

Natural-Language QA over Linked Data Multilingual Question Answering over Linked Data (QALD-3), CLEF 2011-13 http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/ <question id="4" answertype="resource" aggregation="false" onlydbo="true"> <string lang="en">Which river does the Brooklyn Bridge cross?</string> <string lang="de">WelchenFlussüberspannt die Brooklyn Bridge?</string> <string lang="es">¿Porquéríocruza la Brooklyn Bridge?</string> <string lang="it">Quale fiumeattraversailponte di Brooklyn?</string> <string lang="fr">Quellecoursd'eauesttraversé par le pont de Brooklyn?</string> <string lang="nl">Welkerivieroverspant de Brooklyn Bridge?</string> <keywords lang="en">river, cross, Brooklyn Bridge</keywords> <keywords lang="de">Fluss, überspannen, Brooklyn Bridge</keywords> <keywords lang="es">río, cruza, Brooklyn Bridge</keywords> <keywords lang="it">fiume, attraversare, ponte di Brooklyn</keywords> <keywords lang="fr">coursd'eau, pont de Brooklyn</keywords> <keywords lang="nl">rivier, Brooklyn Bridge, overspant</keywords> <query> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX res: <http://dbpedia.org/resource/> SELECT DISTINCT ?uri WHERE { res:Brooklyn_Bridgedbo:crosses ?uri . } </query> </question>

Natural-Language QA over Linked Data INEX Linked Data Track, CLEF 2012-13 https://inex.mmci.uni-saarland.de/tracks/lod/ <topic id="2012374" category="Politics"> <jeopardy_clue>Which German politician is a successor of another politician who stepped down before his or her actual term was over, and what is the name of their political ancestor?</jeopardy_clue> <keyword_title>German politicians successor other stepped down before actual term name ancestor</keyword_title> <sparql_ft> SELECT ?s ?s1 WHERE { ?s rdf:type <http://dbpedia.org/class/yago/GermanPoliticians> . ?s1 <http://dbpedia.org/property/successor> ?s . FILTER FTContains (?s, "stepped down early") . } </sparql_ft> </topic>

Outline • Probabilistic Databases • Stanford’s Trio System: Data, Uncertainty & Lineage • Handling Uncertain RDF Data: URDF (Max-Planck-Institute/U-Antwerp) • Probabilistic & Temporal Databases • Sequenced vs. Non-Sequenced Semantics • Interval Alignment & Probabilistic Inference • Probabilistic Programming • Statistical Relational Learning • Learning “Interesting” Deduction Rules • Summary & Challenges

Probabilistic Databases: A Panacea to All of the Afore Tasks? C2 • Probabilistic databases combine first-order logicand probability theory in an elegant way: • Declarative:Queries formulated in SQL/Relational Algebra/Datalog, support for updates, transactions, etc. • Deductive: Well-studied resolution algorithms for SQL/Relational Algebra/Datalog (top-down/bottom-up), indexes, automatic query optimization • Scalable (?):Polynomial data complexity (SQL), but #P-complete for the probabilistic inference

Probabilistic Database A probabilistic database Dp (compactly) encodes a probability distribution over a finite set of deterministic database instances Di. • Special Cases: • Query Semantics: (“Marginal Probabilities”) • Run query Q against each instance Di; for each answer tuple t, sum up the probabilities of all instances Di where t exists. 0.42 0.18 0.28 0.12 (1) Tuple-independent PDB (II) Block-independent PDB Note: (I) and (II) are not equivalent!

Stanford Trio System [Widom: CIDR 2005] • Alternatives • ‘?’ (Maybe) Annotations • Confidence values • Lineage Uncertainty-Lineage Databases (ULDBs)

Trio’s Data Model 1. Alternatives:uncertainty about value Three possible instances

Trio’s Data Model 1. Alternatives 2.‘?’ (Maybe): uncertainty about presence ? Six possible instances

Trio’s Data Model • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences: weighted uncertainty ? Still six possible instances, each with a probability

So Far: Model is Not Closed Suspects= πperson(Saw ⋈ Drives) Does not correctly capture possible instances in the result CANNOT ? ? ?

Example with Lineage Suspects= πperson(Saw ⋈ Drives) ? ? ? λ(31) = (11,2) Λ (21,2) λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2) λ(33) = (11,1) Λ 23

Example with Lineage Correctly captures possible instances in the result Suspects= πperson(Saw ⋈ Drives) λ(31) = (11,2) Λ (21,2) ? λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2) ? λ(33) = (11,1) Λ 23 ? (4)

Operational Semantics But: data complexity is #P-complete! Dp Dp′ Closure: up-arrow always exists direct implementation possible instances rep. of instances Q on each instance D1,D2,…, Dn D1’, D2’, …, Dm’ Completeness:any (finite) set of possible instances can be represented

Summary on Trio’s Data Model • Alternatives • ‘?’ (Maybe) Annotations • Confidence values • Lineage Uncertainty-Lineage Databases (ULDBs) Theorem: ULDBs are closed and complete. Formally studied properties like minimization, equivalence, approximationand membership based on lineage. [Benjelloun, Das Sarma, Halevy, Widom, Theobald: VLDB-J. 2008]

Basic Complexity Issue • [Suciu & Dalvi:SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"] Theorem[Valiant:1979]For a Boolean expression E, computing Pr(E) is #P-complete NP = class of problems of the form “is there a witness ?”  SAT #P = class of problems of the form “how many witnesses ?”  #SAT The decision problem for 2CNF is in PTIME.The counting problem for 2CNF is already #P-complete. (will be coming back to this later again…)

…back to Information Extraction bornIn(Barack, Honolulu) bornIn(Barack, Kenya)

Uncertain RDF (URDF): Facts & Rules • Extensional Knowledge(the “facts”) • High-confidence facts:existing knowledge base (“ground truth”) • New fact candidates: extracted fact candidates with confidences • Linked-Data & integration of various knowledge sources: Ontology merging or explicitly linked facts (owl:sameAs, owl:equivProp.) Large “Probabilistic Database” of RDF facts • Intensional Knowledge(the “rules”) • Soft rules:deductive grounding & lineage (Datalog/SLD resolution) • Hard rules:consistency constraints (more general FOL rules) • Propositional&probabilistic inference  At query-time!

Soft Rules vs. Hard Rules (Soft) Deduction Rules vs. (Hard) Consistency Constraints • People may live inmore than one place livesIn(x,y)  marriedTo(x,z)  livesIn(z,y) livesIn(x,y)  hasChild(x,z)  livesIn(z,y) • People are not born indifferent places/on different dates bornIn(x,y)  bornIn(x,z)  y=z bornOn(x,y)  bornOn(x,z)  y=z • People are not married to more than one person (at the same time, in most countries?) marriedTo(x,y,t1)  marriedTo(x,z,t2)  y≠z  disjoint(t1,t2) [0.8] [0.5]

Soft Rules vs. Hard Rules Deductive Database: Datalog, core of SQL & Relational Algebra, RDF/S, OWL2-RL, etc. (Soft) Deduction Rules vs. (Hard) Consistency Constraints • People may live inmore than one place livesIn(x,y)  marriedTo(x,z)  livesIn(z,y) livesIn(x,y)  hasChild(x,z)  livesIn(z,y) • People are not born indifferent places/on different dates bornIn(x,y)  bornIn(x,z)  y=z bornOn(x,y)  bornOn(x,z)  y=z • People are not married to more than one person (at the same time, in most countries?) marriedTo(x,y,t1)  marriedTo(x,z,t2)  y≠z  disjoint(t1,t2) [0.8] [0.5] More General FOL Constraints: Datalog plus constraints, X-tuples in PDB’s, owl:FunctionalProperty, owl:disjointWith, etc.

URDF Running Example KB:RDF Base Facts Rules hasAdvisor(x,y)  worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y)  graduatedFrom(x,z)  y=z Computer Scientist type[1.0] type[1.0] type[1.0] hasAdvisor[0.7] hasAdvisor[0.8] Jeff Surajit David graduatedFrom[0.9] graduatedFrom[?] graduatedFrom[0.6] graduatedFrom[?] graduatedFrom[?] graduatedFrom[0.7] graduatedFrom[?] Stanford Princeton • Derived Facts • gradFr(Surajit,Stanford) • gradFr(David,Stanford) worksAt[0.9] type[1.0] type[1.0] University

Basic Types of Inference • MAP Inference • Find the most likely assignment to query variables y under a given evidence x. • Compute: argmaxy P( y | x)(NP-complete for MaxSAT) • Marginal/Success Probabilities • Probability that query y is true in a random world under a given evidence x. • Compute: ∑y P( y | x)(#P-complete already for conjunctive queries)

General Route: Grounding & MaxSAT Solving Query graduatedFrom(x, y) • 1) Grounding • Consider only facts (and rules)which are relevant for answering the query • 2) Propositional formula in CNF, consisting of • Grounded soft & hard rules • Weighted base facts • 3) Propositional Reasoning • Find truth assignment to facts such that the total weight of the satisfied clauses is maximized  MAP inference: compute “most likely”possible world CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton)) (graduatedFrom(David, Stanford) graduatedFrom(David, Princeton))  (hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))  (hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford)) worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford)  graduatedFrom(David, Princeton) 1000 1000 0.4 0.4 0.9 0.8 0.7 0.6 0.7 0.9

URDF: MaxSAT Solving with Soft & Hard Rules Special case:Horn-clauses as soft rules & mutex-constraintsas hard rules [Theobald,Sozio,Suchanek,Nakashole: VLDS’12] • Find:argmaxy P(y|x) • Resolves to a variant of MaxSAT for propositional formulas 0.4 0.4 0.9 0.8 0.7 0.6 0.7 0.9 { graduatedFrom(Surajit, Stanford), graduatedFrom(Surajit, Princeton) } { graduatedFrom(David, Stanford), graduatedFrom(David, Princeton) } S: Mutex-const. MaxSAT Alg. Compute W0 = ∑clauses C w(C) P(C is satisfied); For each hard constraint S { For each fact f in St { Compute Wf+t = ∑clauses C w(C) P(C is sat. | f = true); } Compute WS-t= ∑clauses C w(C) P(C is sat. | St= false); Choose truth assignment to f in St that maximizes Wf+t , WS-t ; Remove satisfied clauses C; t++; } (hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))  (hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford)) worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford)  graduatedFrom(David, Princeton) C: Weighted Horn clauses (CNF) • Runtime: O(|S||C|) • Approximation guarantee of 1/2

Experiment (I): MAP Inference • YAGO Knowledge Base: 2 Mio entities,20 Mio facts • Query Answering:Deductive grounding & MaxSAT solving for 10 queriesover 16 soft rules (partly recursive) & 5 hard rules(bornIn, diedIn, marriedTo, …) • Asymptotic runtime checks via synthetic (random) soft rule expansions URDF: Grounding & MaxSAT solving • URDF MaxSATvs. Markov Logic • (MAP inference & MC-SAT) • |C| - # literals in grounded soft rules • |S| - # literals in grounded hard rules

Basic Types of Inference ✔ • MAP Inference • Find the most likely assignment to query variables y under a given evidence x. • Compute: argmaxy P( y | x)(NP-complete for MaxSAT) • Marginal/Success Probabilities • Probability that query y is true in a random world under a given evidence x. • Compute: ∑y P( y | x)(#P-complete already for conjunctive queries)

Deductive Grounding with Lineage (SLD Resolution in Datalog/Prolog) [Yahya,Theobald: RuleML’11 Dylla,Miliaraki,Theobald: ICDE’13] Rules hasAdvisor(x,y)  worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y)  graduatedFrom(x,z)  y=z Query graduatedFrom(Surajit, y) • graduatedFrom • (Surajit, Princeton) • graduatedFrom • (Surajit, Stanford) Q1 Q2 A(B (CD))  A(B (CD))  Base Facts • graduatedFrom(Surajit, Princeton) [0.7] • graduatedFrom(Surajit, Stanford) [0.6] • graduatedFrom(David, Princeton) [0.9] • hasAdvisor(Surajit, Jeff) [0.8] hasAdvisor(David, Jeff) [0.7] worksAt(Jeff, Stanford) [0.9] • type(Princeton, University) [1.0] type(Stanford, University) [1.0] • type(Jeff, Computer_Scientist) [1.0] • type(Surajit, Computer_Scientist) [1.0] • type(David, Computer_Scientist) [1.0] \/ A B • graduatedFrom • (Surajit, Princeton)[0.7] • graduatedFrom • (Surajit, Stanford)[0.6] /\ C D • hasAdvisor • (Surajit,Jeff)[0.8] • worksAt • (Jeff,Stanford)[0.9]

Lineage & Possible Worlds [Das Sarma,Theobald,Widom: ICDE’08 Dylla,Miliaraki,Theobald: ICDE’13] Query graduatedFrom(Surajit, y) 1) Deductive Grounding • Dependency graph of the query • Trace lineage of individual query answers 2) Lineage DAG (not in CNF), consisting of • Grounded soft & hard rules • Probabilistic base facts 3) Probabilistic Inference  Compute marginals: P(Q): sum up the probabilities of all possible worlds that entail the query answers’ lineage P(Q|H): drop “impossible worlds” 0.7x(1-0.888)=0.078 (1-0.7)x0.888=0.266 • graduatedFrom • (Surajit, Princeton) • graduatedFrom • (Surajit, Stanford) Q1 Q2 A(B (CD))  A(B (CD)) 1-(1-0.72)x(1-0.6) =0.888  \/ 0.8x0.9 =0.72 A B • graduatedFrom • (Surajit, Princeton)[0.7] • graduatedFrom • (Surajit, Stanford)[0.6] /\ C D • hasAdvisor • (Surajit,Jeff)[0.8] • worksAt • (Jeff,Stanford)[0.9]

Possible Worlds Semantics P(Q1)=0.0784 P(Q1|H)=0.0784 / 0.412 = 0.1903 P(Q2|H)=0.2664 / 0.412 = 0.6466 P(Q2)=0.2664 0.0784 0.2664 1.0 0.412 Hard rule H:  A   (B  (CD))

Inference in Probabilistic Databases • Safe query plans [Dalvi,Suciu: VLDB-J’07] • Can propagate confidences along with relational operators. • Read-once functions[Sen,Deshpande,Getoor: PVLDB’10] • Can factorize Boolean formula (in polynomial time) into read-once form, where every variable occurs at most once. • Knowledge compilation [Olteanu et al.: ICDT’10, ICDT’11] • Can decompose Boolean formula into ordered binary decision diagram (OBDD), such that inference resolves to independent-and and independent-or operations over the decomposed formula. • Top-k pruning [Ré,Davli,Suciu: ICDE’07; Karp,Luby,Madras: J-Alg.’89] • Can return top-k answers based on lower and upper bounds, even without knowing their exact marginal probabilities. • Multi-Simulation: run multiple Markov-Chain-Monte-Carlo (MCMC) simulations in parallel.

Monte Carlo Simulation (I) • [Suciu& Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries" • Karp,Luby,Madras: J-Alg.’89] Boolean formula: E = X1X2 v X1X3 v X2X3 X1X2 X1X3 Naïve sampling: cnt = 0 repeat N times randomly choose X1, X2, X3 {0,1}ifE(X1, X2, X3) = 1 thencnt = cnt+1 P = cnt/N return P /* estimate for true Pr(F) */ X2X3 N may be very big for small Pr(E) Zero/One-EstimatorTheorem Works for any E (not in PTIME) Theorem: If N ≥ (1/ Pr(E)) × (4 ln(2/d)/e2) then: Pr[ | P/Pr(E) - 1 | > e ] < d

Monte Carlo Simulation (II) • [Suciu& Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries" • Karp,Luby,Madras: J-Alg.’89] Boolean formula in DNF: E = C1 v C2 v . . . v Cm Importance sampling: cnt = 0; S = Pr(C1) + … + Pr(Cm) repeat N times randomly choose i {1,2,…, m}, with prob. Pr(Ci)/S randomly choose X1, …, Xn {0,1} s.t. Ci = 1if C1= 0 and C2= 0 and … and Ci-1= 0 then cnt = cnt+1 P = cnt/N return P /* estimate for true Pr(E) */ This is better! Only for E in DNF in PTIME Theorem: If N ≥ (1/m) × (4 ln(2/d)/e2) then: Pr[ |P/Pr(E) - 1| > e ] < d

Top-k Ranking by Marginal Probabilities [Dylla,Miliaraki,Theobald: ICDE’13] Query graduatedFrom(Surajit, y) • Datalog/SLD resolution • Top-down grounding allows us to compute lowerand upper boundson the marginal probabilities of answer candidates before rules are fully grounded. • Subgoals may represent sets of answer candidates. • First-order lineage formulas: • Φ(Q1) = A • Φ(Q2) = B ygradFrom(Surajit,y) • Pruneentire set of answer candidatesrepresentedbyΦ. • graduatedFrom • (Surajit, Princeton) • graduatedFrom • (Surajit, Stanford) Q1 Q2 \/ • graduatedFrom • (Surajit, Princeton)[0.7] • graduatedFrom • (Surajit, y=Stanford) • graduatedFrom • (Surajit, Stanford)[0.6] A B /\ C D • hasAdvisor • (Surajit,Jeff)[0.8] • worksAt • (Jeff,Stanford)[0.9]

Bounds for First-Order Formulas [Dylla,Miliaraki,Theobald: ICDE’13] Theorem 1: Given a (partially grounded) first-order lineage formula Φ: Φ(Q2) = B  y gradFrom(S,y) • Lower bound Plow(for all query answers that can be obtained from grounding Φ)Substitute y gradFrom(S,y)with false (or true if negated). Plow(Q2) = P(B  false) = P(B) = 0.6 • Upper bound Pup(for all query answers that can be obtained from grounding Φ) Substitute y gradFrom(S,y) with true (or falseif negated). Pup(Q2) = P(B  true) = P(true) = 1.0 Proof:(sketch) Substitution of a subformula with false reduces the number of models (possible worlds) that satisfyΦ; substitution with true increases them.

Convergence of Bounds [Dylla,Miliaraki,Theobald: ICDE’13] Theorem II: Let Φ1,…,Φn be a series of first-order lineage formulas obtained from grounding Φ via SLD resolution, and let φbe the propositional lineage formula of an answer obtained from this grounding procedure. Then rewriting eachΦi according to Theorem 1 into Pi,low and Pi,up creates a monotonic series of lower and upper bounds that converges to P(φ). 0= P(false) P(B  false) =0.6 P(B  (C  D)) =0.888 P(B  true) =P(true) = 1 Proof:(sketch, via induction) Substitution of true with a formula reduces the number of models that satisfy Φ; substitution of falsewith a formulaincreases this number.

Top-k Pruning [Fagin et al.’01; Balke,Kießling’02; Dylla,Miliaraki,Theobald: ICDE’13] “Fagin’s Algorithm” • Maintain two disjoint queues: Top-k queue sorted by Plow and Candidates sorted by Pup • Return the top-k queue at the t’th grounding step when: Pi,low(Qk) | Qk Top-k> Pi,up(Qj)| Qj Candidates P1,up(Qj) Drop Qjfrom the Candidates queue. 1 Marginal probability P2,up(Qj) k-th lower bound Pn,up(Qj) Pn,low(Qj) P2,low(Qj) P1,low(Qj) #SLD steps t 0

10 Years of Probabilistic Querying – What Next?