1 / 64

Probabilistic answers to relational queries (PARQ)

Probabilistic answers to relational queries (PARQ). Octavian Udrea Yu Deng Edward Hung V. S. Subrahmanian. Content. Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work. Content.

shani
Télécharger la présentation

Probabilistic answers to relational queries (PARQ)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic answers to relational queries (PARQ) Octavian Udrea Yu Deng Edward Hung V. S. Subrahmanian

  2. Content • Motivation and goals • Running example • Technical preliminaries • CPO model • CPO integration • CPO inference algorithms • Experimental results • Ongoing work

  3. Content • Motivation and goals • Running example • Technical preliminaries • CPO model • CPO integration • CPO inference algorithms • Experimental results • Ongoing work

  4. Motivation • Query algebras do not take semantics into account when computing answers • Data is not always precise • Ambiguity, insufficient information • Goal: Use probabilistic ontologies to improve query answer recall and quality

  5. The probabilistic solution • Compute and return answers with high probability ( > pthr) • Keep probabilities hidden from the user • Problems • How do we assign a probability to each data item? • How do we choose pthr?

  6. Concepts • Constraint probabilistic ontologies • Is-a graph with edges labeled with probabilities • Including conditional probabilities • Disjoint decompositions • Ontologies associated with terms in a data source • Attributes in a relation/XML • Propositional entities in text sources

  7. Content • Motivation and goals • Running example • Technical preliminaries • CPO model • CPO integration • CPO inference algorithms • Experimental results • Ongoing work

  8. Running example Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

  9. Example: decompositions Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

  10. Example: probability labels Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

  11. Example: conditional probabilities Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

  12. Running example: Sample queries • “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...” • What type of board meeting is being discussed? • Since Ed Masters is present, there is a 75% probability it is a board of directors meeting • What type of financial unit is referenced? • Since the subject is marketing policy, there is a 65% probability it is the Financial Review Board.

  13. Content • Motivation and goals • Running example • Technical preliminaries • CPO model • CPO integration • CPO inference algorithms • Experimental results • Ongoing work

  14. Technical preliminaries: POB • POB schema: • C is a finite set of classes • is a directed acyclic graph • me produces clusters (disjoint decompositions) for each node • me(OrganizationUnit) = {{Comittee, Board, Team, Department}, {Legal, Executive, Financial, Marketing}} • maps each edge in to a positive rational number in [0,1]

  15. Back to the example Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

  16. Constraint probabilities • Simple constraints: • Only for entities NOT represented in the current ontology • Nil constraint: • Constraint probabilities: • Pair , with p in [0,1] and a conjunction of simple constraints

  17. Labeling • Labeling should not be arbitrary • Invalid labeling may lead to time-consuming consistency algorithms • And to ambiguity in interpreting query answers • Valid labeling: • No constraint refers to the entities associated with this ontology • There is exactly one nil constraint probability on each edge

  18. Content • Motivation and goals • Running example • Technical preliminaries • CPO model • CPO integration • CPO inference algorithms • Experimental results • Ongoing work

  19. The CPO model • CPO: • C is a finite set of classes • is a directed acyclic graph • me produces clusters (disjoint decompositions) for each node • is a valid labeling for • Note there is no condition on the probabilities....yet!

  20. CPO enhanced data sources • Associate CPOs with some attributes of a relation. • Associate CPOs with elements in an XML data store. • Associate CPOs with some keywords for text files. • CPOk • At most k probabilities on each edge • CPO1 is a POB

  21. Answering queries • “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...” • What type of board meeting is being discussed? • Since Ed Masters is present, there is a 75% probability it is a board of directors meeting • Goal: Associate probabilities with possible answers.

  22. Probability path Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

  23. Probability path • if: • c => c1 => c2 => … => ck => d • f is a function defined on the chain • f selects one probability on each edge • is the set of constraints selected by f along with the probabilities

  24. CPO consistency • CPO • An arbitrary universe of objects O • Interpretation ε is a mapping from C to 2O • ε is a taxonomic model iff: • We assign objects to each class • Objects cannot be shared between classes in the same cluster • => edges imply subset relations on the sets of objects assigned to each class • If A => B is labeled with probability p, at least p percent of objects in A are also assigned to B

  25. CPO consistency (cont’d) • CPO consistent  it has a taxonomic probabilistic model • Deciding if a CPO is consistent is NP-complete • The weight formula satisfiability problem. • A non-deterministic algorithm for consistency checking is straightforward.

  26. Consistency approach • Identify a subclass of CPOs for which we can check consistency • Two parts: • Pseudoconsistency – this was done for POBs • Well-structuredness – particular to CPOs

  27. Pseudoconsistent CPO • CPO • No two classes in the same cluster have a common subclass • The graph is rooted • For every immediate distinct subclasses of c, they either: • Have no common subclass • Have a greatest common subclass different from them • No cycles • If c inherits from multiple clusters, all paths from descendants of c to the root go through c

  28. Pseudoconsistency

  29. Weight factor • A set P of not-nil constraint probabilities • If P is the empty set, wf(P) = 0 • If P = {(p,γ)}, wf(P) = p • wf(P U Q) = wf(P) + wf(Q) – wf(P) * wf(Q) • Intuitive meaning: how many objects from class A do I have to assign to class B and satisfy the constraints?

  30. More weight factors • CPO • c => d an edge • We write: • We define: • Result: Conditions of taxonomic interpretation can be satisfied by selecting at most w(c,d)*|Od| objects from d into c.

  31. Well-structured CPO • Conditional constraints on edges from the same cluster must be disjoint • Otherwise, impossible to cpumte a weight factor for the cluster edges. • The sum of the weight factors for edges in a cluster is ≤ 1

  32. Well-structuredness

  33. Consistent CPOs revisited • A pseudoconsistent and well-structured CPO is consistent • Pseudoconsistency accounts for most of the conditions in the taxonomic interpretation • Well-structuredness accounts for the the assignment of objects to subclasses

  34. Consistency checking algorithm • Pseudoconsistency is O(n2e) and well-structuredness is O(n2k2) • n – number of classes • e – the number of edges • k – the order of the CPO • Algorithm based on: • Topological sort • Dijskstra and derivatives

  35. CPO enhanced algebras • CPO enhanced algebras formally defined for: • Relational data sources • XML data stores • Selection, projection, product, join, etc. • Ongoing work: • RDF ehanced query algebra • Directly related to RDF extraction from text.

  36. Content • Motivation and goals • Running example • Technical preliminaries • CPO model • CPO integration • CPO inference algorithms • Experimental results • Ongoing work

  37. CPO integration: motivation ACME corp. CPO EVIL corp. CPO Email from ACME corp. to EVIL corp.: “During you last FO board meeting, the rising costs of quality assurance were not addressed. We would like to include this in our next auditing comittee meeting....

  38. Merging CPOs Two scenarios: • One data source that refers to similar entities but from different application domains. • Example: ACME – EVIL correspondence • Queries across multiple data sources • Example: Two different CPOs associated with distinct relations during a join query.

  39. Interoperation constraints • Since the CPOs being merged refer to similar entities, some classes may be euqivalent • Equality constraints c1:=:c2 • Possiblity: immediate subclassing constraints • Not really used – hardly feasible

  40. The integration problem • Two CPOs S1 = (C1, =>1, me1, φ1), S2 = (C2, =>2, me2, φ2) • Set of interoperation constraints I • An integration witness is another CPO S = (C, =>, me, φ) that satisifes S1, S2 and I

  41. Integration witness • Every class c in C1 U C2 • Appears in C OR • c:=:d appears in I and d є C • i.e. no classes get “lost” • Similarly, no edges are lost • No constraints are lost • If two identical constraint probabilities are on the same edge in both CPOs, take a probability p between the two

  42. Integration witness • Immediate subclassing constraints add edges to S • No cluster can be split as a result of merging • S is pseudoconsistent and well-structured (if it’s not, it’s of no use) • Open problem: If it is not, how can we minimally change it such that it has these properties?

  43. CPOmerge algorithm • CPOmerge produces an integration witness if exists • O(n3) – costly • In pratice, much more efficient through: • Caching • Some properties are preserved if the original ontologies are pseudo-consistent and well-structured

  44. Who writes the interop constraints? • User – not feasible • How to infer them? • Intuitive solution: If enough neighbours are in equality constraints, then infer respective nodes should be equivalent. • But we still need some equivalence constraints to get started – use lexical distance • How many neighbors are “enough”?

  45. ICI – Simple solution • Neighbor: parent, immediate child, sibling from the same cluster • We define • ne – number of neighbors in equality constraints • nc,d – number of neighbors of c,d • Why? Number of equal neighbors / Total number of neighbors (including self). • Always < 1 • ICI algorithm: if pe exceeds threshold, assume they are equal • Start with lexical distance

  46. Content • Motivation and goals • Running example • Technical preliminaries • CPO model • CPO integration • CPO inference algorithms • Experimental results • Ongoing work

  47. Give me a CPO… • Very little work so far on probabilistic ontologies. • Nothing resembling CPOs around • How do we infer them: • How do we build disjoint decompositions? • How do we infer probabilities?

  48. Building disjoint decompositions • Take regular ontologies from the Web • Many sources: daml.org, SchemaWeb, OntoBroker • Modify CPOmerge to ignore labeling • The merge result will contain disjoint decompositions • Equality constraints can be inferred through ICI

  49. Infer probabilities – simple methods • Simple methods: • Distribute probabilities uniformly within each cluster • For each cluster L in me(c), d=>c, • For any distance function (lexical or otherwise)

  50. Advanced methods • Probabilistic relational models with structural uncertainty • Work by Dr. Getoor et. al • Classification approach • Feature extraction determines entities of interest • Create conditional probabilities on those entities • User feedback approach • General, applicable to any of the above (ongoing work)

More Related