1 / 72

Finding and Approximating Top- k Answers in Keyword Proximity Search

Finding and Approximating Top- k Answers in Keyword Proximity Search. Benny Kimelfeld and Yehoshua Sagiv. The Selim and Rachel Benin School of Engineering and Computer Science. האוניברסיטה העברית בירושלים. The Hebrew University of Jerusalem. Keyword Proximity Search (KPS). The Goal:.

ogden
Télécharger la présentation

Finding and Approximating Top- k Answers in Keyword Proximity Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding and Approximating Top-kAnswers in Keyword Proximity Search Benny Kimelfeld and Yehoshua Sagiv The Selim and Rachel Benin School of Engineering and Computer Science האוניברסיטה העברית בירושלים The Hebrew University of Jerusalem

  2. Keyword Proximity Search (KPS) The Goal: Extract meaningful parts of data w.r.t. the keywords • A paradigm for data extraction • Datahave varying degrees of structure • Relational databases, XML, Web sites • Queriesare sets of keywords • No structural constraints

  3. Querying Structure & Content by Keywords VardiDatabases search • Keywords appear in different parts of the data • Answers show occurrences of keywords, as well the associations among these occurrences • Proximityof the keywords in the answer indicates a close(strong)semanticassociation among them …

  4. Past Work on KPS (Keyword Proximity Search) • DataSpot (Sigmod 1998) • Information Units (WWW 2001) • BANKS (ICDE 2002, VLDB 2005) • DISCOVER (VLDB 2002) • DBXplorer (ICDE 2002) • XKeyword (ICDE 2003) • …

  5. The Goal of this Paper Devise efficient algorithms for finding high-quality answers in keyword proximity search

  6. Contents • Introduction • Formal Setting • The Main Results • Enumerating in the Exact Order • Enumerating in an Approximate Order • Conclusion and Future Work

  7. Contents • Introduction • Formal Setting • The Main Results • Enumerating in the Exact Order • Enumerating in an Approximate Order • Conclusion and Future Work

  8. Data Graphs • Structuralandkeyword nodes • Edges may have weights • – Weak relationships are penalized by high weights

  9. Queries Queries are sets of keywords from the data graph Q={ Summers ,Cohen ,coffee}

  10. Query Answers

  11. Query Answers The root has two or more children An answer is a directed subtree of the data graph • Contains all keywordsof the query • Has no redundant edges(and nodes) The keywords of the query are the leaves

  12. Ranking: Inversely Proportional to Weight rank(A)=(weight(A))-1 Smallersubtrees representcloser associations

  13. Enumerating in Exact (Ranked) Order C C A A B B C C A A B B C C A B A B A B C A B C A B C If ≤ Then Top-k Answers

  14. C may be a function of G and Q Enumerating in a C-Approximate Order C C A A B B C C A A B B C C A B A B A B C A B C A B C If Then C ≤ C-Approximation of the Top-k Answers (Fagin et. al, PODS’01)

  15. Polynomial Delay C C A B A B A B C A B C A B C C A B C A B Yardstick of efficiency: Polynomial delay Polynomial time between generatingsuccessive answers Exponentially many answers even for 2 keywords (it is inefficient to generate all answers and then sort)

  16. Contents • Introduction • Formal Setting • The Main Results • Enumerating in the Exact Order • Enumerating in an Approximate Order • Conclusion and Future Work

  17. Top Answers are Steiner Trees • Finding the top answer in KPS (a.k.a. the Steiner-tree problem) is intractable • Therefore, onecannotenumerate all answers in ranked order with polynomial delay • However, the top answer can be found efficiently under data complexity • That is, the number of keywords is fixed • Approximations can be found efficiently under query-and-data complexity • There is a lot of work on Steiner-tree approximations

  18. So What Can Be Done? Can answers of KPS be enumerated in the exact order with polynomial delay, under data complexity? Can approximations of Steiner trees be used for efficiently enumerating in an approximate order(while preserving the approximation ratio)?

  19. Our Results C C A B A B A B C A B C A B C C A B C A B Theorem 1: Under data complexity, answers of KPS can be enumerated in the exact orderwith polynomial delay

  20. Our Results (cont’d) C C A B A B A B C A B C A B C C A B C A B Theorem 2: Under query-and-data complexity, given an efficient C-approximation for finding Steiner trees, one can enumerate with polynomial delay in a (C+1)-approximate order

  21. The Meaning of the Results KPS is tractable under data complexity All results on Steiner trees can be applied to KPS Under query-and-data complexity, an efficient enumeration in an approximate order can be done with almost the same ratios as Steiner trees From a theoretical point of view, using heuristics isnotthe only option • Existing approaches to KPS are heuristics • Exponential delay in the worst case • No provable nontrivial approximation ratios

  22. Contents • Introduction • Formal Setting • The Main Results • Enumerating in the Exact Order • Enumerating in an Approximate Order • Conclusion and Future Work

  23. Lawler’s Method • We use the technique of Lawler (1972), which is an iterative method for finding the top-k answers • Each iteration generates the next answer by finding the top answer under constraints • Lawler’s method is designed for general (discrete) optimization problems • When applying it to a specific problem, one needs to deal with the following two issues

  24. Two Problems to Solve 1.What exactly are the constraints? (That is, how can we apply Lawler’s method so that the constraints make it possible to find top answers efficiently?) 2. How can we find efficientlythe top answer under constraints?

  25. Solving the First Problem • Constraints are subtrees of the graph • Pairwisenode disjoint • Their leaves are exactly the keywords of the query An answer satisfies the constraints if it contains all the subtrees (i.e., a supertree)

  26. Two Problems to Solve (One Left) 1.What exactly are the constraints? (That is, how can we apply Lawler in a way that the constraints enable finding the top answer efficiently?) 2. How can we find efficientlythe top answer under constraints?

  27. Formulation of the Second Problem Input:constraints (node-disjoint subtrees, keywords as leaves) Objective: A minimal answer satisfying the constraints (i.e., containing all the subtress) Next, an algorithm that solves “almost” this problem, namely: (Almost the same) Objective: A minimal supertree satisfying the constraints

  28. Finding a Minimal Supertree Input:G, T(constraints, i.e., subtrees) 1. Collapse each of the subtrees of T into a node 2. Find a Steiner tree T of the collapsed subtrees 3. Restore the collapsed subtrees in T (more details in the proceedings…)

  29. This is not Enough! Not the same! Input:constraints (node-disjoint subtrees, keywords as leaves) Objective: A minimal answer satisfying the constraints (i.e., containing all the subtress) (Almost the same) Objective: A minimal supertree satisfying the constraints

  30. Query Answers Revisited The root has two or more children An answer is a directed subtree of the data graph • Contains all keywords of the query • Has no redundant edges(and nodes) Keywords are the leaves

  31. An Example

  32. An Example This edge is redundant! But, it cannot be removed since it is a constraint! The minimal supertree satisfying the constraints The minimal answer satisfying the constraints The minimal answer can be completely different from the minimal supertree Furthermore, there can be no answer even if there is a supertree

  33. What if We Remove Edges of Constraints? • What if we first generate a minimal supertree and if the root has only one child, then we just remove it (until an answer is obtained)? • The constraints are violated, leading to a failure of Lawler’s method! • That is, • Some answers will be duplicated • While other answers will not be generated at all

  34. Our Approach Min. Supertree Transform Answer Constraints The rootof this subtree has more than one child and it must be the root of the answer New constraints

  35. This Process is Repeated Transform Transform Transform Transform Min. Supertree Min. Supertree Min. Supertree Min. Supertree The best is the final answer Constraints Up to 2#keywords times (fixed & usually fewer)

  36. About the Transformation • The details of the exact transformation and the proof of correctness are intricate • All can be found in the proceedings… This concludes the algorithm for enumerating in the exact order

  37. A Different View: Chain of Reductions Adapting Lawler’s method Transformation of constraints Collapse and restore Enumerating answers in ranked order Finding the top answer under constraints Finding minimal supertrees Finding Steiner trees

  38. Contents • Introduction • Formal Setting • The Main Results • Enumerating in the Exact Order • Enumerating in an Approximate Order • Conclusion and Future Work

  39. Modifying the Chain of Reductions Similar Similar Completely different! Enumeration in an approximate order Finding approximate answers under constraints Finding approximations of minimal supertrees Finding approximations of Steiner trees

  40. Exact Order Revisited Transform Transform Transform Transform Min. Supertree Min. Supertree Min. Supertree Min. Supertree Constraints We cannot allow it under query-and-data complexity! Up to 2#keywords

  41. The Algorithm Constraints ≤C times the optimum ≤1 times the optimum A C-approximation of the minimal supertree (collapse and restore) A minimal answer for 3 or fewer constraints (the algorithm for the exact order)

  42. Combine the Subtrees The combined subgraph contains an answer ≤(C+1)times the optimum ≤C times the optimum ≤1 times the optimum A C-approximation of the minimal supertree (collapse and restore) A minimal answer for 3 or fewer constraints (the algorithm for the exact order)

  43. Contents • Introduction • Formal Setting • The Main Results • Enumerating in the Exact Order • Enumerating in an Approximate Order • Conclusion and Future Work

  44. Keyword Proximity Search • A common paradigm for keyword search over structured databases • In the formal model: • Data are directed and weighted graphs • Queries are sets of keywords (i.e., nodes) from the data graph • Query answers are non-redundant subtrees containing the keywords of the query • The goal is to find the top-k answers, where the rank is inversely proportional to the weight • A stronger goal: enumeration with poly. delay

  45. Our Results • Under data complexity, answers can be enumerated in the exact ranked order with polynomial delay • Under query-and-data complexity, every efficient C-approximation to the Steiner-tree problem yields an algorithm for enumerating answers with polynomial delay in a (C+1)-approximate order

  46. Our Chain of Reductions Enumerating answers in sorted order Lawler’s approach Finding the top answer under constraints The intricate part … Finding minimal supertrees Subtree Collapse/Restore Finding Steiner trees

  47. Other Variant of KPS Our algorithms can be adapted to other popular variants of KPS

  48. Undirected Variant Answers are undirected trees

  49. Strong Variant Answers are undirected trees and keywords are leaves

  50. Open Problems • Can we improve the space efficiency of our algorithms? • Some ranking functions (e.g., height) are easier than weight when looking for the top answer (no constraints), but • The chain of reductions doesn’t work • The complexity of finding the top answer under constraints is unknown • Can our results hold for richer queries that also have structural constraints?

More Related