Finding and Approximating Top- k Answers in Keyword Proximity Search

Finding and Approximating Top-kAnswers in Keyword Proximity Search Benny Kimelfeld and Yehoshua Sagiv The Selim and Rachel Benin School of Engineering and Computer Science האוניברסיטה העברית בירושלים The Hebrew University of Jerusalem

Keyword Proximity Search (KPS) The Goal: Extract meaningful parts of data w.r.t. the keywords • A paradigm for data extraction • Datahave varying degrees of structure • Relational databases, XML, Web sites • Queriesare sets of keywords • No structural constraints

Querying Structure & Content by Keywords VardiDatabases search • Keywords appear in different parts of the data • Answers show occurrences of keywords, as well the associations among these occurrences • Proximityof the keywords in the answer indicates a close(strong)semanticassociation among them …

Past Work on KPS (Keyword Proximity Search) • DataSpot (Sigmod 1998) • Information Units (WWW 2001) • BANKS (ICDE 2002, VLDB 2005) • DISCOVER (VLDB 2002) • DBXplorer (ICDE 2002) • XKeyword (ICDE 2003) • …

The Goal of this Paper Devise efficient algorithms for finding high-quality answers in keyword proximity search

Contents • Introduction • Formal Setting • The Main Results • Enumerating in the Exact Order • Enumerating in an Approximate Order • Conclusion and Future Work

Data Graphs • Structuralandkeyword nodes • Edges may have weights • – Weak relationships are penalized by high weights

Queries Queries are sets of keywords from the data graph Q={ Summers ,Cohen ,coffee}

Query Answers

Query Answers The root has two or more children An answer is a directed subtree of the data graph • Contains all keywordsof the query • Has no redundant edges(and nodes) The keywords of the query are the leaves

Ranking: Inversely Proportional to Weight rank(A)=(weight(A))-1 Smallersubtrees representcloser associations

Enumerating in Exact (Ranked) Order C C A A B B C C A A B B C C A B A B A B C A B C A B C If ≤ Then Top-k Answers

C may be a function of G and Q Enumerating in a C-Approximate Order C C A A B B C C A A B B C C A B A B A B C A B C A B C If Then C ≤ C-Approximation of the Top-k Answers (Fagin et. al, PODS’01)

Polynomial Delay C C A B A B A B C A B C A B C C A B C A B Yardstick of efficiency: Polynomial delay Polynomial time between generatingsuccessive answers Exponentially many answers even for 2 keywords (it is inefficient to generate all answers and then sort)

Top Answers are Steiner Trees • Finding the top answer in KPS (a.k.a. the Steiner-tree problem) is intractable • Therefore, onecannotenumerate all answers in ranked order with polynomial delay • However, the top answer can be found efficiently under data complexity • That is, the number of keywords is fixed • Approximations can be found efficiently under query-and-data complexity • There is a lot of work on Steiner-tree approximations

So What Can Be Done? Can answers of KPS be enumerated in the exact order with polynomial delay, under data complexity? Can approximations of Steiner trees be used for efficiently enumerating in an approximate order(while preserving the approximation ratio)?

Our Results C C A B A B A B C A B C A B C C A B C A B Theorem 1: Under data complexity, answers of KPS can be enumerated in the exact orderwith polynomial delay

Our Results (cont’d) C C A B A B A B C A B C A B C C A B C A B Theorem 2: Under query-and-data complexity, given an efficient C-approximation for finding Steiner trees, one can enumerate with polynomial delay in a (C+1)-approximate order

The Meaning of the Results KPS is tractable under data complexity All results on Steiner trees can be applied to KPS Under query-and-data complexity, an efficient enumeration in an approximate order can be done with almost the same ratios as Steiner trees From a theoretical point of view, using heuristics isnotthe only option • Existing approaches to KPS are heuristics • Exponential delay in the worst case • No provable nontrivial approximation ratios

Lawler’s Method • We use the technique of Lawler (1972), which is an iterative method for finding the top-k answers • Each iteration generates the next answer by finding the top answer under constraints • Lawler’s method is designed for general (discrete) optimization problems • When applying it to a specific problem, one needs to deal with the following two issues

Two Problems to Solve 1.What exactly are the constraints? (That is, how can we apply Lawler’s method so that the constraints make it possible to find top answers efficiently?) 2. How can we find efficientlythe top answer under constraints?

Solving the First Problem • Constraints are subtrees of the graph • Pairwisenode disjoint • Their leaves are exactly the keywords of the query An answer satisfies the constraints if it contains all the subtrees (i.e., a supertree)

Two Problems to Solve (One Left) 1.What exactly are the constraints? (That is, how can we apply Lawler in a way that the constraints enable finding the top answer efficiently?) 2. How can we find efficientlythe top answer under constraints?

Formulation of the Second Problem Input:constraints (node-disjoint subtrees, keywords as leaves) Objective: A minimal answer satisfying the constraints (i.e., containing all the subtress) Next, an algorithm that solves “almost” this problem, namely: (Almost the same) Objective: A minimal supertree satisfying the constraints

Finding a Minimal Supertree Input:G, T(constraints, i.e., subtrees) 1. Collapse each of the subtrees of T into a node 2. Find a Steiner tree T of the collapsed subtrees 3. Restore the collapsed subtrees in T (more details in the proceedings…)

This is not Enough! Not the same! Input:constraints (node-disjoint subtrees, keywords as leaves) Objective: A minimal answer satisfying the constraints (i.e., containing all the subtress) (Almost the same) Objective: A minimal supertree satisfying the constraints

Query Answers Revisited The root has two or more children An answer is a directed subtree of the data graph • Contains all keywords of the query • Has no redundant edges(and nodes) Keywords are the leaves

An Example

An Example This edge is redundant! But, it cannot be removed since it is a constraint! The minimal supertree satisfying the constraints The minimal answer satisfying the constraints The minimal answer can be completely different from the minimal supertree Furthermore, there can be no answer even if there is a supertree

What if We Remove Edges of Constraints? • What if we first generate a minimal supertree and if the root has only one child, then we just remove it (until an answer is obtained)? • The constraints are violated, leading to a failure of Lawler’s method! • That is, • Some answers will be duplicated • While other answers will not be generated at all

Our Approach Min. Supertree Transform Answer Constraints The rootof this subtree has more than one child and it must be the root of the answer New constraints

This Process is Repeated Transform Transform Transform Transform Min. Supertree Min. Supertree Min. Supertree Min. Supertree The best is the final answer Constraints Up to 2#keywords times (fixed & usually fewer)

About the Transformation • The details of the exact transformation and the proof of correctness are intricate • All can be found in the proceedings… This concludes the algorithm for enumerating in the exact order

A Different View: Chain of Reductions Adapting Lawler’s method Transformation of constraints Collapse and restore Enumerating answers in ranked order Finding the top answer under constraints Finding minimal supertrees Finding Steiner trees

Modifying the Chain of Reductions Similar Similar Completely different! Enumeration in an approximate order Finding approximate answers under constraints Finding approximations of minimal supertrees Finding approximations of Steiner trees

Exact Order Revisited Transform Transform Transform Transform Min. Supertree Min. Supertree Min. Supertree Min. Supertree Constraints We cannot allow it under query-and-data complexity! Up to 2#keywords

The Algorithm Constraints ≤C times the optimum ≤1 times the optimum A C-approximation of the minimal supertree (collapse and restore) A minimal answer for 3 or fewer constraints (the algorithm for the exact order)

Combine the Subtrees The combined subgraph contains an answer ≤(C+1)times the optimum ≤C times the optimum ≤1 times the optimum A C-approximation of the minimal supertree (collapse and restore) A minimal answer for 3 or fewer constraints (the algorithm for the exact order)

Keyword Proximity Search • A common paradigm for keyword search over structured databases • In the formal model: • Data are directed and weighted graphs • Queries are sets of keywords (i.e., nodes) from the data graph • Query answers are non-redundant subtrees containing the keywords of the query • The goal is to find the top-k answers, where the rank is inversely proportional to the weight • A stronger goal: enumeration with poly. delay

Our Results • Under data complexity, answers can be enumerated in the exact ranked order with polynomial delay • Under query-and-data complexity, every efficient C-approximation to the Steiner-tree problem yields an algorithm for enumerating answers with polynomial delay in a (C+1)-approximate order

Our Chain of Reductions Enumerating answers in sorted order Lawler’s approach Finding the top answer under constraints The intricate part … Finding minimal supertrees Subtree Collapse/Restore Finding Steiner trees

Other Variant of KPS Our algorithms can be adapted to other popular variants of KPS

Undirected Variant Answers are undirected trees

Strong Variant Answers are undirected trees and keywords are leaves

Open Problems • Can we improve the space efficiency of our algorithms? • Some ranking functions (e.g., height) are easier than weight when looking for the top answer (no constraints), but • The chain of reductions doesn’t work • The complexity of finding the top answer under constraints is unknown • Can our results hold for richer queries that also have structural constraints?

Finding and Approximating Top- k Answers in Keyword Proximity Search

Finding and Approximating Top- k Answers in Keyword Proximity Search

Presentation Transcript

“Trick” for Keyword Search

Finding Answers Fast

Keyword Proximity Search on XML Graphs

Fast Incremental Proximity Search in Large Graphs

SPARK: Top- k Keyword Query in Relational Database

Finding answers

Finding Top-k P rofitable Products

Finding Answers

CHAPTER 16: KEYWORD SEARCH

Proximity search

Keyword Search in Databases using PageRank

Keyword Proximity Search on XML Graphs

A fast algorithm for the generalized k-keyword proximity problem given keyword offsets

DISCOVER: Keyword Search in Relational Databases

HyKSS: Hybrid Keyword and Semantic Search

FINDING TOP-K PREFERABLE PRODUCTS

XML Keyword Search Refinement

Supporting Top-K Keyword Search in XML Databases

Keyword Search and Keyword Selection

Approximating Min-Max k-clustering

Finding Answers…

Effective Keyword Search in Relational Databases