Efficient Top-K Keyword Search in XML Databases: Join-Based Algorithm Strategy

Supporting Top-K Keyword Search in XMLDatabases ICDE 2010

Outline • Introduction • Motivation • Preliminaries • Join-based Algorithm • Join-based Top-k Algorithm • Experiments • Conclusions

Introduction • LCA:Lowest Common Ancestor

Motivation • The naive LCA-based semantics is straightforward, but leads to exponential computation and result size. • Two keywords:{XML} and {data} :lists of node XML. :lists of node data. the total number of the LCAs :m*n • Existing algorithms focusing on efficiency, cannot provide effective support for Top-k processing. • tg

Preliminaries 1.Query Semantics • k-keyword query • :the list of nodes directly • :the LCA of nodes • ELCA semantics :the result as a set of nodes that contain at least one occurrence of all of the query keywords either in their labels or in the labels of their descendant nodes, after excluding the occurrences of the keywords in the subtrees that already contain at least one occurrence of all the query keywords

Cont. • SLCA: a subset of such that no LCA in the subset is the ancestor of another LCA. • LCA:1.1, 1.1.2, 1, 1.3.4, 1.3 • SLCA:1.1.2, 1.3.4 • ELCA:1.1.2, 1.3.4, 1

Cont. 2.Ranking Function

Cont. • : a decreasing function

Join-based Algorithm 1.Node encoding

Join-based Algorithm 2.Algorithm .Two lists of nodes: . . .

Cont. (2,3) join (1),no matched

Cont. (3,5,6) join (1,2,4) no matched

Cont. (2,3,4,5) join (1,2,4)=>(2,4) matchedthe nodes numbered 2 and 4 at level 3 are the lowest ELCAs=>erased

Cont. (2,3) join (1) ,no matched

Cont. (1,1) join (1) matched=>root is ELCA 1 correspond two node (1.2.3 and 1.3.5.6),output one of them

Cont.

Cont. Score:(1.3.4.5.3.1.1) is greater than Score(1.3.5.6) But in 4th column, 0.5*d(3) may greater than or equal 0.44

Cont.

Cont. Assume d( ): Join column 5 and 4: no result

Cont. Column 3: Number 2 is matched It’s score is 0.73+0,41=1.14 Threshold of the unseen results in column 3 is =max{0.7+0.3,0.5+0.4}=1

Cont. Consider the unseen results in other column: column 1 and 2 do not contain sequence s. ignore. Consider column 2:the maximum scores 0.7*0.9 and 0.5*0.9, threshold is 0.63+0.45=1.08<1.14 Therefore , node 2 at level 3 can output.

Experiments

Cont.

Conclusions • 1. Join-based Algorithm has good performance in high frequency • 2. Join-based Top-k Algorithm has good performance in high correlation.

Efficient Top-K Keyword Search in XML Databases: Join-Based Algorithm Strategy

Efficient Top-K Keyword Search in XML Databases: Join-Based Algorithm Strategy

Presentation Transcript

Keyword Proximity Search on XML Graphs

Keyword++: A Framework to Improve Keyword Search Over Entity Databases

XRANK: Ranked Keyword Search Over XML Documents

Keyword-based Search and Exploration on Databases

Supporting Top- k join Queries in Relational Databases

Perk: Personalized Keyword Search in Relational Databases through Preferences

Perk: Personalized Keyword Search in Relational Databases through Preferences

Supporting top-k join queries in relational databases

Supporting top-k join queries in relational databases

Efficient Keyword Search over Virtual XML Views

Efficient Keyword Search Over Virtual XML Views

Integrating Keyword Search into XML Query Processing

Finding and Approximating Top- k Answers in Keyword Proximity Search

Keyword Search Over Graph Databases

Keyword Search in Databases using PageRank

Efficient Keyword Search over Virtual XML Views

Keyword Proximity Search on XML Graphs

Supporting Top- k join Queries in Relational Databases

XRANK: Ranked Keyword Search over XML Documents

DISCOVER: Keyword Search in Relational Databases

Supporting Top- k join Queries in Relational Databases

XML Keyword Search Refinement