270 likes | 373 Vues
This paper delves into a novel Join-Based Algorithm for Top-K Keyword Search in XML Databases, addressing the inefficiencies of existing methods and emphasizing on high performance in both frequency and correlation. The study also explores the LCA (Lowest Common Ancestor) concept and introduces ELCA (Effective Lowest Common Ancestor), SLCA (Subset Lowest Common Ancestor), and query semantics for enhanced search capabilities. Detailed experiments and conclusive remarks reaffirm the algorithm's robustness across different scenarios.
E N D
Outline • Introduction • Motivation • Preliminaries • Join-based Algorithm • Join-based Top-k Algorithm • Experiments • Conclusions
Introduction • LCA:Lowest Common Ancestor
Introduction • LCA:Lowest Common Ancestor
Motivation • The naive LCA-based semantics is straightforward, but leads to exponential computation and result size. • Two keywords:{XML} and {data} :lists of node XML. :lists of node data. the total number of the LCAs :m*n • Existing algorithms focusing on efficiency, cannot provide effective support for Top-k processing. • tg
Preliminaries 1.Query Semantics • k-keyword query • :the list of nodes directly • :the LCA of nodes • ELCA semantics :the result as a set of nodes that contain at least one occurrence of all of the query keywords either in their labels or in the labels of their descendant nodes, after excluding the occurrences of the keywords in the subtrees that already contain at least one occurrence of all the query keywords
Cont. • SLCA: a subset of such that no LCA in the subset is the ancestor of another LCA. • LCA:1.1, 1.1.2, 1, 1.3.4, 1.3 • SLCA:1.1.2, 1.3.4 • ELCA:1.1.2, 1.3.4, 1
Cont. 2.Ranking Function
Cont. • : a decreasing function
Join-based Algorithm 1.Node encoding
Join-based Algorithm 2.Algorithm .Two lists of nodes: . . .
Cont. (2,3) join (1),no matched
Cont. (3,5,6) join (1,2,4) no matched
Cont. (2,3,4,5) join (1,2,4)=>(2,4) matchedthe nodes numbered 2 and 4 at level 3 are the lowest ELCAs=>erased
Cont. (2,3) join (1) ,no matched
Cont. (1,1) join (1) matched=>root is ELCA 1 correspond two node (1.2.3 and 1.3.5.6),output one of them
Cont. Score:(1.3.4.5.3.1.1) is greater than Score(1.3.5.6) But in 4th column, 0.5*d(3) may greater than or equal 0.44
Cont. Assume d( ): Join column 5 and 4: no result
Cont. Column 3: Number 2 is matched It’s score is 0.73+0,41=1.14 Threshold of the unseen results in column 3 is =max{0.7+0.3,0.5+0.4}=1
Cont. Consider the unseen results in other column: column 1 and 2 do not contain sequence s. ignore. Consider column 2:the maximum scores 0.7*0.9 and 0.5*0.9, threshold is 0.63+0.45=1.08<1.14 Therefore , node 2 at level 3 can output.
Conclusions • 1. Join-based Algorithm has good performance in high frequency • 2. Join-based Top-k Algorithm has good performance in high correlation.