430 likes | 575 Vues
This paper presents an innovative approach to improve keyword search in XML databases by addressing issues related to answer completeness and meaningfulness. We propose the concept of Valuable Lowest Common Ancestors (VLCA) and a novel Meaningful Dewey Code (MDC) for efficient data retrieval. Utilizing a stack-based algorithm, we demonstrate enhanced search efficiency and accuracy in retrieving relevant data. Through experimental studies, we establish the effectiveness of our proposed solutions over existing methodologies, thereby contributing to the advancement of XML database systems.
E N D
EffectiveKeywordSearch for Valuable LCAs over XML Documents Guoliang Li JianhuaFeng Jianyong Wang Lizhu Zhou Lin Shao XML und Datenbanksysteme
Content • Introduction • Backgroundand Motivation • Valuable LCA • Meaningful Dewey Code (MDC) • The Stack-Based Algorithm • Experimental Study • Conclusion
Introduction • Existing proposals on keyword search over XML databases suffer from two problems • Meaningfulness and completeness of answers, and the scope of the search • The answer of keyword search should not be limited to the LCAs of the keyword
Introduction • To solve the problem • Valuable LCA • Compact VLCA • devise an efficient stack-based algorithm
Background and Motivation • Notations • u v u is an ancestor of node v • u < v u precedes v in the XML Document • u _ v denotes that u v or u = v
Background and Motivation • Notations • u v u is an ancestor of node v • u < v u precedes v in the XML Document • u _ v denotes that u v or u = v • For example • conf(2) paper(15) • author(17) _ paper(15) • title (6) < author(17)
Background and Motivation Example False positive problem of LCA • Search for: {“IR”, “Tom”}
Background and Motivation Example False positive problem of LCA • Search for: {“IR”, “Tom”} false answer conf(2) • Solutions • Meaningful LCA (MLCA) • Smallest LCA (SLCA) • XRank
Background and Motivation Example False negative problem of SLCA • Search for: {“XML”, “Bob”}
Background and Motivation Example False negative problem of SLCA • Search for: {“XML”, “Bob”} paper(5) will not be in SLCAset
Background and Motivation Example False positive problem of SLCA • Search for: {“XML”, “John”}
Background and Motivation Example False positive problem of SLCA • Search for: {“XML”, “John”} false answer conf(2)
Content • Introduction • Backgroundand Motivation • Valuable LCA • Meaningful Dewey Code (MDC) • The Stack-Based Algorithm • Experimental Study • Conclusion
Valuable LCA • Based on the homogenous / heterogenous concept • Given two nodes u, v, and w=LCA(u,v) uSet and vSet are two sets of nodes in the parths of wu and wv respectively. • If u and v having the same elementary type, they are homogenous (denoted u ~ v)
Valuable LCA • Avoid the false positives and false negatives introduced by SLCA • Definition: Given m nodes n1,n2, … , nm, v=LCA(n1,n2, ... , nm). VLCA(n1,n2, ... ,nm) = v, iff, these m nodes are homogenous, that is, A 1 i < j m, ni~ nj.
Valuable LCA Example heterogenous / homogenous: • Search for: {“XML”, “John”}
Valuable LCA Example heterogenous / homogenous: • Search for: {“XML”, “John”} conf(2) heterogenous paper(23) homogenous
Content • Introduction • Backgroundand Motivation • Valuable LCA • Meaningful Dewey Code (MDC) • The Stack-Based Algorithm • Experimental Study • Conclusion
Meaningful Dewey Code (MDC) • Novel numbering scheme • Inspired form Dewey Code • Number/encode the nodes based on the corresponding DTD • Deduce ancestors and elementary types
Meaningful Dewey Code (MDC) <!ELEMENT bib (conf)*> <!ELEMENT conf (name,year,paper*,chair)> <!ELEMENT paper (title,author+,bib?)> <!ELEMENT name (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT chair (#PCDATA)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)>
Meaningful Dewey Code (MDC) • Ɛ Root Element • CnMDC of the node n • On ordered number of the node n • To encode a node: • author(0.2.1)
Meaningful Dewey Code (MDC) • k k-thlable • m number of children in DTD of parent(n)
Meaningful Dewey Code (MDC) MDC example • Given MDC = 0.6.1 • Level 0 (root) = bib m = 1 • Level 1 = conf m = 4 • Level 2 = paper m = 3 <!ELEMENT bib (conf)*> <!ELEMENT conf (name,year,paper*,chair)> <!ELEMENT paper (title,author+,bib?)> <!ELEMENT name (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT chair (#PCDATA)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)>
Meaningful Dewey Code (MDC) To check homogenous or heterogenous nodes • Proof. If u and v have the same elementary type, λ(u) = λ(v) |{λ(u) ∩ λ(v)}|= 1 • Heterogenous |wSet| - |{λ(u) ∩ λ(v)}| > |lSet| wSet = uSet ᴜ vSet, lSet = {λ(u)|u ϵwSet} • Check u(0.2.0) and v(0.6.4) • wSet{conf(0), paper(0.2), title(0.2.0), paper(0.6), author(0.6.4)} • |wSet|= 5, |lSet|= 4, and |{λ(u) ∩ λ(v)}|= 0
Content • Introduction • Backgroundand Motivation • Valuable LCA • Meaningful Dewey Code (MDC) • The Stack-Based Algorithm • Experimental Study • Conclusion
The Stack-Based Algorithm • VLCAStack to improve the search efficiency • Algorithm for structure join and twig join • Different from the existing studies (CVLCA)
The Stack-Based Algorithm • Compact VLCA (CVLCA) • Is more compact than VLCA • Answer is more meaningful • Connected subtree rooted at CVLCA • Idea behind compact connected tree • Since node v is in a compact connected tree, it will not be in another looser one, which contain some other irrelevant nodes
The Stack-Based Algorithm • Compact VLCA vs. SLCA • Example False negative problem of SLCA • Search for: {“XML”, “Bob”}
The Stack-Based Algorithm • Compact VLCA vs SLCA • Example False negatives problem of SLCA • Search for: {“XML”, “Bob”} SLCAset = {paper(12)} CVLCAset ={paper(5), paper(12)}
The Stack-Based Algorithm • VLCAStack • Input Elements are sorted in order by their MDCs • VLCAStack maintains another stack to preserve current LCAs
The Stack-Based Algorithm • Example: Search for = {“XML”, “John”} • sVLCA is empty • nMin = 0.2.0
The Stack-Based Algorithm • Example: Search for = {“XML”, “John”} • sVLCA = 0.2.0 • nMin = 0.6.4
The Stack-Based Algorithm • Example: Search for = {“XML”, “John”} • sVLCA = 0.6.4 • nMin = 1.2.0
The Stack-Based Algorithm • Example: Search for = {“XML”, “John”} • sVLCA = 0 • nMin = 1.2.0
The Stack-Based Algorithm • Example: Search for = {“XML”, “John”} • sVLCA = 1.2.0 • nMin = 1.2.1
The Stack-Based Algorithm • Example: Search for = {“XML”, “John”} • sVLCA = 1.2 • nMin is empty
The Stack-Based Algorithm • Example: Search for = {“XML”, “John”} • Answer of the keyword query = {(paper(1.2);title:XML(1.2.0);author:John(1.2.1))}
Content • Introduction • Backgroundand Motivation • Valuable LCA • Meaningful Dewey Code (MDC) • The Stack-Based Algorithm • Experimental Study • Conclusion
Experimental Study • Efficiency and Effectiveness Test • Datasets • Real Dataset: DBLP, SIGMOD Record, TreeBank • Synthetic Dataset: XMark • Tested Methods • Brute-Force • XSEarch • SLCA • GDMCT
Experimental Study • Efficiency
Experimental Study • Effectiveness • Precision • Recall • F-measure
Conclusion • Demonstration of the problems of keyword search over XML documents • Proposed VLCA and CVLCA to obtain meaningful results of keyword queries • Present an optimization technique to compute CVLCAs and devise an efficient stack-based algorithm to identify meaningful compact connected trees