XML data management and approximate string matching

XML data management and approximate string matching Presentation in TeleCom ParisTech Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November 22 2010

Research experience Associate Professor: Renmin University of China • XML data management, Cloud data management, Approximate search Post-doc: University of California, Irvine • Data integration, Approximate string match PhD National University of Singapore • XML data management

Outline XML data management • XML twig query processing • XML keyword search • Graphical and interactive XML query processing Approximate string matching • Approximate string search • Approximate member extraction

XML twig query processing • XPath: Section[Title]/Paragraph//Figure • Twig pattern Section Paragraph Title Figure

XML twig query processing (Cont.) • Problem Statement • Given a query twig pattern Q, and an XML database D, weneed to compute ALL the answers to Q in D. • E.g. Consider Query and Document: Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1) Query: Section Document: s1 t1 s2 title figure t2 p1 f1

Previous work: TwigStack • TwigStack [1] is a holistic algorithm for XML twig matching on containment labeling scheme. • Two steps in TwigStack : • (1) intermediate path solutions are output to match each query root-to-leaf path; and • (2) these intermediate path solutions are merged to get the final results. [1] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In Proceedings of ACM SIGMOD, 2002.

Running example: TwigStack algorithm State of stacks: Query: Data streams: s (1,12,1) (1,12,1) (4,11,2) (4,11,2) s t f t (2,3,2) (2,3,2) (5,6,3) (5,6,3) Output path intermediate solutions: f (8,9,4) (8,9,4) s//t: s//f: Final results: (1,12,1) (2,3,2) (1,12,1) (8,9,4) (1,12,1) (2,3,2) (8,9,4) (4,11,2) (8,9,4) (1,12,1) (5,6,3) (8,9,4) (1,12,1) (5,6,3) (4,11,2) (5,6,3) (8,9,4) (4,11,2) (5,6,3)

Limitations of TwigStack • (1) TwigStack may output many useless intermediate results for queries with parent-child relationship • (2) TwigStack cannot process XML twig queries with ordered predicates, like “Proceeding”, “Following” in XPath • (3) TwigStack cannot answer queries with wildcards in branching nodes. E.g. * The parent of B should be an ancestor of C B C

XML twig query processing (Cont.) • Several efficient pattern matching algorithms • TJFast (VLDB 05)(citation: 173) • iTwigJoin (SIGMOD 05) • TwigStackList (CIKM 04) • TreeMatch (TKDE 10)

Motivation: new labeling scheme • TwigStackList and iTwigJoin are all based on the containment labeling scheme Why not try Dewey labeling scheme for XML twig pattern query ? Oh, it is really a novel idea!

Original Dewey Labeling Scheme • In Dewey labeling scheme, each element is presented by an integer sequence: • (i) the root is labeled by a empty stringε • (ii) for a non-root element u, label(u)= label(s).x, where u is the x-th child of s. • For example: ε s1 2 1 3 t1 s2 f2 2.1 2.2 t2 f1

Main problem of the original Dewey • If we use the original Dewey labeling scheme to answer the twig query, we need to read labels for all query node. Thus, this is not a better solution than pervious algorithms. Extend the original Dewey labeling scheme so that given the label of any element e, we can know the path ofe from this label alone

Modular function • We need to know some schema information: DTD (Document Type Definitions ) or XML schema • Given DTD information: book → author, title, chapter* • Our solution: using modular function, we create a match between an element tag and an integer number. • We define Xauthormod 3 = 0 Xtitlemod 3 = 1 Xchaptermod 3 = 2; where, Xt is the last integer of the label of tag t. ε Why not 3 as the original Dewey ? book The number of distinct tags under book 0 5 2 1 author chapter chapter title

From a label , we can derive its tag name. book → author, title, chapter* Recall that we define: Xauthor mod 3 = 0 Xtitle mod 3 = 1 Xchapter mod 3 = 2. Derive element tag ε book 0 5 2 1 author chapter chapter title ? ? ? ?

More examples for assigning labels • Let us consider a more complicated DTD • a → (b | c )*, d?, c+ • We define: Xbmod 3 = 0 Xcmod 3 = 1 Xd mod 3 = 2 (Why do we use mod 3 instead of 4?) ε a 0 7 2 4 b c c d

Derive the path from a label • By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label. • For example: FST: DTD: book → author, title, chapter* chapter → (paragraph | section)* section → (paragraph | section)* Mod 3=0 author Mod 3=1 book title paragraph Mod 2=0 Mod 3=2 Mod 2=0 book chapter Document: section Mod 2=1 Mod 2=1 chapter chapter title author section Question: Given a label 5.1.0, what is the corresponding path ? section section paragraph

Derive the path from a label • By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label. • For example: FST: DTD: book → author, title, chapter* chapter → (paragraph | section)* section → (paragraph | section)* Mod 3=0 author Mod 3=1 book paragraph title Mod 2=0 Mod 3=2 Mod 2=0 book Mod 2=1 Document: chapter section chapter Mod 2=1 chapter title author Following the above red path, we get 5.1.0 denotes : section section book/ chapter/section/paragraph section paragraph

Two properties of extended Dewey • Find Ancestor Label • From a label of any element, we can derive the labels of its all ancestors. • Find Ancestor Name • From a label of any element, we can derive the tag names of its all ancestors. • Two properties enable us to design a new and efficient algorithm for XML twig pattern matching.

A new algorithm: TJFast • For each node n in the query, there exists a corresponding input stream Tn. • Tn contains the extended Dewey labels of elements of tag n. Those labels are arranged by the document order. • For each branching node b of twig pattern, there is a corresponding set Sb, which contains elements possibly involving query answers. (Compared to TwigStackList, what difference? ) • During any point of computing, the size of set Sb is bounded by the depth of the XML document.

An example for TJFast algorithm Root Document: Query: { } 0 A set for the branching node A a1 … A 0.0 0.3 0.5 b2 a3 a2 D B 0.3.2 0.5.0 d1 d2 b1 d3 C 0.0.1 0.3.1 DTD: a -> a*,d*, b* b -> d*, c* d -> c* c1 c2 0.3.2.1 0.5.0.0 TD: 0.0.1 , 0.3.1, 0.5.0 Why are there only two streams? TC: 0.3.2.1, 0.5.0.0

An example for TJFast algorithm Root Document: { } Query: 0 A a1 … 0.0 0.3 0.5 D B b2 a3 a2 0.3.2 C 0.5.0 d1 d2 b1 d3 0.0.1 0.3.1 derive 0.0.1 a1/a2/d1 c1 c2 0.3.2.1 0.5.0.0 derive 0.3.2.1 a1/a3/b1/c1 TD: 0.0.1 , 0.3.1, 0.5.0 By finite state transducer of extended Dewey labeling scheme TC: 0.3.2.1, 0.5.0.0

An example for TJFast algorithm Root { } Document: Query: 0 A a1 … 0.0 0.3 0.5 D B a3 b2 a2 0.3.2 C d1 d2 b1 d3 0.5.0 0.0.1 0.3.1 c1 c2 0.3.2.1 0.5.0.0 Both a1 and a3 possibly involve in query answers. (Why not a2 ?) TD: 0.0.1 , 0.3.1, 0.5.0 TC: 0.3.2.1, 0.5.0.0

An example for TJFast algorithm Document: Root {a1,a3} Query: 0 A a1 … 0.0 0.3 0.5 D B b2 a3 a2 0.3.2 C 0.5.0 d1 d2 b1 d3 0.0.1 0.3.1 Then we insert a1, a3 to the set, Output Path solutions: A//D A/B//C (a1, d1) (a3, b1, c1) c1 c2 0.3.2.1 0.5.0.0 TD: 0.0.1 , 0.3.1, 0.5.0 TC: 0.3.2.1, 0.5.0.0

An example for TJFast algorithm Document: Root Query: {a1,a3} 0 A a1 … 0.0 0.3 0.5 D B b2 a3 a2 0.3.2 C 0.5.0 d1 d2 b1 d3 Move the cursor of TD from d1 to d2 0.0.1 0.3.1 c1 c2 Output Path solutions: A//D A/B//C (a1, d1) (a3, b1, c1) (a1, d2) (a3, d2) 0.3.2.1 0.5.0.0 TD: 0.0.1 , 0.3.1, 0.5.0 TC: 0.3.2.1, 0.5.0.0

An example for TJFast algorithm Root Document: Query: {a1,a3} 0 A a1 … 0.0 0.3 0.5 D B b2 a3 a2 0.3.2 C 0.5.0 d1 d2 b1 d3 0.0.1 0.3.1 Move the cursor of stream TD fromd2 to d3 c1 c2 0.3.2.1 0.5.0.0 Output Path solutions: A//D A/B//C (a1, d1) (a3, b1, c1) (a1, d2) (a3, d2) (a1, d3) TD: 0.0.1 , 0.3.1, 0.5.0 TC: 0.3.2.1, 0.5.0.0

An example for TJFast algorithm Root Document: Query: {a1,a3} 0 A a1 … 0.0 0.3 0.5 D B b2 a3 a2 0.3.2 C 0.5.0 d1 d2 b1 d3 0.0.1 0.3.1 Move the cursor of stream TC from c1 to c2 c1 c2 Output Path solutions: A//D A/B//C (a1, d1) (a3, b1, c1) (a1, d2) (a1, b2, c2) (a3, d2) (a1, d3) 0.3.2.1 0.5.0.0 TD: 0.0.1 , 0.3.1, 0.5.0 TC: 0.3.2.1, 0.5.0.0

Sort and merge-join in TJFast a1 Document: A b2 Query: a3 a2 D B d1 d2 b1 d3 C c1 c2 Phase 1. Intermediate paths Phase 2. Final solutions A// D: <a1, d1>, <a1, d2>, <a1, d3>, <a3, d2> A/B//C: <a1,b2, c2>, <a3, b1,c1> <A, D, B,C> Join <a1,d1,b2,c2>,<a1,d2, b2,c2>, <a1,d3,b2,c2>,<a3,d2, b1,c1>,

TJFast+L • Apply extended Dewey labeling scheme on tag+level streaming scheme, we propose TJFast+L algorithm by extendingTJFast • Two benefits of TJFast+L over TJFast • reduce I/O cost by reading less elements • enlarge optimal query classes

Optimal query classes Only P-C in all edges Only A-D in branching edges Optimal Class of TJFast A A B C B C Optimal Class of TJFast+L D D

XML twig query processing • Jiaheng Lu, Ting Chen, Tok Wang Ling: Efficient processing of XML twig patterns with parent child edges: a look-ahead approach. CIKM 2004:533-542 • Jiaheng Lu, Tok Wang Ling, Chee Yong Chan, Ting Chen: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. VLDB 2005:193-204 • Jiaheng Lu, Tok Wang Ling: Labeling and Querying Dynamic XML Trees. APWeb 2004:180-189 • Jiaheng Lu, Ting Chen, Tok Wang Ling: TJFast: effective processing of XML twig pattern matching. WWW (Special interest tracks and posters) 2005:1118-1119 • Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni: Efficient Processing of Ordered XML Twig Pattern. DEXA 2005:300-309 • Jiaheng Lu: Benchmarking Holistic Approaches to XML Tree Pattern Query Processing - (Extended Abstract of Invited Talk). DASFAA Workshops 2010:170-178 • Tian Yu, Tok Wang Ling, Jiaheng Lu: TwigStackList-: A Holistic Twig Join Algorithm for Twig Query with Not-Predicates on XML Data. DASFAA 2006:249-263 • Zhifeng Bao, Tok Wang Ling, Jiaheng Lu, Bo Chen: SemanticTwig: A Semantic Approach to Optimize XML Query Processing. DASFAA 2008:282-298 • Ting Chen, Jiaheng Lu, Tok Wang Ling: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing Techniques. SIGMOD 2005:455-466 • ……

Outline XML data management • XML twig query processing • XML keyword search • Graphical and interactive XML query processing

课题背景： XQuery vs. 关键字查询 Query papers by “Mike” XQuery:for $a in doc(“bib.xml”)//author $n in $a/name where $n=”Mike” return $a//inproceedings  Keyword search: Mike，inproceedings Complicated

The proposed keyword search returns the set of smallest trees containing all keywords. Keywords: bib Mike hobby Paper author author article 2009 name publications hobby name publications hobby Mike ward Paper folding John Hopking Read book inproceedings articles inproceedings article title year title year title year title year 2002 Information Retrival Base line of XML key 2002 Data Mining 2007 Keyword Search in XML 2009

XML keyword search • Search intention identification • Query result retrieval • Result ranking • Extend original TF*IDF from text database to XML database, while capture the hierarchical structure of XML data • Detailed papers: Effective XML Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528 (one of best papers to be invited in TKDE Journal)

XML keyword search XML Keyword search • Inspired by IR style keyword search on the web • Enables user to access information in XML database • XML data modeled as a rooted, labeled tree • Recent research efforts • Efficiency • Effectiveness

Effectiveness Capture user’s search intention • Identify the target that user intends to search for • Infer the predicate constraint that user intends to search via Result ranking • Rank the query results according to their objective relevance to user search intention

State of the Art Search semantics design • LCA (Lowest Common Ancestor) • Node v is a LCAof keyword set K={w1, w2,…,wk} if the sub-tree rooted at v contains at least one occurrence of all keywords in K, after excluding the sub-elements that already contain all keywords in K • SLCA (Smallest LCA) • Node v is a SLCA of keyword set K={w1, w2,…,wk} if • (1) v is a LCA of K • (2) no proper descendant of v is LCA of K • XSeek • Infers the search intention based on the concept of objects and an analysis of the matching between keyword and data node

State of the Art (cont) Efficient result retrieval • Designed based on a certain search semantics • XKSearch, Multiway SLCA etc. Result ranking • XRANK, XKSEarch, EASE • They only consider • Structural compactness of matching results • Keyword proximity • Similarity at node level

Problems Unaddressed Neither SLCA nor Xseek can well address keyword ambiguity Not address the user search intention adequately! • Meaningfulness of query result • SLCA is less meaningful in many cases • Keyword Ambiguity Problems • A keyword can appear both as an xml node type and as the text value of some other nodes • A keyword can appear in the text values of different xml node types and carry different meanings

Problems——Keyword Ambiguity storeDB customers books ... ... book customer customer ID customer ID interests publisher title name authors interests ... ... ID ... interest “C 3 ” name interests author ID author interest name “Art Smith” contact “C 4 ” “B 2 ” address “rock music” ... interest book “C 1 ” “Edward Martin” “Rock Davis” “art” no . ... “Sophia Jones” city authors “ 1 ” street title ID ... author author “Mary Smith” 1 ” “B “Art Street” “fashion” “John Williams” “Art of Customer “Daniel Jones” Interest Care” ... ... ... name “Oxford” customer ... purchases interests name ID purchase interest “C 2 ” Q = “customer, interest, art” • Ambiguity 1: customer, interest; Ambiguity 2: art • Intention: find customer whose interest is art • less relevant or irrelevant result to be returned also --- C1,C3, B1’s title “John Martin” “street art”

Problems——Keyword Ambiguity (cont) storeDB customers books ... ... book customer ... customer ... ID customer ID interests publisher title name authors ... interests ... ... ID ... interest “C 3 ” name name interests author ID author interest name “Art Smith” contact “C 4 ” “B 2 ” address “rock music” ... “Oxford” interest book “C 1 ” “Edward Martin” “Rock Davis” “art” customer no . ... “Sophia Jones” city authors ... “ 1 ” purchases street title ID ... interests name ID author author “Mary Smith” 1 ” “B purchase interest “Art Street” “fashion” “John Williams” “C 2 ” “Art of Customer “Daniel Jones” “John Martin” “street art” Interest Care” - How to rank C1 to C4 and B1? Q = “customer, interest, art” • “art” can be the value of interest node(C2, C4), name node(C3), or street node of customer(C1), or title node of book(B1) • “customer” can be tag name of customer node, or (part of) value of title of(B1)

Objectives & Challenges • Address the below as a single problem • Search intention identification • Query result retrieval • Result ranking • Extend original TF*IDF from text database to XML database, while capture the hierarchical structure of XML data Challenges • How to decide which sub-tree(s) with appropriate node types can capture user desired information • How to return sub-trees of an appropriate size (i.e. contain enoughbut non-overwhelming information) • How to rank those sub-trees by their relevance

Challenges Difficulty in applying TF*IDF to XML • XML DB carries semantic information while text DB contains pure text information. XML TF*IDF must be aware of the underlying semantics. • All contents of XML data are stored in leaf nodes only • What is analogy of “flat document” in XML? • Sub-tree classified according to its prefix path • Normalization factor is not simply the size of sub-tree • Structure of sub-trees may also infest the ranks

Our Approach • Extend IR-style keyword search techniques (like TF*IDF) from text database to XML database, in order to capture the hierarchical structure of xml document • by analyzing the knowledge of statistics of underlying XML data • Major Contributions • Identify user’s desired search-for node and search-via node(s) in a heuristic way • Define XML TF(term frequency) and XML DF (document frequency) • Confidence Formulas for search for/via candidates • Define XML TF*IDF Similarity • Propose 3 guidelines specifically for xml keyword search • Take keyword ambiguity problems into account • Design a Keyword Search Engine XReal

XML keyword search • Zhifeng Bao, Jiaheng Lu, Tok Wang Ling: XReal: an interactive XML keyword searching. CIKM 2010:1933-1934 • Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Liang Xu, Huayu Wu: An Effective Object-Level XML Keyword Search. DASFAA 2010:93-109 • Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Bo Chen: Towards an Effective XML Keyword Search. TKDE, 22(8):1077-1092 (2010) • Zhifeng Bao, Bo Chen, Tok Wang Ling, Jiaheng Lu: Demonstrating Effective Ranked XML Keyword Search with Meaningful Result Display. DASFAA 2009:750-754 • Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu: Effective XML Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528 • Bo Chen, Jiaheng Lu, Tok Wang Ling: Exploiting ID References for Effective Keyword Search in XML Documents. DASFAA 2008:529-537 • Jianjun Xu, Jiaheng Lu, Wei Wang, Baile Shi: Effective Keyword Search in XML Documents Based on MIU. DASFAA 2006:702-716 • ……

Outline XML data management • XML twig query processing • XML keyword search • Graphical and interactive XML query processing

Graphical and interactive XML search • Auto-completion XML search • Order-sensitive XML twig query • XML query suggestion • Demo online: http://datasearch.ruc.edu.cn:8080/LotusX/

Outline XML data management • XML twig query processing • XML keyword search • XML Keyword refinement • Graphical and interactive XML query processing Approximate string matching • Approximate string search • Approximate member extraction

Motivation: Data Cleaning Should clearly be “Niels Bohr” • Real-world data is dirty • Typos • Inconsistent representations • (PO Box vs. P.O. Box) • Approximately check against clean dictionary Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008

Motivation: Record Linkage We want to link records belonging to the same entity No exact match! The same entity may have similar representations Arnold Schwarzeneger versus Arnold Schwarzenegger Forrest Whittaker versus Forest Whittacker

XML data management and approximate string matching

XML data management and approximate string matching

Presentation Transcript

Faster Approximate String Matching over Compressed Text

Approximate String Matching using Compressed Suffix Arrays

String Matching

XML data management and approximate string matching

Approximate String Matching

String Matching

Rules for Approximate String Matching

A Hybrid Indexing Method for Approximate String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

Two Different Approximate String Matching Problems and Their Algorithms

String Matching

Approximate Boyer-Moore String Matching

String Matching

Filter Algorithms for Approximate String Matching

String matching

Approximate String Matching

String Matching

String Matching