Abdeslame ALILAOUAR, Florence SEDES

Fuzzy Querying of XML Documents The minimum spanning tree Abdeslame ALILAOUAR, Florence SEDES IRIT - CNRS IRIT : Research Institute for Computer Science of Toulouse (France)

Talk Outline • The XML model • The problem of querying XML documents • Proposed techniques • Our approach • Implementation details • Conclusion and future tasks

The XML Data Model Document-centric vs. Data-centric • Less regular or irregular structure, • The order of sibling elements is important, • Examples : Emails, books, etc. • Document-centric Data-centric • More structured • The order of sibling elements is often unimportant • Examples : sales orders, configuration files, etc.

The XML Data Model (continued) • Data are commonly modeled by a tree structure • Nodes represent objects • Edges represent relationships between objects • Atomic values are attached to leaf nodes

cotglist cottage cottage identifier identifier character price character price ″23″ ″40″ winter room nbeds room summer 1700 nbeds nbeds 1300 1100 4 4 2 The XML Data Model (continued) <xml version="1.0" encoding="UTF-8> • <cotglist> • <cottage identifier = "23"> • <character> • <room> <nbeds> 4</nbeds></room> • <room> <nbeds> 2 </nbeds></room> • </character> • <price> 1700 </ price> </cottage> • <cottage identifier = "40" > • <character> • <nbeds > 4 </nbeds> • <price> • <winter>1100</winter> • <summer>1300</summer> • </price > • </character> • </cottage > • </cotglist > Variations in Structure

R.I. Structure matching The Problem of Querying XML Documents Content matching Unknown, Irregular Query = Content + Structure XML Document = Content + Structure Irregular structure • Data has structural variations Relationships between objects are represented differently in different parts of the documents • Data has ontology variations Different labels are used to describe objects of the same type (e.g. house, cottage) Result In most cases, the queries return empty or incomplete set of answers

The Problem of Querying XML Documents (continued) Solution • Query should deal with different data structures • The queries should not be rigid patterns (structure) • Flexible handling of queries in order to find not only the • answers that match exactly, but also with a similar structure • and/or content

Proposed Techniques • Query relaxation (S. AmerYahia, AT&T, 2002) • Data Relaxation (Damiani & Tanca, 2000 ) • Tree-edit distance (D. Shasha, K. Zhang, 1989 ) • Correlation (A. Tversky, 1977 )

The minimum spanning tree Our approach The minimum spanning tree (MST) - Optimization problem Input - A weighted graph Output - The cheapest subset of edges that keeps the graph in one connected component

Proposed algorithm : Kruskal's algorithm (1956) • It maintains a set of partial minimum spanning trees, and repeatedly adds the shortest edge in the graph whose vertices are in different partial • minimum spanning trees. Prim's algorithm (1957) Compute a minimum spanning tree by beginning with any vertex as the current tree. At each step add a least edge between any vertex not in the tree and any vertex in the tree. Continue until all vertices have been added.

cottage 0,6 0,8 price nbeds 4 1400 Querying XML documents with MST • replace the criteria by preferences with their importance levels • The importance level determines the priority between the preferences • The satisfaction degree of one preference is at least equal it importance level • represent the queries by a weighted tree pattern Example : • Define a similarity function that we will use for estimating the matching degree • of the preferences • The answers subtrees are built gradually, starting by evaluating the leaf nodes and the • most important preferences, going up until construct the answers tree like a Kruskal’s • algorithm.

0,6 0,8 Example : cotglist cottage cottage cottage nbeds Sim=0,9 price Sim=0,7 identifier identifier character character price price ″140″ ″123″ 4 Sim(price,price)=1 1400 Sim=1 summer nbeds room winter Sim(1300,1700) = 0,7 room Sim=1,0 1700 Sim(1300,1400)=0,9 nbeds nbeds Sim=1,0 Sim=1,0 1300 1100 4 Sim=1,0 4 2

Some Implementation Details Query Processor XML collection Answer list Query XML document Indexed collection Tag Index Attribute Index Index builder Data Index Term Index The architecture of our querying system

Indexing method • Dietz’s method (1982) Traversal order to determine the ancestor-descendant relationship • Why Dietz’s method • A straightforward method • Efficiently determine the ancestors and descendents of any node - for two given nodes x and y of a tree T, x is an ancestor of y iff x occurs before y in the preorder traversal and after y in the postorder traversal.

Future work • Experiments within INEX (Initiative for the Evaluation of XML retrieval) • Improving the similarity functions (Uses athesaurus, etc.) • Introducing the qualitative preferences (cheapest, nearest, small, etc.)

Thank You Questions?

Abdeslame ALILAOUAR, Florence SEDES

Abdeslame ALILAOUAR, Florence SEDES

Presentation Transcript

Florence Nightingale

FLORENCE NIGHTENGALE

Renaissance Florence

Florence

Florence Nightingale

Florence Nightingale

FLORENCE

Florence

Florence Schelling

Florence Goodenough

Florence, Italy

Florence Nightingale

Florence

HURRICANE FLORENCE