450 likes | 556 Vues
This seminar explores the concept of dataspaces, which comprise heterogeneous and partially unstructured data entities. Attendees will learn about the importance of effective querying strategies that combine keyword and structure-aware search capabilities. We will introduce an innovative indexing method, extending traditional inverted lists to capture both text values and structural information, improving the performance of structured queries on complex data. Key topics include attribute and association indexing, hybrid models, and predicate queries, aiming to address current inefficiencies in data retrieval.
E N D
Indexing Dataspaces Presenter : Aviv Alon Seminar in Databases (236826)
Dataspaces • Dataspaces are collections of heterogeneous and partially unstructured data.
Dataspaces – Why we need them? Looking for an architect with good reviews and cheap materials? Return “Architect B” as instance
Main Problem • Consider queries that are keyword based but also structure aware: How to effectively query and search a dataspace
Indexing Heterogeneous Data • An inverted list where each row represents a keyword and each column represents a data item from the data sources.
Indexing Heterogeneous Data • We model the data as a set of triples • Each triple is either of the form (instance, attribute, value) for example: (“Architect B”, name, ‘Shalom’) or of the form (instance, association, instance) for example: (“Architect B”, worksWith, “Architect A”)
Indexing Heterogeneous Data • We also model:
Example • Person instances: p1, p2, p3 • Article instance: a1 • Conference instance: c1 • Attributes firstName, lastNameand nickNameare sub-attributes of name • Association contactAuthor is a sub-association of author.
Predicate queries • Set of predicates of the form (v, {K1, ... , Kn}) • v - an attribute or association label • {K1, …, Kn} - a keyword set Example 1: (title, ‘Birch’) attribute predicate
Predicate queries • Set of predicates of the form (v, {K1, ... , Kn}) • v - an attribute or association label • {K1, …, Kn} - a keyword set Example 2: (publishedIn ‘1996 Sigmod’) association predicate
Neighborhood keyword queries • Set of keywords K1, ... , Kn • relevant instance • associatedinstances Example: ‘Birch’ relevant instance associatedinstances
Existing methods • Build a separate index for each attribute to support structured queries on structured data. • Con: significant overhead to the index structure • Create an inverted list to support keyword search on unstructured data. • Con: Does not allow specifications on structure
Proposed solution • Capture both text values and structural information using an extended inverted list. • The index augments the text terms in the inverted list with labels denoting the structural aspects of the data such as attribute tags and associations between data items.
Inverted Lists - Example We cannot tell that “Tian” occurs as p1’s name and p3’s lastName
Indexing structure outline • Indexing Attributes • Attribute inverted lists (ATIL) • Indexing Associations • Attribute-association inverted lists (AAIL) • Indexing hierarchies • Attribute inverted lists with duplication (Dup-ATIL) • Attribute inverted lists with hierarchies (Hier-ATIL) • Hybrid attribute inverted list (Hybrid-ATIL)
Indexing Attributes Attribute inverted lists (ATIL) • Whenever the keyword k appears in a value of the a attribute, there is a row in the inverted list for k//a// Attribute = year keyword = 1996
Indexing Attributes Attribute inverted lists (ATIL) • Whenever the keyword k appears in a value of the a attribute, there is a row in the inverted list for k//a//
Attribute inverted lists (ATIL) To Answer an attribute predicate query (A,{K1, ... , Kn}) we need to search for {K1 //A//, ... , Kn //A//} Example: (lastName, ‘Tian’) “tian//lastName//” The search will yield p3
Indexing structure outline • Indexing Attributes • Attribute inverted lists (ATIL) • Indexing Associations • Attribute-association inverted lists (AAIL) • Indexing hierarchies • Attribute inverted lists with duplication (Dup-ATIL) • Attribute inverted lists with hierarchies (Hier-ATIL) • Hybrid attribute inverted list (Hybrid-ATIL)
Indexing Attributes Attribute-association inverted lists (AAIL): Association = authoredPaper with p1, p2 keyword = Birch
Indexing Associations Attribute-association inverted lists (AAIL):
Attribute-association Inverted lists (AAIL) To Answer a association predicate query (R, {K1, ... , Kn}) we need to search for {K1 // R //, ... , Kn // R //} Example: (author ‘Raghu’) “raghu//author//”
Indexing hierarchies • For the query (name ‘Tian’), we wish to return instances p1 and p3, rather than only p1.
A Naïve method To Answer the query (name ‘Tian’) we can search for: “tian//name// OR tian//firstName// OR tian//lastName// OR tian//nickName//” Can be very expensive!
Indexing structure outline • Indexing Attributes • Attribute inverted lists (ATIL) • Indexing Associations • Attribute-association inverted lists (AAIL) • Indexing hierarchies • Attribute inverted lists with duplication (Dup-ATIL) • Attribute inverted lists with hierarchies (Hier-ATIL) • Hybrid attribute inverted list (Hybrid-ATIL)
Indexing Attributes Attribute inverted lists with duplication (Dup-ATIL): Attribute = name Sub-attribute = nickName
Index with Duplication Attribute inverted lists with duplication (Dup-ATIL)
Attribute inverted lists with duplication (Dup-ATIL) To Answer an attribute predicate query (A,{K1, ... , Kn}) we need to search for {K1//A//, ... , Kn//A//} Example: (name, ‘Tian’) “tian//name//” The search will yield both p3 and p1
Dup-ATIL (cont.) • Pro: simple query answering • Con: may considerably expand the size of the index because of the duplication. Specially when: • Long paths from the root attribute to the leaf attributes • Most values in the triple base belong to leaf attributes.
Indexing structure outline • Indexing Attributes • Attribute inverted lists (ATIL) • Indexing Associations • Attribute-association inverted lists (AAIL) • Indexing hierarchies • Attribute inverted lists with duplication (Dup-ATIL) • Attribute inverted lists with hierarchies (Hier-ATIL) • Hybrid attribute inverted list (Hybrid-ATIL)
Index with Hierarchy Path Attribute inverted lists with hierarchies (Hier-ATIL): Attribute = name Sub-attribute = nickName
Attribute inverted lists with hierarchies (Hier-ATIL) To Answer an attribute predicate query (A,{K1, ... , Kn}) we need to search for {K1//a0 // ... //am //*, ... , Kn//a0 // ... //am //*} Example: (name, ‘Tian’) “tian//name//*” a0 // ... //am : the hierarchy pathforattribute A The search will yield both p3 and p1
Hier-ATIL (cont.) • Pro: Does not increase the number of indexed keywords (Although it can lengthen many of them) • real indexing systems typically record a keyword only by the difference from its previous keyword • Con: answer a predicate query by transforming it into a prefix search, which can be more expensive than a keyword search.
Indexing structure outline • Indexing Attributes • Attribute inverted lists (ATIL) • Indexing Associations • Attribute-association inverted lists (AAIL) • Indexing hierarchies • Attribute inverted lists with duplication (Dup-ATIL) • Attribute inverted lists with hierarchies (Hier-ATIL) • Hybrid attribute inverted list (Hybrid-ATIL)
Hybrid Index – Why? • Dup-ATIL is more suitable for the cases where a keyword occurs in many attributes with common ancestors • Hier-ATIL is more suitable for the cases where a keyword occurs in only a few attributes with common ancestors • Hybrid indexing combines the strengths of both methods
Hybrid Index Hybrid attribute inverted list (Hybrid-ATIL): • Inverted list that can answer any prefix search by reading no more than t rows. Tian//name//lastName// is shadowed by Tian//name// summary row
Hybrid Index To Answer prefix query of the form k//a0 // ... //am//* we look at all the rows with prefix k//a0 // ... //am // except those shadowed by summary rows Example: (name, ‘Tian’), t=1 “tian//name//*” Answer the prefix search after reading 1 row. yield both p1 and p3
Neighborhood Keyword Queries • We build the Keyword Inverted List (KIL), which is essentially a Hybrid-AAIL
Neighborhood Keyword Queries • We build the Keyword Inverted List (KIL), which is essentially a Hybrid-AAIL Example: “Birch”, t=1 “birch//*” Answer the prefix search after reading 1 row. yield p1, p2, a1,, c1
Experimental Evaluation • Associations between disparate items on the desktop: • Latex and Bibtex files • Word documents • Powerpoint presentations • emails and contacts • webpages in the web cache • The instances and associations are stored in an RDF file. the size of the file is 52.4MB
Experimental Evaluation Attribute clauses. No sub-attributes Attribute clauses. With sub-attributes Association clauses
Observations about the results 105,320 object 300,354 attribute 468,402 association predicate query: 15.2 ms neighborhood keyword query: 224.3 ms (with no more than 5 keywords) Answering queries using the KIL was very efficient! Answering queries with / without sub-attributes consumed a similar amount of time Effectiveness of hybrid indexing
Comparison of methods Compared with KIL (on average): • The Naïve method • query-answering time increased by a factor of 15.9 • XML Index (SepIL): • query-answering time increased by a factor of 2
Conclusions Main Contributions: • An indexing method that is designed to support flexible queryingover dataspaces • Extended inverted lists to capture both texts and structure of data Future Work • Extend the index to support value heterogeneity and to investigate appropriate ranking algorithms
THE END Questions ?