Advancements in Indexing Heterogeneous Data for Enhanced Keyword-Based Queries

Indexing Dataspaces Presenter : Aviv Alon Seminar in Databases (236826)

Dataspaces • Dataspaces are collections of heterogeneous and partially unstructured data.

Dataspaces – Why we need them? Looking for an architect with good reviews and cheap materials? Return “Architect B” as instance

Main Problem • Consider queries that are keyword based but also structure aware: How to effectively query and search a dataspace

Indexing Heterogeneous Data • An inverted list where each row represents a keyword and each column represents a data item from the data sources.

Indexing Heterogeneous Data • We model the data as a set of triples • Each triple is either of the form (instance, attribute, value) for example: (“Architect B”, name, ‘Shalom’) or of the form (instance, association, instance) for example: (“Architect B”, worksWith, “Architect A”)

Indexing Heterogeneous Data • We also model:

Example • Person instances: p1, p2, p3 • Article instance: a1 • Conference instance: c1 • Attributes firstName, lastNameand nickNameare sub-attributes of name • Association contactAuthor is a sub-association of author.

Predicate queries • Set of predicates of the form (v, {K1, ... , Kn}) • v - an attribute or association label • {K1, …, Kn} - a keyword set Example 1: (title, ‘Birch’) attribute predicate

Predicate queries • Set of predicates of the form (v, {K1, ... , Kn}) • v - an attribute or association label • {K1, …, Kn} - a keyword set Example 2: (publishedIn ‘1996 Sigmod’) association predicate

Neighborhood keyword queries • Set of keywords K1, ... , Kn • relevant instance • associatedinstances Example: ‘Birch’ relevant instance associatedinstances

Existing methods • Build a separate index for each attribute to support structured queries on structured data. • Con: significant overhead to the index structure • Create an inverted list to support keyword search on unstructured data. • Con: Does not allow specifications on structure

Proposed solution • Capture both text values and structural information using an extended inverted list. • The index augments the text terms in the inverted list with labels denoting the structural aspects of the data such as attribute tags and associations between data items.

Inverted Lists - Example We cannot tell that “Tian” occurs as p1’s name and p3’s lastName

Indexing structure outline • Indexing Attributes • Attribute inverted lists (ATIL) • Indexing Associations • Attribute-association inverted lists (AAIL) • Indexing hierarchies • Attribute inverted lists with duplication (Dup-ATIL) • Attribute inverted lists with hierarchies (Hier-ATIL) • Hybrid attribute inverted list (Hybrid-ATIL)

Indexing Attributes Attribute inverted lists (ATIL) • Whenever the keyword k appears in a value of the a attribute, there is a row in the inverted list for k//a// Attribute = year keyword = 1996

Indexing Attributes Attribute inverted lists (ATIL) • Whenever the keyword k appears in a value of the a attribute, there is a row in the inverted list for k//a//

Attribute inverted lists (ATIL) To Answer an attribute predicate query (A,{K1, ... , Kn}) we need to search for {K1 //A//, ... , Kn //A//} Example: (lastName, ‘Tian’) “tian//lastName//” The search will yield p3

Indexing Attributes Attribute-association inverted lists (AAIL): Association = authoredPaper with p1, p2 keyword = Birch

Indexing Associations Attribute-association inverted lists (AAIL):

Attribute-association Inverted lists (AAIL) To Answer a association predicate query (R, {K1, ... , Kn}) we need to search for {K1 // R //, ... , Kn // R //} Example: (author ‘Raghu’) “raghu//author//”

Indexing hierarchies • For the query (name ‘Tian’), we wish to return instances p1 and p3, rather than only p1.

A Naïve method To Answer the query (name ‘Tian’) we can search for: “tian//name// OR tian//firstName// OR tian//lastName// OR tian//nickName//” Can be very expensive!

Indexing Attributes Attribute inverted lists with duplication (Dup-ATIL): Attribute = name Sub-attribute = nickName

Index with Duplication Attribute inverted lists with duplication (Dup-ATIL)

Attribute inverted lists with duplication (Dup-ATIL) To Answer an attribute predicate query (A,{K1, ... , Kn}) we need to search for {K1//A//, ... , Kn//A//} Example: (name, ‘Tian’) “tian//name//” The search will yield both p3 and p1

Dup-ATIL (cont.) • Pro: simple query answering • Con: may considerably expand the size of the index because of the duplication. Specially when: • Long paths from the root attribute to the leaf attributes • Most values in the triple base belong to leaf attributes.

Index with Hierarchy Path Attribute inverted lists with hierarchies (Hier-ATIL): Attribute = name Sub-attribute = nickName

Attribute inverted lists with hierarchies (Hier-ATIL) To Answer an attribute predicate query (A,{K1, ... , Kn}) we need to search for {K1//a0 // ... //am //*, ... , Kn//a0 // ... //am //*} Example: (name, ‘Tian’) “tian//name//*” a0 // ... //am : the hierarchy pathforattribute A The search will yield both p3 and p1

Hier-ATIL (cont.) • Pro: Does not increase the number of indexed keywords (Although it can lengthen many of them) • real indexing systems typically record a keyword only by the difference from its previous keyword • Con: answer a predicate query by transforming it into a prefix search, which can be more expensive than a keyword search.

Hybrid Index – Why? • Dup-ATIL is more suitable for the cases where a keyword occurs in many attributes with common ancestors • Hier-ATIL is more suitable for the cases where a keyword occurs in only a few attributes with common ancestors • Hybrid indexing combines the strengths of both methods

Hybrid Index Hybrid attribute inverted list (Hybrid-ATIL): • Inverted list that can answer any prefix search by reading no more than t rows. Tian//name//lastName// is shadowed by Tian//name// summary row

Hybrid Index To Answer prefix query of the form k//a0 // ... //am//* we look at all the rows with prefix k//a0 // ... //am // except those shadowed by summary rows Example: (name, ‘Tian’), t=1 “tian//name//*” Answer the prefix search after reading 1 row. yield both p1 and p3

Neighborhood Keyword Queries • We build the Keyword Inverted List (KIL), which is essentially a Hybrid-AAIL

Neighborhood Keyword Queries • We build the Keyword Inverted List (KIL), which is essentially a Hybrid-AAIL Example: “Birch”, t=1 “birch//*” Answer the prefix search after reading 1 row. yield p1, p2, a1,, c1

Experimental Evaluation • Associations between disparate items on the desktop: • Latex and Bibtex files • Word documents • Powerpoint presentations • emails and contacts • webpages in the web cache • The instances and associations are stored in an RDF file. the size of the file is 52.4MB

Experimental Evaluation Attribute clauses. No sub-attributes Attribute clauses. With sub-attributes Association clauses

Observations about the results 105,320 object 300,354 attribute 468,402 association predicate query: 15.2 ms neighborhood keyword query: 224.3 ms (with no more than 5 keywords) Answering queries using the KIL was very efficient! Answering queries with / without sub-attributes consumed a similar amount of time Effectiveness of hybrid indexing

Comparison of methods Compared with KIL (on average): • The Naïve method • query-answering time increased by a factor of 15.9 • XML Index (SepIL): • query-answering time increased by a factor of 2

Conclusions Main Contributions: • An indexing method that is designed to support flexible queryingover dataspaces • Extended inverted lists to capture both texts and structure of data Future Work • Extend the index to support value heterogeneity and to investigate appropriate ranking algorithms

THE END Questions ?

Advancements in Indexing Heterogeneous Data for Enhanced Keyword-Based Queries