1 / 45

Indexing Dataspaces

Indexing Dataspaces. Presenter : Aviv Alon Seminar in Databases  (236826). Dataspaces. Dataspaces are collections of heterogeneous and partially unstructured data. Dataspaces – Why we need them?. Looking for an architect with good reviews and cheap materials?.

shen
Télécharger la présentation

Indexing Dataspaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing Dataspaces Presenter : Aviv Alon Seminar in Databases (236826)

  2. Dataspaces • Dataspaces are collections of heterogeneous and partially unstructured data.

  3. Dataspaces – Why we need them? Looking for an architect with good reviews and cheap materials? Return “Architect B” as instance

  4. Main Problem • Consider queries that are keyword based but also structure aware: How to effectively query and search a dataspace

  5. Indexing Heterogeneous Data • An inverted list where each row represents a keyword and each column represents a data item from the data sources.

  6. Indexing Heterogeneous Data • We model the data as a set of triples • Each triple is either of the form (instance, attribute, value) for example: (“Architect B”, name, ‘Shalom’) or of the form (instance, association, instance) for example: (“Architect B”, worksWith, “Architect A”)

  7. Indexing Heterogeneous Data • We also model:

  8. Example • Person instances: p1, p2, p3 • Article instance: a1 • Conference instance: c1 • Attributes firstName, lastNameand nickNameare sub-attributes of name • Association contactAuthor is a sub-association of author.

  9. Predicate queries • Set of predicates of the form (v, {K1, ... , Kn}) • v - an attribute or association label • {K1, …, Kn} - a keyword set Example 1: (title, ‘Birch’) attribute predicate

  10. Predicate queries • Set of predicates of the form (v, {K1, ... , Kn}) • v - an attribute or association label • {K1, …, Kn} - a keyword set Example 2: (publishedIn ‘1996 Sigmod’) association predicate

  11. Neighborhood keyword queries • Set of keywords K1, ... , Kn • relevant instance • associatedinstances Example: ‘Birch’ relevant instance associatedinstances

  12. Existing methods • Build a separate index for each attribute to support structured queries on structured data. • Con: significant overhead to the index structure • Create an inverted list to support keyword search on unstructured data. • Con: Does not allow specifications on structure

  13. Proposed solution • Capture both text values and structural information using an extended inverted list. • The index augments the text terms in the inverted list with labels denoting the structural aspects of the data such as attribute tags and associations between data items.

  14. Inverted Lists - Example We cannot tell that “Tian” occurs as p1’s name and p3’s lastName

  15. Indexing structure outline • Indexing Attributes • Attribute inverted lists (ATIL) • Indexing Associations • Attribute-association inverted lists (AAIL) • Indexing hierarchies • Attribute inverted lists with duplication (Dup-ATIL) • Attribute inverted lists with hierarchies (Hier-ATIL) • Hybrid attribute inverted list (Hybrid-ATIL)

  16. Indexing Attributes Attribute inverted lists (ATIL) • Whenever the keyword k appears in a value of the a attribute, there is a row in the inverted list for k//a// Attribute = year keyword = 1996

  17. Indexing Attributes Attribute inverted lists (ATIL) • Whenever the keyword k appears in a value of the a attribute, there is a row in the inverted list for k//a//

  18. Attribute inverted lists (ATIL) To Answer an attribute predicate query (A,{K1, ... , Kn}) we need to search for {K1 //A//, ... , Kn //A//} Example: (lastName, ‘Tian’) “tian//lastName//” The search will yield p3

  19. Indexing structure outline • Indexing Attributes • Attribute inverted lists (ATIL) • Indexing Associations • Attribute-association inverted lists (AAIL) • Indexing hierarchies • Attribute inverted lists with duplication (Dup-ATIL) • Attribute inverted lists with hierarchies (Hier-ATIL) • Hybrid attribute inverted list (Hybrid-ATIL)

  20. Indexing Attributes Attribute-association inverted lists (AAIL): Association = authoredPaper with p1, p2 keyword = Birch

  21. Indexing Associations Attribute-association inverted lists (AAIL):

  22. Attribute-association Inverted lists (AAIL) To Answer a association predicate query (R, {K1, ... , Kn}) we need to search for {K1 // R //, ... , Kn // R //} Example: (author ‘Raghu’) “raghu//author//”

  23. Indexing hierarchies • For the query (name ‘Tian’), we wish to return instances p1 and p3, rather than only p1.

  24. A Naïve method To Answer the query (name ‘Tian’) we can search for: “tian//name// OR tian//firstName// OR tian//lastName// OR tian//nickName//” Can be very expensive!

  25. Indexing structure outline • Indexing Attributes • Attribute inverted lists (ATIL) • Indexing Associations • Attribute-association inverted lists (AAIL) • Indexing hierarchies • Attribute inverted lists with duplication (Dup-ATIL) • Attribute inverted lists with hierarchies (Hier-ATIL) • Hybrid attribute inverted list (Hybrid-ATIL)

  26. Indexing Attributes Attribute inverted lists with duplication (Dup-ATIL): Attribute = name Sub-attribute = nickName

  27. Index with Duplication Attribute inverted lists with duplication (Dup-ATIL)

  28. Attribute inverted lists with duplication (Dup-ATIL) To Answer an attribute predicate query (A,{K1, ... , Kn}) we need to search for {K1//A//, ... , Kn//A//} Example: (name, ‘Tian’) “tian//name//” The search will yield both p3 and p1

  29. Dup-ATIL (cont.) • Pro: simple query answering • Con: may considerably expand the size of the index because of the duplication. Specially when: • Long paths from the root attribute to the leaf attributes • Most values in the triple base belong to leaf attributes.

  30. Indexing structure outline • Indexing Attributes • Attribute inverted lists (ATIL) • Indexing Associations • Attribute-association inverted lists (AAIL) • Indexing hierarchies • Attribute inverted lists with duplication (Dup-ATIL) • Attribute inverted lists with hierarchies (Hier-ATIL) • Hybrid attribute inverted list (Hybrid-ATIL)

  31. Index with Hierarchy Path Attribute inverted lists with hierarchies (Hier-ATIL): Attribute = name Sub-attribute = nickName

  32. Attribute inverted lists with hierarchies (Hier-ATIL) To Answer an attribute predicate query (A,{K1, ... , Kn}) we need to search for {K1//a0 // ... //am //*, ... , Kn//a0 // ... //am //*} Example: (name, ‘Tian’) “tian//name//*” a0 // ... //am : the hierarchy pathforattribute A The search will yield both p3 and p1

  33. Hier-ATIL (cont.) • Pro: Does not increase the number of indexed keywords (Although it can lengthen many of them) • real indexing systems typically record a keyword only by the difference from its previous keyword • Con: answer a predicate query by transforming it into a prefix search, which can be more expensive than a keyword search.

  34. Indexing structure outline • Indexing Attributes • Attribute inverted lists (ATIL) • Indexing Associations • Attribute-association inverted lists (AAIL) • Indexing hierarchies • Attribute inverted lists with duplication (Dup-ATIL) • Attribute inverted lists with hierarchies (Hier-ATIL) • Hybrid attribute inverted list (Hybrid-ATIL)

  35. Hybrid Index – Why? • Dup-ATIL is more suitable for the cases where a keyword occurs in many attributes with common ancestors • Hier-ATIL is more suitable for the cases where a keyword occurs in only a few attributes with common ancestors • Hybrid indexing combines the strengths of both methods

  36. Hybrid Index Hybrid attribute inverted list (Hybrid-ATIL): • Inverted list that can answer any prefix search by reading no more than t rows. Tian//name//lastName// is shadowed by Tian//name// summary row

  37. Hybrid Index To Answer prefix query of the form k//a0 // ... //am//* we look at all the rows with prefix k//a0 // ... //am // except those shadowed by summary rows Example: (name, ‘Tian’), t=1 “tian//name//*” Answer the prefix search after reading 1 row. yield both p1 and p3

  38. Neighborhood Keyword Queries • We build the Keyword Inverted List (KIL), which is essentially a Hybrid-AAIL

  39. Neighborhood Keyword Queries • We build the Keyword Inverted List (KIL), which is essentially a Hybrid-AAIL Example: “Birch”, t=1 “birch//*” Answer the prefix search after reading 1 row. yield p1, p2, a1,, c1

  40. Experimental Evaluation • Associations between disparate items on the desktop: • Latex and Bibtex files • Word documents • Powerpoint presentations • emails and contacts • webpages in the web cache • The instances and associations are stored in an RDF file. the size of the file is 52.4MB

  41. Experimental Evaluation Attribute clauses. No sub-attributes Attribute clauses. With sub-attributes Association clauses

  42. Observations about the results 105,320 object 300,354 attribute 468,402 association predicate query: 15.2 ms neighborhood keyword query: 224.3 ms (with no more than 5 keywords) Answering queries using the KIL was very efficient! Answering queries with / without sub-attributes consumed a similar amount of time Effectiveness of hybrid indexing

  43. Comparison of methods Compared with KIL (on average): • The Naïve method • query-answering time increased by a factor of 15.9 • XML Index (SepIL): • query-answering time increased by a factor of 2

  44. Conclusions Main Contributions: • An indexing method that is designed to support flexible queryingover dataspaces • Extended inverted lists to capture both texts and structure of data Future Work • Extend the index to support value heterogeneity and to investigate appropriate ranking algorithms

  45. THE END Questions ?

More Related