1 / 32

Computer Science and Engineering

Computer Science and Engineering. Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search. Chengyuan Zhang 1 ,Ying Zhang 1 ,Wenjie Zhang 1 , Xuemin Lin 2,1. 1 The University of New South Wales, Australia 2 East China Normal University. Background.

keisha
Télécharger la présentation

Computer Science and Engineering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang1,Ying Zhang1,Wenjie Zhang1, Xuemin Lin2,1 1The University of New South Wales, Australia 2 East China Normal University

  2. Background • An enormous amount of spatio-textual objects available in many applications • online local search e.g., online yellow pages • social network services e.g., Facebook, Flickr

  3. p5 (pizza, steak,seafood) p2 (pizza, coffee,steak) p4 (coffee, sushi) pizza,coffee p3 (pizza, sushi) p1 (pizza, coffee,sushi)

  4. Top k spatial keyword search (TOPK-SK) Data • A set of spatio-textual objects • Each object is represented a location and a set of keywords Query • Query location (q.loc) • A set of query keywords (q.T) Answer • The closest k objects, each of which contains all query keywords

  5. Naïve Approach Running Example 11 spatio-textual objects Vocabulary {t1, t2, t3} Query q with q.T = {t1, t2} and k =1 p11 (t2) p10 (t1) P10 (t1) p6 (t2,t3) p7 (t3) p4 (t1) p9 (t2) p1 (t1,t2) p8 (t3) Objects Accessed: p3, p4, p7, p8 ,p5, p1! p3 (t1,t3) p5 (t2,t3) p2 (t1,t2)

  6. Inverted R-tree [Y. Zhou,et al., CIKM 2005] K=1, q.T={t1, t2} For each keyword t, construct an R tree for objects containing t E1 E2 R1 (t1) Objects Accessed: p3, p4, p5, p1! E2 E1 R2 (t2) E1 E2 R3 (t3)

  7. IR2-tree [ I. D. Felipe, et. al., ICDE 2008] Index Structure • Combination of an R-Tree and signature technique • Each node contains a rectangle and a signature ( a fixed length bitmap) • Each word is hashed to a particular bit • The signature of a node is the “ BitwiseOR ” of all the signatures of its child nodes

  8. 10 t1 Example Objects Accessed: p3, p4, p7, p1! 01 t2 t3 01 k=1, q.T={t1, t2} False positive! E12 E11 E10 E9 E8 Result: E7 E2 E3 E4 E1 E6 E5 E8 p5 p1 E5

  9. Number of object within search region Observations Number of object accessed Avg. probability that an objects is accessed Naïve approach • Disadvantages: all objects in the search region are accessed ( large s and p=1 ) Inverted R-tree • Advantages: exclude unrelated objects ( small s ) • Disadvantages: cannot take advantage of AND semantics (p=1) IR2-tree • Advantages: have filtering technique to reduce p • Disadvantages: large s and pis affected by non-related objects Other Single Augmented R-tree • Other spatial keyword search : KR tree [R. Hariharan, et al., SSDBM 2007] WIR tree [D. Wu , et al., TKDE 2011] • Spatial keyword ranking query : IR tree [G. Cong ,et al., PVLDB 2009] CM-CDIR tree [D. Wu ,et al., VLDBJ 2012] • Their shortcomings: same as IR2-tree Cost model: n= s*p

  10. Motivation Index structure • have a small number of objects within the search region • can prune objects within the search region Properties • falls in the category of inverted index • exploit the AND semantics • adaptive to the distribution of the objects for each keyword

  11. Motivation Signature of a region regarding a keyword 1 non-Empty Empty 0 p1: t1 Query Keyword: t1, t2 p2: t1, t2 p3: t2 t1 : 1 0 t2 :0 t1 : 1 1 t2 : 1

  12. Linear Quadtree Structure • Regular space partition based indexing • Each node can be identified by its split sequence (Morton code, a.k.a Z order) • A circle and a square to denote the non-leaf node and leaf node • A leaf node is set black if it is not empty, otherwise, it is a white leaf node • Keep the black leaf nodes (B+ tree) NE 1100 SW, SE 0001

  13. IL-Quadtree For each keyword ti ∈ V we build a linear quadtree, denoted by LQi, for the objects which contain the keyword ti Besides the black leaf nodes we also keep the quadtree node information ( signature ) 1 for black leaf nodes and non-leaf nodes and 0 otherwise

  14. k=1, q.T={t1, t2} Search Algorithm Objects Accessed: p4, p1!

  15. Direction-aware spatial keyword search[G. Li, et al., ICDE 2012] • Data • A set of spatio-textual objects • Each objects has a location and a set of keywords • Query • A location (q.loc) • A set of query keywords (q.T) • A direction [, ] • Answer • The closest k objects, each of which contains all keywords in q.T, and in the search direction

  16. Spatial Keyword Based Ranking[G. Cong ,et al., PVLDB 2009, VLDBJ 2012] Query • Spatial location • Query keywords Returns the k best objects ranked by • Spatial distance to the query location • Textual relevance to the query keywords Spatio-textual ranking Score • The spatial proximity (δ) is the normalized Euclidean distance between pand q • The textual relevance (θ) is the tf-idf based textual similarity between the description of p and the query keywords. Our Solution • the maximal keywords weight replaces the bit signature – aggregate inverted linear quadtree • spatial distance ranking function replaced by spatio-textual ranking score function • Score based pruning based on weight and region of the quadtree node

  17. Experimental Setting Implemented in Java Debian Linux • Intel Xeon 2.40GHz dual CPU • 4 GB memory Dataset GN : US Board on Geographic Names Tigers, Cars : • Spatial datasets from Rtree-Portal • Textual content from 20 Newsgroups SYN: synthetic dataset Query (1000) : location , #l query keywords Evaluate Response time and # I/O

  18. Important Statistics Parameters evaluated

  19. Tuning w’ : Minimal depth of the black leaf node c: The split threshold Best performance: • w’ = 8 and c = 64

  20. l: The number of query keywords Gird :[ M. Christoforaki,et al., CIKM, 2011] Grid+SIG: the extension of Grid, utilizing signaturetechnique

  21. Algorithms Evaluated ILQ • Inverted Linear Quadtree based techniques IVR • inverted Rtree [Y. Zhou, et al., CIKM 2005] MIR2 • [I. D. Felipe,et al., ICDE 2008] KR • [R. Hariharan,et al., SSDBM 2007] WIR • [D. Wu ,et al., TKDE 2011] IR • [G. Cong ,et al., PVLDB 2009] CM-CDIR • [D. Wu ,et al., VLDBJ 2012]

  22. Evaluation on different datasets

  23. Comparison – Varying l

  24. Comparison – Varying k

  25. Comparison – Varying Parameters

  26. Conclusion Important properties of indexing techniques to support top k spatial keyword search Propose the inverted linear quadtree structure to efficiently support top k spatial keyword search Extensive experiment on both real and synthetic data Future work Enhance the region based signature technique – group objects to reduce false positive. Support top k spatial keyword search on other metric spaces

  27. Thank you!

  28. Spatial Keyword Ranking Query • Our Algorithm • Aggregate ILQ • Compare with • IR [G. Cong, et al., PVLDB 2009] • CM-CDIR [D. Wu ,et al., VLDBJ 2012] • Dataset: Tiger

  29. Direction-Aware TOPK-SK Query • Our Algorithm • ILQ • Compare with • DESKS [G.Li,et al., ICDE 2012]

  30. Comparison – Varying k

  31. IR-Tree

  32. KR* Tree

More Related