Computer Science and Engineering

Computer Science and Engineering Inverted Linear Quadtree: Efﬁcient Top K Spatial Keyword Search Chengyuan Zhang1,Ying Zhang1,Wenjie Zhang1, Xuemin Lin2,1 1The University of New South Wales, Australia 2 East China Normal University

Background • An enormous amount of spatio-textual objects available in many applications • online local search e.g., online yellow pages • social network services e.g., Facebook, Flickr

p5 (pizza, steak,seafood) p2 (pizza, coffee,steak) p4 (coffee, sushi) pizza,coffee p3 (pizza, sushi) p1 (pizza, coffee,sushi)

Top k spatial keyword search (TOPK-SK) Data • A set of spatio-textual objects • Each object is represented a location and a set of keywords Query • Query location (q.loc) • A set of query keywords (q.T) Answer • The closest k objects, each of which contains all query keywords

Naïve Approach Running Example 11 spatio-textual objects Vocabulary {t1, t2, t3} Query q with q.T = {t1, t2} and k =1 p11 (t2) p10 (t1) P10 (t1) p6 (t2,t3) p7 (t3) p4 (t1) p9 (t2) p1 (t1,t2) p8 (t3) Objects Accessed: p3, p4, p7, p8 ,p5, p1! p3 (t1,t3) p5 (t2,t3) p2 (t1,t2)

Inverted R-tree [Y. Zhou,et al., CIKM 2005] K=1, q.T={t1, t2} For each keyword t, construct an R tree for objects containing t E1 E2 R1 (t1) Objects Accessed: p3, p4, p5, p1! E2 E1 R2 (t2) E1 E2 R3 (t3)

IR2-tree [ I. D. Felipe, et. al., ICDE 2008] Index Structure • Combination of an R-Tree and signature technique • Each node contains a rectangle and a signature ( a fixed length bitmap) • Each word is hashed to a particular bit • The signature of a node is the “ BitwiseOR ” of all the signatures of its child nodes

10 t1 Example Objects Accessed: p3, p4, p7, p1! 01 t2 t3 01 k=1, q.T={t1, t2} False positive! E12 E11 E10 E9 E8 Result: E7 E2 E3 E4 E1 E6 E5 E8 p5 p1 E5

Number of object within search region Observations Number of object accessed Avg. probability that an objects is accessed Naïve approach • Disadvantages: all objects in the search region are accessed ( large s and p=1 ) Inverted R-tree • Advantages: exclude unrelated objects ( small s ) • Disadvantages: cannot take advantage of AND semantics (p=1) IR2-tree • Advantages: have filtering technique to reduce p • Disadvantages: large s and pis affected by non-related objects Other Single Augmented R-tree • Other spatial keyword search : KR tree [R. Hariharan, et al., SSDBM 2007] WIR tree [D. Wu , et al., TKDE 2011] • Spatial keyword ranking query : IR tree [G. Cong ,et al., PVLDB 2009] CM-CDIR tree [D. Wu ,et al., VLDBJ 2012] • Their shortcomings: same as IR2-tree Cost model: n= s*p

Motivation Index structure • have a small number of objects within the search region • can prune objects within the search region Properties • falls in the category of inverted index • exploit the AND semantics • adaptive to the distribution of the objects for each keyword

Motivation Signature of a region regarding a keyword 1 non-Empty Empty 0 p1: t1 Query Keyword: t1, t2 p2: t1, t2 p3: t2 t1 : 1 0 t2 :0 t1 : 1 1 t2 : 1

Linear Quadtree Structure • Regular space partition based indexing • Each node can be identified by its split sequence (Morton code, a.k.a Z order) • A circle and a square to denote the non-leaf node and leaf node • A leaf node is set black if it is not empty, otherwise, it is a white leaf node • Keep the black leaf nodes (B+ tree) NE 1100 SW, SE 0001

IL-Quadtree For each keyword ti ∈ V we build a linear quadtree, denoted by LQi, for the objects which contain the keyword ti Besides the black leaf nodes we also keep the quadtree node information ( signature ) 1 for black leaf nodes and non-leaf nodes and 0 otherwise

k=1, q.T={t1, t2} Search Algorithm Objects Accessed: p4, p1!

Direction-aware spatial keyword search[G. Li, et al., ICDE 2012] • Data • A set of spatio-textual objects • Each objects has a location and a set of keywords • Query • A location (q.loc) • A set of query keywords (q.T) • A direction [, ] • Answer • The closest k objects, each of which contains all keywords in q.T, and in the search direction

Spatial Keyword Based Ranking[G. Cong ,et al., PVLDB 2009, VLDBJ 2012] Query • Spatial location • Query keywords Returns the k best objects ranked by • Spatial distance to the query location • Textual relevance to the query keywords Spatio-textual ranking Score • The spatial proximity (δ) is the normalized Euclidean distance between pand q • The textual relevance (θ) is the tf-idf based textual similarity between the description of p and the query keywords. Our Solution • the maximal keywords weight replaces the bit signature – aggregate inverted linear quadtree • spatial distance ranking function replaced by spatio-textual ranking score function • Score based pruning based on weight and region of the quadtree node

Experimental Setting Implemented in Java Debian Linux • Intel Xeon 2.40GHz dual CPU • 4 GB memory Dataset GN : US Board on Geographic Names Tigers, Cars : • Spatial datasets from Rtree-Portal • Textual content from 20 Newsgroups SYN: synthetic dataset Query (1000) : location , #l query keywords Evaluate Response time and # I/O

Important Statistics Parameters evaluated

Tuning w’ : Minimal depth of the black leaf node c: The split threshold Best performance: • w’ = 8 and c = 64

l: The number of query keywords Gird :[ M. Christoforaki,et al., CIKM, 2011] Grid+SIG: the extension of Grid, utilizing signaturetechnique

Algorithms Evaluated ILQ • Inverted Linear Quadtree based techniques IVR • inverted Rtree [Y. Zhou, et al., CIKM 2005] MIR2 • [I. D. Felipe,et al., ICDE 2008] KR • [R. Hariharan,et al., SSDBM 2007] WIR • [D. Wu ,et al., TKDE 2011] IR • [G. Cong ,et al., PVLDB 2009] CM-CDIR • [D. Wu ,et al., VLDBJ 2012]

Evaluation on different datasets

Comparison – Varying l

Comparison – Varying k

Comparison – Varying Parameters

Conclusion Important properties of indexing techniques to support top k spatial keyword search Propose the inverted linear quadtree structure to efficiently support top k spatial keyword search Extensive experiment on both real and synthetic data Future work Enhance the region based signature technique – group objects to reduce false positive. Support top k spatial keyword search on other metric spaces

Thank you!

Spatial Keyword Ranking Query • Our Algorithm • Aggregate ILQ • Compare with • IR [G. Cong, et al., PVLDB 2009] • CM-CDIR [D. Wu ,et al., VLDBJ 2012] • Dataset: Tiger

Direction-Aware TOPK-SK Query • Our Algorithm • ILQ • Compare with • DESKS [G.Li,et al., ICDE 2012]

Comparison – Varying k

IR-Tree

KR* Tree

Computer Science and Engineering