Location-Based Services & Continuous kNN Query Processing

Location-Based Services &Continuous kNN Query Processing Tai Do Data Systems Group, UCF Fall 2005

Outline • Introduction to Data Management in Mobile Computing. • Discussion on Location-Based Services and its enabling technologies. • In-depth discussion on Continuous kNN queries (2 recent papers [MHP05], and [XMA05])

Data Management in Mobile Computing • Our interest: application-driven research that involves data management in mobile computing. • Services/applications that inspire data management research: • Location-based services, Transactional services, Data mining applications. • Research problems to support these novel services efficiently: • Spatiotemporal Query Processing. • Data dissemination over limited bandwidth channels. • Data consistency guarantees. • Advanced interfaces for mobile computers.

Location-Based Services (LBS) • Location-Based Services • can be defined as services that integrate a mobile device’s location or position with other information so as to provide added value to a user. • Examples: • Military and Government industries • Emergency services (E911 in US and 112 in Europe) • Commercial Sector: Advanced Traveler Information Systems (DoT), location-aware games, Advertising services • Commercial potentials of LBS([S03]): • Optimistic prediction: $4B by 2002, $81.9B by 2005 (Europe only) • Pessimistic prediction: $11M by 2002, $167M by 2005 (USA only). • Enabling Technologies: • Mobile Positioning Methods • Location Update Techniques • Location-based Query Processing

Mobile Positioning • GPS: Global Positioning System. Accuracy: up to 3 meters or more. • Cell-ID (Europe): Accuracy: 100m-3km Overview of LBS app. And level of accuracy required ([SV04])

Location Update Techniques • Dead-Reckoning Location Update Policies ([GS05]) • Periodic Updates

Concept of Uncertainty • Uncertainty is an inherent feature in databases storing location information. • Sources of uncertainty: • Mobile Positioning Methods • Location Update Techniques • Capturing uncertainty in the model and query language is an ongoing research.

Location-Based Queries • Two kinds of location-based queries: • Snapshot queries: “Tell me 3 nearest cars around menow” • Continuous queries: “Monitor 3 nearest restaurants around me in thenext 10 minutes” • We focus on continuous kNN (CkNN) query processing. • Main memory solution: Conceptual Partitioning Model CPM {MHP05} • Disk-based solution: Shared Execution Algorithm SEA-CNN {XMA05}

Parameters Values Underlying network Unconstrained (Euclidean) Transportation Network (shortest path) Movement pattern Unpredictable Trajectory Location Update Query-Aware (safe region) Query-Blind (periodic OR dead reckoning) Mutability Moving queries over static objects Static queries over moving objects Moving queries over moving objects Processing Type Distributed Centralized Storage Disk-resident Main memory Common Assumptions

SEA-CNN: Over view • Overview: • Objects are stored in disk, everything else is in memory. • Centralized processing. • Support all kinds of mutability between objects and queries. • No movement pattern, in open space. • Goal: • Minimize I/O cost, and CPU time. • Two important features: • Incremental evaluation of queries • Shared execution

SEA-CNN: Data Structures

SEA-CNN: Incremental Search • Key points: • For each query q, define a search region based on past answer and recent movements of q and objects. • Only objects inside search region are checked against q. • Given q.ARt0 as the answer radius of q at time t0 (q.AR = distance from q to kth-NN object) At time t1, the search radius of query q (q.SRt1) is computed as follows: • Step 1: check if any object moves in q.ARt0 during [t0, t1]. If yes, q.SRt1 = q. ARt0. If no, q.SRt1 = 0. • Step 2: check if any object that was in q.ARt0 but moves out of q.ARt0 during [t0, t1]. If yes, q.SRt1 equals the distance from q to the furthest object. • Step 3: check if q moves during [t0, t1]. If yes: • If q. SRt1 =0 then q.SRt1 = q.ARt0 + |q.Loct1- q.Loct0| • If q. SRt1 !=0 then q.SRt1 = q. SRt1 + |q.Loct1- q.Loct0|

SEA-CNN: Incremental Search(An Example) Q1: O5 and Q1 move during [T0, T1]. So Q1.SRT1 = Q1.ART0 + |Q1.LocT1-Q1.LocT0 Q2: O8 moves out of Q2.ART0 during [T0, T1]. So Q2.SRT1 = |Q2.LocT1-O8.LocT0

SEA-CNN: Shared Execution • Key points: • Utilize shared execution to reduce repeated I/O operations. • Group similar queries together. Evaluating this set of queries is reduced to a spatial join between the objects and the queries.

SEA-CNN: Algorithm

CPM: Overview • Overview: • Objects and queries are stored in memory. • Centralized processing. • Support all kinds of mutability between objects and queries. • No movement pattern, in open space. • Goal: • Minimize CPU time. • Important features: • Conceptual Partitioning • Simulate traditional kNN search (using branch-and-bound search with breadth-first (or best-first) traversal) • Roadmap: • Initial NN Computation (conceptual partitioning + branch and bound search + breadth-first traversal) • Handling Updates

CPM: Data Structures

CPM: NN Computation(Conceptual Partitioning) • Conceptual Partitioning: • What is CP? Partitioning of cells into rectangles based on proximity to the query cell. Each rectangle has direction and level. • Why CP? A natural processing order of the cells. Facilitate NN search (search minimal set of cells).

CPM: NN Computation(Algorithm by Example) Search heap content (always sorted): • H ={<c4,4,0>, <U0,0.1>, <L0,0.2>, <R0,0.8>, <D0,0.9>} • Deheap c4: do nothing. • Deheap U0: • insert cells of U0 • Insert U1 • Continue until deheap <c3,3, 1> and find 1st candidate p1: • best_dist = dist(p1, q) = 1.7 • Continue until deheap c2,4 and find p2: • best_dist = dist(p2, q) = 1.3 • Terminate because the next entry in the heap has min_dist >= best_dist

CPM: Handling Updates • Key Points: • Focus on moving objects, static queries. Moving queries are treated as new queries. • Reexamine only queries whose influence regions overlap with updated cells. • Re-compute affected queries incrementally based on book keeping information to save computation time.

CPM: Handling Updates(Algorithm by Example) NN Re-computation Algorithm Input: grid G, affected query q Output: new NN for q /* Similar to NN Computation. Utilize the book keeping information in visit_list and search heap */ • p2 moves from c2,4 to c0,6 • c2,4 has q in the influence list and dist(q, p2’) > best_NN = dist(q, p2)  mark q as affected query. • c0,6 has an empty influence list  ignore • Re-compute NN for q in the NN Re-computation algorithm

SEA-CNN & CPM: A Comparison • Common features between the two: • Performance metrics: • Use query processing time (or CPU time) at the centralized server as the primary metric. • Ignore communication cost. • Employ Grid-based Indexing (simple, fast maintenance). • Keep a search region for each query to handle updates. • Are the differences significant? • CPM saves some computations over SEA-CNN (as shown in the CPM paper) because CPM uses an optimal search algorithm. • However, is saving in CPU time still very important?

Summary • Monitoring queries to support LBS is an intensive research area in the past few years: • Short-term research trend seems to be proposals of new, more advance query types (our next presentation will discuss Reverse NN, and Group NN). • Long-term research could be a Moving Object Databases. Recommend: “Moving Objects Databases” textbook to gain perspective: • Location-management perspective vs. spatio-temporal data perspective. • Many LBS-based commercial products: Verilocation, uLocate, meetro, EarthComber, CellSpotting. • Standards and Development Software: Natural Area Coding System, Mobile Location Services Reference Architecture by Sun. • For LBS updated info: try LBSZone.

References • {B99} D. Barbara. "`Mobile Computing and Databases- A Survey.“ In {\em IEEE Transactions of Knowledge and Data Engineering, 11(1), 108-117, 1999.} • {S03} http://www.wirelessdevnet.com/features/nacjan03/ • {GS05} R. H. Guting, M. Schneider. Moving Object Databases. Book. • {SV04} J. Schiller, A. Voisard. Location-based Services. Book. • {MHP05} Kyriakos Mouratidis, Marios Hadjieleftheriou, Dimitris Papadias. Conceptual Partitioning: An Efficient Method for Continuous Nearest Neighbor Monitoring Nearest Neighbor Monitoring. SIGMOD 2005. • {YPK05} Yu, X., Pu, K., Koudas, N. Monitoring K-Nearest Neighbor Queries Over Moving Objects. ICDE, 2005. • {XMA05} Xiong, X., Mokbel, M., Aref, W. SEA-CNN: Scalable Processing of Continuous K-Nearest Neighbor Queries in Spatio-temporal Databases. ICDE, 2005. • {CDT+00} Jianjun Chen, David J. DeWitt, Feng Tian, and Yuan Wang. NiagaraCQ: A Scalable Continuous Query System for Internet Databases. In SIGMOD, 2000. • {CF02} Sirish Chandrasekaran and Michael J. Franklin. Streaming Queries over Streaming Data. In VLDB, 2002. (Psoup system).

Note • Due date of your presentation slides is November 14 2005.

Aggregate NN Queries in Spatial Databases and Location-based Services Tai Do Data Systems Group, UCF November 11, 2005

Outline • Aggregate Nearest Neighbor (ANN) queries: • Introduction to ANN. • Solutions for Group Nearest Neighbor (GNN) Queries, a specific type of ANN. • Solutions for Continuous Group Nearest Neighbor Queries (CGNN).

Aggregate NN:Examples and Applications • Applications: • Business decision making (construction of new facilities) • Military Rescue (earliest pick-up time) • Severe weather monitoring (most dangerous area)

Aggregate NN: Definition • What is ANN? • A generalized form of NN search (multiple query points vs. single query point) • Formally: • Given P = {p1, …, pN} (set of data points), Q={q1,…qn} (set of query points) • Aggregate distance function adist(p, Q) = f(|pq1|, …, |pqn|) • An ANN query returns the data point p with the minimum aggregate distance Note: AkNN is similar (find k >=1 data points), we only focus on ANN. • When f= sum, the ANN is called Group Nearest Neighbor Queries.

Group NN Queries • Assumptions: • Queries are in memory. • Data points are in disk and indexed by R-tree. • Goal: • Minimize the extent and cost of the search (I/O and CPU time) • Roadmap: 3 solutions • Multiple query method • Single point method • Minimum bound method

Multiple Query Method (MQM) • Apply multiple conventional NN queries, then combine the results. • MQM is a straightforward application of the threshold algorithm ([FLN03]): • Each query point visits incrementally its NN data points (1st NN, then 2nd NN, …) • Compute the aggregate distance of the current NN data point • Do the two above steps until we have seen the best data point. • Main idea: • Question: how do we know that the aggregate distance of the seen data point is smaller than the aggregate distance of unseen data points? • Answer: Predict minimum aggregate distance of unseen data points (or in other words, use a threshold)

MQM: An Example (1) • Q = {q1, q2} • P = {p1,…, p12}

ID Dist(q1) Dist(q2) Sum/adist MQM: An Example (2) q2 q1 t1 = 0 t2 = 0 T= 0, best_dist = , best_NN = null

ID dist(q1) dist(q2) Sum/adist MQM: An Example (3) • Step 1: • Find the next (1st ) NN of q1 • Update t1 and T q2 q1 (p10, 2) t1 = 2 T= t1 + t2 = 2 + 0 = 2

ID dist(q1) dist(q2) Sum/adist MQM: An Example (4) • Step 2: • if the current aggregate distance < best_dist ? Update best_dist and best_NN • If current best aggregate distance <= T ? Stop • Else go to the next NN of the next query point and repeat step 1 q2 q1 (p10, 2) t1 = 2 p10 2 5 7 T = 2 best_dist =  best_dist = 7 best_NN = p10

ID dist(q1) dist(q2) Sum/adist MQM: An Example (5) • Step 1: • Find the next (1st ) NN of q2 • Update t2 and T q2 q1 (p10, 2) (p11, 3) t1 = 2 t2 = 3 7 p10 2 5 T = t1 + t2 = 2 + 3 = 5 best_dist = 7 best_NN = p10

ID dist(q1) dist(q2) Sum/adist MQM: An Example (6) • Step 2: • if the current aggregate distance < best_dist ? Update best_dist and best_NN • If current best aggregate distance <= T ? Stop • Else go to the next NN of the next query point and repeat step 1 q2 q1 (p10, 2) (p11, 3) t1 = 2 t2 = 3 p10 2 5 7 best_dist = 7 p11 3 3 6 T = 5 best_dist = 6 best_NN = p11

ID dist(q1) dist(q2) Sum/adist MQM: An Example (7) • Step 1: • Find the next (2nd ) NN of q1 • Update t1 and T q2 q1 (p10, 2) (p11, 3) (p11, 3) t2 = 3 7 p10 2 5 t1 = 3 p11 3 3 6 T = t1 + t2 = 3 + 3 = 6 best_dist = 6 best_NN = p11

ID dist(q1) dist(q2) Sum/adist MQM: An Example (6) • Step 2: • if the current aggregate distance < best_dist ? Update best_dist and best_NN • If current best aggregate distance <= T ? Stop • Else go to the next NN of the next query point and repeat step 1 q2 q1 (p10, 2) (p11, 3) (p11, 3) t1 = 3 p10 2 5 7 best_dist = 6 t1 = 3 p11 3 3 6 T = 6 No Update p11 3 3 6 STOP best_dist = 6 best_NN = p11

Single Point Method (SPM) • Problem with MQM: • Multiple accesses to the same node and retrieve the same data point (e.g p11) through different queries. • SPM processes queries by a single traversal. • Strategy: • Compute the centroid q of Q, which is a point with small adist(q, Q) • The GNN is a point of P “near” q. • Challenges: • The computation of q. • The range around q, in which we should look for points of P, before we conclude that no better GNN can be found.

SPM: Illustration

SPM: The Computation of q

SPM: Finding the range • To define the range around q: find heuristics that can safely prune nodes in R-tree • Lemma 1: • For each query point qi we have |pqi| + |qiq|>= |pq| • Summing up the n inequalities: |pqi| + |qiq| >= n*|pq|  adist (p, Q) >= n|pq| - adist (q, Q) (1) • Lemma 1 can be used for pruning intermediate nodes: • Node N can be pruned if mindist(N, q) >= (1/n) * [best_dist + adist(q,Q)] (2) Because: when we transform this pruning rule we have n * mindist(N, q) – adist(q,Q) >= best_dist (3) For any p in node N: dist(p,q) >= mindist(N,q), so n * dist(p,q) – adist(q, Q) >= best_dist (4) Using Lemma 1 we have adist(p, Q) >= best_dist, hence node N can be safely pruned.

SPM: Pruning Illustration • Both N1 and N2 can be pruned: • best_dist = adist(best_NN, Q) = 9 • adist(q, Q) = 3 • (1/n)(best_dist + adist(q,Q)) = ½ (9 + 3) = 6 • mindist(N1,q) = 10 and mindist(N2,q) = 6

Minimum Bound Method (MBM) • Like SPM, MBM performs a single query, but uses the minimum bounding rectangle M of Q (instead of a centroid q) to prune the search space. • Is MBM obviously better than SPM? No clear reason. Must evaluate through experiments. • Strategy: • Use good heuristics to identify the qualifying nodes

Minimum Bound Method: Heuristics • Heuristic 1: A node N can’t contain qualifying points if: mindist (N, M) >= (1/n)*best_dist, because for any data point p in N adist(p, Q) >= n * mindist(N, M) >= best_dist • Heuristic 1 prunes N1 but not N2. • Heuristic 2: A node N can be safely pruned if: (mindist(N, qi)) >= best_dist • Heuristic 2 prunes both N1 and N2

Performance Study

Continuous Group NN • Assumptions: • Both query points and data points are in memory. • Method: • Use a grid index. • Utilize conceptual partitioning of the space around query Q. • Apply Minimum Bound Method.

Continuous GNN:Details • amindist (c, Q) = (qi in Q) (mindist(c, qi)). • amindist(c,Q) is the lower bound of mindist(p, Q) for any data point p in cell c. • The GNN computation is similar to the NN computation presented in previous class.

Summary • Threshold Algorithm: • Simple, useful, and reusable. • Aggregate Nearest Neighbor Queries in Spatial Database: • Practical applications. • Good heuristics are important. • Optimal ANN search remains unsolved???

Location-Based Services & Continuous kNN Query Processing