250 likes | 389 Vues
Retrieving k-Nearest Neighboring Trajectories by a Set of Point Locations. Lu-An Tang , Yu Zheng , Xing Xie, Jing Yuan, Xiao Yu, Jiawei Han. University of Illinois at Urbana-Champaign Microsoft Research Asia. Motivation: trajectory query by locations. Huge volume of spatial trajectories
E N D
Retrieving k-Nearest Neighboring Trajectories by a Set of Point Locations Lu-An Tang, Yu Zheng, Xing Xie, Jing Yuan, Xiao Yu, Jiawei Han • University of Illinois at Urbana-Champaign Microsoft Research Asia
Motivation: trajectory query by locations • Huge volume of spatial trajectories • Require to search trajectories by a set of point locations
k-Nearest Neighboring trajectory query • The trajectories may not exactly pass those locations • Query the top k trajectories with the minimum aggregated distance to the given locations q3 q2 q1
k-NNT query • Task Definition: Given the trajectory dataset D,anda set of query points, Q, the k-NNT query retrieves k trajectories K from D, K = {R1, R2, …, Rk} that for ∀ Ri ∈ K, ∀ Rj ∈ D - K, dist(Ri,Q) ≤ dist(Rj,Q). • Challenges • Huge trajectory dataset: High I/O cost to scan all the trajectories • Aggregated distance computation • Non-uniform distribution: • the trajectories are sparse/dense in different regions • the user-given query locations may be far from all the trajectories
The aggregate distance in k-NNT query 1. Find out the closest point from a trajectory to each query point (i.e., shortest matching pairs) 3. Sum up the lengths of all matching pairs dist(R2, q1)= dist(p2,3, q1)= 30 m dist(R1, q1)= dist(p1,2, q1)= 20 m dist(R2, q2)= dist(p2,4, q2)= 5 m dist(R1, q2)= dist(p1,3, q2)= 50 m dist(R2, q3)= dist(p2,6, q3)= 40 m dist(R1, q3)= dist(p1,5, q3)= 15 m dist(R2, Q)=∑ dist(R2, qi)= 75 m dist(R1, Q)=∑ dist(R1, qi)= 85 m
Related Work: k-BCT query • k-Best Connected Trajectory (k-BCT) query [SIGMOD2010] the similarity function between a trajectory R and query locations Q is • Problem: This function changes over units (inconsistent) • An example If query Q has two points q1 and q2; dist(R1, q1) = dist(R1, q2) = 2.4km = 1.48 miles, dist(R2, q1) = 1.5 km =0.93 miles, dist(R2, q2) = 5km = 3.1 miles Use unit “mile”, Sim(R1, Q) = 0.45 > Sim(R2, Q) = 0.43 Use unit “km”, Sim(R1, Q) = 0.18 < Sim(R2, Q) = 0.22
Advantages of k-NNT over k-BCT • The distance function of k-BCT changes over units (inconsistent) • The distance function of k-BCT is sensitive to a query • k-BCT • k-BCT&k-NNT q3 • k-NNT q2 q1
Query framework: candidate-generation-and-verification • Candidate generation • Best-first search based individual heaps • Coordination by a global heap • Candidate verification • Lower-bound estimation • Efficient pruning with the global heap • Qualifier expectation-based method
Candidate Generation • Given a query Q = {q1, q2, …, qm}, generate a trajectory candidate set including all the k-NNTs (i.e., complete set) • Step 1: searching k-NN points using best-first-based individual heap • Step 2: generating the candidate trajectories by the global heap
Step 2: generating candidate trajectories • Global heap • A minimum heap sorting matching pairs by the distance • Retrieves new matching pair from individual heaps • Pops the matching pairs to the candidate set
Example: Search based on the global heap q1 Candidate Set Global Heap q2 h1 h2 h3 q3 …… …… …… <p1,2, q1> <p1,4, q2> <p1,6, q3> Individual Heaps
Example: Search based on the global heap R1: (Partial Match) q1 Candidate Set Global Heap q2 <p1,4, q2> <p1,6, q3> <p1,2, q1> h1 h2 h3 q3 …… …… …… <p5,5, q2> Individual Heaps
Example: Search based on the global heap <p1,4, q2> R1: (Partial Match) q1 Candidate Set Global Heap q2 <p1,6, q3> <p1,2, q1> <p5,5, q2> h1 h2 h3 q3 …… …… …… <p4,5, q3> Individual Heaps
Example: Search based on the global heap <p1,4, q2> <p1,6, q3> R1: (Partial Match) R5: (Partial Match) q1 Candidate Set Global Heap q2 <p1,2, q1> <p5,5, q2> <p4,5, q3> h1 h2 h3 q3 …… …… …… <p4,4, q2> Individual Heaps
Example: Search based on the global heap • Advantages • guarantee including allk-NNTs in candidate set • generate compact candidate sets R1: <p1,2, q1>, <p1,4, q2>, <p1,6, q3>. (Full Match) R4: <p4,5, q3>.(Partial Match) R5: <p5,5, q2>. (Partial Match) q1 Candidate Set Global Heap q2 <p1,2, q1>, <p4,4, q2>,<p1,5, q3> h1 h2 h3 q3 …… …… …… Stop critiria: when there isk full-matching candidates – Property 1: The candidate set is complete if G has popped out k full-matching candidates (In this example k=1) Individual Heaps
Candidate verification R1: <p1,2, q1>, <p1,4, q2>, <p1,6, q3>. (Full Match) R4: <p4,5, q3>.(Partial Match) R5: <p5,5, q2>. (Partial Match) • The full-matching candidate may not be the final k-NNT • The system has to retrieve the partial-matching trajectories (R4 and R5) to compute their aggregate distance (I/O cost) • Question: can we compute a lower-bound for R4 and R5 without retrieving their details? • If LB(R4/5) > dist(R1,Q), we can prune it directly Candidate Set
Candidate verification • The lower-bound of a partial-matching trajectory is • If the LB(R) is larger than the distance of full-matching candidate, R can be pruned directly R1: <p1,2, q1> <p1,4, q2> <p1,6, q3> dist(R1) = 95 R4: <p4,5, q3> R5: <p5,5, q2> Candidate Set LB(R4) =114 (pruned) LB(R5) =90 (passed) Global Heap <p1,2, q1> <p4,4, q2> <p1,5, q3> <p1,2, q1> <p1,2, q1> <p4,4, q2> <p4,4, q2> <p1,5, q3> <p1,5, q3>
Problem of Outlier Query Location • A query location is an outlier if it is far from all the trajectories • Too many partial-matching candidates will be generated before finding a full-matching candidates
Qualifier expectation based method • The system can make up the missing pairs of a partial-matching trajectory by retrieving all its points • Two key issues: • Guarantee the completeness of candidate set Property 2: If there are k made-up candidates (qualifier) with distance smaller than the sum of the pairs in global heap, the candidate set is complete • Which candidate should be selected to make up? The qualifier expectation measure
Example of Qualifier Expectation dist(R1) =160m < sum(G), R1 is a qualifier R1: <p1,1, q1>, <p1,4, q2>, <p1,7, q3>. R1: 40m. R2: 30m. R4: 15m. Qualifier Expectation R1: <p1,1, q1>, <p1,4, q2>, . R2: <p2,1, q1>, <p2,5, q2>, . R4: ,<p4,4, q2>, . Candidate Set Global Heap, total dist sum(G) = 200m <p2,1, q1>, <p4,4, q2>,<p1,7, q3>
Experiment Setup • Real Dataset: collected from the Microsoft GeoLifeandT-Driveprojects , with over 20,000 real trajectories • Synthetic datasets with both uniform distribution and biased distribution • Random generated query Q • The proposed methods are compared with Fagin’s Algorithm (FA) and Threshold Algorithm (TA) (used in k-BCT) GeoLife
Evaluations on synthetic dataset (biased distribution) • GH (global heap) is faster than baselines with less I/O costs • QE( global heap+ qualifier expectation ) is an order of magnitude faster than others
Evaluations on real dataset • When |Q| is small, the probability of outlier location is low, GH achieves the best performance • When |Q| is larger, the probability of outlier location is high, QE is more efficient
Conclusion • k-Nearest Neighboring Trajectory (k-NNT) query • retrieve trajectories by a set of locations • Candidate-generation-and-verification framework • Generate candidate trajectories with global heap • Efficient lower-bound computation • Outlier query location: qualifier expectation-based method
Released Datasets: T-Drive taxi trajectories GeoLifeGPS trajectories Thanks! Yu Zheng yuzheng@microsoft.com