1 / 29

Efficient Query Processing On Massive Multi-dimension Data

Efficient Query Processing On Massive Multi-dimension Data. Presenter: Ying ZHANG. Outline. Background Problems investigated Data stream Computation Skyline computation Spatial keyword search. Background. Massive multidimensional data are collected everyday .

misha
Télécharger la présentation

Efficient Query Processing On Massive Multi-dimension Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Query Processing On Massive Multi-dimension Data Presenter: Ying ZHANG DBG@UNSW

  2. Outline • Background • Problems investigated • Data stream Computation • Skyline computation • Spatial keyword search DBG@UNSW

  3. Background Massivemultidimensional data are collected everyday • location data from various Observational Mechanisms. • - Smart Phone • 0.36 billion this year in China – largest smart phone market , expect 0.45 billion next year. • Baidu Location based service receives 3.5 billion location requests on average each day. • - Sensor • - Radio Frequency Identification (RFID) • - Global Position System (GPS) DBG@UNSW

  4. Background • Other Multi-dimensional data from various applications - Environment monitoring Measure light, temperature, humidity… - Finance and economic data purchase transactions, stock transactions … - User behavior data click streams , shopping records,… - Network data Network monitoring data - etc. DBG@UNSW

  5. Problems Investigated Given a large number of multi-dimensional objects, we investigate the following representative and fundamental queries. • Rank-based Queries Top k query, Quantilequery • Dominance-based Queries Skyline query, representative skyline query • Proximity-based Queries Range search, nearest neighbor search and Reverse nearest neighbor search DBG@UNSW

  6. Rank-based queries 1. Top k query p4 Y: research score p6 p1 p5 p8 p7 p2 p3 f(p) = x + y X : academic score DBG@UNSW

  7. Rank-based queries (cont.) The first element in a sorted list with the cumulative weight not smaller than Φ, where Φ is a number in (0, 1]. 2. Quantile Computation ( Order statistics ) Φ-quantile : summarize score distribution • Sorted elements: • 3 3 6 7 8 9 12 13 15 20 0.5 quantile (median) 0.8 quantile DBG@UNSW

  8. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find all elements with frequency > 0.1% Rank-based queries (cont.) • Other Statistics Top-k most frequent elements What is the frequency of element 3? What is the total frequency of elements between 8 and 14? How many elements have non-zero frequency? DBG@UNSW

  9. Dominance-based queries • n-dimensional numeric space D = (D1, …, Dn) • on each dimension, a user preference ≺ is defined • two points, udominatesv (u≺v), if - Di (1 ≤ i ≤ n), u.Di≺ = v.Di - Dj (1 ≤ j ≤ n), u.Dj≺v.Dj p4 Y: research score p6 p5 p1 p7 p8 p3 p2 X : academic score DBG@UNSW

  10. Dominance-basedqueries (cont.) Skyline : points not dominated by other points. - candidates of best options in multi-criteria decision applications. DBG@UNSW

  11. Proximity-based queries • Range search • Nearest Neighbor search • Reverse Nearest Neighbor search p2 p5 p6 p3 q p4 p7 DBG@UNSW

  12. New Challenges (1) Massive Streaming data • Arrive at high speed and the volume of the data is extremely large. • - Twitter : 140 million users and over 340 million tweets per Day • - 200Mb/sec from a single sensor node for reading of the weather data • - AT&T collects 600-800 Gigabytes of NetFlowdata each day • - Square Kilometre Array (SKA) project : a few exabytes (1018 bytes) of data per day for a single beam per square kilometer, DBG@UNSW

  13. Streaming Algorithm Synopses in Memory Data Streams ( Approximate ) Answer Stream processing Engine • One scan only • Processing time ( fast ) • Synopsis size ( small ) • Accuracy ( a good tradeoff with synopsis size ) DBG@UNSW

  14. New Challenges (2) The data may be uncertain for various reasons. • Limits of the measuring devices • Noise • Delay or loss in data transfer. • Privacy • Data integration • The uncertainty of the data may be described continuously or discretely. DBG@UNSW

  15. New Challenges (3) Enriched spatial data • Textual data • - Twitter , Weibo, Fourquare • The user profile • - age, gender, preference, etc. • Multimedia data • - photos, videos DBG@UNSW

  16. Rank-based queries - Top K computation Find top k objects for given scoring/preference function. - Quantile computation Focus on approximate solution in the context of data stream - Others: Counting the number of distinct objects. Research outcome: ICDE’06, ICDE’07, TKDE’10, ADC’10 (best paper award) , etc. DBG@UNSW

  17. Dominance-based queries • Skyline Computation • Skyline computation on uncertain data • Skyline computation over uncertain data streams Research outcome: ICDE’09, Info.Sys’11, ICDE’11, TODS’12, etc. DBG@UNSW

  18. Dominance-based queries (cont.) • Representative Skyline Computation (ICDE’07) • Find k skyline objects which dominate the largest number of distinct objects. • Top k dominating query on uncertain data (VLDBJ’10) • Rank objects based on their dominance power, i.e, the number of objects dominated DBG@UNSW

  19. Proximity-based queries • Range search on uncertain data • Report objects with appearance probability larger than p regarding a search region r . • The query is uncertain, target are certain objects (ICDE’09, TKDE’12 ) • The objects are uncertain. (TKDE’10, EDBT’12, TKDE’13) • Nearest neighbor search • Top k nearest neighbor search on uncertain data (ICDE’10) • - Top k spatial keyword search (ICDE’13, EDBT’14) DBG@UNSW

  20. Proximity-based queries (cont.) • Reverse Nearest Neighbor Search • Reverse nearest neighbor search (ICDE’11, ICDE’14) • Continuous Monitoring reverse nearest neighbor (VLDB’09, 2 VLDBJ’12) DBG@UNSW

  21. Two recent research topics • Skyline Computation on uncertain data • Spatial keyword search DBG@UNSW

  22. Dominance Relation Easy for certain objects, Non-trivial for uncertain objects DBG@UNSW

  23. Uncertain Skyline Computation (1) Probabilistic Skyline (2) Stochastic Order Non-trivial for uncertain objects A B 1K 500K 100K 80K 50K 10K DBG@UNSW

  24. Spatial keyword search Spatial-TextualObjects • An enormous amount of spatio-textual objects available in many applications • Online local search e.g., online yellow pages • Social network services e.g., Facebook, Flickr, Twitter DBG@UNSW

  25. Top k spatial keyword search (ICDE’13) p5 (pizza, steak,seafood) p2 (pizza, coffee,steak) p4 (coffee, sushi) pizza,coffee p3 (pizza, sushi) p1 (pizza, coffee,sushi) DBG@UNSW

  26. Diversified spatial keyword search on Road Network (EDBT’14) • Consider the Road network distance Develop new signature techniques to improve I/O efficiency • Consider the diversity of the results ( Spatial disperse ) - Ranking score : linear combination of the distances from objects to query object (Relevance) and the sum of pairwise distance among resulting objects ( Diversity ) - Develop incremental diversified top k search algorithms DBG@UNSW

  27. Streaming spatial keyword search Spatial-textual objects arrive in streaming fashion in many applications (e.g., twitter, and Weibo). • Size estimation for spatial keyword search. • Continuously monitoring local hot spot • Continuous spatial keyword queries. DBG@UNSW

  28. Summary • Massive multi-dimensional data in various applications. • Three fundamental problems for massive multi-dimension data analysis. • New challenges and research opportunities DBG@UNSW

  29. Thanks ! DBG@UNSW

More Related