1 / 56

Continuous Processing of Preference Queries in Data Streams : a Survey

Continuous Processing of Preference Queries in Data Streams : a Survey. M. Kontaki , A.N. Papadopoulos, Y. Manolopoulos Data Engineering Lab Department of Informatics Aristotle University of Thessaloniki. Presentation Layout. Preliminaries Continuous skyline queries

holleb
Télécharger la présentation

Continuous Processing of Preference Queries in Data Streams : a Survey

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Continuous Processing of Preference Queries in Data Streams : a Survey M. Kontaki, A.N. Papadopoulos, Y. Manolopoulos Data Engineering Lab Department of Informatics Aristotle Universityof Thessaloniki

  2. Presentation Layout • Preliminaries • Continuous skyline queries • Continuous top-k queries • Continuous top-k dominating queries • Summary

  3. Presentation Layout • Preliminaries • Continuous skyline queries • Continuous top-k queries • Continuous top-k dominating queries • Summary

  4. Data Streams • Data Stream is an infinite sequence of objects. • Each object can be one-dimensional or multi-dimensional. • Streaming Time Series are finite sequences of objects. • Streaming Time Series changes over time. • Arrival rate of objects usually varies.

  5. Time expired active Sliding Window Model (1) • Count-based window: Sliding window contains the W most recent tuples(“active”). • Older tuples expire. W=5 t1 t2 t3 t4 t5 t6 t7 t8

  6. Time expired active Sliding Window Model (2) • Time-based window: Sliding window contains the tuples(“active”) of the W most recent timestamps. • Older records expire. W=5 t6 t1 t2 t3 t4 t5 t8 t7

  7. Query Result Query Result Database System User / Application Input

  8. Query Result Continuous Evaluation in a Data Stream System User / Application Query processor

  9. Motivation (1) • Numerous data stream contexts • Financial data analysis • Network management • Astronomical data analysis • Sensor network • Telecommunication data management

  10. Motivation (2) • Preference queries • Useful decision support tool • Many applications in data streams Example 1 (telecommunication data) Report the clients with the maximum call time and the maximum number of calls. Example 2 (stock-market data) Report the products with the maximum price, the minimum sales and the minimum number of buyers. Continuous top-k dominating query Continuous skyline query

  11. Presentation Layout • Preliminaries • Continuous skyline queries • Continuous top-k queries • Continuous top-k dominating queries • Conclusions

  12. Skyline Query price Skyline: contains all the tuples not dominated by any other tuple. T1 T6 T2 T4 T5 T3 distance • Dominant tuple: A tuple t dominates another tuple t’ if • t is not worse than t’ in all dimensions, and • t is better than t’ in at least one dimension.

  13. Continuous Skyline Query • Problem definition: We have to continuously evaluate a skyline query in multidimensional streaming time series. • Application example: network data • Computers with suspicious behavior. • Network traffic, number of connections, number of destinations.

  14. Basic Idea • Skyline changes due • The insertion of a new skyline tuple. • The expiration of a skyline tuple. • LookOut [Morse, ICDE06] and Lazy [Tao, TKDE06] • Use of a spatial index • Advantage: simple implementation • Disadvantage: the expiration of a skyline tuple is not handled efficiently

  15. Event Approach (1) • Existing skyline tuple expires: • How can we find new skyline tuples? • Very costly operation • Skyline influence time (SIT) • Minimum time in which a tuplemay become a skyline tuple. • Generate events based on SIT

  16. Event Approach (2) A(1) J(10) F(6) W=10 • Eager [Tao, TKDE06] • Advantage: handles skyline expiration • Disadvantage: pro-cessingtime per tuple H(8) K(11) G(7) L(12) I(9) D(4) B(2) E(5) C(3) Tuple K can be discarded due to tuple L (younger and better) K.SIT=19

  17. n-of-N Skyline Queries (1) n-of-N definition • S6 = {a,c} • S4 = {c,g} source: icde05

  18. n-of-N Skyline Queries (2) n-of-N definition • S6 = {c,h} • S4 = {e,h} source: icde05

  19. Method cnN(1) Method cnN [Lin, ICDE05] is also based on events Tuple K is redundant because tupleL is better and younger than K A(1) J(10) F(6) W=10 H(8) K(11) G(7) Tuple L is dominated by D and E. L(12) I(9) D(4) B(2) E(5) The dominance relation between L and E is critical because E is the youngest tuple which dominates L C(3)

  20. Method cnN (2) Redundant tuples • Generate intervals • For the skyline tuples, e.g. C = (0,3] • For the critical dominance relations, C -> G = (3,7] • Use an interval-tree to store them A(1) B(2) G(7) Dominance graph contains all the critical dominance relations F(6) Critical dominance relation E(5) C(3) D(4)

  21. Method cnN (3) • A tuple t is in the answer of an n-of-Nskylinequery iff there exists an interval containing the value M–n+1, where M is the number of the total elements seen so far. M = 7 A(1) C = (0,3] stabbing query For n = 4, M–n+1 = 4 For n = 6, M–n+1 = 2 B(2) G(7) D = (0,4] F(6) C -> G = (3,7] E(5) D -> E = (4,5] S4 = {D, G} S6 = {C, D} C(3) D -> F = (4,6] D(4) To answer a n-of-N query, apply a (M–n+1) stabbing query

  22. Method cnN (3) • Advantages • Good use of skyline properties • Multiple query processing • Disadvantages • Processing time per tuple • Increased memory requirements

  23. Frequent Skyline - Motivation • Highly dynamic environment • The skyline results are meaningful only if the skyline tuples appear consistently • Frequent skyline: tuples on the skyline for a minimum user-defined interval. [Zhang, SIGMOD09]

  24. Streaming Model • Client/Server architecture • Server receives object updates from the clients. • Each object can be represented as a d-dimensional point. • Object update (point movement in the d-dimensional space). • at least a value in one dimension changes • Object insertion or deletion • Point movement from/to a nonexistent position • Minimization of communication cost

  25. Filter • Safe region technique • Skyline remains unchanged if each object stays in a safe region • Communication happens only when the safe region is violated • Safe region approach leads to communication optimization An object as a point and its filter (safe region) source: sigmod09

  26. Sampling • All clients report their skyline at the same sampled time • The clients are synchronized with the same random seed • Guaranteed quality if sampling rate is high enough

  27. Hybrid • Hybrid solution • Combines Filter and Sampling • Small changes: apply Filter • Larger changes: apply Sampling • Disadvantage of all three methods • energy consumption is not uniform (critical in sensor networks)

  28. k-dominant Skyline Query - Μotivation Skyline: contains tuplesnot dominated by any other tuple. Disadvantage: High dimensionality problem. Solution: Relax the notion of dominance. • k-dominant tuple: A tuple t k-dominates another tuple t’ if • t is not worse than t’ in at least k dimensions and • t is better than t’ in at least one of them. k-dominant skyline: contains all tuplesnot k-dominated by any other tuple[Kontaki, SAC08]

  29. k-dominant Skyline Query - Εxample T1 4-dominates T3 T1 5-dominates T4 T1 dominates T5 Smaller k, less tuples in k-dominant skyline Conventional skyline {T1, T2, T3, T4} 4-dominantskyline {T1, T2} 5-dominantskyline {T1, T2, T3}

  30. Observations • Traditional or streaming skyline methods are inappropriate • Skyline properties do not hold • E.g. transitive property • k-dominance can be cyclic • Existence of multiple users and multiple queries.

  31. Method CoSMuQ (1) • A query on D dimensions arrives. • Given a parameter value k, split the query to subqueries of d=k dimensions. • Compute the conventional skyline of each subquery. • The k-dominant skyline is the intersection of the skylines of the subqueries of a query.

  32. Method CoSMuQ (2) • Advantages • Based on conventional skyline (simple domination checks) • Properties of conventional skylines can be used • Exploits the overlap between different queries. • Disadvantages • Memory requirements increase in high dimensionality.

  33. Continuous Skyline methods - Summary

  34. Presentation Layout • Data streams - Preliminaries • Continuous skyline queries • Continuous top-k queries • Continuous top-k dominating queries • Summary

  35. Top-k query - Εxample Given a preference function, a top-k query returns the k tuples with thebest scores. price T1 T6 T2 T4 T5 k=1 k=2 F=price+distance T3 distance

  36. Continuous Top-k Query • Problem definition: Continuous evaluation of top-k query in multidimensional streaming time series. • Application Example: network data • top-100 flows with the largest individual throughput • Common destination • DDoS attack

  37. Basic Idea • New tuple changes the top-k • Should belong in the influence region of the query • Top-k tuple expiration • From scratch query computation • TMA (Top-k Monitoring Algorithm) [Mouratidis, SIGMOD06] • Advantage: simple implementation • Disadvantage: no efficient handling of an expired top-k tuple Line defined by the F = score(tk) = x1 + x2 x2 tk Influence region x1 source: sigmod06

  38. Skyband - Example 1-skyband (tuples not dominated by other tuples) 1-skyband is the skyline 2-skyband (tuples dominated by at most 1 other tuples) A B D Dominated by 2 other tuples (3-skyband) C E k-skyband: contains all the tuples which are dominated by at most k–1 other tuples.

  39. SkybandApproach (1) Transform tuples in the (score,expiration_time) space original space transformed space F=price+distance price score top-1 DC=0 T1 T6 T6 T2 T4 DC=1 T4 T5 DC=0 T2 DC=1 T1 T5 DC=1 T3 DC=0 T3 exp_time distance Rule: Keep tuples with DC < k Dominance counter (DC): number of tuples that are younger and better Observation: tuplesappearinginsometop-kresultbelongtothe k-skybandinthe(score,exp_time)space.

  40. SkybandApproach (2) • SMA (Skyband Monitoring Algorithm) proposed in [Mouratidis, SIGMOD06] • Advantage: independent of the dimensionality • 2-dimensional space (score-exp_time) • Disadvantage: • k-skyband may contain less than k tuples • In this case, a top-k tuple expiration will cause query computation from scratch

  41. Distributed Top-k • Continuously report the k largest values obtained from distributed data streams. • Objective is to minimize communication cost • Proposed by [Babcock, SIGMOD03]

  42. Streaming Model • Nodes: N1, N2 , … , Nm, coordinator node: N0 • Set of n data objects O1, O2 , … , Onassociated with real values V1, V2 , … , Vn • Value updates are represented as <Oi, Nj, > tuples: • Nj detects a change  in the value Vi of Oi. • Change is not seen by other nodes Nk(kj) • The value Vi for an object Oi: Vi= j (Vi,j) • where Vi,j is the value of i-th object in the j-th node

  43. Method (1) • Initialize a top-k set at the coordinator node • Set arithmetic constraints at monitor nodes • Depend on current top-k set • Constraints valid  No communications • Constraints invalidated • Client communicates with server • Possibly new top-k set • Recomputation of constraints

  44. Method(2) - Adjustment Factors Adjustment Factors (AF) Object 1 Object 2 Object 1 Object 2 Node 1 Node 2 Top-1 = {O1} Node 2: V1,2 = 3+0 = 3 Node 2: V2,1 = 1+3 = 4 Local top-k similar to global =>Low communication cost Disadvantage: Energy consumption is not uniform Node 1, Local Top-1 = {O1} Node 2, Local Top-1 = {O2} Local top-ks differ from global top-k =>Unnecessary constraint violations => Increased communication cost To keep the results valid AF for each object sum to zero

  45. Uncertain Data Compute probability of 6 tuples 16 possible worlds Sum the world probabilities • Pk-topk query: returns the k most probabletuples of being the top-k.Top-2: {6,5} with prob. {0.64, 0.5} source: pvldb08

  46. Pk-topkQuery • Solution proposed by [Jin, PVLDB08] • Compact set based • Space-efficient solution • Discard unnecessary tuples and • Apply several compression schemes to compress data • Disadvantages • Model assumption: the probability of a tuple is assumed random and independent of each other.

  47. Continuous Top-k Methods -Summary

  48. Presentation Layout • Preliminaries • Continuous skyline queries • Continuous top-k queries • Continuous top-k dominating queries • Summary

  49. Top-k Dominating Query - Example price Top-k: Given a preference function, a top-k query returns the k tuples with thebest scores. Top-k dominating: the answer contains the k tuples with highest domination power. T1 Skyline: contains all the tuples not dominated by any other tuple. T6 T2 T4 T5 k=2 k=1 F=price+distance T3 distance Disadvantage: High dimensionality problem. Disadvantage: user-defined preference function. Combines the advantages of skyline and top-k queries and avoids their disdvantages.

  50. Continuous Top-k Dominating Query • Problem definition: Continuous evaluation of top-k dominating query in multidimensional streaming time series. • Application Example: sensor network • Areas with high probability of fire outbreak • Temperature, humidity and wind speed

More Related