Categorical Skylines for Streaming Data

Categorical Skylines for Streaming Data Nikos Sarkas, Gautam Das, Nick Koudas and Anthony K. H. Tunge Univ. of Toronto, Univ. of Texas at Arlington, and National Univ. of Singapore SIGMOD 2008 Nam, Kwang-hyun Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea Center for E-Business Technology Seoul National University Seoul, Korea

Contents • Introduction • Motivation • Background • Efficient skyline maintenance for categorical tuples • Experimental Evaluation • Conclusions

Introduction • Current situation • It has become increasingly difficult to process abundant data in order to isolate useful and relevant information • To facilitate the exploration of a data space • One such successful tool is the skyline query • The skyline of a data set • The subset of tuples which are not dominated on all of their attributes by any other tuple • The tuples that have a uniquely interesting combination of attribute values that no other tuple can match.

Motivation • Recent work • Considered the on-demand evaluation of skyline queries on tuples with partially ordered categorical attributes • But in an offline environment • Inappropriate for a highly dynamic data streaming environment • Objective • To identify and study the problem of maintaining the skyline of streaming data with partially ordered, categorical attributes • To realize two novel techniques • Constitute the building blocks of STARS (STreaming ARrangement Skyline) to solve the problem.

Background • A limited-capacity, sliding-window buffer B in memory • To only store the n most recent tuples from the stream • The contents of the buffer constitute a data set denoted by D • The data set is comprised of n tuples with d categorical attributes • The domain Domi of each the attributes is partially ordered • Partial order (an antisymmetric preorder) • A binary relation"≤" over a set P which is reflexive, antisymmetric, and transitive, i.e., for all a, b, and c in P, we have that: • a ≤ a (reflexivity) • if a ≤ b and b ≤ a then a = b (antisymmetry) • if a ≤ b and b ≤ c then a ≤ c (transitivity) • Poset (Partially ordered set) • A set with a partial order • Commonly represented as DAG (directed acyclic graph)

Vertex : Domain value • Directed edge : For each pair of comparable values Background • Example 1 • Value a and b -> ‘Comparable and a dominates b’ • Value b and c -> ‘Incomparable’ • Assume tuples comprised of d categorical attributes • Tuple t1dominates tuple t2 ( t2 < t1 or t1 > t2 ) • If for every attribute Xi, t2.Xi ≤i t1.Xi and there is at least one attribute Xj such that t2.Xj < t1.Xj • Tuples t1 and t2 do not dominate one another • Call this situation ‘tied’ ( t1 ~ t2) • Then, the skyline of data set D is the subset of all tuples that are not dominated by any other tuple

i) t2 < t1 ∵ b < a and d < b ii) t2 < t3 ∵ b = b and d < c iii) t1 ~ t3 ∵ b < a, but b ~ c ∴ The skyline of this data set consists of tuples t1 and t3. Background • Example 2 • Consider three tuples t1, t2 and t3 with two categorical attributes • The domain of both attributes is the poset of Figure 1 • Assume t1 = (a,b), t2 = (b,d), and t3 = (b,c) • Skyline maintenance solution • Checking whether a tuple a dominated by the current skyline • Retrieving the tuples in the buffer that are dominated by the outgoing skyline tuple • Since only these tuples are candidates for entering the skyline • Lemma 1 • Let t1,t2∈B be two tuples so that t1 < t2. Then, if t2 arrived after t1 in the stream, t1 will never be in the skyline of B

Efficient skyline maintenance for categorical tuples • Two novel techniques for realizing the building blocks of the skyline maintenance frame work • Indexing the skybuffer (the relevant part of the buffer for skyline) • To enable efficiently identify the skybuffer tuples dominated by a query tuple • Organizing the skyline • To enable rapidly answer whether a query tuple is dominated by the skyline

Topological sorting • Topological sort • A numbering of the vertices of a DAG • Every edge from a vertex numbered i to a vertex numbered j satisfies i < j • If for two values a, b, a > b, then a will appear before b. • For a given value x, all values dominated by x will appear after it in the linear order, while x can never dominate any value that precedes it in the ordering • Definition 1 • Let v be a value of a partially-ordered domain. We denote by r(v) the integer corresponding to v’s position in a ceratin topological sort of the domain

Topological sorting • Lemma 2 • Let v1, … , vm be the m values of a partially-ordered domain Dom. Then vi > vjonly if r(vi) < (vj) • Lemma 3 • Let t1, t2 be two tuples with d partially-ordered categorical attributes X1, …, Xd. Then t1 > t2 only if r(t1.Xi) < r(t2.Xi), 1≤i≤d • Lemma 4 • Let t1, t2 be two tuples with d partially-ordered categorical attributes X1, … Xd. If ∃i, j such that r(t1.Xi) < r(t2.Xi) and r(t1.Xj) > r(t2.Xj), then t1 ~ t2

Organizing the skybuffer tuples in a grid • One of goals is indexing the skybuffer • To efficiently insert and delete tuples, as well as identify the tuples dominated by a query tuple • Build skybuffer indexing solution around a simple grid • Example 3 • Consider a set of tuples with two categorical attributes, the domain of both attributes being the poset of Figures 2. Suppose that the poset values have been mapped to integers according to topological sort (i) in Figure 2. Then, we create the grid by using this topological sort as the grid scales and place the skybuffer tuples in its cell. • Each grid cell corresponds to a unique combination of attribute values and only contains tuples with these exact values • The cell corresponding to tuples with attribute values (d, e) has been marked with an “x”.

Improving the skybuffer organization • 1. Visiting only relevant cells • One of advantage of the grid-based index • Directly identify and process precisely the cells that contain dominated tuples (directly visit the cells marked with •) • Lemma 5 (Focused Search) • Let t be a tuple with d partially-ordered categorical attributes X1 ∈ Domi, …, Xd ∈ Domd. Let dom(t.Xi) be the values in Domi such that t.Xi ≥ v, v ∈ Domi. Then, t dominates a tuple s if and only if s ∈ dom(t.X1) x … x dom(t.Xd) • Example • The query tuple is (d,e) in figure 3. The domain values dominated by d are {d,g}. The domain values dominated by e are {e,g,h}. Then only tuples with values in {d,g} x {e,g,h} are dominated by {d,e}.

Improving the skybuffer organization • 2. Controlling the grid granularity • A problem with the grid-based index • The lack of control over granularity of the grid • Example • If the tuples have 4 attributes and each domain size is about 500, then the grid would be comprised of 62.5 billion cells • Definition 2 • Consider a DAG and its vertices. A vertex is a source if it has no incoming edges. Then, the depth of a vertex in a DAG is the length of the longest path from a source to the vertex. • Lemma 6 • Let v1, …, vm bethe m values of a partially-ordered domain Dom. Then vi > vj onlyif depth(vi) < depth(vj)

Improving the skybuffer organization • Example 4 • Consider a data set consisting of tuples with 2 partially-ordered categorical attributes, the domain of both of them being the poset of Figure 4(a). Then, we can group domain values that lie on the same depth and create the grid of Figure 4(b). Every cell would contain tuples with the same attribute values. Now, tuples with corresponding attributes that lie at the same depth level are placed in the same cell. • tuple (d,e) lies in the cell marked with “x” in Figure 4(b), along with other tuples with values in {c,d,e} x {c,d,e}. • Problem • The cells can contain tuples both dominated and not dominated by the query tuple • A dominance check is required to identify the dominated tuples

Improving the skybuffer organization • 3. Poset partitioning • Example 5 • Suppose that our data set contains tuples with two partially-ordered categorical attributes, the domain of both being the poset of Figure 5(a). The values of the domain have been grouped. If we order the groups in ascending order of their depth value, breaking ties arbitrarily, we propose a valid topological sort for the poset • Figure 5(b) presents the resulting grid by using as scales the grouping of Figure 5(a) • The cells that contain candidate are marked with • • This allows us to use focused search in order to directly access relevant cells. All cells that can potentially contain tuples that are dominated by (d,e) are located in the rectangular area.

Experimental Evaluation • SDC (Stratification by Dominance Classification) • Introduced in Stratified computation of skylines with partially-ordered domains written by Chan et al. in 2005. • Improves over BBS+ in terms of both progressiveness and speed • Suggested in the numerical skyline maintenance solution in offline environment • Performance evaluation of STARS and SDC • Goal • To identify the impact of the buffer size, data dimensionality and domain structure on performance • Measure • The average time required to process a buffer update • i.e., A combined tuple arrival and expiration

Experimental Evaluation STARS SDC • Performance on synthetic data • STARS outperforms SDC by an order of magnitude • The time required by STARS is in the order of a millisecond or less • Rendering its use in real life applications entirely realistic Effect of buffer size and dimensionality on performance

Experimental Evaluation STARS SDC • Performance on a stream of real, skewed and correlated data • Used DMV data set and three categorical attributes of the “cars” table: Maker/Model, Color, and Year • The result can be attributed to the presence of skew and correlation in the data that leads considerably larger skylines as the buffer size increases 3 dimensional tuples Performance on skewed real data

Experimental Evaluation • Further evlauation of STARS • Pruning efficency • Measured as the fraction of skyline tuples that need to be examined on average in order to answer a dominance query • Found to be resilient to increases in data dimensionality Every depth level has twice the number of values from the level above All depth levels have the same number of values Tree wall Pruning efficiency of arrangement skyline organization

Experimental Evaluation 3 dimensional tuples • Demonstrates the potential perfomance benefits by utilizing the techniques of ‘improving the skybuffer organization’ • The benefit of increasing the grid granularity is eventually offset(상쇄) by the overhead of visiting many sparse cells • Knee in the curves Every depth level has twice the number of values from the level above All depth levels have the same number of values Tree wall Effect of grid granularity on performance

Conclusions • Identified and motivated the problem of maintaining the skyline of streaming data with partially ordered, categorical attributes • Realized two novel techniques • To constitute the building blocks of an efficient solution to the problem. • Introduce a lightweight data structure for indexing the tuples in the streaming buffer

Skyline Nam, Kwang-hyun Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea Center for E-Business Technology Seoul National University Seoul, Korea

Introduction of Skyline Query • Motivation • The amount of information is explosively increasing • Users often face information overload • How can different attributes be compared? • Solution • Return objects which are not dominated by any other object in all attributes • Intuitive querying • No overhead of formulation Low A B Price High Low High Speed

Categorical Skylines for Streaming Data • Motivation • Existing work is not proper to maintain skylines for streaming data • Skyline maintenance of streaming tuples • Incoming tuple • Consider whether incoming tuple can become part of skyline • Outgoing tuple • Consider whether the tuples dominated by outgoing tuple can be candidates for entering skyline • Skyline maintenance solution • Checking whether a tuple is dominated by the current line • Retrieving the tuples in the buffer that are dominated by the outgoing sky line tuple

Categorical Skylines for Streaming Data • Proposal • Indexing the skybuffer • To insert and delete tuples, as well as identify the skybuffer tuples dominated by a query tuple efficiently • 1. Topological sorting • 2. Organizaing the skybuffer tuples in a grid • 3. Improving the skybuffer • Visiting only relevant cells, Controlling the grid granularity, Poset partitioning • Advantages • Optimize query evaluation • Applicable to a wide range of dimensionality • Adapt any size of poset

Categorical Skylines for Streaming Data

Categorical Skylines for Streaming Data

Presentation Transcript

Categorical Data Analysis

Categorical Data

Categorical Data Analysis

Categorical Data

Categorical Data

Analyzing categorical data

Displaying Categorical data

Categorical Data

Inference for Categorical Data

Categorical Data

Interpreting Categorical Data

Categorical Data

Categorical Data

Categorical Data

Categorical Data Analysis

Categorical Data

Classifying Categorical Data

Spatial Smoothing for Categorical Data

WLS for Categorical Data

Analyzing Categorical Data

Categorical Data Analysis

Categorical data