CSE 6339: DATA EXPLORATION Anitha Josephine Royappan

Ad-hoc top-k query answering for data streams gautam das dimitrios gunopulos nick koudas nikos sarkas CSE 6339: DATA EXPLORATION Anitha Josephine Royappan

Outline • Introduction • Data Stream • DBMS VS DSMS • Top-k query answering in Primal Plane • Primal Plane • Arrangements • Top-k query answering in Dual Plane • Dual Plane • Operations • Tuple pruning • Principles • Implementation • Experimental Evaluation • Conclusion

DATA STREAM • Traffic monitoring Applications : To monitor traffic slowdown or accidents using data sent by each car on the road every few seconds or minutes. • Sensor Monitoring Applications: Monitoring room for switching on AC, heater, light, fire alarm based on data received from sensors.

DATA STREAM • Needs to process data at a high input-rate. Data are processed continuously over a long period of time and data is obtained in the form of DATA STREAM. • Data Stream: Unbounded (or never-ending) sequence of DATA ITEMS that are usually ordered.

Characteristics of a data stream • Continuous arrival in multiple, rapid, high rates, possibly unpredictable and unbounded streams • Data items belonging to the same data stream are usually processed in the order they arrive. • A main memory buffer maintains incoming tuples. • Data streams are usually generated by external sources or other applications and are sent to a Data Stream management System (DSMS).

Data stream • Large number of sensors are distributed in the physical world and generate streams of data that need to be combined, monitored, and analyzed. • DBMS’s are not designed for rapid and continuous loading of individual data items. • One time Queries: Evaluated once over a point-in-time snapshot of the data set, with the answer returned to the user.

Data stream • Continuous Queries: • Evaluated continuously as data streams continue to arrive. • Predefined Queries: • Queries are fixed and Data keeps changing. • Ad-hoc Queries: • Both Queries and Data keeps changing • Continuous queries are not supported. • DSMS: Data Stream Management System

Data stream processing using dbms (RFID) • Increased response time

Data stream processing using dsms IN RFID • Date rate is high, Data is stored in buffer. • Incoming streams are processed directly by a DSMS. • Decrease in response time, data may also be persisted for archival

Dbmsvsdsms DSMS • Transient • Continuous queries • Sequential access DBMS • Persistent • One-time queries • Random access

Buffer management for incoming tuples • Memory space is limited, so aged tuples are removed in order to free space for fresh incoming tuples. • Sliding window of a specific size W. • Time based (timestamp) • Tuplecount-based (recent N records) • Example: K highest ranking tuples according to temperature, humidity and both. • Rank tuples in a buffer according to ad-hoc preferences towards their attributes.

Technique to support efficient evaluation of ad-hoc top-k queries • Novel geometric representation that utilizes arrangement of geometric objects to perform indexing and ad-hoc top-k query answering. • Algorithm for updating and querying. • Tuple pruning to minimize the number of tuple need to be indexed. • Using real and synthetic data set we evaluate the performance of this technique.

Top-k query problem and solution • A top-k query retrieves the k highest scoring tuples from a data set with respect to a scoring function defined on the attributes of a tuple. • This is not directly applicable to highly dynamic environments and on-line applications, like data streams. (predefined queries) • Introduce a novel geometric representation for the top-k query problem based on DCEL. • Query evaluation strategies that operate on top of an arrangement data structure that are able to guarantee efficient evaluation for ad-hoc queries

Top k answering in the primal plane • Data Set D = {t1 , t2, t3, . . . , tn ) • d numeric attributes = {x1, x2, x3, . . , xd } • Domi: domain of ith attribute • Consider unit interval [0,1] • Tuple t  (t. x1, . . . , t. xd) • Top-k query: Q = ( S , k ) • Scoring function S: Dom1 x … x Domd • Consider d=2

primal plane y  w’ 1 Additive Scoring function: P3 Scoring function P2  w P1 Scoring function 0 1 x

Arrangement of lines • The arrangement A(S) of a finite collection S of geometric objects is the decomposition of the d-dimensional space into connected open cells of dimensions 0, . . . , d induced by S • Geometric objects: Lines, curves, hyper planes, triangles, circles • Cell Range: 0 to d • l-cell : cell of dimensionality l ( 0 ≤ l ≤ d ) • Vertices, Edges and Faces • d-cell: cell of maximum dimensionality (l = d)

Example 3 Vertices, 9 Edges, 7 Faces

combinatorial complexity • The combinatorial complexity of an arrangement is the overall number of cells of all dimensions in the arrangement. • The combinatorial complexity of an l-cell is the number of cells of the arrangement of dimension less than l that are contained in the boundary of the cell. • zone of a surface : set of d-cells intersecting the surface.

combinatorial complexity • The faces are convex, but can be unbounded . • An arrangement of n lines is composed of O(n2) vertices, O(n2) edges and O(n2) faces • Theorem • Complexity results

Representation (DCEL)

representation • Each halfedge maintains a pointer to its twin, e.g., e2to e9and vice versa. • For each vertex, a circular list of the incident halfedges is maintained, in clockwise order. For example vertex v3 maintains a list with edges (e2 , e8, e6, e4). • Each halfedge has pointers to its source and target vertices, for example edge e1to vertices v1and v2. • Each halfedge stores a pointer to its incident face, e.g., edges e1, e2, e3store a link to face f1. • For each face, the halfedges forming its boundary are organized in a doubly connected circular list. • The total space complexity of the data structure : O(n2)

INDEXING for top-k query answering in the dual plane • Each tuple t = (x1, x2) is mapped to a line et : y = (1 − x2)x + (1 − x1) in the dual plane. • A query Q can be represented as a point p(Q) = (w2 / w1, 0), where w1, w2are the weights of its scoring function. • Theorem 2

Top-k query answering in the dual plane

Top-k query answering in the dual plane • Solution for top-k query answering problem. • Map the tuples of a data set to lines in the dual plane. • Maintains the rankings of all possible top-k queries in a non-redundant, self-organizing manner. • All queries that produce the exact same tupleranking are mapped to a continuous interval on the x axis and use the same part of the arrangement to retrieve that ranking. • An intersection signifies a change in the ranking of two tuples. • Self organizing for line insertion and deletion.

Operating on the arrangement • Arrangement Representation

Arrangement representation • The points representing a query can only lie in the positive part of the x axis of the dual plane. • The domain of the tuples is the unit square, the lines that result after the mapping to the dual plane are of the form y = ax + b, where 0 ≤ a, b ≤ 1. • The selected mapping places all the elements of the top-k query answering problem in the positive quadrant of the dual plane. • Bound frame of dimensions [0,M] × [0,M + 1]

Top-k retrieval

line insertion

indexing • The complexity of the main loop of the insertion procedure is determined by the number of the arrangement’s edges that must be traversed and the number of new edges that must be inserted. • An edge insertion is an O(1) operation. Since a line can intersect with up to n other lines, the cost for inserting the new edges is O(n). • The complexity of the line’s zone is of size O(n). • The space complexity of the solution is O(n2) and the cost of the query answering, insert and delete operations is O(n).

Tuple pruning • Minimizes the number of tuples that need to be indexed, while maintaining the capability to correctly answer any top-k query. • Let Q: top-k query. • Rk(Q) the point in the dual plane where a ray shooting upwards from p(Q) meets the k-th line. • Let Q1< Q2denote the fact that p(Q1) lies left of p(Q2) and let [Q1, Q2] be the interval along the x axis of the dual plane between p(Q1) and p(Q2).

LEMMA 1 Let Q1, Q2 be two top-k queries such that Q1 < Q2. Let also l1(Q1, Q2) be the line in the dual plane that passes through the origin and Rk(Q1). Then, any line that is located above l1(Q1, Q2) in the interval [Q1, Q2] cannot be in the result of any top-k query that lies inside [Q1, Q2]

LEMMA 2 Let Q1, Q2 be two top-k queries such that Q1 < Q2. Let also l2(Q1, Q2) be the horizontal line in the dual plane that passes through Rk(Q2). Then, any line that is located above l2(Q1, Q2) in the interval [Q1, Q2] cannot be in the result of any top-k query that lies inside [Q1, Q2].

Theorem 3 Let Q1, Q2 be two top-k queries such that Q1 < Q2. Let also I(Q1, Q2) be the intersection point of lines l1(Q1, Q2) (Lemma 1) and l2(Q1, Q2) (Lemma 2). We refer to this point as the pruning point. Then, any line that is located above I(Q1, Q2) cannot be in the result of any top-k query that lies inside [Q1, Q2].

pruning • Given two top-k queries Q1, Q2 and their result, we can filter out a portion of the data set D that is definitely irrelevant to any top-k query in [Q1,Q2]. • In other words, only the part of the data set not pruned, denoted by D∗, needs to be stored in the arrangement. • Top-k query in [Q1,Q2] can be answered by requesting up to a number of results determined by the number of results (K) we choose to return for queries Q1, Q2. • K determines the position of the pruning point.

pruning • Consider a set B of m + 1 top-k queries • B = {B1, . . . ,Bm+1}, such that Bi < Bj for i < j and p(B1) = (0, 0), p(Bm+1) = (M, 0). We will refer to those queries as borders. • Treating each border as a query, we can compute the query result for each border top-k query. • Pruning point is computed I(Si) = I(Bi,Bi+1) for each strip and identify the part of the full data set D that we need to use in order to be able to answer any top-k query that lies inside a strip. We denote the filtered data set associated with strip Si by D∗i .

borders

borders • Full Arrangement • Strip arrangement • Each strip is responsible for answering queries that lie between two borders Bi and Bi+1, we only need to construct and maintain the arrangement in the interval [Bi,Bi+1]. • This effect greatly reduces the arrangement complexity. • The complexity of arrangement operations is reduced to O(|D∗|) instead of O(n), n being the size of the buffer. In the case of uncorrelated data, the cost of arrangement operations is only O(k ln n).

Experimental evaluations • Pruning Efficiency

Evaluating the performance

conclusion • Primal Plane • Dual Plane • Use of arrangements • Tuple Pruning Technique • Borders

references • Models and Issues in Data Stream Systems by Brian Babcock, ShivnathBabu, MayurDatar, Rajeev Motwani, and Jennifer Widom • Stream Data Processing: A Quality of Service Perspective Modeling, Scheduling, Load Shedding, and Complex Event Processing by Sharma Chakravarthy. • http://www.vldb.org/archives/website/2007/program/slides/s183-das.pdf

Questions ? ? ? ? Thank you

CSE 6339: DATA EXPLORATION Anitha Josephine Royappan

CSE 6339: DATA EXPLORATION Anitha Josephine Royappan

Presentation Transcript

Exploration

OILSIM EXPLORATION

Age of Exploration 1400s to 1700s

European Exploration

Explorers of the New World

Exploration and Colonization of America 1607-1754

Onshore oil and gas exploration in the UK: regulation and best practice

Age of Exploration

Data Exploration, Analysis, and Representation: Integration through Visual Analytics

Please . . . .

Pasgear 2

XML and Web Data

EXPLORATION TECHNIQUES

Job Exploration Workshop Day One

European Exploration