Continuous Query Languages for DSMS

Continuous Query Languages for DSMS CS240B Notes by Carlo Zaniolo

Blocking Operators • A blocking query operator is ‘one that is unable to produce the first tuple of the output until it has seen the entire input’ [Babcock et al. PODS02] • But continuous queries cannot wait for the end of the stream: must return results while the data is streaming in. Blocking operators cannot be used. • Only non-blocking (nb) queries and operators can be used on data streams (i.e. those that return their results before they have detected the end of the input). • Current DBMSs make heavy usage of blocking computations: • For operators that are intrinsically blocking: e.g., SQL aggregates, • And for those that are not: e.g., sort-based implementation of joins and group by • We only need to be concerned with 1: find a characterization for blocking & nonblocking independent of implementation.

Partial Ordering Let S = [ t1, ¼, tn] be a sequence and 0 £k £n. Then [t1, ¼, tk ] is said to be the presequence of S, of length k, denoted by Sk. We write L  S to denote that L is a presequence of S,  Defines a Partial Order: reflexive, antisymmetric and transitive.  generalizes to the subset notion when order and duplicates are immaterial The empty sequence,[ ], is a subsequence of every other sequence.

employees(E#, Dept) select dept, count(E#) from employees group by dept Traditional SQL-2 aggregates: Blocking select dept, count(E#) over (partition bydeptrange unbounded preceding) from employees SQL:2003 OLAP functions: Non-Blocking Continuous countreturns, for each new tuple, the count so far. Consider a sequence of length n. At each step j<n, j is returned  cumulative return up to j: sumj(S)= [1,2, …, j]independent on whether j=n or j<n. Traditional count: For each j<n --nothing: sumj(S)=[] Final: sumn(S)=[n]

Examples • Traditional SQL-2 aggregates are blocking—SQL:2003 OLAP functions are not. • Selection is nonblocking. • Continuous count (i.e., the unlimited preceding count of OLAP functions) is non-blocking • Also window aggregates are non-blocking • In between cases: e.g., traditional aggregates on input that is already sorted on group-by values.

Characterization of NonBlocking (nb) • Many functions expressible by nb-computations can also be expressed by blocking ones. E.g., joins can be implemented using sorting. Ditto for projections with duplicate elimination. • But many functions implemented using blocking computation cannot be given an nb-implementation. • We must distinguish between the two kinds of functions, since one can be supported in our DSMS (via suitable nb-implementation) and the other cannot. Theorem: Queries can be expressed via nb computations iff they are monotonic w.r.t. the presequence ordering.

NB-completeness • A query language Lcan express a given set of functions F on its input (DB, sequences, data streams)---the larger F, the greater the expressive power of L. • Non-monotonic functions are intrinsically blocking and they cannot be used on data streams. Thus, if we use L in a DSMS, we give up the non-monotonic subset of F with no regret. However, let us make sure that we do not give up anything more! • More? Yes, because for continuous queries of streams, we will normally disallow L’sblocking (i.e. nonmonotonic) operators & constructs, and only allow nb (i.e., monotonic ) operators. • But are ALL the monotonic functions expressible by L using the nb-operators of L ? Or by disallowing blocking operators did we also lose the ability of expressing some monotonic queries? Definition:L is said to be nb-complete when it can express all the monotonic queries expressible by L using only its nb-operators.

Expressive Power and NB-Completeness • NB-completeness is a test that a language is as suitable for continuous queries on data streams as it is on stored database. • In a language L lacking nb-completeness, there are monotonic functions that L cannot express as continuous queries, that L can express if the stream had been stored in a database. • For instance, Relational Algebra and SQL are not nb-complete (in addition to the shortcomings they might have on DBs).

Sets versus Sequences • Sets are sequences where duplicates are allowed and order is immaterial. • Thus S1 is a subset of S2 iff S1 can be reordered in a presequence of S2. • Theorem [Lifted from sequences to sets]. A function is is nb iff it is monotonic. • NB=monotonic: selection, projection, and OLAP functions • Blocking=Non-Monotonic: e.g. Traditional aggregates. • Operators of more than one argument: • Join are monotonic (i.e., NB) in both arguments. • R-S is monotonic on R and antimonotonic on S: i.e., will block on S but not on R (after it has seen the whole S, though)

Relational Algebra (RA) • Set difference can produce monotonic queries: Intersection: R1Ç R2= R1- (R1- R2) • Are these still expressible without set diff? • Intersection can be expressed as a joins: product+select • But interval coalescing and Until queries are monotonic queries that can be expressed in RA but not in nb-RA. • Example: Temporal domain isomorfic to nonnegative integers.Intervals closed to the left but open to the right: p(0, 3). % 0,1, and 2 are in p but 3 is not p(2, 4). % 3 is not a hole because is covered by this p(4, 5). % 5 is a hole because not covered by any other interval p(6, 8).

Coalesce p (cp) & p Until q p(0, 3). p(2, 4). p(4, 5). p(6, 8). cp(0, 3). cp(2, 4). cp(4, 5). cp(6, 8). cp(0, 4). cp(2, 5). cp(0,5). cp contains intervals from the start point of any p interval to the endpoint of any p interval unless the endpoint of an interval in between is a hole. cp(I1, J2) ¬ p(I1, J1), p(I2, J2), J1 < J2, Øhole(I1, J2). hole(I1, J2) ¬ p(I1, J1), p(I2, J2), p(_,K), J1 £ K, K < I2, Øcep(K). cep(K) ¬ p(_, K), p(I, J), I £ K, K < J. q(5,_) holdsif cp has an interval that starts at 0 & contains 5pUntil q(yes) ¬ q(0, J). pUntil q(yes) ¬ cp(0, I), q(J, _), I ³ J .

Relational Algebra • NonMonotonic (i.e., blocking) RA operators: set difference and division • We are left with: select, project, join, and union. Can these express all FO monotonic queries? • Some interesting temporal queries: coalesce and until • They are expressible in RA (by double negation) • They are monotonic • They cannot be expressed in nb-RA. Theorem: RA and SQL are not nb-complete. SQL faces two problems: (i) the exclusion of EXCEPT/NOT EXISTS, and (ii) the exclusion of aggregates.

E-Bay Example • Auctions: a stream of bids on an item. bidStream(Item#, BidValue, Time) • Items for which sum of bids is > 100K SELECT Item# FROM bidStream GROUP BY Item# HAVING SUM(BidValue) > 100000; • This is a monotonic query. Thus it can be expressed in a language containing suitable query operators, but not in SQL-2. SQL-2 is not nb-complete; thus it is ill-suited for continuous queries on data streams. • So SQL-2 is not nb-complete because of its blocking aggregates. What about relational algebra?

Incompleteness of Relational QL • The coalesce and until queries • can be expressed in safe nonrecursive Datalog, thus • They are expressible in RA, • They are monotonic • They cannot be expressed in nb-RA Theorem: RA and SQL are not nb-complete. • A new limitation for DB query languages (which were already severely challenged in terms of expressive power)

Embedding SQL Queries in a PL • In DB applications, SQL can be embedded in a PL (Java, C++…) where the PL accesses the tuples returned by SQL using a Get Next of Cursor statement. • Operations that could not be expressed in SQL can then be expressed in the PL: • an effective remedy for the lack of expressive power of SQL • But cursors is a ‘pull-based’ mechanism and cannot be used on data streams: the DSMS cannot hold tuples until the PL request them. • The DSMS can only deliver its output to the PL as a stream—This is OK to drive a GUI. But if most of the work has not been done yet, who is the DSMS? • Contrast this to DBMS who are useful even with a weak QL.

Reviewing the Situation • SQL’s lack of expressive power is a major problem for database-centric applications. • These problems are significantly more serious for data streams since: • Only monotonic queries can be used, • Actually, not even all the monotonic ones since SQL is not nb-complete, • These problems cannot be really by using PLs with embedded SQL statements on streams • DSMS will be impaired--unless significant improvements can be made.

UDAs to the Rescue • Full support for UDAs with all window combinations—effective on UDAs written in SQL, PLs, and even built-ins • Support for continuous queries and ad hoc queries, under a simple and unified semantics • Turing completeness --all possible queries • nb-completeness all monotonic queries using only non-blocking operators (e.g., window UDAs & those without TERMINATE) • Effective on a broad range of data-intensive applications: data/stream mining, approximate queries, sequential patters (XML not there) • Making a strong case for the DB-oriented approach to data streams.

Conclusion • Language Technology: • ESL a very powerful language for data stream and DB applications • Simple semantics and unified syntax conforming to SQL:2003 standards • Strong case for the DB-oriented approach to data streams • System Technology: • Some performance-oriented techniques well-developed—e.g., buffer management for windows • For others: work is still in progress—stay tuned for latest news • Stream Mill is up and running: http://wis.cs.ucla.edu/stream-mill

References [1]ATLaS user manual. http://wis.cs.ucla.edu/atlas. [2]SQL/LPP: A Time Series Extension of SQL Based on Limited Patience Patterns, volume 1677 of Lecture Notes in Computer Science. Springer, 1999. [4]A. Arasu, S. Babu, and J. Widom. An abstract semantics and concrete language for continuous queries over streams and relations. Technical report, Stanford University, 2002. [5]B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In PODS, 2002. [9]D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams - a new class of data management applications. In VLDB, Hong Kong, China, 2002. [10]J. Celko. SQL for Smarties, chapter Advanced SQL Programming. Morgan Kaufmann, 1995. [11]S. Chandrasekaran and M. Franklin. Streaming queries over streaming data. In VLDB, 2002. [12]J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for internet databases. In SIGMOD, pages 379-390, May 2000. [13]C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk, and O. Spatscheck. Gigascope: A stream database for network applications. In SIGMOD Conference, pages 647-651. ACM Press, 2003. [14]Lukasz Golab and M. Tamer Özsu. Issues in data stream management. ACM SIGMOD Record, 32(2):5-14, 2003. [15]J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997. [16] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, Carlo Zaniolo: A Data Stream Language and System Designed for Power and Extensibility. Proc. of the ACM 15th Conference on Information and Knowledge Management (CIKM'06), 2006 [17] Yijian Bai, Hetal Thakkar, Haixun Wang and Carlo Zaniolo: Optimizing Timestamp Management in Data Stream Management Systems. ICDE 2007.

References (Cont.) [18] Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Query Languages and Data Models for Database Sequences and Data Streams. VLDB 2004: 492-503 [19] Sam Madden, Mehul A. Shah, Joseph M. Hellerstein, and Vijayshankar Raman. Continuously adaptive continuous queries over streams. In SIGMOD, pages 49-61, 2002. [20]R. Motwani, J. Widom, A. Arasu, B. Babcock, M. Datar S. Babu, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query processing, approximation, and resource management in a data stream management system. In First CIDR 2003 Conference, Asilomar, CA, 2003. [21]R. Ramakrishnan, D. Donjerkovic, A. Ranganathan, K. Beyer, and M. Krishnaprasad. SRQL: Sorted relational query language, 1998. [23]Reza Sadri, Carlo Zaniolo, and Amir M. Zarkesh andJafar Adibi. A sequential pattern query language for supporting instant data minining for e-services. In VLDB, pages 653-656, 2001. [24]Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi. Optimization of sequence queries in database systems. In PODS, Santa Barbara, CA, May 2001. [25]P. Seshadri. Predator: A resource for database research. SIGMOD Record, 27(1):16-20, 1998. [26]P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequence databases. In ICDE, pages 232-239, Taipei, Taiwan, March 1995. [27]Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. Sequence query processing. In ACM SIGMOD 1994, pages 430-441. ACM Press, 1994. [28]M. Sullivan. Tribeca: A stream database manager for network traffic analysis. In VLDB, 1996. [29]D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous queries over append-only databases. In SIGMOD, pages 321-330, 6 1992. [30]Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowl. Data Eng, 15(3):555-568, 2003. [31]Haixun Wang and Carlo Zaniolo. ATLaS: a native extension of SQL for data minining. In Proceedings of Third SIAM Int. Conference on Data MIning, pages 130-141, 2003.

DSMS Research Projects • Aurora (Brandeis/Brown/MIT) http://www.cs.brown.edu/research/aurora/ • Cougar (Cornell) http://www.cs.cornell.edu/database/cougar/ • Telegraph (Berkeley)- http://telegraph.cs.berkeley.edu • STREAM (Stanford) –http://www-db.stanford.edu/stream • Niagara (OGI/Wisconsin)-http://www.cs.wisc.edu/niagara/ • OpenCQ (Georgia Tech) – http://disl.cc.gatech.edu/CQ • Tapestry (Xerox) – electronic documents stream filtering • Hancock (AT&T) http://www.research.att.com/~kfisher/hancock/ • Cape (WPI) http://davis.wpi.edu/dsrg/CAPE/home.html • Tribeca (Bellcore) – network monitoring • Stream Mill (UCLA) – http://wis.cs.ucla.edu/stream-mill • Gigascope …

CQLs for DSMS • Most of DSMS projects use SQL for continuous queries—for good reasons, since • Many applications span data streams and DB tables • A CQL based on SQL will be easier to learn & use • Moreover: the fewer the differences the better! • But DSMS were designed for persistent data and transient queries---not for persistent queries on transient data • Adaptation of SQL and its enabling technology presents difficult research challenges • These combine with traditional SQL problem, such as inability to deal with sequences, DM tasks, and other complex query tasks---i.e., lack of expressive power

Language Problems • Most DSMS projects use SQL — queries spanning both data streams and DBs will be easier. But … • Even for persistent data, SQL is far from perfect.Important application areas poorly supported include: • Data Mining, and we need to mine data streams, • Sequence queries, and data streams are infinite time series! • Major new problems for SQL on data stream applications. (After all, it was designed for persistent data on secondary store, not for streaming data) • Only NonBlocking operators in DSMS: blocking forbidden • Distinction not clear in DBMS which often use blocking implementations for nonblocking operators • The distinction needs to formally characterized • and so is the loss of query power of the QL.

Continuous Query Languages for DSMS

Continuous Query Languages for DSMS

Presentation Transcript

Semantic Query Languages

Relational Query Languages

Query Languages for XML

Logical Query Languages

Continuous Query Languages for DSMS

Query Languages

Query Languages for XML: XQuery

XML Query Languages

XML Query Languages

Query Languages

Query Languages

Query Languages

Query Languages for XML

XML Query Languages

Query Languages for XML

Relational Query Languages

RDF Query Languages

Relational Query Languages

XML Query Languages