280 likes | 356 Vues
BUILDING A DATABASE SYSTEM FOR ORDER. New England Database Seminars April 2002 Alberto Lerner – ENST Paris Dennis Shasha – NYU {lerner,shasha}@cs.nyu.edu. Agenda. Motivation SQL + Order Transformations Conclusion. Motivation The need for ordered data. Some queries rely on order
E N D
BUILDING A DATABASE SYSTEM FOR ORDER New England Database Seminars April 2002 Alberto Lerner – ENST ParisDennis Shasha – NYU {lerner,shasha}@cs.nyu.edu
Agenda • Motivation • SQL + Order • Transformations • Conclusion
MotivationThe need for ordered data • Some queries rely on order • Examples: • Moving averages • Top N • Rank • “SQL can handle it.” Can it really?
MotivationMoving Averages: algorithmically linear Sales(month, total) SELECT t1.month+1 AS forecastMonth, (t1.total+ t2.total + t3.total)/3 AS 3MonthMovingAverageFROM Sales AS t1, Sales AS t2, Sales AS t3WHERE t1.month = t2.month - 1 AND t1.month = t3.month – 2 Can optimizer make a 3-way (in general, n-way) join linear time? Ref: Data Mining and Statistical Analysis Using SQL Trueblood and LovettApress, 2001
MotivationTop N Employee(Id, salary) SELECT DISTINCT count(*), t1.salaryFROM Employee AS t1, Employee AS t2WHERE t1.salary <= t2.salaryGROUP BY t1.salaryHAVING count(*) <= N How many elements of cross-product have salaries at least as large as t1.salary? Will optimizer see essential sort-count trick? Ref: SQL for Smarties Joe CelkoMorgan Kauffman, 1995
MotivationProblems Extending SQL with Order • Queries are hard to read • Cost of execution is often non-linear (would not pass basic algorithms course) • Few operators preserve order, so optimization hard.
Agenda • Motivation • SQL + Order • Transformations • Conclusion
SQL + OrderDesirable Features • Express order-dependent predicates and clauses in a readable, clear way • Make optimization opportunities explicit (by getting rid of complex idioms, see above) • Execution in linear (or n log n) time when possible
SQL + Orderthree steps in solution • Give SQL a vector-oriented semantics – Database is a set of array-tables “arrables”; variables in the queries do not refer to a single tuple at a time anymore, but to a whole column vector • Provide new vector-to-vector functions – Supporting order-based manipulations of column vectors • Streaming: new data may need special treatment.
SQL + OrderMoving Averages Sales(month, total) SELECT month, avgs(8, total)FROM Sales ASSUMING ORDER month avgs: vector-to-vector function, order-dependant and size-preserving order to be used on vector-to-vector functions • Execution (Sales is an arrable): • FROM clause – enforces the order in ASSUMING clause • SELECT clause – for each month yields the moving average (window size 8) ending at that month. No 8-way join.
SQL + OrderTop N Employee(ID, salary) SELECT first(N, salary) FROM Employee ASSUMING ORDER Salary first: vector-to-vector function, order-dependant and non size-preserving • Execution: • FROM clause – orders arrable by Salary • SELECT clause – applies first() to the ‘salary’ vector, yielding first N values of that vector given the order. Could get the top earning IDs by saying first(N, ID).
SQL + OrderRanking SalesReport(salesPerson, territory, total) SELECT territory, salesPerson, total, rank(total)FROM SalesReport WHERE rank(total) < N rank: vector-to-vector function, non order-dependant and size-preserving • Execution: • FROM clause – assuming is NOT needed. • rank is applied to the ‘total’ vector and maps each position into an integer.
SQL + OrderVector-to-Vector Functions size-preserving non size-preserving prev, next, $, [] avgs(*), prds(*), sums(*), deltas(*), ratios(*), reverse, … drop, first, last order-dependant rank, tile min, max, avg, count non order-dependant
SQL + OrderComplex queries: Best spread In a given day, what would be the maximum difference between a buying and selling point of each security? Ticks(ID, price, tradeDate, timestamp, …) SELECT ID, max(price – mins(price))FROM Ticks ASSUMING ORDER timestampWHERE tradeDate = ‘99/99/99’GROUP BY ID max bestspread running min min • Execution: • For each security, compute the running minimum vector for price and then subtract from the price vector itself; result is a vector of spreads. • Note that max – min would overstate spread.
SQL + OrderComplex queries: Crossing averages part I When does the 21-day average cross the 5-month average? Market(ID, closePrice, tradeDate, …)TradedStocks(ID, Exchange,…) INSERT INTO temp FROMSELECT ID, tradeDate, avgs(21 days, closePrice) AS a21, avgs(5 months, closePrice) AS a5, prev(avgs(21 days, closePrice)) AS pa21, prev(avgs(5 months, closePrice)) AS pa5FROM TradedStocks NATURAL JOIN Market ASSUMING ORDER tradeDateGROUP BY ID
SQL + OrderComplex queries: Crossing averages part I • Execution: • FROM clause – order-preserving join • GROUP BY clause – groups are defined based on the value of the Id column • SELECT clause – functions are applied; non-grouped columns become vector fields so that target cardinality is met. Violates first normal form Vectorfield groups in ID and non-grouped column grouped ID and non-grouped column two columns withthe same cardinality
SQL + OrderComplex queries: Crossing averages part II Get the result from the resulting non first normal form relation temp SELECT ID, tradeDateFROM flatten(temp)WHERE a21 > a5 AND pa21 <= pa5 • Execution: • FROM clause – flatten transforms temp into a first normal form relation (for row r, every vector field in r MUST have the same cardinality). Could have been placed at end of previous query. • Standard query processing after that.
SQL + OrderRelated Work: Research • SEQUIN – Seshadri et al. • Sequences are first-class objects • Difficult to mix tables and sequences. • SRQL – Ramakrishnan et al. • Elegant algebra and language • No work on transformations. • SQL-TS – Sadri et al. • Language for finding patterns in sequence • But: Not everything is a pattern!
SQL + OrderRelated Works: Products • RISQL – Red Brick • Some vector-to-vector, order-dependent, size-preserving functions • Low-hanging fruit approach to language design. • Analysis Functions – Oracle 9i • Quite complete set of vector-to-vector functions • But: Can only be used in the select clause; poor optimization (our preliminary study) • KSQL – Kx Systems • Arrable extension to SQL but syntactically incompatible. • No cost-based optimization.
Agenda • Motivation • SQL + Order • Transformations • Conclusion
TransformationsEarly sorting + order preserving operators SELECT ts.ID, ts.Exchange, avgs(10, hq.ClosePrice)FROM TradedStocks AS ts NATURAL JOIN HistoricQuotes AS hq ASSUMING ORDER hq.TradeDateGROUP BY Id avgs avgs avgs g-by sort g-by op avgs op sort g-by g-by op op sort op (1) Sort then joinpreserving order (2) Preserve existingorder (3) Join then sortbefore grouping (4) Join then sortafter grouping
TransformationsUDFs evaluation order Gene(geneId, seq)SELECT t1.geneId, t2.geneId, dist(t1.seq, t2.seq)FROM Gene AS t1, Gene AS tWHERE dist(t1.seq, t2.seq) < 5 AND posA(t1.seq, t2.seq) posA asks whether sequences have Nucleo A in same position. Dist gives edit distance between two Sequences. posA dist Switch dynamicallybetween (1) and (2) depending on the execution history dist posA (1) (2) (3)
TransformationsOrder preserving joins select lineitem.orderid, avgs(10, lineitem.qty), lineitem.lineid from order, lineitem assuming order lineid where order.date > 45 and order.date < 55 and lineitem.orderid = order.orderid • Basic strategy 1: restrict based on date. Create hash on order. Run through lineitem, performing the join and pulling out the qty. • Basic strategy 2: Arrange for lineitem.orderid to be an index into order. Then restrict order based on date giving a bit vector. The bit vector, indexed by lineitem.orderid, gives the relevant lineitem rows. • The relevant order rows are then fetched using the surviving • lineitem.orderid. • Strategy 2 is often 3-10 times faster.
Transformations Building Blocks • Order optimization • Simmens et al. `96 – push-down sorts over joins, and combining and avoiding sorts • Order preserving operators • KSQL – joins on vector • Claussen et al. `00 – OP hash-based join • Push-down aggregating functions • Chaudhuri and Shim `94, Yan and Larson `94 – evaluate aggregation before joins • UDF evaluation • Hellerstein and Stonebraker ’93 – evaluate UDF according to its ((output/input) – 1)/cost per tuple • Porto et al. `00 – take correlation into account
Agenda • Motivation • SQL + Order • Transformations • Conclusion
Conclusion • Arrable-based approach to ordered databases may be scary – dependency on order, vector-to-vector functions – but it’s expressive and fast. • SQL extension that includes order is possible and reasonably simple. • Optimization possibilities are vast.