1 / 50

User-Defined Aggregates for Advanced Database Applications

User-Defined Aggregates for Advanced Database Applications. Haixun Wang hxwang@cs.ucla.edu Computer Science Dept. University of California, Los Angeles. State of the Art: the big picture. Relational Databases: Single data model of spartan simplicity Logic-based database languages

stewartpaul
Télécharger la présentation

User-Defined Aggregates for Advanced Database Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. User-Defined Aggregates for Advanced Database Applications Haixun Wang hxwang@cs.ucla.edu Computer Science Dept. University of California, Los Angeles

  2. State of the Art: the big picture • Relational Databases: • Single data model of spartan simplicity • Logic-based database languages • Commercial success and dominance through SQL standards (in the 80’s) • A new wave of DB applications (in the 90’s) • Knowledge discovery, Data Mining, OLAP, Time Series analysis, Spatial/Temporal DBs, XML, … • Underscores limitations of RDBMS • Prompted major extensions leading to SQL3 standards (SQL99 is a subset of SQL3)

  3. State of the Art: R&D highlights • Deductive DBs • Rule based syntax • Logic based formalization of query language semantics – e.g., nonmonotonic reasoning and stratification • Recursive queries • OO-DBs • Complex data types / inheritance • Expressive power by merging PL and query languages • OR-DBs • Rich data types / Path Expression (SQL) • UDFs and Database Extenders (Data Blades)

  4. State of the Art: the seamy side • A patchwork of major extensions • DBMSs have become more powerful but much hard and complex to build and use • Still not powerful enough • Data Mining: clustering, classification, association • New language constructs not helping either • Limited expressive power in other applications • Bill of Materials (BoM) type of applications • Temporal reasoning

  5. This thesis: Many of the problems can be solved by UDAs • User Defined Aggregates (UDAs): • insufficient support in commercial world and DB standards • Our claim: UDAs provide a more general and powerful mechanism for DB extensions • AXL – a system to make it easier to define UDAs • AXL – where SQL and Data Mining intersect

  6. A Brief History of AXL(and of my thesis) • Logic formalization of aggregates [DDLP’98, LBAI’00] • Early returns, monotonic aggregates: used freely in recursive queries. Extensions of LDL++ (Logic Database Language) • SQL-AG: Implementing and extending SQL3 UDAs[DBPL’99] • Implemented on DB2 using extended user-defined aggregates with ‘early returns’ • SADL: Simple Aggregate Definition Language [ICDE’00] • using SQL to define new aggregates • easy to use, but with limited performance and power • AXL: Aggregate eXtension Language [VLDB’00] • powerful, efficient and still SQL-based

  7. Defining UDAs in SQL3 AGGREGATE FUNCTION myavg(val NUMBER) RETURN NUMBER STATE state INITIALIZE myavg_init ITERATE myavg_iterate TERMINATE myavg_return • INITIALIZE: gives an initial value to the aggregate • ITERATE : computes the intermediate aggregate value for each new record • TERMINATE: returns the final value computed for the aggregate • myavg_init, myavg_iterate, myavg_return are 3 functions that the user must write in a procedural programming language

  8. Limitation of SQL3 UDAs UDAs in SQL3, Postgres, Informix, and early versions of LDL++ share the same limitations: • Aggregates can not be used inside recursion • No support for early returns and on-line aggregation • Also Ease of Use is a major issue (except for LDL++)

  9. Ease of Use • THE PROBLEM: UDFs are very hard to write and debug. In “unfenced mode” they jeopardize the integrity of the system. UDAs defined using several UDFs are prone to the same problem. • ASOLUTION: Use a high-level language for defining UDAs. But there are many potential problems with any new language. • THE IDEAL SOLUTION: use SQL to define new aggregates. Substantial benefits: • Users are already familiar with SQL • No impedance mismatch of data types and programming paradigms • DB advantages: scalability, data independence, optimizability, parallelizability

  10. Simple Aggregates AGGREGATE avg(value INT) : REAL { TABLE state(sum INT, cnt INT); INITIALIZE:{ INSERT INTO state(value, 1); } ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; } TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state; } }

  11. Avoiding Multiple Scans Show the average salary of senior managers who make 3 timesmore than the average employees. • SQL: SELECT avg(salary) FROM employee WHERE title = ‘senior manager’ AND salary > 3 * (SELECT avg(salary) FROM employee) Two scans of the employee table required • With AXL UDAs: SELECT sscan(title, salary) FROM employee

  12. AXL: Using a Single Scan AGGREGATE sscan(title CHAR(20), salary INT) : REAL { TABLE state(sum INT, cnt INT) ASVALUES (0,0); TABLE seniors(salary INT); INITIALIZE: ITERATE: { UPDATE state SET sum=sum+salary, cnt=cnt+1; INSERT INTO seniors VALUES(salary) WHERE title = ‘senior manager’; } TERMINATE: { INSERT INTO RETURN SELECT avg(s.salary) FROM seniors AS s WHERE s.salary > 3 * (SELECT sum/cnt FROM state); } }

  13. Ordered Sequences and Time Series We have a sequence of events, each of which is active during a certain interval (from, end). Find out at which point of time we have the largest number of active events. • SQL: • Group-by on the start time of each interval, and count! • With AXL UDAs: SELECT density(from, end) FROM events

  14. AXL: Using a Single Scan AGGREGATE density(start TIME, end TIME) : (time TIME, count INT) { TABLE state(time TIME, count INT) AS (0, 0); TABLE active(endpoint TIME); INITIALIZE: ITERATE: { DELETEFROM active WHERE endpoint < start; INSERTINTO active VALUES(end); UPDATE state SET time=start, count =count + 1 WHERE count < (SELECT count(*) FROM active); } TERMINATE: { INSERT INTO RETURN SELECT time, count FROM state; } }

  15. Early Returns • AVG normally converges early: an early approximation is all is needed in several applications • Online aggregation means that early returns are produced during the computation • Many applications: e.g., find the local max and min in a sequence of values, various temporal aggregates, rollups, etc. • These might depend on the order – same as new OLAP extensions in SQL3

  16. Return avg for Every 100 Records AGGREGATEolavg(value INT): REAL { TABLE state(sum INT, cnt INT); INITIALIZE: { INSERT INTO state VALUES (value,1); } ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; INSERT INTO RETURN SELECT sum/cnt FROM state WHERE cnt MOD 100 = 0; } TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state; } }

  17. Temporal Coalescing AGGREGATE coalesce(from TIME, to TIME): (start TIME, end TIME) { TABLE state(cFrom TIME, cTo TIME); INITIALIZE: { INSERT INTO state VALUES (from, to); } ITERATE: { UPDATE state SET cTo = to WHERE cTo >= from AND cTo < to; INSERT INTO RETURN SELECT cFrom, cTo FROM state WHERE cTo < from; UPDATE state SET cFrom = from, cTo = to WHERE cTo < from; } TERMINATE: { INSERT INTO RETURN SELECT cFrom, cTo FROM state; } }

  18. Recursive Aggregates • In AXL, aggregates can call other aggregates. Particularly, an aggregate can call itself recursively. AGGREGATE alldesc(P CHAR(20)): CHAR(20) { INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(P); INSERT INTO RETURN SELECT alldesc(Child) FROM children WHERE Parent = P; } } Find all the descendents of Tom: SELECT alldesc(Child) FROM children WHERE Parent = ‘Tom’;

  19. Check Point • Simple applications: AXL UDAs provide a solution with better performance and good ease of use. • Many advanced database applications • Time Series, Temporal Database, Spatial Database… • In particular data mining applications

  20. Data Mining and Database Systems • Current Approach: • Cursor-based: loose-coupling, stored procedures • UDFs: ease of use problems • Cache-Mine: • Using DBMSs as containers of data • Many attempts to closely integrate data mining functions into DBMS have shown major problems

  21. What we need … SQL-aware Data Mining Systems Surajit Chaudhuri “Data Mining and Database Systems: Where is the Intersection?” IEEE Data Engineering Bulletin, 1998

  22. Decision Tree Classifiers Training set: tennis Stream of Column/Value Pairs (together with RecId and Category)

  23. Convert training set to column/value pairs AGGREGATE dissemble(v1 INT, v2 INT, v3 INT, v4 INT, yorn INT) : (col INT, val INT, YorN INT) { INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(1, v1, yorn), (2,v2,yorn), (3,v3,yorn), (4,v4,yorn); } } CREATE VIEW col-val-pairs(recId INT, col INT, val INT, YorN INT) SELECT mcount(), dissemble(Outlook, Temp, Humidity, Wind, PlayTennis) FROM tennis; SELECT classify(recId, col, val, YorN) FROM col-val-pairs;

  24. Categorical Classifier in AXL [ 1] AGGREGATE classify(RecId INT, iNode INT, iCol INT, iValue REAL, iYorN INT) [ 2] { TABLE treenodes(RecId INT,Node INT, Col INT, Val REAL, YorN INT); [ 3]TABLE summary(Col INT, Value INT, Yc INT, Nc INT, KEY {Col, Value}); [ 4] TABLE mincol(Col INT, MinGini REAL); [ 5] TABLE ginitable(Col INT, Gini REAL); [ 6] INITIALIZE: ITERATE: { [ 7] INSERT INTO treenodes VALUES (RecId, iNode, iCol, iValue, iYorN); [ 8]UPDATE summary [ 9]SET Yc=Yc+iYorN, Nc=Nc+1-iYorN WHERE Col = iCol AND Value = iValue; [10]INSERTINTO summary SELECT iCol, iValue, iYorN, 1-iYorN WHERE SQLCDE=0; [11] } [12] TERMINATE: { [13]INSERT INTO ginitable SELECT Col, sum((Yc*Nc)/(Yc+Nc))/sum(Yc+Nc) [14]FROM summary GROUPBY Col; [15] INSERT INTO mincol SELECT minpointvalue(Col, Gini) FROM ginitable; [16] INSERT INTO result SELECT iNode, Col FROM mincol; [17]SELECT classify(t.RecId, t.Node*MAXVALUE+m.Value+1, t.Col, t.Value, t.YorN) [18] FROM treenodes AS t, [19] (SELECT tt.RecId RecId, tt.Value Value FROM treenodes tt, mincol m [20]WHERE tt.Col=m.Col AND m.MinGini>0 ) AS m [21]WHERE t.RecId = m.RecId GROUP BY m.Value; [22] } [23]}

  25. Performance SPRINT Algorithm: AXL vs. C Categorical Classifier: AXL vs. C

  26. SPRINT Algorithm in AXL [ 1] AGGREGATE sprint(iNode INT, iRec INT, iCol INT, iValue REAL, iYorN INT) [ 2] { TABLE treenodes(Rec INT, Col INT, Val REAL, YorN INT, KEY(Col, Value)); [ 3]TABLE summary(Col INT, SplitGini REAL, SplitVal REAL, Yc INT, Nc INT); [ 4] TABLE split(Rec INT, LeftOrRight INT, KEY (RecId)); [ 5] TABLE mincol(Col INT, Val REAL, Gini REAL); [ 6] TABLE node(Node INT) ASVALUES(iNode); [ 7] INITIALIZE: ITERATE: { [ 8] INSERT INTO treenodes VALUES (iRec, iCol, iValue, iYorN); [ 9]UPDATE summary [10]SET Yc=Yc+iYorN, Nc=Nc+1-iYorN, (SplitGini, SplitVal) = giniudf(Yc, Nc, N, SplitGini, SplitVal) [11]WHERE Col=iCol; [12] } [13] TERMINATE: { [14]INSERT INTO mincol SELECT minpointvalue(Col, SplitGini, SplitVal) FROM summary; [15] INSERT INTO result SELECT n.Node, m.Col, m.Value FROM mincol AS m, node AS n; [16] INSERT INTO split SELECT t.Rec, (t.Value>m.Value) FROM treenodes AS t, mincol AS m [17]WHERE t.Col = m.Col AND m.Gini > 0; [18]SELECT sprint(n.Node*2+s.LeftOrRight, t.Rec, t.Col, t.Val, t.YorN) [19] FROM treenodes AS t, split AS s, node AS n WHERE t.Rec = s.Rec [20]GROUP BY s.LeftOrRight; [21] } [22]}

  27. Comparison with Other Architectures Datasets with 6.6 million records, support level .25%.

  28. Implementation of AXL Standalone Mode DB2 Add-on Mode .axl .axl Berkeley DB .cc .cc DB2 .lib SQL .exe UDFs DB2

  29. Implementation of AXL • Open interface of physical data model. • Currently using Berkeley DB as our storage manager • In memory tables • Limited Optimization • Using B+-Tree indexes to support equality/range query • Predicate push-down / push-up • User Defined Aggregates • Hash based • Return multiple rows: ‘early return’ • Return multiple columns: employee’s name and salary

  30. Implementation of AXL (cont’d) • Non-blocking aggregation • Keeping the state of aggregation between calls to the aggregate routines • Local tables defined inside aggregation are passed as parameters to the aggregates • Explicit sorting (and implicit hash-based aggregation) • AXL V1.2: above 30,000 lines of code

  31. Check Point • Simple applications: AXL UDAs provide a solution with better performance and good ease of use. • Data Mining applications • Formal Semantics of Aggregates and Monotonic Aggregation

  32. Aggregates in Recursion • Stratification: • shaves(barber, X) :- villager(X), shaves(X, X). villager(barber). • Aggregates: p  count(p) =0 • Aggregates in many applications are actually monotonic (and should be allowed inside recursion).

  33. Beyond Stratification • Significant previous efforts… • I. S. Mumick, H. Pirahesh and R. Ramakrishnan, “The magic of duplicates and aggregates”, VLDB 1990 • A. Van Gelder, “Foundations of aggregates in deductive databases”, DOOD 1993 • K. A. Ross and Y. Sagiv, “Monotonic aggregates in deductive databases”, JCSS 1997 • S. Greco and C. Zaniolo, “Greedy algorithms in Datalog with choice and negation”, JICSLP 1998

  34. Formal Semantics of Aggregates • choice((X),(Y)) • Enforcing functional dependency. FD: X->Y • Multiplicity of stable models, monotonic transformation • Ordering a domain • Once (X,Y) is generated, choice ensures this is the only arc leaving source node X and entering sink node Y • Formal semantics of UDA • …return(Y,V) :- ordered(X,Y), ordered(Y,_), terminate(Y,V).…

  35. AGGREGATE mcount(): INT { TABLE state(cnt INT) AS VALUES (0); INITIALIZE: ITERATE: { UPDATE state SET cnt=cnt+1; INSERT INTO RETURNSELECT cnt FROM state; } } Early Returns  Monotonic Aggregates Aggregates with only ‘early returns’ and no ‘final returns’ are monotonic w.r.t. set containment:

  36. Early Returns  Monotonic Aggregates SELECT mcount(*) FROM employee; v.s. SELECT count(*) FROM employee; mcount{John, Mary, Tom} {1,2,3} count{John, Mary, Tom} 3 mcount{John, Mary, Tom, Jerry} {1,2,3,4} count{John, Mary, Tom, Jerry} 4

  37. Return sum at the nth value AGGREGATE sumat(value INT, n INT): INT { TABLE state (sum INT, cnt INT) ASVALUES (0,0); INITIALIZE: ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt +1; INSERT INTO RETURN SELECT sum FROM state WHERE cnt = n; } }

  38. Monotonic Aggregation • Monotonic aggregates can be used without any restriction and without changing the underlying implementation. • This solves the problem that had eluded database researchers since the introduction of relational systems: • BoM, Company Control, Join-the-Party… • Greedy optimization algorithms, such as Dijkstra’s single source shortest path.

  39. Join-the-Party Problem Some people will come to the party no matter what, and their names are stored in a sure(PName) relation. But many other people will join only after they know that at least K=3 of their friends will be there. WITH wllcm(Name) AS((SELECT Pname FROM sure)UNION ALL (SELECT f.PnameFROM friend AS f, wllcm AS wWHERE w.Name =f.FnameGROUPBY f.PnameHAVING mcount()=3))SELECT Name FROM wllcm; Density-based Clustering [M. Ester et al. KDD 96]

  40. BoM: Cost of Parts 001 assembly 1 5 1 004 3 002 003 100 2 3 part-cost 5 10 8 006 fan-out 005

  41. WITH cst(part, cost) AS ((SELECT part, cost FROM part-cost) UNION ALL (SELECT a.part, sumat(a.qty * c.cost, p.ChC) FROM assembly AS a, cst AS c, fan-out AS p WHERE a.subpart = c.part AND p.part = a.part GROUP BY a.part)) SELECT part, cost FROM cst; Bottom up solution. Computes the cost for each part once and only once. Monotonic sumat(cost, n) returns sum when exactly n items are aggregated. Works in DB2 after AXL rewrites callings of sumat() to callings of automatically generated UDFs. BoM: Using AXL

  42. BoM: Using Recursive SQL WITH mpath(subpart, qty) AS ((SELECT subpart, qty FROM assembly WHERE part = ‘001’) UNION ALL (SELECT c.subpart, m.qty * c.qty FROM mpath m, assembly c WHERE m.subpart = c.part)) SELECTsum(m.qty * c.cost) FROM mpath m, part_cost c WHERE m.subpart = c.part ; • Top down solution: computes the cost of part ‘001’ • Explosion: all edges that descend from part ‘001’ are “stored” in mpath • What if we want to compute the cost of each part?

  43. Check Point • Simple applications • Data Mining and Decision Support • Formal Semantics & Monotonic aggregates • OLAP and other aggregate extensions • SUCH THAT • CUBE, ROLLUP, GROUPING SET • OLAP Functions

  44. For each division, show the average salary of senior managers who make 3 times more than the average employees, and the average salary of senior engineers who make 2 times more than the average employees (in the same output record). D. Chatziantoniou, Kenneth Ross, VLDB 1996 SELECT division, avg(X.salary), avg(Y.salary) FROM employee GROUP BY division: X, Y SUCH THATX.title = ‘senior manager’ AND X.salary > 3*avg(salary) AND Y.title = ‘senior engineer’ AND Y.salary > 2*avg(salary) “Such That”

  45. Expressing “Such That” in AXL TABLE seniors(salary INT); AGGREGATE sscan2(title CHAR(20), salary INT, qtitle CHAR(20), ratio INT): REAL { TABLE state(sum INT, cnt INT) AS VALUES (0,0); INITIALIZE: ITERATE: {UPDATE state SET sum=sum+salary, cnt=cnt+1;INSERT INTO seniors VALUES (salary) WHERE title = qtitle; } TERMINATE: { SELECT avg(s.salary) FROM seniors AS s WHERE s.salary > ratio * (SELECT sum/cnt FROM state); } }

  46. Using UDA sscan2 SELECT division, sscan2(title, salary, ‘senior manager’, 3), sscan2(title, salary, ‘senior engineer’, 2) FROM employee GROUP BY division; • No joins or sub-queries required. • One pass through the employee relation (standard SQL requires at least 2 passes).

  47. Other Aggregate Extensions • GROUPING SETS, ROLL-UP, CUBE • New OLAP extensions • Windows containing a partitioning, an ordering of rows and an aggregate group • “… every standard must be prepared to tackle new issues that arise as the market evolves. If SQL does not respond positively to this challenge, SQL risks becoming irrelevant …” -- F. Zemke, K. Kulkarni, A. Witkowski, B. Lyle Introduction to OLAP Functions

  48. Contributions • Adding extended UDAs to O-R systems • high level language, minimal additions to SQL • monotonic aggregates, recursive aggregates • designed for general purpose applications • Tightly couple data mining functions with DBMS • SPRINT Algorithm, Categorical Classifier, … • Performance • More efficient than cursor-based languages like PL/SQL, JDBC and UDF-based approaches …

  49. Future Directions • Parallelization • Extenders/Data Blades • build on top of UDAs instead of UDFs • Decision Support • the Apriori algorithm • Windows and OLAP functions • Spatial/Temporal extensions • the TENOR system

  50. Future Direction: Parallelization • Current parallel aggregation algorithms valid for AXL: • Inter-Partition parallelism: all tuples of the same group-by value are in one node • Two phase algorithm: user provides a COMBINE routine • Unlike SQL3, AXL’s aggregate routines are written in SQL, so we can apply traditional query parallelization techniques to INITIALIZE, ITERATE, and TERMINATE • Since aggregate routines are written in SQL, the COMBINE routine can be generated automatically by the system for simple UDAs

More Related