500 likes | 518 Vues
User-Defined Aggregates for Advanced Database Applications. Haixun Wang hxwang@cs.ucla.edu Computer Science Dept. University of California, Los Angeles. State of the Art: the big picture. Relational Databases: Single data model of spartan simplicity Logic-based database languages
E N D
User-Defined Aggregates for Advanced Database Applications Haixun Wang hxwang@cs.ucla.edu Computer Science Dept. University of California, Los Angeles
State of the Art: the big picture • Relational Databases: • Single data model of spartan simplicity • Logic-based database languages • Commercial success and dominance through SQL standards (in the 80’s) • A new wave of DB applications (in the 90’s) • Knowledge discovery, Data Mining, OLAP, Time Series analysis, Spatial/Temporal DBs, XML, … • Underscores limitations of RDBMS • Prompted major extensions leading to SQL3 standards (SQL99 is a subset of SQL3)
State of the Art: R&D highlights • Deductive DBs • Rule based syntax • Logic based formalization of query language semantics – e.g., nonmonotonic reasoning and stratification • Recursive queries • OO-DBs • Complex data types / inheritance • Expressive power by merging PL and query languages • OR-DBs • Rich data types / Path Expression (SQL) • UDFs and Database Extenders (Data Blades)
State of the Art: the seamy side • A patchwork of major extensions • DBMSs have become more powerful but much hard and complex to build and use • Still not powerful enough • Data Mining: clustering, classification, association • New language constructs not helping either • Limited expressive power in other applications • Bill of Materials (BoM) type of applications • Temporal reasoning
This thesis: Many of the problems can be solved by UDAs • User Defined Aggregates (UDAs): • insufficient support in commercial world and DB standards • Our claim: UDAs provide a more general and powerful mechanism for DB extensions • AXL – a system to make it easier to define UDAs • AXL – where SQL and Data Mining intersect
A Brief History of AXL(and of my thesis) • Logic formalization of aggregates [DDLP’98, LBAI’00] • Early returns, monotonic aggregates: used freely in recursive queries. Extensions of LDL++ (Logic Database Language) • SQL-AG: Implementing and extending SQL3 UDAs[DBPL’99] • Implemented on DB2 using extended user-defined aggregates with ‘early returns’ • SADL: Simple Aggregate Definition Language [ICDE’00] • using SQL to define new aggregates • easy to use, but with limited performance and power • AXL: Aggregate eXtension Language [VLDB’00] • powerful, efficient and still SQL-based
Defining UDAs in SQL3 AGGREGATE FUNCTION myavg(val NUMBER) RETURN NUMBER STATE state INITIALIZE myavg_init ITERATE myavg_iterate TERMINATE myavg_return • INITIALIZE: gives an initial value to the aggregate • ITERATE : computes the intermediate aggregate value for each new record • TERMINATE: returns the final value computed for the aggregate • myavg_init, myavg_iterate, myavg_return are 3 functions that the user must write in a procedural programming language
Limitation of SQL3 UDAs UDAs in SQL3, Postgres, Informix, and early versions of LDL++ share the same limitations: • Aggregates can not be used inside recursion • No support for early returns and on-line aggregation • Also Ease of Use is a major issue (except for LDL++)
Ease of Use • THE PROBLEM: UDFs are very hard to write and debug. In “unfenced mode” they jeopardize the integrity of the system. UDAs defined using several UDFs are prone to the same problem. • ASOLUTION: Use a high-level language for defining UDAs. But there are many potential problems with any new language. • THE IDEAL SOLUTION: use SQL to define new aggregates. Substantial benefits: • Users are already familiar with SQL • No impedance mismatch of data types and programming paradigms • DB advantages: scalability, data independence, optimizability, parallelizability
Simple Aggregates AGGREGATE avg(value INT) : REAL { TABLE state(sum INT, cnt INT); INITIALIZE:{ INSERT INTO state(value, 1); } ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; } TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state; } }
Avoiding Multiple Scans Show the average salary of senior managers who make 3 timesmore than the average employees. • SQL: SELECT avg(salary) FROM employee WHERE title = ‘senior manager’ AND salary > 3 * (SELECT avg(salary) FROM employee) Two scans of the employee table required • With AXL UDAs: SELECT sscan(title, salary) FROM employee
AXL: Using a Single Scan AGGREGATE sscan(title CHAR(20), salary INT) : REAL { TABLE state(sum INT, cnt INT) ASVALUES (0,0); TABLE seniors(salary INT); INITIALIZE: ITERATE: { UPDATE state SET sum=sum+salary, cnt=cnt+1; INSERT INTO seniors VALUES(salary) WHERE title = ‘senior manager’; } TERMINATE: { INSERT INTO RETURN SELECT avg(s.salary) FROM seniors AS s WHERE s.salary > 3 * (SELECT sum/cnt FROM state); } }
Ordered Sequences and Time Series We have a sequence of events, each of which is active during a certain interval (from, end). Find out at which point of time we have the largest number of active events. • SQL: • Group-by on the start time of each interval, and count! • With AXL UDAs: SELECT density(from, end) FROM events
AXL: Using a Single Scan AGGREGATE density(start TIME, end TIME) : (time TIME, count INT) { TABLE state(time TIME, count INT) AS (0, 0); TABLE active(endpoint TIME); INITIALIZE: ITERATE: { DELETEFROM active WHERE endpoint < start; INSERTINTO active VALUES(end); UPDATE state SET time=start, count =count + 1 WHERE count < (SELECT count(*) FROM active); } TERMINATE: { INSERT INTO RETURN SELECT time, count FROM state; } }
Early Returns • AVG normally converges early: an early approximation is all is needed in several applications • Online aggregation means that early returns are produced during the computation • Many applications: e.g., find the local max and min in a sequence of values, various temporal aggregates, rollups, etc. • These might depend on the order – same as new OLAP extensions in SQL3
Return avg for Every 100 Records AGGREGATEolavg(value INT): REAL { TABLE state(sum INT, cnt INT); INITIALIZE: { INSERT INTO state VALUES (value,1); } ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; INSERT INTO RETURN SELECT sum/cnt FROM state WHERE cnt MOD 100 = 0; } TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state; } }
Temporal Coalescing AGGREGATE coalesce(from TIME, to TIME): (start TIME, end TIME) { TABLE state(cFrom TIME, cTo TIME); INITIALIZE: { INSERT INTO state VALUES (from, to); } ITERATE: { UPDATE state SET cTo = to WHERE cTo >= from AND cTo < to; INSERT INTO RETURN SELECT cFrom, cTo FROM state WHERE cTo < from; UPDATE state SET cFrom = from, cTo = to WHERE cTo < from; } TERMINATE: { INSERT INTO RETURN SELECT cFrom, cTo FROM state; } }
Recursive Aggregates • In AXL, aggregates can call other aggregates. Particularly, an aggregate can call itself recursively. AGGREGATE alldesc(P CHAR(20)): CHAR(20) { INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(P); INSERT INTO RETURN SELECT alldesc(Child) FROM children WHERE Parent = P; } } Find all the descendents of Tom: SELECT alldesc(Child) FROM children WHERE Parent = ‘Tom’;
Check Point • Simple applications: AXL UDAs provide a solution with better performance and good ease of use. • Many advanced database applications • Time Series, Temporal Database, Spatial Database… • In particular data mining applications
Data Mining and Database Systems • Current Approach: • Cursor-based: loose-coupling, stored procedures • UDFs: ease of use problems • Cache-Mine: • Using DBMSs as containers of data • Many attempts to closely integrate data mining functions into DBMS have shown major problems
What we need … SQL-aware Data Mining Systems Surajit Chaudhuri “Data Mining and Database Systems: Where is the Intersection?” IEEE Data Engineering Bulletin, 1998
Decision Tree Classifiers Training set: tennis Stream of Column/Value Pairs (together with RecId and Category)
Convert training set to column/value pairs AGGREGATE dissemble(v1 INT, v2 INT, v3 INT, v4 INT, yorn INT) : (col INT, val INT, YorN INT) { INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(1, v1, yorn), (2,v2,yorn), (3,v3,yorn), (4,v4,yorn); } } CREATE VIEW col-val-pairs(recId INT, col INT, val INT, YorN INT) SELECT mcount(), dissemble(Outlook, Temp, Humidity, Wind, PlayTennis) FROM tennis; SELECT classify(recId, col, val, YorN) FROM col-val-pairs;
Categorical Classifier in AXL [ 1] AGGREGATE classify(RecId INT, iNode INT, iCol INT, iValue REAL, iYorN INT) [ 2] { TABLE treenodes(RecId INT,Node INT, Col INT, Val REAL, YorN INT); [ 3]TABLE summary(Col INT, Value INT, Yc INT, Nc INT, KEY {Col, Value}); [ 4] TABLE mincol(Col INT, MinGini REAL); [ 5] TABLE ginitable(Col INT, Gini REAL); [ 6] INITIALIZE: ITERATE: { [ 7] INSERT INTO treenodes VALUES (RecId, iNode, iCol, iValue, iYorN); [ 8]UPDATE summary [ 9]SET Yc=Yc+iYorN, Nc=Nc+1-iYorN WHERE Col = iCol AND Value = iValue; [10]INSERTINTO summary SELECT iCol, iValue, iYorN, 1-iYorN WHERE SQLCDE=0; [11] } [12] TERMINATE: { [13]INSERT INTO ginitable SELECT Col, sum((Yc*Nc)/(Yc+Nc))/sum(Yc+Nc) [14]FROM summary GROUPBY Col; [15] INSERT INTO mincol SELECT minpointvalue(Col, Gini) FROM ginitable; [16] INSERT INTO result SELECT iNode, Col FROM mincol; [17]SELECT classify(t.RecId, t.Node*MAXVALUE+m.Value+1, t.Col, t.Value, t.YorN) [18] FROM treenodes AS t, [19] (SELECT tt.RecId RecId, tt.Value Value FROM treenodes tt, mincol m [20]WHERE tt.Col=m.Col AND m.MinGini>0 ) AS m [21]WHERE t.RecId = m.RecId GROUP BY m.Value; [22] } [23]}
Performance SPRINT Algorithm: AXL vs. C Categorical Classifier: AXL vs. C
SPRINT Algorithm in AXL [ 1] AGGREGATE sprint(iNode INT, iRec INT, iCol INT, iValue REAL, iYorN INT) [ 2] { TABLE treenodes(Rec INT, Col INT, Val REAL, YorN INT, KEY(Col, Value)); [ 3]TABLE summary(Col INT, SplitGini REAL, SplitVal REAL, Yc INT, Nc INT); [ 4] TABLE split(Rec INT, LeftOrRight INT, KEY (RecId)); [ 5] TABLE mincol(Col INT, Val REAL, Gini REAL); [ 6] TABLE node(Node INT) ASVALUES(iNode); [ 7] INITIALIZE: ITERATE: { [ 8] INSERT INTO treenodes VALUES (iRec, iCol, iValue, iYorN); [ 9]UPDATE summary [10]SET Yc=Yc+iYorN, Nc=Nc+1-iYorN, (SplitGini, SplitVal) = giniudf(Yc, Nc, N, SplitGini, SplitVal) [11]WHERE Col=iCol; [12] } [13] TERMINATE: { [14]INSERT INTO mincol SELECT minpointvalue(Col, SplitGini, SplitVal) FROM summary; [15] INSERT INTO result SELECT n.Node, m.Col, m.Value FROM mincol AS m, node AS n; [16] INSERT INTO split SELECT t.Rec, (t.Value>m.Value) FROM treenodes AS t, mincol AS m [17]WHERE t.Col = m.Col AND m.Gini > 0; [18]SELECT sprint(n.Node*2+s.LeftOrRight, t.Rec, t.Col, t.Val, t.YorN) [19] FROM treenodes AS t, split AS s, node AS n WHERE t.Rec = s.Rec [20]GROUP BY s.LeftOrRight; [21] } [22]}
Comparison with Other Architectures Datasets with 6.6 million records, support level .25%.
Implementation of AXL Standalone Mode DB2 Add-on Mode .axl .axl Berkeley DB .cc .cc DB2 .lib SQL .exe UDFs DB2
Implementation of AXL • Open interface of physical data model. • Currently using Berkeley DB as our storage manager • In memory tables • Limited Optimization • Using B+-Tree indexes to support equality/range query • Predicate push-down / push-up • User Defined Aggregates • Hash based • Return multiple rows: ‘early return’ • Return multiple columns: employee’s name and salary
Implementation of AXL (cont’d) • Non-blocking aggregation • Keeping the state of aggregation between calls to the aggregate routines • Local tables defined inside aggregation are passed as parameters to the aggregates • Explicit sorting (and implicit hash-based aggregation) • AXL V1.2: above 30,000 lines of code
Check Point • Simple applications: AXL UDAs provide a solution with better performance and good ease of use. • Data Mining applications • Formal Semantics of Aggregates and Monotonic Aggregation
Aggregates in Recursion • Stratification: • shaves(barber, X) :- villager(X), shaves(X, X). villager(barber). • Aggregates: p count(p) =0 • Aggregates in many applications are actually monotonic (and should be allowed inside recursion).
Beyond Stratification • Significant previous efforts… • I. S. Mumick, H. Pirahesh and R. Ramakrishnan, “The magic of duplicates and aggregates”, VLDB 1990 • A. Van Gelder, “Foundations of aggregates in deductive databases”, DOOD 1993 • K. A. Ross and Y. Sagiv, “Monotonic aggregates in deductive databases”, JCSS 1997 • S. Greco and C. Zaniolo, “Greedy algorithms in Datalog with choice and negation”, JICSLP 1998
Formal Semantics of Aggregates • choice((X),(Y)) • Enforcing functional dependency. FD: X->Y • Multiplicity of stable models, monotonic transformation • Ordering a domain • Once (X,Y) is generated, choice ensures this is the only arc leaving source node X and entering sink node Y • Formal semantics of UDA • …return(Y,V) :- ordered(X,Y), ordered(Y,_), terminate(Y,V).…
AGGREGATE mcount(): INT { TABLE state(cnt INT) AS VALUES (0); INITIALIZE: ITERATE: { UPDATE state SET cnt=cnt+1; INSERT INTO RETURNSELECT cnt FROM state; } } Early Returns Monotonic Aggregates Aggregates with only ‘early returns’ and no ‘final returns’ are monotonic w.r.t. set containment:
Early Returns Monotonic Aggregates SELECT mcount(*) FROM employee; v.s. SELECT count(*) FROM employee; mcount{John, Mary, Tom} {1,2,3} count{John, Mary, Tom} 3 mcount{John, Mary, Tom, Jerry} {1,2,3,4} count{John, Mary, Tom, Jerry} 4
Return sum at the nth value AGGREGATE sumat(value INT, n INT): INT { TABLE state (sum INT, cnt INT) ASVALUES (0,0); INITIALIZE: ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt +1; INSERT INTO RETURN SELECT sum FROM state WHERE cnt = n; } }
Monotonic Aggregation • Monotonic aggregates can be used without any restriction and without changing the underlying implementation. • This solves the problem that had eluded database researchers since the introduction of relational systems: • BoM, Company Control, Join-the-Party… • Greedy optimization algorithms, such as Dijkstra’s single source shortest path.
Join-the-Party Problem Some people will come to the party no matter what, and their names are stored in a sure(PName) relation. But many other people will join only after they know that at least K=3 of their friends will be there. WITH wllcm(Name) AS((SELECT Pname FROM sure)UNION ALL (SELECT f.PnameFROM friend AS f, wllcm AS wWHERE w.Name =f.FnameGROUPBY f.PnameHAVING mcount()=3))SELECT Name FROM wllcm; Density-based Clustering [M. Ester et al. KDD 96]
BoM: Cost of Parts 001 assembly 1 5 1 004 3 002 003 100 2 3 part-cost 5 10 8 006 fan-out 005
WITH cst(part, cost) AS ((SELECT part, cost FROM part-cost) UNION ALL (SELECT a.part, sumat(a.qty * c.cost, p.ChC) FROM assembly AS a, cst AS c, fan-out AS p WHERE a.subpart = c.part AND p.part = a.part GROUP BY a.part)) SELECT part, cost FROM cst; Bottom up solution. Computes the cost for each part once and only once. Monotonic sumat(cost, n) returns sum when exactly n items are aggregated. Works in DB2 after AXL rewrites callings of sumat() to callings of automatically generated UDFs. BoM: Using AXL
BoM: Using Recursive SQL WITH mpath(subpart, qty) AS ((SELECT subpart, qty FROM assembly WHERE part = ‘001’) UNION ALL (SELECT c.subpart, m.qty * c.qty FROM mpath m, assembly c WHERE m.subpart = c.part)) SELECTsum(m.qty * c.cost) FROM mpath m, part_cost c WHERE m.subpart = c.part ; • Top down solution: computes the cost of part ‘001’ • Explosion: all edges that descend from part ‘001’ are “stored” in mpath • What if we want to compute the cost of each part?
Check Point • Simple applications • Data Mining and Decision Support • Formal Semantics & Monotonic aggregates • OLAP and other aggregate extensions • SUCH THAT • CUBE, ROLLUP, GROUPING SET • OLAP Functions
For each division, show the average salary of senior managers who make 3 times more than the average employees, and the average salary of senior engineers who make 2 times more than the average employees (in the same output record). D. Chatziantoniou, Kenneth Ross, VLDB 1996 SELECT division, avg(X.salary), avg(Y.salary) FROM employee GROUP BY division: X, Y SUCH THATX.title = ‘senior manager’ AND X.salary > 3*avg(salary) AND Y.title = ‘senior engineer’ AND Y.salary > 2*avg(salary) “Such That”
Expressing “Such That” in AXL TABLE seniors(salary INT); AGGREGATE sscan2(title CHAR(20), salary INT, qtitle CHAR(20), ratio INT): REAL { TABLE state(sum INT, cnt INT) AS VALUES (0,0); INITIALIZE: ITERATE: {UPDATE state SET sum=sum+salary, cnt=cnt+1;INSERT INTO seniors VALUES (salary) WHERE title = qtitle; } TERMINATE: { SELECT avg(s.salary) FROM seniors AS s WHERE s.salary > ratio * (SELECT sum/cnt FROM state); } }
Using UDA sscan2 SELECT division, sscan2(title, salary, ‘senior manager’, 3), sscan2(title, salary, ‘senior engineer’, 2) FROM employee GROUP BY division; • No joins or sub-queries required. • One pass through the employee relation (standard SQL requires at least 2 passes).
Other Aggregate Extensions • GROUPING SETS, ROLL-UP, CUBE • New OLAP extensions • Windows containing a partitioning, an ordering of rows and an aggregate group • “… every standard must be prepared to tackle new issues that arise as the market evolves. If SQL does not respond positively to this challenge, SQL risks becoming irrelevant …” -- F. Zemke, K. Kulkarni, A. Witkowski, B. Lyle Introduction to OLAP Functions
Contributions • Adding extended UDAs to O-R systems • high level language, minimal additions to SQL • monotonic aggregates, recursive aggregates • designed for general purpose applications • Tightly couple data mining functions with DBMS • SPRINT Algorithm, Categorical Classifier, … • Performance • More efficient than cursor-based languages like PL/SQL, JDBC and UDF-based approaches …
Future Directions • Parallelization • Extenders/Data Blades • build on top of UDAs instead of UDFs • Decision Support • the Apriori algorithm • Windows and OLAP functions • Spatial/Temporal extensions • the TENOR system
Future Direction: Parallelization • Current parallel aggregation algorithms valid for AXL: • Inter-Partition parallelism: all tuples of the same group-by value are in one node • Two phase algorithm: user provides a COMBINE routine • Unlike SQL3, AXL’s aggregate routines are written in SQL, so we can apply traditional query parallelization techniques to INITIALIZE, ITERATE, and TERMINATE • Since aggregate routines are written in SQL, the COMBINE routine can be generated automatically by the system for simple UDAs