710 likes | 872 Vues
Association Rules Mining with SQL. Kirsten Nelson Deepen Manek November 24, 2003. Organization of Presentation. Overview – Data Mining and RDBMS Loosely-coupled data and programs Tightly-coupled data and programs Architectural approaches Methods of writing efficient SQL
E N D
Association Rules Mining with SQL Kirsten Nelson Deepen Manek November 24, 2003
Organization of Presentation • Overview – Data Mining and RDBMS • Loosely-coupled data and programs • Tightly-coupled data and programs • Architectural approaches • Methods of writing efficient SQL • Candidate generation, pruning, support counting • K-way join, SubQuery, GatherJoin, Vertical, Hybrid • Integrating taxonomies • Mining sequential patterns
Early data mining applications • Most early mining systems were developed largely on file systems, with specialized data structures and buffer management strategies devised for each • All data was read into memory before beginning computation • This limits the amount of data that can be mined
Advantage of SQL and RDBMS • Make use of database indexing and query processing capabilities • More than a decade spent on making these systems robust, portable, scalable, and concurrent • Exploit underlying SQL parallelization • For long-running algorithms, use checkpointing and space management
Organization of Presentation • Overview – Data Mining and RDBMS • Loosely-coupled data and programs • Tightly-coupled data and programs • Architectural approaches • Methods of writing efficient SQL • Candidate generation, pruning, support counting • K-way join, SubQuery, GatherJoin, Vertical, Hybrid • Integrating taxonomies • Mining sequential patterns
Use of Database in Data Mining • “Loose coupling” of application and data • How would you write an Apriori program? • Use SQL statements in an application • Use a cursor interface to read through records sequentially for each pass • Still two major performance problems: • Copying of record from database to memory • Process context switching for each record retrieved
Organization of Presentation • Overview – Data Mining and RDBMS • Loosely-coupled data and programs • Tightly-coupled data and programs • Architectural approaches • Methods of writing efficient SQL • Candidate generation, pruning, support counting • K-way join, SubQuery, GatherJoin, Vertical, Hybrid • Integrating taxonomies • Mining sequential patterns
Tightly-coupled applications • Push computations into the database system to avoid performance degradation • Take advantage of user-defined functions (UDFs) • Does not require changes to database software • Two types of UDFs we will use: • Ones that are executed only a few times, regardless of the number of rows • Ones that are executed once for each selected row
Tight-coupling using UDFs Procedure TightlyCoupledApriori(): begin exec sql connect to database; exec sql select allocSpace() into :blob from onerecord; exec sql select * from sales where GenL1(:blob, TID, ITEMID) = 1; notDone := true;
Tight-coupling using UDFs while notDone do { exec sql select aprioriGen(:blob) into :blob from onerecord; exec sql select * from sales where itemCount(:blob, TID, ITEMID)=1; exec sql select GenLk(:blob) into :notDone from onerecord }
Tight-coupling using UDFs exec sql select getResult(:blob) into :resultBlob from onerecord; exec sql select deallocSpace(:blob) from onerecord; compute Answer using resultBlob; end
Organization of Presentation • Overview – Data Mining and RDBMS • Loosely-coupled data and programs • Tightly-coupled data and programs • Architectural approaches • Methods of writing efficient SQL • Candidate generation, pruning, support counting • K-way join, SubQuery, GatherJoin, Vertical, Hybrid • Integrating taxonomies • Mining sequential patterns
Methodology • Comparison done with Association Rules against IBM DB2 • Only consider generation of frequent itemsets using Apriori algorithm • Five alternatives considered: • Loose-coupling through SQL cursor interface – as described earlier • UDF tight-coupling – as described earlier • Stored-procedure to encapsulate mining algorithm • Cache-mine – caching data and mining on the fly • SQL implementations to force processing in the database • Consider two classes of implementations • SQL-92 – four different implementations • SQL-OR (with object relational extensions) – six implementations
Architectural Options • Stored procedure • Apriori algorithm encapsulated as a stored procedure • Implication: runs in the same address space as the DBMS • Mined results stored back into the DBMS. • Cache-mine • Variation of stored-procedure • Read entire data once from DBMS, temporarily cache data in a lookaside buffer on a local disk • Cached data is discarded when execution completes • Disadvantage – requires additional disk space for caching • Use Intelligent Miner’s “space” option
Organization of Presentation • Overview – Data Mining and RDBMS • Loosely-coupled data and programs • Tightly-coupled data and programs • Architectural approaches • Methods of writing efficient SQL • Candidate generation, pruning, support counting • K-way join, SubQuery, GatherJoin, Vertical, Hybrid • Integrating taxonomies • Mining sequential patterns
Terminology • Use the following terminology • T: table of items • {tid,item} pairs • Data is normally sorted by transaction id • Ck: candidate k-itemsets • Obtained from joining and pruning frequent itemsets from previous iteration • Fk: frequent items sets of length k • Obtained from Ck and T
Candidate Generation in SQL – join step • Generate Ck from Fk-1 by joining Fk-1 with itself insert into Ck select I1.item1,…,I1.itemk-1,I2.itemk-1 from Fk-1 I1,Fk-1 I2 where I1.item1 = I2.item1 and … I1.itemk-2 = I2.itemk-2 and I1.itemk-1 < I2.itemk-1
Candidate Generation Example • F3 is {{1,2,3},{1,2,4},{1,3,4},{1,3,5},{2,3,4}} • C4 is {{1,2,3,4},{1,3,4,5}} Table F3 (I1) Table F3 (I2)
Pruning • Modify candidate generation algorithm to ensure all k subsets of Ck of length (k-1) are in Fk-1 • Do a k-way join, skipping itemn-2 when joining with the nth table (2<n≤k) • Create primary index (item1, …, itemk-1) on Fk-1 to efficiently process k-way join • For k=4, this becomes insert into C4 select I1.item1, I1.item2, I1.item3,I2.item3 from F3 I1,F3 I2, F3 I3, F3 I4where I1.item1 = I2.item1 … and I1.item3 < I2.item3 and I1.item2 = I3.item1 and I1.item3 = I3.item2 and I2.item3 = I3.item3 and I1.item1 = I4.item1 and I1.item3 = I4.item2 and I2.item3 = I4.item3
Pruning Example • Evaluate join with I3 using previous example • C4 is {1,2,3,4} Table F3 (I1) Table F3 (I2) Table F3 (I3)
Support counting using SQL • Two different approaches • Use the SQL-92 standard • Use ‘standard’ SQL syntax such as joins and subqueries to find support of itemsets • Use object-relational extensions of SQL (SQL-OR) • User Defined Functions (UDFs) & table functions • Binary Large Objects (BLOBs)
Support Counting using SQL-92 • 4 different methods, two of which detailed in the papers • K-way Joins • SubQuery • Other methods not discussed because of unacceptable performance • 3-way join • 2 Group-Bys
SQL-92: K-way join • Obtain Fk by joining Ck with table T of (tid,item) • Perform group by on the itemset insert into Fk select item1,…,itemk,count(*) from Ck, T t1, …, T tk, where t1.item = Ck.item1, … , and tk.item = Ck.itemk and t1.tid = t2.tid … and tk-1.tid = tk.tid group by item1,…,itemk having count(*) > :minsup
K-way join example • C3={B,C,E} and minimum support required is 2 • Insert into F3 {B,C,E,2}
K-way join: Pass-2 optimization • When calculating C2, no pruning is required after we join F1 with itself • Don’t calculate and materialize C2- replace C2 in 2-way join algorithm with join of F1 with itself insert into F2 select I1.item1, I2.item1,count(*) from F1 I1, F1 I2, T t1, T t2 where I1.item1 < I2.item1 and t1.item = I1.item1 and t2.item = I2.item1 and t1.tid = t2.tid group by I1.item1,I2.item1 having count(*) > :minsup
SQL-92: SubQuery based • Split support counting into cascade of k subqueries • nth subquery Qn finds all tids that match the distinct itemsets formed by the first n items of Ck insert into Fk select item1, …, itemk, count(*) from (Subquery Qk) t Group by item1, item2 … , itemk having count(*) > :minsup Subquery Qn (for any n between 1 and k): select item1, …, itemn, tid from T tn, (Subquery Qn-1) as rn-1 (select distinct item1, …, itemn from CK) as dn where rn-1.item1 = dn.item1 and … and rn-1.itemn-1 = dn.itemn and rn-1.tid = tn.tid and tn.item = dn.itemn
Example of SubQuery based • Using previous example from class • C3 = {B,C,E}, minimum support = 2 • Q0: No subquery Q0 • Q1 in this case becomes select item1, tid From T t1, (select distinct item1from C3) as d1 where t1.item = d1.item1
Example of SubQuery based cnt’d • Q2 becomes select item1, item2, tid from T t2, (Subquery Q1) as r1, (select distinct item1, item2 from C3) as d2 where r1.item1 = d2.item1 and r1.tid = t2.tid and t2.item = d2.item2
Example of SubQuery based cnt’d • Q3 becomes select item1,item2,item3, tid from T t3, (Subquery Q2) as r2, (select distinct item1,item2,item3 from C3) as d3 where r2.item1 = d3.item1 and r2.item2 = d3.item2 and r2.tid = t3.tid and t3.item = d3.item3
Example of SubQuery based cnt’d • Output of Q3 is • Insert statement becomes insert into F3 select item1, item2, item3, count(*) from (Subquery Q3) t group by item1, item2 ,item3 having count(*) > :minsup • Insert the row {B,C,E,2} • For Q2, pass-2 optimization can be used
Performance Comparisons of SQL-92 approaches • Used Version 5 of DB2 UDB and RS/6000 Model 140 • 200 Mhz CPU, 256 MB main memory, 9 GB of disk space, Transfer rate of 8 MB/sec • Used 4 different item sets based on real-world data • Built the following indexes, which are not included in any cost calculations • Composite index (item1, …, itemk) on Ck • k different indices on each of the k items in Ck • (item,tid) and (tid,item) indexes on the data table T
Performance Comparisons of SQL-92 approaches • Best performance obtained by SubQuery approach • SubQuery was only comparable to loose-coupling in some cases, failing to complete in other cases • DataSet C, for support of 2%, SubQuery outperforms loose-coupling but decreasing support to 1%, SubQuery takes 10 times as long to complete • Lower support will increase the size of Ck and Fk at each step, causing the join to process more rows
Support Counting using SQL with object-relational extensions • 6 different methods, four of which detailed in the papers • GatherJoin • GatherCount • GatherPrune • Vertical • Other methods not discussed because of unacceptable performance • Horizontal • SBF
SQL Object-Relational Extension: GatherJoin • Generates all possible k-item combinations of items contained in a transaction and joins them with Ck • An index is created on all items of Ck • Uses the following table functions • Gather: Outputs records {tid,item-list}, with item-list being a BLOB or VARCHAR containing all items associated with the tid • Comb-K: returns all k-item combinations from the transaction • Output has k attributes T_itm1, …, T_itmk
GatherJoin insert into Fk select item1,…, itemk, count(*) from Ck, (select t2.T_itm1,…,t2.itmk from T, table(Gather(T.tid,T.item)) as t1, table(Comb-K(t1.tid,t1.item-list)) as t2) where t2.T_itm1 = Ck.item1 and … and t2.T_itmk = Ck.itemk group by Ck.item1,…,Ck.itemk having count(*) > :minsup
Example of GatherJoin • t1 (output from Gather) looks like: • t2 (generated by Comb-K from t1) will be joined with C3 to obtain F3 • 1 row from Tid 10 • 1 row from Tid 20 • 4 rows from Tid 30 • Insert {B,C,E,2}
GatherJoin: Pass 2 optimization • When calculating C2, no pruning is required after we join F1 with itself • Don’t calculate and materialize C2 - replace C2 with a join to F1 before the table function • Gather is only passed frequent 1-itemset rows insert into F2 select I1.item1, I2.item1, count(*) from F1 I1, (select t2.T_itm1,t2.T_itm2 from T, table(Gather(T.tid,T.item)) as t1, table(Comb-K(t1.tid,t1.item-list)) as t2 where T.item = I1.item1) group by t2.T_itm1,t2.T_itm2 having count(*) > :minsup
Variations of GatherJoin - GatherCount • Perform the GROUP BY inside the table function Comb-K for pass 2 optimization • Output of the table function Comb-K • Not the candidate frequent itemsets (Ck) • But the actual frequent itemsets (Fk) along with the corresponding support • Use a 2-dimensional array to store possible frequent itemsets in Comb-K • May lead to excessive memory use
Variations of GatherJoin - GatherPrune • Push the join with Ck into the table function Comb-K • Ck is converted into a BLOB and passed as an argument to the table function. • Will have to pass the BLOB for each invocation of Comb-K - # of rows in table T
SQL Object-Relational Extension: Vertical • For each item, create a BLOB containing the tids the item belongs to • Use function Gather to generate {item,tid-list} pairs, storing results in table TidTable • Tid-list are all in the same sorted order • Use function Intersect to compare two different tid-lists and extract common values • Pass-2 optimization can be used for Vertical • Similar to K-way join method • Upcoming example does not show optimization
Vertical insert into Fk select item1, …, itemk, count(tid-list) as cnt from (Subquery Qk) t where cnt > :minsup Subquery Qn (for any n between 2 and k) Select item1, …, itemn, Intersect(rn-1.tid-list, tn.tid-list) as tid-list from TidTable tn, (Subquery Qn-1) as rn-1 (select distinct item1, …, itemn from CK) as dn where rn-1.item1 = dn.item1 and … and rn-1.itemn-1 = dn.itemn-1 and tn.item = dn.itemn Subquery Q1: (select * from TidTable)
Example of Vertical • Using previous example from class • C3 = {B,C,E}, minimum support = 2 • Q1 is TidTable
Example of Vertical cnt’d • Q2 becomes Select item1, item2, Intersect(r1.tid-list, t2.tid-list) as tid-list from TidTable t2, (Subquery Q1) as r1 (select distinct item1, item2 from C3) as d2 where r1.item1 = d2.item1 and t2.item = d2.item2
Example of Vertical cnt’d • Q3 becomes select item1, item2, item3, intersect(r2.tid-list, t3.tid-list) as tid-list from TidTable t3, (Subquery Q2) as r2 (select distinct item1, item2, item3 from C3) as d3 where r2.item1 = d3.item1 and r2.item2 = d3.item2 and t3.item = d3.item3
Performance comparison of SQL object-relational approaches • Vertical has best overall performance, sometimes an order of magnitude better than other 3 approaches • Majority of time is transforming the data in {item,tid-list} pairs • Vertical spends too much time on the second pass • Pass-2 optimization has huge impact on performance of GatherJoin • For Dataset-B with support of 0.1 %, running time for Pass 2 went from 5.2 hours to 10 minutes • Comb-K in GatherJoin generates large number of potential frequent itemsets we must work with
Hybrid approach • Previous charts and algorithm analysis show • Vertical spends too much time on pass 2 compared to other algorithms, especially when the support is decreased • GatherJoin degrades when the # of frequent items per transaction increases • To improve performance, use a hybrid algorithm • Use Vertical for most cases • When size of candidate itemset is too large, GatherJoin is a good option if number of frequent items per transaction (Nf) is not too large • When Nf is large, GatherCount may be the only good option
Architecture Comparisons • Compare five alternatives • Loose-Coupling, Stored-procedure • Basically the same except for address space program is being run in • Because of limited difference in performance, focus solely on stored procedure in following charts • Cache-Mine • UDF tight-coupling • Best SQL approach (Hybrid)