Data Mining Systemsand Languages CS240A Notes
Knowledge Discovery (KDD) Process Knowledge • Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases
DM Experience for DBMS: Dreams vs. Reality Decision Support and business intelligence: • OLAP & data warehouses: resounding success for DBMS vendors, via • Simple extensions of SQL (aggregates & analytics) • relational DBMS extensions for DM queries: a flop • OR-DBMS do not fare much better [Sarawagi’ 98]. • Imielinski & Mannila proposed a ‘high-road’ approach [CACM’96] was suggested by who called for a quantum leap in functionality based on: • Simple declarative extensions of SQL for Data Mining (DM) • Efficiency through DM query optimization techniques (yet to be invented) • The research area of Inductive DBMS was thus born, producing • Interesting language work: DMQL, Mine Rule, MSQL, … • Where implementation technology lacks generality & performance limitations • Real questions if optimizers will ever take us there.
DBMS Limitations • DBMSs were easily and very Successfully extended for Data Warehouses with help of OLAP functions • Extending DBMSs for Mining has proven much harder • Limited expressive power • Flexibility of the languages • Apriori in DB2 [Saravagi’ 98] • Because of lack of suitable primitives task proved extremely difficult and not as efficient as the cache-mining task • Cache-mining: move data from the database to cache and then use PL algorithms to mine the cache.
Mining Systems Desiderata • Problem: How to efficiently support the vast variety of online mining algorithms in an integrated framework? • Generality over a wide spectrum of mining tasks • Ease of use for naïve users and flexibility and customizability for experts • Efficiency, scalability • Databases: where the data is. But DBMS do not support well the KDD tasks. Three approaches • Inductive DBMS • Commercial DBMS extensions • Dedicated KDD systems with DBMS connections.
Inductive DBMSs vs. Vendor Extensions • Imielinski & Manilla introduced the notion of • A high-level Data Mining Query Language for DBMS • Optimization techniques for • Inductive DBMS a new research field • MSQL, DMQL, Mine Rule: DM query language • Performance and generality an open problem. • DBMS Vendors • Ad-hoc approaches based on mining libraries
DBMS extensions: DB2 Intelligent Miner • Model creation • Training CALL IDMMX.DM_buildClasModelCmd('IDMMX.CLASTASKS', 'TASK', 'ID', 'HeartClasTask', 'IDMMX.CLASSIFMODELS', 'MODEL', 'MODELNAME', 'HeartClasModel' ); • Prediction • Stored procedures and virtual mining views • Most of the implementation outside the DBMS (Cache Mining) • Data transfer delays • http://www-306.ibm.com/software/data/iminer/
Oracle Data Miner • Algorithms • Adaptive Naïve Bayes • SVM regression • K-means clustering • Association rules, text, mining, etc., etc. • PL/SQL with extensions for mining • Models as first class objects • Create_Model, Prediction, Prediction_Cost, Prediction_Details, etc. • http://www.oracle.com/technology/products/bi/odm/index.html
OLE DB for DM by Microsoft • Model creation. Descriptive phase • Prediction joins • Other features • Nested cases • http://research.microsoft.com/dmx/DataMining/ • PMML a descriptive XML language for exchanging information between systems
OLE DB for DM (DMX) (cont.) • Mining objects as first class objects • Schema rowsets • Mining_Models • Mining_Model_Content • Mining_Functions • Other features • Column value distribution • Nested cases • http://research.microsoft.com/dmx/DataMining/
OLE DB for DM (DMX): 3 steps • Model creation Create mining model MemCard_Pred ( CustomerId long key, Age long continuous, Profession text discrete, Income long continuous, Risk text discrete predict) Using Microsoft_Decision_Tree; • Training Insert into MemCard_Pred OpenRowSet( “‘sqloledb’, ‘sa’, ‘mypass’”, ‘SELECT CustomerId, Age, Profession, Income, Risk from Customers’) • Prediction Join Select C.Id, C.Risk, PredictProbability(MemCard_Pred.Risk) From MemCard_Pred AS MP Prediction Join Customers AS C Where MP.Profession = C.Profession and AP.Income = C.Income AND MP.Age = C.Age;
Defining a Mining Model: • E.g., a model to predict students’ plan to attend college • The format of “training cases” (top-level entity) • Attributes, Input/output type, distribution • Algorithms and parameters • Example CREATE MINING MODEL CollegePlanModel (StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG NORMAL CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees
Training INSERT INTO CollegePlanModel (StudentID, Gender, ParentIncome, Encouragement, CollegePlans) OPENROWSET(‘<provider>’, ‘<connection>’, ‘SELECT StudentID, Gender, ParentIncome, Encouragement, CollegePlans FROM CollegePlansTrainData’)
Prediction Join SELECT t.ID, CPModel.Plan FROM CPModel PREDICTION JOIN OPENQUERY(…,‘SELECT * FROM NewStudents’) AS t ON CPModel.Gender = t.Gender AND CPModel.IQ = t.IQ CPModel NewStudents
Summary of Vendors’ Approaches • Built-in library of mining methods • Script language or GUI tools • Limitations • Closed systems (internals hidden from users) • Adding new algorithms or customizing old ones -- Difficult • Poor integration with SQL • Limited interoperability across DBMSs • Predictive Markup Modeling Language (PMML) as a palliative
PMML • Predictive Markup Model Language • XML based language for vendor independent definition of statistical and data mining models • Share models among PMML compliant products • A descriptive language • Supported by all major vendors
Much Competion Vendors • SAS Institute (Enterprise Miner) • IBM (DB2 Intelligent Miner for Data) • Oracle (ODM option to Oracle 10g) • SPSS (Clementine) • Unica Technologies, Inc. (Pattern Recognition Workbench) • Insightsful (Insightful Miner) • KXEN (Analytic Framework) • Prudsys (Discoverer and its family) • Microsoft (SQL Server 2005) • Angoss (KnowledgeServer and its family) • DBMiner (DB2) Platforms • IBM • Oracle • SAS, Tools • SPSS • Angoss • KXEN • Megaputer • FairIsaac • Insightful
Stand Alone Systems • WEKA is open-source java code created by researchers at the University of Waikato in New Zealand. • It provides many different machine learning algorithms • Applicable to generic data described in Attribute-Relation File Format (ARFF)
Weka • A comprehensive set of DM algorithms, and tools. • Generic algorithms over arbitrary data sets. • Independent on the number of columns in tables. • Open and extensible system based on Java. * Also free …