Data Mining Systems and Languages

Data Mining Systemsand Languages CS240A Notes

Knowledge Discovery (KDD) Process Knowledge • Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases

DM Experience for DBMS: Dreams vs. Reality Decision Support and business intelligence: • OLAP & data warehouses: resounding success for DBMS vendors, via • Simple extensions of SQL (aggregates & analytics) • relational DBMS extensions for DM queries: a flop • OR-DBMS do not fare much better [Sarawagi’ 98]. • Imielinski & Mannila proposed a ‘high-road’ approach [CACM’96] was suggested by who called for a quantum leap in functionality based on: • Simple declarative extensions of SQL for Data Mining (DM) • Efficiency through DM query optimization techniques (yet to be invented) • The research area of Inductive DBMS was thus born, producing • Interesting language work: DMQL, Mine Rule, MSQL, … • Where implementation technology lacks generality & performance limitations • Real questions if optimizers will ever take us there.

DBMS Limitations • DBMSs were easily and very Successfully extended for Data Warehouses with help of OLAP functions • Extending DBMSs for Mining has proven much harder • Limited expressive power • Flexibility of the languages • Apriori in DB2 [Saravagi’ 98] • Because of lack of suitable primitives task proved extremely difficult and not as efficient as the cache-mining task • Cache-mining: move data from the database to cache and then use PL algorithms to mine the cache.

Mining Systems Desiderata • Problem: How to efficiently support the vast variety of online mining algorithms in an integrated framework? • Generality over a wide spectrum of mining tasks • Ease of use for naïve users and flexibility and customizability for experts • Efficiency, scalability • Databases: where the data is. But DBMS do not support well the KDD tasks. Three approaches • Inductive DBMS • Commercial DBMS extensions • Dedicated KDD systems with DBMS connections.

Inductive DBMSs vs. Vendor Extensions • Imielinski & Manilla introduced the notion of • A high-level Data Mining Query Language for DBMS • Optimization techniques for • Inductive DBMS a new research field • MSQL, DMQL, Mine Rule: DM query language • Performance and generality an open problem. • DBMS Vendors • Ad-hoc approaches based on mining libraries

DBMS extensions: DB2 Intelligent Miner • Model creation • Training CALL IDMMX.DM_buildClasModelCmd('IDMMX.CLASTASKS', 'TASK', 'ID', 'HeartClasTask', 'IDMMX.CLASSIFMODELS', 'MODEL', 'MODELNAME', 'HeartClasModel' ); • Prediction • Stored procedures and virtual mining views • Most of the implementation outside the DBMS (Cache Mining) • Data transfer delays • http://www-306.ibm.com/software/data/iminer/

Oracle Data Miner • Algorithms • Adaptive Naïve Bayes • SVM regression • K-means clustering • Association rules, text, mining, etc., etc. • PL/SQL with extensions for mining • Models as first class objects • Create_Model, Prediction, Prediction_Cost, Prediction_Details, etc. • http://www.oracle.com/technology/products/bi/odm/index.html

OLE DB for DM by Microsoft • Model creation. Descriptive phase • Prediction joins • Other features • Nested cases • http://research.microsoft.com/dmx/DataMining/ • PMML a descriptive XML language for exchanging information between systems

OLE DB for DM (DMX) (cont.) • Mining objects as first class objects • Schema rowsets • Mining_Models • Mining_Model_Content • Mining_Functions • Other features • Column value distribution • Nested cases • http://research.microsoft.com/dmx/DataMining/

OLE DB for DM (DMX): 3 steps • Model creation Create mining model MemCard_Pred ( CustomerId long key, Age long continuous, Profession text discrete, Income long continuous, Risk text discrete predict) Using Microsoft_Decision_Tree; • Training Insert into MemCard_Pred OpenRowSet( “‘sqloledb’, ‘sa’, ‘mypass’”, ‘SELECT CustomerId, Age, Profession, Income, Risk from Customers’) • Prediction Join Select C.Id, C.Risk, PredictProbability(MemCard_Pred.Risk) From MemCard_Pred AS MP Prediction Join Customers AS C Where MP.Profession = C.Profession and AP.Income = C.Income AND MP.Age = C.Age;

Defining a Mining Model: • E.g., a model to predict students’ plan to attend college • The format of “training cases” (top-level entity) • Attributes, Input/output type, distribution • Algorithms and parameters • Example CREATE MINING MODEL CollegePlanModel (StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG NORMAL CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees

Training INSERT INTO CollegePlanModel (StudentID, Gender, ParentIncome, Encouragement, CollegePlans) OPENROWSET(‘<provider>’, ‘<connection>’, ‘SELECT StudentID, Gender, ParentIncome, Encouragement, CollegePlans FROM CollegePlansTrainData’)

Prediction Join SELECT t.ID, CPModel.Plan FROM CPModel PREDICTION JOIN OPENQUERY(…,‘SELECT * FROM NewStudents’) AS t ON CPModel.Gender = t.Gender AND CPModel.IQ = t.IQ CPModel NewStudents

Summary of Vendors’ Approaches • Built-in library of mining methods • Script language or GUI tools • Limitations • Closed systems (internals hidden from users) • Adding new algorithms or customizing old ones -- Difficult • Poor integration with SQL • Limited interoperability across DBMSs • Predictive Markup Modeling Language (PMML) as a palliative

PMML • Predictive Markup Model Language • XML based language for vendor independent definition of statistical and data mining models • Share models among PMML compliant products • A descriptive language • Supported by all major vendors

PMML Example

Much Competion Vendors • SAS Institute (Enterprise Miner) • IBM (DB2 Intelligent Miner for Data) • Oracle (ODM option to Oracle 10g) • SPSS (Clementine) • Unica Technologies, Inc. (Pattern Recognition Workbench) • Insightsful (Insightful Miner) • KXEN (Analytic Framework) • Prudsys (Discoverer and its family) • Microsoft (SQL Server 2005) • Angoss (KnowledgeServer and its family) • DBMiner (DB2) Platforms • IBM • Oracle • SAS, Tools • SPSS • Angoss • KXEN • Megaputer • FairIsaac • Insightful

Stand Alone Systems • WEKA is open-source java code created by researchers at the University of Waikato in New Zealand. • It provides many different machine learning algorithms • Applicable to generic data described in Attribute-Relation File Format (ARFF)

Weka • A comprehensive set of DM algorithms, and tools. • Generic algorithms over arbitrary data sets. • Independent on the number of columns in tables. • Open and extensible system based on Java. * Also free …

Data Mining Systems and Languages

Data Mining Systems and Languages

Presentation Transcript

Database Systems Research on Data Mining

Data Mining Query Languages

CPS216: Advanced Database Systems Data Mining

Integrated Data Mining Systems

Troubleshooting Distributed Systems via Data Mining

Database Management Systems: Data Mining

Data Mining Algorithms for Recommendation Systems

Chapter 4: Data Mining Primitives, Languages, and System Architectures

Data Mining Primitives, Languages and System Architecture

Towards New Models and Languages for Data Mining and Integration

UNIT-3 Data Mining Primitives, Languages, and System Architectures

parallel data mining on multicore and clusters Systems

Data Mining Algorithms for Recommendation Systems

parallel data mining on multicore and clusters Systems

Database Management Systems: Data Mining

Database Management Systems: Data Mining

Chapter 4: Data Mining Primitives, Languages, and System Architectures