1 / 20

Data Mining Systems and Languages

Data Mining Systems and Languages. CS240A Notes. Knowledge Discovery (KDD) Process. Knowledge. Data mining—core of knowledge discovery process. Pattern Evaluation. Data Mining. Task-relevant Data. Selection. Data Warehouse. Data Cleaning. Data Integration. Databases.

kendall
Télécharger la présentation

Data Mining Systems and Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Systemsand Languages CS240A Notes

  2. Knowledge Discovery (KDD) Process Knowledge • Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases

  3. DM Experience for DBMS: Dreams vs. Reality Decision Support and business intelligence: • OLAP & data warehouses: resounding success for DBMS vendors, via • Simple extensions of SQL (aggregates & analytics) • relational DBMS extensions for DM queries: a flop • OR-DBMS do not fare much better [Sarawagi’ 98]. • Imielinski & Mannila proposed a ‘high-road’ approach [CACM’96] was suggested by who called for a quantum leap in functionality based on: • Simple declarative extensions of SQL for Data Mining (DM) • Efficiency through DM query optimization techniques (yet to be invented) • The research area of Inductive DBMS was thus born, producing • Interesting language work: DMQL, Mine Rule, MSQL, … • Where implementation technology lacks generality & performance limitations • Real questions if optimizers will ever take us there.

  4. DBMS Limitations • DBMSs were easily and very Successfully extended for Data Warehouses with help of OLAP functions • Extending DBMSs for Mining has proven much harder • Limited expressive power • Flexibility of the languages • Apriori in DB2 [Saravagi’ 98] • Because of lack of suitable primitives task proved extremely difficult and not as efficient as the cache-mining task • Cache-mining: move data from the database to cache and then use PL algorithms to mine the cache.

  5. Mining Systems Desiderata • Problem: How to efficiently support the vast variety of online mining algorithms in an integrated framework? • Generality over a wide spectrum of mining tasks • Ease of use for naïve users and flexibility and customizability for experts • Efficiency, scalability • Databases: where the data is. But DBMS do not support well the KDD tasks. Three approaches • Inductive DBMS • Commercial DBMS extensions • Dedicated KDD systems with DBMS connections.

  6. Inductive DBMSs vs. Vendor Extensions • Imielinski & Manilla introduced the notion of • A high-level Data Mining Query Language for DBMS • Optimization techniques for • Inductive DBMS a new research field • MSQL, DMQL, Mine Rule: DM query language • Performance and generality an open problem. • DBMS Vendors • Ad-hoc approaches based on mining libraries

  7. DBMS extensions: DB2 Intelligent Miner • Model creation • Training CALL IDMMX.DM_buildClasModelCmd('IDMMX.CLASTASKS', 'TASK', 'ID', 'HeartClasTask', 'IDMMX.CLASSIFMODELS', 'MODEL', 'MODELNAME', 'HeartClasModel' ); • Prediction • Stored procedures and virtual mining views • Most of the implementation outside the DBMS (Cache Mining) • Data transfer delays • http://www-306.ibm.com/software/data/iminer/

  8. Oracle Data Miner • Algorithms • Adaptive Naïve Bayes • SVM regression • K-means clustering • Association rules, text, mining, etc., etc. • PL/SQL with extensions for mining • Models as first class objects • Create_Model, Prediction, Prediction_Cost, Prediction_Details, etc. • http://www.oracle.com/technology/products/bi/odm/index.html

  9. OLE DB for DM by Microsoft • Model creation. Descriptive phase • Prediction joins • Other features • Nested cases • http://research.microsoft.com/dmx/DataMining/ • PMML a descriptive XML language for exchanging information between systems

  10. OLE DB for DM (DMX) (cont.) • Mining objects as first class objects • Schema rowsets • Mining_Models • Mining_Model_Content • Mining_Functions • Other features • Column value distribution • Nested cases • http://research.microsoft.com/dmx/DataMining/

  11. OLE DB for DM (DMX): 3 steps • Model creation Create mining model MemCard_Pred ( CustomerId long key, Age long continuous, Profession text discrete, Income long continuous, Risk text discrete predict) Using Microsoft_Decision_Tree; • Training Insert into MemCard_Pred OpenRowSet( “‘sqloledb’, ‘sa’, ‘mypass’”, ‘SELECT CustomerId, Age, Profession, Income, Risk from Customers’) • Prediction Join Select C.Id, C.Risk, PredictProbability(MemCard_Pred.Risk) From MemCard_Pred AS MP Prediction Join Customers AS C Where MP.Profession = C.Profession and AP.Income = C.Income AND MP.Age = C.Age;

  12. Defining a Mining Model: • E.g., a model to predict students’ plan to attend college • The format of “training cases” (top-level entity) • Attributes, Input/output type, distribution • Algorithms and parameters • Example CREATE MINING MODEL CollegePlanModel (StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG NORMAL CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees

  13. Training INSERT INTO CollegePlanModel (StudentID, Gender, ParentIncome, Encouragement, CollegePlans) OPENROWSET(‘<provider>’, ‘<connection>’, ‘SELECT StudentID, Gender, ParentIncome, Encouragement, CollegePlans FROM CollegePlansTrainData’)

  14. Prediction Join SELECT t.ID, CPModel.Plan FROM CPModel PREDICTION JOIN OPENQUERY(…,‘SELECT * FROM NewStudents’) AS t ON CPModel.Gender = t.Gender AND CPModel.IQ = t.IQ CPModel NewStudents

  15. Summary of Vendors’ Approaches • Built-in library of mining methods • Script language or GUI tools • Limitations • Closed systems (internals hidden from users) • Adding new algorithms or customizing old ones -- Difficult • Poor integration with SQL • Limited interoperability across DBMSs • Predictive Markup Modeling Language (PMML) as a palliative

  16. PMML • Predictive Markup Model Language • XML based language for vendor independent definition of statistical and data mining models • Share models among PMML compliant products • A descriptive language • Supported by all major vendors

  17. PMML Example

  18. Much Competion Vendors • SAS Institute (Enterprise Miner) • IBM (DB2 Intelligent Miner for Data) • Oracle (ODM option to Oracle 10g) • SPSS (Clementine) • Unica Technologies, Inc. (Pattern Recognition Workbench) • Insightsful (Insightful Miner) • KXEN (Analytic Framework) • Prudsys (Discoverer and its family) • Microsoft (SQL Server 2005) • Angoss (KnowledgeServer and its family) • DBMiner (DB2) Platforms • IBM • Oracle • SAS, Tools • SPSS • Angoss • KXEN • Megaputer • FairIsaac • Insightful

  19. Stand Alone Systems • WEKA is open-source java code created by researchers at the University of Waikato in New Zealand. • It provides many different machine learning algorithms • Applicable to generic data described in Attribute-Relation File Format (ARFF)

  20. Weka • A comprehensive set of DM algorithms, and tools. • Generic algorithms over arbitrary data sets. • Independent on the number of columns in tables. • Open and extensible system based on Java. * Also free …

More Related