1 / 27

Data mining and the knowledge discovery process

Data mining and the knowledge discovery process. Summer Course 2005 H.H.L.M. Donkers. Content. Opening / acquaintance What is data mining Data mining methodology Course perspective Course contents. Data - Information - Knowledge -. Data: symbols

tyrell
Télécharger la présentation

Data mining and the knowledge discovery process

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data mining and the knowledge discovery process Summer Course 2005 H.H.L.M. Donkers

  2. Content • Opening / acquaintance • What is data mining • Data mining methodology • Course perspective • Course contents

  3. Data - Information - Knowledge - • Data: symbols • Information: data that are processed to be useful; provides answers to "who", "what", "where", and "when" questions • Knowledge: application of data and information; answers "how" questions • Understanding: appreciation of "why" • Wisdom: evaluated understanding. (Russell Ackoff - http://www.outsights.com/systems/dikw/dikw.htm)

  4. Data - Information - Knowledge - http://www.outsights.com/systems/dikw/dikw.htm

  5. What is Data Mining – Traditionally “Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.” Witten & Frank (2000). Data Mining.

  6. What is Data Mining – Traditionally “The application of specific algorithms for extracting patterns from data, it is a part of knowledge discovery from databases” Fayyad (1997). From data mining to knowledge discovery in databases.

  7. What is Data Mining – Traditionally “Data mining is a process, not just a series of statistical analyses.” SAS Institute (2003). Finding the solution to data mining.

  8. Computer Science (Semi-)automated application of algorithms for pattern discovery Algorithms developed in the field of Artificial Intelligence (machine learning) Part of the process of knowledge discovery Statistics Process of discovering patterns in data (Manual) application of a series of statistical techniques (among which machine learning) Incorporates Exploration Sampling Modeling Validation What is Data Mining – Traditionally Data mining = Statistics + Marketing

  9. What is Data Mining – A Fusion “An analytic process designed to explore data in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal is prediction.” Statsoft (2003). Data Mining Techniques.

  10. What is Data Mining – A Fusion “An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results.” Rudjer Boskovic Institute (2001). DMS Tutorial.

  11. Data Mining In This Course • We use the book of Witten & Frank • Computer science (machine learning) approach • Emphasis on algorithms for pattern discovery and rule extraction • What are the underlying models • What are the properties of the algorithms • When to use (for which tasks) • How to apply and to tune • How to interpret and assess the results

  12. Data Mining Process • These algorithms are only part of a process that computer scientists call Knowledge Discovery and the statisticians call Data Mining • The process starts with the recognition of a problem and ends with the control of a deployed solution • The whole process needs to be supported for a successful application

  13. Methodologies for Data Mining • As Data Mining is coming of age, several methodologies have been developed, each with their own perspective. We will discuss three of them: • Fayyad et al. (Computer science) • E.g., WEKA • SEMMA (SAS) (Statistics) • SAS Enterprise Miner • CRISP-DM (SPSS, OHRA, a.o.) (Business) • SPSS Clementine

  14. Knowledge Transformed data Patterns Target data Processed data Interpretation Evaluation Data Mining Transformation & feature selection Preprocessing & cleaning Selection Fayyad’s KDD Methodology data

  15. SAMPLE EXPLORE MODIFY MODEL ASSESS Input data, Sampling, Data partition Transform variable, Filter outliers, Clustering, SOM / Kohonen Assessment, Score, Report Distribution explorer, Multiplot, Insight, Association, Variable selection Regression, Tree, Neural Network, Ensemble SEMMA Methodology Supported by SAS Enterprise Mining environment

  16. CRISP-DM Methodology • Developed by data-mining companies (SPSS, NCR, OHRA, ChryslerDaimler), funded by the European Commission • Tool-independent / industry-independent • Hierarchical process model 1 Generic phases 2 Generic tasks 3 Specific tasks 4 Task instances • Supported by SPSS Clementine environment

  17. CRISP-DM Methodology TASKS Business objective Assess situation Data mining goals Project plan Business understanding Data understanding Data Preparation Deployment Modeling Evaluation

  18. CRISP-DM Methodology TASKS Collect data Describe data Explore data Verify data quality Business understanding Data understanding Data Preparation Deployment Modeling Evaluation

  19. CRISP-DM Methodology TASKS Select data Clean data Construct data Integrate data Format data Business understanding Data understanding Data Preparation Deployment Modeling Evaluation

  20. CRISP-DM Methodology TASKS Select modeling techniques Design the test Build model Assess model Business understanding Data understanding Data Preparation Deployment Modeling Evaluation

  21. CRISP-DM Methodology TASKS Evaluate results Review process Determine next steps Business understanding Data understanding Data Preparation Deployment Modeling Evaluation

  22. CRISP-DM Methodology TASKS Plan deployment Plan monitoring and maintenance Final report Review project Business understanding Data understanding Data Preparation Deployment Modeling Evaluation

  23. Knowledge Transformed data Patterns SAMPLE EXPLORE MODIFY MODEL ASSESS data Target data Processed data Interpretation Evaluation Transform variable, Filter outliers, Clustering, SOM / Kohonen Input data, Sampling, Data partition Assessment, Score, Report Data Mining Transformation & feature selection Distribution explorer, Multiplot, Insight, Association, Variable selection Regression, Tree, Neural Network, Ensemble Preprocessing & cleaning Selection A Comparison Business understanding Data understanding Data Preparation Deployment Modeling Evaluation

  24. A Small Poll (July 2002) Source: http://www.kdnuggets.com/polls/2002/methodology.htm

  25. Course perspective and goal • The perspective is from computer science (machine learning): Fayyad’s approach • The emphasis is on techniques for the automated discovery of patterns in data and the automated extraction of rules (the model phase of SEMMA and CRISP) • The goal is to get acquainted with these techniques, so you can use them in the methodology of your choice

  26. Course contents • Data preparation (Wednesday) • Selection, preprocessing, transformation • Techniques, algorithms and models • Decision trees (Monday) • Instance based and Bayesian learning (Tuesday) • Neural networks (Tuesday) • Association rules (Thursday) • Clustering (Thursday) • Support Vector Machines (Friday) • Evaluation of learned models (Wednesday)

  27. Course contents • For each technique you learn • For which tasks it is suitable • Classification, rules, prediction, … • Restrictions on input data (numerical, symbolic, etc.) • What algorithms are available • What parameters should be tuned • How to interpret the results • How to evaluate the model

More Related