390 likes | 496 Vues
Data Mining. دكترمحسن كاهاني http://www.um.ac.ir/~kahani/. Motivation: “Necessity is the Mother of Invention”. Data explosion problem:
E N D
Data Mining دكترمحسن كاهاني http://www.um.ac.ir/~kahani/
Motivation: “Necessity is the Mother of Invention” • Data explosion problem: • Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories • We are drowning in data, but starving for knowledge! سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Related Fields Machine Learning Visualization Data Mining and Knowledge Discovery Statistics Databases سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
__ ____ __ ____ __ ____ Patterns and Rules Knowledge Discovery Process Integration Interpretation & Evaluation Knowledge Data Mining Knowledge RawData Transformation Selection & Cleaning Understanding Transformed Data Target Data DATA Ware house سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Data Mining and Business Intelligence Increasing potential to support business decisions End User Making Decisions Business Analyst Data Presentation Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Definition of Data Mining “…The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data…” Fayyad,Piatetsky-Shapiro, Smyth [1996] سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Need for Data Mining • Data accumulate and double every 9 months • There is a big gap from stored data to knowledge; and the transition won’t occur automatically. • Manual data analysis is not new but a bottleneck • Fast developing Computer Science and Engineering generates new demands • Seeking knowledge from massive data سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
When is DM useful • Data rich world • Large data (dimensionality and size) • Image data (size) • Gene chip data (dimensionality) • Little knowledge about data (exploratory data analysis) • What if we have some knowledge? سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Challenges • Increasing data dimensionality and data size • Various data forms • New data types • Streaming data, multimedia data • Efficient search and access to data/knowledge • Intelligent update and integration • Privacy Concerns سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Results of Data Mining Include: • Forecasting what may happen in the future • Classifying people or things into groups by recognizing patterns • Clustering people or things into groups based on their attributes • Associating what events are likely to occur together • Sequencing what events are likely to lead to later events سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Data Mining versus OLAP • OLAP - On-line Analytical Processing • Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Data Analysis Tests for statistical correctness of models Are statistical assumptions of models correct? Eg Is the R-Square good? Hypothesis testing Is the relationship significant? Use a t-test to validate significance Tends to rely on sampling Techniques are not optimised for large amounts of data Requires strong statistical skills Data Mining Originally developed to act as expert systems to solve problems Less interested in the mechanics of the technique If it makes sense then let’s use it Does not require assumptions to be made about data Can find patterns in very large amounts of data Requires understanding of data and business problem Data Mining Versus Statistical Analysis سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Data Mining Taxonomy Predictive Method - …predict the value of a particular attribute… Descriptive Method - …foundation of human-interpretable patterns that describe the data… سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Data Mining Tasks... • Classification [Predictive] • Clustering [Descriptive] • Association Rule Discovery [Descriptive] • Sequential Pattern Discovery [Descriptive] • Deviation Detection [Predictive] سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Data Mining Tasks: Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ... سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Classification: Linear Regression • Linear Regression w0 + w1 x + w2 y >= 0 • Regression computes wi from data to minimize squared error to ‘fit’ the data • Not flexible enough سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Classification: Decision Trees if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue Y 3 X 2 5 سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Example Decision Tree categorical categorical continuous Splitting Attributes class Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO The splitting attribute at a node is determined based on the Gini index. سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Classification: Neural Networks efficiently model large and complex problems; may be used in classification problems or for regressions; Starts with input layer=> hidden layer => output layer 3 1 4 6 2 5 Output Inputs Hidden Layer سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Neural Networks (cont.) • can be easily implemented to run on massively parallel computers; • can not be easily interpret; • require an extensive amount of training time; • require a lot of data preparation (involve very careful data cleansing, selection, preparation, and pre-processing); • require sufficiently large data set and high signal-to noise ratio. سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Test Set Model Classification Example categorical categorical continuous class Learn Classifier Training Set سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Classification Application • Direct Marketing • Fraud Detection • Customer Attrition/Churn • Sky Survey Cataloging سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Data Mining Tasks: Clustering • Goal is to identify categories • Natural grouping of customers by processing all the available data about them. • Other applications • market segmentation, discovering affinity groups, and defect analysis سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Kohonen Network Description • unsupervised • seeks to describe dataset in terms of natural clusters of cases سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Data Mining Tasks: Association Rule Discovery • Given a set of records each of which contain some number of items from a given collection; • Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Association Rule Discovery Application • Marketing and Sales Promotion • Supermarket Shelf Management • Inventory Management سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Deviation Detection & Pattern Discovery Deviation Detection: …discovering most significant changes in data from previously measured or normative values… V. Kumar, M. Joshi, Tutorial on High Performance Data Mining. Sequential Pattern Discovery: …process of looking for patterns and rules that predict strong sequential dependencies among different events… V. Kumar, M. Joshi, Tutorial on High Performance Data Mining. سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Sequential Patterns • Identify frequently occurring sequences from given records • 40 percent of female customers buy a gray skirt six months after buying a red jacket سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Data Mining Methodology: SAS • Sample • Extract a portion of the dataset for data mining • Explore • Modify • create, select and transform variables with the intention of building a model • Model • Specify a relationship of variables that reliably predicts a desired goal • Assess • Evaluate the practical value of the findings and the model resulting from the data mining effort سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Data Mining Methodology: CRISP-DM • Data understanding • Data preparation • Modeling • Evaluation • Deployment سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
CRISP-DM Phases سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Phases and Tasks Business Understanding Data Understanding Data Preparation Modeling Deployment Evaluation Determine Business Objectives Background Business Objectives Business Success Criteria Situation Assessment Inventory of Resources Requirements, Assumptions, and Constraints Risks and Contingencies Terminology Costs and Benefits Determine Data Mining Goal Data Mining Goals Data Mining Success Criteria Produce Project Plan Project PlanInitial Asessment of Tools and Techniques Collect Initial Data Initial Data Collection Report Describe Data Data Description Report Explore Data Data Exploration Report Verify Data Quality Data Quality Report Data Set Data Set Description Select Data Rationale for Inclusion / Exclusion Clean Data Data Cleaning Report Construct Data Derived Attributes Generated Records Integrate Data Merged Data Format Data Reformatted Data Select Modeling Technique Modeling Technique Modeling Assumptions Generate Test Design Test Design Build Model Parameter Settings Models Model Description Assess Model Model AssessmentRevised Parameter Settings Evaluate Results Assessment of Data Mining Results w.r.t. Business Success Criteria Approved Models Review Process Review of Process Determine Next Steps List of Possible Actions Decision Plan Deployment Deployment Plan Plan Monitoring and Maintenance Monitoring and Maintenance Plan Produce Final Report Final Report Final Presentation Review Project Experience Documentation سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Fraud/Non-Compliance Anomaly detection Isolate the factors that lead to fraud, waste and abuse Target auditing and investigative efforts more effectively Credit/Risk Scoring Intrusion detection Parts failure prediction Recruiting/Attracting customers Maximizing profitability (cross selling, identifying profitable customers) Service Delivery and Customer Retention Build profiles of customers likely to use which services Web Mining Health Care Major Application Areas for Data Mining Solutions سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Controversial Issues • Data mining (or simple analysis) on people may come with a profile that would raise controversial issues of • Discrimination • Privacy • Security • Examples: • Should males between 18 and 35 from countries that produced terrorists be singled out for search before flight? • Can people be denied mortgage based on age, sex, race? • Women live longer. Should they pay less for life insurance?
Data Mining and Discrimination • Can discrimination be based on features like sex, age, national origin? • In some areas (e.g. mortgages, employment), some features cannot be used for decision making • In other areas, these features are needed to assess the risk factors • E.g. people of African descent are more susceptible to sickle cell anemia
Data Mining and Privacy • Can information collected for one purpose be used for mining data for another purpose • In Europe, generally no, without explicit consent • In US, generally yes • Companies routinely collect information about customers and use it for marketing, etc. • People may be willing to give up some of their privacy in exchange for some benefits • See Data Mining And Privacy Symposium, www.kdnuggets.com/gpspubs/ieee-expert-9504-priv.html
Data Mining and Privacy • Data Mining looks for patterns, not people! • Technical solutions can limit privacy invasion • Replacing sensitive personal data with anon. ID • Give randomized outputs • Multi-party computation – distributed data • … سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
The Hype Curve for Data Mining and Knowledge Discovery Over-inflated expectations Growing acceptance and mainstreaming rising expectations Disappointment سيستمهاي خبره و مهندسي دانش-دكتر كاهاني
Final Remarks • Data Mining can be utilized for any field that needs to find patterns or relationships in their data. سيستمهاي خبره و مهندسي دانش-دكتر كاهاني