200 likes | 228 Vues
Introduction to Data Mining. Dr. Hany Saleeb. Why Data Mining? — Potential Applications. Direct Marketing identify which prospects should be included in a mailing list Market segmentation identify common characteristics of customers who buy same products Market Basket Analysis
E N D
Introduction to Data Mining Dr. Hany Saleeb
Why Data Mining? — Potential Applications • Direct Marketing • identify which prospects should be included in a mailing list • Market segmentation • identify common characteristics of customers who buy same products • Market Basket Analysis • Identify what products are likely to be bought together • Insurance Claims Analysis • discover patterns of fraudulent transactions • compare current transactions against those patterns
What Is Data Mining? • Combination of AI and statistical analysis to discover information that is “hidden” in the data • associations (e.g. linking purchase of pizza with beer) • sequences (e.g. tying events together: marriage and purchase of furniture) • classifications (e.g. recognizing patterns such as the attributes of employees that are most likely to quit) • forecasting (e.g. predicting buying habits of customers based on past patterns) Expert systems or small ML/statistical programs
What can data mining do? • Classification – Classify credit applicants as low, medium, high risk – Classify insurance claims as normal, suspicious • Estimation – Estimate the probability of a direct mailing response – Estimate the lifetime value of a customer • Prediction – Predict which customers will leave within six months – Predict the size of the balance that will be transferred by a credit card prospect
What can data mining do? (cont’d) • Association – Find out items customers are likely to buy together – Find out what books to recommend to Amazon.com users • Clustering – Difference from classification: classes are unknown!
Market Analysis and Management • Where are the data sources for analysis? • Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies • Target marketing • Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. • Determine customer purchasing patterns over time • Conversion of single to a joint bank account: marriage, etc. • Cross-market analysis • Associations/co-relations between product sales • Prediction based on the association information
Data Mining: Confluence of Multiple Disciplines Database Technology Statistics Data Mining Machine Learning Visualization Information Science Other Disciplines
Data Mining: On What Kind of Data? • Relational databases • Data warehouses • Transactional databases • Advanced DB and information repositories • Object-oriented and object-relational databases • Spatial databases • Time-series data and temporal data • Text databases and multimedia databases • Heterogeneous and legacy databases • WWW
Data Mining Process Learning Collecting relevant data Model building Understanding of business Problem identification Business strategy and evaluation Action
Requirements/challenges in Data Mining • User interface • Mining methodology • Performance • Data source • Social and Security
Requirements/challenges in Data Mining(2) • User interface - Data Visualization • Understandability and interpretation of results • Information representation and rendering • Screen real-estate - Interactivity • Manipulation of mined knowledge • focus and refine mining tasks • Focus and refine mining results
Requirements/challenges in Data Mining(3) • Mining Methodology • Mining different kinds of knowledge in databases • Interactive mining of knowledge at multiple levels of abstraction • Incorporation of background knowledge • Query languages • Expression and visualization of results • Handling noise and incomplete data • Pattern evaluation
Requirements/challenges in Data Mining (4) • Performance • Efficiency and scalability of data mining algorithms • Linear algorithms needed • Parallel and distributed methods • Incremental methods • Divide and conquer?
Requirements/challenges in Data Mining(5) • Data Source • Diversity of data types • Handling complex types of data • Mining information from heterogenous data bases or information repositories • Can we expect a DM algorithm to do well on all types of data ? • Data glut • Are we collecting the right data for the right answer? • Distinguish between important and unimportant data
Requirements/challenges in Data Mining(6) • Social and Security -Social Impact • Private and sensitive data is gathered and mined without individual’s knowledge and/or consent • Appropriate use and distribution of discovered knowledge - Regulations Need for privacy and DM policies
DBMiner : A free tool • DBMiner: A data mining system originated in Intelligent Database Systems Lab and further developed by DBMiner Technology Inc. • OLAM (on-line analytical mining) architecture for interactive mining of multi-level knowledge in both RDBMS and data warehouses • Mining knowledge on Microsoft SQLServer 7.0 databases and/or data warehouses • Multiple mining functions: discovery-driven OLAP, association, classification and clustering
Input and Output • Input: SQLServer 7.0 data cubes which are constructed from single or multiple relational tables, data warehouses or spread sheets (with OLEDB and RDBMS connections) • Multiple outputs • Summarization and discovery-driven OLAP: crosstabs and graphical outputs using MS/Excel2000 • Association: rule tables, rule planes and ball graphs • Classification: decision trees and decision tables • Clustering: maps and summarization graphs • Others: • Data and cube views • Visualization of concept hierarchies • Visualization for task management • Visualization of 2-D and 3-D boxplots
Data Mining Tasks • DBMiner covers the following functions • Discovery-driven, OLAP-based multi-dimensional analysis • Association and frequent pattern analysis • Classification (decision tree analysis) • Cluster analysis • 3-D cube viewer and analyzer • Other function • OLAP service, cube exploration, statistical analysis • Sequential pattern analysis • Visual classification
Summary • The benefits of knowing one’s business is critical; technologies are coming together to support data mining. • Data mining is the process and result of knowledge production, knowledge discovery and knowledge management.