1 / 56

IE 483/583 Knowledge Discovery and Data Mining

IE 483/583 Knowledge Discovery and Data Mining. Dr. Siggi Olafsson Fall 2003. What is Data Mining?. (… and should I be here?). Dilbert Replies . Some Definitions. “Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.”

ulmer
Télécharger la présentation

IE 483/583 Knowledge Discovery and Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IE 483/583Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003 Data Mining

  2. What is Data Mining? (… and should I be here?) Data Mining

  3. Dilbert Replies ... Data Mining

  4. Some Definitions “Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.” “Data mining is the process of exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules.” Data Mining

  5. What can Data Mining Do? • Classification • Prediction Supervised • Association discovery • Clustering Unsupervised Data Mining

  6. Applications of Data Mining • Manufacturing Process Improvement • Sales and Marketing • Mapping the Human Genome • Diagnosing Breast Cancer • Financial Crime Identification • Portfolio Management Data Mining

  7. Technical Background • Machine Learning • Data mining: business-oriented use of AI • Statistics • Regression, sampling, DOE, etc • Decision Support • Data warehousing, data marts, OLAP, etc • Interdisciplinary tools put together to form the process of knowledge discovery in databases … Data Mining

  8. Historical Perspective < 40 Stat Bayes theorem, regression, etc. 40s AI Neural networks 50s AI Nearest neighbor, single link, perceptron Stat Resampling, bias reduction, jackknife 60s Stat Linear models for classification, exploratory data analysis (EDA) IR Similarity measures, clustering DB Relational data model 70s IR Smart IR systems AI Genetic algorithms Stat EM algorithm, k-means clustering 80s AI Kohonen maps, decision trees 90s DB Association rule algorithms, web & search engines, data warehousing, OLAP Data Mining

  9. What Changed? • Very large databases • Increased computational power as enabler • Business perspective Data Mining

  10. Knowledge Discovery in Databases Data Warehouse Systems Engineering Databases Data warehouse Prepared Data Knowledge Model/Structures Knowledge Discovery and Data Mining Data Mining

  11. Course Information • We assume data is ready for mining • Thus, we focus on: • models and structures, and • algorithms • More information on course homepage http://www.public.iastate.edu/~olafsson/mining.html Data Mining

  12. Data Mining

  13. Course Outline • Introduction • Exploratory Data Mining • Supervised Learning • Unsupervised Learning • Optimization Methods in Learning • Selected Advanced Topics • Mining the Web • Customer Relationship Management (CRM) • Course Review Data Mining

  14. Questions? Data Mining

  15. Data Mining • Discover patterns in data • automatic or semi-automatic process • meaningful or useful pattern • large amounts of data • What does such a pattern look like? Black box Transparent box Data Mining

  16. Describing Structural Patterns • Some ways of representing knowledge: • Decision tables • Decision trees • Classification rules • Association rules • Regression trees • Clusters Data Mining

  17. The Weather Problem Data Mining

  18. A Decision List If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes • These are classification rules Data Mining

  19. Association Rules • Many association rules can be inferred: if temperature = cool then humidity = normal if humidity = normal and windy = false then play = yes if outlook = sunny and play = no then humidity = high Data Mining

  20. Three Layers of the Process Inputs Outputs Algorithms Data Mining

  21. Inputs • Three forms • Concepts • concept description - what you want to learn • Instances • examples - what you learn from • Attributes • features of instances - variables you have values for Data Mining

  22. Concepts: Styles of Learning • Classification (supervised) learning • Association learning • Clustering • Numeric prediction Data Mining

  23. Instances: Learn from Examples • Set of instances to be classified, or associated, or clustered • Example of concept to be learned • Data set: flat file (single relation) • denormalization • Family tree example • concept: sister • example: family tree Data Mining

  24. Family Tree = Data Mining

  25. Denormalizing Relational Data Data Mining

  26. Denormalization Problems • Computational and storage costs • Trivial regularities customers products product supplier supplier supplier address • Infinite relations Data Mining

  27. Content of Instances: Attributes • Instance characterized by values of its (predefined) set of attributes • Numeric (“continuous”) • Nominal (categorical) • Ordinal (rank) • Interval • Ratio Focus in this class Data Mining

  28. Data Preparation • Data … • assembly • set of instances/denormalizing relational data • integration • enterprise-wide database/data warehouse • cleaning • missing data • aggregation • good information Data Mining

  29. ARFF Format • Used by JAVA package (Weka) • Independent, unordered instances • No relationship between instances Data Mining

  30. Weather Data Data Mining

  31. Features • % = comments • @relation <name> • @attribute <name> <type> • Attribute types: Nominal and numeric • @data • List of instances • Missing values represented by ? Data Mining

  32. Other Issues • Missing data • Inaccurate values • Look at the data!!! Data Mining

  33. Recall the Three Layers of the Data Mining Process Done Next Inputs Outputs (structural patterns) Algorithms Data Mining

  34. Describing Structural Patterns • Ways of representing knowledge: • Decision tables • Decision trees • Classification rules • Association rules • Regression trees • Clusters Data Mining

  35. The Weather Problem Data Mining

  36. A Decision List If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes Data Mining

  37. A Decision Tree Outlook Overcast Sunny Rainy Humidity Windy Play=Yes High TRUE Play=No Play=No Data Mining

  38. Concepts: Styles of Learning • Classification (supervised) learning • Association learning • Clustering • Numeric prediction Data Mining

  39. Classification Rules • Classification easily read off decision trees • How? • Other direction possible, but not as straightforward If a and b then x If c and d then x Data Mining

  40. Corresponding Decision Tree a n y b c n n y y x c d n n y y d x n y x Data Mining

  41. Replicated Subtree Problem X=1 n y Y=1 Y=1 n n y b a a b If x=1 and y=0 then a If x=0 and y=1 then a If x=0 and y=0 then b If x=1 and y=1 then b Data Mining

  42. Replicated Subtree Problem If x=1 and y=1 then a If z=1 and w=1 then a Otherwise b x,y,z,w take values 1,2,3 Data Mining

  43. Rules with exceptions • Account for new instances • Exceptions from exceptions, etc If x and y then a EXCEPT if z then b Data Mining

  44. Association Rules • Coverage (support): number of instances it predicts correctly • Accuracy (confidence): coverage divided by number of instances it applies to • Coverage = 4 • Accuracy = 100% If temperature = cool then humidity = normal Data Mining

  45. Interpretation If windy = false and play = no then outlook = sunny and humidity = high If windy = false and play = no then outlook = sunny If windy = false and play = no then humidity = high If humidity = high and windy = false and play = no then outlook = sunny Data Mining

  46. The Shapes Problem Shaded=standing Unshaded=lying Data Mining

  47. Instances Data Mining

  48. Classification Rules If width  3.5 and height < 7.0 then lying If height  3.5 then standing • Work well to classify these instances • Problems? Data Mining

  49. Relational Rules If width > height then lying If height > width then standing • Rules comparing attributes to constants are called propositional rules • Structural patterns? Data Mining

  50. CPU Performance Example Data Mining

More Related