240 likes | 438 Vues
Data Mining CSCI 307 Spring, 2019. Lecture 2 Describing Patterns Simple Examples. Data versus Information. Society produces huge amounts of data Sources : business, science, medicine, economics, geography, environment, sports, … Potentially valuable resource
E N D
Data MiningCSCI 307Spring, 2019 Lecture 2 Describing Patterns Simple Examples
Data versus Information • Society produces huge amounts of data • Sources: business, science, medicine, economics, geography, environment, sports, … • Potentially valuable resource • Raw data is useless: need techniques to automatically extract information from it • Data: recorded facts • Information: patterns underlying the data • Be careful not to create patterns from random noise: https://xkcd.com/2101/
Information is Crucial • Example 1: in vitro fertilization • Given: embryos described by 60 features • Problem: select embryos that will survive • Data: historical records of embryos and outcomes • Example 2: cow culling • Given: cows described by 700 features • Problem: select cows to cull • Data: historical records and farmers’ decisions
Data Mining • Extracting • implicit, • previously unknown, • potentially useful • information from data • Needed: programs that detect patterns and regularities in the data • Strong patterns ==> good predictions • Problem 1: most patterns are not interesting • Problem 2: patterns may be inexact (or spurious) • Problem 3: data may be garbled or missing
Black Box vs. Structural Descriptions • Machine learning can produce different types of patterns: • Black Box descriptions: • Can be used to predict outcome in new situation • Are opaque as to how the prediction is made • Are not useful for examining how they make predictions. BLACK BOX Output e.g. Classification Data Input
Black Box vs. Structural Descriptions • Structural descriptions: • Represent patterns explicitly (e.g. by a set of rules or a decision tree). • Can be used to predict outcome in new situation • Can be used to understand and explain how prediction is derived • (may be even more important) • Methods originate from artificial intelligence, statistics, and research on databases
Structural Descriptions Example:if-thenrules (from contact-lens data) If tear production rate = reduced then recommendation = none Otherwise, if age = young and astigmatic = no then recommendation = soft
The Weather Problem:A simple example Conditionsfor playinga certain game If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes These rules must be checked in order, otherwise they will be classified incorrectly.
Classification versus Association Rules Classification rule: predicts value of a given attribute (the classification of an example) If outlook = sunny and humidity = high then play = no If temperature = cool then humidity = normal If humidity = normal and windy = false then play = yes If outlook = sunny and play = no then humidity = high If windy = false and play = no then outlook = sunny and humidity = high Association rule: predicts value of arbitrary attribute (or combination)
Weather Data with Mixed Attributes Some attributes have numericvalues If outlook = sunny and humidity > 83 then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity < 85 then play = yes If none of the above then play = yes
The Contact Lenses Data The contact lenses data
Contact Lens Data is Complete Attributes: Age: young, pre-presbyopic, presbyopic Prescription: Myope, Hypermetrope Astigmatism: Yes or No Tear Production: Reduced, Normal All possible combinations of attribute values are represented. Question: How many instances is that? Note: Real input sets are not usually complete. They may have missing values, or not all combinations are present.
A Complete and Correct Rule Set If tear production rate = reduced then recommendation = none If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age = presbyopic and spectacle prescription = myope and astigmatic = no then recommendation = none If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then recommendation = soft If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then recommendation = hard If age young and astigmatic = yes and tear production rate = normal then recommendation = hard If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none If age = presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none In real life, the classifier may not always produce the correct class. This is a large set of rules. Would a smaller set be better?
A Decision Tree for this Same Problem Pruned decision tree produced by J48. More on this in Chapter 3. We might use a tree to determine the outcome. Notice it is less cumbersome.
Age, Prescription, Astig, Tear Rate, Recommend 8 Young, Hypermetrope, Yes, Normal, Hard 18 Presbyopic, Myope, No, Normal, None Here both the attributes and the outcome are nominal (aka categorical)—preset, finite set of possibilities.
ClassifyingIrisFlowers This famous data set’s rules are cumbersome and there might be a better way to classify. Note here that there are numeric attributes, but the outcome is a category. setosa versicolor If petal-length < 2.45 then Iris-setosa If sepal-width < 2.10 then Iris-versicolor If sepal-width < 2.45 and petal-length < 4.55 then Iris-versicolor ... virginica
Predicting CPU Performance Example: 209 different computer configurations are the instances 1 2 …. 208 209 In this case both the attributes and the outcome are numeric. Linearregression function PRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX + 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX More on how to do this in Chapter 4.
Data from Labor Negotiations Here the attributes are in rows, instead of the usual columns. In this case the instances are in columns. Here's the class
Decision Trees for theLaborData This decision tree is simple, but does not always predict correctly. The tree makes intuitive sense – bigger wage increase and more holidays are usually positive for an employee.
Decision Trees for theLaborData This decision tree is more accurate, but may not be as intuitive. It likely reflects compromises made so that a contract is accepted by both the employer and employee
This tree is simple and approximate, does not classify exactly. Decision Trees for theLaborData The full tree is more accurate on training data, BUT may not actually work better in real life. It may be "overfitted." The simple tree above is a pruned version of the one to the right.
Soybean Classification An early Machine Learning success story! Attribute Number of values Sample value Environment Time of occurrence Precipitation July Above normal 7 3 … Seed A domain expert produced rules (72% correct) that did not perform as well as computer generated rules (97.5% correct). Condition Mold growth Normal Absent 2 2 … Fruit 4 Condition of fruit pods Fruit spots Condition Leaf spot size Normal ? Abnormal ? 5 2 3 Leaf … Stem Condition Stem lodging Abnormal Yes 2 2 … Root Diagnosis Condition 3 19 Normal Diaporthe stem canker
The Role of Domain Knowledge If leaf condition is normal and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown then diagnosis is rhizoctonia root rot If leaf malformation is absent and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown then diagnosis is rhizoctonia root rot Is “leaf condition is normal” the same as “leaf malformation is absent”? In this domain, "malformation is absent" is a special case of "leaf condition is normal." It only comes into play when leaf condition is not normal.
So far….examples of toy problems… Examples of small research problems. We will use them a lot because it makes it easier to understand the algorithms and techniques. • What about real applications? • Use data mining to: • Make a decision • Do a task faster than an expert • Let the expert make the scheme better • etc.