Machine Learning

Machine Learning Reading: Chapter 18

Text Classification • Is texti a finance new article? Positive Negative

Investors 2 Dow 2 Jones 2 Industrial 1 Average 3 Percent 5 Gain 6 Trading 8 Broader 5 stock 5 Indicators 6 Standard 2 Rolling 1 Nasdaq 3 Early 10 Rest 12 More 13 first 11 Same 12 The 30 20 attributes

Men’s Basketball Championship UConn Huskies Georgia Tech Women Playing Crown Titles Games Rebounds All-America early rolling Celebrates Rest More First The same 20 attributes

Example

Constructing the Decision Tree • Goal: Find the smallest decision tree consistent with the examples • Find the attribute that best splits examples • Form tree with root = best attribute • For each value vi (or range) of best attribute • Selects those examples with best=vi • Construct subtreei by recursively calling decision tree with subset of examples, all attributes except best • Add a branch to tree with label=vi and subtree=subtreei

Choosing the Best Attribute: Binary Classification • Want a formal measure that returns a maximum value when attribute makes a perfect split and minimum when it makes no distinction • Information theory (Shannon and Weaver 49) • Entropy: a measure that characterizes the impurity of a collection of examples • Information gain: the expected reduction in entropy caused by partitioning the xamples according to this attribute

Formula for Entropy n H(P(v1),…P(vn))=∑-P(vi)log2P(vi) where P(v) = probability of v Examples: Suppose we have a collection of 10 examples, 5 positive, 5 negative:H(1/2,1/2)=-1/2log21/2-1/2log21/2=1 bit Suppose we have a collection of 100 examples, 1 positive and 99 negative: H(1/100,99/100)=-.01log2.01-.99log2.99=.08 bits i=1

Choosing the Best Attribute: Information Gain • Information gain (from attribute test) = difference between the original information requirement and new requirement • Gain(A)=H(p/p+n,n/p+n)-Remainder(A) • H=entropy • Highest when the set is equally divided between positive (p) and negative (n) examples (.5,.5) (value of 1) • Lower as the set becomes more unbalanced (e.g., (.9, .1) )

Information based on attributes P=n=10, so H(1/2,1/2)= 1 bit = Remainder (A)

Text Classification • Is texti a finance new article? Positive Negative

Example

stock rolling 10 <5 <5 5-10 10 5-10 9,10 1,5,6,8 2,3,4,7 1,8,9,10 2,3,4,5,6,7

Algorithm as specified so far is designed for binary classification, attributes with discrete values Attributes: • Outlook: sunny, overcast, rain • Temperature: hot, mild, cool • Humidity: normal, high • Wind: weak, strong Classification • PlayTennis?: Yes, No

Humidity E=.940 (9/14 yes) Wind E=.94 Strong Weak High Normal E=.985 E=.811 E=.592 E=.1.0 Gain(humidity)=.940-(7/14).985-(7/14).592 Gain(wind)=.940-(6/14).811-(8/14)1.0 Temperature E=.940 Outlook E=.940 Overcast Cool Mild Sunny Hot Rain Gain(outlook)=.940-(4/14)0-(5/14).79-(5/14).79 Gain(S,Outlook)=.246, Gain(S,Humidity)=.151, Gain(S,Wind)=.048, Gain(S,Temperature)=.029 Outlook is selected because it has highest gain

Extending the algorithm for continuous valued attributes • Dynamically define new discrete-valued attributes that partition the continuous attribute into a discrete set of intervals • For continuous A, create Ac that is true if A<c, false otherwise • How to select the best value for threshold c? • Sort examples by continuous attribute • Identify adjacent examples that differ in target classification • Generate a set of candidate thresholds midway between corresponding values of A • Choose threshold c that maximizes information gain

Example: temperature as continuous value • Two candidate thresholds: • (48+60)/2 • (80+90)/2 • Information gain greater for Temperature>54 than for Temperature>85

Other cases • What if class is discrete valued, not binary? • What if an attribute has many values (e.g., 1 per instance)?

Training vs. Testing • A learning algorithm is good if it uses its learned hypothesis to make accurate predictions on unseen data • Collect a large set of examples (with classifications) • Divide into two disjoint sets: the training set and the test set • Apply the learning algorithm to the training set, generating hypothesis h • Measure the percentage of examples in the test set that are correctly classified by h • Repeat for different sizes of training sets and different randomly selected training sets of each size.

Overfitting • Learning algorithms may use irrelevant attributes to make decisions • For news, day published and newspaper • When else can overfitting occur? • Solution #1: Decision tree pruning • Prune away attributes with low information gain • Use statistical significance to test whether gain is meaningful

K-fold Cross Validation • Solution #2: To reduce overfitting • Run k experiments • Use a different 1/k of data for testing each time • Average the results • 5-fold, 10-fold, leave-one-out

Cross-Validation Labeled data (1566) Split into 10 folds 9 folds (approx. 1409) 1 fold (approx. 157) Train Model Evaluate Lather, rinse, repeat (10 times) Report average

Example

Ensemble Learning • Learn from a collection of hypotheses • Majority voting • Enlarges the hypothesis space

Boosting • Uses a weighted training set • Each example has an associated weight wj0 • Higher weighted examples have higher importance • Initially, wj=1 for all examples • Next round: increase weights of misclassified examples, decrease other weights • From the new weighted set, generate hypothesis h2 • Continue until M hypotheses generated • Final ensemble hypothesis = weighted-majority combination of all M hypotheses • Weight each hypothesis according to how well it did on training data

AdaBoost • If input learning algorithm is a weak learning algorithm • L always returns a hypothesis with weighted error on training slightly better than random • Returns hypothesis that classifies training data perfectly for large enough M • Boosts the accuracy of the original learning algorithm on training data

Machine Learning