Alternative measures for selecting attributes

Alternative measures for selecting attributes • Recall intuition behind information gain measure: • We want to choose attribute that does the most work in classifying the training examples by itself. • So measure how much information is gained (or how much entropy decreased) if that attribute is known.

However, information gain measure favors attributes with many values. • Extreme example: Suppose that we add attribute “Date” to each training example. Each training example has a different date.

Day DateOutlook Temp Humidity Wind PlayTennis D1 3/1 Sunny Hot High Weak No D2 3/2 Sunny Hot High Strong No D3 3/3 Overcast Hot High Weak Yes D4 3/4 Rain Mild High Weak Yes D5 3/5 Rain Cool Normal Weak Yes D6 3/6 Rain Cool Normal Strong No D7 3/7 Overcast Cool Normal Strong Yes D8 3/8 Sunny Mild High Weak No D9 3/9 Sunny Cool Normal Weak Yes D10 3/10 Rain Mild Normal Weak Yes D11 3/11 Sunny Mild Normal Strong Yes D12 3/12 Overcast Mild High Strong Yes D13 3/13 Overcast Hot Normal Weak Yes D14 3/14 Rain Mild High Strong No Gain (S, Outlook) = .94 - .694 = .246 What is Gain (S, Date)?

Date will be chosen as root of the tree. • But of course the resulting tree will not generalize

Gain Ratio • Quinlan proposed another method of selecting attributes, called “gain ratio”: Suppose attribute A splits the training data S into m subsets. Call the subsets S1, S2, ..., Sm. We can define a set: The Penalty Term is the entropy of this set. For example: What is the Penalty Term for the “Date” attribute? How about for “Outlook”?

Day DateOutlook Temp Humidity Wind PlayTennis D1 3/1 Sunny Hot High Weak No D2 3/2 Sunny Hot High Strong No D3 3/3 Overcast Hot High Weak Yes D4 3/4 Rain Mild High Weak Yes D5 3/5 Rain Cool Normal Weak Yes D6 3/6 Rain Cool Normal Strong No D7 3/7 Overcast Cool Normal Strong Yes D8 3/8 Sunny Mild High Weak No D9 3/9 Sunny Cool Normal Weak Yes D10 3/10 Rain Mild Normal Weak Yes D11 3/11 Sunny Mild Normal Strong Yes D12 3/12 Overcast Mild High Strong Yes D13 3/13 Overcast Hot Normal Weak Yes D14 3/14 Rain Mild High Strong No

UCI ML Repository http://archive.ics.uci.edu/ml/ http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits optdigits-pictures optdigits.info optdigits.names

Homework 1 • How to download homework and data • Demo of C4.5 • Accounts on Linuxlab? • How to get to Linux Lab • Need help on Linux? • Newer version C5.0: http://www.rulequest.com/see5-info.html

Alternative measures for selecting attributes

Alternative measures for selecting attributes

Presentation Transcript

Selecting Seismic Attributes and Proper Display Parameters

Selecting the Best Alternative

LoA for Attributes

Alternative Measures of Business Entry and Exit

Alternative measures of well-being

Selecting the Best Alternative Design Strategy

Selecting Priorities—Measures for Measures

Selecting Attributes for Sentiment Classification Using Feature Relation Networks

Selecting and communicating high level impact measures

Selecting the Best Alternative

Alternative Risk Measures for Alternative Investments

Evaluating alternative measures of Core Inflation for Argentina

Identifying and Selecting Measures for Health Disparities Research

Identifying and Selecting Measures for Health Disparities Research

Alternative Source Selection Technique – Value Attributes

Identifying and Selecting Self-Report Measures for Health Disparities Research

Alternative Inflation Measures

Key Attributes while Selecting CRO for your Research Trials

Alternative measures of Performance

Alternative measures of Performance

Alternative measures to evaluate core inflation

Diversion and Alternative Measures