Classification Prof. Navneet Goyal BITS, Pilani CS C415/IS C415 – Data Mining

Classification Prof. Navneet GoyalBITS, PilaniCS C415/IS C415 – Data Mining

Classification & Prediction • What is Classification? • What is Prediction? • Any relationship between the two? • Supervised or Unsupervised? • Issues • Applications • Algorithms • Classifier Accuracy

Classification vs. Prediction • Classification: • predicts categorical class labels • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Prediction: • models continuous-valued functions, i.e., predicts unknown or missing values

Classification Problem • Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f:DgC where each ti is assigned to one class. • Predictionis similar, but may be viewed as having infinite number of classes.

Applications • Credit approval • Medical diagnosis • Categorizing cells as malignant or benign based on MRI scans • Classifying galaxies based on their shapes • Classifying emails as spam or not spam • Intrusion detection ( rare event classification) • …

Classification: 2 Step Process • Model construction: describing a set of predetermined classes • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction: training set • The model is represented as classification rules, decision trees, or mathematical formulae • Model usage: for classifying future or unknown objects • Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur

Partition: Training-and-testing • use two independent data sets, e.g., training set (2/3), test set(1/3) • used for data set with large number of samples • Cross-validation • divide the data set into k subsamples • use k-1 subsamples as training data and one sub-sample as test data --- k-fold cross-validation • for data set with moderate size • Bootstrapping (leave-one-out) • for small size data Classifier Accuracy

Distance Based Defining Classes Partitioning Based 0x 8 & 0 y 10 Figure taken from Dunham book

Height Example Data Example taken from Dunham book

Confusion Matrix Example Using height data example with Output1 correct and Output2 actual assignment

Classification Techniques • Decision Tree based Methods • Rule-based Methods • Distance-based Methods • Memory based reasoning • Neural Networks • Naïve Bayes and Bayesian Belief Networks • Support Vector Machines

General Approach Figure taken from text book (Tan, Steinbach, Kumar)

Classification by Decision Tree Induction • Decision tree – is a classification scheme • Represents – a model of different classes • Generates – tree & set of rules • A node without children - is a leaf node. Otherwise an internal node. • Each internal node has - an associated splitting predicate. e.g. binary predicates. • Example predicates: • Age <= 20 • Profession in {student, teacher} 5000*Age + 3*Salary – 10000 > 0

Classification by Decision Tree Induction • Decision tree • A flow-chart-like tree structure • Internal node denotes a test on an attribute • Branch represents an outcome of the test • Leaf nodes represent class labels or class distribution • Decision tree generation consists of two phases • Tree construction • At start, all the training examples are at the root • Partition examples recursively based on selected attributes • Tree pruning • Identify and remove branches that reflect noise or outliers • Use of decision tree: Classifying an unknown sample • Test the attribute values of the sample against the decision tree

Classification by Decision Tree Induction Decision tree classifiers are very popular. WHY? • It does not require any domain knowledge or parameter setting, and is therefore suitable for exploratory knowledge discovery • DTs can handle high dimensional data • Representation of acquired knowledge in tree form is intuitive and easy to assimilate by humans • Learning and classification steps are simple & fast • Good accuracy

Classification by Decision Tree Induction Main Algorithms • Hunt’s algorithm • ID3 • C4.5 • CART • SLIQ,SPRINT

categorical categorical continuous class Example of a Decision Tree Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Model: Decision Tree Training Data Figure taken from text book (Tan, Steinbach, Kumar)

NO Another Example of Decision Tree categorical categorical continuous class Single, Divorced MarSt Married NO Refund No Yes TaxInc < 80K > 80K YES NO There could be more than one tree that fits the same data! Figure taken from text book (Tan, Steinbach, Kumar)

Some Questions • Which tree is better and why? • How many decision trees? • How to find the optimal tree? • Is it computationally feasible? • (Try constructing a suboptimal tree in reasonable amount of time – greedy algorithm) • What should be the order of split? • Look for answers in “20 questions” & “Guess Who” games!

Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply Model to Test Data Test Data Start from the root of tree. Figure taken from text book (Tan, Steinbach, Kumar)

Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply Model to Test Data Test Data Figure taken from text book (Tan, Steinbach, Kumar)

Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Figure taken from text book (Tan, Steinbach, Kumar)

Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO

Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Figure taken from text book (Tan, Steinbach, Kumar)

Apply Model to Test Data Test Data Refund Yes No NO MarSt Assign Cheat to “No” Married Single, Divorced TaxInc NO < 80K > 80K YES NO Figure taken from text book (Tan, Steinbach, Kumar)

Decision Trees: Example Outlook Temp Humidity Windy Class Sunny 79 90 true No play Sunny 56 70 False Play Sunny 79 75 True Play Sunny 60 90 True No Play Overcast 88 88 False Play Overcast 63 75 True Play Overcast 88 95 False Play Rain 78 60 False Play Rain 66 70 False No Play Rain 68 60 True No Play Training Data Set Numerical Attributes Temprature, Humidity Categorical Attributes Outlook, Windy Class ??? Class label

Outlook rain sunny overcast Windy Humidity Play <=75 true false > 75 No Play Play Play No Play No {1} Decision Trees: Example Sample Decision Tree Five leaf nodes – Each represents a rule

Decision Trees: Example Rules corresponding to the given tree • If it is a sunny day and humidity is not above 75%, then play. • If it is a sunny day and humidity is above 75%, then do not play. • If it is overcast, then play. • If it is rainy and not windy, then play. • If it is rainy and windy, then do not play. Is it the best classification ????

Decision Trees: Example Classification of new record New record: outlook=rain, temp =70, humidity=65, windy=true. Class: “No Play” Accuracy of the classifier determined by the percentage of the test data set that is correctly classified

Decision Trees: Example Outlook Temp Humidity Windy Class Sunny 79 90 true Play Sunny 56 70 False Play Sunny 79 75 True No Play Sunny 60 90 True No Play Overcast 88 88 False No Play Overcast 63 75 True Play Overcast 88 95 False Play Rain 78 60 False Play Rain 66 70 False No Play Rain 68 60 True Play Test Data Set Rule 1: two records Sunny & hum <=75 (one is correctly classified) Accuracy= 50% Rule 2:sunny, hum> 75 Accuracy = 50% Rule 3: overcast Accuracy= 66%

Practical Issues of Classification • Underfitting and Overfitting • Missing Values • Costs of Classification

Overfitting the Data • A classification model commits two kinds of errors: • Training Errors (TE) (resubstitution, apparent errors) • Generalization Errors (GE) • A good classification model must have low TE as well as low GE • A model that fits the training data too well can have high GE than a model with high TE • This problem is known as model overfitting

Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large. TE & GE are large when the size of the tree is very small. It occurs because the model is yet to learn the true structure of the data and as a result it performs poorly on both training and test sets Figure taken from text book (Tan, Steinbach, Kumar)

Overfitting the Data • When a decision tree is built, many of the branches may reflect anomalies in the training data due to noise or outliers. • We may grow the tree just deeply enough to perfectly classify the training data set. • This problem is known as overfitting the data.

Overfitting the Data • TE of a model can be reduced by increasing the model complexity • Leaf nodes of the tree can be expanded until it perfectly fits the training data • TE for such a complex tree = 0 • GE can be large because the tree may accidently fit noise points in the training set • Overfitting & underfitting are two pathologies that are related to model complexity

Occam’s Razor • Given two models of similar generalization errors, one should prefer the simpler model over the more complex model • For complex models, there is a greater chance that it was fitted accidentally by errors in data • Therefore, one should include model complexity when evaluating a model

Definition A decision tree T is said to overfit the training data if there exists some other tree T’ which is a simplification of T, such that T has smaller error than T’ over the training set but T’ has a smaller error than T over the entire distribution of the instances.

Problems of Overfitting Overfitting can lead to many difficulties: • Overfitted models are incorrect. • Require more space and more computational resources • Require collection of unnecessary features • They are more difficult to comprehend

Overfitting Overfitting can be due to: Presence of Noise Lack of representative samples

Overfitting: Example Presence of Noise:Test Set Table taken from text book (Tan, Steinbach, Kumar)

Body Temp Body Temp Warm blooded Warm blooded Gives Birth Gives Birth No No Yes Yes 4-legged Mammals Non-mammals Yes Non-mammals Mammals Overfitting: Example Presence of Noise: Models Cold blooded Cold blooded Non-mammals Non-mammals Non-mammals Model M2 TE = 20%, GE=10% No Model M1 TE = 0%, GE=30% Find out why? Figure taken from text book (Tan, Steinbach, Kumar)

Overfitting: Example Lack of representative samples: Training Set Table taken from text book (Tan, Steinbach, Kumar)

Body Temp Warm blooded Hibernates No Yes 4-legged Yes Non-mammals Mammals Overfitting: Example Lack of representative samples: Training Set Cold blooded Model M3 TE = 0%, GE=30% Find out why? Non-mammals Non-mammals No Figure taken from text book (Tan, Steinbach, Kumar)

Overfitting due to Noise Decision boundary is distorted by noise point Figure taken from text book (Tan, Steinbach, Kumar)

Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task Figure taken from text book (Tan, Steinbach, Kumar)

How to Address Overfitting • Pre-Pruning (Early Stopping Rule) • Stop the algorithm before it becomes a fully-grown tree • Typical stopping conditions for a node: • Stop if all instances belong to the same class • Stop if all the attribute values are the same • More restrictive conditions: • Stop if number of instances is less than some user-specified threshold • Stop if class distribution of instances are independent of the available features (e.g., using  2 test) • Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).

How to Address Overfitting… • Post-pruning • Grow decision tree to its entirety • Trim the nodes of the decision tree in a bottom-up fashion • If generalization error improves after trimming, replace sub-tree by a leaf node. • Class label of leaf node is determined from majority class of instances in the sub-tree • Can use MDL for post-pruning

Post-pruning • Subtree replacement replaces a subtree with a single leaf node Post-pruning approach-removes branches of a fully grown tree. Alt Alt Yes Yes Yes Price $$$ $ $$ No Yes Yes

Post-pruning Alt Alt Yes Yes Res Price Yes No $$$ $ $$ No Price No 4/4 Yes Yes $$$ $ $$ No Yes Yes • Subtree raising moves a subtree to a higher level in the decision tree, subsuming its parent

Overfitting: Example Presence of Noise:Training Set Table taken from text book (Tan, Steinbach, Kumar)

Classification Prof. Navneet Goyal BITS, Pilani CS C415/IS C415 – Data Mining

Classification Prof. Navneet Goyal BITS, Pilani CS C415/IS C415 – Data Mining

Presentation Transcript

Data Mining: Classification

Data Mining Classification: Alternative Techniques

CS 345A Data Mining

Data Mining Classification: Alternative Techniques

CS 5310 Data Mining

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

Data Mining Classification: Basic Concepts,

CS 277: Data Mining Notes on Classification

Data Mining Classification: Alternative Techniques

Decision Tree Classification Prof. Navneet Goyal BITS, Pilani BITS C464 – Machine Learning

Data Mining CS 541

High Performance Computing Solutions for Data Mining Prof. Navneet Goyal

Data Mining Classification: Alternative Techniques

Association Rules Dr. Navneet Goyal BITS, Pilani

Data Mining Classification:

Decision Tree Classification Prof. Navneet Goyal BITS, Pilani BITS C464 – Machine Learning

S.P.Vimal CSIS Group BITS-Pilani vimalsp@bits-pilani ac

DATA MINING Prof. Navneet Goyal BITS, Pilani

CS 277: Data Mining Text Classification

BITS Pilani

Data Mining: Classification and Prediction