100 likes | 118 Vues
This text discusses the different methods of evaluating learned models, including cross-validation, leave one out, bootstrapping, and counting the cost. It also touches on information retrieval measures and how to apply action rules to change detractors to promoters.
E N D
Cross-Validation • Foundation is a simple idea – “holdout” – holds out a certain amount for testing and uses rest for training • Separation should NOT be “convenience”, • Should at least be random • Better – “stratified” random – division preserves relative proportion of classes in both training and test data • Enhanced : repeated holdout • Enables using more data in training, while still getting a good test • 10-fold cross validation has become standard • This is improved if the folds are chosen in a “stratified” random way
For Small Datasets • Leave One Out • Bootstrapping • To be discussed in turn
Leave One Out • Train on all but one instance, test on that one (pct correct always equals 100% or 0%) • Repeat until have tested on all instances, average results • Really equivalent to N-fold cross validation where N = number of instances available • Plusses: • Always trains on maximum possible training data (without cheating) • Efficient to run – no repeated (since fold contents not randomized) • No stratification, no random sampling necessary • Minuses • Guarantees a non-stratified sample – the correct class will always be at least a little bit under-represented in the training data • Statistical tests are not appropriate
Bootstrapping • Sampling done with replacement to form a training dataset • Particular approach – 0.632 bootstrap • Dataset of n instances is sampled n times • Some instances will be included multiple times • Those not picked will be used as test data • On large enough dataset, .632 of the data instances will end up in the training dataset, rest will be in test • This is a bit of a pessimistic estimate of performance, since only using 63% of data for training (vs 90% in 10-fold cross validation) • May try to balance by weighting in performance predicting training data (p 129) <but this doesn’t seem fair> • This procedure can be repeated any number of times, allowing statistical tests
Counting the Cost • Some mistakes are more costly to make than others • Giving a loan to a defaulter is more costly than denying somebody who would be a good customer • Sending mail solicitation to somebody who won’t buy is less costly than missing somebody who would buy (opportunity cost) • Looking at a confusion matrix, each position could have an associated cost (or benefit from correct positions) • Measurement could be average profit/ loss per prediction • To be fair in cost benefit analysis, should also factor in cost of collecting and preparing the data, building the model …
Information Retrieval (IR) Measures • IR community has developed 3 measures: • Recall = number of documents retrieved that are relevant total number of documents that are relevant • Precision = number of documents retrieved that are relevant total number of documents that are retrieved • F-measure = 2 * recall * precision recall + precision
WEKA • Part of the results provided by WEKA (that we’ve ignored so far) • Let’s look at an example (Naïve Bayes on my-weather-nominal) === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.667 0.125 0.8 0.667 0.727 yes 0.875 0.333 0.778 0.875 0.824 no === Confusion Matrix === a b <-- classified as 4 2 | a = yes 1 7 | b = no • TP rate and recall are the same = TP / (TP + FN) • For Yes = 4 / (4 + 2); For No = 7 / (7 + 1) • FP rate = FP / (FP + TN) – For Yes = 1 / (1 + 7); For No = 2 / (2 + 4) • Precision = TP / (TP + FP) – For yes = 4 / (4 + 1); For No = 7 / (7 + 2) • F-measure = 2TP / (2TP + FP + FN) • For Yes = 2*4 / (2*4 + 1 + 2) = 8 / 11 • For No = 2 * 7 / (2*7 + 2 + 1) = 14/17
WEKA – with more than two classes • Contact Lenses with Naïve Bayes === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.8 0.053 0.8 0.8 0.8 soft 0.25 0.1 0.333 0.25 0.286 hard 0.8 0.444 0.75 0.8 0.774 none === Confusion Matrix === a b c <-- classified as 4 0 1 | a = soft 0 1 3 | b = hard 1 2 12 | c = none • Class exercise – show how to calculate recall, precision, f-measure for each class
Applying Action Rules to change Detractor to Passive /Accuracy- Precision, Coverage- Recall/ Let’s assume that we built action rules from the classifiers for Promoter & Detractor. The goal is to change Detractors -> Promoters The confidence of action rule – 0.993 * 0.849 = 0.84 Our action rule can target only 4.2 (out of 10.2) detractors. So, we can expect 4.2*0.84 = 3.52 detractors moving to the promoter status