Non-Traditional Metrics

Non-Traditional Metrics Evaluation measures from the medical diagnostic community Constructing new evaluation measures that combine metric and statistical information

Part I Borrowing new performance evaluation measures from the medical diagnostic community (Marina Sokolova, Nathalie Japkowicz and Stan Szpakowicz)

The need to borrow new performance measures: an example • It has come to our attention that the performance measures commonly used in Machine Learning are not very good at assessing the performance of problems in which the two classes are equally important. • Accuracy focuses on both classes, but, it does not distinguish between the two classes. • Other measures, such as Precision/Recall, F-Score and ROC Analysis only focus on one class, without concerning themselves with performance on the other class.

Learning Problems in which the classes are equally important • Examples of recent Machine Learning domains that require equal focus on both classes and a distinction between false positive and false negative rates are: • opinion/sentiment identification • classification of negotiations • An examples of a traditional problem that requires equal focus on both classes and a distinction between false positive and false negative rates is: • Medical Diagnostic Tests • What measures have researchers in the Medical Diagnostic Test Community used that we can borrow?

Performance Measures in use in the Medical Diagnostic Community • Common performance measures in use in the Medical Diagnostic Community are: • Sensitivity/Specificity (also in use in Machine learning) • Likelihood ratios • Youden’s Index • Discriminant Power [Biggerstaff, 2000; Blakeley & Oddone, 1995]

Sensitivity/Specificity • The sensitivity of a diagnostic test is: • P[+|D], i.e., the probability of obtaining a positive test result in the diseased population. • The specificity of a diagnostic test is: • P[-|Ď], i.e., the probability of obtaining a negative test result in the disease-free population. • Sensitivity and specificity are not that useful, however, since one, really is interested in P[D|+] (PVP: the Predictive Value of a Positive) and P[Ď|-] (PVN: the Predictive Value of a Negative) in both the medical testing community and in Machine Learning.  We can apply Bayes’ Theorem to derive the PVP and PVN.

Deriving the PVPs and PVNs • The problem with deriving the PVP and PVN of a test, is that in order to derive them, we need to know p[D], the pre-test probability of the disease. This cannot be done directly. • As usual, however, we can set ourselves in the context of the comparison of two tests (with P[D] being the same in both cases). • Doing so, and using Bayes’ Theorem: P[D|+] = (P[+|D] P[D])/(P[+|D] P[D] + P[+|Ď]P[Ď]) We can get the following relationships (see Biggerstaff, 2000): • P[D|+Y] > P[D|+X] ↔ ρ+Y > ρ+X • P[Ď|-Y] > P[Ď|-X] ↔ ρ-Y < ρ-X Where X and Y are two diagnostic tests, and +X, and –X stand for confirming the presence of the disease and confirming the absence of the disease, respectively. (and similarly for +Y and –Y) • ρ+ and ρ- are the likelihood ratios that are defined on the next slide

Likelihood Ratios • ρ+ and ρ- are actually easy to derive. • The likelihood ratio of a positive test is: • ρ+ = P[+|D] / P[+| Ď], i.e. the ratio of the true positive rate to the false positive rate. • The likelihood ratio of a negative test is: • ρ- = P[-|D] / P[-| Ď], i.e. the ratio of the false negative rate to the true negative rate. Note: We want to maximize ρ+ and minimize ρ-. • This means that, even though we cannot calculate the PVP and PVN directly, we can get the information we need to compare two tests through the likelihood ratios.

Youden’s Index and Discriminant Power • Youden’s Index measures the avoidance of failure of an algorithm while Discriminant Power evaluates how well an algorithm distinguishes between positive and negative examples. • Youden’s Index γ = sensitivity – (1 – specificity) = P[+|D] – (1 - P[-|Ď]) • Discriminant Power: DP = √3/π (log X + log Y), where X = sensitivity/(1 – sensitivity) and Y = specificity/(1-specificity)

Comparison of the various measures on the outcome of e-negotiation DP is below 3  insignificant

What does this all mean? Traditional ML Measures

What does this all mean? New Measures that are more appropriate for problems where both classes are as important

Part I: Discussion • The variety of results obtained with the different measures suggest two conclusions: • It is very important for practitioners of Machine Learning to understand their domain deeply, to understand what it is, exactly, that they want to evaluate, and to reach their goal using appropriate measures (existing or new ones). • Since some of the results are very close to each other, it is important to establish reliable confidence tests to find out whether or not these results are significant.

Part II Constructing new evaluation measures • (William Elamzeh, Nathalie Japkowicz • and Stan Matwin)

Motivation for our new evaluation method • ROC Analysis alone and its associated AUC measure do not assess the performance of classifiers adequately since they omit any information regarding the confidence of these estimates. • Though the identification of the significant portion of ROC Curves is an important step towards generating a more useful assessment, this analysis remains biased in favour of the large class, in case of severe imbalances. • We would like to combine the information provided by the ROC Curve together with information regarding how balanced the classifier is with regard to the misclassification of positive and negative examples.

ROC’s bias in the case of severe class imbalances • ROC Curves, for the pos class, plots the true positive rate a/(a+b) against the false positive rate c/(c+d). • When the number of pos. examples is significantly lower than the number of neg. examples, a+b << c+d, as we change the class probability threshold, a/(a+b) climbs faster than c/(c+d) • ROC gives the majority class (-) an unfair advantage. • Ideally, a classifier should classify both classes proportionally Confusion Matrix

Correcting for ROC’s bias in the case of severe class imbalances • Though we keep ROC as a performance evaluation measure, since rate information is useful, we propose to favour classifiers that perform with similar number of errors in both classes, for confidence estimation. • More specifically,as in the Tango test, we favour classifiers that have lower difference in classification errors in both classes, (b-c)/n. • This quantity (b-c)/n is interesting not just for confidence estimation, but also as an evaluation measure in its own right Confusion Matrix

Proposed Evaluation Method for severely Imbalanced Data sets • Our method consists of five steps: • Generate a ROC Curve R for a classifier K applied to data D. • Apply Tango’s confidence test in order to identify the confident segments of R. • Compute the CAUC, the area under the confident ROC segment. • Compute AveD, the average normalized difference (b-c)/n for all points in the confident ROC segment. • Plot CAUC against aveD  An effective classifier shows low AveD and high CAUC

Experiments and Expected Results • We considered 6 imbalanced domain from UCI. The most imbalanced one contained only 1.4% examples in the small class while the least imbalanced one had as many as 26%. • We ran 4 classifiers: Decision Stumps, Decision Trees, Decision Forests and Naïve Bayes • We expected the following results: • Weak Performance from the Decision Stumps • Stronger Performance from the Decision Trees • Even Stronger Performance from the Random Forests • We expected Naïve Bayes to perform reasonably well, but with no idea of how it would compare to the tree family of learners Same family of learners

Results using our new method: Our expectations are met Note: Classifiers in the top left corner outperform those in the bottom right corner • Decision Stumps perform the worst, followed by decision trees and then random forests (in most cases) • Surprise 1: Decision trees outperform random forests in the two most balanced data sets. • Surprise 2: Naïve Bayes consistently outperforms Random forests

AUC Results • Our, more informed, results contradict the AUC results which claim that: • Decision Stumps are sometimes as good as or superior to decision trees (!) • Random Forests outperforms all other systems in all but one cases.

Part II: Discussion • In order to better understand the performance of classifiers on various domains, it can be useful to consider several aspects of this evaluation simultaneously. • In order to do so, it might be useful to create specific measures adapted to the purpose of the evaluation. • In our case, above, our evaluation measure allowed us to study the tradeoff between classification difference and area under the confident segment of the AUC curve, thus, producing more reliable results

Non-Traditional Metrics