Machine Learning and Bioinformatics 機器學習與生物資訊學
E N D
Presentation Transcript
Machine Learning and Bioinformatics機器學習與生物資訊學 Machine Learning & Bioinformatics
Evaluation The key to success Machine Learning and Bioinformatics
Three datasets of which the answers must be known Machine Learning and Bioinformatics
Note on parameter tuning • It is important that the testing data is not used in any way to create the classifier • Some learning schemes operate in two stages • build the basic structure • optimize parameters • Thetesting data can not be used for parameter tuning • proper procedure uses three sets: training,tuning and testing data Machine Learning and Bioinformatics
Data is usually limited • Error on the training data is NOT a good indicator of performance on future data • otherwise 1NN would be the optimum classifier • Not a problem if lots of (answered) data is available • split data into training, turning and testing sets • However, (answered) data is usually limited • More sophisticated techniques need to be used Machine Learning and Bioinformatics
Issues in evaluation • Statistical reliability of estimated differences in performance significancetests • Choice of performance measures • number of correctly classified samples • ratio of correctly classified samples • error in numeric predictions • Costs assigned to different types of errors • many practical applications involve costs Machine Learning and Bioinformatics
Training and testing sets • Testing set mustplay no part, including parameter tuning, in classifier formation • Ideally, both training and testing sets are representative samples of the underlying problem, but they may differ in nature • we got data from two different towns A and B and want to estimate the performance of our classifier in a completely new town Machine Learning and Bioinformatics
Which (training vs. tuning/testing) should be more similar to the target new town? Machine Learning and Bioinformatics
Making the most of the data • Once evaluation is complete, all the data can be used to build the final classifier for real (unknown) data • A dilemma • generally, the larger the training data the better the classifier (but returns diminish) • the larger the testing data the more accurate the error estimate Machine Learning and Bioinformatics
Holdout procedure • Method of splitting original data into training and testing sets • Reserve a certain amount for testing and use the remainder for training • usually one third for testing and the rest for training • The samples might not be representative • e.g., class might be missing in the testing data • Stratification • ensures that each class is represented with approximately equal proportions in both subsets Machine Learning and Bioinformatics
Repeated holdout procedure • Holdout procedure can be made more reliable by repeating the process with different subsamples • in each iteration, a certain proportion is randomly selected for testing (possibly with stratification) • the error rates on the different iterations are averaged to yield an overall error rate • This is called the repeated holdout procedure • A problem is that the different testing sets overlap Machine Learning and Bioinformatics
Cross-validation • Cross-validation avoids overlapping test sets • split data into nsubsets of equal size • use each subset in turn for testing, the remainder for training • the error estimates are averaged to yield an overall error estimate • Called n-foldcross-validation • Often the subsets are stratified before the cross-validation is performed Machine Learning and Bioinformatics
More on cross-validation • Stratified ten-fold cross-validation • Why ten? • extensive experiments have shown that this is the best choice to get an accurate estimate • there is also some theoretical evidence for this • Repeated stratified cross-validation • e.g., ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance) Machine Learning and Bioinformatics
Leave-One-Out cross-validation • A particular form of cross-validation • set number of folds to number of training instances • Makes best use of the data • Involves no random subsampling • Very computationally expensive Advantage and disadvantage Machine Learning and Bioinformatics
LOO-CV and stratification • Stratification is not possible • there is only one instance in the testing set • An extremeexample • random dataset split equally into two classes • best inducer predicts majority class • 50% accuracy on fresh data • LOO-CV estimate is 100% error Machine Learning and Bioinformatics
Cost Machine Learning and Bioinformatics
Counting the cost • In practice, different types of classification errors often incur different costs • Examples • terrorist profiling, where predicting ‘negative’ achieves 99.99% accuracy • loan decisions • oil-slick detection • fault diagnosis • promotional mailing Machine Learning and Bioinformatics
Confusion matrix Machine Learning and Bioinformatics
Classification with costs • Two cost matrices • Error rate is replaced by average cost per prediction Machine Learning and Bioinformatics
Cost-sensitive learning • A basicidea is to only predict high-cost class when very confident about the prediction • Instead predicting the most likely class, we should make the prediction that minimizes the expected cost • dot product of class probabilities and appropriate column in cost matrix • choose column (class) that minimizes expected cost • Not at training time • Most learning schemes do not perform cost sensitive learning • they generate the same classifier no matter what costs are assigned to the different classes Machine Learning and Bioinformatics
A simple method for cost-sensitive learning Machine Learning and Bioinformatics
Resampling of instances according to costs Machine Learning and Bioinformatics
Measures Machine Learning and Bioinformatics
Lift charts • In practice, costs are rarely known • Decisions are usually made by comparing possible scenarios • E.g., promotional mail to 1,000,000 households • mail to all; 0.1% respond (1000) • a data mining tool identifies subset of 100,000 most promising, 0.4% of these respond (400) • another tool identifies subset of 400,000 most promising, 0.2% respond (800) • Which is better? • A lift chart allows a visual comparison Machine Learning and Bioinformatics
Generating a lift chart • Sort instances according to predicted probability of being positive • x-axis is sample size; y-axis is number of true positives Machine Learning and Bioinformatics
A hypothetical lift chart Machine Learning and Bioinformatics
ROC curves • ROC curves are similar to lift charts • stands for “receiver operating characteristic” • used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel • Differences to lift chart • y-axis shows percentage of true positives in sample rather than absolute number • x-axis shows percentage of false positives in sample rather than sample size Machine Learning and Bioinformatics
A sample ROC curve Jagged curve one set of test data Smooth curve use cross-validation Machine Learning and Bioinformatics
More measures • Precision = , percentage of reported samples that are positive • Recall = , percentage of positive samples that are reported • Precision/recallcurves have hyperbolic shape • Three-point average is the average precision at 20%, 50% and 80% recall • F-measure = , harmonic mean of precision and recall • makes precision and recall as equal as possible • Specificity = , percentage of negative samples that are not reported • Area under the ROC curve (AUC) Machine Learning and Bioinformatics
Summary of some measures Machine Learning and Bioinformatics
Evaluating numeric prediction Same strategies, including independent testing sets,cross-validation, significance tests, etc. Machine Learning and Bioinformatics
Measures in numeric prediction • Actual target values: • Predicted target values: • The most popular measure is mean squared error (MSE), , because it is easy to manipulate mathematically Machine Learning and Bioinformatics
Other measures • Root mean squarederror (RMSE) = • Mean absolute error (MAE), , is less sensitive to outliers than MSE • Sometimes relative error values are more appropriate Machine Learning and Bioinformatics
Improvement on the mean • How much does the scheme improve on simply predicting the average? • Relative squared error = • Relative absolute error = Machine Learning and Bioinformatics
Correlation coefficient / 相關係數 • Measures the statistical correlation between the predicted values and the actual values • Scale independent, between –1 and +1 • Good performance leads to large values Machine Learning and Bioinformatics
http://upload.wikimedia.org/wikipedia/commons/8/86/Correlation_coefficient.gifhttp://upload.wikimedia.org/wikipedia/commons/8/86/Correlation_coefficient.gif
Which measure? • Best to look at all of them • Often it doesn’t matter • D the best; C the second-best; A and B are arguable Machine Learning and Bioinformatics
Today’s exercise Machine Learning & Bioinformatics
Parameter tuning Design your own select, feature, buy and sell programs. Upload and test them in our simulation system. Finally, commit your best version and send TA Jang a report before 23:59 11/5 (Mon). Machine Learning & Bioinformatics
Possible ways • Enlarge parameter range in CV • Stratified, repeated… • minimize the variance • Make turning set • use large training set; make tuning set as similar to the target stocks as possible • Cost matrix • resampling, otherwise it would be very difficult • Change measures • or plot ROC curves to understand your classifiers • The best measure is the transaction profit, but it requires the simulation system. Instead, you can develop a comprising evaluation script, which is more complicated than any theoretic measures but simpler than the real problem. This is usually required in practice. Machine Learning and Bioinformatics