Rohit Kate

Computational Intelligence in Biomedical and Health Care InformaticsHCA 590 (Topics in Health Sciences) Rohit Kate Machine Learning: Some Topics and Weka Software

Learning Curves • Train the classifier with increasing amount of training examples and plot accuracy vs. size of training set • Helps to answer: • Whether maximum accuracy has nearly been reached or will more training examples help? • Is one technique better when training data is limited? • Most learners eventually converge to the maximum accuracy given sufficient training examples 100% Maximum Accuracy Method 1 Method 2 Test Accuracy # Training examples

Comparing Learning Curves • Gap usually has a “banana shape” • Often a better picture emerges if learning curves are compared “horizontally” instead of “vertically” 100% Maximum Accuracy 85% Method 1 Method 2 Test Accuracy Method 1 can achieve 85% accuracy with half the training data needed by method 2! 300 600 # Training examples

Datasets • Datasets are important for empirically evaluating machine learning techniques • It is important to test them on a variety of domains. Testing on 20+ data sets is common. • Variety of freely available datasets • UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html • KDD Cup (large data sets for data mining) http://www.kdnuggets.com/datasets/kddcup.html

Which is the Best Machine Learning Technique? • There is no single machine learning technique that performs better than every other technique on every dataset • One can always come up with a dataset on which a particular machine learning technique will do miserably • Flip its predictions and call them the correct answers • As such there is no basis for preferring one label over another for classifying a never before seen test example even after seeing a lot of training data • It is unknown so it could be anything! • Hence every machine learning technique makes some assumptions (“bias”) which helps it generalize from training data to test data

Which is the Best Machine Learning Technique? • Depending upon how the assumptions of a machine learning technique hold in a given dataset, some techniques perform better than others Assumptions: • Naïve Bayes & Bayesian networks: Conditional independence assumptions • SVM & NN: A hyperplane can separate the examples • Decision Trees: Some feature values separate the examples

Training Data • Training data is critical for applying any machine learning technique • Obtaining it is often the most difficult part • Availability of data, particularly medical data • Obtaining correct labels, often manually done by experts, expensive and labor intensive • As learning curves show, “more data is better data” • But it is expensive to get more training data • Some approaches have been designed to compensate for the lack of training data

Various Forms of Supervision • If all the training data have correct labels then it is called supervised learning • Some methods also utilize unlabeled training data in addition to the labeled data and are called semi-supervised learning • Most learning methods can be extended to leverage unlabeled training data • Predict labels for unlabeled examples and take them as the correct labels and train again; iterate a few times • Often helps as if by magic! • Some methods, like clustering examples into groups, learn completely unsupervised, but they are useful only in limited situations

Weka: The Most Well-Known Machine Learning Software • Freely available • Includes several machine learning techniques • Download from the web-site: http://www.cs.waikato.ac.nz/ml/weka/ • A tutorial (only classification part): http://prdownloads.sourceforge.net/weka/weka.ppt

ARFF Format for Data • Once the data is in the ARFF format (attribute-relation file format), you can play with several machine learning techniques using Weka! • See Weka tutorial slides 5 & 6 • More description of the ARFF format: http://weka.wikispaces.com/ARFF+%28book+version%29 • Plain text file (use notepad etc. to open or create) • Save with .arff extension • See several examples: http://repository.seasr.org/Datasets/UCI/arff/ • Comments after ‘%’ character • Unknowns marked by ‘?’ • If the last attribute is nominal then it is a classification task, if it is numeric then it is a regression task

Rohit Kate

Rohit Kate

Presentation Transcript

Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Natural Language Processing COMPSCI 423/723 Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Natural Language Processing COMPSCI 423/723 Rohit Kate

Rohit Kate

Rohit Kate

Natural Language Processing COMPSCI 423/723 Rohit Kate

Rohit Khokher

Rohit Kate

Rohit Kate

Rohit Kate

Natural Language Processing COMPSCI 423/723 Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate