An Exercise in Machine Learning
Explore the power of WEKA software in machine learning, from data preparation to building classifiers and interpreting results. Discover the state-of-the-art learning algorithms, classification strengths, regression, association rules, clustering, and extensibility. Dive into API documentation, tutorials, and Weka-related projects. Delve into data formats, filters, classifiers, and performance measures with practical exercises. Learn to use decision trees, Naive Bayes, and J48 classifiers for accurate predictions and model evaluations.
An Exercise in Machine Learning
E N D
Presentation Transcript
An Exercise in Machine Learning • http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/ • Cornelia Caragea
Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results
Machine Learning Software • Suites (General Purpose) • WEKA (Source: Java) • MLC++ (Source: C++) • SAS • List from KDNuggets (Various) • Specific • Classification: C4.5, SVMlight • Association Rule Mining • Bayesian Net … • Commercial vs. Free
What does WEKA do? • Implementation of the state-of-the-art learning algorithm • Main strengths in the classification • Regression, Association Rules and clustering algorithms • Extensible to try new learning schemes • Large variety of handy tools (transforming datasets, filters, visualization etc…)
WEKA resources • API Documentation, Tutorials, Source code. • WEKA mailing list • Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations • Weka-related Projects: • Weka-Parallel - parallel processing for Weka • RWeka - linking R and Weka • YALE - Yet Another Learning Environment • Many others…
Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results
Preparing Data • ARFF Data Format • Header – describing the attribute types • Data – (instances, examples) comma-separated list
Launching WEKA • java -jar weka.jar
Data Filters • Useful support for data preprocessing • Removing or adding attributes, resampling the dataset, removing examples, etc. • Creates stratified cross-validation folds of the given dataset, and class distributions are approximately retained within each fold. • Typically split data as 2/3 in training and 1/3 in testing
Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results
Building Classifiers • A classifier model - mapping from dataset attributes to the class (target) attribute. Creation and form differs. • Decision Tree and Naïve Bayes Classifiers • Which one is the best? • No Free Lunch!
(1) weka.classifiers.rules.ZeroR • Class for building and using a 0-R classifier • Majority class classifier • Predicts the mean (for a numeric class) or the mode (for a nominal class)
Exercise 1 • http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/exercises/ex1.html
(2)weka.classifiers.bayes.NaiveBayes • Class for building a Naive Bayes classifier
(3) weka.classifiers.trees.J48 • Class for generating a pruned or unpruned C4.5 decision tree
Test Options • Percentage Split (2/3 Training; 1/3 Testing) • Cross-validation • estimating the generalization error based on resampling when limited data; averaged error estimate. • stratified • 10-fold • leave-one-out (Loo)
Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results
Exercise 2 • http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/exercises/ex2.html
Performance Measures • Accuracy & Error rate • Confusion matrix – contingency table • True Positive rate & False Positive rate (Area under Receiver Operating Characteristic) • Precision,Recall & F-Measure • Sensitivity & Specificity • For more information on these, see • uisp09-Evaluation.ppt
Decision Tree Pruning • Overcome Over-fitting • Pre-pruning and Post-pruning • Reduced error pruning • Subtree raising with different confidence • Comparing tree size and accuracy
Subtree replacement • Bottom-up: tree is considered for replacement once all its subtrees have been considered
Subtree Raising • Deletes node and redistributes instances • Slower than subtree replacement
Exercise 3 • http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/exercises/ex3.html