290 likes | 300 Vues
Explore the power of WEKA software in machine learning, from data preparation to building classifiers and interpreting results. Discover the state-of-the-art learning algorithms, classification strengths, regression, association rules, clustering, and extensibility. Dive into API documentation, tutorials, and Weka-related projects. Delve into data formats, filters, classifiers, and performance measures with practical exercises. Learn to use decision trees, Naive Bayes, and J48 classifiers for accurate predictions and model evaluations.
E N D
An Exercise in Machine Learning • http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/ • Cornelia Caragea
Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results
Machine Learning Software • Suites (General Purpose) • WEKA (Source: Java) • MLC++ (Source: C++) • SAS • List from KDNuggets (Various) • Specific • Classification: C4.5, SVMlight • Association Rule Mining • Bayesian Net … • Commercial vs. Free
What does WEKA do? • Implementation of the state-of-the-art learning algorithm • Main strengths in the classification • Regression, Association Rules and clustering algorithms • Extensible to try new learning schemes • Large variety of handy tools (transforming datasets, filters, visualization etc…)
WEKA resources • API Documentation, Tutorials, Source code. • WEKA mailing list • Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations • Weka-related Projects: • Weka-Parallel - parallel processing for Weka • RWeka - linking R and Weka • YALE - Yet Another Learning Environment • Many others…
Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results
Preparing Data • ARFF Data Format • Header – describing the attribute types • Data – (instances, examples) comma-separated list
Launching WEKA • java -jar weka.jar
Data Filters • Useful support for data preprocessing • Removing or adding attributes, resampling the dataset, removing examples, etc. • Creates stratified cross-validation folds of the given dataset, and class distributions are approximately retained within each fold. • Typically split data as 2/3 in training and 1/3 in testing
Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results
Building Classifiers • A classifier model - mapping from dataset attributes to the class (target) attribute. Creation and form differs. • Decision Tree and Naïve Bayes Classifiers • Which one is the best? • No Free Lunch!
(1) weka.classifiers.rules.ZeroR • Class for building and using a 0-R classifier • Majority class classifier • Predicts the mean (for a numeric class) or the mode (for a nominal class)
Exercise 1 • http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/exercises/ex1.html
(2)weka.classifiers.bayes.NaiveBayes • Class for building a Naive Bayes classifier
(3) weka.classifiers.trees.J48 • Class for generating a pruned or unpruned C4.5 decision tree
Test Options • Percentage Split (2/3 Training; 1/3 Testing) • Cross-validation • estimating the generalization error based on resampling when limited data; averaged error estimate. • stratified • 10-fold • leave-one-out (Loo)
Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results
Exercise 2 • http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/exercises/ex2.html
Performance Measures • Accuracy & Error rate • Confusion matrix – contingency table • True Positive rate & False Positive rate (Area under Receiver Operating Characteristic) • Precision,Recall & F-Measure • Sensitivity & Specificity • For more information on these, see • uisp09-Evaluation.ppt
Decision Tree Pruning • Overcome Over-fitting • Pre-pruning and Post-pruning • Reduced error pruning • Subtree raising with different confidence • Comparing tree size and accuracy
Subtree replacement • Bottom-up: tree is considered for replacement once all its subtrees have been considered
Subtree Raising • Deletes node and redistributes instances • Slower than subtree replacement
Exercise 3 • http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/exercises/ex3.html