1 / 26

Advanced data mining with TagHelper and Weka

This guide covers selecting a classifier, feature selection, optimization, and semi-supervised learning in data mining using TagHelper and Weka. Includes tips and tricks for improving performance and using various algorithms.

Télécharger la présentation

Advanced data mining with TagHelper and Weka

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced data mining with TagHelper and Weka Carolyn Penstein Rosé Carnegie Mellon University Funded through the Pittsburgh Science of Learning Center and The Office of Naval Research, Cognitive and Neural Sciences Division

  2. Outline • Selecting a classifier • Feature Selection • Optimization • Semi-supervised learning

  3. Selecting a Classifier

  4. * The three main types of Classifiers are Bayesian models (Naïve Bayes), functions (SMO), and trees (J48) Classifier Options

  5. Classifier Options • Rules of thumb: • SMO is state-of-the-art for text classification • J48 is best with small feature sets – also handles contingencies between features well • Naïve Bayes works well for models where decisions are made based on accumulating evidence rather than hard and fast rules

  6. Feature Selection

  7. Why do irrelevant features hurt performance? • They might confuse a classifier • They waste time

  8. Two Solutions • Use a feature selection algorithm • Only extract a subset of possible features

  9. * Click on the AttributeSlectedClassifier Feature Selection

  10. Feature Selection • Feature selection algorithms pick out a subset of the features that work best • Usually they evaluate each feature in isolation

  11. * First click here * Then pick your base classifier just like before * Finally you will configure the feature selection Feature Selection

  12. Setting Up Feature Selection

  13. Setting Up Feature Selection • The number of features you pick should not be larger than the number of features available • The number should not be larger than the number of coded examples you have

  14. Examining Which Features are Most Predictive • You can find a ranked list of features in the Performance Report if you use feature selection * Predictiveness score * Frequency

  15. Optimization

  16. Key idea:combine multiple views on the same data in order to increase reliability

  17. Boosting • In boosting, a series of models are trained and each trained model is influenced by the strengths and weaknesses of the previous model • New models should be experts in classifying examples that the previous model got wrong • It specifically seeks to train multiple models that complement each other • In the final vote, model predictions are weighted based on their model’s performance

  18. More about Boosting • The more iterations, the more confident the trained classifier will be in its predictions (since it will have more experts voting) • On the other side, sometimes Boosting overfits • Boosting can turn a weak classifier into a strong classifier

  19. Boosting • Boosting is an option listed in the Meta folder, near the Attribute Selected Classifier • It is listed as AdaBoostM1 • Go ahead and click on it now

  20. Boosting * Now click here

  21. Setting Up Boosting * Select a classifier * Set the number of cycles of boosting

  22. Semi-Supervised Learning

  23. Using Unlabeled Data • If you have a small amount of labeled data and a large amount of unlabeled data: • you can use a type of bootstrapping to learn a model that exploits regularities in the larger set of data • The stable regularities might be easier to spot in the larger set than the smaller set • Less likely to overfit your labeled data

  24. Co-training • Train two different models based on a few labeled examples • Each model is learning the same labels but using different features • Use each of these to label the unlabeled data • For each approach, take the example most confidently labeled negative and most confidently labeled positive and add them to the labeled data • Now repeat the process until all of the data is labeled

  25. Semi-supervised Learning • Remember the Basic idea: • Train on a small amount of data • Add the positive and negative example you are most confident about to the training data • Retrain • Keep looping until you label all the data

  26. Questions?

More Related