1 / 45

My JSM Presentation

Dick De Veaux. Williams College. My JSM Presentation. Note to self: Basically the same talk I gave somewhere else where I had about 10 times longer to give it. I hope it goes better this time where I only have 15 minutes.

liesel
Télécharger la présentation

My JSM Presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dick De Veaux Williams College MyJSM Presentation Note to self: Basically the same talk I gave somewhere else where I had about 10 times longer to give it. I hope it goes better this time where I only have 15 minutes.

  2. SOME Recent results from in Statistics That I’ve either come up with myself or Borrowed heavily from the literature.

  3. The theory behind boosting is easy to understand via a binary classification problem. Therefore for the time being assume that the goal is to clas­sify the members of some population into two categories. For instance, the goal might be to determine whether a medical patient has a certain disease or not. Typi­cally these two categories are given numerical representations such that the positive outcome (the patient has the disease) equals to 1 and the negative outcome (the pa­tient does not have the disease) equals to −1. Using this notation, each example can be represented with a pair (y, x), where y ∈ {−1, 1} and x ∈p. The boosting algorithm starts with a constant function, e.g. the mean or median of the response values. After this, the algorithm proceeds iteratively. During every it­eration it trains a weak learner (defined as a rule that can classify examples slightly better than random guessing) on a training set that weights more heavily those ex­amples that the previous weak learners found difficult to classify correctly. Iterating in this manner produces a set of weak learners that can be viewed as a committee of classifiers working together to correctly classify each training example. Within the committee each weak learner has a vote on the final prediction. These votes are typically weighted such that weak learners that perform well with respect to the training set have more relative influence on the final prediction. The weighted pre­dictions are then added together. The sign of this sum forms the final prediction (resulting into a prediction of either +1 or -1) of the committee. And a GREAT space and time consumer is to put lots of unnecessary bibliographic material on your visuals, especially if they point to your own previous work and you can get them into a microscopically small font like this one that even you can’t read and have NO IDEA why you ever even put it in there in the first place!!

  4. Statisticians and Averages Averaging

  5. [1] Yang P.; Yang Y. H.; Zomaya A. Y. A Review of Ensemble Methods in Bioinformatics. Current Bioinformatics, 2010, 5, 296–308. [2] Okun O. Feature Selection and Ensemble Methods for Bioinformatics: Algorithmic Classification and Implementations; SMARTTECCO: Malmö, 2011. [3] Dettling M.; Buhlmann P. Boosting for Tumor Classification with Gene Expression Data. Seminar fur Statistik, 2002, 19, 1061–1069. [4] Politis D. N. In: Bagging Multiple Comparisons from Microarray Data, Proceedings of the 4th International Conference on Bioinformatics Research and Applications; Mandoiu I.; Sunderraman R.; Zelikovsky A., Eds.; Berlin/Heidelberg, Germany, 2008; pp. 492-503. [5] Hastie T.; Tibshirani R.; Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer: New York, 2009. [6] Fumera G.; Fabio R.; Alessandra S. A Theoretical Analysis of Bagging as a Linear Combination of Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30, 1293–1299. [7] Duffy N.; Helmbold D. Boosting Methods for Regression. Machine Learning, 2002, 47, 153-200. [8] Breiman L. Random Forests. Machine Learning, 2001, 45, 5-32. [9] Freund Y.; Schapire R. E. A Decision-Theoretic Generalization of Online Learning and an Application to Boosting. Journal of Computer and System Sciences, 1997, 55, 119-139.

  6. Some Background

  7. Simulation Results

  8. Our method • Performed really well • In fact, in all the data sets we found • Our method • Was the best • Was better than the other methods • In all the data sets we simulated • Our method • Was the best • Out performed the other methods • Our method • Is fasterand easier to compute and has the smallest asymptotic variance compared to the other method that we found in the literature • So, now I’d like to show some of the results from our simulation studies where we simulated data sets and tuned our method to optimize performance comparing it to the other method which we really didn’t know how to use

  9. Penalized Regression

  10. Penalized Regression • Least squares

  11. Penalized Regression • Least squares

  12. Penalized Regression • Least squares • Ridge Regression

  13. Variations on a Regression Theme • Least squares • Ridge Regression

  14. Variations on a Regression Theme • Least squares • Ridge Regression • Lasso

  15. Stepwise Regression Review Forward Stepwise Regression – If we standardize x’s, start with r = y Find x most correlated with r Add x to fit, r=y-fit Find x most correlated with r Continue until no x is correlated enough

  16. LAR is similar (Forward Stagewise) Only enter as much of x as it “deserves” Find xj most correlated with r, βj ← βj + δj , where δj = e sign <r, xj> until another variable is equally correlated Move βj and βk in the direction defined by their joint least squares coefficient of the current residual on (xj ; xk), until some other competitor x has as much correlation with the current residual. Set r ← r − new predictions and repeat steps many times

  17. LAR and Lasso

  18. Boosting fits an Additive Model Forward Stagewise Additive Modeling f0(x) = 0 For m=1,…, M Compute Set

  19. Adaboost? Adaboost is a stagewise AM with Basis functions are just the classifiers

  20. Solve the Exponential Loss Problem

  21. How to Solve That?

  22. Two Steps

  23. First Step

  24. End of First Step

  25. Almost There – what about b?

  26. Take Derivative

  27. Almost There

  28. Almost There

  29. And?

  30. Wait? What? • So Adaboost • Finds the next “weak learner” that minimizes the sum of the weighted exponential missclassifications • With overall weights equal to • Adds this to previous estimates

  31. Where are we? Adaboost is: Forward stagewise additive model With exponential loss function Sensitive to misclassification since we use exponential (not missclassification) loss

  32. Now what – extend to regression? Why exponential loss? Squared error loss is least squares What’s more robust? Absolute Error loss Huber loss Truncated loss

  33. Next Idea Gradient Boosting Machine Take these ideas: Loss function Find Just solve this by steepest descent

  34. Oops --- one small problem • The derviatives are based only on the training data • They won’t generalize • So calculate them and “fit” them using trees • Boosted Trees

  35. Making it more Robust • Can handle 25% contamination to original data set

  36. Some Applications

  37. Wine Data Set • 11 predictors and a rating Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates11 - alcohol  Output variable (based on sensory data): 12 - quality (score between 0 and 10)

  38. Performance Under Contamination- Red Wine

  39. Performance Under Contamination- White Wine

  40. Our method • Performed really well • In fact, in all the data sets we found • Our method • Was the best • Was better than the other methods • In all the data sets we simulated • Our method • Was the best • Out performed the other methods • Our method • Is fasterand easier to compute and has the smallest asymptotic variance compared to the other method that we found in the literature

More Related