1 / 15

Statistics in MSmcDESPOT

Statistics in MSmcDESPOT. Jason Su (borrowed heavily from STATS191 and Prof. Jonathan Taylor). Comparison of 2 Populations. Null hypothesis (H 0 ): The populations are the same Alternative hypothesis (H A ): The populations are different. t-test is the standard tool used here

oleg
Télécharger la présentation

Statistics in MSmcDESPOT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics in MSmcDESPOT Jason Su (borrowed heavily from STATS191 and Prof. Jonathan Taylor)

  2. Comparison of 2 Populations • Null hypothesis (H0): The populations are the same • Alternative hypothesis (HA): The populations are different. • t-test is the standard tool used here • Assumes the two populations are Gaussian distributed but that the data follows a t-dist. since we must estimate the mean and standard deviation • Wilcoxon rank-sum (or Mann–Whitney U) test • Is a non-parametric version, does not assume a distribution • Compares the medians instead of means of population • Reject at p-value < 0.05 level typically. • Interpretation: assuming the null hypothesis the p-value is the chance that we would observe something as extreme as the 2nd sample • Rejection at 0.05, means we would tolerate being wrong 5% of the time if they are actually the same

  3. Simple Linear Regression • y = a*x + b • Least squares fit of the predictor to the outcome, equiv. to maximum likelihood if assumptions correct • Assumptions: full column rank, residuals are independent N(0, σ^2) constant variance • In MSmcDESPOT predictor is log(DV), outcome is EDSS • EDSS = a*log(DV) + b • R^2 is a measure of how much of the variability of the outcome is explained by the predictor

  4. Diagnostics • What can go wrong? • Wrong regression function • Incorrect model for errors • Not normal • Not independent • Non-constant variance • Tools • Q-Q Plot, plot the quantiles of the residuals vs. that of a normal, should be a linear relationship • Plot residuals vs. predictor

  5. Q-Q Plot

  6. Non-constant Variance

  7. Multiple Linear Regression • Y = X*a • a = pinv(X)*Y, LS solution, pinv(X) = inv(X’X)X’ • X is now a matrix of columns of predictors • The outcome is linear in a predictor after accounting for all the others • Same assumptions from simple lin. reg. also • Adding even random noise to X improves R^2 • Adjusted R^2, instead of sum of square error, use mean square error: favors simpler models

  8. Diagnostics • Old tools are still good • New tools to measure the influence of an observation, useful for determining outliers • DFFITS: measures how much the regression function changes at the i-th observation when the i-th row is removed from X • Cook’s distance: how much the entire regression function changes when i-th row removed • DFBETAS: how much coefficients change when i-th row removed

  9. Model Selection • As suggested by Adjusted R^2, what we really want is a parsimonious model • One that predicts the outcome well with only a few predictors • This is a combinatorially hard problem • Models are evaluated with a criterion • Adjusted R^2 • Mallow’s Cp – estimated predictive power of model • Akaike information criterion (AIC) – related to Cp • Bayesian information criterion (BIC) • Cross validation with MSE

  10. Search Strategy • If the model is small enough, can search all • In MSmcDESPOT this is probably feasible, our predictors are: age, PVF, log(DV), gender, PP, SP, RR, CIS • 127 possibilities • Stepwise • This is a popular search method where the algorithm is giving a starting point then adds or removes predictors one at a time until there is no improvement in the criterion

  11. New Results with All 26 Normals

  12. Old Plot with 22 Normals

  13. New Plot with 26 Normals

  14. Discussion • All the relevant rank sum tests (Normals vs. classes of MS, RR vs. SP) are still below the p < 0.01 threshold as before • The drop in correlation is probably due to N024, who shows an unusually high amount of demyelination at half the level of the lowest CIS patients, could be an outlier • I’m not certain if log() is the correct transform for DV, need to run more diagnostics • How accurate is EDSS?

  15. Stepwise Model Selection • Stepwise model selection keeps age, PVF, and log(DV), shown on the left • On the right is a Q-Q plot of a model with age, PP, and SP • MATLAB function for exhaustive model search?

More Related