1 / 34

Workshop in R & GLMs: #2

Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca Start by loading your Lakedata_06 dataset: diane<-read.table(file.choose(),sep=";",header=TRUE) Dataframes Two ways to identify a column (called "treatment") in your dataframe (called "diane"):

Ava
Télécharger la présentation

Workshop in R & GLMs: #2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

  2. Start by loading your Lakedata_06 dataset: diane<-read.table(file.choose(),sep=";",header=TRUE)

  3. Dataframes Two ways to identify a column (called "treatment") in your dataframe (called "diane"): diane$treatment OR attach(diane); treatment At end of session, remember to: detach(diane)

  4. Summary statistics length (x) mean (x) var (x) cor (x,y) sum (x) summary (x) minimum, maximum, mean, median, quartiles What is the correlation between two variables in your dataset?

  5. Factors • A factor has several discrete levels (e.g. control, herbicide) • If a vector contains text, R automatically assumes it is a factor. • To manually convert numeric vector to a factor: • x <- as.factor(x) • To check if your vector is a factor, and what the levels are: • is.factor(x) ; levels(x)

  6. Exercise Make lake area into a factor called AreaFactor: Area 0 to 5 ha: small Area 5.1 to 10: medium Area > 10 ha: large You will need to: 1. Tell R how long AreaFactor will be. AreaFactor<-Area; AreaFactor[1:length(Area)]<-"medium" 2. Assign cells in AreaFactor to each of the 3 levels 3. Make AreaFactor into a factor, then check that it is a factor

  7. Exercise Make lake area into a factor called AreaFactor: Area 0 to 5 ha: small Area 5.1 to 10: medium Area > 10 ha: large You will need to: 1. Tell R how long AreaFactor will be. AreaFactor<-Area; AreaFactor[1:length(Area)]<-"medium" 2. Assign cells in AreaFactor to each of the 3 levels AreaFactor[Area<5.1]<-“small"; AreaFactor[Area>10]<-“large" 3. Make AreaFactor into a factor, then check that it is a factor AreaFactor<-as.factor(AreaFactor); is.factor(AreaFactor)

  8. Linear regression ALT+126 model <- lm (y ~ x, data = diane) insert your dataframe name here invent a name for your model linear model insert your x vector name here insert your y vector name here

  9. Linear regression model <- lm (Species ~ Elevation, data = diane) summary (model) Call: lm(formula = Species ~ Elevation, data = diane) Residuals: Min 1Q Median 3Q Max -7.29467 -2.75041 -0.04947 1.83054 15.00270 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.421568 2.426983 3.882 0.000551 *** Elevation -0.0026090.003663 -0.712 0.482070 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.811 on 29 degrees of freedom Multiple R-Squared: 0.01719, Adjusted R-squared: -0.0167 F-statistic: 0.5071 on 1 and 29 DF, p-value: 0.4821

  10. Linear regression model2 <- lm (Species ~ AreaFactor, data = diane) summary (model2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.833 1.441 8.210 6.17e-09 *** AreaFactormedium -2.405 1.723 -1.396 0.174 AreaFactorsmall -8.288 1.792 -4.626 7.72e-05 *** Large has mean species richness of 11.8 Medium has mean species richness of 11.8 - 2.4 = 9.4 Small has a mean species richness of 11.8 - 8.3 = 3.5 mean(Species[AreaFactor=="medio"])

  11. ANOVA model2 <- lm (Species ~ AreaFactor, data = diane) anova (model2) Analysis of Variance Table Response: Species Df Sum Sq Mean Sq F value Pr(>F) AreaFactor 2 333.85 166.92 13.393 8.297e-05 *** Residuals 28 348.99 12.46

  12. F tests in regression model3 <- lm (Species ~ Elevation + pH, data = diane) anova (model, model3) Model 1: Species ~ Elevation Model 2: Species ~ Elevation + pH Res.Df RSS Df Sum of Sq F Pr(>F) 1 29 671.10 2 28 502.06 1 169.05 9.4279 0.004715 ** F 1, 28 = 9.43

  13. Exercise • Fit the model: Species~pH • Fit the model: Species~pH+pH2 • ("pH2" is just pH2) • Use the ANOVA command to decide whether species richness is a linear or quadratic function of pH

  14. Distributions: not so normal! • Review assumptions for parametric stats (ANOVA, regressions) • Why don’t transformations always work? • Introduce non-normal distributions

  15. Tests before ANOVA, t-tests Normality Constant variances

  16. Tests for normality: exercise data<-c(rep(0:6,c(42,30,10,5,5,4,4)));data How many datapoints are there?

  17. Tests for normality: exercise • Shapiro-Wilks (if sig, not normal) shapiro.test (data) If error message, make sure the stats package is loaded, then try again: library(stats); shapiro.test (data)

  18. Tests for normality: exercise • Kolmogorov-Smirnov (if sig, not normal) ks.test(data,”pnorm”,mean(data),sd=sqrt(var(data)))

  19. Tests for normality: exercise • Quantile-quantile plot (if wavers substantially off 1:1 line, not normal) par(pty="s") qqnorm(data); qqline(data) opens up a single plot window

  20. Tests for normality: exercise

  21. Tests for normality: exercise If the distribution isn´t normal, what is it? freq.data<-table(data); freq.data barplot(freq.data)

  22. Non-normal distributions • Poisson (count data, left-skewed, variance = mean) • Negative binomial (count data, left-skewed, variance >> mean) • Binomial (binary or proportion data, left-skewed, variance constrained by 0,1) • Gamma (variance increases as square of mean, often used for survival data)

  23. Exercise model2 <- lm (Species ~ AreaFactor, data = diane) anova (model2) 1. Test for normality of residuals resid2<- resid (model2) ...you do the rest! 2. Test for homogeneity of variances summary (lm (abs (resid2) ~ AreaFactor))

  24. Regression diagnostics • Residuals are normally distributed • Absolute value of residuals do not change with predicted value (homoscedastcity) • Residuals show no pattern with predicted values (i.e. the function “fits”) • No datapoint has undue leverage on the model.

  25. Regression diagnostics model3 <- lm (Species ~ Elevation + pH, data = diane) par(mfrow=c(2,2)); plot(model3)

  26. 1. Residuals are normally distributed • Straight “Normal Q-Q plot” Std. deviance resid. Theoretical Quantiles

  27. 2. Absolute residuals do not change with predicted values • No fan shape in Residuals vs fitted plot • No upward (or downward) trend in Scale-location plot MALO BUENO Residuals Sqrt (abs (SD resid.)) Fitted values Fitted values

  28. Examples of neutral and fan-shapes

  29. 3. Residuals show no pattern Curved residual plots result from fitting a straight line to non-linear data (e.g. quadratic)

  30. 4. No unusual leverage Cook’s distance > 1 indicates a point with undue leverage (large change in model fit when removed)

  31. Transformations Try transforming your y-variable to improve the regression diagnostic plot • replace Species with log(Species) • replace Species with sqrt(Species)

  32. Poisson distribution • Frequency data • Lots of zero (or minimum value) data • Variance increases with the mean

  33. What do you do with Poisson data? • Correct for correlation between mean and variance by log-transforming y (but log (0) is undefined!!) • Use non-parametric statistics (but low power) • Use a “generalized linear model” specifying a Poisson distribution

  34. The problem: Hard to transform data to satisfy all requirements! Tarea: Janka example Janka dataset: Asks if Janka hardness values are a good estimate of timber density? N=36

More Related