1 / 57

R for psychologists

R for psychologists. Professor Thom Baguley, Psychology, Nottingham Trent University Thomas.Baguley@ntu.ac.uk. 0. Overview. 1. What is R ? 2. Why use R ? 3. R basics 4. R graphics 5. Linear models in R 6. ANOVA and ANCOVA 6. Writing your own R functions

winter
Télécharger la présentation

R for psychologists

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. R for psychologists Professor Thom Baguley, Psychology, Nottingham Trent University Thomas.Baguley@ntu.ac.uk

  2. 0. Overview 1. What is R? 2. Why use R? 3. R basics 4. R graphics 5. Linear models in R 6. ANOVA and ANCOVA 6. Writing your own R functions 8. Simulation and bootstrapping

  3. 1. What is R? R is a software environment for statistical analysis “the lingua franca of computational statistics” De Leeuw & Mair (2007)

  4. 2. Why use R? It is: • free • open source • cross-platform (Mac, Linux, PC) It has: • excellent basic statistics functions • powerful and versatile graphics • hundreds of user-contributed packages • a large community of users

  5. 3. R basics Some common R objects: • characters (text strings e.g., 'a' or ’1’) • numbers (numbers like 2 or 1e+3) • vectors (1D set of numbers or other elements) • data frames (like vectors organized in columns) • matrices (r by c array of numbers) • lists (objects that contain other objects)

  6. Assignment Input: > vector1 <- 6 > vector1 Output: [1] 6

  7. Arithmetic > 6 + 6 * 2 [1] 18 > vector1^2 [1] 36

  8. Calling functions > c(1, 9, 25) > log(vector1) [1] 1 9 25 [1] 1.791759 > rnorm(1, 100, 15) ?rnorm [1] 123.5336 help(rnorm)

  9. Loading data > dat1 <- read.csv('pride.csv') > library(foreign) > h05 <- read.spss('Hayden_2005.sav', to.data.frame = TRUE)

  10. Addressing the contents of an object > vector1[1] [1] 6 > days <- h05$days > days[1:9] [1] 2864 1460 2921 2921 2921 1460 2921 1460 31

  11. Addressing the contents of an object > vector1[1] [1] 6 > days <- h05$days > days[1:9] [1] 2864 1460 2921 2921 2921 1460 2921 1460 31 Or combine the steps withh05$days[1:9]

  12. Data frames > is.data.frame(h05) [1] TRUE > dim(h05) > names(h05) [1] 43 2 [1] "names" "days” • this data frame has 43 rows and 2 named columns • it can be helpful to think of a data frame as a set of named variables (vectors) bound into columns

  13. Matrix objects (matrices) > cells <- c(3677,56,3924,31) > cat.names <- list(c("Before", "After"), c("Alive", "Dead")) > checklist <- matrix(cells, 2, 2, byrow=TRUE, dimnames= cat.names) > checklist Alive Dead Before 3677 56 After 3924 31

  14. The power of objects > chisq.test(checklist) Pearson's Chi-squared test with Yates' continuity correction data: checklist X-squared = 8.1786, df = 1, p-value = 0.004239 > chisq.test(c(2, 18)) Chi-squared test for given probabilities data: c(2, 18) X-squared = 12.8, df = 1, p-value = 0.0003466

  15. Calling functions: defaults and named arguments > chisq.test(checklist, correct=FALSE) Pearson's Chi-squared test data: checklist X-squared = 8.8072, df = 1, p-value = 0.003000 • the default argument (for matrix input is)correct=TRUE • setting correct=FALSE over-rides the default • because unnamed arguments to functions are matched in order, this argument must be named (otherwise naming the arguments is optional)

  16. Exercise 1 (R Basics) • entering data • working with objects • simple statistical functions

  17. 4. R graphics > boxplot(h05$days) n = 43

  18. > hist(h05$days) > plot(density(h05$days))

  19. Distribution functions e.g., Normal distribution dnorm(x, mean = 0, sd = 1) pnorm(q, mean = 0, sd = 1, lower.tail = TRUE) qnorm(p, mean = 0, sd = 1, lower.tail = TRUE) rnorm(n, mean = 0, sd = 1)> rdat <- rnorm(30, 100, 15)

  20. > curve(dnorm(x), xlim=c(-4,4), col='purple') > curve(dt(x, 1), col='red', add=TRUE, lty=3)

  21. Exercise 2 (R Graphics) • exploratory plots • plot parameters • plotting distribution functions • plotting a serial position curve (optional)

  22. 5. Linear models in R > expenses <- read.csv('expenses.csv') > t.test(majority ~ problem, data = expenses) Welch Two Sample t-test data: majority by problem t = -3.7477, df = 505.923, p-value = 0.000199 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2044.457 -638.160 sample estimates: mean in group 0 mean in group 1 7092.712 8434.021 Data courtesy of Mark Thompson http://markreckons.blogspot.com/

  23. > expenses <- read.csv('expenses.csv') > lm(majority ~ problem, data = expenses) Call: lm(formula = majority ~ problem, data = expenses) Coefficients: (Intercept) problem 7093 1341

  24. > lm.out <- lm(majority ~ problem, data = expenses) > max(rstandard(lm.out)) [1] 2.629733 > AIC(lm.out) [1] 12674.85 > lm.out$coefficients (Intercept) problem 7092.712 1341.308> ?lm

  25. > summary(lm.out) Call: lm(formula = majority ~ problem, data = expenses) Residuals: Min 1Q Median 3Q Max -8397.0 -3403.5 -260.9 3118.8 11543.3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7092.7 218.9 32.397 < 2e-16 *** problem 1341.3 357.0 3.758 0.000187 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4395 on 644 degrees of freedom Multiple R-squared: 0.02145, Adjusted R-squared: 0.01993 F-statistic: 14.12 on 1 and 644 DF, p-value: 0.0001871

  26. > glm(problem~I(majority/10000), family='binomial', data = expenses) Call: glm(formula = problem ~ I(majority/10000), family = "binomial", data = expenses) Coefficients: (Intercept) I(majority/10000) -1.0394 0.6878 Degrees of Freedom: 645 Total (i.e. Null); 644 Residual Null Deviance: 855.5 Residual Deviance: 841.6 AIC: 845.6

  27. > lrm.out <- glm(problem ~ I(majority/10000), family='binomial', data = expenses) > plot(election$majority, lrm.out$fitted, col='dark green')

  28. Exercise 3 (Linear models) - using a formula in a call to a model function • linear models • plotting a regression line • generalized linear models (optional) • plotting confidence bands (optional)

  29. 6. ANOVA > factor1 <- gl(10,4)> factor 1[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4Levels: 1 2 3 4> library(foreign)> diag.data <- read.spss("diagram.sav", to.data.frame = T)> diag.fit <- lm(descript ~ group, data=diag.data)

  30. Regression model (dummy coding) > diag.fit Call: lm(formula = descript ~ group, data = diag.data) Coefficients: (Intercept) groupPicture groupFull Diagram groupSegmented 17.8 0.5 4.6 9.5

  31. aov() > aov(diag.fit) Call: aov(formula = diag.fit) Terms: group Residuals Sum of Squares 583.7 2440.2 Deg. of Freedom 3 36

  32. anova() > summary(aov(diag.fit)) > anova(diag.fit) Analysis of Variance Table Response: descript Df Sum Sq Mean Sq F value Pr(>F) group 3 583.7 194.567 2.8704 0.04977 * Residuals 36 2440.2 67.783 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

  33. Factorial ANOVA and ANCOVA > lm(DV ~ factor1 + factor2 + factor1:factor2, data=data2) > lm(DV ~ factor1 * factor2, data=data2) > lm(descript ~ group + time, diag.data) > lm(descript ~ 0 + group + scale(time, scale=F), diag.data) > diag.ancov <- lm(descript ~ group + scale(time, scale=F), diag.data)

  34. Type I and Type II Sums of squares > drop1(diag.ancov, test = 'F') Single term deletions Model: descript ~ group + scale(time, scale = F) Df Sum of Sq RSS AIC F value Pr(F) <none> 2154.4 169.46 group 3 821.45 2975.9 176.38 4.4484 0.009478 ** scale(time, scale = F) 1 285.78 2440.2 172.44 4.6427 0.038149 * - sequential sums of squares (Type I) tests are given by default • hierarchical sums of squares (Type II) tests are given by the drop1() function [Software such as SAS and SPSS defaults to unique (Type III) SS]

  35. Repeated measures (within-subjects) factors • a weakness in the base R installation, but easily done using packages such as ez, nlme or lme4 (the latter two able to fit a wider range of repeated measures models) > pride.long <- read.csv("pride_long.csv") > pride.mixed <- aov(accuracy ~ emotion*condition + Error(factor(participant)/emotion), pride.long) > summary(pride.mixed) [Software such as SAS and SPSS defaults to unique (Type III) SS]

  36. Multiple comparisons and contrasts • you can run contrasts by changing default contrasts in R • Fisher LSD, Tukey’s HSD and p adjustment (e.g., Hochberg, Holm, Bonferroni, FDR) in R base installation • powerful modified Bonferroni (e.g., Shaffer, Westfall) and general linear contrasts available in multcomp package • flexible contrast and estimable functions in gmodels package

  37. Exercise 4 (ANOVA) - factors • aov() and anova() • ANCOVA • repeated measures (within-subjects) • contrasts and multiple comparisons (optional)

  38. 7. Writing your own functions sd.pool <- function(sd.1, n.1, sd.2, n.2 = n.1){ num <- (n.1-1)*sd.1^2+(n.2-1)*sd.2^2 denom <- n.1+n.2-2 output <- sqrt(num/denom) output } > sd.pool(6.1, 20, 5.3, 25) [1] 5.66743

  39. Exercise 5 (Write a function) • pick a simple statistical procedure • write a function to automate it e.g., pooled standard deviation

  40. 8. Simulation and bootstrapping • distribution functions > rnorm(100, 10, 1) • R boot package • sample() and replicate()

  41. Bootstrap CIs for a median or timmed mean

  42. The bootstrap (and other Monte Carlo methods) Bootstrapping involves repeatedly resampling with replacement from a data set e.g., Data set = {0,1} 7 simulated samples: {1,0}, {0,1}, {0,0}, {0,0}, {0,1}, {0,0}, {1,1} Bootstrapping effectively treats the sample distribution as the population distribution > replicate(7, sample(x,2,replace=TRUE))

  43. Bootstrapping using R > library(boot) > x1.boot <- boot(samp,function(x,i) median(x[i]),R=10000) > boot.ci(x1.boot) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 10000 bootstrap replicates CALL : boot.ci(boot.out = x1.boot) Intervals : Level Normal Basic 95% (-0.1747, 0.6948 ) (-0.1648, 0.6614 ) Level Percentile BCa 95% (-0.1538, 0.6724 ) (-0.2464, 0.6620 ) Calculations and Intervals on Original Scale Normal 95% CI for mean (using t) [-2.00, 1.33]

  44. Exercise 6 (Bootstrapping) • resample data • construct a simple percentile bootstrap • run a BCa bootstrap using the bootpackage (optional)

  45. 9. Advanced statistical modelling in R Multilevel modeling nlme package lme4 package MCMCglmm package Meta-analysis meta package metafor package … and many, many more other packages

  46. metaforpackage > m.e <- c(10.7, 16.24, 10.03, 3.65, 5.73) > n.e <- c(31, 57, 9, 17, 10) > m.c <- c(2.87, 6.83, 7.18, 4.65, 7.47) > n.c <- c(14, 52, 9, 18, 10) > sd.pooled <- c(7.88, 9.84, 8.67, 6.34, 5.74) > diff <- m.c - m.e > se.diffs <- sd.pooled * sqrt(1/n.e + 1/n.c)

  47. > install.packages('metafor') > library(metafor) > meta.out <- rma(yi=diff, sei=se.diffs, method = 'FE') > meta.out Fixed-Effects Model (k = 5) Test for Heterogeneity: Q(df = 4) = 21.0062, p-val = 0.0003 Model Results: estimate se zval pval ci.lb ci.ub -4.1003 1.0750 -3.8141 0.0001 -6.2073 -1.9933 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

  48. > funnel(trimfill(meta.out))

More Related