Lecture 1 Introduction

Advanced Research Skills Lecture 1 Introduction Olivier MISSA, om502@york.ac.uk

Aims • Introduce the use of R for advanced statistical analysesbeyond "Statistics for Ecologists". • Demonstrate these analyses on a broad range of questions and situations. • Develop your understanding of statistical programming. • Empower you to tackle future analytical challenges on your own.

Aims • Other skills will be developed too. • Produce posters using CorelDraw (graphics package). • Learn how to write a grant proposal.

Learning Outcomes • At the end of the module, you should be able to : • Determine which test to use for significance testing. • Explore the inherent structure of your data through a wide range of multivariate techniques. • Work out which model "best explains" the variable you are interested in. • Produce high quality graphs (ready for publication) using fully R graphical capabilities.

Organisation • Staff • Olivier Missa (OM), module organiser, R sessions om502@york.ac.uk • Emma Rand (ER), R sessionser13@york.ac.uk • Phil Roberts (PTR), CorelDraw session • ptr2@york.ac.uk • Peter Mayhew (PJM), Grant writing session • pjm19@york.ac.uk

Organisation • Structure • 9 theoretical lectures (OM) on advanced stats. • 9 practical sessions (OM & ER) on using R. • 1 practical session (PTR) on CorelDraw. • 1 tutorial session (PJM) on Grant writing.

Organisation • Content • L1 Introduction • L2 – L4Linear Models • L5 – L6 GLMs & Mixed-effects models • L7 Non-Linear Models • L8 – L9 Multivariate Analyses • Each lecture is accompanied by a practical session

Organisation • Assessment • Open Data Analysis exercise, • Written reportwith Introduction, • Material & Methods, • Results, • Discussion. • particular emphasis on justifying the analysesand interpreting the results properly.

What is R ? • "R is a language and environment for statistical computing and graphics" R website • A programming language, actually a dialect of S, which wasdeveloped in the 80s by John Chambers at the Bell Labs. • The Bell Labs then sold S to MathSoft (now Insightful Co.), which developed it further into S-Plus, a commercial Statistical package. • In the 90s, S was rewritten from scratch by two statisticians, Ross Ihaka & Rob Gentleman, from New Zealand. • Since then R has continued to grow in scale and scope and is currently maintained by about 20 people across the globe.

Why use R ? • The Key Benefits : • it'sFreeIt won't cost you a penny ever • OpenHow things are calculated is not hidden • Fully customisableThe user is in full control • Cutting EdgeStats Pros use it to create new techniques • Very Widespread (increasingly so) Thousands of contributors (packages), millions of users • Supported by an international user communityhappy to provide help and assistance

Why use R ? • The Drawback : • Steep Learning Curve • You need to learn the language • You need to know what you are doing (stats)

What is R Good for ? • Absolutely everything (to do with data) • Statistics • Modelling • Programming / Simulations • Graphics(from very simple to complex, 2D, 3D, ...) • Database(simple relational functions) • Bioinformatics (Bioconductor project) • Platform interacting with other Softwares (e.g. Ggobi, WinBUGS, MySQL, GRASS GIS)

Example of a session • > data(volcano) • > dim(volcano) • [1] 87 61 • > volcano • [,1] [,2] [,3] [,4] [,5] [,6] [,7] . . . [,61] • [1,] 100 100 101 101 101 101 101 . . . 103 • [2,] 101 101 102 102 102 102 102 . . . 104 • . . . . . . . . . . . . . . . . . . . . . . . . . . • [87,] 97 97 97 98 98 99 99 . . . 94 • > volcano[1:3,1:3] • [,1] [,2] [,3] • [1,] 100 100 101 • [2,] 101 101 102 • [3,] 102 102 103

> range(volcano) • [1] 94 195 • > mean(volcano) • [1] 130.1879 • > sd(volcano) • [1] 6.902227 7.565538 8.203669 8.735686 . . . • [8] 11.165554 11.735217 12.733854 13.668694 . . . • . . . • > ?sd## help('sd') doesthe same • > sd • function (x, na.rm = FALSE) • { if (is.matrix(x)) • apply(x, 2, sd, na.rm = na.rm) • else if (is.vector(x)) • sqrt(var(x, na.rm = na.rm)) • else if (is.data.frame(x)) • sapply(x, sd, na.rm = na.rm) • else sqrt(var(as.vector(x), na.rm = na.rm)) • } . . .

> sd(as.vector(volcano)) • [1] 25.83233 • > summary(as.vector(volcano)) • Min. 1st Qu. Median Mean 3rd Qu. Max. • 94.0 108.0 124.0 130.2 150.0 195.0 • > volcano.v <- as.vector(volcano) • > dim(volcano.v) • NULL • > length(volcano.v) • [1] 5307 • > 61*87 • [1] 5307 • > volcano.v[1:87] == volcano[,1] • [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE . . . • . . . . . . . . . . . . . . . . . . . . . . • [87] TRUE • > volcano.v[1:61] == volcano[1,] • . . . only three values (out of 61) show "TRUE"

> plot(volcano) not useful, only show that elevation in columns 1 and 2 tend to be correlated

E W • > plot(volcano) • > plot(volcano.v, pch=20) • > hist(volcano, prob=TRUE, • + xlab="volcano elevation (m)") • > x <- seq(90,200,1) • > curve(dnorm(x, mean=mean(volcano.v), • + sd=sd(volcano.v)), add=TRUE) • > shapiro.test(volcano.v) • Error in shapiro.test(volcano.v) : • sample size must be between 3 and 5000 • > smpl <- sample(volcano.v, 5000) • > shapiro.test(smpl) • Shapiro-Wilk normality test • data: smpl • W = 0.9358, p-value < 2.2e-16

> library(nortest)##Package of Normality tests • > ad.test(volcano)## Anderson-Darling • Anderson-Darling normality test • data: volcano • A = 106.2715, p-value < 2.2e-16 • > cvm.test(volcano) ## Cramer-von Mises • > lillie.test(volcano) ## Lilliefors • > pearson.test(volcano) ## Pearson (Chi2) • > sf.test(smpl) ## Shapiro-Francia • > qqnorm(volcano.v) • > qqline(volcano.v, col="red")

> x <- 10*(1:nrow(volcano)) ## 10, 20, ..., 610 • > y <- 10*(1:ncol(volcano)) ## 10, 20, ..., 870 • > image(x, y, volcano)

> x <- 10*(1:nrow(volcano)) • > y <- 10*(1:ncol(volcano)) • > image(x, y, volcano) • > image(x, y, volcano, asp=1)

> x <- 10*(1:nrow(volcano)) • > y <- 10*(1:ncol(volcano)) • > image(x, y, volcano) • > image(x, y, volcano, asp=1) • > image(x, y, volcano, asp=1, • + col = terrain.colors(100), • + axes = FALSE, asp=1)

> x <- 10*(1:nrow(volcano)) • > y <- 10*(1:ncol(volcano)) • > image(x, y, volcano) • > image(x, y, volcano, asp=1) • > image(x, y, volcano, asp=1, • + col = terrain.colors(100), • + axes = FALSE, asp=1) • > contour(x, y, volcano, • + levels = seq(90, 200, by=5), • + add = TRUE, col = "peru")

> x <- 10*(1:nrow(volcano)) • > y <- 10*(1:ncol(volcano)) • > image(x, y, volcano) • > image(x, y, volcano, asp=1) • > image(x, y, volcano, asp=1, • + col = terrain.colors(100), • + axes = FALSE) • > contour(x, y, volcano, • + levels = seq(90, 200, by=5), • + add = TRUE, col = "peru") • > image(x, y, volcano, asp=1, • + col = terrain.colors(100), • + axes = FALSE) • > contour(x, y, volcano, • + levels = seq(90, 200, by=10), • + add = TRUE, col = "peru")

image + contour persp with shading persp • Gallery of other Volcano Graphs surface3d

More Classical Graphs Histogram + Theoretical curve Boxplot Stripchart Pie chart Barplot 3D models

Lecture 1 Introduction