Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Lecture 1Course Structure & Introduction to R MBP1010 Dr. Paul C. Boutros Winter 2014 † Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE) DEPARTMENT OF MEDICAL BIOPHYSICS This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others †

Who Am I? • Got my PhD here in Medical Biophysics in 2008 • Started a lab that year at OICR • Research focuses on statistical techniques for developing biomarkers for personalized cancer treatment • The interface of clinical research, molecular biology, computer science and biostatistics • Five MBP graduate students • This is the first full grad-course I am teaching • TAs

Who Are You? • MSc Students? PhD Students? Others? • First Year? Second Year? Third Year? Others? • Prior use of R? • (Bio)statistics in your thesis project? • Computational biology in your thesis project? • Genomics in your thesis project? • What do you want to get out of this course?

My Philosophy For This Course • Learn how to do first (application), theory second • Cover less material, but make sure it is clear when and how to use it • Sometimes, the correct answer is “I’ll ask a real statistician” • I use this answer routinely • Grades are mostly based on the ability to get things done

Course Overview • Lecture 1: What is Statistics? Introduction to R • Lecture 2: Univariate Analyses I: continuous • Lecture 3: Univariate Analyses II: discrete • Lecture 4: Multivariate Analyses I: specialized models • Lecture 5: Multivariate Analyses II: general models • Lecture 6: Sequence Analysis • Lecture 7: Microarray Analysis I: Pre-Processing • Lecture 8: Microarray Analysis II: Multiple-Testing • Lecture 9: Machine-Learning • Final Exam (written)

How Will You Be Graded? • 9% Participation: 1% per week • 56% Assignments: 8 x 7% each • 35% Final Examination: in-class • Each individual will get their own, unique assignment • Assignments will all be in R, and will be graded according to computational correctness only (i.e. does your R script yield the correct result when run) • Final Exam will include multiple-choice and written answers

What Resources Can I Use? • Lecture notes alone should be sufficient • Tutorial sessions • Recommended Book: • Introductory Statistics with R; Peter Dalgaard • Extensive documentation for R itself • Several online tutorials • Course Email: quantitativebiology.utoronto@gmail.com

House Rules • Cell phones to silent • No side conversations • Hands up for questions • Others?

What Is Statistics? • The study of all aspects of dataitself: • Collection • Organization • Quantifying uncertainty • Data presentation • Reporting/Description • Visualization • Analysis/Inference/ • Distinct but closely related to probability theory: • Statistics: learning from data • Probability Theory: inferring from the underlying population

Population vs. Sample Population: all possible measurements Sample: the portion of the population we are studying All MBP Students = Population MBP Students in 1010 = Sample Is that sample representative?

When Do We Use Statistics? • Ubiquitous in modern biology • Every class I will show a use of statistics in a (very, very) recent Nature paper. January 2, 2014

Figure 1: At Least 6 P-Values

How Do You Report Statistical Analyses? • Ideas? • What is a P-Value? • What is an Effect-Size? • Which matters to you as a biologist? Why? • Always report both

R Latest version: 3.0.2 (released September). I am using v3.0.1. The differences are minimal regarding the functionality we are going to use, and are mostly minor bug-fixes. Either version will be perfectly fine, and most older versions should work as well until the last few lectures.

R Studio Don’t use this! Not ready for production-use!

Why Are You Learning R? Why not Excel? Even If You Do It Right… Spreadsheets Are Hard “What we know about spreadsheet errors” Journal of End User Computing10(2):15-21 Spreadsheet Error Rate: 88% Cell Error Rate: 2-7% “The accuracy of statistical distributions in Microsoft Excel 2007” “On the accuracy of statistical Procedures in Microsoft Excel 2007” Computational Statistics & Data Analysis 52 “researchers should continue to avoid using the statistical functions in Excel 2007 for any scientific purpose” “it is not safe to assume that Microsoft Excel’s statistical procedures give the correct answer. Persons who wish to conduct statistical analyses should use some other package.”

Other Reasons to Use R • Emerging as the lingua franca of statistics • New methods first developed for and implemented-in R • Extraordinarily flexible: • Moving from simple to sophisticated analyses is easy • Free • Community development leading to rapid improvements • Works identically on any type of computer (PC, Mac, linux) • Extraordinarily high-quality visualizations possible • Reproducible research

Complex Data Visualization in R

Let’s Look at the Parts of R • Overall Editor Experience • R can act as a very good calculator • It can store variables • But you should always save your R commands in a separate file containing nothing else • Why? Reproducibility, separation of code & data, reusability

Different Data-Types • Scalar vs. Vector • String vs. Numeric • Categorical Data • Male vs. Female • Days of the Week • Colours • Functions

expressions > 2+2 [1] 4 > exp(-2) [1] 0.1353353 > pi [1] 3.141593 > sin (2*pi) [1] -2.449294e-16 > 0/0 [1] NaN R evaluates expressions. Entering expressions allows you to use R like a calculator. Tip: Predefined symbols: pi, letters, month.name Special symbols: NA, NaN, Inf, NULL, TRUE, FALSE

strings > "Hello" [1] "Hello" > x <- paste("Hello", "World") > x [1] "Hello World" > m <- gregexpr("(\\b\\w{2})", x, perl=T) > y<-regmatches(x,m) > y [[1]] [1] "He" "Wo" > paste(y[[1]], collapse='') [1] "HeWo" R has a string datatype. Although you can accomplish all of your string-handling needs in R, other programming languages may be more suitable. Task: Assign a first and a last name to two variables.Create a third variable that contains the initials.

dates R has a date datatype. > format(ISOdate(2000, 1:12, 1), "%b") [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" [7] "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" > format(Sys.time(), "%W") [1] "20" See strptime() for formatting options. Task: What weekday will your birthday be this year?

help R has extensive help available for all of its functions and objects. > help (pi) > ?pi > ?sqrt > ?Special Task: Print pi to 10 digits. Fix: help "sqrt"

help searches If you don't know the function name, try a keyword search. > help.search ("trigonometry") > ??input However, often a Google search will give you more immediate results. Tip: A table of all available packages is at : http://cran.r-project.org/ R Manuals are at: http://cran.r-project.org/manuals.html For a list of all functions in the base package see e.g.: http://ugrad.stat.ubc.ca/R/library/base/html/00Index.html

assignments We need to be able to store intermediate results. In R, we can assign data to variables. > x <- 1/sqrt(4) > y <- sin(pi/6) > x+y [1] 1 The R community prefers "<-" to "=". Both are possible. "<-" is more general.Don't confuse "=" with "==" !!! Tip: Be explicit in variable names. Avoid "-" and "_", use mixed case or dots instead. Make variables upper-case nouns, functions lower-case verbs. Good: GeneIDs, simulateAlleles() Poor: q, calculate-number

vectors We can't do much statistics with scalars. R is built to handle lists of numbers and other elements efficiently. Lists (vectors) can be created with the "c" operator (concatenate). > Weight <- c(60,72,75,90,95,72) > Weight[1] [1] 60 > Weight[2] [1] 72 > Weight [1] 60 72 75 90 95 72 > Height <- c(1.75,1.80,1.65,1.90,1.74,1.91) > BMI <- Weight/Height^2 # vector based operation > BMI [1] 19.59184 22.22222 27.54821 24.93075 31.37799 19.73630

vector operations If you apply an operation to a vector, it is applied to each element of the vector. > x <- 1:5 > x+2 [1] 3 4 5 6 7 If you apply an operation to two vectors, it is applied to each matching pair of elements. > y <- 6:2 > x+y [1] 7 7 7 7 7 Exercise: What happens if the vectors have different types (numeric, character, logical)? What happens if the vectors have different lengths?

vector operations Exercise: Create a vector "x" with the following elements 1,3,10,-1. Print the square of these elements. Take the square root of x. Take the log of all values in x after adding 1.

vector types • R vectors can be of type: • numeric • character • logical. > x <- c(1, 5, 8) # Numeric > x [1] 1 5 8> x <- c(TRUE, TRUE, FALSE, TRUE) # Logical > x [1] TRUE TRUE FALSE TRUE > x <- c ("Hello","world") # Character > x [1] "Hello""world" > x <- c(1, TRUE, "Thursday") # Mixed > x [1] "1" "TRUE" "Thursday" Task: Show that "TRUE" is no longer a logical type.

missing and special values > Weight[5] <- NA > mean(Weight) [1] NA > mean(Weight, na.rm=TRUE) [1] 73.8 We have already encountered the NaN symbol meaning Not-a-Number, and Inf, -Inf. In practical data analysis a data point is frequently unavailable. In R, missing values are denoted by NA("Not Available").

matrices and arrays > a<-matrix(1:12,nrow=3,byrow=TRUE) > a [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 > a<-matrix(1:12,nrow=3,byrow=FALSE) > a [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > rownames(a)<-c("A","B","C") > a [,1] [,2] [,3] [,4] A 1 4 7 10 B 2 5 8 11 C 3 6 9 12 > colnames(a)<-c("1","2","x","y") > a 1 2 x y A 1 4 7 10 B 2 5 8 11 C 3 6 9 12 A matrix is a two dimensional array of numbers. Matrices can be used to perform statistical operations (linear algebra). However, they can also be used to hold tables. > x<-1:12 > x [1] 1 2 3 4 5 6 7 8 9 10 11 12 > length(x) [1] 12 > dim(x) NULL > dim(x)<-c(3,4) > x [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12

matrices and arrays > a 1 2 x y A 1 4 7 10 B 2 5 8 11 C 3 6 9 12 Exercise: Print the values of the second column of a. Print the values of the second row of a. Print the value of the element in the lower left corner.

matrices and arrays Matrices can also be formed by "glueing" rows and columns using cbind and rbind.This is the equivalent of c for vectors. > x1 <- 1:4 # Define three vectors > x2 <- 5:8 > y1 <- c(3,9) > MyMatrix <- rbind(x1,x2) > MyMatrix [,1] [,2] [,3] [,4] x1 1 2 3 4 x2 5 6 7 8 > MyNewMatrix <- cbind(MyMatrix,y1) > MyNewMatrix y1 x1 1 2 3 4 3 x2 5 6 7 8 9

factors It is common to have categorical data in statistical data analysis (e.g. Male/ Female). In R such variables are referred to as factors. This makes it possible to assign meaningful names to categories. A factor has a set of levels. > Pain <- c(0,3,2,2,1) > SevPain <- as.factor(c(0,3,2,2,1)) > levels(SevPain) <- c("none","mild","medium","severe") > is.factor(SevPain) [1] TRUE > is.vector(SevPain) [1] FALSE

lists > A<-c(31,32,40) > S<-as.factor(c("F","M","M","F")) > L<-c("London","School") > MyFriends<-list(age=A,sex=S,meta=L) > MyFriends $age [1] 31 32 40 $sex [1] F M M F Levels: F M $meta [1] "London" "School" > MyFriends[[2]] [1] 31 32 40 > MyFriends$age [1] 31 32 40 Lists can be used to combine objects (of possibly different kinds/sizes) into a larger composite object. The components of the list are named according to the arguments used. Components can be extracted with the double bracket operator [[ ]] Alternatively, named components can be accessed with the "$" separator. Exercise: Combine Pain and SevPain into a list with a meaningful name.

data frames A data frame is a matrix or a "set" of data. It is a list of vectors and/or factors of the same length that are related "across", such that data in the same position come from the same experimental unit (subject, animal, etc). > Probands <- data.frame(age=c(31,32,40,50),sex=S) > Probands age sex 1 31 F 2 32 M 3 40 M 4 50 F > Probands$age [1] 31 32 40 50 Why do we need data frames if they do the same as a list? More efficient storage, and indexing! R's read...() functions return data frames.

names Names of an R object can be accessed and/or modified with the names() function. > x <- 1:3 > names(x) NULL > names(x) <- c("a", "b", "c") > x a b c 1 2 3 > names(Probands) [1] "age" "sex" > names(Probands) <- c("age", "gender") > names(Probands)[1] <- c("Age") Tip: Give explicit names to variables. Names can be used for indexing.

indexing (extracting) > # Indexing a matrix > MyNewMatrix[1,1] [1] 1 > MyNewMatrix[1,] y1 1 2 3 4 3 > MyNewMatrix[,1] x1 x2 1 5 > MyNewMatrix[,-2] y1 x1 1 3 4 3 x2 5 7 8 9 > # Indexing a list > MyFriends[3] $meta [1] "London" "School" > MyFriends[[3]] [1] "London" "School" > MyFriends[[3]][1] [1] "London" > # Indexing a data frame > Probands[1,] Age gender 1 31 F > Probands[2,] Age gender 2 32 M Indexing (> ?Extract ) is a great way to directly assess elements of interest. > # Indexing a vector > Pain <- c(0,3,2,2,1) > Pain[1] [1] 0 > Pain[2] [1] 3 > Pain[1:2] [1] 0 3 > Pain[c(1,3)] [1] 0 2 > Pain[-5] [1] 0 3 2 2

indexing by name > Probands["Age"] Age 1 31 2 32 3 40 4 50 > Probands[1] Age 1 31 2 32 3 40 4 50 > Probands[[1]] [1] 31 32 40 50 Names can also be used to index an R object. > MyFriends$age [1] 31 32 40 > MyFriends["age"] $age [1] 31 32 40 > MyFriends[["age"]] [1] 31 32 40 Exercise: Can the results of "[ ]" and "[[ ]]" extractions both be used in vector operations?

conditional indexing Indexing can be conditional on another variable. > Pain; Fpain [1] 0 3 2 2 1 [1] none severe medium medium mild Levels: none mild medium severe > Age <- c(45,51,45,32,90) > Pain[Fpain=="medium" | Fpain=="severe"] [1] 3 2 2 > Pain[Age>32] [1] 0 3 2 1 Note: the conditional variable does not have to be part of the same data object. Exercise: Extract elements for "none" and for Age < 90.

data input Normally, you would start your R session by reading in some data to be analysed. This can be done with the read.table function. Download the sample data to your local directory... > GvHD <- read.table("GvHD.txt", header=TRUE) > GvHD[1:10,] FSC.Height SSC.Height CD4.FITC CD8.B.PE CD3.PerCP CD8.APC 1 321 199 308 220 157 339 2 303 210 319 271 223 350 3 318 170 215 148 119 221 4 202 49 104 49 284 178 5 353 248 262 167 144 156 6 192 68 423 97 344 113 7 322 225 236 214 141 209 8 350 152 258 82 253 205 9 351 223 286 128 172 220 10 269 78 169 289 224 537 Tip: Alternatively – use the RStudio GUI.

functions and arguments Many things in R are done using function calls, commands that look like an application of a mathematical function to one or several variables, e.g. log(x), plot(Weight,Height). When you use plot(Weight, Height) R assumes that the first argument is the x variable and the second is the y. If you do not know how to specify the arguments look at ?plot. Most function arguments have sensible defaults and can thus be omitted, e.g. plot(Weight, Height,col=1). If you do not specify the names of the argument, R interprets them by their default order.

libraries Many contributed functionalities of R are available in R packages/libraries. Some of these are distributed with R while others need to be downloaded and installed separately. > library(survival) Loading required package: splines > library(samr) Error in library(samr) : there is no package called 'samr' > install.packages("samr") --- Please select a CRAN mirror for use in this session --- also installing the dependencies ‘R.methodsS3’, ‘impute’, ‘matrixStats’ trying URL 'http://probability.ca/cran/bin/macosx/leopard/contrib/2.13/R.methodsS3_1.2.1.tgz' Content type 'application/x-gzip' length 47709 bytes (46 Kb) opened URL ================================================== downloaded 46 Kb [...] The downloaded packages are in /var/folders/dq/dqPEEPbFGFWs6MKN40ApRU+++TI/-Tmp-//RtmpNDvKDp/downloaded_packages > library(samr) Loading required package: impute Loading required package: matrixStats Loading required package: R.methodsS3 R.methodsS3 v1.2.1 (2010-09-18) successfully loaded. See ?R.methodsS3 for help. matrixStats v0.2.2 (2010-10-06) successfully loaded. See ?matrixStats for help.

R programming: conditional statements R is a full-featured programming language. # if statement > x <- -2 > if(x>0) { + print(x) + } else { + print(-x) + } [1] 2 > > if(x>0) { + print(x) + } else if(x==0) { + print(0) + } else { + print(-x) + } [1] 2

R programming: loops # for loop n <- 1000000 x <- rnorm(n,10,1) y <- x^2 y <- rep(0,n) for (i in 1:n) { y[i] <- sqrt (x[i]) } # while loop Counter <- 1 while (Counter <= n) { y[Counter] <- sqrt(x[Counter]) Counter <- Counter+1 } Exercise: Apply sqrt() to x as a vector and compare execution speed.

creating your own functions Function objects can simply be assigned. Oracle <- function() { WiseWords <- c( "Joy", "Plan", "Disappear", "Perhaps", "Sorrow", "Hope", "Change" ) n <- sample(WiseWords, 1) return(n) } > Oracle() [1] "Disappear" Exercise: Write a function to return the inverse of a number. Warn if input == 0;

Canadian Bioinformatics Workshops