340 likes | 514 Vues
Programming with R and Bioconductor for microarray expression analysis and interpretation. Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569. Outline. Lecture 1: Overview and background for R Lecture 2: BioC & microarray diagnostics Lecture 3: BioC analysis and interpretation
E N D
Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569
Outline • Lecture 1: Overview and background for R • Lecture 2: BioC & microarray diagnostics • Lecture 3: BioC analysis and interpretation • Lecture 4: Cluster computing on BlueGene
Lecture 1 Overview and background for R
What is R? • Open source software that implements the S programming language • can be used on the command line, or scripted • Available from www.r-project.org • Good documentation • Not so user friendly • “R is user-friendly, it just chooses its friends carefully”, Kyle Furge
Starting R • use Putty to connect to genetics.wayne.edu • then use ssh to connect to bluegene.biosci.wayne.edu • password is mgb8680 • type R to start • q() # this is how we shut down
Help! • If you have a workstation, use help.start() and a web browser will pop up • For us help(command) for manual pages • can also use ?command • Excellent documentation at web site • “Introductory statistics with R”, P. Dalgaard • great little book
Basics • R is command driven • It waits for you to give it instructions “>prompt” • R operates on data structures and objects • Data structures include vectors, lists, arrays (matrices) and dataframes • No scalars per se, only vectors with 1 element
Command or Assignment • Everything is either a functional command (I.e. statement)or an assignment • Assignment operator is arrow pointing to variable name (no space allowed within arrow!): “<-” • x <- 10 # I.e. x = (10) • y <- c(1,2,3) # c(), concatenate • z <- c(1:20) # help(rep, seq) • print(X) • won’t work since R is case sensitive: print(x) • [1] 10 #[1] index of first element # on line
Functions • 1 + 1 # answer is a vector [1] • z * 2 • jack <- z *2 • Jack <- 1:40 • length(Jack) • all functions include () even if there is nothing: list work space: ls() • if you type ls without (), R gives you the contents of the function. • some functions behave differently on different objects, polymorphic functions. • lots of built in functions • mean(Jack); median(Jack); sd(Jack)
Don’t use single character names • single letter object names cannot be: • c, q, t, C, D, F, I, T • try to use other variable names of at least 3 characters • I like to use names for quick stuff.
Vectors, intro to data structures • All data structures are objects that can be assigned to variable names • All objects have a mode and a length • Vector: an ordered collection of items (all of which are of the same mode) • jack <- c(1:10) • mode(jack) # mode is numeric • typeof(jack) # integer type • length(jack) # 10 • mode can be Character, numeric, complex, logical • typeof can be integer, double, complex etc.
Vector generating functions • seq(from, to, by=) # test #this out, look at help #page, ?seq • rep(x,times,length.out,each) # test out.
Vectorized Arithmetic • two vectors can be multiplied, one element at a time, if they are the same length • if they are different length, shorter vector is “recycled” • if longer vector is not a multiple of the recycled vector, there is a warning. • make two vectors of different length and test this out.
Logical Vectors • can be written out (rare) • a <- c(TRUE, TRUE, FALSE, TRUE) • can be generated • a <- x >10 # expect 10X FALSE, 10X TRUE • print(a); length(a) • logical operators: <, <=, ==, >, >=, !=
Arrays & Matrices • arrays are multi-dimensional vectors, matrices are two-dimensional vectors • can assign dimensions to vector z to create matrix • z <- 1:20; dim(z) <- c(5,4) # z matrix has 5 rows • table.z <- matrix(z, nrow=4, ncol=5) • table.z # make sure it worked the way we wanted. • lets look at ?matrix to see how else to do that • maybe add “byrow=TRUE”
Matrix Functions • dim(z) <- c(row, column) # previous slide • matricies have rownames & colnames # attributes • vectors use names • rownames(z) <- LETTERS[1:20] • t(z) # transposition so that columns become rows and row become colums • very important when generating a matrix from a matrix • rownames(z) <- c(paste(“row”, 1:20)) • what happens if you forget the terminal ) ?
Matrix Functions con’t • cbind, rbind # column bind and row bind • cart <- matrix(1:100, ncol=20, byrow=T) • horse <- 10:15 • cart.horse <- rbind(cart, horse) • this wouldn’t work with cbind, why? • rownames(cart.horse) <- NULL • # remove the evidence
Lists • Lists are generic vectors that can hold any data type (don’t all need to be char or logical or numeric • ramsi <- list(name=“ramsi”, age=36, married=TRUE, children=c(“jack”, “tommy”) • str(ramsi) # gives you structure of ramsi list • ramsi$children • # $ operator gets to variables within a list
Data Frame • table containing numbers and text • I.e. the matrix equivalent of a list • some columns can be numbers, others text, still others logical. • sounds like microarray data • probe set ID – character • expression values –numerical • treatment – logical…..
Numeric Indexing • vectors, matrices, lists • need to have some form of indexing. • use square brackets [] • z[1] # first element of z (there is no z[0]) • z[1:4] # sub-sequence of z • z[1,4] # no go, need to us z[c(1,4)] • I usually write closing bracket with opener, less confusing.
Logical Indexing of vectorsI.e. Conditional Selecion • z <- 1:50 • a <- z > 13 #make a logical vector • length(a) # still 50, just some F • z[a] #apply the logical vector • length(z[a]) # this is 37 • length(z[z>13]) • negative index drops item.
Logical Indexing of matricies • a must be the same length as z for this to work. • matrices have index [row, column] • table.z <- matrix(z, nrow=4, ncol=5, by.row=TRUE) • table.z[3,2] # should be a 12, verify it…. • table.z[2,] # all of row 2 • table.z[,4] # all of column 4 • indexes dropped with minus sign • Try this out
Looping through matrices • R has explicit loops like other languages • not very efficient, nobody uses them • apply(x, margin, fun, …) #fun must work on vectors • lets look at its help page, make a big table with rnorm() and take the mean of the rows and columns.
Functions • look at the help page for t.test • this is a typical statistical function • it requires an input and has an output • the input usually has default values • input can be defined by order of appearance (positional matching) or explicitly • always better to go explicit, expecially when scripting. • positional and named arguments can be mixed in the same function call.
More on the t.test() function • t.test(x, y, alternative=“two.sided”, mu=0, paired=FALSE, var.equal =FALSE, conf.level=0.95, …) • t.test(x,y,”two.sided”, ,,TRUE,0.99) # this is not readable!
t.test() results list • list with multiple values. • since it’s a list, use the $ operator to extract values • also possible to use list[[]], (not worth it) • catherine <- c(rnorm(10, 5, 2), rnorm(10, 3, 3)) #make a vector • ramsi <- t.test(catherine[1:10], catherine[11:20]) • ramsi # this shows the standard t.test() output • str(ramsi) # lets see the variables
library(multtest) • look at your handout, we’ll fiddle with it. • this is a library of routines or functions that someone has written. • must first be called with library(multtest) • one of the functions is mt.sample.teststat()
writing functions • my.function <- function(a,b,c) {this, that,return(whatever)} • cubism <- function(a = 3) { b <- (a*a*a) • return(b) } • also c(b) instead of return • cubism() # default is 27 • cubism(4) # better be 64!
another not-so-silly function • number to times a number (variable name number) is bigger than a series of numbers (variable name series) freq <- function(series=1:20, number=11){ instances <- series[series > number] answer <- length(instances) return(answer) }
Permuted t.test() • take a vector of gene expression data. • calculate a t-statistic for it using a t.test() • jumble the data and take another t.test() • rinse and repeat for all possibilities • how many times is the real t-statistic better than or equal to the permutations? • this is your real p-value • this permuted t-test is distribution free since you are using sampling to generate the actual distribution of the data instead of relying on some theoretical normal dist.
making your own function • homework assignment will be to generate at permuted.t.test() for microarray data • once this works, you’ll have it for ever. • you’ll forget how to use gene spring, but you’ll always have a record of your function or your entire data analysis. • I have only seen farther because I stood on the shoulder’s of giants. • use multtest….can it be done without multtest?
What is Bioconductor? • set of routines and functions to perform microarray analysis • all are used within R and can therefore be manipulated and altered. • tough to use without R • my analysis is about ½ and ½ R and Bioconductor libraries… • open group with mailing list.
Lecture 2 • you all have 24 arrays of muscle tissue from normal, obese and morbidly obese men • we want to read these all in, perform some QC/QA and see which genes are differentially expressed using a permuted t test.