Carlo Colantuoni carlo@illuminatobiotech

Summer Inst. Of Epidemiology and Biostatistics, 2010:Gene Expression Data Analysis1:30pm – 5:00pm in Room W2015 Carlo Colantuoni carlo@illuminatobiotech.com http://www.illuminatobiotech.com/GEA2010/GEA2010.htm

Class Outline • Basic Biology & Gene Expression Analysis Technology • Data Preprocessing, Normalization, & QC • Measures of Differential Expression • Multiple Comparison Problem • Clustering and Classification • The R Statistical Language and Bioconductor • GRADES – independent project with Affymetrix data. http://www.illuminatobiotech.com/GEA2010/GEA2010.htm

Class Outline - Detailed • Basic Biology & Gene Expression Analysis Technology • The Biology of Our Genome & Transcriptome • Genome and Transcriptome Structure & Databases • Gene Expression & Microarray Technology • Data Preprocessing, Normalization, & QC • Intensity Comparison & Ratio vs. Intensity Plots (log transformation) • Background correction (PM-MM, RMA, GCRMA) • Global Mean Normalization • Loess Normalization • Quantile Normalization (RMA & GCRMA) • Quality Control: Batches, plates, pins, hybs, washes, and other artifacts • Quality Control: PCA and MDS for dimension reduction • SVA: Surrogate Variable Analysis • Measures of Differential Expression • Basic Statistical Concepts • T-tests and Associated Problems • Significance analysis in microarrays (SAM) [ & Empirical Bayes] • Complex ANOVA’s (limma package in R) • Multiple Comparison Problem • Bonferroni • False Discovery Rate Analysis (FDR) • Differential Expression of Functional Gene Groups • Functional Annotation of the Genome • Hypergeometric test?, Χ2, KS, pDens, Wilcoxon Rank Sum • Gene Set Enrichment Analysis (GSEA) • Parametric Analysis of Gene Set Enrichment (PAGE) • geneSetTest • Notes on Experimental Design • Clustering and Classification • Hierarchical clustering • K-means • Classification • LDA (PAM), kNN, Random Forests • Cross-Validation • Additional Topics • eQTL (expression + SNPs) • Next-Gen Sequencing data: RNAseq, ChIPseq • Epigenetics? • The R Statistical Language: http://www.r-project.org/ • Bioconductor : http://www.bioconductor.org/docs/install/ • Affymetrix data processing example

Measures of Differential Expression: • Review of basic statistical concepts • T-tests and associated problems • Significance analysis in microarrays (SAM) • (Empirical Bayes) • Complex ANOVA’s (“limma” package in R) • Multiple Comparison Problem: • Bonferroni • FDR • Differential Expression of Functional Gene Groups • Notes on Experimental Design DAY #3:

Slides from Rob Scharpf

Fold-Change?T-Statistics? Some genes are more variable than others

distribution of distribution of Slides from Rob Scharpf

X1-X2 is normally distributed if X1 and X2 are normally distributed – is this the case in microarray data? Slides from Rob Scharpf

Problem 1: T-statistic not t-distributed. Implication: p-values/inference incorrect

P-values by permutation • It is common that the assumptions used to derive the statistics are not approximate enough to yield useful p-values (e.g. when T-statistics are not T distributed.) • An alternative is to use permutations.

p-values by permutations We focus on one gene only. For the bth iteration, b = 1,  , B; Permute the n data points for the gene (x). The first n1 are referred to as “treatments”, the second n2 as “controls”. For each gene, calculate the corresponding two sample t-statistic, tb. After all the B permutations are done: p = # { b: |tb| ≥ |tobserved| } / B This does not yet address the issue of multiple tests!

The volcano plot shows, for a particular test, negative log p-value against the effect size (M). Another problem with t-tests

Remember this?

Problem 2: t-statistic bigger for geneswith smaller standard error estimates.Implication: Ranking might not be optimal

Problem 2 • With low N’s SD estimates are unstable • Solutions: • Significance Analysis in Microarrays (SAM) • Empirical Bayes methods and Stein estimators

Significance analysis in microarrays (SAM) • A clever adaptation of the t-ratio to borrow information across genes • Implemented in Bioconductor in the siggenes package Significance analysis of microarrays applied to the ionizing radiation response, Tusher et al., PNAS 2002

SAM d-statistic • For gene i : mean of sample 1 mean of sample 2 Standard deviation of repeated measurements for gene i Exchangeability factor estimated using all genes

Minimize the average CV across all genes.

Scatter plots of relative difference (d) vs standard deviation (s) of repeated expression measurements A fix for this problem: Relative difference for a permutation of the data that was balanced between cell lines 1 and 2. Random fluctuations in the data, measured by balanced permutations (for cell line 1 and 2)

eBayes: Borrowing Strength • An advantage of having tens of thousands of genes is that we can try to learn about typical standard deviations by looking at all genes • Empirical Bayes gives us a formal way of doing this • “Shrinkage” of variance estimates toward a “prior”: moderated t-statistics – eliminates extreme stats due to small variances. • Implemented in the limma package in R. In addition, limma provides methods for more complex experimental designs beyond simple, two-sample designs.

The Multiple Comparison Problem (some slides courtesy of John Storey)

Hypothesis Testing • Test for each gene: Null Hypothesis: no differential expression. • Two types of errors can be committed • Type I error or false positive (say that a gene is differentially expressed when it is not, i.e., reject a true null hypothesis). • Type II error or false negative (fail to identify a truly differentially expressed gene, i.e.,fail to reject a false null hypothesis)

Once you have a given score for each gene, how do you decide on a cut-off? p-values are most common. How do we decide on a cut-off when we are looking at many 1000’s of “tests”? Are 0.05 and 0.01 appropriate? How many false positives would we get if we applied these cut-offs to long lists of genes? Hypothesis testing

Multiple Comparison Problem • Even if we have good approximations of our p-values, we still face the multiple comparison problem. • When performing many independent tests, p-values no longer have the same interpretation.

Bonferroni Procedure a = 0.05# Tests = 1000a = 0.05 / 1000 = 0.00005orp = p * 1000

Bonferroni Procedure Too conservative.How else can we interpret many 1000’s of observed statistics?Instead of evaluating each statistic individually, can we assess a list of statistics: FDR (Benjamini & Hochberg 1995)

FDR • Given a cut-off statistic, FDR gives us an estimate of the proportion of hits in our list of differentially expressed genes that are false. Null = Equivalent Expression; Alternative = Differential Expression

False Discovery Rate • The “false discovery rate” measures the proportion of false positives among all genes called significant: • This is usually appropriate because one wants to find as many truly differentially expressed genes as possible with relatively few false positives • The false discovery rate gives an estimate of the rate at which further biological verification will result in dead-ends

Distribution of Statistics N=90 Permuted Observed Statistic

Distribution of Statistics False Pos. Total Pos. = FDR = Permuted Observed Permuted Observed Statistic

Distribution of p-values N=90 Observed Permuted p-value

SAM produces a modified T-statistic (d), and has an approach to the multiple comparison problem.

Scatter plots of relative difference (d) vs standard deviation (s) of repeated expression measurements A fix for this problem: Relative difference for a permutation of the data that was balanced between cell lines 1 and 2. Random fluctuations in the data, measured by balanced permutations (for cell line 1 and 2)

Selected genes:Beyond expected distribution

FDR = False Positives/Total Positive CallsThis FDR analysis requires enough samples in each condition to estimate a statistic for each gene: observed statistic distribution.And enough samples in each condition to permute many times and recalculate this statistic: null statistic distribution.What if we don’t have this?

FDR = 0.05 Beyond ±0.9

False Positive Rate versus False Discovery Rate • False positive rate is the rate at which truly null genes are called significant • False discovery rate is the rate at which significant genes are truly null

False Positive Rate and P-values • The p-value is a measure of significance in terms of the false positive rate (aka Type I error rate) • P-value is defined to be the minimum false positive rate at which the statistic can be called significant • Can be described as the probability a truly null statistic is “as or more extreme” than the observed one

False Discovery Rate and Q-values • The q-value is a measure of significance in terms of the false discovery rate • Q-value is defined to be the minimum false discovery rate at which the statistic can be called significant • Can be described as the probability a statistic “as or more extreme” is truly null

Power and Sample Size Calculations are Hard • Need to specify: • a (Type I error rate, false positives) or FDR • s (stdev: will be sample- and gene-specific) • Effect size (how do we estimate?) • Power (1-b, b=Type II error rate) • Sample Size • Some papers: • Mueller, Parmigiani et al. JASA (2004) • Rich Simon’s group Biostatistics (2005) • Tibshirani. A simple method for assessing sample sizes in microarray experiments. BMC Bioinformatics. 2006 Mar 2;7:106.

Beyond Individual Genes: Functional Gene Groups • Borrow statistical power across entire dataset • Integrate preexisting biological knowledge

Carlo Colantuoni carlo@illuminatobiotech