Create Presentation
Download Presentation

Download Presentation

Carlo Colantuoni carlo@illuminatobiotech

Download Presentation
## Carlo Colantuoni carlo@illuminatobiotech

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Summer Inst. Of Epidemiology and Biostatistics, 2010:Gene**Expression Data Analysis1:30pm – 5:00pm in Room W2015 Carlo Colantuoni carlo@illuminatobiotech.com http://www.illuminatobiotech.com/GEA2010/GEA2010.htm**Class Outline**• Basic Biology & Gene Expression Analysis Technology • Data Preprocessing, Normalization, & QC • Measures of Differential Expression • Multiple Comparison Problem • Clustering and Classification • The R Statistical Language and Bioconductor • GRADES – independent project with Affymetrix data. http://www.illuminatobiotech.com/GEA2010/GEA2010.htm**Class Outline - Detailed**• Basic Biology & Gene Expression Analysis Technology • The Biology of Our Genome & Transcriptome • Genome and Transcriptome Structure & Databases • Gene Expression & Microarray Technology • Data Preprocessing, Normalization, & QC • Intensity Comparison & Ratio vs. Intensity Plots (log transformation) • Background correction (PM-MM, RMA, GCRMA) • Global Mean Normalization • Loess Normalization • Quantile Normalization (RMA & GCRMA) • Quality Control: Batches, plates, pins, hybs, washes, and other artifacts • Quality Control: PCA and MDS for dimension reduction • SVA: Surrogate Variable Analysis • Measures of Differential Expression • Basic Statistical Concepts • T-tests and Associated Problems • Significance analysis in microarrays (SAM) [ & Empirical Bayes] • Complex ANOVA’s (limma package in R) • Multiple Comparison Problem • Bonferroni • False Discovery Rate Analysis (FDR) • Differential Expression of Functional Gene Groups • Functional Annotation of the Genome • Hypergeometric test?, Χ2, KS, pDens, Wilcoxon Rank Sum • Gene Set Enrichment Analysis (GSEA) • Parametric Analysis of Gene Set Enrichment (PAGE) • geneSetTest • Notes on Experimental Design • Clustering and Classification • Hierarchical clustering • K-means • Classification • LDA (PAM), kNN, Random Forests • Cross-Validation • Additional Topics • eQTL (expression + SNPs) • Next-Gen Sequencing data: RNAseq, ChIPseq • Epigenetics? • The R Statistical Language: http://www.r-project.org/ • Bioconductor : http://www.bioconductor.org/docs/install/ • Affymetrix data processing example**Measures of Differential Expression:**• Review of basic statistical concepts • T-tests and associated problems • Significance analysis in microarrays (SAM) • (Empirical Bayes) • Complex ANOVA’s (“limma” package in R) • Multiple Comparison Problem: • Bonferroni • FDR • Differential Expression of Functional Gene Groups • Notes on Experimental Design DAY #3:**Fold-Change?T-Statistics?**Some genes are more variable than others**distribution of**distribution of Slides from Rob Scharpf**X1-X2 is normally distributed if X1 and X2 are normally**distributed – is this the case in microarray data? Slides from Rob Scharpf**Problem 1: T-statistic not t-distributed. Implication:**p-values/inference incorrect**P-values by permutation**• It is common that the assumptions used to derive the statistics are not approximate enough to yield useful p-values (e.g. when T-statistics are not T distributed.) • An alternative is to use permutations.**p-values by permutations**We focus on one gene only. For the bth iteration, b = 1, , B; Permute the n data points for the gene (x). The first n1 are referred to as “treatments”, the second n2 as “controls”. For each gene, calculate the corresponding two sample t-statistic, tb. After all the B permutations are done: p = # { b: |tb| ≥ |tobserved| } / B This does not yet address the issue of multiple tests!**The volcano plot shows, for a particular test, negative log**p-value against the effect size (M). Another problem with t-tests**Problem 2: t-statistic bigger for geneswith smaller standard**error estimates.Implication: Ranking might not be optimal**Problem 2**• With low N’s SD estimates are unstable • Solutions: • Significance Analysis in Microarrays (SAM) • Empirical Bayes methods and Stein estimators**Significance analysis in microarrays (SAM)**• A clever adaptation of the t-ratio to borrow information across genes • Implemented in Bioconductor in the siggenes package Significance analysis of microarrays applied to the ionizing radiation response, Tusher et al., PNAS 2002**SAM d-statistic**• For gene i : mean of sample 1 mean of sample 2 Standard deviation of repeated measurements for gene i Exchangeability factor estimated using all genes**Scatter plots of relative difference (d) vs standard**deviation (s) of repeated expression measurements A fix for this problem: Relative difference for a permutation of the data that was balanced between cell lines 1 and 2. Random fluctuations in the data, measured by balanced permutations (for cell line 1 and 2)**eBayes: Borrowing Strength**• An advantage of having tens of thousands of genes is that we can try to learn about typical standard deviations by looking at all genes • Empirical Bayes gives us a formal way of doing this • “Shrinkage” of variance estimates toward a “prior”: moderated t-statistics – eliminates extreme stats due to small variances. • Implemented in the limma package in R. In addition, limma provides methods for more complex experimental designs beyond simple, two-sample designs.**The Multiple Comparison Problem**(some slides courtesy of John Storey)**Hypothesis Testing**• Test for each gene: Null Hypothesis: no differential expression. • Two types of errors can be committed • Type I error or false positive (say that a gene is differentially expressed when it is not, i.e., reject a true null hypothesis). • Type II error or false negative (fail to identify a truly differentially expressed gene, i.e.,fail to reject a false null hypothesis)**Once you have a given score for each gene, how do you decide**on a cut-off? p-values are most common. How do we decide on a cut-off when we are looking at many 1000’s of “tests”? Are 0.05 and 0.01 appropriate? How many false positives would we get if we applied these cut-offs to long lists of genes? Hypothesis testing**Multiple Comparison Problem**• Even if we have good approximations of our p-values, we still face the multiple comparison problem. • When performing many independent tests, p-values no longer have the same interpretation.**Bonferroni Procedure**a = 0.05# Tests = 1000a = 0.05 / 1000 = 0.00005orp = p * 1000**Bonferroni Procedure**Too conservative.How else can we interpret many 1000’s of observed statistics?Instead of evaluating each statistic individually, can we assess a list of statistics: FDR (Benjamini & Hochberg 1995)**FDR**• Given a cut-off statistic, FDR gives us an estimate of the proportion of hits in our list of differentially expressed genes that are false. Null = Equivalent Expression; Alternative = Differential Expression**False Discovery Rate**• The “false discovery rate” measures the proportion of false positives among all genes called significant: • This is usually appropriate because one wants to find as many truly differentially expressed genes as possible with relatively few false positives • The false discovery rate gives an estimate of the rate at which further biological verification will result in dead-ends**Distribution of Statistics**N=90 Permuted Observed Statistic**Distribution of Statistics**False Pos. Total Pos. = FDR = Permuted Observed Permuted Observed Statistic**Distribution of p-values**N=90 Observed Permuted p-value**SAM produces a modified T-statistic (d), and has an approach**to the multiple comparison problem.**Scatter plots of relative difference (d) vs standard**deviation (s) of repeated expression measurements A fix for this problem: Relative difference for a permutation of the data that was balanced between cell lines 1 and 2. Random fluctuations in the data, measured by balanced permutations (for cell line 1 and 2)**FDR = False Positives/Total Positive CallsThis FDR analysis**requires enough samples in each condition to estimate a statistic for each gene: observed statistic distribution.And enough samples in each condition to permute many times and recalculate this statistic: null statistic distribution.What if we don’t have this?**FDR = 0.05**Beyond ±0.9**FDR = 0.05**Beyond ±0.9**False Positive Rate versus False Discovery Rate**• False positive rate is the rate at which truly null genes are called significant • False discovery rate is the rate at which significant genes are truly null**False Positive Rate and P-values**• The p-value is a measure of significance in terms of the false positive rate (aka Type I error rate) • P-value is defined to be the minimum false positive rate at which the statistic can be called significant • Can be described as the probability a truly null statistic is “as or more extreme” than the observed one**False Discovery Rate and Q-values**• The q-value is a measure of significance in terms of the false discovery rate • Q-value is defined to be the minimum false discovery rate at which the statistic can be called significant • Can be described as the probability a statistic “as or more extreme” is truly null**Power and Sample Size Calculations are Hard**• Need to specify: • a (Type I error rate, false positives) or FDR • s (stdev: will be sample- and gene-specific) • Effect size (how do we estimate?) • Power (1-b, b=Type II error rate) • Sample Size • Some papers: • Mueller, Parmigiani et al. JASA (2004) • Rich Simon’s group Biostatistics (2005) • Tibshirani. A simple method for assessing sample sizes in microarray experiments. BMC Bioinformatics. 2006 Mar 2;7:106.**Beyond Individual Genes:**Functional Gene Groups • Borrow statistical power across entire dataset • Integrate preexisting biological knowledge