Differential expression and testing

Differential expression and testing Consider a case where we have observed two genes with estimated fold changes of 2 Is this worth reporting? Some journals require measures of statistical significance (“p-values”)

Repeated experiment

Back to Basics If we have two measurements X, Y then Y-X may have a different distribution under the null hypothesis for different genes More specifically the standard deviation  of Y-X may be different. We could consider (Y-X) /  instead But we do not know ! What is ? Why is it not 0? How about taking samples and using the t-statistic?

Back to Basics Observations: Averages: SD2 or variances:

Back to Basics t - statistic:

Back to Basics If the number of replicates N is very large the t-statistic is normally distributed with mean 0 and and SD of 1 If the observed data is normally distributed then the t-statistic follows a t-distribution regardless of sample size We can then compute probability that t-statistic is as extreme or more, when null hypothesis is true: the p-value Where does this probability come from? We will see that the t-statistic is not a good strategy when N is small…

Estimating the variance The t-test considers difference between group means to standard deviation of data within groups F-test is a generalization of this idea to more than 2 groups But with few replicates, estimates of SE are not stable This explains why t-test is not powerful There are many proposals for estimating variation Many share information across genes Empirical Bayesian Approaches are popular SAM, an ad-hoc procedure, is even more popular Many are what some call “moderated” t-tests

Notation: T is average log expression in Tx C is average log expression in Control S is SD Note taking log before average is important Tests: Average log fold-change: (T-C) t-statistic: (T-C) / S SAM shrunken t-statistic: (T-C) / (S + S0) Bayesian posteriors: (T-C) / √(S2+K2) Wilcoxon Rank test Ad-hoc pairwise comparison No formula Some Examples of Tests

Once you have a score for each gene, how do you decide on a cut-off? p-values are popular. Are they appropriate? Test for each gene null hypothesis: no differential expression. Notice that if you have look at 10,000 genes for which the null is true you expect to see 500 attain p-values of 0.05 This is called the multiple comparison problem. Statisticians fight about it. But not about the above. Main message: p-values can’t be interpreted in the usual way A popular solution is to report FDR instead. One final problem

Multiple Hypothesis Testing • What happens if we call all genes significant with p-values ≤ 0.05, for example? Null = Equivalent Expression; Alternative = Differential Expression

Error Rates • Per comparison error rate (PCER): the expected value of the number of Type I errors over the number of hypotheses PCER = E(V)/m • Per family error rate (PFER): the expected number of Type I errors PFER = E(V) • Family-wise error rate: the probability of at least one Type I error FEWR = Pr(V ≥ 1) • False discovery rate (FDR) rate that false discoveries occur FDR = E(V/R; R>0) = E(V/R | R>0)Pr(R>0) • Positive false discovery rate (pFDR): rate that discoveries are false pFDR = E(V/R | R>0).

p >> n Goal: find statistically significant associations of biological conditions or phenotypes with gene expression. Consider the two class problem. Data: n (10…100) points in a p-dimensional (5000…30000) space. Problem: There are infinitely many ways to separate the space into two regions by a hyperplane such that the two groups are perfectly separated. This is a simple geometrical fact and holds as long as n<p!

p >> n Problem: If I find such a perfectly separating hyperplane, it doesn’t mean anything. It is not surprising. It is not a significant finding. I would always find it, no matter how random the data are! Answer: regularization Rather than searching in the huge space of all hyperplanes in n-1 dimensional space, restrict ourselves to a much smaller space. Two major approaches: - only the hyperplanes perpendicular to one of the n coordinate axis  gene-by-gene discrimination, gene-by-gene hypothesis testing. - any other reasonable, not too complex set of hypersurfaces  machine learning

Gene by gene tests t-test Wilcoxon F-test / more complex linear models Cox-regression Problem: Treating each gene independently of each other wastes information – many properties may be shared among genes. E.g. their within-group variability.

The volcano plot shows, for a particular test, negative log p-value against the effect size (M) A useful plot

MA and volcano

Complications • What is the distribution of SAM statistic? • How about t-statistic is it really t-distributed? • How can we get p-values when we don’t know the distribution?

p-values by permutations We focus on one gene only. For the bth iteration, b = 1,  , B; Permute the n data points for the gene (x). The first n1 are referred to as “treatments”, the second n2 as “controls”. For each gene, calculate the corresponding two sample t-statistic, tb. After all the B permutations are done; Put p = #{b: |tb| ≥ |tobserved|}/B (p lower if we use >).

Differential expression and testing

Differential expression and testing

Presentation Transcript

Differential Gene Expression: Ischemic vs. Nonischemic

Differential Gene Expression

Design of experiments and basic analysis: estimating and testing for differential expression.

Differential Gene Expression with the limma package

RNA sequencing for differential expression genes

Differential Expression Analysis

Differential Expression II

Introduction to Differential Expression Analysis

Differential Expression and False Discovery Rate

Next lectures: Differential Gene expression

Testing for differential gene expression

Statistics for Differential Expression

Differential expression analysis Alternative exon usage

Differential Gene Expression between Han Chinese and Japanese

Differential Expression Analysis Multiple Hypotheses Testing

Microarray Data Analysis Differential Gene Expression

Differential Gene Expression Dennis Kostka, Christine Steinhoff

Differential gene expression

Advanced Differential Expression Analysis

Differential Gene Expression

Differential Expression Between Cufi cells and Nuli cells

Analysis of Differential Expression