1 / 19

Differential expression and testing

Differential expression and testing. Consider a case where we have observed two genes with estimated fold changes of 2 Is this worth reporting? Some journals require measures of statistical significance (“p-values”). Repeated experiment. Repeated experiment. Back to Basics.

steffi
Télécharger la présentation

Differential expression and testing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Differential expression and testing Consider a case where we have observed two genes with estimated fold changes of 2 Is this worth reporting? Some journals require measures of statistical significance (“p-values”)

  2. Repeated experiment

  3. Repeated experiment

  4. Back to Basics If we have two measurements X, Y then Y-X may have a different distribution under the null hypothesis for different genes More specifically the standard deviation  of Y-X may be different. We could consider (Y-X) /  instead But we do not know ! What is ? Why is it not 0? How about taking samples and using the t-statistic?

  5. Back to Basics Observations: Averages: SD2 or variances:

  6. Back to Basics t - statistic:

  7. Back to Basics If the number of replicates N is very large the t-statistic is normally distributed with mean 0 and and SD of 1 If the observed data is normally distributed then the t-statistic follows a t-distribution regardless of sample size We can then compute probability that t-statistic is as extreme or more, when null hypothesis is true: the p-value Where does this probability come from? We will see that the t-statistic is not a good strategy when N is small…

  8. Estimating the variance The t-test considers difference between group means to standard deviation of data within groups F-test is a generalization of this idea to more than 2 groups But with few replicates, estimates of SE are not stable This explains why t-test is not powerful There are many proposals for estimating variation Many share information across genes Empirical Bayesian Approaches are popular SAM, an ad-hoc procedure, is even more popular Many are what some call “moderated” t-tests

  9. Notation: T is average log expression in Tx C is average log expression in Control S is SD Note taking log before average is important Tests: Average log fold-change: (T-C) t-statistic: (T-C) / S SAM shrunken t-statistic: (T-C) / (S + S0) Bayesian posteriors: (T-C) / √(S2+K2) Wilcoxon Rank test Ad-hoc pairwise comparison No formula Some Examples of Tests

  10. Once you have a score for each gene, how do you decide on a cut-off? p-values are popular. Are they appropriate? Test for each gene null hypothesis: no differential expression. Notice that if you have look at 10,000 genes for which the null is true you expect to see 500 attain p-values of 0.05 This is called the multiple comparison problem. Statisticians fight about it. But not about the above. Main message: p-values can’t be interpreted in the usual way A popular solution is to report FDR instead. One final problem

  11. Multiple Hypothesis Testing • What happens if we call all genes significant with p-values ≤ 0.05, for example? Null = Equivalent Expression; Alternative = Differential Expression

  12. Error Rates • Per comparison error rate (PCER): the expected value of the number of Type I errors over the number of hypotheses PCER = E(V)/m • Per family error rate (PFER): the expected number of Type I errors PFER = E(V) • Family-wise error rate: the probability of at least one Type I error FEWR = Pr(V ≥ 1) • False discovery rate (FDR) rate that false discoveries occur FDR = E(V/R; R>0) = E(V/R | R>0)Pr(R>0) • Positive false discovery rate (pFDR): rate that discoveries are false pFDR = E(V/R | R>0).

  13. p >> n Goal: find statistically significant associations of biological conditions or phenotypes with gene expression. Consider the two class problem. Data: n (10…100) points in a p-dimensional (5000…30000) space. Problem: There are infinitely many ways to separate the space into two regions by a hyperplane such that the two groups are perfectly separated. This is a simple geometrical fact and holds as long as n<p!

  14. p >> n Problem: If I find such a perfectly separating hyperplane, it doesn’t mean anything. It is not surprising. It is not a significant finding. I would always find it, no matter how random the data are! Answer: regularization Rather than searching in the huge space of all hyperplanes in n-1 dimensional space, restrict ourselves to a much smaller space. Two major approaches: - only the hyperplanes perpendicular to one of the n coordinate axis  gene-by-gene discrimination, gene-by-gene hypothesis testing. - any other reasonable, not too complex set of hypersurfaces  machine learning

  15. Gene by gene tests t-test Wilcoxon F-test / more complex linear models Cox-regression Problem: Treating each gene independently of each other wastes information – many properties may be shared among genes. E.g. their within-group variability.

  16. The volcano plot shows, for a particular test, negative log p-value against the effect size (M) A useful plot

  17. MA and volcano

  18. Complications • What is the distribution of SAM statistic? • How about t-statistic is it really t-distributed? • How can we get p-values when we don’t know the distribution?

  19. p-values by permutations We focus on one gene only. For the bth iteration, b = 1,  , B; Permute the n data points for the gene (x). The first n1 are referred to as “treatments”, the second n2 as “controls”. For each gene, calculate the corresponding two sample t-statistic, tb. After all the B permutations are done; Put p = #{b: |tb| ≥ |tobserved|}/B (p lower if we use >).

More Related