1 / 53

Statistical data analysis in functional genomics

Statistical data analysis in functional genomics. Wolfgang Huber Genome Biology Unit EMBL Predoc Course 2009. Download dataset E-MEXP-1422 (e.g. using Bioconductor). Go to ArrayExpress website. Microarray Experiment. si-GFP. Si-PROX1 (#1). Si-PROX1 (#2). colon cancer cell line.

ewa
Télécharger la présentation

Statistical data analysis in functional genomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical data analysis in functional genomics Wolfgang Huber Genome Biology Unit EMBL Predoc Course 2009

  2. Download dataset E-MEXP-1422 (e.g. using Bioconductor) Go to ArrayExpress website

  3. Microarray Experiment si-GFP Si-PROX1 (#1) Si-PROX1 (#2) colon cancer cell line

  4. samples: mRNA from tissue biopsies, cell lines fluorescent detection of the amount of sample-probe binding arrays: probes = gene-specific DNA strands tissue A tissue B tissue C ErbB2 0.02 1.12 2.12 VIM 1.1 5.8 1.8 ALDH4 2.2 0.6 1.0 CASP4 0.01 0.72 0.12 LAMA4 1.32 1.67 0.67 MCAM 4.2 2.93 3.31 marray data

  5. Boxplots

  6. Outliers: data points whose distance from the box is larger than 1.5 times the interquartile range. The whiskers extend to the last point which is not an outlier. A boxplot is a graphical representation of the Five-number summary: minimum, 1st quartile, median, 3rd quartile, maximum Boxplot 50% of the data is in the box “Interquartile range” Whisker Outliers Outlier Median 50% of the data is above 50% of the data is below 25% of the data is below the box 25% of the data is above the box

  7. Scatterplots

  8. Regression Slope: Different amount of extracted RNA Different labeling efficiency Baseline: Different background fluorescence

  9. A complex measurement process lies between RNA concentrations and intensities The problem is less that these steps are ‘not perfect’; it is that they vary from array to array, experiment to experiment.

  10. spike-in data log2 Cope et al. Bioinformatics 2003

  11. Should really be called calibration; has nothing to do with the normal distribution Four major approaches: Linear scaling: x → ax Affine (e.g. vsn): x → ax + b Monotone smooth (e.g. loess): x → f(x) Rank (e.g. quantile normalisation): x → rank(x) Normalisation

  12. Statistics 101 biasaccuracy  precision variance

  13. Basic dogma of data analysis Can always increase sensitivity on the cost of specificity, or vice versa, the art is to - optimize both, then - find the best trade-off. X X X X X X X X X

  14. Receiver Operating Characteristic (ROC) curve

  15. Hypothesis testing Two hypotheses in competition: • H0: the null hypothesis, usually the more conservative, more specific, better understood • H1 or HA: alternative hypothesis, often corresponds to the statement that we want to demonstrate Examples of null hypotheses: • The coin is fair • The new drug is no better (or worse) than a placebo • There is no difference in weight between two groups of mice

  16. Karl Popper (1902-1994) Logical asymmetry between verification and falsifiability No number of positive outcomes at the level of experimental testing can confirm a scientific theory, but a single counterexample is logically decisive: it shows the theory is false Neyman-Pearson theory of hypothesis testing (~1950): goal is to use the data to reject H0 (with high confidence) - which implies the complement of H0.

  17. Hypothesis testing We need something to measure how far an observation is from what we expect to see if H0 is correct: a test statistic Example: number of heads obtained when tossing a coin a given number of times. We expect that this number should be close to half the number of tosses. But what is „close“? We can use combinatorics / probability theory to quantify this. For example, in 12 tosses, the probability of seeing exactly 8 heads is

  18. Significance level The p-value is the probability that the test statistic is as or more extreme than the observed value. Practitioners like tests where this probability can be explicitly computed through a mathematical formula (t-test, Wilcoxon, hypergeometric…). If not, one can often still estimate it by random permutation or bootstrapping Historically, some people agreed that for many instances, a significance level α =0.05 was a useful cutoff. But there is nothing magic about this number. Two views: Strength of evidence for a certain (negative) statement Rational decision support

  19. Binomial distribution How is the test statistic distributed if H0 is correct ? P(Heads <= 1) = 0.01074 P(Heads >= 9) = 0.01074

  20. One-sided vs two-sided tests If HA is “the coin is biased”, we do not specify the direction of the bias and must be ready for both alternatives (many heads or many tails) This is a two-sided test. If the total level of the test α is e.g. 0.05, we allow α/2 for bias towards tail and α/2 for bias towards head. If HA is “the coin is biased towards heads”, we specify the direction of the bias and the test is one-sided. One-sided test is more powerful – but it depends on a greater amount of prior information

  21. Errors in hypothesis testing Decision Truth

  22. One sample t-test t-statistic (1908, William Sealy Gosset, pen-name “Student”) n is the number of observations Without n: z-score, a relative measure of distance With n: t-statistic. Its null distribution can be computed: t-distribution with a parameter that is called “degrees of freedom” equal to n-1

  23. t-test If H0 is correct, t follows a known distribution: t-distribution The shape of the t-distribution depends on the number of observations: if the average is made of n observations, if follows the t-distribution with n-1 degrees of freedom (tn-1). If n is large, tn-1 is close to a normal distribution If n is small, tn-1 is more spread out than a normal distribution (penalty because we had to estimate the standard deviation using the data).

  24. One sample t-test: example Consider the following 10 data points: -0.01, 0.65, -0.17, 1.77, 0.76, -0.16, 0.88, 1.09, 0.96, 0.25 We are wondering if these values come from a distribution with a true mean of 0. H0: the true mean is 0; HA: the true mean is not 0 These 10 data points have a mean of 0.60 and a standard deviation of 0.62. From these numbers, we can calculate the t-statistic: t = 0.60 / 0.62 * 101/2 = 3.0

  25. t9 2.5% 2.5% -2.26 2.26 One sample t-test: example Since we have 10 observations, we must compare our observed t-statistic to the t distribution with 9 degrees of freedom Since t = 3.01 > 2.26, the difference is significant at the 5% level. Exact p-value: P( |t| >= 3.01) = 0.014

  26. 5% 2.5% 2.5% One-sided vs two-sided test One-sided e.g. HA: μ>0 Two-sided e.g. HA: μ=0

  27. Two samples t-test Do two different samples have the same mean ? Test-statistic: Where y and x are the average of the observations in both populations SE is the standard error for the difference (see a book for details) If H0 is correct (no difference in mean), the test statistics follows a t distribution with n+m-2 degrees of freedom (n, m the number of observations in each sample).

  28. Comments and pitfalls The derivation of the t-distribution assumes that the observations are independent and that they follow a normal distribution. If the data are not normal (e.g. heavier tails), that is actually rarely a problem for the t-test. If the data are dependent, then p-values will likely be totally wrong (e.g., for positive correlation, too optimistic).

  29. different data distributions – independent case

  30. different data distributions – correlated case

  31. P-values in the PROX1 dataset

  32. Multiple testing • Classical hypothesis test: • null hypothesis H0, alternative H1 • test statistic X  t(X)  • a = P( t(X) Grej | H0) type I error (false positive) • b = P( t(X) Grej | H1) type II error (false negative) • When n tests are performed, what is the extent of type I errors, and how can it be controlled? • E.g.: 20,000 tests at a=0.05, all with H0 true: expect 1,000 false positives

  33. p-values: a mixture null alternative Slide37

  34. Experiment-wide type I error rates Family-wise error rate: P(V > 0), the probability of one or more false positives. For large m0, this is difficult to keep small. False discovery rate: E[ V / max{R,1} ], the expected fraction of false positives among all discoveries. Slide38

  35. Benjamini Hochberg multiple testing adjustmentcontrol siRNA vs PROX1 siRNA #1 slope: alpha / #genes

  36. Benjamini Hochberg multiple testing adjustmentcontrol siRNA vs PROX1 siRNA #1 rawp Bonferroni BH [1,] 9.67e-06 0.215 0.215 [2,] 2.94e-05 0.655 0.242 [3,] 3.25e-05 0.725 0.242 [4,] 7.28e-05 1.000 0.261 [5,] 8.01e-05 1.000 0.261 [6,] 8.90e-05 1.000 0.261

  37. p-values and fold changes

  38. Independent filtering From the set of 22,000 probes, first filter out those that seem to report negligible signal (say, 40%), then formally test for differential expression on the rest. Conditions under which we expect negligible signal : Target gene is absent, or just weakly expressed, in both samples. (Probes will still report noise and cross-hybridization.) Probe fails to detect the target. Slide45

  39. Increased detection rates Stage 1 filter: compute variance, across samples, for each probeset, and remove the fraction θ that are smallest Stage 2: standard two-sample t-test ALL data

  40. Increased detection rate implies increased power only if we are still controlling type I errors at the nominal level. Increased power? • Concerns: • Have we thrown away good genes? • Use a data-driven criterion in stage 1, but do type I error consideration only on number of genes in stage 2 • Informal justification: • Filter does not use covariate information ALL data Slide47

  41. Marginal control: independence of stage 1 and stage 2 statistics under the null hypothesis For genes for which the null hypothesis is true (X1 ,..., Xn exchangeable), f and g are statistically independent in both of the following cases: • Normally distributed data: f (stage 1): overall variance (or mean) g (stage 2): the standard two-sample t-statistic, or any test statistic which is scale and location invariant. • Non-parametrically: f: any function that does not depend on the order of the arguments. E.g. overall variance, IQR. g: the Wilcoxon rank sum test statistic. Both can be extended to the multi-class context: ANOVA and Kruskal-Wallis. Slide48

  42. How to know when it works? (power)

  43. Diagnostics odds ratio = 4.4 abs. value of t-statistic (stage 2) ALL data overall standard deviation (stage 1)

More Related