1 / 130

Linear models and Limma

Linear models and Limma. Københavns Universitet, 19 August 2009 Mark D. Robinson Bioinformatics, Walter+Eliza Hall Institute Epigenetics Laboratory, Garvan Institute. (with many slides taken from Gordon Smyth). Morning Theory Introduction Background correction Moderated t-tests

lyre
Télécharger la présentation

Linear models and Limma

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear models and Limma Københavns Universitet, 19 August 2009 Mark D. Robinson Bioinformatics, Walter+Eliza Hall Institute Epigenetics Laboratory, Garvan Institute (with many slides taken from Gordon Smyth)

  2. Morning Theory Introduction Background correction Moderated t-tests Simple linear models Morning Practical Demonstration of smoothing Limma objects (beta7) Background correction and normalization (beta7) Simple experimental designs 2-colour example (beta7) Affymetrix example (cancer) Afternoon Theory More advanced designs / linear modeling Moderated F-tests Gene set tests Other analyses limma can do Afternoon practical Factorial design (estrogen) Gene set testing (cancer) Time course experiment (SAHA/depsipeptide) Limma = linear models for microarray data

  3. Expression measures array yga = log2(R/G) Two-colour probe or gene Affymetrix log-intensity(summarized over probes) yga = log-intensity(summarized over beads) yga = Illumina

  4. Questions of Interest • What genes have changed in expression? (e.g. between disease/normal, affected by treatment) Gene discovery, differential expression • Is a specified group of genes all up-regulated in a particular condition?Gene set differential expression • Can the expression profile predict outcome?Class prediction, classification • Are there tumour sub-types not previously identified? Do my genes group into previously undiscovered pathways?Class discovery, clustering Today will cover first two questions - differential expression

  5. MicroarrayDifferential Expression Studies • 103 – 106 genes/probes/exons on a chip or glass slide • Inputs to limma: log-intensities (1-colour data) or log(R/G) log-ratios (for 2-colour data) • Several steps to go from raw data to table of “expression”: background correction, normalization • Idea: Fit a linear model to the expression data for each gene

  6. Two colour microarrays http://en.wikipedia.org/wiki/DNA_microarray

  7. Two-colour data: Log-Intensities For each probe: Various ways to calculate background. Will often modify to ensure: Rf – Rb >0 and Gf – Gb > 0.

  8. Two-colour data: Means and Differences For each probe: “minus” “add”

  9. G1 R1 G2 R2 G3 R3 G4 R4 M3 M2 M4 M1 A1 A2 A3 A4 Data Summaries For each gene

  10. MA Plot

  11. Log-Ratios orSingle Channel Intensities? • Tradition analysis, treats log-ratios M=log(R/G) as the primary data, i.e., gene expression measurements are relative • Alternative approach treats individual channel intensities R and G as primary data, i.e., gene expression measures are absolute (Wolfinger, Churchill, Kerr) • Single channel approach makes new analyses possible but • make stronger assumptions • requires more complex models (mixed models in place of ordinary linear models) to accommodate correlation between R and G on same spot

  12. BG correction affects DE results • Importance of careful pre-processing and quality control cannot be over-emphasized for microarray data • Can have dramatic effect on differential expression results • Consider here the normexp method of adaptive background correction • background correction step of the RMA algorithm • Can also be applied to two colour data

  13. Additive + multiplicative error model Observe intensity for one probe on one array Intensity = background + signal I = B + S additiveerrors multiplicative errors This idea underlies variance stabilizing transformations vsn (two colour data) and vst (for Illumina data)

  14. = + normexp convolution model Intensity = Background + Signal N(μ,s2) Exponential(a)

  15. Conditional expectation under normexp model Then with

  16. normexp background correction • Estimate the three parameters • Replace I with E(S|I) • For Affymetrix data, I is the “Perfect Match” data intensity • For two-colour data, I=Rf-Rb or I=Gf-Gb • In the RMA algorithm, parameter estimation uses an ad hoc density kernel method • In limma (two colour), parameter estimation maximises the saddlepoint approximation to the likelihood

  17. PM data on log2 scale: raw and fitted model

  18. Background corrected intensity = E(Signal | Observed Intensity) Adaptive background correction produces positive signal E( Signal | Intensity) Observed Intensity

  19. Offsets to stabilise the variance Background correction Log-ratios Offset reduces variability at low intensities

  20. Why do offsets stabilize the variance?

  21. Why do offsets stabilize the variance? • log2( 800/100 ) = log2 ( 8/1 ) = 3 (8-fold) • Additive noise affects small numbers more • Offsets introduce bias: • log2[(80+10)/(10+10)] = 2.17 • But the tradeoff (drop in variance for increase in bias) is usually worth it

  22. A self-self experiment:two background methods 553 spots not plotted

  23. Comparison of 2-colour BG correction methods False discoveries (limma) Genes selected Ritchie et al. 2007

  24. References • Silver et al. (2009). Microarray background correction: maximum likelihood estimation for the normal-exponential convolution. Biostatistics. [complete mathematical development of the saddle point approximation] • Ritchie et al. (2007). A comparison of background correction methods for two-colour microarrays. Bioinformatics. [shows “normexp” performs best for 2-colour data] • Irizarry et al. (2003). Exploration, normalization and summaries of high density oligonucleotide array probe level data. Biostatistics. [Describes RMA BG correction, but doesn't give much detail of the normexp convolution model.]

  25. Normalization

  26. Normalization Two-colour BG correction and normalization are closely connected Even after BG correction, some effects remain.

  27. Normalization One-colour Similarly for single channel data, adjustments need to be made for all samples to be comparable.

  28. Moderated t-tests

  29. Borrowing information across genes • Small data sets: few arrays, inference for individual genes is uncertain • Curse of dimensionality: many tests, need to adjust for multiple testing, loss of power • Benefit of parallelism: same model is fitted for every gene. Can borrow information from one gene to another

  30. Hard and soft shrinkage • Hard: simplest way to borrow information is to assume that one or more parameters are constant across genes • Soft: smooth genewise parameters towards a common value in a graduated way, e.g., Bayes, empirical Bayes, Stein shrinkage …

  31. Wild-type mouse x 2 A very common experiment Mutant mouse x 2 Gene X Which genes are differentially expressed? n1 = n2 = 2 Affymetrix arrays 25,000 probe-sets

  32. Ordinary t-tests give very high false discovery rates Residual df = 2

  33. Another very common experiment Wild-type mouse 1 Mutant mouse 1 Wild-type mouse 2 Mutant mouse 2 Which genes are differentially expressed? n = 2 two-colour arrays 30,000 probes

  34. Ordinary t-tests give very high false discovery rates Residual df = 1

  35. Small sample size, many tests The problem: • These experiments would be under-powered even with just one gene • Yet we want to test differential expression for each of 50k genes, hence lots of multiple testing and further loss of power The solution: The same statistical model is being fitted for every gene in parallel. Can borrow strength from other genes.

  36. t-tests with common variance with residual standard deviation pooled across genes More stable, but ignores gene-specific variability

  37. A better compromise Shrink standard deviations towards common value = degrees of freedom Moderated t-statistics

  38. Shrinkage of standard deviations The data decides whether should be closer to or to

  39. Why does it work? • We learn what is the typicalvariability level by looking at all genes, but allow some flexibility from this for individual genes • Adaptive

  40. Hierarchical model for variances Data Prior Posterior

  41. Posterior Statistics Posterior variance estimators Moderated t-statistics Baldi & Long 2001, Wright & Simon 2003, Smyth 2004

  42. An unexpected piece of mathematics shows that, under the null hypothesis, The degrees of freedom add! The Bayes prior in effect adds d0 extra arrays for estimating the variance. Exact distribution for moderated t Wright and Simon 2003, Smyth 2004

  43. More on empirical Bayes statistics

  44. Hierarchical model for means Data Prior Lönnstedt and Speed 2002, Smyth 2004

  45. Hence gives the best possible ranking of genes Posterior Odds Posterior odds of differential expression Monotonic function of Lönnstedt and Speed 2002, Smyth 2004

  46. Estimating Hyper-Parameters Closed form estimators with good properties are available: for s0 and d0 in terms of the first two moments of log s2 for c0 in terms of quantiles of the

  47. Marginal Distributions Under usual likelihood model, sg is independent of the estimated coefficients. Under the hierarchical model, sg is independent ofthe moderated t-statistics instead

  48. Moment estimators for s0 and d0 Marginal moments of log s2 lead to estimators ofs0 and d0: Estimated0 by solving where Finally

  49. Quantile Estimation of c0 Let r be rank of in descending order, and let F(;) be the distribution function of the t-distribution. Can estimate c0 by equating empirical to theoretical quantiles: Get overall estimator of c0 by averaging the individual estimators from the top p/2 proportion of the

  50. Short note on multiple testing

More Related