1 / 13

Statistics for RNA- seq Analysis

Statistics for RNA- seq Analysis. Moscow Genomic Data Analysis 2012 Mark Reimers, PhD. Basic Statistics. Variation in read counts should follow a Poisson distribution … almost does if replicates are done by same lab in same batch on same machine from same library

xenia
Télécharger la présentation

Statistics for RNA- seq Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics for RNA-seqAnalysis Moscow Genomic Data Analysis 2012 Mark Reimers, PhD

  2. Basic Statistics • Variation in read counts should follow a Poisson distribution … almost does if replicates are done by same lab in same batch on same machine from same library • Distribution of counts in one sample typically follows a power law

  3. Distribution of Counts within one Sample follows a Power Law If we plot the counts against the number of genes with those counts on log-log plot, we see something like a straight line (with some bump at 0)

  4. RNA-Seq Significance Testing

  5. Approaches to Significance • Continuous (easy) • At current read depths (>50M) most genes of interest are well above the threshold for continuity approximation • Discrete (hard) • All data are counts, and many are quite low, well below the acceptable n > 5 for continuous approximation • Cost-effective studies will use multiplexing and so counts will remain low

  6. Issues with Continuous Approximation • Data are NOT anywhere near Gaussian • Discrete counts under five may be poorly approximated by continuous distributions • Select only those with mean at least five • Ad-hoc fix: Winsorize data and do t-tests • Typically there are excess zeroes resulting in extreme values

  7. Models for Count Data • Poisson model • Standard model for count data • Negative Binomial Model • Higher variance than Poisson • Zero-inflated (mixture) model • Allows excess 0 counts beyond either above

  8. Poisson Model • Describes counts of independent events where each has small probability of occurring, such as reads from one gene Poisson distributions with various means

  9. Variance Increases with Mean

  10. Why is GLM for RNA-Seq hard? • Data from biological replicates are greatly over-dispersed compared to Poisson distribution • Common ad-hoc fix: model by Negative Binomial • Typically there are excess zeroes beyond NB or any standard discrete distribution

  11. Negative Binomial Distribution Negative Binomial distributions for p = 0.2, and various r Note r = 0.5 defined by analogy • Generalization of geometric distribution • Repeat Bernoulli(p) trials • Count number of non-selected outcomes until r selected outcomes

  12. Alternate Parameterization by Mean and Over-Dispersion Negative Binomial may also be parameterized by mean and variance: m = pr/(1-p) s2=pr/(1-p)2 Over-dispersion parameter q: s2 = m + qm2 q= 1/r; p = qm/(1+qm) If q= 0, Poisson Negative Binomial distributions for m = 10, and various q

  13. Using the Negative Binomial Model to Test for Differential Expression • Assume dispersion parameters are identical between samples • Test for difference of means using Likelihood Ratio Test • log( P(x1 | m1, q) P(x2| m2, q) / P((x1, x2 ) | m, q ) ) ~ c2 • Can also use t-test if estimate covariance matrix for parameters • Issue: what if library sizes differ?

More Related