1 / 22

Estimation of alternative splicing isoform frequencies from RNA- Seq data

Estimation of alternative splicing isoform frequencies from RNA- Seq data. Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul , Ion Mandoiu and Alex Zelikovsky. Outline. Introduction EM Algorithm Results

happy
Télécharger la présentation

Estimation of alternative splicing isoform frequencies from RNA- Seq data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Estimation of alternative splicing isoform frequencies from RNA-Seq data Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with SergheiMangul, Ion Mandoiuand Alex Zelikovsky

  2. Outline • Introduction • EM Algorithm • Results • Conclusions and future work

  3. RNA-Seq Make cDNA & shatter into fragments Sequence fragment ends Map reads A B C D E Isoform Expression (IE) Gene Expression (GE) Isoform Discovery (ID) A B C A C D E

  4. Gene Expression Challenges • Read ambiguity (multireads) • What is the gene length? A B C D E

  5. Previous approaches to GE • Ignore multireads • [Mortazavi et al. 08] • Fractionally allocate multireads based on unique read estimates • [Pasaniuc et al. 10] • EM algorithm for solving ambiguities • Gene length: sum of lengths of exons that appear in at least one isoform  Underestimate expression levels for genes with 2 or more isoforms [Trapnell et al. 10]

  6. Read Ambiguity in IE A B C D E A C

  7. Previous approaches to IE • [Jiang&Wong 09] • Poisson model, single reads only • [Li et al.10] • EM Algorithm, single reads only • [Feng et al. 10] • Convex quadratic program, pairs used only for ID • [Trapnell et al. 10] • Extends Jiang’s model to paired reads • Fragment length distribution

  8. Our contributions • EM Algorithm for IE • Single and paired reads • Fragment length distribution • Strand information • Base quality scores • Solving GE by adding isoform levels

  9. Outline • Introduction • EM Algorithm • Results • Conclusions and future work

  10. Read-Isoform Compatibility

  11. Fragment length distribution • Paired reads • Single reads A B C A A B B C C A C A A C C A B C A B C A C A B C A C A C

  12. IsoEM algorithm E-step M-step

  13. Outline • Introduction • EM Algorithm • Results • Conclusions and future work

  14. Experimental setup • Human genome UCSC known isoforms • GNFAtlas2 gene expression levels • Uniform/geometric expression of gene isoforms • Normally distributed fragment lengths • Mean 250, std. dev. 25

  15. Accuracy measurements • Error Fraction (EF) • Percentage of isoforms (or genes) with relative error larger than given threshold t • Median Percent Error (MPE) • Threshold t for which EF is 50% • r2 • Coefficient of determination

  16. Isoform Error Fraction Curves • 30M single reads of length 25 • Main difference b/w IsoEM and RSEM is fragment length modeling

  17. Gene Error Fraction Curves • 30M single reads of length 25

  18. Read Length Effect • Fixed sequencing throughput (750Mb) • 50bp reads better than 100bp!

  19. Effect of Pairs & Strand Information • 1-60M 75bp reads • Pairs help, strand info doesn’t • [Trapnell et al. 10] r2=.95 for 13M PE reads

  20. Outline • Introduction • EM Algorithm • Results • Conclusions and future work

  21. Conclusions & Future Work • Presented EM algorithm for isoform frequency estimation that exploits fragment length distribution for both single and paired reads • Significant accuracy improvement over existing methods • Code and datasets to be released publicly soon • Ongoing extensions • Confidence intervals • Allelic specific isoform expression • Testing for novel isoforms • Integration with isoform discovery

  22. Questions?

More Related