1 / 52

Cufflinks

Cufflinks. Matt Paisner , Hua He, Steve Smith and Brian Lovett. The Vision. RNAseq can be used for transcript discovery and abundance estimation

acton
Télécharger la présentation

Cufflinks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett

  2. The Vision • RNAseq can be used for transcript discovery and abundance estimation • What’s missing: algorithms which aren’t restricted by prior gene annotations (which are often incomplete) and account for alternative transcription and splicing. • Hence, Cufflinks.

  3. The Need • Evidence of ambiguous assignment of isoforms. TSS site/promoter changes and splice site changes were found previously by the authors • Longer reads and pair end reads do not do enough

  4. The Biology • General assumption of randomization of reads • Central Dogma • Transcription Start Site (TSS) • Splice site • Isoform

  5. Central Dogma and Regulation

  6. Splicing • Thisblahblahblahblahblahblahisblahblahimportant • Thisblahblahblahblahblahblahisblahblahimportant • “This” “is” “important” - Exons • “blah” - Introns (Intrusions)

  7. Major Change 1

  8. Major Change 2

  9. Why it matters: Isoforms • Not only different sizes, but different shapes • Shape determines function • Isoforms would map to the same section of the genome: undetected without Cufflinks • Separating transcripts into isoforms elucidates a more realistic representation of what is happening

  10. TopHat Mapping short reads Trapnellet. al,Bioinformatics, 2009

  11. TopHat • No genome reference annotations are needed • The output of TopHat is the input of Cufflinks. • Input: Reads and genome • Output: Read mappings • Short reads present computational challenges • BOWTIE

  12. How does TopHat Work?! Big Idea: “Exon Inference”!!

  13. Step 1: Initial Mapping via Bowtie • Group 1: Mapped Reads (Segments) • Group 2: Initially Unmapped (IUM)Reads • possibly intron-spanning read • Based on Group 1, we want to get intron-spanning reads from Group 2 Reference Mapped Reads

  14. Step 2: Generate Putative Exons

  15. Step 3: Look for Potential Splice Signals Putative Exons

  16. Step 4: Seed-and-Extend

  17. Cufflinks Isoform/Transcript Detection and Quantification Trapnellet al, Nature Biotech, 2010

  18. Step 5: Identify Compatible Reads Two reads are compatible if their overlap contains the exact same implied introns (or none). If two reads are not compatible they are incompatible.

  19. Step 6: Less BIOLOGY, and NOW it is the time for some GRAPH THEORIES……. “We emphasize that the definition of a transcription locus is not biological……” - Authors

  20. Step 6: Create Overlap Graph Connect compatible reads in order Create a DAG

  21. A path in this graph corresponds to a transcript isoform

  22. Theory • Solving minimum pathcover (isoforms) in the overlap graph implies the fewest transcriptsnecessary to explain the reads. • Solve minimumpath cover by finding largest set of individual reads such that no two are compatible. • According to Dilworth Thereom, find a maximum matching in a bipartitegraph

  23. Step 8: Convert a DAG into a Bipartite Graph

  24. Step 9: Looking for Maximum Matching inside a bipartite graph via Bipartite Matching Algorithm BIPARTITE-MATCHING Algorithm: Add augmenting path via BFS, repeatedly adding the paths into the matching until none can be added.

  25. A path in this graph corresponds to a transcript isoform

  26. Projective normalization underestimates expression isoform a isoform b project all isoforms into genome coordinates • R reads total, r reads for the gene: • ra for isoform a • rb for isoform b so but

  27. How should expression levels be estimated? • A-B are distinguished by the presence of splice junction (a) or (b). • A-C are distinguished by the presence of splice junction (a) and change in UTR • B-C are distinguished by the presence of splice junction (b) and change in UTR (a) (b)

  28. How should expression levels be estimated? • Longer transcripts contain more reads. • Reads that could have originated from multiple transcripts are informative. • Relative abundance estimation requires “discriminatory reads”. (a) (b)

  29. A model for RNA-Seq • r = Transcript proportions for assignment of reads to transcripts • L = Likelihood of this assignment • R = all reads

  30. A model for RNA-Seq • T = All transcripts

  31. A model for RNA-Seq • Define: • Expected possible positions for an arbitrary fragment in Transcript t • F(i) = pr(random fragment has length i) • l(t) = Full length of transcript t

  32. A model for RNA-Seq

  33. A model for RNA-Seq • It (r) = Implied length of r’s fragment if r is assigned to transcript t • Recall: F(i) = pr(fragment length = i)

  34. Projective normalization underestimates expression isoform a isoform b project all isoforms into genome coordinates • R reads total, r reads for the gene: • ra for isoform a • rb for isoform b so but

  35. A model for RNA-Seq • Now we have a maximum likelihood function in terms of r, the distribution of reads among transcripts. • Non-negative linear model

  36. Inference with the sequencing model • Maximum likelihood function is concave - optimization using the EM algorithm. • Asymptotic MLE theory leads to a covariance matrix for the estimator in the form of the inverse of the observed Fisher information matrix • Importance sampling from the posterior distribution used for estimating the abundances from the posterior expectation, and 95% confidence intervals for the estimates. • This approach extends the log linear model of H. Jiang and W. Wong, Bioinformatics 2009 to a linear model for paired end reads. • For more background see Li et al., Bioinformatics, 2010 and Bullard et al., BMC Bioinformatics, 2010.

  37. Utility of Cufflinks • mRNA as proxy for gene expression & action • Control points • transcriptional vs • post transcriptional • Does isoform-level discovery & quantification matter? • Apparently, yes • Putatively discovered about 12K new isoforms while recovering about 13K known • Plus other stuff…

  38. The skeletal myogenesistranscriptomeRNA-Seq (2x75bp GAIIx) along time course of mouse C2C12 differentiation 168 hours 120 hours -24 hours 60 hours myotube myoctyte • 123,575,666reads • 84,369,078 reads • 140,384,062reads • 82,138,212reads • 66,541,668alignments • 103,681,081alignments • 47,431,271alignments • 89,162,512alignments • 10,754,363to junctions • 19,194,697to junctions • 9,015,806to junctions • 17,449,848to junctions • 58,008transfrags • 69,716transfrags • 55,241transfrags • 63,664transfrags fusion differentiation (starting at 0 hours) Illustration based on: Ohtake et al, J. Cell Sci., 2006; 119:3822-3832 Slide courtesy of Hector CorradaBravo

  39. Validation

  40. Validation

  41. Projective normalization underestimates expression isoform a isoform b project all isoforms into genome coordinates • R reads total, r reads for the gene: • ra for isoform a • rb for isoform b so but Slide courtesy of Hector CorradaBravo

  42. Discovery is necessary for accurate abundance estimates Slide courtesy of Hector CorradaBravo

  43. Some Questions… • Do isoforms of a given gene have interesting temporal patterns? • Increasing, decreasing, more complex… • What does this mean biologically? • What about transcriptional versus post transcriptional regulation? • Differential transcription • Differential splicing

  44. Dynamics of Myc expression Slide courtesy of Hector CorradaBravo

  45. Overloading Metric using Jensen-Shannon Divergence Average Entropy Entropy of Average Metric: One-sided t-test under the null hypothesis that there is no difference in abundance; Type I errors controlled with Benjamini-Hotchbergcorrection (FDR)

  46. Regulatory Overloading # Genes (FDR < 0.05) Differential splicing Fibronectin Tropomyosin 1 Mef2d … 231 17 Fhl3 Fhl1 Myl1 … 101 Differential TSS preference

  47. Dynamics of Myc expression d( , ) Slide courtesy of Hector CorradaBravo

  48. New TSS = New Points of Regulation Whatwould a “collapsed” RNA-seq alignment look like? Microarray? TSS=Transcription Start Site

More Related