Adrian Caciula Department of Computer Science Georgia State University Joint work with

Transcriptome Reconstruction from Single • RNA-Seq Reads Using EM Algorithm with Expected • Deviation Minimization Enhancement Adrian Caciula Department of Computer Science Georgia State University • Joint work with • SergheiMangul (UCLA) • Ion Mandoiu (UCONN) • Alex Zelikovsky (GSU) ISBRA 2013, Charlotte, NC

Outline • RNA-Seq: Background and Related work • EM-EDM: EMAlgorithm with Expected Deviation Minimization • 1. Candidate transcripts construction • 2. EMfor Isoform Expression Estimation • 3. EDM: Expected Deviation Minimization • Experimental Results • Conclusions ISBRA 2013, Charlotte, NC

Alternative Splicing [Griffith and Marra 07]

Advances in Next Generation Sequencing High-throughput RNA sequencing (RNA-Seq) allows to reduce the sequencing cost and significantly increase data throughput. Illumina HiSeq 2000 Up to 6 billion PE reads/run 35-100bp read length Roche/454 FLX Titanium 400-600 million reads/run 400bp avg. length http://www.economist.com/node/16349358 Ion Proton Sequencer SOLiD 4/5500 1.4-2.4 billion PE reads/run 35-50bp read length

Genome-Guided RNA-Seq Protocol From RNA – through the process of hybridization- Make cDNA & shatter into Fragments Sequence fragment ends Map reads to genome A B C D E Isoform Expression (IE) Isoform Discovery (ID) Gene Expression (GE) A B C A C D E ISBRA 2013, Charlotte, NC [Nicolae, et. al., 10]

Transcriptome Reconstruction Given partial or incomplete information about something make we need to make an informed guess about the missing or unknown data.

Transcriptome Reconstruction Types • Genome-independent reconstruction (de novo) • de Brujin k-mer graph • Genome-guided reconstruction (ab initio) • Spliced read mapping • Exon identification • Splice graph • Annotation-guided reconstruction • Use existing annotation (known transcripts) • Focus on discovering novel transcripts

Previous approaches • Genome-independent reconstruction • Trinity(2011), Velvet(2008), TransABySS(2008) • Genome-guided reconstruction • Scripture(2010) • Reports “all” transcripts • Cufflinks(2010), IsoLasso(2011), SLIDE(2012) • Minimizes set of transcripts explaining reads • Annotation-guided reconstruction • RABT(2011), DRUT(2011)

Outline • RNA-Seq: background and related work • EM-EDM: EMAlgorithm with Expected Deviation Minimization • 1. Candidate transcripts construction • 2. EMfor Isoform Expression Estimation • 3. EDM: Expected Deviation Minimization • Experimental Results • Conclusions ISBRA 2013, Charlotte, NC

EMAlgorithm with Expected Deviation Minimization • EM-EDM algorithm starts with • aset of Nknown candidate transcripts and • initialize their frequencies (expression levels), ft, with EM estimates. • then incorporates EDM, to improve the accuracy of EM. ISBRA 2013, Charlotte, NC

EM Algorithm with Expected Deviation Minimization • Step 1: Map the RNA-Seq reads to genome (using TopHat) • Step 2:Construct Splice Graph - G(V,E) • V : exons • E: splicing events • Step 3: Build the candidate transcripts • depth-first-search (DFS) • Step 4: Apply EM-EDM to compute expression levels for all candidates • Step 5:Filter candidate transcripts based on expression levels Genome

1. Candidate transcripts constructionGene representation Tr1: e1 e5 Tr2: e1 e3 e5 Tr3: e2 e4 e6 Pseudo-exons: pse2 pse3 pse4 pse5 pse6 pse7 pse1 Epse1 Spse2 Epse3 Spse4 Epse4 Spse5 Epse6 Spse7 Spse1 Spse3 Epse2 Epse5 Spse6 Epse7 Pseudo-exons(psei) - regions of a gene between consecutive transcriptional or splicing events Gene- set of non-overlapping pseudo-exons

1. Candidate transcripts constructionSplice Graph Construction pseudo-exons TSS TES Genome Single Spliced Reads pse5 pse6 pse7 pse8 pse9 pse4 pse3 pse1 pse2

2. EM for Isoform Expression Estimation A B C D E A C Read Ambiguity in Isoform Expression

Previous approaches to Isoform Expression • [Nicolae et. Al. 10] • Fragment length distribution • [Li et al. 10] • EM Algorithm, single reads • [Feng et al. 10] • Convex quadratic program, pairs used only for ID • [Trapnell et al. 10] • Extends Jiang’s model to paired reads • Fragment length distribution

Read-Isoform Compatibility Transcripts Reads • Qarepresents the probability of observing the read from the genome locations described by the alignment a. • - This is computed from the base quality scores as described in [Nicolae et. al., 10]

Fragment length distribution A B C A C i j Fa(i) Fa(j) A B C A B C A C A C For Single reads Fais defined as the probability of observing a fragment with length of u bases or fewer. For more details see IsoEM [Nicolae et. al., 10]

Generic EM algorithm • Initialization: uniform transcript frequencies ft’s • E-step: Compute the expected number nt of reads sampled from transcript t • assuming current transcript frequencies ft • M-step: For each transcript t, set ft = portion of reads emitted by transcript t among all reads in the sample ML estimates for ft= nt/(n1 + . . . + nT) CAME 2011, Atlanta, GA

3. EDM: Expected Deviation Minimization EDM Motivation: Reducing the error rate is critical for detecting similar transcripts especially in those cases when one is a subeset of another: EDMis a fine tuning for frequency estimation which further improves the accuracy of the computation. ISBRA 2013, Charlotte, NC

Expected Deviation Minimization method (EDM). Let ltbe the adjusted length of the transcripts t H (i.e., the length of t - the average fragment length), where H is the set of all candidate transcripts. The expected read frequency e′i

Expected Deviation Minimization method (EDM). • The transcript frequency can be estimated by the following iterative process: Initialize ft corresponding EM frequency EDMincrements and decrements transcript frequencies in order to decrease the total deviation(between observed and expected read frequency). ISBRA 2013, Charlotte, NC

Expected Deviation Minimization method (EDM). Each iteration consists of the following three steps: Step1: Set D=1 and C=0.05. ISBRA 2013, Charlotte, NC

Expected Deviation Minimization method (EDM). Step 2: ISBRA 2013, Charlotte, NC

Expected Deviation Minimization method (EDM). Step 3: ISBRA 2013, Charlotte, NC

Outline • RNA-Seq: background and related work • EM-EDM: Expectation Maximization Algorithm with Expected Deviation Minimization Enhancement • 1. Gene representation and candidate transcripts • 2. EMfor Isoform Expression Estimation • 3. EDM: Expected Deviation Minimization • Experimental Results • Conclusions ISBRA 2013, Charlotte, NC

Simulation Setup • human genome data (UCSC hg18) • UCSC database - 66, 803 isoforms • 19, 372 genes. • Single error-free reads: 60M of length 100bp • for partially annotated genome -> remove from every gene exactly one isoform ISBRA 2013, Charlotte, NC

Distribution of isoforms length and gene cluster sizes in UCSC dataset ISBRA 2013, Charlotte, NC

Comparison Between Methods ISBRA 2013, Charlotte, NC

Conclusions • We proposed EM-EDM annotation-guided method for transcriptome discovery and reconstruction EM-EDM overperforms existing genome-guided transcriptome assemblers in terms of Sensitivity (i.e., Cufflinks) • For future work we plan the improve the filtering algorithm in order to increase the PPV and extend our work to paired-end reads. ISBRA 2013, Charlotte, NC

Thanks! ISBRA 2013, Charlotte, NC

Adrian Caciula Department of Computer Science Georgia State University Joint work with

Adrian Caciula Department of Computer Science Georgia State University Joint work with

Presentation Transcript

Florida State University Department of Computer Science

Ken D. Nguyen Department of Computer Science Georgia State University

University Of West Georgia Department of Computer Science

Computer Science Department Sonoma State University

Florida State University Department of Computer Science

Florida State University Department of Computer Science

Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Sergh

Alex X. Liu Dept. of Computer Science and Engineering Michigan State University Joint work with

Adrian Caciula Department of Computer Science Georgia State University Joint work with

Department of Computer Science, Wayne State University

Alex Zelikovsky Department of Computer Science Georgia State University

Serghei Mangul Department of Computer Science Georgia State University

Qiong Cheng Georgia State University Joint work with Piotr Berman (Pennstate)

Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia

RANI NALAMARU DEPARTMENT OF COMPUTER SCIENCE BALL STATE UNIVERSITY

Xiuwen Liu Department of Computer Science Florida State University

Georgia State University Police Department

Mehmet Koyut ü rk PURDUE UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE Joint work with Jayesh Pandey,

Columbia University Department of Computer Science

Concordia University Department of Computer Science

Columbia University Department of Computer Science

Alex Zelikovsky Department of Computer Science Georgia State University Joint work with