1 / 21

Assembling Genome

Assembling Genome. Timothee Cezard EBI NGS workshop 16/10/2012. Assembly Algorithms. Goal : Find the shortest common sequence of a set of reads. This is NP-hard problem, we need to use some approximation algorithm . Main algorithm used: Overlap Layout Consensus Debrujin graphs.

sheena
Télécharger la présentation

Assembling Genome

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assembling Genome TimotheeCezard EBI NGS workshop 16/10/2012

  2. AssemblyAlgorithms • Goal: Find the shortest common sequence of a set of reads. • This is NP-hard problem, we need to use some approximation algorithm. Main algorithm used: • Overlap Layout Consensus • Debrujin graphs

  3. Overlap-layout-consensusStep 1: Find Overlapping Reads Need efficient alignment algorithm Doesn’t scale well when number of read is high Use seed based alignment with extension TACATAGATTACACAGATTACTGA || |||||||||||||||||||| TAGTTAGATTACACAGATTACTAGA

  4. Overlap-layout-consensusStep 2: Construct overlap graph • A graph is constructed: • Nodes are reads • Edges represent overlapping reads CGTAGTGGCAT Overlap graph ATTCACGTAG

  5. Overlap-layout-consensusStep 3: Find Contigs Try to find the Hamiltonian path: • a path in the graph contains each node exactly once. • Expensive computationally CGTAGTGGCAT ATTCACGTAG

  6. Overlap-layout-consensus • This approach is used in Celera (CABOG), Newbler, Mira, SGA… • It is mostly used with Sanger or 454 data. • Can’t assemble repeat longer than read length • Could come back if read gets longer.

  7. De Bruijn Graphs example “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity,.... “ Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall Velvet example courtesy of J. Leipzig 2010

  8. De Bruijn Graphs example itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness… Generate random ‘reads’ How do we assemble? fincreduligeoffoolisItwasthebeItwasthebegeofwisdomitwastheepepochofinctimesitwasstheepochonessitwastwastheageotheepochofstheepochohofincreduestoftimeseoffoolishlishnessithofbeliefipochofincritwasthewotwastheagetoftimesitdomitwasthochofbelieeepochofbeeepochofbeastheworstchofincredtheageofwiiefitwasthssitwastheastheepochefitwasthewisdomitwaageoffoolitwastheworochofbeliesdomitwastsitwastheaeepochofbeffoolishneeofwisdomihebestoftistheageofftwastheepoeworstoftistoftimesitheepochofesitwastheheepochofitheepochofsdomitwastastheworstrstoftimesworstoftimstheepochogeoffoolisffoolishnetimesitwaslishnessitstheageoffeworstoftiorstoftimefwisdomitwwastheageoheageofwisincredulitishnessitwtwastheepowastheworsastheepochheworstoftofbeliefitwastheageoheepochofipochofincrheageofwisstheageofwfincreduliastheageofwisdomitwawastheageoastheepocholishnessiastheepochitwastheeptwastheagewisdomitwafbeliefitwbestoftimeepochofbeltheepochofsthebestoflishnessithofbeliefiItwasthebeishnessitwsitwasthewageofwisdotwastheageesitwasthetwastheageshnessitwafincredulifbeliefitwtheepochofmesitwasthdomitwasthochofbelieheageofwisoftimesitwstheepochobestoftimetwastheagefoolishnesftimesitwathebestoftitwastheagtheepochofitwasthewoofbeliefitbestoftimemitwastheaimesitwasttimesitwasorstoftimeestoftimestwasthebesstoftimesisdomitwastwisdomitwatheworstofastheworstsitwasthewtheageoffoeepochofbetheageofwifoolishnesincredulitofbeliefitchofincredbeliefitwabeliefitwawisdomitwaeageoffooleoffoolishitwastheagmesitwasthepochofincssitwastheitwastheepastheageofstheageoffsitwastheethebestoftoolishnessheepochofbochofbeliewastheepocbestoftimemesitwasthebestoftimpochofincr …etc. to 10’s of millions of reads Traditional all-vs-all comparisons of datasets this size require immense computational resources. De Bruijn solution: Construct a graph efficiently

  9. De Bruijn GraphsStep 1: create kmer Step 1: “Kmerize” the data Reads: theageofwi sthebestof astheageof worstoftim imesitwast the sth ast wor ime Kmers : (k=3) hea the sth ors mes eag heb the rst esi age ebe hea sto sit geo bes eag tof itw eof est age oft twa ofw sto geo fti was fwi tof eof tim ast …..etc for all reads in the dataset

  10. De Bruijn GraphsStep2 Build the graph Look for k-1 overlaps: given by the reads hea eag age geo eof the ast sth the hea eag age geo eof ofw fwi ast sth the heb ebe bes est sto tof sto tof wor ors rst oft fti tim ime mes was esi twa itw sit …..etc for all ‘kmers’ in the dataset

  11. De Bruijn Graphsstep3: simplify the graph

  12. De Bruijn Graphsstep4: Create contigs No single solution! Break the graph to give the final assembly

  13. De Bruijn example The final assembly (k=3) foolishness itwasthe st wor wisdom times age incredulity epoch be of belief Repeat with a longer “kmer” length A better assembly (k=10) itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis… Why not always use longest ‘k’ possible? Sequencing errors: Mostly unaffected kmers the ent sth ebe tof k=3 heb nto ben sthebentof k=10 sthebentof 100% wrong kmer

  14. Strengths and problems of De Bruijn approach • Strengths: • No need to calculate the overlaps • Size of the final graph is function of the genome size • Repeats are collapsed • Problems: • Can only resolve k long repeat • Loose connectivity when create the contigs

  15. Resolve repeat through scaffolding Contigs from assembly Align reads from short insert or long insert library Join contigs using evidence from paired end data Scaffold

  16. De Bruijn assembler • Velvet: http://www.ebi.ac.uk/~zerbino/velvet/ • ABySS: http://www.bcgsc.ca/platform/bioinfo/software/abyss • SOAP-denovo: http://soap.genomics.org.cn/soapdenovo.html • ALLPATH-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/ • IDBA-UD: http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/

  17. What makes an assembly good? • High coverage: 50 to 100X • Different but precise insert size libraries • Little to no sequencing errors • Avoid large number of variant. • Try different assembler • Need a big fat memory machine (from 16Go to 1To)

  18. What makes your assembly better? • Error Correction: Correct the read before assembly • http://bib.oxfordjournals.org/content/early/2012/04/06/bib.bbs015.full • SOAP-denovo • Reptile: http://aluru-sun.ece.iastate.edu/doku.php?id=reptile • SGA: https://github.com/jts/sga • Joining overlapping reads: • COPE: ftp://ftp.genomics.org.cn/pub/cope/ • FLASH: http://genomics.jhu.edu/software/FLASH/index.shtml

  19. What makes your assembly better? Gap Filling - Image Tsai et al. Genome biology 2010

  20. Assembly validation • N50 is the most commonly used metric: • Weighted median such as 50% of your assembly is contained in contig of length >=N50 • CEGMA: Core Eukaryotic Genes Mapping Approach • Looks in your assembly for gene that should be there • Usually best assembly have best CEGMA score • http://korflab.ucdavis.edu/datasets/cegma/ • There are no magic tool

More Related