1 / 22

The Changing Face of Sequencing

The Changing Face of Sequencing. Strategies for de novo sequencing of complex genomes. Quick Review:. BACs Whole Genome Shotgun. First some history…. BAC. 2000: Arabidopsis. 2005: Rice. BAC & WGS. 2006: Poplar. WGS. 2007: Grapevine. WGS. 2008: Maize. BAC. 2008: Papaya.

Télécharger la présentation

The Changing Face of Sequencing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. The Changing Face of Sequencing Strategies for de novo sequencing of complex genomes

  2. Quick Review: • BACs • Whole Genome Shotgun

  3. First some history…. BAC • 2000: Arabidopsis • 2005: Rice BAC & WGS • 2006: Poplar WGS • 2007: Grapevine WGS • 2008: Maize BAC • 2008: Papaya WGS • 2009: Sorghum WGS

  4. BAC-based vs WGS

  5. What made WGS possible? • Long, high quality Sanger reads (700-800bp) • Paired-end libraries • Range of insert sizes • 3kb • 8-10kb • 40kb fosmids • Assemblers tailored to these datatypes. • Still not guaranteed… • public maize project went BAC by BAC

  6. NGS changes all the rules • Quantity not quality is now the focus • New platforms generate huge quantities of data • Read length & PE’s initially limited de novo apps • Rapid cycle of improvements • No time for standard approaches to spread beyond genome centers before next cycle begins. • Third party software sometimes slow to catch up • Cost model has changed • Library construction used to be minor component of cost • Unit used to be 96 or 384 reads….. • Choice is now more complex than BAC vs WGS

  7. does notOne size^fits all • Every project has individual needs • Monolithic reference genome is rarely needed now • How bad are the repeat structures? • Is it important to get them right? • How important is it to anchor all the sequence to a genome location? • What other genome data can be leveraged?

  8. BACs and NGS – the problem • Pre-NGS: • To sequence a BAC: • Make 1 sequencing library ~$50-100 • Sequence two 384-well plates of clones ~$750 • ~6x coverage • With NGS: • To sequence a BAC with 454: • Make 1 sequencing library ~$300 • Sequence 1/8 plate of 454: ~$1,000 • ~600x coverage • Too expensive, and too much coverage…..

  9. New BAC-based approaches • One library per BAC is cost-prohibitive • Map-based BAC pooling • Retain some of the assembly benefits of BACs • Reduced library costs over BAC-by-BAC • If contiguous, retains the genome localization benefits

  10. BAC pooling strategy Chr3. shortarm Select FPC contigs on the shortarm FPC contigs Select overlapping BACs and bin them into 3Mb pools 3 Mb pools Selected BACs Pyrosequencing of BAC pools and assembly of raw sequences ~20x 454 Titanium Reads (~400bp each) Contigs from individual BAC pools 454 FLX PE’s (~250bp each) Contigs are organized into scaffolds using 454 paired end sequences Scaffolds from individual BAC pools Use BAC ends for very long scaffolds Generate superscaffolds using BAMBUS and BAC end sequences Superscaffolds spanning pool boundaries From Rounsley et al. (2009)

  11. Results: Chr3S of Oryza barthii 6 x 3Mb BAC pools 1 Titanium Run 0.5 FLX Run ~$12k in reagents Contig N50: 14.3 kb Scaffold N50: 370.9 kb Scaffold N50: 3,165.1 kb (after BAC ends) Nt Accuracy: 2.2 errors per 10kb

  12. 2D pooling: An alternative to contiguous BAC pools • Place ordered clones in plates • 1 Library from each row • 1 Library from each column • Identify reads from each individual clone by sequence overlap. • Then assemble each clone • Assembly unit reduced to ~ single BAC • Library cost drops with size of grid • 10x10: 100 clones, 20 libraries • 50x50: 2500 clones, 100 libraries • 3D grid lowers cost even further • 10x10x10: 1000 clones, 30 libraries • 20x20x20: 4000 clones, 60 libraries • Repeats may misbehave but can choose to ignore them

  13. The ideal…. • One library per BAC clone • Barcoded • Sequence all clones from BAC library in one combined, barcoded pool • BUT: currently not cost-effective. • Individual DNA preps for thousands of BAC clones is costly

  14. Is WGS with NGS feasible yet? • With 454: • 400bp reads, + 4kb and 20kb insert PE protocols • Success may be Species & Goal dependent: • Arabidopsis • small & low repeat content • 21kb contig N50; 2.6Mb scaffold N50 • Roche & Ecker • Cassava • 800Mb, lots of repeats • 5.3kb contig N50; 180kb scaffold N50 • Roche & JGI • Missing half of the genome (repetitive half)

  15. WGS with Solexa/Illumina • Improved read-lengths, PE protocols • Improved third party assemblers • e.g. SOAPdenovo, Velvet • Cucumber genome - BGI • 300Mb genome • 50x coverage with 50bp PE • 5kb contigN50, 60kb scaffoldN50 • Much better when mixed with 4x Sanger • Missing half of genome (repeats) • Panda Genome - BGI • 3Gb genome • 50x coverage with 75bp PE • 300kb contigN50 (?) • Big question: What is misassembly rate?

  16. Building contigs from overlapping clones 5 overlapping BAC clones form small contig Cut with R.E. Overlapping BACs share common fragments

  17. Building contigs from overlapping clones • Measure lengths • Make sequencing lib • Sequence from each cut site Overlapping BACs will share fragments of same size Overlapping BACs will share sequence tags next to each cut site

  18. A BAC-WGS hybrid? whole genome profiling by Keygene • A: Solexa-based BAC map • Construct BAC library; array into 2D pools • Cut with restriction enzyme, and make 1 library per pool. • Generate sequence from libraries • Deconvolute pools to identify the Solexa reads from each BAC. • Build a map from overlaps • Map has short sequence tag every 1-2kb in genome • B: WGS sequencing with Solexa • Assemble short contigs (high stringency) • Use above map to locate each contig in genome. • Map can identify misassemblies • C: Result: • High quality map-based genome at fraction of cost

  19. Simulation of Tag-based Map building • Rice: 372Mb, 12 chromosomes • Simulate a 10x BAC library • 28,600 clones • Cut the sequence for each clone with HindIII • Simulate a short read sequence from each site • 2.2 million sequence tags • Build a map from these – overlapping clones share tags • 33 contigs built (<3 contigs per chromosome) • Only 1 misassembly!

  20. So you want to sequence a genome? • Lots of choices to make: • BACs, WGS • Which NGS technology? • Single end, paired end? • What size paired ends? • What depth of coverage from each? • How do you pick? • Do lots of testing of strategies - $$$$$ • Guess – Free • Copy what someone else did - Free • Educated Guess based on Simulation

  21. How to decide on a strategy? Simulating Genome Sequencing • “Plantagora” • Plant Genome Assembly Simulation Platform • Use existing genomes to simulate sequencing reads • Combine reads in many combinations • Assemble • Score the results with meaningful metrics • Report results on web site

  22. Summary • No longer BACs vs WGS • Different ways of using BACs • Linear pooling • 2D pooling • BACs for map, WGS for sequence • WGS works on easy parts of genome • Simulation is valuable in evaluating strategies

More Related