200 likes | 362 Vues
This pilot project focuses on sequencing and assembling a 3 Megabase (Mbp) segment of the cacao genome using a hybrid approach that integrates traditional genomics techniques with next-generation sequencing (NGS). By utilizing BAC libraries, physical mapping, and Roche 454 sequencing, the project aims to reduce assembly complexity and prioritize critical genomic regions. The strategy allows for flexible pooling of BACs and parallel sequencing, which minimizes investment costs while maximizing coverage and efficiency. The outcome will contribute to a pseudomolecule sequence aiding future cacao genomic studies.
E N D
CUGI Pilot Sequencing/Assembly Projects Christopher Saski
Sequencing the Cacao Genome:3 Megabases at a Time • Pilot project to sequence and assemble 3Mbp segment of cacao genome • IBM in silico assembly project – Testing the assembly pipeline
Sequencing the Cacao Genome:3 Megabases at a Time • Combination of: • “Old School Genomics” • BAC libraries, physical mapping, and clone-by-clone sequencing • Roche 454 Titanium and FLX De Novo sequencing • Key: • Not yet accurately assembled a eukaryotic genome with NGS alone • Reduce assembly complexity
3 Megabase segments Rounsley et al., 2009
Advantages • Reduce assembly complexity • Limit number of sequencing libraries • Prioritize critical genomic regions • Outsource BAC pools for sequencing in parallel at any center that has a 454 Titanium/GS-FLX sequencer • Flexibility – Start slow with minimal investment • Could redesign strategy to reduce sequence runs
Strategy Components • Integrated Physical/Genetic framework • Pool development and sequencing: • BAC-end • Titanium 454 (paired/non-paired) • Draft sequence • Assembly and integration: • Newbler • Celera (CABOG)
Cacao Integrated Physical/Genetic Framework • Represents ~29X coverage (3 BAC libraries) • Assembled into small number of large contigs • Suggests reasonable levels of heterozygosity • Manageable amounts of repetitive sequence • 220 anchored genetic markers spanning 10 linkage groups • Resemble recombinational derived order
Pool Development • Select contiguous BAC clones from MTP • Pools will contain 25-30 clones • 20-30kb overlap • Complete Cacao MTP will require 120-150 pools • Repetitive-type regions: • BAC-end sequence and physical map data predictive tool • Modify pools accordingly
Pool Development • Estimate contig size using Consensus Band (CB) algorithm • Example: Cacao cp genome is 160,604bp • Hybridization revealed cp containing contig and is estimated to be ~160 kb based on CB algorithm. • Purified pool DNA can be produced at CUGI • Treat with ATP-dependent Dnase
Sequencing • 3 Levels of Sequence: • Paired BAC-end Sequence – 20 kb increments • End sequencing of pool members • 454 sequencing of BAC pools • Paired 3.5X-5.1X coverage (Roche 454/FLX) • Non-paired 17X-26X coverage (Titanium)
454 Runs—Whole Genome • 454 Titanium non-paired – 26X coverage/pool • 4 pools per slide (up to 150 pools total) • Up to 38 slide runs • 454 FLX paired-end (3kb) – 5X coverage/pool • 16 pools per slide (up to 150 pools total) • Up to 10 slide runs total
Assembly/Curation of 3Mbp Segment • Preprocessing • Filter reads to remove: • Pair-end that did not contain both ends • BAC vector • E. coli (host DNA) • Newbler Assembler (Roche) • Celera Assembler (CABOG) • Improvements in homopolymer calls, and heterogeneous read length issues • Recently shown N50 contig size double to Newbler • Human (50% repetitive) and microbes
Assembly Curation of 3Mbp Segment • Assembly at various depths (5X, 10X, 15X) • Determine optimal sequencing coverage • Utilize available data to scaffold contigs: • BAC end sequences every 20kb • Genetic marker sequences • RNA-seq clusters • Arabidopsis – Cacao synteny • Draft Sequence (2X) • Augment approach by covering regions missed by clones – assist in selecting MTP
Assembly Curation of 3Mbp Segment • Deliverable will be a pseudomolecule sequence for the 3Mbp region • Gaps will be strings of N • Assess and employ lab-based gap filling strategies • Make every attempt to close gaps
Assembly Validation and Correction • In-silico virtual digest of scaffold sequence and compare to physical map restriction fragments • Draft sequence integration (DSI) via FPC • Integrate and visualize physical map, 3 Mbp segments, and draft sequence
IBM in silico Sequences • IBM will provide a set of sequences that mimic the pilot caco sequences • Input error • Indels, homopolymer calls, nucleotide substitutions • Simulated data to test pipeline: • Physical map • Simulated BAC end sequences • Simulated pseudo-reads from pooled BACs • EST clusters • Indicate reference species for syntenic comparisons
Pilot Project Budget • BAC-end sequencing (30K BACs), 20Kb increments • $206,605.00 • Assembly/curation/validation of cacao 3Mbp • $16,720.00 • Assembly of IBM in-silico derived sequences • $15,400.00
ESTIMATED Budget – Whole Genome Assembly • Assembly, curation, validation of 130-150, 3Mbp segments • $147,620.00 • Automated structural/functional annotation • $8,800.00
Acknowledgements • USDA-ARS • Mars Inc. • Dr. Alex Feltus • Stephen Ficklin • Dr. Keith Murphy • Dr. Margaret Staton