90 likes | 176 Vues
Novel multi-platform next generation assembly methods for mammalian genomes.
E N D
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut teamed together (under the KanGO consortium) as part of an international effort to generate a de novo genome sequence for marsupial model species, the tammar wallaby (M. eugenii). An important model organism is part of the Metatherian mammals and harbors unique life history traits and genome features. Genome sequencing was done at several institutions , using several technologies, over several years. Developed a pipeline to integrate long and short read sequences and existing assemblies using well known mapping and assembly tools such as Bowtie and Phrap.
Overview of the Data “Paired” Read Overview: Read Read Insert • All reads used in reassembly were “paired”. • Read orientation not considered. • The Solid reads had insert size of 1.395 kb. • The Illumina reads had insert size of 3 and 8 kb in roughly equal proportions. • The 454 reads had insert size of 8, 12, 20, and 30 kb with the majority of them split between 20 and 30 kb.
Local Assembly Pipeline Initial Assembly Scaffold Assembly Final Mapping Map Short Reads Gap Re-estimation Finished Assembly Quality Calling Initial Assembly Sanger reads assembled using Atlas (BCM). Initial scaffolding done using Solid mate pairs. • Map Short Reads • To scaffolds using Bowtie, there are two cases: • Orphaned, when one mate is unmapped. • Complete, when both are mapped. Reads which map to multiple locations (non unique) are not considered. Key: 454, illumina, sanger, unmapped read, dotted line – estimated distance
Pipeline continued Scaffold Assembly Feed contigs, complete and orphaned pairs to Phrap. Re-assemble. • Final Mapping • Map all data to the final contigs. ;;;;;;;;;;;9;7;;.7;393333 ;;;;;;;;;;;7;;;;;-;;;3;83 ;;;;;;;;;;;9;7;;.7;393333 ;;;;;;;;;;;7;;;;;-;;;3;83 ;;3;;;;;;;7;;;;;;;88383;;;34 ;;3;;;;;;7;;;;;;;88383;;;77 Gap Restimation Of all gap distances in scaffolds using complete pairs mapped on different contigs. Quality Calling Map all reads and calculate a quality of each base. Output: Contigs, Quality Scores, Scaffold (agp file).
Local Assembly Close-up A B C • Screen shot taken from Codon Code aligner which uses Phrap to map reads. • Red and blue denote orientation. • A: Two contigs may be fused using short reads. • B: Contig will be extended with short reads at one end. • C: Single nucleotide and small errors are corrected using the short reads higher coverage. • We are confident of changes since even at < 10x each read is itself uniquely mapped, or its approximate position supported by a uniquely mapped read.
Gap Estimation Methods • Gap between two contigs estimated using an expectation maximization algorithm. Steps are repeated until estimated parameters do not change. • Step1: Maximization: compute gap estimate (x), let the mean insertion length of N pairs equal μ (initial value is library average). • Step 2: Sampling, given x, and the length of contigs, sample μ from completely mapped reads spanning gaps.
Gap Estimation Results • When using libraries with different insert size and std deviation it is necessary to bundle the estimates. • The following is an example of how two libraries are bundled: • nx = # reads in libx, ex = libx estimate, sx = lib std dev.
Conclusion Future Work • This method is a viable way of improving existing draft genomes with short read technologies at limited (<10x depth) coverage. • This method is robust and easily parallelized so it is practical for large mammalian genomes. • Better results may be obtained through multiple iterations. • Re-scaffolding of the contigs should be done between iterations. • A contig aware assembly algorithm could improve local assembly performance.