1 / 27

Large Plant Genome Assemblies using Phusion2

Large Plant Genome Assemblies using Phusion2. Zemin Ning The Wellcome Trust Sanger Institute. NGS Data. Assembly. Phusion2 Assembly Pipeline. Scaffolding Spinner. Mate Pair Reads 2k-40k. Pair End Reads 170-800bp. Consensus Bases Smalt & Gap5. Filtering Unikalow. Fermi.

leal
Télécharger la présentation

Large Plant Genome Assemblies using Phusion2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute

  2. NGS Data Assembly Phusion2 Assembly Pipeline Scaffolding Spinner Mate Pair Reads 2k-40k Pair End Reads 170-800bp Consensus Bases Smalt & Gap5 Filtering Unikalow Fermi Clustering Phusion2 Contig Generation Contig Merge ABySS SOAPdenovo

  3. iCAS – an Illumina Clone Assembly System ftp://ftp.sanger.ac.uk/pub/badger/aw7/icas_v061.tar.bz2

  4. Data filtering using Unikalow Unikalow: ftp://ftp.sanger.ac.uk/pub/zn1/unikalow/

  5. Assembly Method Sequencing reads: 1. Overlap graph 2. de Bruijn graph 3. String graph

  6. Scaffold Merge: ftp://ftp.sanger.ac.uk/pub/users/zn1/merge/ Ref Base Sup Contig Merge: Ref Base Ctg

  7. Contig Consensus using Gap5

  8. Can we really trust Single Molecule Sequencing? PacBio Capillary Illumina

  9. Clone Assemblies vs Assemblers 5 BAC clones and 3 fosmids Clone coverage: 99.7%; Base quality: Q39

  10. Spinner – a scaffolding tool ftp://ftp.sanger.ac.uk/pub/users/zn1/spinner/ Spinner uses mate pair data to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi-directional graph: Using expected insert size, a estimate of the gap size can be given for each contig.

  11. Spinner – walks through a loop These techniques alone produces useful results. Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.

  12. Spinner vs SSPACE _________________________________________________________ SSPACESPINNER _________________________________________________________ Genome_Size N50 AverageN50 Average Assemblathon 1 119 Mb 608Kb 86.8Kb 11Mb 450Kb Grass Carp (F) 900Mb 2.3Mb 14.4 5.85Mb 17.1Kb Grass Carp (M) 1000MB 0.34Mb 11.2Kb 2.27 Mb 8.2Kb Bamboo 2.0 Gb 322Kb 7404 488Kb 7689 Parrot 1.23 Gb 906Kb 4675 1.32Mb 6969 ________________________________________________________

  13. Grass Phylogeny

  14. Bamboo Genome: Size Estimation Gs = (Kn – Ks)/D = 1.97x109 Kn = 80.5x109 – Total number of kmer words; Ks = 9.5x109 - Number of single copy kmer words; D = 36 - Depth of kmer occurrence

  15. Bamboo Genome Assembly Solexa reads: Number of read pairs: 877 Million;Finished genome size: 2.0 GB; Read length: 2x100bp; Estimated read coverage: ~90X; Insert size: 500/50-600 bp; Mate pair data: 3k,5k,7k,8k,10k,20k Number of reads clustered: 757 Million Assembly features: - stats Contigs ScaffoldsTotal number of contigs: 744,286 277,278 Total bases of contigs: 1.86 Gb 2.05 Gb N50 contig size: 11,622 328,698 Largest contig: 188,163 4,869,017 Averaged contig size: 2,500 7,400 Contig coverage on genome: ~90% >95%

  16. Bamboo Genome Assembly QC using Finished BACs

  17. Evolution of the Wheat Genome

  18. Size of the Wheat Genome: 17Gb

  19. International Wheat Genome Sequencing Consortium

  20. WHEjyyDADDBAAPE 167 WHEjjzDADDCBAPE 199 WHEjjzDADDCCAPE 223 WHEjjzDADDCABPE 230 WHEjyyDAEDDAAPE 250 WHEjyyDAEDDABPE 250 WHEjyyDAEDDBAPE 250 WHEjyyDAEDDBBPE 250 WHEjyyDAEDDCAPE 250 WHEjyyDAEDDCBPE 250 WHEjyyDAEDDDAPE 250 WHEjjzDADDCACPE 254 WHEjyyDAEDIAAPE 500 WHEjyyDAEDIBAPE 500 WHEjyyDADDIAAPE 502 WHEjyyDADDIDAPE 510 WHEjyyDADDICAPE 527 WHEjyyDADDIBAPE 532 WHEjyyDADDIBBPE 551 WHEjyyDADDKAAPE 682 WHEjyyDADDMBAPE 706 WHEjyyDADDKCAPE 725 WHEjyyDADDMAAPE 764 Sequencing of D Genome Libraries & Insert Sizes WHEjyyDAADWAAPE 2000 WHEjyyDAADWBAPE 2000 WHEjyyDAADWCAPE 2000 WHEjyyDAADWDAPE 2000 WHEjyyDACDWAAPE 2002 WHEjyyDAEDWAAPE 2008 WHEjyyDACDWBBPE 2500 WHEjyyDAADLAAPE 5000 WHEjyyDAADLBAPE 5000 WHEjyyDAADLBBPE 5000 WHEjyyDAEDLAAPE 5004 WHEjjzDADLBBPE 8300 WHEjyyDAADTAAPE 10000 WHEjyyDABDTAAPE 10000 WHEjyyDADDTAAPE 10000 WHEjyyDADDTBBPE 10000 WHEjyyDAIDUAAPE 20000

  21. D Genome: Size Estimation Gs = (Kn – Ks)/D = 4.2x109 Kn = 59.8x109 – Total number of kmer words; Ks = 4.3x109 - Number of single copy kmer words; D = 13 - Depth of kmer occurrence

  22. Wheat D Genome Assembly Solexa reads: Number of read pairs: 805 Million;Estimated genome size: 4.2 GB; Read length: 45-95bp; Estimated read coverage: ~40X; Insert size: 167-800 bp; Mate pair data: 2k - 20k Number of reads clustered: 558 Million Assembly features: - stats Contigs Total number of contigs: 3,228,623 Total bases of contigs: 3.34 Gb N50 contig size: 3,084 Largest contig: 86,064 Averaged contig size: 1,035 Contig coverage on genome: ~80%

  23. Grass carp(F&M) 55,277 130,221 0.88 Gb 0.97Gb 40,353 18,252 5.89 Mb 2.27Mb Miscanthus Wild rice

  24. Acknowledgements: • Joe Henson • German Tischler • Andrew Whitwham • Chinese Academy of Agricultural Sciences • Jizeng Jia • Guangyue Zhao • National Gene Research Centre, Chinese Academy of Sciences • Han Bin • Hengyun Lu

More Related