1 / 47

NGS sequencing and Genome Assemblies from Animals and Large Plants

NGS sequencing and Genome Assemblies from Animals and Large Plants. Zemin Ning The Wellcome Trust Sanger Institute. Outline of the Talk:. NGS sequencing technologies Oxford Nanopore Assembly algorithms and Assemblers Phusion2 pipeline Tasmanian Devil genome project

marius
Télécharger la présentation

NGS sequencing and Genome Assemblies from Animals and Large Plants

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute

  2. Outline of the Talk: • NGS sequencing technologies • Oxford Nanopore • Assembly algorithms and Assemblers • Phusion2 pipeline • Tasmanian Devil genome project • Assemblies of Large plant genomes • Future work

  3. Next-Generation Sequencing

  4. NGS Platforms & Performances

  5. Oxford Nanopore End of Short Read Sequencing? Read length: upto 100Kb Human genome 50x in 15 Minutes $10 per GB

  6. Can we really trust Single Molecule Sequencing? PacBio Capillary Illumina

  7. Kmer Size and Assemblability

  8. Assembly Method Sequencing reads: 1. Overlap graph 2. de Bruijn graph 3. String graph

  9. Various Assembly Pipelines

  10. Phusion2 Assembly Pipeline Assembly Illumina Reads Contigs 2x75 or 2x100bp Data Process Base Correction Consensus Generation Reads Group

  11. Phusion2 Assembly Pipeline Assembly Illumina Reads Supercontig AGPcontig Contigs 2x75 or 2x100bp Flow-sorting Reads Map Markers Mate Pair Reads BAC Ends

  12. Spinner – a scaffolding tool Spinner uses mate pair data to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi-directional graph: Using expected insert size, a estimate of the gap size can be given for each contig.

  13. Spinner – still to do These techniques alone produces useful results. Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.

  14. Tasmanian tiger Tasmanian devil Australian Tasmanian

  15. Tasmanian devil Tasmanian devil Wallaby Opossum

  16. Tasmanian devil facial tumour disease (DFTD) • Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils • Transmitted by biting • Commonly metastasises • First observed in 1996 • Primarily affects adults >1yr • Death in 4 – 6 months

  17. DFTD samples for sequencing Area still DFTD free DFTD originated here c.1996 Narawntapu 2007 Mt William 2007 or 2008 Upper Natone 2007 Strain 1, tetraploid Strain 2 Reedy Marsh 2007 Strain 3 “Evolved” Unknown strain Coles Bay Mangalore 2007 Forestier 2007

  18. Devil Genomes Sequenced Tumour 2 (53T) Narawntapu 2007 Mt William Upper Natone 2007 Reedy Marsh 2007 Tumour 1 (87T) Coles Bay Mangalore 2007 Salem - A female Tasmanian Devil lived Taronga Zoo in Sydney. Forestier 2007

  19. Sequencing T. Devil on Illumina: Strategy Tumour or normal genomic DNA Fragments of defined size 0.5, 2, 5, 7, 8, 10 kb Sequencing 2x100bp reads short insert 2x50bp mate pairs Sequencing performed at Illumina

  20. Devil – Opossum Homology Map Based on Hybridisation Results of Devil Paints onto Opossum Chromosomes Opossum Devil 1 4 2a 3a 6 1 2 3 4 5 2b 5 3b X 6 7 8 X Opossum chromosome images were taken from Duke et a. 2007, Chromosome Res 15:361-370

  21. Genome size Flow cytometry analysis of chromosomal mixture of devil and opossum 3 2 1 1 Tasmanian devil 4 2 3 5 4 6 5+8 6 7 Opossum X X

  22. Table 1 Run ID, Template names, Number of reads and Chromosome size 4972_1 chr1 IL20_4972:1 19.8 571 4967_1 chr2 IL21_4967:1 20.0 610 4971_1 chr3 IL30_4971:1 21.7 556 4964_1 chr4 IL14_4964:1 7.26 450 4969_1 chr5 IL17_4969:1 7.06 341 4969_2 chr6 IL17_4969:2 8.59 277 4969_3 chrx IL17_4969:3 9.43 122 Read mapping coefficient: e = Size_of_Chr/Num_reads_in_lane

  23. Perfect - Reads from the same library were mapped to the contig

  24. Acceptable - Majority of the reads were from the same library, but there were reads from other libraries

  25. Bad – mis-assembly error Majority of the reads in one region were from one library. But there is a transition from which we see a new library, i.e. switch to another chromosome.

  26. Unassigned contigs were placed by supercontigs using mate pairs

  27. Scaffolds Assigned to Chromosomes using Flow-sorting Data Chr_ID Chr_size Scaffolds_assigned Bases_assigned Mb Chr1 571 6729 684 Chr2 610 8381 740 Chr3 556 7197 641 Chr4 450 4817 487 Chr5 341 3188 300 Chr6 277 2844 263 Chrx 122 2378 86.6 Unassigned 440 1.23

  28. Genome Assembly Normal – T. Devil Solexa reads: Number of read pairs: 1130 Million;Finished genome size: 3.1 GB; Read length: 2x100bp; Estimated read coverage: ~80X; Insert size: 410/50-600 bp; Mate pair data: 2k,4k,5k,6k,8k,10k Number of reads clustered: 1010 Million Assembly features: - stats Contigs SupercontigsTotal number of contigs: 178,711 26,954 Total bases of contigs: 2.95 Gb 3.08 Gb N50 contig size: 28,921 2,244,460 Largest contig: 214,456 6,014,864 Averaged contig size: 16,511 114,451 Contig coverage on genome: ~94% >99% Ratio of placed PE reads: ~92% ?

  29. Devil Tumour Genome Assemblies Solexa reads: Tumour_53T Tumour_87T Number of read pairs: 760 Million 669 M;Finished genome size: 3.1 GB 3.1 GB; Read length: 2x100 2x100; Estimated read coverage: ~75X ~56X; Insert size: 300bp 300bp; Number of reads clustered: 710 Million 603 M Assembly features: - stats Tumour_53T Tumour_87TTotal number of contigs: 335,215 335,531 Total bases of contigs: 3.05 Gb 2.98 Gb N50 contig size: 21,582 19,346 Largest contig: 175,353 139,414 Averaged contig size: 9,096 8,892 Contig coverage on genome: ~95% ~95% Ratio of placed PE reads: ~92% ~92%

  30. Variant calling : catalogue of variants in all 4 genomes *Data source: Illumina. Variants removed within 500bp of a contig end, Q(indel) < 30 and Q(GT) < 5.

  31. Homozygous SNPs

  32. Homozygous SNPs

  33. Homozygous Base Corrections 46039 Candidates 40689 Base changed

  34. Homozygous Indel Corrections 51654 Candidates 45337 Del changed

  35. DFTD1 K I F1 F F2 D G/H E F A M1 J M2? M3 1 der1 der2 3 4 5 der5 6 der6 M4 X 1 X 6 5 2 5 6 2 X? 5 X 2 2

  36. DFTD2 L M K3 J K1/K2 I D F G J H M2 M1 M3 der6 der5 der1 B 1 2 3 4 5 6 Xp Xq 5 1 6 2 2 1 X 2 X X 2 2

  37. Grass carp Bamboo N_scaffolds: 358,998 61,232 N_bases 2.08 Gb 0.88 Gb N50 contigs 11,882 40,353 N50 scaffolds 321,729 2.37Mb Miscanthus Wild rice

  38. Acknowledgements: • Elizabeth Murchuson • Joe Henson • German Tischler • Fengtang Yang • Mike Stratton • Han Bin • Feng Qi • Zhao Qiang • Ole Schulz-Trieglaff • David Bentley

  39. BGI - FINISHED SPECIES

  40. Preliminary assembled species

  41. Sequencing of species

  42. Dipus Genome Project

More Related