1 / 28

Phusion2 and The Genome Assembly of Tasmanian Devil

Phusion2 and The Genome Assembly of Tasmanian Devil. Zemin Ning The Wellcome Trust Sanger Institute. Outline of the Talk:. Challenges in genome assemblies from pure Illumina reads The Phusion2 pipeline The Tasmanian devil genome project The Devil genome assembly

phiala
Télécharger la présentation

Phusion2 and The Genome Assembly of Tasmanian Devil

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phusion2 and The Genome Assembly of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute

  2. Outline of the Talk: • Challenges in genome assemblies from pure Illumina reads • The Phusion2 pipeline • The Tasmanian devil genome project • The Devil genome assembly • Other assemblies: human cancer, zebrafish, rice, etc

  3. Challenges in Whole Genome Assembly using Pure Illumina Reads • Large genome and huge datasets • For human: 100Gb at 30x • Repetitive/Duplication structures, Alus, LINES, SVAs • 30-40% such as human, mouse; 50-60% such as rice and other plant genomes. • Tandem repeats: how many copies they have? • TATATATATATATATATATATATATATA • GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG • GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG • AGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGT

  4. De Bruijnvs Read overlap Missing sequences Missing from de Bruijn contigs

  5. Phusion2 Assembly Pipeline Assembly Data Process Solexa Reads Supercontig Long Insert Reads PRono Contigs Reads Group Fuzzypath 2x75 or 2x100 Base Correction Velvet Phrap

  6. Repetitive Contig and Read Pairs Depth Depth Depth Grouped Reads by Phusion

  7. ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC GGCGTGCAGTCC GCGTGCAGTCCA CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGGCAGATGT TGGCCAGTTGTT GGCGAGTCGTTC GCGTGTCCTTCG Kmer Word Hashing Contiguous Base Hash K = 12 Gap-Hash 4x3

  8. Useful Region Real Data Curve Poisson Curve Word use distribution for the mouse sequence data at ~7.5 fold

  9. Sorted List of Each k-Mer and Its Read Indices ACAGAAAAGC 10h06.p1c High bits Low bits ACAGAAAAGC 12a04.q1c ACAGAAAAGC 13d01.p1c ACAGAAAAGC 16d01.p1c ACAGAAAAGC 26g04.p1c ACAGAAAAGC 33h02.q1c ACAGAAAAGC 37g12.p1c ACAGAAAAGC 40d06.p1c ACAGAAAAGG 16a02.p1c ACAGAAAAGG 20a10.p1c ACAGAAAAGG 22a03.p1c ACAGAAAAGG 26e12.q1c ACAGAAAAGG 30e12.q1c ACAGAAAAGG 47a01.p1c 64 -2k 2k

  10. Relation Matrix: R(i,j) – number of kmer words shared between read i and read j 1 2 3 4 5 6 … j … N 41 0 0 0 0 1 2 41 37 0 0 0 3 0 37 0 22 0 4 0 0 0 0 27 Group 2: (4,6) 5 0 0 22 0 0 6 0 0 0 27 0 i R(i,j) Group 1: (1,2,3,5) N

  11. Relation Matrix: R(i,j) – Implementation 1 2 3 4 5 6 … j … 500 1 2 3 4 Number of shared kmer words (< 63) 5 . . . Read index R(i,j) N

  12. Break contigs without read pair coverage

  13. Tasmanian devil Tasmanian devil Wallaby Opossum

  14. Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils Transmitted by biting Commonly metastasises First observed in 1996 Primarily affects adults >1yr Death in 4 – 6 months Tasmanian devil facial tumour disease (DFTD)

  15. DFTD samples Area still DFTD free DFTD originated here c.1996 Narawntapu Mt William (2) Upper Natone 2006 Frankford Wisedale (?) Railton 2007 St Mary’s (2) West Pencil Pine (3) Reedy Marsh 2008 Trowunna (2) Bronte Park Coles Bay Tarraleah Kempton (2) Mangalore Fentonbury (no host) Nugent (2) 4 14 Forestier (33) 13

  16. DFTD samples for sequencing Area still DFTD free DFTD originated here c.1996 Narawntapu 2007 Mt William 2007 or 2008 Upper Natone 2007 Strain 1, tetraploid Strain 2 Reedy Marsh 2007 Strain 3 “Evolved” Unknown strain Coles Bay Mangalore 2007 Forestier 2007

  17. Sequencing T. Devil on Illumina: Strategy Tumour or normal genomic DNA Fragments of defined size 0.5, 5, 7 kb Sequencing 100 bp reads short insert 75 bp reads long insert Sequencing performed at Illumina Alignment using bwa, ssaha2 De novo Assembly Somatic mutations Germline variants

  18. Paired Reads Separated by “NN”

  19. Error Bases Correction

  20. Genome Assembly – T. Devil Solexa reads: Number of read pairs: 528 Million;Finished genome size: 3.5 GB; Read length: 2x100bp; Estimated read coverage: ~30X; Insert size: 410/50-600 bp; Number of reads clustered: 458 Million Assembly features: - contig stats Phusion2 ABySSTotal number of contigs: 1,420,262 7,796,722 Total bases of contigs: 3.29 Gb 2.28 Gb N50 contig size: 7,618 2,013 Largest contig: 76,418 31045 Averaged contig size: 2,314 292 Contig coverage on genome: ~94 % 65% Mis-assembly errors: ? ?

  21. Brown Bear Dog Macropus eugenii (Wallaby) Monodelphis domestica (Opossum ) Sminthopsis macroura (Dunnart)

  22. Tasmanian devil Tasmanian devil Wallaby Opossum

  23. Melanoma cell line COLO-829 Paul Edwards, Departments of Pathology and Oncology, University of Cambridge

  24. Human Cancer Genome Assembly – Normal Cell Solexa reads: Number of read pairs: 557 Million;Finished genome size: 3.0 GB; Read length: 2x75bp; Estimated read coverage: ~25X; Insert size: 190/50-300 bp; Number of reads clustered: 458 Million Assembly features: - contig statsTotal number of contigs: 1,020,346; Total bases of contigs: 2.713 Gb N50 contig size: 8,344; Largest contig: 107,613 Averaged contig size: 2,659; Contig coverage over the genome: ~90 %; Mis-assembly errors: ?

  25. Genome Assembly – Tumour Cell Solexa reads: Number of read pairs: 562 Million;Finished genome size: 3.0 GB; Read length: 2x75bp; Estimated read coverage: ~25X; Insert size: 190/50-300 bp; Number of reads clustered: 449 Million Assembly features: - contig statsTotal number of contigs: 1,249,719; Total bases of contigs: 2.690 Gb N50 contig size: 6,073; Largest contig: 72,123 Averaged contig size: 2,152; Contig coverage over the genome: ~90 %; Mis-assembly errors: ?

  26. Rice Genome Assembly One Of the most difficult Genomes on earth? Solexa reads: Number of read pairs: 97.9 Million;Finished genome size: 440 MB; Read length: 2x76bp; Estimated read coverage: ~33X; Insert size: 500/50-600 bp; Number of reads clustered: 81.2 Million Assembly features: - contig statsTotal number of contigs: 374,713; Total bases of contigs: 365 Mb N50 contig size: 7,639; Largest contig: 72,321 Averaged contig size: 973; Contig coverage over the genome: ~83 %; Mis-assembly errors: ?

  27. Acknowledgements: • Elizabeth Murchuson • Erin Preasance • Mike Stratton • Dirk Evers • Ole Schulz-Trieglaff • Qi Feng • Bin Han

More Related