1 / 56

Genomics and bioinformatics of tropical disease-causing parasites

Genomics and bioinformatics of tropical disease-causing parasites. Jane Carlton, Ph.D. Associate Investigator The Institute for Genomic Research. 1. Tropical disease-causing parasites and their genome sequencing projects 2. About TIGR 3. How to sequence a genome

taro
Télécharger la présentation

Genomics and bioinformatics of tropical disease-causing parasites

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genomics and bioinformatics of tropical disease-causing parasites Jane Carlton, Ph.D. Associate Investigator The Institute for Genomic Research

  2. 1. Tropical disease-causing parasites and their genome sequencing projects 2. About TIGR 3. How to sequence a genome 4. The Plasmodium genome sequencing projects 5. How the accumulation of sequence data is pushing the development of bioinformatics 6. Take-home messages

  3. Apicomplexan sequencing projects: 1. Projects

  4. Kinetoplast sequencing projects:

  5. Amitochondriate sequencing projects:

  6. Pathogenic worm sequencing projects: EST (expressed sequence tag) projects: www.nematodes.org

  7. Parasite genome sequencing projects at TIGR 2. About TIGR Jane Carlton: Plasmodium vivax P. yoelii Trichomonas vaginalis Malcolm Gardner: Plasmodium falciparum Theileria parva Vish Nene: Amblyomma variegatum Rhiphicephalus appendiculatus Najib El-Sayed: Trypanosoma brucei T. cruzi Schistosoma mansoni Elodie Ghedin: Brugia malayi Brendan Loftus: Entamoeba histolytica & spp. Anopheles gambiae Aedes aegypti Ruobing Wang:parasite vaccine development Ian Paulsen*: Toxoplasma gondii Herve Tettelin*: Biomphalaria glabrata

  8. TIGR • Not-for-profit research institute founded by J. Craig Venter, 1992 • Known for development of whole genome shotgun (WGS) • sequencing technology • First to sequence entire genome of free-living organism, • Haemophilus influenzae, 1995 • 1995-1998: Mycoplasma genitalium, Methanococcus jannaschii, • Helicobacter pylori, Borrelia burgdorferi, Treponema pallidum • Over 40 microbial genomes sequenced eg Bacillus anthracis • Eukaryotic genomes: rice, Arabidopsis, human chrom 16, • Plasmodium falciparum, P. vivax, Toxoplasma gondii,tryps etc

  9. TIGR organization • Faculty • Principal investigators • Laboratory staff • Laboratory technicians, research associates, post-docs • SeqCore • - Library team, template team, sequencing team, closure team, R&D • IFX • Software engineers, bioinformatics analysts, staff scientists, • database managers, IT • Administrative staff • Accounting, human resources, legal counsel, grants management • Publications, Conferences, Education & Training

  10. TIGR Education and Training programs • Educational programs: visiting TIGR MdBioLab MSc Genomics/Bioinformatics • Professional development: genomics course for educators prokaryotic annotation training DNA sequencing genomics boot camp • Student research: semester fellowship program summer fellowship program TIGR Conferences GSAC: international Genome Sequencing and Analysis Conference Microbial Genomes Conference: ASM & TIGR Computational Genomics Conference: Jackson Lab & TIGR

  11. Average number of projects/month 50 Joint Technology Center (JTC): 45 million sequence reads per year

  12. Applied Biosystems 3730xl ABI 3700 DNA Analyzer • 96 capillary array • 2 hours run time • 50% higher capacity than ABI 3700 • POP-7™ separation matrix • Sequence reads up to 1,100 bases • Integrated auto-sampler and plate stacker • Internal bar code reader • Automated basecalling • Reduced reagent consumption • Walkaway automation • Increased sample throughput • High performance capillary arrays

  13. TIGR software tools Gene finding/annotation Sequencing/finishing Alignment

  14. Functional Genomics at TIGR • TIGR Gene Indices • Microarrays • Genotyping • Proteomics Picking and gridding QBot

  15. PFGRC

  16. 3. How to sequence a genome • development of sequencing strategy and source of funding • procurement of DNA and initial library construction • test sequencing • large-scale random sequencing of small (2-3 kb), medium (10 kb) and large (>50 kb) libraries • analysis of raw sequence data by: BLAST, RepeatFinder etc • release of genome data onto TIGR website • at 8-10 X coverage, random stops • closure of sequence gaps and physical gaps • comparison to physical map • gene model prediction • final gene model annotation • release of data to GenBank and publication

  17. Full shotgun sequencing Marker1 Marker2 Genomic DNA large insert library (20 - 500 kb) Minimal tiling path shotgun library: small (2-3 kb) and medium (10 kb) Sequencing (8-10 X) Assembly scaffold contig Gap closure gene prediction, annotation and analysis

  18. Partial shotgun sequencing Genomic DNA shotgun library: small (2-3 kb) and medium (10 kb) Sequencing (5X) Assembly contig scaffold Analysis

  19. Genome sequencing terms Raw sequence: unassembled sequence reads produced from sequencing of inserts from individual recombinant clones of a genomic DNA library. Finished sequence: complete sequence of a genome with no gaps and an accuracy of > 99.9%. Genome coverage: average number of times a nucleotide is represented by a high-quality base in random raw sequence. Full shotgun coverage: genome coverage in random raw sequence required to produce finished sequence, usually 8-10 fold (‘8-10X’). Partial shotgun coverage: typically 3-6X random coverage of a genome which produces sequence data of sufficient quality to enable gene identification but which is not sufficient to produce a finished genome sequence Paired reads: sequence reads determined from both ends of a cloned insert in a recombinant clone. Contig: contiguous DNA sequence produced from joining overlapping raw sequence reads. Singleton: single sequence read that cannot be joined (‘assembled’) into a contig. Scaffold: a group of ordered and orientated contigs known to be physically linked to each other by paired read information. EST: expressed sequence tag generated by sequencing one end of a recombinant clone from a cDNA library. ESTs are single-pass reads and therefore prone to contain sequence errors. GSS: genome survey sequence generated by sequencing one end of a recombinant clone from a genomic DNA library. The genomic DNA library can in some instances be enriched for the presence of coding regions, for example through use of mung bean nuclease digestion of genomic DNA prior to cloning. SNP: single nucleotide polymorphism ORF: open reading frame, stretches of codons in the same reading frame uninterrupted by STOP codons and calculated from a six-frame translation of DNA sequence.

  20. 4. Sequencing the Plasmodium genomes Four species of malaria infect man: Plasmodium falciparum P. vivax P. malariae P. ovale Four species of malaria infect rodents: P. yoelii P. berghei P. chabaudi P. vinckei

  21. Plasmodium falciparum • ~30 million base pairs (Mb) • 80% (A+T) • 14 chromosomes • DNA “unstable” in E. coli • No large insert DNA clones suitable for sequencing • Too large for whole genome shotgun (‘96) • Whole chromosome shotgun strategy was selected

  22. FeatureP.y.yoeliiP.falciparum Size (Mb) 23.1 22.9 No. chroms 14 14 Coverage (fold) 5 14.5 No. gaps 5,812 93 (G+C) content (%) 22.6 19.4 No. genes 5,878 5,268 Mean gene length (bp) 1,298 2,283 Gene density (bp/gene) 2,566 4,338 Genes with introns (%) 54.2 53.9 Genes with ESTs (%) 48.9 49.1 Genes with proteomic data (%) 18.2 51.8 Exons: Mean no./gene 2.0 2.4 (G+C) content (%) 24.8 23.7 Introns: (G+C) content 21.1 13.5 Intergenic sequences: (G+C) content 20.7 13.6 RNAs: no. tRNAs 39 43 no. 5s rRNAs 3 3 no. rRNA units 4 7 Comparison of genome features

  23. P. falciparum genome status

  24. Eukaryotic annotation Annotation Station/Manatee Project DB Annotation DB Gene finders EGC DDS/DPS Gene models BLAST PFAM/TIGRFAM SignalP/TMHMM Alignments of genomic to proteins and ESTs Functional assignments

  25. PFB0680w

  26. The P. falciparum genome

  27. Distribution of gene lengths 15.5% 3.0-3.6%

  28. The P. falciparum proteome

  29. Florens et al. Nature 419:520-526 52% of predicted gene products detected by proteomics

  30. Metabolism and transport • Analysis based on similarity searches with sequences of known enzymes • 14% (733) of genes encoded enzymes • Lower than in bacterial genomes (25-33%) • Enzymes more difficult to identify due to AT-rich genome and evolutionary distance between P.f. and other sequenced organisms Or • P.f. has smaller proportion of genome devoted to enzymes, reduced metabolic potential

  31. Analysis of transporters in P. falciparum

  32. Organization of multi-gene families in P. falciparum

  33. P. falciparum Genome Summary

  34. Where do I find Plasmodium genome data? 1. TIGR and Sanger websites: or e-mail us !!! www.tigr.org

  35. TIGR Gene Indices home page www.tigr.org/tdb/tgi.shtml Newest addition…

  36. 2. PlasmoDB http://PlasmoDB.org

  37. 3. GenBank * Genomes & Maps * dbEST * dbGSS * nr * Malaria Genetics & Genomics

  38. 5. How the accumulation of sequence data is pushing the development of bioinformatics Comparative genomics of malaria parasites as an example: MUMmer Position Effect OWEN Functional constraint analysis

  39. Comparative genomics terms Homologs: genes related to each other by descent from a common ancestral DNA sequence. Orthologs: homologous genes generated by speciation, i.e related to each other by vertical descent. Paralogs: homologous genes generated by duplication, i.e related to each other by horizontal descent. Synteny: Strictly, this refers to the presence of two or more genes on the same chromosome in the same species. However, it is used frequently to mean conservation of orthologous gene location between species, i.e the presence of orthologous genes that are syntenic in one species and also located on the same chromosome in a second species, without regard to gene order. Conserved synteny: two or more genes located on the same chromosome in different species regardless of gene order. Conserved linkage: a group of genes conserved in synteny and order between species.

  40. 3,310 (~63%) bi-directional orthologs identified between 5,268 Pfal proteins and 5,878 Pyy proteins

  41. P.y.yoelii orthologs of P. falciparum candidate vaccine/drug genes

  42. Paralogous gene families in Plasmodium

  43. Gene synteny between species “Conservation of orthologous gene location between different species.” conserved synteny: Species 1 Species 2 conserved linkage/order: Species 1 Species 2 In Plasmodium: whole-chromosome synteny between rodent malaria spp. synteny of chromosome blocks btwn rodent malaria spp & Pfal

  44. Creation of a genome-wide synteny map tiling path evidence tiling path evidence Pf chrom 10 Pyoelii contigs Clone mate evidence: no evidence EST evidence: Physical map evidence: PCR gene 1 gene 2 conserved synteny synteny not conserved – synteny break point?

More Related