1 / 27

PHYTOME a plant comparative genomics resource

www. PHYTOME .org a plant comparative genomics resource. Todd Vision, Jason Phillips, Dihui Lu, Stefanie Hartmann. Outline of today’s presentation. What kind of data is stored in Phytome - and how did we generate this data? How can you search Phytome?

barbie
Télécharger la présentation

PHYTOME a plant comparative genomics resource

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. www.PHYTOME.org a plant comparative genomics resource Todd Vision, Jason Phillips, Dihui Lu, Stefanie Hartmann

  2. Outline of today’s presentation • What kind of data is stored in Phytome - and how did we generate this data? • How can you search Phytome? • What kind of results will Phytome give you?

  3. Phytome integrates • organismal phylogeny • gene family information: sequences • alignments • phylogenies • genetic and physical maps

  4. Phytome: applications • Starting with a gene family •  resolve orthology/paralogy relationships •  identify coevolving families • Starting with a species •  explore lineage-specific diversification • guide comparative mapping bench-work • Starting with a chromosome segment •  identify homologous segments •  predict unobserved gene content (candidate QTL)

  5. overview of the pipeline

  6. protein cDNA cDNA clone DNA pre-RNA mRNA data aquisition • EST - expressed sequence tags • are partial sequences of expressed genes • are error-prone, contain sequence or frame shift errors • are very useful for discovering new genes, provide data on gene expression, make up much of the sequence data • EST contig assemblies • contigs: continuous sequences of multiple overlapping ESTs • singletons: don’t match other ESTs in the dataset • sources • • TIGR, Plant GDB, NCBI, TAIR, Sputnik, Plant Genome Network; • • for each species, we used the source with the largest number of EST

  7. data acquisition/organismal phylogenies

  8. protein sequence prediction • from EST contigs to peptide sequences: ESTwise • translate cDNA sequence (ESTs) in all reading frames • compare the translated DNA to a database of known proteins (Swiss-Prot, TrEMBL) • use this information for gene prediction/translation • correct frame shift errors based on the homology information protein TVKKAHFEKWGNIVDVDYFQHFGNIVDINIVIDKETGKKRGFAFVEFDDYDPVDKVVLQKQHQLNGKMVDV TVK++HF +WG + D DYF+ +G I I I+ D+ +GKKRGF FV FD +D VDK+V+QK H +NG +V TVKRSHFxQWGTLTDCDYFEQYGKIEVIEIMTDRGSGKKRGF!FVTFDGHDSVDKIVIQKYHTVNGHNxEV EST agaaactNctgacagtgttgctgaaggagaaagcgagaaagt2tgatggcgtggaagacatcagagcatgg ctaggataaggctcagaataaagatattattcaggggaaggt ttctagaactaatttaaaactagaaNat tgagcttgagagcgcttttagtaatagtacgtcactcgagct tactcctccgtgtctgacttgtccctat

  9. protein family clustering(Tribe-MCL) input: • a set of proteins • BLAST-all vs. BLAST-all values method: • construct weighted graph • convert into Markov matrix • expansion repeat until matrix • inflation doesn’t change output: • clusters of related proteins: protein families

  10. protein family clustering(Tribe-MCL) input: • a set of proteins • BLAST-all vs. BLAST-all values method: • construct weighted graph • convert into Markov matrix • expansion repeat until matrix • inflation doesn’t change output: • clusters of related proteins: protein families image taken from the MCL homepage: http://micans.org/mcl/

  11. protein family clustering(Tribe-MCL)

  12. multiple sequence alignment tested program quality speed algorithm ClustalW + ++ progressive Mafft i ++ + iterative Mafft p ++ +++ progressive T-Coffee +++ memory! consistency-based/progressive Dialign +++ time! consistency based progressive sequence alignment: 1. generate pairwise distances from a multiple alignment 2. use distances to construct a guide tree 3. start by aligning the most similar sequences 4. progressively add more sequences to the existing alignment

  13. multiple sequence alignment identification of homologous proteins, clustering these into a Phytome family, generation of a multiple sequence alignment identification of homologous sequence positions within the homologous proteins = of columns of amino acids that share a common ancestral amino acid

  14. multiple sequence alignment 1. find columns that will be retained • remove columns with low average pairwise scores • remove columns with high percentage of gaps

  15. multiple sequence alignment 1. find columns that will be retained • remove columns with low average pairwise scores • remove columns with high percentage of gaps 2. find sequences that will be retained • remove sequences with a high proportion of gaps within the retained columns • remove misaligned sequences (i.e., with a low overall score) 3. final check • are enough sequences left for a phylogeny?

  16. phylogenetic inference generate distance matrix generate unrooted neighbor-joining tree midpoint-root the tree do molecular clock test PHYLIP ? TreePuzzle

  17. 1 2 3 4 5 6 1 2 3 4 5 6 1 2 1 2 3 4 5 6 1 2 1 2 3 1 2 3 4 1 2 3 4 5 6 7 8 9 10 defining subfamilies

  18. webflow, overview search pages result pages

  19.   

  20. Lab meeting, Sept 13, 2004: Phytome demo Dihui - BLAST search ∑ a friend of mine is working with a plant called Lophopyrum elongatum (it's a weed, and it's salt-tolerant, and that's all I know about it). She just cloned a cDNA and want to find out more about it - what it does and which other genes in which other taxa it is related to. ∑ Though Lophoprum is not among the species represented in Phytome, I offered her to see if I can find out more about her gene. ∑ Best to use for this: the single BLAST search. ∑ Navigate to the single BLAST search and explain the page. Mention batch BLAST. ∑ paste the friend's sequence into the appropriate field ∑ MEYQGQQQHDQATTNRVDEYGNPVAGHGVGTGMGAHGGVGTGAAAGGHFQPTREEHKAGGILQRSGSSSSSSSSEDDGMGGRRKKGIKDKIKEKLPGGHGDQQQTAGTYGQQGHTGMAGTGGNYGQPGHTGMAGTDGTGEKKGIMDKIKEKLPGQH ∑ explain the results page ∑ view the best result: taes7111 from wheat ∑ go to the best scoring family: 1980 Stefanie - Unigene search ∑ http://www.ebi.ac.uk/interpro/IEntry?ac=IPR000167 ∑ search Phytome for InterproEntry 000167 ∑ look at the hvul1175 entry: ∑ The family and subfamily ID ∑ Interpro and Gene Ontology results, but only if the Unipeptide is an exemplar of its subfamily ∑ The species name ∑ A link to the primary source for this unigene sequence ∑ A list of related unigenes (from all sources) that contain common Genbank accession numbers in their assembly ∑ Predicted peptide sequence (available for download in FASTA format) Jason - "restrict by species" search ∑ You can search for families that do or do not contain members from particular species. Navigate to the "restrict by species" search and explain the page. ∑ The relationships among the species are displayed as a phylogenetic tree (NCBI taxonomy information) ∑ and you can select families to include or exclude using radio buttons to the right of each species name. ∑ If the default "either" is selected, Phytome will return a family regardless of whether there are members from that species. ∑ I'm interested in monocot gene families (Hordeum-barley to Allium-onion): want to exclude all other taxa, only use gene families with monocot members. NOTE: explain the difference between "include" monocots or "either" monocots: because species with small numbers of Unipeptides will necessarily lack members in most families, selecting "include" will return NO families! ∑ 119273 families were retrieved. Their family ID is shown ∑ click on family number 1980 Stefanie - family results page ∑ The "Family Information Page" includes o Related families if this family is part of a superfamily (?) o Hyperlinks to subfamilies (these will work if the "Subfamily" tab is selected). o A link to a list of family members excluded from the reduced alignment by REAP o A list of those species represented within the family (these will work if the with the default species tab) ∑ The tabs below allow one to view o A list of member Unipeptides, which can be sorted either by subfamily or by species, depending on which tab is selected. From these lists, you may select members to include in a multiple alignment and/or phylogeny. o InterPro and GO assignments for an examplar of each subfamily. o By selecting multiple Unipeptides and proceeding to the "Alignment Page", one can download a single filecontaining all the predicted peptide sequences (in FASTA format) as well as additional information such as the names used by the Unigene sources and the component Genbank accession numbers.

  21. I = 53.6 2.82.01.2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1 1 1 3 3 2 2 1 3 3 2 2 1 3 3 2 2 1 3 3 2 2 1 3 3 2 2 1 3 3 2 2 1 3 3 2 2 1 3 3 2 2 1 4 4 3 1 1 4 4 3 1 1 4 4 3 1 1 4 4 2 1 1 5 5 4 1 1 5 5 4 1 1 6 5 4 3 1 6 5 4 3 1 protein family clustering(Tribe-MCL)

  22. ...some numbers • almost 1 million EST contigs/singletons • ESTwise translation • 730,000 unigenes • BLAST all vs. BLAST all • 640,000 unigenes 110,000 singletons • to be clustered • into families

  23. data aquisition • species tax_id common name NCBI PGDB PGN SPNK TIGR • Allium cepa 4679 onion X • Amborella trichopoda 13333 amborella X • Arabidopsis thaliana 3702 thale cress X • Avena sativa 4498 oat X • Beta vulgaris 161934 sugarbeet X • Brassica napus 3708 rape X • Capsicum annuum 4072 (orgnamental) pepper X • Ceratopteris richardii 49495 water sprite or indian fern X • Citrus sinensis 2711 orange X • Cryptomeria japonica 3369 Japanese cedar X • Cucumis sativus 3659 cucumber X • Cycas rumphii 58031 sago palm or seashore cycad X • Eschscholzia californica 3467 california poppy X • Glycine maxX 3847 soybean X • Gossypium hirsutum 3635 cotton (tetraploid) X • Helianthus annuus 4232 sunflower X • Hordeum vulgare 4513 barley X • Lactuca sativa 4236 lettuce X • Lotus corniculatus 47247 lotus X • Lycopersicon esculentum 4081 tomato X • Marchantia polymorpha 3197 marchantia X • Medicago truncatula 3880 barrel medic X • Mesembryanthemum crystallinum 3544 ice plant X • Nicotiana benthamiana 4100 wild tobacco X • Oryza sativa 4530 rice X • Physcomitrella patens 3218 Physcomitrella moss X • Pinus taeda 3352 loblolly pine X • Phaseolus coccineus 3886 scarlet runner bean X • Populus tremula x Populus tremuloides 47664 aspen X • Prunus persica 3760 peach X • Saccharum officinarum 4547 plume grass or sugar cane X • Secale cereale 4550 rye X • Solanum tuberosum 4113 potato X • Sorghum bicolor 4558 sorghum X • Stevia rebaudiana 55670 candyleaf X • Theobroma cacao 3641 cacao X • Triticum aestivum 4565 wheat X • Vitis vinifera 29760 wine grape X • Zea mays 4577 corn X • Zinnia elegans 34245 zinnia X

  24. multiple sequence alignment tested program quality speed algorithm ClustalW + ++ progressive Mafft i ++ + iterative Mafft p ++ +++ progressive T-Coffee +++ memory! consistency-based/progressive Dialign +++ time! consistency based

More Related