1 / 61

HOGENOM a phylogenomic database

HOGENOM a phylogenomic database. Simon Penel , Pascal Calvat, Jean-Francois Dufayard , Vincent Daubin , Laurent Duret , Manolo Gouy , Dominique Guyot, Daniel Kahn, Vincent Miele, Vincent Navratil , Guy Perrière, Rémi Planel. Several phylogenomic databases developed at LBBE/PRABI.

bonnie
Télécharger la présentation

HOGENOM a phylogenomic database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HOGENOM a phylogenomicdatabase Simon Penel, Pascal Calvat,Jean-FrancoisDufayard, Vincent Daubin, Laurent Duret,Manolo Gouy, Dominique Guyot,Daniel Kahn, Vincent Miele,Vincent Navratil, Guy Perrière, Rémi Planel

  2. Severalphylogenomicdatabasesdevelopedat LBBE/PRABI • HOVERGEN • VerterbrateProteinsfromUniProt • ClusteringwithSiLiX • HOMOLENS • ProteinsfromEnsembl Complete Genomes • ClusteringfromEnsembl • Treescalculated and annoated (S,D,L) with new methods (PhylDog,LBBE) • HOGENOM • Proteinsfrom all availablecompletegenomes • (Bacteria, Eukaroyota, Archaea) • ClusteringwithSiLiX and post-processingwithHiFiX • Treeswillbeannotated (S,D,L,T)

  3. HOGENOM characteristics • all completegenomesfrom the wholetree of life (not restricted to particular phylum) • Propose « genefamilies » : full lengthhomologoussequences (different of « domainfamilies »)

  4. Domain vs.genefamilies Proteindomainfamily Families of homologous protein domains (ProDom): - Evolution by domain shuffling (duplication, loss, translocation)

  5. Domain vs.genefamilies Homologousgenefamily Families of homologous protein domains (ProDom): - Evolution by domain shuffling (duplication, loss, translocation) Homologous Gene families (HOGENOM): - Evolution of homologous genes by speciation or by gene duplication, or horizontal transfer - Sequences are homologous over their entire length (or almost)

  6. Orthologs and paralogs in HOGENOM HOGENOM is centered on phylogenetic trees of gene families. Information on orthologs and paralogs can be deduced from gene trees: - from the annotation of gene trees (Duplication, Speciation, Transfer) - from query tools such as tree-pattern matching

  7. Building Compare all proteinsagainsteachother (BLAST) Cluster homologoussequencesintofamilies (SILIX + HIFIX) Compute multiple alignments for eachfamily Computephylogenetictrees for eachfamily Annotatephylogenetictrees (gene duplications, losses, transfers)

  8. Compare all proteinsagainsteachother • Iterative BLAST calculation • Use of a non-redundantproteinsequencedatabase … (all know proteins , about 20,000,000 non redondant sequences) … associatedwitha resultingBLAST hits database (fromwhich blast hits maybeextracted) • Cluster, grid and cloudcomputing

  9. Building Compare all proteinsagainsteachother (BLAST) Cluster homologoussequencesintofamilies (SILIX + HIFIX) Compute multiple alignments for eachfamily Computephylogenetictrees for eachfamily Annotatephylogenetictrees (gene duplications, losses, transfers)

  10. Local pairwise alignments Proteindatabase SiLiX 1ststep : similaritysearch BLASTP BLOSUM62 E ≤ 10-4

  11. SiLiX 2ndstep : SiLiXclustering Use the all-against-all BLAST hits

  12. S3 S1 S2 S4 Seq. A Seq. B S1’ S2 Seq. A Seq. B ∆lg3 ∆lg1 lgHSP1 ∆lg2 lgHSP2 SiLiX SiLiX : Selection of consistent HSPs

  13. A A B C HSP ≥ 80% length Identity ≥ 35% A Cluster A, B, C B C SiLiX • SiLiX : single linkage clustering

  14. SiLiX SiLiX : single linkage clusteringwithalignmentcoverageconstraints (Mièle et al. BMC Bioinformatics 2011) • Computing efficiency: • Ultra-fast • Memory efficient • Scalable (parallel architecture) • Clustering quality: • At least as good as the previously published methods

  15. However … • Because of over-extension of BLAST alignments, some sequences that share only partial homology may be clustered in a same family • The risk of alignment over-extension is low, but becomes a problem for very large protein families • Use more stringent clustering criteria ? No : optimal clustering criteria are not the same for all families

  16. HiFiX • The mode and tempo of evolutionisspecific to eachproteinfamily • A multiple alignmentprovides information about the specific pattern of evolution of a family • => thiscanbeused to decidewhether or not a new sequencebelongs to thatfamily

  17. HiFiX • Step 1: rapidclustering (SiLiX) • pre-families • Step2: sub-clustering of pre-families into homogeneousproteinclusters • sub-families • Step3: progressive merging of sub-familiesintofamilies, withevaluation of multiple alignmentqualityateachstep • families

  18. HiFiX

  19. HiFiX

  20. HiFiX

  21. Results of clustering NumberSequencesNumber of Families • at least 2 296,920 • 2:10 242,398 • 10:500 53,450 • 500:2000 1,026 • more than 2000 79 About 7,000,000 proteinsclusteredinto 300,000 families Family size distribution:

  22. Building Compare all proteinsagainsteachother (BLAST) Cluster homologoussequencesintofamilies (SILIX + HIFIX) Compute multiple alignments for eachfamily Computephylogenetictrees for eachfamily Annotatephylogenetictrees (gene duplications, losses, transfers)

  23. Compute multiple alignments All alignments ( ~ 300, 000) have been calculated withClustalΩ

  24. Building Compare all proteinsagainsteachother (BLAST) Cluster homologoussequencesintofamilies (SILIX + HIFIX) Compute multiple alignments for eachfamily Computephylogenetictrees for eachfamily Annotatephylogenetictrees (gene duplications, losses, transfers)

  25. Computephylogenetictree Question: what about the alternative splicing ?

  26. Alternative splicing In eukaryotes, due to alternative splicing , one unique genemaybebetranscriptedintoseveraltranscripts

  27. Transcripts in HOGENOM6 Weselectedall the transcripts for eachgene. Becausethe longesttranscriptis not allways the best!

  28. Because: Selection of a representaitiveisoform in HOGENOM Wedon’twantseveralproteins for a samegene in a phylogenetictree: maybeseen as a duplication Wewant 1 protein per gene for statisticcomparison amongorganisms

  29. Selection of a representaitiveisoform : how ?

  30. Selection of a representativeisoform : how ? Eukarya 1 or more transcripts per gene Archaea and bacteria 1 transcript per gene

  31. Selection of a representativeisoform : how ? Eukarya clustering Archaea and bacteria

  32. First step:when a gene has isoforms in differentfamilies ( ), choose a family for the gene Selection of a representativeisoform : how ?

  33. We select the familywith the highestnumber of eukaryoticgenes (and not proteins) 1 1 1 Selection of a representativeisoform : how ? 2 2 2 3 2 genes 2 genes 3 genes

  34. We select the familywith the highestnumber of eukaryoticgenes (and not proteins) 1 1 1 Selection of a representativeisoform : how ? 2 2 If the number of eukaryoticgenesare identical, we select the familywith the highestnumber of eukaryoticproteins 2 3 2 genes 2 genes 3 genes

  35. We select the familywith the highestnumber of eukaryoticgenes (and not proteins) 1 1 1 Selection of a representativeisoform : how ? 2 2 If the number of eukaryoticgenesare identical, we select the familywith the highestnumber of eukaryoticproteins 2 3 If the number of eukaryoticproteinsare identical, we select the familywith the highestnumber of proteins 2 genes 2 genes 3 genes

  36. We select the familywith the highestnumber of eukaryoticgenes (and not proteins) 1 1 1 Selection of a representativeisoform : how ? 2 2 If the number of eukaryoticgenesare identical, we select the familywith the highestnumber of eukaryoticproteins 2 3 If the number of eukaryoticproteinsare identical, we select the familywith the highestnumber of proteins 2 genes 2 genes 3 genes The « rejected » isoforms are called « ISOFORMEX » SOME FAMILIES MAY FINALLY BE EMPTY AFTER THIS

  37. Second step:when a gene has isoforms in a family, choose a representativeisoform for the gene 1 1 1 Selection of a representativeisoform : how ? 2 2 2 3 2 genes 2 genes 3 genes

  38. Second step: when a gene has isoforms in a family, choose a representativeisoform for the gene 1 1 1 Selection of a representativeisoform : how ? 2 2 2 3 2 genes ? 2 genes ? 3 genes

  39. We use the alignment Selection of a representativeisoform : how ?

  40. We use the alignment Selection of a representativeisoform : how ? Suppression of ISOFORMEX

  41. We use the alignment Selection of a representativeisoform : how ? Selection positions with < 50% gap

  42. Selection of a representativeisoform : how ? For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 2 2

  43. Selection of a representativeisoform : how ? For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 1 2 2

  44. Selection of a representativeisoform : how ? For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 1 2 2 2 2

  45. Treecalculation

  46. Treecalculation isformin isformin a b c isformin d isformex e f g

  47. Treecalculation isformin isformin a b c isformin d isformex e f g

  48. Gblocks Phyml, FastTree Treecalculation d isformin a isformin e f g b a c b c d isformex e f g

  49. Building Compare all proteinsagainsteachother (BLAST) Cluster homologoussequencesintofamilies (SILIX + HIFIX) Compute multiple alignments for eachfamily Computephylogenetictrees for eachfamily Annotatephylogenetictrees (gene duplications, losses, transfers)

  50. Annotatephylogenetictrees • Severalmethods are currentlydeveloped in the ANCESTROM project • Speciation, Duplication and Loss • Speciation, Duplication, Transfert and Loss • See Vincent Daubin talk tomorow

More Related