1 / 20

Identification of Ortholog Groups by OrthoMCL

Identification of Ortholog Groups by OrthoMCL. Protein sequences from organisms of interest. All-against-all BLASTP. Similarity cutoff: P-value % overlap. Between Species: Reciprocal best similarity pairs Putative orthologs. Within Species: Reciprocal better similarity pairs

reedv
Télécharger la présentation

Identification of Ortholog Groups by OrthoMCL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Similarity cutoff: P-value % overlap Between Species: Reciprocal best similarity pairs Putative orthologs Within Species: Reciprocal better similarity pairs (Recent) paralogs

  2. Similarity Matrix Markov Clustering Cluster tightness: Inflation values (I) Ortholog groups with (recent) paralogs

  3. A1 A2 B1 B2 A1 ─ 200 150 0 A2 200 ─ 0 0 B1 150 0 ─ 220 B2 0 0 220 ─ Species A Species B Paralog Paralog Ortholog A2 A1 B1 B2 200 150 220 Similarity Matrix Similarity score

  4. Markov Clustering (MCL) Algorithm Similarity Matrix Matrix Inflation (entry powering) Markov Matrix Transition probability matrix Matrix Expansion (matrix powering) Terminate when no further change Final matrix as clustering

  5. Application of OrthoMCL to Plasmodium, human and other model organisms Plasmodium falciparum, Human, Arabidopsis, Worm, Fly, Yeast E. coli … 6241 ortholog groups 1182 only Metazoa 160 all included 114 Plasmodium Not human 24 only Plasmodium & Arabidopsis 551 only Eukaryotes

  6. An Example of Gamma-tubulin Ortholog Group

  7. Comparing OrthoMCL with INPARANOID ( two species) • INPARANOID clusters both orthologs and in-paralogs from two species by pairwise similarity • Find two-way best hits from pairwise similarity scores as main ortholog pair • Add additional orthologs (in-paralogs) from the same species for each main ortholog by comparing similarity scores between the main ortholog with putative in-paralogs with the score between the main ortholog pair • Resolve overlapping groups by merging, deleting, dividing them based on a set of rules • OrthoMCL can cluster orthologs and in-paralogs from multiple species

  8. Yeast: 6358 proteins Worm: 19774 proteins OrthoMCL INPARANOID 4428 proteins: Yeast: 2158 Worm: 2270 4985 proteins: Yeast: 2283 Worm: 2702 3931 same from both methods I = ? 1805 groups ? (paralog groups?) ? Coherent grouping I. Yeast – Worm dataset (estimation)

  9. Coherent groups = same groups + contained groups Contained groups ∩ OrthoMCL group INPARANOID group ∩ INPARANOID group OrthoMCL group

  10. Inflation value (I) regulates cluster tightness tight loose * Percentage of 3931 sequences identified by both OrthoMCL and Inparanoid So, choose I = 1.1 as the optimal inflation value

  11. Possible reasons for including different sequences

  12. Default parameters: Similarity cutoff: P-value <1e-5, overlap > 50% Cluster tightness: Inflation values I =1.1 Yeast: 6358 proteins Worm: 19774 proteins OrthoMCL INPARANOID 3949 proteins: Yeast: 1927 Worm: 2022 4985 proteins: Yeast: 2283 Worm: 2702 3765 same from both methods I = 1.1 1805 groups 1614 groups 86.3% same groups 98.1% coherent groups

  13. II. Worm – Fly dataset (test) Worm: 19774 proteins Fly: 13288 proteins OrthoMCL INPARANOID 9623 proteins Worm: 4997 Fly: 4626 10100 proteins: Worm: 5399 Fly: 4761 8856 same from both methods I = 1.1 3988 groups 3764 groups 86% same groups 98% coherent groups In conclusion: OrthoMCL and INPARANOID have similar clustering behavior when comparing two species

  14. Comparison of OrthoMCL with EGO (multiple species) III. Yeast – Worm – Fly dataset BLASTP EGO: TC/NP Protein sequences 10260 seqs 4776 proteins Remove redundancy 4776 unique proteins formed 3125 unique groups OrthoMCL: 12459 proteins formed 4033 groups

  15. 4392 same proteins from both 44.2% same groups 2.3% OrthoMCL contained in EGO 62% EGO contained in OrthoMCL 93.8% coherent groups

  16. SSA1 SSA2 SSA3 SSA4 An Example: EGO Groups contained by OrthoMCL Groups Worm Hsp-1 Fly Yeast Hsc70-1 Hsc70-4 EGO : Hsp-1, Hsc70-4, SSA2 OrthoMCL: Hsp-1, Hsc70-1, Hsc70-4, SSA1, SSA2, SSA3, SSA4

  17. 1846 orthologous to the other 6 organisms Back to Apicomplexa … 5333 Proteins 483 orthologous to E. coli 1421 orthologous to yeast 1771 orthologous to fly, worm or human 1693 orthologous to Arabidopsis 1824 non- orthologous to human

  18. Summary • OrthoMCL automatically delineates the many-to-many orthologous relationship across multiple eukaryotic genomes • When applied to pairwise comparison of two species, the performance of OrthoMCL is comparable to INPARANOID which was designed for comparing two species • When applied to multiple species and compared with EGO database, OrthoMCL tend to identify more orthologous genes • The underlying object-based relational storage model permits integration with organismal data and queries based on user-defined species distribution provides a snapshot of shared/diversified biological processes across species

  19. Related Posters and Reference • 114A. Web-Based Biological Discovery using an Integrated Database. • 146A. The Genomics Unified Schema (GUS). • 170A. TESS-II: Describing and Finding Gene Regulatory Sequences with Grammars. • Remm et al. Automatic Clustering of Orthologs and In-paralogs from Pairwise Species Comparisons. J.MOL.Biol. (2001) 314 • Lee et al. Cross-Referencing Eukaryotic Genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res. (2002) 12 • Enright et al. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. (2002) 30

More Related