1 / 23

Genome Analysis to Select Targets which Probe Fold and Function Space

Genome Analysis to Select Targets which Probe Fold and Function Space. How many protein superfamilies and families can we identify in the proteomes How many structures needed to cover a high fraction of prokaryotic, eukaryotic families

Télécharger la présentation

Genome Analysis to Select Targets which Probe Fold and Function Space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome Analysis to Select Targets which Probe Fold and Function Space • How many protein superfamilies and families can we identify in the proteomes • How many structures needed to cover a high fraction of prokaryotic, eukaryotic families • Targeting Universal Recurrent Superfamilies (SCOP/CATH/Pfam) to optimise coverage of fold and function space Midwest Consortium Russell Marsden, Alastair Grant, David Lee, Annabel Todd Janet Thornton, Andrzej Joachim MCSG Site Visit, Argonne, January 30, 2003

  2. Protein Families in Complete Genomes with Structural/Functional Annotations Gene3D Buchan, Thornton, Orengo, Genome Research (2002) 800,000 protein sequences from 120 completed genomes 14 eukaryotic genomes including human, mouse, rat, plant,fly, worm, fugu 92 bacterial genomes 14 archael genomes

  3. Protein Families in Complete Genomes with Structural/Functional Annotations Gene3D Buchan, Thornton, Orengo, Genome Research (2002) 800,000 protein sequences from 120 completed genomes 14 eukaryotic genomes including human, mouse, rat, plant,fly, worm, fugu 92 bacterial genomes 14 archael genomes

  4. Clustering Sequences into Protein Superfamilies of Known Domain Composition PFscape - Protein Family Landscape • BLAST all the sequences from 120 completed genomes against each and cluster into protein families • For each sequence identify CATH and Pfam domains TRIBE-MCL - Markov Clustering Enright & Ouzounis, Genome Research, 2002 SAM-T99 - sequence mapping of CATH & Pfam Karplus et al., NAR, 2000

  5. Clustering ~800,000 genes from 120 complete genomes Gene Superfamily 1 PFscape Gene Superfamily 4 Gene Superfamily 2 Gene Superfamily 3 ~50,000 gene superfamilies of 2 or more sequences, 150,000 singletons

  6. Mapping CATH and Pfam Domains onto Genome Sequences • Library of HMMs built for representative sequences from each CATH and Pfam domain superfamily Scan against CATH & Pfam SAM-T99 HMM library protein sequences from genomes assign domains to CATH and Pfam superfamilies

  7. Performance of Sequence Mapping Method Percentage of remote, structurally validated CATH homologues (<35% sequence identity) identified by SAM-T99 1D-HMM (SAM-T99) (%) of homologues found Error rate Library of 1D-HMM models detects ~80% of remote homologues

  8. Use HMMs to annotate Gene Superfamilies with CATH and Pfam domains Gene Superfamily 1 CATH Pfam Gene Superfamily 2 Gene Superfamily 4 Gene Superfamily 3 NewFam 50,000 Gene Superfamilies

  9. Merge superfamilies with the same domain combinations Gene Superfamily 1 Gene Superfamily 2 Gene Superfamily 3 Gene3D: 50,000 -> 36,000 Superfamilies

  10. Superfamilies Further Classified into Families Multi-linkage clustering relatives in each sequence family have 35% or more sequence identity Superfamily Families (35%ID) For good homology models one structure is needed for each family within a superfamily

  11. Percentage of Sequence Families with and without Close Structural Homologues (>35% identity) 100 Percentage of Families No close PDB homologue 50 NewFam CATH Pfam Number of domain superfamilies and families with no close structural homologue CATH (1400)+Pfam(4100)+Newfam(46,384) = 51,844 Superfamilies CATH (60,360)+Pfam(53,907)+Newfam(56,973) = 171,240 Families

  12. Preferentially Target Largest Superfamilies CATH Pfam Number of Superfamilies containing given number of Non-identical relatives as percentage of the total Number of Non-identical Relatives Number of Non-identical Relatives Fitted power-laws (with gradients) Newfam CATH (-0.4) Pfam (-1.0) Newfam (-1.9) Number of Non-identical Relatives Number of Non-identical Relatives

  13. Proteome Coverage by Superfamilies 100 Percentage of Proteomes (Number of non-identical proteins in 120 completed genomes) 50 50 0 Superfamilies Ordered by Size ~70% of Proteomes are contained in < 2500 Largest CATH + Pfam + NewFamTarget Superfamilies

  14. Proteome Coverage by Superfamilies CATH (superfamilies of known fold) Pfam 50 Percentage of Proteomes (120 completed genomes) NewFam Superfamilies Ordered by Size

  15. What Fraction of the Proteomes is covered by Bacterial Family Targets? eukaryotes plus prokaryotes 100 eukaryotes 50 prokaryotes Percentage of Proteomes (120 completed genomes) 50 40 o 0 100,000 200,000 0 Number of Target Families ~100,000 prokaryotic targets cover nearly 60% of proteomes

  16. How many family targets cover a significant proportion of the eukaryotes and/or prokaryotes? prokaryotes eukaryotes eukaryotes plus prokaryotes Percentage of Kingdom Proteomes (120 completed genomes) 50 40 o 25,000 30,000 45,000 Number of Target Families 25,000 - 45,000 family targets cover 70% of proteomes (< 2500 largest superfamily targets)

  17. Target Selection Strategy • the largest < 2500 superfamily targets give 70% of proteomes • this corresponds to 25,000 - 45,000 family targets • accurate homology models are not needed for all families • target families of biological interest or containing human homologues with disease association • targets families from functionally diverse superfamilies to understand how changes in the structure can modify function • For example, Universal, Highly Recurrent Superfamilies are an interesting biological subset with diverse functions MCSG Site Visit, Argonne, January 30, 2003

  18. Universal CATH Domain Superfamilies 100 Proportion of CATH domain annotations 50 0 30 representative eukaryotic and prokaryotic organisms ~60-70% of CATH domain annotations within each organism are from < 200 CATH universal superfamilies common to all kingdoms of life some of which are very extensively duplicated

  19. Domain Recurrences in the Genomes 730 570 number of superfamilies Highly Recurrent, Extensively Duplicated Superfamilies occurrences

  20. 56 Universal and Highly Recurrent Superfamilies Poorly charac. Information stor. & proce. Cellular processes and signalling Metabolism U O S R Z Y V W T COG functional annotation (25 Functional Categories) N M D L A J B K Q P I H G F E C Analysis in bacterial genomes showed that 56 Universal Superfamilies recurred in proportion to the genome size and accounted for 45% of the CATH domain annotations E (Amino acid metabolism) J (Translation and protein biosynthesis) K (Transcription) T (Signal Transduction) 15,000 bacterial family targets

  21. In Functionally Diverse Superfamilies Select More Targets Relative with most neighbours for which homology model can be built or function assigned For >95% confidence when inheriting functional properties, homologues should have at least 60% identity (Todd, Valencia, Rost)

  22. Representative Structures for Superfamilies will help identify Functional Families functional clusters S60_1 S60_2 S60_3 Superfamily S60_4 S60_5 functional clusters identified by sequence conservation annotations (GO, Kegg, Pfam, EC, COGS, SWISS-PROT) stored in Gene3D

  23. Target Selection Strategy • Targeting the 2500 largest superfamilies will cover a significant proportion (70%) of the proteomes • For good homology models between 25,000 - 45,000 family targets are needed • Preferentially select targets from medically important and/or structurally and functionally diverse superfamilies • For example, targeting Universal and Recurrent superfamilies which exhibit significant structural and functional divergence will help to improve function prediction methods MCSG Site Visit, Argonne, January 30, 2003

More Related