Theoretical methods for predicting gene function II. predicting protein domains

Theoretical methods for predicting gene function II. predicting protein domains and their function from sequence analysis S. Wodak, ULB Inter-university DEA/DES in Bioinformatics

The main steps [3.1] Predict domains [3.2] Predict function of individual domains Family G Funct(s) Y Family A Funct(s) X Family M Funct(s) Z Family F Funct(s) W

Domain analysis Proteins tend to be modular -> domains. A first step in functional prediction/annotation can be a scan for known domains in a newly sequenced protein Scan databases of ‘fingerprints’ of classified domains: PROSITE (Bairoch et al., 1997):consensus sequence strings for more than 1000 domains PROFILESCAN: BLOCKS (Henickoff et al., 1998):ungapped alignments and pattern matching PRINTS(Attwood et al., 1998):a set of multiple seq. motifs separated along the sequence PFAM (Batemann et al., ): HMM from multiple alignments

Example: The alcohol dehydrogenase domain (Demo) (PDB-code 8ADH) CATH: http://www.biochem.ucl.ac.uk/bsm/cath_new/domains/8adh02.html PDBsum Swiss-Prot PROSITE pattern associated with Zinc binding/active site PFAM PROSITE etc..

Zinc binding constellation in carbonic anhydrase

Predicting function of individual domains based on sequence similarity 1- Intrinsic feature analysis - compositional biases -transmembrane regions (stretched of hp residues) -coiled coil segments (hepta-repeats of pol/hp residues) -pro rich, glu rich If not eliminated first, can lead to spurious hits, and thus erroneous inference of function 2- Sequence alignments - Pairwise alignments Blast, Fasta : >40% sequence identity - Multiple alignments: <40% sequence identity -Psi-Blast - SAM-98 (HMM)/PFAM More sensitive Erroneous inference of function can still be made, because sequence Similarity does not guarantee structural similarity.

Predicting function based on sequence alignments >40% sequence identity Pairwise alignments, Blast, Fasta -can be used to ‘safely’ infer function fororthologs: close homologs, genes evolved as a result of speciation (not duplication); likely to perform same function in different species ->comparions of the sequence tree and the species tree, can help identify orthologs. Inferring function for non-ortholog homologs -much more error prone. 7/10 genes will have a homolog in the sequence DB’s.. and some fraction of those will have a known 3D structure <40% sequence identity ->But the structural and functional features of the homolog cannot be transferred without additional analysis

Detection of remote homologs - Multiple alignments: -Psi-Blast: Position specific Iterated Blast -HMM Hidden Markov Models C - Other: -ISS Intermediate sequence search B A Sequence comparisons using multiple sequence alignments detect 3x as many homologs as pairwise alignments Park et al. (1998) J. Mol. Biol. 284, 1201-1210

Sequence comparisons using multiple sequence alignments detect 3x as many homologs as pairwise alignments Park et al. (1998) J. Mol. Biol. 284, 1201-1210 error rate 1/100,000 error rate 1/1000 PDBD40-J Database of 935 sequences with ≤40% sequence identity and known evolutionary relationships from SCOP: -Gap-Blast -Fasta -Psi-Blast: -SAM-98 -ISS 14 16 27 29 24 19 23 44 50 34 % homologs recognised NRDB90 Database of 152,228 non redundant sequences (<90% sequence identity) from other sequence DB’s SCOP

Structural proteomics: extending structure information to sequences Library of known folds New sequence Assign known fold from library Function Build detailed Atomic model

Detection of remote homologs across genomes Pfam... Slide incomplete

Theoretical methods for predicting gene function II. predicting protein domains