2003. 12.04 Susie Jo Bio-Information System Laboratory BioSystem Dept., KAIST

Protein Family Classification Using AI Techniques (Profile-HMMs, SVM) 2003. 12.04 Susie Jo Bio-Information System Laboratory BioSystem Dept., KAIST

SOM Intro Table of Contents Profile HMM SVM-Fisher

Derive Function Protein homology detection CS774 • Protein, DNA can be encoded in primary sequence • (amino acid residue[20 types], nucleotide[A/G/C/T]) • Functionally Annotated Sequence • Functionally Unknown Sequence Introduction Profile HMM SVM-Fisher Similar Sequence Similar Function SOM Sequence Similarity

Proteins belong to the same fold if they have the same major secondary structures in the same arrangement and with the same topological connections. • Proteins placed together in the same fold category may not have a common evolutionary origin. • Members of a superfamily may have low sequence similarity than those of family. • Have structural and functional features that suggest a common evolutionary origin. 1. Proteins within a single family show clear evolutionary relationships 2. Typically have more than 30% pairwise identities at the sequence level Protein Classification – SCOP (Structural Classification of Protein Database) CS774 • SCOP hierarchy of protein domains Introduction Profile HMM SVM-Fisher SOM Primary Level Varying degrees of similarity

GPCR CS774 • The three major subfamilies include the receptors related to the “light receptor” rhodopsin and ß2-adrenergic receptor (family A) • can be subdivided into six major subgroups • overall homology among all type A receptors is low • highly conserved a few key residues • Asp-Arg-Tyr (DRY) motif • Receptors related to the glucagon receptor (family B) • The receptors related to the metabotropic neurotransmitter receptors (family C) • Yeast pheromone receptor (family D, E) • cAMP recptors (family F) Introduction Profile HMM SVM-Fisher SOM

GPCR 3 major subfamilies CS774 Introduction Profile HMM SVM-Fisher SOM Low Sequence Homology

Protein remote homology detection Methods CS774 • Pair-wise similarities between proteins • Simple Sequence Similarity • Using Smith-Waterman dynamic programming • BLAST, FASTA • +)Simple, Easy • - )Low accuracy Introduction Profile HMM SVM-Fisher SOM Urotensin is very similar with 4 somatostatin => However, actually they have different ligands

Protein remote homology detection Methods CS774 • Profiles and hidden Markov models(HMMs) • Profile-based methods by iteratively collecting • homologous sequences from a large database and • incorporating the resulting statistics into a single model • PSI-BLAST and SAM-T98 • 3.SVM-Fisher method(Jaakkola et al., 1999,2000) • couples an iterative HMM training scheme with the SVM Introduction Profile HMM SVM-Fisher SOM

Profile HMM

Motivation CS774 • Objective : Given a family of related sequences, what is an effective way to capture what they have in common, so that we can recognize other members of the family. • Some standard methods for characterization: • - Multiple alignments • - Profile • - Regular Expressions • - Consensus Sequences • - Hidden Markov Models Introduction Profile HMM SVM-Fisher SOM

Single motif Regular expression (PROSITE, eMOTIF) R-WX(2)-[AG]-C-X-[NQ] Whole domain Profile, HMM (Pfam) Encode information including gaps Multiple motif Frequency matrix, Weight matrix (PRINTS, Blocks) A.Gaulton & T.K.Attwood, Bioinformatics approach for the classification of GPCRCurrent Opinion in Phamacology 2003, 3:114-120 Using Family Profile CS774 • Use MSA of the family = Identify the most highly conserved regions Introduction Profile HMM SVM-Fisher SOM RWDAGCVN RWDSGCVN RWHHGCVQ RWKGACYN RWLWACEQ

Method of characterizing family of nucleotide sequences CS774 • 1. Regular expression • [AT] [CG] [AC] [ACTG]* A [TG] [GC] • But, cannot distinguish between • highly implausible T G C T - - A G G • and consensus A C A C - - A T C • 2. Consensus sequence • A C A C - - A T C • Unclear what consensus means • Need some kind of similarity table between nucleotides to measure the probability of a sequence Introduction Profile HMM SVM-Fisher SOM

Sean R. Eddy, Profile hidden Markov models Bioinformatics vol.14, no.9 1998, 755-763 HMM CS774 • A model that generates ”Sequence” • A symbol seq. (or observations) is generated moving of states. • The state seq. is hidden. • - States • - Symbol emission probabilities • - State transition probabilities Introduction Profile HMM SVM-Fisher SOM Hidden state sequence, S Observed symbol sequence, X P( X,S | HMM )

Emission Probabilities Transition probabilities 3. HMM(Ex. gene sequence) CS774 Introduction Insertion State Profile HMM SVM-Fisher SOM M M M I M M M

Scoring • P(ACACAGC) = .810.810.80.60.40.6110.210.8= 0.012 • Log-odds score = log [p(S)/0.25L] = 5.3 Deriving HMM & Scoring HMM CS774 • Deriving the HMM from a known alignment • Each column in the alignment generates a state • Count the occurrence of [ATGC] in each column to determine emission probabilities for each state • Transition probabilities to insertion states in a similar way (need some caution…) Introduction Profile HMM SVM-Fisher SOM

Probability & Log-odds CS774 • Probability : sequence length(L) dependent • Penalize insertion & favor deletion • Log-odds is computed using null model • Considers the overall sequence of nucleotides as random • Better estimate – use overall frequency of nucleotides(or amino acids) in organisms genome Introduction Profile HMM SVM-Fisher SOM

Sean R. Eddy, Profile hidden Markov models Bioinformatics vol.14, no.9 1998, 755-763 Profile HMM CS774 • HMM architecture for representing profiles of multiple sequence alignments • Linear left-right model • Match state • Insert state • Delete state Introduction Profile HMM SVM-Fisher SOM 1 2 3C A FC G WC D YC V F C K Y

Visual Recognition Tutorial. Thad Starner, Alex Pentland. Visual Recognition of American sign Language Using HMMs. In International Workshop on Automatic Face and Gesture Recognition, pages 189-194, 1995 Elements of Profile HMMs CS774 Introduction • N– the number of hidden states • Q– set of states Q={1,2,…,N} • M– the number of symbols • V– set of symbols V ={1,2,…,M} • A– the state-transition probability matrix • B – Observation probability distribution: • p - the initial state distribution: • l – the entire model Profile HMM SVM-Fisher SOM

Three Basic Problems CS774 • EVALUATION– given observation O=(o1 , o2 ,…,oT ) and model , efficiently compute • Hidden states complicate the evaluation • Given two models l1 and l2, this can be used to choose the better one. • DECODING - given observation O=(o1 , o2 ,…,oT ) and model l find the optimal state sequence q=(q1 , q2 ,…,qT ) . • Optimality criterion has to be decided (e.g. maximum likelihood) • “Explanation” of the data. • LEARNING– given O=(o1 , o2 ,…,oT ), estimate model parameters that maximize Introduction Profile HMM SVM-Fisher SOM

1 a1j 2 i j aij bj,t+1 aNj N at(i) at+1(j) Solution to problem 1 CS774 • Forward Algorithm Introduction • Define forward variable as: • is the probability of observing the partial sequence • such that the state qtis i. • Induction: • Initialization: • Induction: • Termination: Profile HMM SVM-Fisher SOM

Solution to problem 1 CS774 2. Backward Algorithm Introduction • Define backward variable as: • is the probability of observing the partial sequence • such that the state qtis i. • Induction: • 1. Initialization: • 2. Induction: Profile HMM SVM-Fisher SOM

Solution to problem 2 CS774 • Choose the most likely path • Find the path (q1 , q2 ,…,qT ) that maximizes the likelihood: • Solution by Dynamic Programming • Define • is the highest prob. Path ending in state I • By induction we have: Introduction Profile HMM SVM-Fisher SOM

Solution to problem 2 CS774 Viterbi Algorithm Introduction • Initialization: • Recursion: • Termination: • Path (state sequence) backtracking: Profile HMM SVM-Fisher SOM

Solution to problem 3 CS774 Baum-Welch Algorithm Introduction Profile HMM • Estimate to maximize • No analytic method because of complexity – iterative solution. • Baum-Welch Algorithm (actually EM algorithm) : • Let initial model be l0 • Compute new l based on l0 and observation O. • If • Else set l0 l and go to step 2 SVM-Fisher SOM

Preventing Overfitting CS774 • Pseudocount (fake count) • Dangerous to estimate a probability distribution from just a few examples • pretend you saw an a.a. in a position even though it wasn’t there • Sequence Weighting • Some sequences are more frequent than others • Get more Data! Introduction Profile HMM SVM-Fisher SOM

Pseudocount CS774 Introduction Profile HMM SVM-Fisher SOM

SAM-T98 (software tool) CS774 Introduction Profile HMM SVM-Fisher SOM

SVM-Fisher

Jakkola et al, A Discriminative Framework for Detecting Remote Protein Homologies, Journal of Computational Biology Discriminative Framework for Detecting Remote Protein Homologies CS774 • variant of support vector machines using a new kernel function • Kernel function • derived from a generative statistical model for a protein family, in this case HMM • Use generative statistical models built from multiple sequences, in this case HMMs, as a way of extracting features from protein sequences. This maps all protein sequences to points in a Euclidean feature space of fixed dimension. Introduction Profile HMM SVM-Fisher SOM

Method CS774 • X=[x1,…xn] : protein sequence, xi is an amino acid residue • H1 : estimated HMM for particular protein family • P(X|H1) : corresponding probability model • Likelihood ratio score : used in place of a simple probability P(X|H1) Introduction Profile HMM SVM-Fisher SOM

Method CS774 • 1. Discriminative approaches • By bayes rule, • P(H1|X) : posterior probability of the model • Posterior probability that the sequence X belongs to the protein family being modeled. • Score Function L(X) : log posterior odds score Introduction Profile HMM SVM-Fisher SOM

Method CS774 • 2. Kernel methods • K(Xi,X) : Kernel function • a measure of pairwise “similarity” between the training example Xi and the new example X • Sign of the discriminant function L(X) determines the predicted class for any sequence X Introduction Profile HMM SVM-Fisher SOM

Method CS774 • 3. The Fisher kernel • Fisher score • : the gradients w.r.t the parameters of the HMM • P(X|H1,θ) : corresponding probability model , estimated an HMM for a particular family of proteins • Θ : include the output and the transition probabilities of an HMM trained to model Introduction Profile HMM SVM-Fisher SOM

Method CS774 • Fisher score • : the gradients w.r.t the parameters of the HMM • Probability value of HMM for each sequence Introduction Profile HMM SVM-Fisher SOM

Expected posterior frequency of visiting state and generating residue Method CS774 • Fisher score • Derivatives of w.r.t emission probabilities Introduction Profile HMM SVM-Fisher SOM …

Method CS774 • Fisher score vector relative to the emission probabilities • a vector whose components indexed by (x,s) and the corresponding values given by • Dim. Of fisher score vector = 20m (m: # of state) • - Expected posterior frequency of visiting state and generating residue Introduction Profile HMM SVM-Fisher SOM

Method CS774 • A natural (squared) distance between the gradient vectors • Quantify the similarity between two fixed length gradient vectors Ux and Ux’ corresponding to two sequences X and X’ • Gaussian Kernel Introduction Profile HMM SVM-Fisher SOM

Method CS774 • Method Summary (SVM-Fisher Method) • 1. Begin with an HMM trained from positive examples to model a given protein family. • 2. Use this HMM to map each new protein sequence X we want to classify into a fixed length vector, its Fisher score • 3. Compute the kernel function on the basis of the Euclidean distance between the score vector for X and the score vectors for known positive and negative examples Xi of the protein family. • 4. The resulting discriminant function is given by Introduction Profile HMM SVM-Fisher SOM

Method CS774 • 4. Combination of scores • In many cases, we can construct more than one HMM model for the family or superfamily of interest. • Combine the scores from the multiple models rather than selecting just one. • Li(X) : the score for the query sequence X based on the ith model Introduction Profile HMM SVM-Fisher SOM

Result CS774 Introduction Profile HMM SVM-Fisher SOM

Result 2 CS774 • GPCR level 1 subfamily recognition • GPCR level 2 subfamily recognition Introduction Profile HMM SVM-Fisher SOM

SOM

Reference • A.Gaulton & T.K.Attwood, Bioinformatics approach for the classification of GPCRCurrent Opinion in Phamacology 2003, 3:114-120 • Sean R. Eddy, Profile hidden Markov models Bioinformatics vol.14, no.9 1998, 755-763 • Sean R. Eddy, Profile hidden Markov models Bioinformatics vol.14, no.9 1998, 755-763 • Jakkola et al, A Discriminative Framework for Detecting Remote Protein Homologies, Journal of Computational Biology • Visual Recognition Tutorial. Thad Starner, Alex Pentland. Visual Recognition of American sign Language Using HMMs. In International Workshop on Automatic Face and Gesture Recognition, pages 189-194, 1995

Thank You !

2003. 12.04 Susie Jo Bio-Information System Laboratory BioSystem Dept., KAIST