Learning to Align: a Statistical Approach

Learning to Align:a Statistical Approach IDA 2007 Elisa Ricci, University of Perugia (IT) Tijl De Bie, Nello Cristianini, University of Bristol (UK)

Outline • Sequence alignment • Z-score as function of the alignments parameters • Z-score computation by dynamic programming • Inverse Parametric Sequence Alignment Problem (IPSAP) • Z-score maximization to solve IPSAP • Experimental results • Artificial data • PALI dataset of protein structure alignments • Conclusions

S1 ATGCTTTC S2 CTGTCGCC ATGCTTTC--- ---CTGTCGCC A Sequence Alignment • Definition:Given two sequences S1,S2a global alignment is an assignment of gaps, so as to line up each letter in one sequence with either a gap or a letter in the other sequence. • It is used to determine the similarity between biological sequences. Example: S={A,T,G,C}, S1 ,S2 S

ATGCTTTC--- ---CTGTCGCC A Sequence Alignment • Score of the alignment: a linear function of the parameters. • 3-parameter model: matches are rewarded with am ,mismatchesare penalized byas, gaps are weighted byag. f (S1, S2, A) =amm +ass + agg = aT x with xT=[msg]=[#matches#mismatches#gaps] and aT = [amas ag]. Example: f (S1,S2, A) = 4am +as+ 6ag

Sequence Alignment • 4-parameter model: affine function for gap penalties, i.e. different costs if the gap starts (gap opening penalty ao) in a given position or if it continues (gap extension penalty ae). • 211/212-parameter model: gap penalties plus a symmetric scoring matrix with elements ayt, y,tS, S ={A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}. C paired with D aCD =aDC

A T G C T T T C C T G T C G C C Sequence Alignment • Optimal alignment: the highest score alignment. Optimality depends on the parameters a. • The number of possible alignments N is exponential in the length of the sequences. • The optimal alignment is computed using dynamic programming (DP) in aO(nm) time [Needleman-Wunsch, 1970]. • Alignments can be represented as paths from the upper-left to the lower-right corner in the alignment graph.

Moments of the scores • The mean and the variance of the scores can be expressed as function of the parameters a. Example: For the 3-parameter model:

The Z-score • Definition:Let m(S1, S2) and s2(S1, S2) be the average score and the variance of the scores for all possible alignments between S1 and S2. Let be the optimal alignment between S1 and S2 for a given a and be the associated feature vector. We define the Z-score Z(S1, S2): where .

Computing the Z-score • Given a parameter vector a, the Z-score can be computed with DP for the 3-, 4-, 211-, 212-parameter models. Example: For the 3-parameter model 9 DP routines are required. DP table

Computing the Z-score • 2 DP tables: p, mm. • Inductive assumption: p(i, j-1),p(i-1,j),p(i-1, j-1) are the number of alignments. mm(i, j-1),mm(i-1,j),mm(i-1, j-1)are the correct mean values. • Each cell is filled with the following rules: p(i, j) = p(i-1, j-1) + p(i, j-1) + p(i-1, j) mm (i-1, j-1) p(i-1, j-1) + Mp(i-1, j-1) mm (i, j) p(i, j) = summm(i, j-1) p(i, j-1) mm(i-1, j) p(i-1, j) where M = 1, if S1(i) = S2(j) ;M = 0, if S1(i) ≠ S2(j) .

Computing the Z-score Basic principle: • Mean values: • Variances are computed centering the second order moments:

IPSAP • Inverse Parametric Sequence Alignment Problem (IPSAP): given a training set of pairwise global alignments learn the parametersain such a way that the given alignments have the best scores among all possible alignments. Training set Find as.t. • Exponential number of linear constraints. • Iterative approaches: linear programming [Kececioglu and Kim 06], max margin [Joachims et al. 05].

s m Z-score maximization • Idea: global objective function, more naturally suited for non-separable cases. • Z-score maximization: • Minimize the number of alignments with score higher than the given one.

Z-score maximization • Z-score of a training set: • Convex optimization • Most linear constraints are satisfyied. (QP)

Iterative algorithm • Impose explicitly the violated constraints. • Again a convex optimization problem. • Iterative algorithm. • Eventually relax constraints (e.g. add slack variables for non separable problems).

INPUT: training setT 1:C ← ø 2:Compute bi, Ci for all i=1…ℓ 3:Computeb*=sum(bi), C*=sum(Ci) 4: Find a solving QP. 5:Repeat 6: for i=1…ℓ do 7:Compute xi’=argmaxxf (Si1, Si2, Ai) 8:if aTxi’> aT 9:C ← C U { aT (-xi’)>0 } 10:Find a solving QP s.t. C 11: endif 12: endfor 13: until C is not changed in during the current iteration. Moments computation Z-score maximization Identify the most violated constraint Constrained Z-score maximization Iterative algorithm

Experimentalresults • Test error as function of the training set size. • Distribution of correctly reconstructed alignments as a function of the number of additional constraints.

Experimentalresults • Experiments with no constraints. • Test error as function of the training set size. • Given and computed substitution matrices.

Experimentalresults • Real sequences of amino acids: 5 multiple alignments from the PALI database of structural protein alignments. Error rates and added constraints (in parenthesis).

Summary • New method for IPSAP: • Accurate and fast (few constraints are required). • Easy to implement: DP for computing moments and simple convex optimization problem. • Mean and variance computations parallelizable for large training set. • Further works: • Approximate moments estimation with sampling techniques is suitable. • Possible extension to other problems: sequence labeling learning and sequence parse learning with context free grammars.

Learning to Align: a Statistical Approach

Learning to Align: a Statistical Approach

Presentation Transcript

A Conceptual Approach to Survival Analysis

Statistical Machine Learning- The Basic Approach and Current Research Challenges

Mathematics workshop 3

Statistical Machine Translation

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity

Statistical Approach to NoC Design

Statistical Machine Learning and Computational Biology

Statistical Learning

Performance and Power Analysis on ATI GPU: A Statistical Approach

Statistical Approach

Corpora and Statistical Methods Lecture 10

Statistical Learning Methods

C.R.A.P.

STATISTICAL AND THEORETICAL APPROACHES The Statistical Approach The Physical-Mathematical Approach

Statistical Learning Methods

Predictive Learning from Data

A Prediction Interval for the Misclassification Rate

A Statistical Approach to Method Validation and Out of Specification Data

Statistical learning and optimal control: A framework for biological learning and motor control

SYNTHESIS THROUGH SERVICE LEARNING IN STATISTICS