1 / 20

Learning to Align: a Statistical Approach

Learning to Align: a Statistical Approach. IDA 2007. Elisa Ricci, University of Perugia (IT) Tijl De Bie, Nello Cristianini, University of Bristol (UK). Outline. Sequence alignment Z-score as function of the alignments parameters Z-score computation by dynamic programming

jma
Télécharger la présentation

Learning to Align: a Statistical Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning to Align:a Statistical Approach IDA 2007 Elisa Ricci, University of Perugia (IT) Tijl De Bie, Nello Cristianini, University of Bristol (UK)

  2. Outline • Sequence alignment • Z-score as function of the alignments parameters • Z-score computation by dynamic programming • Inverse Parametric Sequence Alignment Problem (IPSAP) • Z-score maximization to solve IPSAP • Experimental results • Artificial data • PALI dataset of protein structure alignments • Conclusions

  3. S1 ATGCTTTC S2 CTGTCGCC ATGCTTTC--- ---CTGTCGCC A Sequence Alignment • Definition:Given two sequences S1,S2a global alignment is an assignment of gaps, so as to line up each letter in one sequence with either a gap or a letter in the other sequence. • It is used to determine the similarity between biological sequences. Example: S={A,T,G,C}, S1 ,S2 S

  4. ATGCTTTC--- ---CTGTCGCC A Sequence Alignment • Score of the alignment: a linear function of the parameters. • 3-parameter model: matches are rewarded with am ,mismatchesare penalized byas, gaps are weighted byag. f (S1, S2, A) =amm +ass + agg = aT x with xT=[msg]=[#matches#mismatches#gaps] and aT = [amas ag]. Example: f (S1,S2, A) = 4am +as+ 6ag

  5. Sequence Alignment • 4-parameter model: affine function for gap penalties, i.e. different costs if the gap starts (gap opening penalty ao) in a given position or if it continues (gap extension penalty ae). • 211/212-parameter model: gap penalties plus a symmetric scoring matrix with elements ayt, y,tS, S ={A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}. C paired with D aCD =aDC

  6. A T G C T T T C C T G T C G C C Sequence Alignment • Optimal alignment: the highest score alignment. Optimality depends on the parameters a. • The number of possible alignments N is exponential in the length of the sequences. • The optimal alignment is computed using dynamic programming (DP) in aO(nm) time [Needleman-Wunsch, 1970]. • Alignments can be represented as paths from the upper-left to the lower-right corner in the alignment graph.

  7. Moments of the scores • The mean and the variance of the scores can be expressed as function of the parameters a. Example: For the 3-parameter model:

  8. The Z-score • Definition:Let m(S1, S2) and s2(S1, S2) be the average score and the variance of the scores for all possible alignments between S1 and S2. Let be the optimal alignment between S1 and S2 for a given a and be the associated feature vector. We define the Z-score Z(S1, S2): where .

  9. Computing the Z-score • Given a parameter vector a, the Z-score can be computed with DP for the 3-, 4-, 211-, 212-parameter models. Example: For the 3-parameter model 9 DP routines are required. DP table

  10. Computing the Z-score • 2 DP tables: p, mm. • Inductive assumption: p(i, j-1),p(i-1,j),p(i-1, j-1) are the number of alignments. mm(i, j-1),mm(i-1,j),mm(i-1, j-1)are the correct mean values. • Each cell is filled with the following rules: p(i, j) = p(i-1, j-1) + p(i, j-1) + p(i-1, j) mm (i-1, j-1) p(i-1, j-1) + Mp(i-1, j-1) mm (i, j) p(i, j) = summm(i, j-1) p(i, j-1) mm(i-1, j) p(i-1, j) where M = 1, if S1(i) = S2(j) ;M = 0, if S1(i) ≠ S2(j) .

  11. Computing the Z-score Basic principle: • Mean values: • Variances are computed centering the second order moments:

  12. IPSAP • Inverse Parametric Sequence Alignment Problem (IPSAP): given a training set of pairwise global alignments learn the parametersain such a way that the given alignments have the best scores among all possible alignments. Training set Find as.t. • Exponential number of linear constraints. • Iterative approaches: linear programming [Kececioglu and Kim 06], max margin [Joachims et al. 05].

  13. s m Z-score maximization • Idea: global objective function, more naturally suited for non-separable cases. • Z-score maximization: • Minimize the number of alignments with score higher than the given one.

  14. Z-score maximization • Z-score of a training set: • Convex optimization • Most linear constraints are satisfyied. (QP)

  15. Iterative algorithm • Impose explicitly the violated constraints. • Again a convex optimization problem. • Iterative algorithm. • Eventually relax constraints (e.g. add slack variables for non separable problems).

  16. INPUT: training setT 1:C ← ø 2:Compute bi, Ci for all i=1…ℓ 3:Computeb*=sum(bi), C*=sum(Ci) 4: Find a solving QP. 5:Repeat 6: for i=1…ℓ do 7:Compute xi’=argmaxxf (Si1, Si2, Ai) 8:if aTxi’> aT 9:C ← C U { aT (-xi’)>0 } 10:Find a solving QP s.t. C 11: endif 12: endfor 13: until C is not changed in during the current iteration. Moments computation Z-score maximization Identify the most violated constraint Constrained Z-score maximization Iterative algorithm

  17. Experimentalresults • Test error as function of the training set size. • Distribution of correctly reconstructed alignments as a function of the number of additional constraints.

  18. Experimentalresults • Experiments with no constraints. • Test error as function of the training set size. • Given and computed substitution matrices.

  19. Experimentalresults • Real sequences of amino acids: 5 multiple alignments from the PALI database of structural protein alignments. Error rates and added constraints (in parenthesis).

  20. Summary • New method for IPSAP: • Accurate and fast (few constraints are required). • Easy to implement: DP for computing moments and simple convex optimization problem. • Mean and variance computations parallelizable for large training set. • Further works: • Approximate moments estimation with sampling techniques is suitable. • Possible extension to other problems: sequence labeling learning and sequence parse learning with context free grammars.

More Related