Sequence AlignmentTutorial #3 . © Ydo Wexler & Dan Geiger
Sequence Alignment (Reminder) Global Alignment: Input: two sequences S1, S2 over the same alphabet Output: two sequences S’1, S’2 of equal length (S’1, S’2 are S1, S2 with possibly additional gaps) Example: • S1= GCGCATGGATTGAGCGA • S2= TGCGCCATTGATGACC • A possible alignment: S’1=-GCGC-ATGGATTGAGCGA S’2= TGCGCCATTGAT-GACC-- Goal: How similar are two sequences S1 and S2
Sequence Alignment (Reminder) Local Alignment: Input: two sequences S1, S2 over the same alphabet Output: two sequences S’1, S’2 of equal length (S’1, S’2 are substrings of S1, S2 with possibly additional gaps) Example: • S1=GCGCATGGATTGAGCGA • S2=TGCGCCATTGATGACC • A possible alignment: S’1=ATTGA-G S’2= ATTGATG Goal: Find the pair of substrings in two input sequences which have the highest similarity
Sequence Alignment (Reminder) -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements: • Perfect matches • Mismatches • Insertions & deletions (indel) • Score each position independently • Score of an alignment is sum of position scores
Breaking Number • Input: Two sequences M,E over the same alphabet (|M|≥|E|) • Output: The smallest k, s.t. there exist partitions: M=M1M2…Mk , E=E1E2…Ek s.t Ei is a substring of Mi for all i = 1..k. If no such k exists, then return ∞. Example: M=AAAATTTAAATTTA E=AATTATA M1=AAAATTT M2=AAATT M3=A E1= AATT E2= AT E3=A AAAATTTAAATTTA --AATT---AT--A Find an O(|M||E|) algorithm for finding the breaking number of M,E.
(d) (e) Affine gap penalty Breaking Number (cont) • Solution: Reduce the problem to global alignment with modifications: • Do not allow mismatches • Do not allow gaps in M • No penalty for gaps in start/end of sequence • Constant penalty for gaps (regardless of their length) • Scoring scheme: • Match – 0 • Mismatch - -∞ • Gap intr. - -1 • Gap elong. -0 AAAATTTAAATTTA --AATT---AT--A breaking number = -score of the alignment + 1.
Breaking Number (cont) • Complexity: Standard O(|M||E|) Dynamic Programming • Correctness: Two-way argument • An alignment of score –(k-1) corresponds to a partition of M,E to k subsequences • A partition of M,E to k subsequences has an alignment score of –(k-1) • Optimal alignment has score of -∞ There is no valid partition(2) • Optimal alignment has score –k • There is a valid partition to k+1 blocks (1) • There is no valid partition to less blocks (2)
A - T A G - G T T G G G G T G G - - T - A T T A - - A - T A C C A C C C - G C - G - Possible alignment Possible alignment Multiple Sequence Alignment S1=AGGTC S2=GTTCG S3=TGAAC
Multiple Sequence Alignment (cont) • Input: Sequences S1, S2,…, Sk over the same alphabet • Output: Gapped sequences S’1, S’2,…, S’k of equal length • |S’1|= |S’2|=…= |S’k| • Removal of spaces from S’iobtains Si Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it.
Multiple Sequence Alignment Example Consider the following alignment: AC-CDB- -C-ADBD A-BCDAD Scoring scheme: match - 0 mismatch/indel - -1 SP score: -4 -3 -5 =-12
Multiple Sequence AlignmentComplexity Given kstrings of length n, there is a generalization of the DP algorithm that finds an optimal SP alignment: • Instead of a 2-dimensional table we have a k-dimensional table • Each dimension is of length ‘n’+1 • Each entry depends on 2k-1 adjacent entries Complexity:O(2knk) This problem is known to be NP-hard (no polynomial-time algorithm)
Multiple Sequence Alignment Approximation Algorithm We use cost instead of score Find alignment of minimal cost Assumption:the cost function δ is a distance function • δ(x,x) = 0 • δ(x,y) = δ(y,x) ≥ 0 • δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality) (e.g. cost of MM ≤ cost of two indels) D(S,T) - cost of minimum global alignment between S and T
Multiple Sequence Alignment Approximation Algorithm The ‘star’ algorithm: Input: Γ - set of k strings S1,…,Sk. • Find the string S’ (center) that minimizes • Denote S1=S’and the rest of the strings as S2,…,Sk • Iteratively add S2,…,Sk to the alignment as follows: • Suppose S1,…,Si-1are alreadyaligned as S’1,…,S’i-1 • AlignSi to S’1 to produce S’i and S’’1 aligned • AdjustS’2,…,S’i-1by adding spaces where spaces were added to S’’1 • Replace S’1 by S’’1
total complexity Multiple Sequence Alignment Approximation Algorithm Time analysis: • Choosing S1 – execute DP for all sequence-pairs - O(k2n2) • Adding Si to the alignment -execute DP for Si , S’1 - O(i·n2). (In the ith stage the length of S’1can be up-to i· n)
Multiple Sequence Alignment Approximation Algorithm Approximation ratio: • M* - optimal alignment • M - The alignment produced by this algorithm • d(i,j) - the distanceMinduces on the pair Si,Sj For all i: d(1,i)=D(S1,Si) (we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )
Multiple Sequence Alignment Approximation Algorithm Triangle inequality Approximation ratio: Definition of S1: