BNFO 602 Multiple sequence alignment

BNFO 602Multiple sequence alignment Usman Roshan

Optimal pairwise alignment • Sum of pairs (SP) optimization: find the alignment of two sequences that maximizes the similarity score given an arbitrary cost matrix. We can find the optimal alignment in O(mn) time and space using the Needleman-Wunsch algorithm. • Recursion: Traceback: where M(i,j) is the score of the optimal alignment of x1..i and y1..j, s(xi,yj) is a substitution scoring matrix, and g is the gap penalty

Affine gap penalties • Affine gap model allows for long insertions in distant proteins by charging a lower penalty for extension gaps. We define g as the gap open penalty (first gap) and e as the gap extension penalty (additional gaps) • Alignment: • ACACCCT ACACCCC • ACCT TAC CTT • Score = 0 Score = 0.9 • Trivial cost matrix: match=+1, mismatch=0, gapopen=-2, gapextension=-0.1

Affine penalty recursion M(i,j) denotes alignments of x1..i and y1..j ending with a match/mismatch. E(i,j) denotes alignments of x1..i and y1..j such that yj is paired with a gap. F(i,j) defined similarly. Recursion takes O(mn) time where m and n are lengths of x and y respectively.

Multiple sequence alignment • “Two sequences whisper, multiple sequences shout out loud”---Arthur Lesk • Computationally very hard---NP-hard

Unaligned sequences GGCTT TAGGCCTT TAGCCCTTA ACACTTC ACTT Aligned sequences _G_ _ GCTT_ TAGGCCTT_ TAGCCCTTA A_ _CACTTC A_ _C_ CTT_ Conserved regions help us to identify functionality Multiple sequence alignment

Sum of pairs score

What is the sum of pairs score of this alignment? Sum of pairs score

Profile • A profile can be described by a set of vectors of nucleotide/residue frequencies. • For each position i of the alignment, we we compute the normalized frequency of nucleotides A, C, G, and T

Aligning a profile vector to a nucleotide • ClustalW/MUSCLE • Let f be the profile vector • Score(f,j)= • where S(i,j) is substitution scoring matrix

Iterative alignment(heuristic for sum-of-pairs) • Pick a random sequence from input set S • Do (n-1) pairwise alignments and align to closest one t in S • Remove t from S and compute profile of alignment • While sequences remaining in S • Do |S| pairwise alignments and align to closest one t • Remove t from S

Iterative alignment • Once alignment is computed randomly divide it into two parts • Compute profile of each sub-alignment and realign the profiles • If sum-of-pairs of the new alignment is better than the previous then keep, otherwise continue with a different division until specified iteration limit

Progressive alignment • Idea: perform profile alignments in the order dictated by a tree • Given a guide-tree do a post-order search and align sequences in that order • Widely used heuristic

Expected accuracy alignment • The dynamic programming formulation allows us to find the optimal alignment defined by a scoring matrix and gap penalties. This may not necessarily be the most “accurate” or biologically informative. • We now look at a different formulation of alignment that allows us to compute the most accurate one instead of the optimal one.

Posterior probability of xi aligned to yj • Let A be the set of all alignments of sequences x and y, and define P(a|x,y) to be the probability that alignment a (of x and y) is the true alignment a*. • We define the posterior probability of the ith residue of x (xi) aligning to the jth residue of y (yj) in the true alignment (a*) of x and y as Do et. al., Genome Research, 2005

Expected accuracy of alignment • We can define the expected accuracy of an alignment a as • The maximum expected accuracy alignment can be obtained by the same dynamic programming algorithm Do et. al., Genome Research, 2005

Example for expected accuracy • True alignment • AC_CG • ACCCA • Expected accuracy=(1+1+0+1+1)/4=1 • Estimated alignment • ACC_G • ACCCA • Expected accuracy=(1+1+0.1+0+1) ~ 0.75

Estimating posterior probabilities • If correct posterior probabilities can be computed then we can compute the correct alignment. Now it remains to estimate these probabilities from the data • PROBCONS (Do et. al., Genome Research 2006): estimate probabilities from pairwise HMMs using forward and backward recursions (as defined in Durbin et. al. 1998) • Probalign (Roshan and Livesay, Bioinformatics 2006): estimate probabilities using partition function dynamic programming matrices

Benchmarking alignment programs • http://nar.oxfordjournals.org/content/38/15/4917.abstract

BNFO 602 Multiple sequence alignment

BNFO 602 Multiple sequence alignment

Presentation Transcript

Multiple sequence alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple sequence alignment

BNFO 602 Multiple sequence alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

BNFO 136 Sequence alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

BNFO 601 Multiple sequence alignment