Sequence Similarity

Sequence Similarity

The Viterbi algorithm for alignment • Compute the following matrices (DP) • M(i, j): most likely alignment of x1…xi with y1…yj ending in state M • I(i, j): most likely alignment of x1…xi with y1…yj ending in state I • J(i, j): most likely alignment of x1…xi with y1…yj ending in state J M(i, j) = log( Prob(xi, yj) ) + max{ M(i-1, j-1) + log(1-2), I(i-1, j) + log(1-), J(i, j-1) + log(1-) } I(i, j) = max{ M(i-1, j) + log , I(i-1, j) + log  } log(1 – 2) M P(xi, yj) log Prob(xi, yj) log(1 – ) log(1 – ) log  log  I P(xi) J P(yj) log  log 

One way to view the state paths – State M …… y1 yn x1 …… xm

State I …… y1 yn x1 …… xm

State J …… y1 yn x1 …… xm

Putting it all together States I(i, j) are connected with states J and M (i-1, j) States J(i, j) are connected with states I and M (i-1, j) States M(i, j) are connected with states J and I (i-1, j-1) …… y1 yn x1 …… xm

Putting it all together States I(i, j) are connected with states J and M (i-1, j) States J(i, j) are connected with states I and M (i-1, j) States M(i, j) are connected with states J and I (i-1, j-1) Optimal solution is the best scoring path from top-left to bottom-right corner This gives the likeliest alignment according to our HMM …… y1 yn x1 …… xm

Yet another way to represent this model Ix Ix BEGIN END Iy Iy Mx1 Mxm Sequence X We are aligning, or threading, sequence Y through sequence X Every time yj lands in state xi, we get substitution score s(xi, yj) Every time yj is gapped, or some xi is skipped, we pay gap penalty

From this model, we can compute additional statistics • P(xi ~ yj | x, y) The probability that positions i, j align, given that sequences x and y align P(xi ~ yj | x, y) = α: alignmentP(α | x, y) 1(xi ~ yj in α) We will not cover the details, but this quantity can also be calculated with DP log(1 – 2) M P(xi, yj) log Prob(xi, yj) log(1 – ) log(1 – ) log  log  I P(xi) J P(yj) log  log 

Fast database search – BLAST (Basic Local Alignment Search Tool) Main idea: • Construct a dictionary of all the words in the query • Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman query DB

BLAST  Original Version …… • Dictionary: All words of length k (~11 nucl.; ~4 aa) Alignment initiated between words of alignment score  T (typically T = k) • Alignment: Ungapped extensions until score below statistical threshold • Output: All local alignments with score > statistical threshold query …… scan DB query

PSI-BLAST Given a sequence query x, and database D • Find all pairwise alignments of x to sequences in D • Collect all matches of x to y with some minimum significance • Construct position specific matrix M • Each sequence y is given a weight so that many similar sequences cannot have much influence on a position (Henikoff & Henikoff 1994) • Using the matrix M, search D for more matches • Iterate 1–4 until convergence Profile M

BLAST Variants • BLASTN – genomic sequences • BLASTP – proteins • BLASTX – translated genome versus proteins • TBLASTN – proteins versus translated genomes • TBLASTX – translated genome versus translated genome • PSIBLAST – iterated BLAST search http://www.ncbi.nlm.nih.gov/BLAST

Multiple Sequence Alignments

Protein Phylogenies • Proteins evolve by both duplication and species divergence

Definition • Given N sequences x1, x2,…, xN: • Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the global map is maximum • A faint similarity between two sequences becomes significant if present in many • Multiple alignments can help improve the pairwise alignments

Scoring Function: Sum Of Pairs Definition:Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d) • Heuristic way to incorporate evolution tree: Human Mouse Duck Chicken • Weighted SOP: • S(m) = k<l wkl s(mk, ml) • wkl: weight decreasing with distance

A Profile Representation • Given a multiple alignment M = m1…mn • Replace each column mi with profile entry pi • Frequency of each letter in  • # gaps • Optional: # gap openings, extensions, closings • Can think of this as a “likelihood” of each letter in each position - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4

Multiple Sequence Alignments Algorithms

Multidimensional DP Generalization of Needleman-Wunsh: S(m) = i S(mi) (sum of column scores) F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN) F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))

Multidimensional DP • Example: in 3D (three sequences): • 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk), F(i-1,j-1,k )+S(xi, xj, - ), F(i-1,j ,k-1)+S(xi, -, xk), F(i-1,j ,k )+S(xi, -, - ), F(i ,j-1,k-1)+S( -, xj, xk), F(i ,j-1,k )+S( -, xj, xk), F(i ,j ,k-1)+S( -, -, xk) }

Multidimensional DP Running Time: • Size of matrix: LN; Where L = length of each sequence N = number of sequences • Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)

Multidimensional DP • How do gap states generalize? • VERY badly! • Require 2N states, one per combination of gapped/ungapped sequences • Running time: O(2N 2N  LN) = O(4N LN) Running Time: • Size of matrix: LN; Where L = length of each sequence N = number of sequences • Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN) Y YZ XY XYZ Z X XZ

Progressive Alignment x • When evolutionary tree is known: • Align closest first, in the order of the tree • In each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presult Weighted version: • Tree edges have weights, proportional to the divergence in that edge • New profile is a weighted average of two old profiles pxy y z pxyzw pzw w

Progressive Alignment x • When evolutionary tree is known: • Align closest first, in the order of the tree • In each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presult Weighted version: • Tree edges have weights, proportional to the divergence in that edge • New profile is a weighted average of two old profiles y Example Profile: (A, C, G, T, -) px = (0.8, 0.2, 0, 0, 0) py = (0.6, 0, 0, 0, 0.4) s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result:pxy= (0.7, 0.1, 0, 0, 0.2) s(px, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result:px-= (0.4, 0.1, 0, 0, 0.5) z w

Progressive Alignment x • When evolutionary tree is unknown: • Perform all pairwise alignments • Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment • Construct a tree • Align on the tree y ? z w

Heuristics to improve alignments • Iterative refinement schemes • A*-based search • Consistency • Simulated Annealing • …

Iterative Refinement One problem of progressive alignment: • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear correct y = GA-CTT

allow y to vary x,z fixed projection Iterative Refinement Algorithm (Barton-Stenberg): • For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN • Repeat 4 until convergence z x y

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA Variant: Refinement on a tree “tree partitioning”

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA

Iterative Refinement Example not handled well: x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA • Realigning any single yi changes nothing

Some Resources http://www.ncbi.nlm.nih.gov/BLAST BLAST & PSI-BLAST http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable http://probcons.stanford.edu/ PROBCONS – most accurate

MUSCLE at a glance • Fast measurement of all pairwise distances between sequences • DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time • Build tree TDRAFT based on DDRAFT, with a hierarchical clustering method (UPGMA) • Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT • Measure distances D(x, y) based on MDRAFT • Build tree T based on D • Progressive alignment over T, to build M • Iterative refinement; for many rounds, do: • Tree Partitioning: Split M on one branch and realign the two resulting profiles • If new alignment M’ has better sum-of-pairs score than previous one, accept

xi ― xi yj MATCH PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERT X INSERT Y ― yj

xi yj MATCH INSERT X INSERT Y xi ― ― yj A pair-HMM model of pairwise alignment • Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences • Transition probabilities ~ gap penalties • Emission probabilities ~ substitution matrix x ABRACA-DABRA AB-ACARDI--- y

Computing Pairwise Alignments • The Viterbi algorithm • conditional distribution P(α | x, y) reflects model’s uncertainty over the “correct” alignment of x and y • identifies highest probability alignment, αviterbi, in O(L2) time Caveat: the mostlikely alignment is not the mostaccurate • Alternative: find the alignment of maximum expected accuracy P(α) αviterbi P(α | x, y)

4. F 4. T 4. F 4. F 4. F B A- A A- A 4. F 4. F 4. T 4. F 4. F B- B+ B+ B- C The Lazy-Teacher Analogy • 10 students take a 10-question true-false quiz • How do you make the answer key? • Approach #1: Use the answer sheet of the best student! • Approach #2: Weighted majority vote!

Viterbi picks single alignment with highest chance of being completely correct mathematically, finds the alignment α that maximizes Eα*[1{α = α*}] Maximum Expected Accuracy picks alignment with highest expected number of correct predictions mathematically, finds the alignment α that maximizes Eα*[accuracy(α, α*)] 4. T A Viterbi vs. Maximum Expected Accuracy (MEA) 4. F 4. F 4. T 4. F 4. F B A- A A- A 4. F 4. F 4. F 4. F 4. T C B- B+ B+ B-

Sequence Similarity