Chaps. 12 & 13: Multiple Sequence Alignment

Chaps. 12 & 13: Multiple Sequence Alignment • Pairwise Alignment • Dynamic Programming • Multi-sequence Alignment • FASTA (Fast Alignment) • BLAST (Basic Local Alignment Search Tool)

FASTA • Pearson and Lipman, 1988, Fast Alignment • http://www.ebi.ac.uk/Tools/sss/fasta/ • Steps • Perform exact match of a subsequences in the query sequence of at least length ktup to subsequences of database sequences • ktup: default – 2 AA • Search diagonal regions in the alignment matrix that contain as many of subsequence matches with small distances between subsequences • Then, see if initial regions can be joined by allowing gaps • Time saved • By performing dynamic programming on initially filtered sequences which are already similar • Only considers pathways through the alignment matrix that remain within a band centered around the highest-scoring initial regions

Blast • Exact methods are good and can pick up very distant relationships • Approximate methods can detect only close relationships well • Maybe OK when the probe sequence is fairly similar to one or more sequences in the databank • Take a small k residues in the probe sequence, find all instances of the k-tuple in the database • For the selected candidate sequences, approximate optimal alignment is performed • Particularly useful in multi-sequence alignments

Blast • Detect • the best region of local alignment between a query and the target • and if there are other plausible alignments • Computational efficiency comes from “seeding” the search with a small subset of substrings in the query • Substrings from two sequences may be highly conserved in biological applications • Temple Smith and Michael Waterman, 1981 • Biologically relevant diagonal matches are likely to have a higher score

Word Length and Threshold • Select word length, w (similar to ktup, default is 3 AA), and a threshold T • Given word length of w, scan database for words of length w that score higher than a threshold T • Example: for a human RBP query • …FSGTWYA… (query word in red) • A list of words (w=3) is: • FSG SGT GTW TWY WYA • YSG TGT ATW SWY WFA • FTG SVT GSW TWF WYS

According to Blosum62 GTW 6,5,11 22 neighborhood GSW 6,1,11 18 word hits ATW 0,5,11 16 > threshold NTW 0,5,11 16 GTY 6,5,2 13 GNW 10 neighborhood GAW 9 word hits < below threshold (T=11)

Effect of Threshold • You can modify the threshold parameter. • The default value for blastp is 11. • To change it, enter “-f 16” or “-f 5” in the • advanced options of BLAST+. • (To find BLAST+ go to BLAST  help  download.)

2. Scan the database for entries matching compiled list • 3. In each direction, extension terminates when the score falls more than a certain distance below the best score reached so far KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) extend extend Hit!

Identify all exact matches of k-tuple words (no gaps, no mismatches) • Extend exact matches in both directions (no gaps, no mismatches) • Put extended matches together with mismatches and gaps, only in limited regions containing preliminary matches

Blast, 1997 Refined to require two independent hits The hits must occur in close proximity to each other. With this modification, only one seventh as many extensions occurred greatly speeding the time required for a search

Comparison of Search Methods • FASTA vs. BLAST • FASTA is more sensitive for DNA-DNA searches, especially for highly diverged sequences. • BLAST is better at finding short regions of high similarity, while FAST is better at finding long regions of lower similarity BLAST will miss similar sequences if they do not have a single identical word • Protein similarity search can find more distant similarities • DNA has four letters and thus the prob. of chancy matches is much greater • Protein databanks are much smaller, and searches can be more sensitive • William Pearson (FASTA author): • “The number one thing that you should learn is that in general, you should try not do DNA sequence comparison.” • Protein-protein search from BLASTX produces more sensitive results than DNA-DNA search

Blast (Basic Local Alignment Search Tool) ProgramInputDatabase 1 blastn DNA DNA 1 blastp protein protein 6 blastx DNA protein 6 tblastn protein DNA 36 tblastx DNA DNA

DNA can encode six Proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG

Effect of word size • For blastn, the word size is typically 7, 11, or 15 (EXACT match). Changing word size is like changing threshold of proteins. • w=15 gives fewer matches and is faster than w=11 or w=7. • For megablast, the word size is 28 and can be adjusted to 64. What will this do? Megablast is VERY fast for finding closely related DNA sequences!

Word-matching problem • Consider two sequences of lengths N and M • A simple local alignment algorithm looks for the longest exactly matching word • Score for alignment = the length of the longest match (l) • e.g., l=6 • Let n denote the length of matching words starting at two random sites (in red, for example, n = 3 for ‘TCC’) • Then, l=max(n) • F(l) has EVD distribution G G A T A T C CA G C G C T C C T A T C C G A T A T C TT G G G G A T A T C C A G C G C T C C T A T C C G A T A T C T T G G

Word-matching problem -theory • Prob[two bases at randomly selected sites are equal] = a • N: length of matching bases starting at these random sites • P[n ≥ l] = al • Define  = -ln(a) = ln(1/a) => P[n ≥ l] = e-l • Compute E[l] • N ways of choosing the first starting sites, and M ways for the 2nd • If NM selected sites are independent, E[l] = NM e-l • However, words starting at different sites overlap, and are not independent • Thus, E[l] = kNM e-l with k < 1 • SinceP[n < l] = 1- e-l, from the exponential distribution, • P[n = l] = e-l • Prob. of longest matching wordsof length l: • F(l) = e-(l-u)exp(e-(l –u) ), u = ln(kMN)/ 

EVD Distribution • e.g., # of sequences, S=2000; length of seq., N=200, only two bases C and G with equal prob. • m has a binomial distribution • mmax is EVD distributed • Thus, F(mmax) does NOT have the same distribution as Gaussian • F(mmax) is of an extreme value distribution (EVD) or Gumbel distribution • F(mmax) = *e- (mmax-u)exp(e- (mmax - u)) • Single peak skewed to one side • EVD arises whenever dealing with the maximum value taken from a large number of independent alternatives • Thus, it is likely to be considerably higher than the value of m obtained from just two typical sequences

EVD distribution • A r.v.x with distribution P(x), and a large sample of S • Find F(xmax) xmax • Prob[x <xmax] = - P(x) dx • F(xmax) = Prob[choose one with xmax and the rest < xmax ] = S*P(xmax)*{Prob[x <xmax]}S-1 • When P(x) =  e-x, Prob[x <xmax] = 1 – e- xmax, • F(mmax) = S e- xmax (1-e- xmax)S-1 = Se- xmaxe- Sxmax((1-a)n exp(-na)) • Set u = ln(S)/  , S = exp( u) • F(mmax) = e-(xmax-u)exp(e-(xmax –u)) • Single peak at xmax = u • Width of the peak is controlled by 

The probability density function of EVD (characteristic value u=0, decay constant l=1) 0.40 0.35 0.30 0.25 normal distribution extreme value distribution probability 0.20 0.15 0.10 0.05 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 x

Significance of matches • Consider the significance of a match • The observed value of the top-hit score for the query is mobs • Prob. of obtaining a value mmax ≥ mobs by chance is given by the area under the tail of the distribution p(mobs) = 1 – exp(-exp(-  (mobs - u))) • Small p implies that it is less likely the match is to arise by chance (greater the significance) • In the same example, mobs = 130, p=3.3% (just big enough to be significant) • Significance increases as S increases as F(mmax) shifts to the right

Alignment stats (reality) • BLAST, etc. works by looking for high-scoring local alignments • When gaps are not allowed, pairwise local alignment scores are shown to be EVD distributed (Karlin and Altschul, 1990) • With gaps, scores are believed to be also EVD distributed • But EVD parameters  and u are not known, and has to be computed empirically • Once the scoring system and EVD parameters for a given search algorithm are known, one can estimate the significance of a match • E: expected number of sequences with a score ≥ observed score S • E(S) = kMNexp(-  S) (N: length of query sequence; M: total length of all the sequences in the database)

Blast Result: E value • The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. • An E value is related to a probability value p. • The key equation describing an E value is: • E = Kmn e-lS

E = Kmn e-lS This equation is derived from a description of the extreme value distribution S = the score E = the expect value = the number of high-scoring segment pairs (HSPs) expected to occur with a score of at least S m, n = the length of two sequences l, K = Karlin Altschul statistics

Properties The value of E decreases exponentially with increasing S (higher S values correspond to better alignments). Very high scores correspond to very low E values. The E value for aligning a pair of random sequences must be negative! Otherwise, long random alignments would acquire great scores Parameter K describes the search space (database). For E=1, one match with a similar score is expected to occur by chance. For a very much larger or smaller database, you would expect E to vary accordingly

From Raw scores to Bit scores • There are two kinds of scores: • raw scores (calculated from a substitution matrix) and • bit scores (normalized scores) • Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes • S’ = bit score = (lS - lnK) / ln2 • The E value corresponding to a given bit score is: • E = mn 2 -S’ • Bit scores allow you to compare results between different database searches, even using different scoring matrices.

E and p • The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. A p value is a different way of representing the significance of an alignment. • p = 1 - e-E • Default value of E: 10 • p>0.05 is considered to be significant

Very small E values are very similar to p values. E values of about 1 to 10 are far easier to interpret than corresponding p values. Ep 10 0.99995460 5 0.99326205 2 0.86466472 1 0.63212056 0.1 0.09516258 (about 0.1) 0.05 0.04877058 (about 0.05) 0.001 0.00099950 (about 0.001) 0.0001 0.0001000

Two problems standard BLAST cannot solve Use human beta globin as a query against human RefSeq proteins, and blastp does not “find” human myoglobin. This is because the two proteins are too distantly related. PSI-BLAST at NCBI as well as hidden Markov models easily solve this problem. How can we search using 10,000 base pairs as a query, or even millions of base pairs? Many BLAST-like tools for genomic DNA are available such as PatternHunter, Megablast, BLAT, and BLASTZ. Page 141

Position specific iterated BLAST: PSI-BLAST The purpose of PSI-BLAST is to look deeper into the database for matches to your query protein sequence by employing a scoring matrix that is customized to your query. Page 146

PSI-BLAST is performed in five steps Select a query and search it against a protein database PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) Page 146

Inspect the blastp output to identify empirical “rules” regarding amino acids tolerated at each position R,I,K C D,E,T K,R,T N,L,Y,G

A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2 10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 20 amino acids all the amino acids from position 1 to the end of your PSI-BLAST query protein

A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2 10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 note that a given amino acid (such as alanine) in your query protein can receive different scores for matching alanine—depending on the position in the protein

A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2 10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 note that a given amino acid (such as tryptophan) in your query protein can receive different scores for matching tryptophan—depending on the position in the protein

PSI-BLAST is performed in five steps • Select a query and search it against a protein database • PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) • The PSSM is used as a query against the database • PSI-BLAST estimates statistical significance (E values) • Repeat steps [3] and [4] iteratively, typically 5 times. • At each new search, a new profile is used as the query Page 146

PSI-BLAST • PSI (Position Specific Iterated) BLAST – Altschul et al., 1997 • Use the original BLAST algorithm and retrieve database sequences with significant matches (E < 0.01) • Multiple alignment is performed • Place all locally aligned sections of the database sequences below the query sequence • When a gap is inserted in the query seq., corresponding residue in database seq. is removed, so that all seq.’s are of the same length => for speed and simplicity • Multiple sequences are then used as input to the 2nd run • Use PSSM (Position Specific Scoring Matrix) for scoring • Score is dependent on the frequencies of the residues n the column of the alignment (V is more likely to be aligned with the column with many V’s or other hydrophobic) • Repeat the process until no more new sequences are added

Search results • During iteration in PSI Blast, • New distantly related sequences can be found, which are not detected in straightforward Blast • Due to the extra info in the aligned group of sequences not in any one sequence • On the other hand, by adding sequences, the range of sequence becomes too broad • Alignment may end up having little relationship to the original query • Ranking of hits gives some info about the degree of relatedness • But, the top hits are not necessarily the most meaningful in terms of evolution • Several top hits to human genes were from bacteria, which led to claims of horizontal gene transfer (dis-proved by phylogenetic methods) • Frequently top hits are not the closest relatives

Multiple Sequence Alignment • Based on local sequence alignments • Wants to recognize resemblance even when sequences share only weak similarities • Problem Statement • Given k strings, v1, …, vkof lengths n1,…, nkover an alphabet A (A’ = A U{-}), • And k dimensional score matrix δ • Find kxn matrix, s.t. • Each character in vi is in order • Every column contains at least one symbol from A • The sum of scores of the columns is maximum • Can extend global alignment approach to k dimension

Multiple Sequence Alignment Example: k=3 • With three sequences v, w, and u and 3D δ(score of a column with x,y, and z) • In global alignment • In multiple sequences si-1, j + δ(vi, -) si, j = max [ si, j-1 + δ(-, wj) ] si-1, j-1 + δ(vi, wj) si-1, j,k + δ(vi, -, -) si, j-1,k + δ(-, wj , -) si, j,k-1 + δ(-, -, uk) si, j,k = max [ si-1, j-1,k + δ(vi, wj, -) ] si-1, j,k-1 + δ(vi, -, uk) si, j-1,k-1 + δ(vi, wj, -) si-1, j-1,k-1 + δ(vi, wj, uk)

Multiple Sequence Alignment • Time complexity is O((2n)k) • Heuristics 1 • Compute all (k2) optimal pairwise alignments, and combines them • Does not work all the time

Multiple Sequence Alignment • Heuristics 2: Greedy progressive multiple alignment • Select two string with greatest similarities • Merge the two into a new string • Works well for very close sequences • Maybe dependent upon two seed sequences • Clustal uses this approach

Chaps. 12 & 13: Multiple Sequence Alignment