Pairwise Sequence Alignment (II)

Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm

A T C G T A C w v A T 0 G 0 T 0 T 0 A 0 T 0 0 Review: Dynamic Programming for LCS • Edit graph representation of alignment • Path = alignment • Incrementally fill in the table • Backtrack to find the best alignment 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 1 2 2 2 2 2 2 3 1 2 2 3 3 3 3 4 1 2 2 3 4 4 4 5 1 2 2 3 4 4 4 6 1 2 2 3 4 5 5 1 7 2 2 3 4 5 5

Matching score Insertion/deletion score The LCS Recurrence Revisited • The formula can be rewritten by adding zero to the edges that come from an indel, since the penalty of indels are 0: si-1, j-1+1 if vi = wj si,j = max si-1, j + 0 si, j-1 + 0 How do we improve scoring?

How do we improve the scoring of alignments? Can we still find an alignment efficiently?

Outline • Improve Scoring • Scoring Matrix • Affine Gap Penalty • Variants of Alignment • Global vs. Local alignment • Assessing Score Significance

Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrixδ. In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score with comparison of a gap character “-”. This will simplify the scoring algorithm as follows: si-1,j-1 + δ(vi, wj) si,j = max s i-1,j + δ(vi, -) s i,j-1 + δ(-, wj) { The same dynamic programming algorithm would still work!

The Global Alignment Problem Find the best alignment between two strings under a given scoring matrix Input : Strings v & w and a scoring matrix δ Output : Alignment of maximum score Algorithm: Dynamic programming si-1,j-1 + δ (vi, wj) si,j = max s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) { The only question left is how to define the scoring matrix…

Measuring Similarity • Measuring the extent of similarity between two sequences • Based on percent sequence identity • Based on conservation

Percent Sequence Identity • The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G mismatch indel 70% identical

Simple Scoring • When mismatches are penalized by some constant –μ, indels are penalized by some other constant –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels)

Making a Better Scoring Matrix • Scoring matrices are created based on biological evidence. • Alignments can be thought of as two sequences that differ due to mutations in the sequence. • Some of these mutations have little effect on the organism’s function, therefore some penalties, δ(vi , wj), will be less harsh than others.

AKRANR KAAANK -1 + (-1) + (-2) + 5 + 7 + 3 = 11 Scoring Matrix: Example • Notice that although R and K are different amino acids, they have a positive score. • Why? They are both positively charged amino acids will not greatly change function of protein.

Scoring matrices • Amino acid substitution matrices • PAM • BLOSUM • DNA substitution matrices • DNA: less conserved than protein sequences • Less effective to compare coding regions at nucleotide level • Simple scoring is often used

PAM • Point Accepted Mutation (Dayhoff et al.) • 1 PAM = PAM1 = 1% average change of all amino acid positions • After 100 PAMs of evolution, not every residue will have changed • some residues may have mutated several times • some residues may have returned to their original state • some residues may have not changed at all

PAMX • PAMx = PAM1x • PAM250 = PAM1250 • PAM250 is a widely used scoring matrix: Think of PAM1 as 1-step transitions and PAM250 as 250-step transitions Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ... Ala A 13 6 9 9 5 8 9 12 6 8 6 7 ... Arg R 3 17 4 3 2 5 3 2 6 3 2 9 Asn N 4 4 6 7 2 5 6 4 6 3 2 5 Asp D 5 4 8 11 1 7 10 5 6 3 2 5 Cys C 2 1 1 1 52 1 1 2 2 2 1 1 Gln Q 3 5 5 6 1 10 7 3 7 2 3 5 ... Trp W 0 2 0 0 0 0 0 0 1 0 1 0 Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 Val V 7 4 4 4 4 4 4 4 5 4 15 10

BLOSUM • Blocks Substitution Matrix • Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins • Matrix name indicates evolutionary distance • BLOSUMx was created using sequences sharing no more than x% identity • E.g., BLOSUM62 <-> 62% identity

The Blosum50 Scoring Matrix Probability of seeing x aligned with y Val(x,y)=log(p(x,y)/p(x)p(y)) Probability of seeing x (or y) alone

Deficiency in Scoring of Indels • A fixed penalty σis given to every indel: • -σ when there is 1 indel, -2σ for 2 consecutive indels, -3σ for 3 consecutive indels, etc. Can be too severe penalty for a series of 100 consecutive indels

In nature, this is more likely. Deficiency in Scoring of Indels (cont.) • In nature, many times indels come as a unit, not just at 1 nucleotide at a time. ATA__GC ATATTGC ATAG_GC AT_GTGC Normal scoring would give the same score for both alignments

Accounting for Gaps • Gaps- contiguous sequence of spaces in one of the rows • Score for a gap of length x is: -(ρ +σx), where ρ >0 is the penalty for introducing a gap. ρ will be large relative to σ because you do not want to add too much of a penalty for extending the gap.

Affine Gap Penalties • Gap penalties: • -ρ-σ when there is 1 indels, -ρ-2σ when there are 2 indels, -ρ-3σ when there are 3 indels, etc. • -ρ- x *σ (-gap opening - x gap extensions) • Somehow reduced penalties (as compared to naïve scoring) are given to runs of horizontal and vertical edges

Affine Gap Penalty Recurrences Continue Gap in w (deletion) si,j = s i-1,j - σ max s i-1,j –(ρ+σ) si,j = s i,j-1 - σ max s i,j-1 –(ρ+σ) si,j = si-1,j-1 + δ(vi, wj) max s i,j s i,j Start Gap in w (deletion) Continue Gap in v (insertion) Start Gap in v (insertion) Match or Mismatch End deletion End insertion Once again, the same dynamic programming algorithm would work!

Local vs. Global Alignment • The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. • The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph.

Local vs. Global Alignment (cont’d) • Global Alignment • Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc

Local Alignments: Why? • Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. • Example: • Homeobox genes have a short region called the homeodomain that is highly conserved between species. • A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence

The Local Alignment Problem • Goal: Find the best local alignment between two strings • Input : Strings v, w and scoring matrix δ • Output : Alignment of substrings of v & w whose alignment score is maximum among all possible alignment of all possible substrings

Compute a “mini” Global Alignment to get Local Local Alignment in Edit Graph Local alignment Global alignment

The Problem with this Problem • Problem of this, long run time O(n4): - There are ~n2 pairs of vertices (i,j) - For each pair of vertices computing an alignment takes O(n2) time. • Solution: Dynamic programming again! • Question: How do we recursively compute the best score of any local (as opposed to global) alignment for each cell in the edit graph?

Notice there is only this change from the original recurrence of a Global Alignment The Local Alignment Recurrence • The largest value of si,j over the whole edit graph is the score of the best local alignment. • The recurrence is shown below: 0 si,j = max si-1,j-1 + δ(vi, wj) s i-1,j + δ(vi, -) s i,j-1 + δ(-, wj) { This is the well-known Waterman-Smith local alignment algoirthm

Assessing Score Signficance • In general, larger s  more significant. The question is how large should s be? • Factors to be considered: • Sequence length: longer sequences are expected to give higher scores • # sequences in the database: the score of the best alignment is expected to be higher for a larger DB • Evolution time: longer evolution causes more mismatches, making a lower score more significant • The Challenge is how to quantify all these…

Two Basic Approaches • The classical approach: Extreme value distribution (EVD) • Assume a null (random) model for scores MR • P(Score > s|MR, a(x, y))=? (a(x,y)=alignment of x, y) • The Bayesian approach: Model comparison • Assume two models for a(x,y): random R; aligned: M • P(M|a(x,y))/P(R|a(x,y))=? prior Log-odds score of the alignment

EVD of the Best Score in Ungapped Local Alignment • The number of unrelated local matches with score higher than S is approximately Poisson distributed, with mean • The probability that there is a match of score greater than S is • K and  can be fit using randomly generated data • This gives a way to test statistical significance p(x>21)= 0.01 vs. p(x>21)=0.3 Parameters Sequence lengths

BLOSUM Scoring Bayesian Model Comparison Assumptions: • M is a model for related sequences • R is a model for unrelated sequences (random) • Ungapped alignment n=m • Alignment of each pair is independent Score S(x,y) Prior (Subjective!) This partially addresses Q1: how to design the scoring function?

Pairwise Alignment Summary (Modeling evolution) Q1: How should we define s? Q2: How should we define A? (Application-specific) Model: scoring function s: A X=x1,…,xn X=x1,…,xn Possible alignments of X and Y: A ={a1,…,ak} Find the best alignment(s) … S(a*)= 21 Y=y1,…,ym Y=y1,…,ym Q4: Is the alignment biologically Meaningful or just the best alignment of two unrelated sequences? Q3: How can we find a* quickly? (Dynamic programming) Q1 & Q4 are related! (Models for scores)

What You Should Know • Alignment Scoring Methods (Matrix & Gap) • Global vs. Local alignments • How the dynamic programming algorithm solves both local and global alignments with a number of scoring strategies • Basic idea in assessing score significance

Pairwise Sequence Alignment (II)

Pairwise Sequence Alignment (II)

Presentation Transcript

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment (I)

Pairwise sequence Alignment

Pairwise Sequence Alignment

Pairwise sequence alignment Unit 7

Pairwise Sequence Alignment

Pairwise sequence Alignment

Pairwise sequence alignment

Pairwise Sequence Alignment

Pairwise sequence alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment Exercise 2

Pairwise sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment (cont.)

Pairwise Sequence Alignment

Pairwise sequence alignment

Pairwise sequence alignment (practice)

Pairwise Sequence Alignment (II)

Pairwise sequence alignment