Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Sequence Alignment: Definition and Importance • Sequence alignment is a process in which at least two homologous sequences are compared and involves the identification of insertions or deletions that might have occurred in either lineage since their divergence from a common ancestor • A powerful tool for discovering biological function and establishing evolutionary relationships

Sequence Alignment • The same principles for sequence alignment can be used to align both nucleotide and amino acid sequences • More reliable alignments are usually obtained by using amino acid sequences • Amino acids change less frequently during evolution than nucleotides • There are 20 amino acids and only 4 nucleotides, so the probability for 2 sites to be identical by chance is lower at the amino acid level than at the nucleotide level

Sequence Alignment • A DNA sequence alignment consists of a series of paired bases (one base from each sequence) • There are 3 types of aligned pairs • Match – it is assumed that the nucleotide at this site has not changed since the divergence between the two sequences • Mismatch – at least one substitution has occurred in one of the sequences since their divergence from each other • Gaps - a deletion has occurred in one sequence, or an insertion has occurred in the other (the alignment itself does not allow us to distinguish between these two possibilities)

Types of Alignment • Manual • Dot matrix • Distance and similarity methods • Alignment algorithms

Manual Alignment • A reasonable alignment by visual inspection can be obtained using either specialized alignment editors or plain text editors, when there are few gaps and the two sequences are not too different from each other • Advantages: uses the brain and allows direct integration of additional data (i.e. domain structure) • Disadvantages: is subjective and results cannot be compared to those derived using other methods

Dot Matrix • The two sequences to be aligned are written out as column and row headings of a two-dimensional matrix • A dot is put in the dot matrix plot at a position where the nucleotides in the two sequences are identical • The alignment is defined by a path through the matrix starting at the upper left and ending with the lower right.

Dot Matrix • Advantages: • a simple method • is useful in unraveling important evolution of sequences • Disadvantages: • may become very cluttered • may require human intervention to recognize patterns • may not be reliable • limited to two sequences

Dot Matrix Examples a.) b.)

Distance and similarity methods • The best possible alignment between two sequences is the one which minimizes the numbers of mismatches and gaps • However, reducing the number of mismatches usually results in an increase in the number of gaps (and vice versa)

Distance and similarity methods • Considering the following example: Seq1: TCAGACGATTG LengthSeq1=11 Seq2: TCGGAGCTG LengthSeq2=9 • We can reduce the number of mismatches to 0, but the number of gaps in this case is 6: Seq1: TCAG-ACG-ATTG Seq2: TC-GGA-GC-T-G

Distance and similarity methods • Our example, yet again: Seq1: TCAGACGATTG LengthSeq1=11 Seq2: TCGGAGCTG LengthSeq2=9 • Conversely, we can reduce the number of gaps to a single gap having the minimum possible size |LengthSeq1 – LengthSeq2| = 2 nucleotides, which increases the number of mismatches to 5: Seq1: TCAGACGATTG * **** Seq2: TCGGAGCTG--

Distance and similarity methods • Our example, yet again: Seq1: TCAGACGATTG LengthSeq1=11 Seq2: TCGGAGCTG LengthSeq2=9 • We can also choose an alignment that minimizes neither the number of gaps nor the number of mismatches. In the case below, the number of gaps is 4 and the number of mismatches is 2: Seq1: TCAG-ACGATTG * * Seq2: TC-GGA-GCTG-

Distance and similarity methods • Which of the three alignments is preferable? • In order to determine that, we need to find a common denominator (the gap penalty) that allows us to compare gaps and mismatches • Gap penalty – a factor (or set of factors) by which gap values (the numbers and lengths of gaps) are multiplied in order to make the gaps equivalent in value to mismatches • Based on how frequent different types of insertions and deletions occur in evolution in comparison with the frequency of occurrence of point substitutions • Of course mismatch penalties also need to be assigned and serves to assess how frequently substitutions occur

Distance and similarity methods • For any given alignment, we can calculate a distance or dissimilarity index (D) as: D = ∑miyi + ∑wkzk where yi is the number of mismatches of type i, mi is the mismatch penalty for an i-type of mismatch, zk is the number of gaps of length k, and wk is a positive number representing the penalty for gaps of length k

Distance and similarity methods • In the most frequently used gap penalty systems, it is assumed that the gap penalty includes two components: • Gap-opening penalty • Gap-extension penalty • Further complications in the gap penalty system may be introduced by distinguishing among different mismatches (i.e. amino acids) • Leu and Ile vs Arg and Glu

BLOSUM • BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) is a substitution matrix used for sequence alignment of proteins • First introduced in a paper by Henikoff and Henikoff • scanned very conserved regions of protein families and counted the relative frequencies of amino acids and their substitution probabilities • Calculated a log-odds score for each of the possible substitutions of the 20 standard amino acids • Several sets of matrices exist • High numbers designed for comparing closely related sequences • Low number designed for comparing distantly related sequences

A Ala C Cys D Asp E Glu F Phe G Gly H His I Ile K Lys L Leu M Met N Asn P Pro Q Gln R Arg S Ser T Thr V Val W Trp Y Tyr BLOSUM50 Substitution Matrix Pxy is the probability that x and y are evolutionarily related. Px is the probability of occurrence of x. Py is the probability of occurrence of y.

Sequence Alignment Algorithms • The purpose of any alignment algorithm is to choose the alignment associated with the smallest D from all possible alignments • The number of possible alignments can be very large • Fortunately, there are computer alignment algorithms for searching the optimal alignment between two sequences • Fundamentally, there are two different types of alignment algorithms: • Global (Needleman-Wunsch) Both sequences are aligned along their entire lengths and the best alignment is found • Local (Smith-Waterman) The best subsequence alignment is found.

Global Alignment: Needleman-Wunsch • Every letter of each sequence is aligned to a letter or gap • Alignment takes place in a 2D matrix • Each cell corresponds to a pairing of one letter from each sequence and contains a score derived from a scoring scheme along with a corresponding pointer • The algorithm contains three major phases (initialization, fill, and trace-back) • In order to examine each phase, lets align the words HEAGAWGHE and PAWHEAE using the following scoring scheme: • gap penalty of -8 • match score and mismatch penalty to be determined using the BLOSUM50 matrix

Global Alignment: Needleman-Wunsch • Initialization • Values for the first row and column are assigned • The score of each cell is set to the gap penalty (-8) multiplied by the distance from the origin

Global Alignment: Needleman-Wunsch • Fill • Three scores are computed for each cell • Diagonal Score – sum of the diagonal cell score and the score for a match/mismatch (BLOSUM50 matrix) • Horizontal Score – sum of the cell to the left and the gap penalty • Vertical Score – sum of the above cell and the gap penalty • The entire matrix is then filled by assigning for each cell the max score (obtained from the 3 computed scores) and corresponding pointer (P->H) Diagonal Score {0 + (-2) = -2 } (P->H) Horizontal Score {-8 + (-8) = -16} (P->H) Max Score = -2 (P->H) Vertical Score {-8 + (-8) = -16}

Global Alignment: Needleman-Wunsch • Fill • Three scores are computed for each cell • Diagonal Score – sum of the diagonal cell score and the score for a match/mismatch (BLOSUM50 matrix) • Horizontal Score – sum of the cell to the left and the gap penalty • Vertical Score – sum of the above cell and the gap penalty • The entire matrix is then filled by assigning for each cell the max score (obtained from the 3 computed scores) and corresponding pointer (P->E) Diagonal Score {-8 + (-1) = -9 } (P->E) Horizontal Score {-2 + (-8) = -10} (P->E) Max Score = -9 (P->E) Vertical Score {-16 + (-8) = -24}

Global Alignment: Needleman-Wunsch • Fill • Continue calculating max score for all cells along with corresponding pointer

Global Alignment: Needleman-Wunsch • Trace-back • Allows one to recover the alignment from the matrix • Trace back your transition from the bottom right corner to the top left corner by referring back to the completed matrix

Global Alignment: Needleman-Wunsch • Trace-back • Horizontal transition represents a gap in the vertical sequence • Vertical transition represents a gap in the horizontal sequence • Diagonal transition represents a match in the corresponding characters of the two sequences • Final Alignment: H E A G A W G H - E - - P - A W H E A E

Local Alignment: Smith-Waterman • A slight modification of the Needleman-Wunsch algorithm: • Edges of the matrix are initialized to zero • Max score is never less than zero, no pointer is recorded unless the score is greater than zero • Trace-back starts from the highest score in the matrix and ends at a score of zero

Local Alignment: Smith-Waterman • Again, lets align the words HEAGAWGHE and PAWHEAE using the same scoring scheme: • gap penalty of -8 • match score and mismatch penalty to be determined using the BLOSUM50 matrix • Start from the largest score and trace back to determine the best local alignment • Horizontal transition represents a gap in the vertical sequence • Vertical transition represents a gap in the horizontal sequence • Diagonal transition represents a match in the corresponding characters of the two sequences • Final Alignment: H E A G A W G H E - - - - - P A W - H E A E

Local Alignment: Smith-Waterman • Does it matter what “word”/sequence is horizontal/vertical? • To answer this question lets align PAWHEAE (horizontal) to HEAGAWGHE (vertical) using the same scoring scheme as before: • gap penalty of -8 • match score and mismatch penalty to be determined using the BLOSUM50 matrix • Start from the largest score and trace back to determine the best local alignment • Horizontal transition represents a gap in the vertical sequence • Vertical transition represents a gap in the horizontal sequence • Diagonal transition represents a match in the corresponding characters of the two sequences

Local Alignment: Smith-Waterman • Does it matter what “word”/sequence is horizontal/vertical? • To answer this question lets align PAWHEAE (horizontal) to HEAGAWGHE (vertical) using the same scoring scheme as before: • gap penalty of -8 • match score and mismatch penalty to be determined using the BLOSUM50 matrix • Start from the largest score and trace back to determine the best local alignment • Horizontal transition represents a gap in the vertical sequence • Vertical transition represents a gap in the horizontal sequence • Diagonal transition represents a match in the corresponding characters of the two sequences • Final Alignment: - - - P A W - H E A E H E A G A W G H E - -

So does it matter what “word”/sequence is horizontal/vertical? No, it does not. Either way, the final alignment is the same and is considered to be the “optimal” alignment Final Alignment: H E A G A W G H E - - - - - P A W - H E A E

Global or Local? • When is a global alignment more useful? • When sequences in a query set are similar and close in size • When is a local alignment more useful? • When sequences in a query set are dissimilar but suspected to contain regions of similarity When sequences (amino acid or nucleotide) are sufficiently similar, there is no difference between local and global alignments

Helpful Charts AA chart: http://sofbiology.blogspot.com/2010/12/protein-synthesis-amino-acid-table.html IUPAC chart: http://www.bioinformatics.org/sms/iupac.html

Except where otherwise noted (i.e. items on the slide labeled “Helpful Charts”), most information contained in this presentation was obtained from: Graur, Dan and Wen-Hsiung Li. Fundamentals of Molecular Evolution. Second Edition. Sunderland, Massachusetts: Sinauer Associates, Inc., Publishers, 2000. Some of the information related to global & local alignment algorithms was obtained from and can be accessed at: http://etutorials.org/Misc/blast/Part+II+Theory/Chapter+3.+Sequence+Alignment/

Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Presentation Transcript

Lecture 2. Optimal Sequence Alignment

Sequence Alignment

Sequence Alignment

Lecture 6: Multiple sequence alignment

Lecture 1 Sequence Alignment

Lecture 2 Pairwise Sequence Alignment

Sequence Alignment I Lecture #2

Sequence Alignment I Lecture #2

Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

CS177 Lecture 4 Sequence Alignment

Sequence alignment

Lecture 2. Optimal Sequence Alignment

Sequence Alignment

Lecture 3. Heuristic Sequence Alignment

Sequence Alignment

CS177 Lecture 4 Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence alignment

Lecture 1 Sequence Alignment