310 likes | 440 Vues
Explore the fundamental concepts of sequence alignment in bioinformatics in this lecture presented by ChengXiang Zhai from the University of Illinois. Delve into the problem of aligning two sequences using dynamic programming, with a focus on the longest common subsequence (LCS) problem. The lecture outlines various scoring methods for alignments and discusses the construction of alignment matrices and edit graphs for optimal sequence matching. Ideal for students interested in algorithms and their applications in biological data analysis.
E N D
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm
Comparing Genes in Two Genomes • Small islands of similarity corresponding to similarities between exons • Such comparisons are quite common in biology research
Alignment of sequences is one of the most basic and most important problems in bioinformatics…
Outline • Defining the problem of alignment • The longest common subsequence problem • Dynamic programming algorithms for alignment
Aligning Two Strings Given the strings: • v = ATGTTAT • w = ATCGTAC One possible alignment of the strings: AT_GTTAT_ ATCGT_A_C 1st row – string v with with space symbols “-” inserted 2nd row – string w with with space symbols “-” inserted
Aligning Two Strings (cont’d) Another way to represent each row shows the number of symbols of the sequence present up to a given position. For example the above sequences can be represented as: 0 1 2 2 3 4 5 6 7 7 AT_GTTAT_ ATCGT_A_C 0 1 2 3 4 5 5 6 6 7
0 1 2 2 3 4 5 6 7 7 AT_GTTAT_ ATCGT_A_C 0 1 2 3 4 5 5 6 6 7 Alignment Matrix Both rows of the alignment can be represented in the resulting matrix: 0 1 2 2 3 4 5 6 7 7 0 1 2 3 4 5 5 6 6 7 Each column in this matrix is a coordinate in a two-dimensional nxm grid
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1)
A T C G T A C w v A T G T T A T Alignment as a Path in the Edit Graph 1 2 3 4 5 6 7 0 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) , (2,2) 0 1 2 3 4 5 6 7
A T C G T A C w v A T G T T A T Alignment as a Path in the Edit Graph 1 2 3 4 5 6 7 0 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) , (2,2), (2,3), (3,4) 0 1 2 3 4 5 6 7
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) - End Result -
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph Every path in the edit graph corresponds to an alignment:
How to Score an Alignment? • Simplest • Every match scores 1 • Every mismatch scores 0 • An alignment is scored based on the number of common symbols • Lead to the longest common subsequence problem • More sophisticated • ? • ? • To be covered later
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignments in Edit Graph (cont’d) • and represent indels in v and w • Score 0. • represent exact matches. • Score 1.
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignments in Edit Graph (cont’d) The score of the alignment path in the graph is 5.
The Longest Common Subsequence (LCS) Problem • Find the longest subsequence common to two strings. Input: Two strings, v and w. Output: The longest common subsequence of v and w. v = ATGTTAT w = ATCGTAC v = AT GTTAT | | | | | “ATGTA” w = ATCGT AC A subsequence is not necessarily consecutive Longest common subsequence Best alignment
Brute Force Approach • Enumerate all the sequences up to length min(|v|,|w|) • For each one, check to see if it is a subsequence of v and w • Very expensive…. (How many sequences do we have to enumerate? )
The Idea of Dynamic Programming • Think of an alignment as a path in an edit graph • We only need to keep track of the best alignment (i.e., the longest common subsequence) • Score a longer alignment based on shorter alignments
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph 0122345677 v= AT_GTTAT_ w= ATCGT_A_C 0123455667 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6),(7,7) Use each cell to store the best alignment so far…
Use this scoring algorithm si,j = si-1, j-1+1 if vi = wj max si-1, j si, j-1 { Alignment: Dynamic Programming
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 0 G 3 0 T 4 0 T 5 0 A 6 0 T 7 0 0 Dynamic Programming Example • There are no matches in the beginning of the sequence • Label column i=1 to be all zero, and row j=1 to be all zero 0 0 0 0 0 0 0 0
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 0 G 3 0 T 4 0 T 5 0 Si,j = Si-1, j-1 max Si-1, j Si, j-1 A 6 0 { T 7 0 0 Dynamic Programming Example 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 value from NW +1, if vi = wj value from North (top) value from West (left) 1 1 1 1 Keep track of the best alignment score and the path contributing to it 1 1
Alignment: Backtracking Arrows show where the score originated from. if from the top if from the left if vi = wj
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 0 G 3 0 T 4 0 T 5 0 A 6 0 T 7 0 0 Dynamic Programming Example 0 0 0 0 0 0 0 0 Continuing with the scoring algorithm gives this result. 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 3 3 3 3 1 2 2 3 4 4 4 1 2 2 3 4 4 4 1 2 2 3 4 5 5 1 2 2 3 4 5 5
LCS(v,w) for i 1 to n Si,0 0 for j 1 to m S0,j 0 for i 1 to n for j 1 to m si-1,j si,j max si,j-1 si-1,j-1 + 1, if vi = wj “ “ if si,j = si-1,j bi,j “ “ if si,j = si,j-1 “ “ if si,j = si-1,j-1 + 1 return (sn,m, b) LCS Algorithm { {
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 0 0 0 0 0 0 0 0 T 2 0 1 1 1 1 1 1 1 G 3 1 0 2 2 2 2 2 2 T 1 4 2 2 3 3 3 3 0 1 2 T 2 3 4 4 4 5 0 1 2 2 3 4 4 4 A 6 0 1 2 2 3 4 5 5 T 1 2 2 3 4 7 0 5 5 0 Now What? • LCS(v,w) created the alignment grid • Now we need a way to read the best alignment of v and w • Follow the arrows backwards from the (|v|,|w|) cell
LCS Runtime • To create the nxm matrix of best scores from vertex (0,0) to all other vertices, it takes O(nm) time. • Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up a nxm matrix.
How do we improve the scoring of alignments? Can we still find an alignment efficiently? We’ll talk about these later…
Matching score Insertion/deletion score The LCS Recurrence Revisited • The formula can be rewritten by adding zero to the edges that come from an indel, since the penalty of indels are 0: si-1, j-1+1 if vi = wj si,j = max si-1, j + 0 si, j-1 + 0
What You Should Know • How an alignment corresponds to a path in an edit graph • How the LCS problem corresponds to alignment with a simple scoring method • How the dynamic programming algorithm solves the LCS problem (= simple alignment)