Introduction to Bioinformatics: Lecture IV Sequence Similarity and Dynamic Programming

Introduction to Bioinformatics: Lecture IVSequence Similarity and Dynamic Programming Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC JM - http://folding.chmcc.org

Outline of the lecture • Wrapping up the previous lecture: a quick look at the NCBI Map Viewer and suffix trees by way of an example • Inexact string matching: from generalizations of suffix trees to dynamic programming • The dynamic programming algorithm for sequence alignment: how it works • The dynamic programming algorithm for sequence alignment: why it works • Limitations and faster heuristic approaches JM - http://folding.chmcc.org

Web watch: NCBI Map Viewer With the knowledge about STSs and physical maps (hopefully) acquired last week we can have another look at the NCBI Map Viewer: http://www.ncbi.nlm.nih.gov/mapview http://www.ncbi.nlm.nih.gov/genome/guide/human/ JM - http://folding.chmcc.org

Computationally efficient and elegant solutions for the exact string matching problem: http://www-igm.univ-mlv.fr/~lecroq/string/index.html Christian Charras and Thierry Lecroq JM - http://folding.chmcc.org

The idea of the suffix tree method Phase 1: Preprocessing of the “text” A string with m characters has m suffixes, which can be represented as m leaves of a rooted directed tree. Consider for example T=cabca c b $ a 4 c a b a c $ b $ a 5 c $ 3 1 a $ 2 For simplicity one leaf, due to the terminal character $, is not included. Problem What is the reason for adding the terminal character? JM - http://folding.chmcc.org

Suffix tree based matching: why does it work? Phase II: Search A substring of a string is a prefix of a suffix in that string. For example, a substring P=ab is a prefix of the suffix abca in T=cabca. Thus, if P occurs in T there is a leaf in the suffix tree that has a label starting with P. c b $ a 4 c a b a c $ b $ a 5 c $ 3 1 a $ 2 Problem Does the size of the alphabet matter (and if so, how)? Hint: how many edges may originate in a node, given that label of each edge out of a node has to start with a different character? JM - http://folding.chmcc.org

Generalized suffix tree for a set of strings and the longest common substring problem Consider for example two strings: T=cabca and U=bbcb. U4 $ U3 $ $ c b b U2 b a $ T4 c a a b $ c b b $ a T5 c $ c T3 b T1 a $ $ U1 T2 Remark By building the generalized suffix tree for a set of k strings of the total length m one can find the longest prefix-suffix match for all pairs of strings in O(m+k2) time (an additional trick is required for that). JM - http://folding.chmcc.org

Assembling DNA from fragment and the suffix-prefix matching problem Hierarchical sequencing: physical maps, clone libraries and shotgun (see Chapter 2 in “A Primer on Genome Science” by Gibson and Muse) Definition The algorithmic problem of shotgun sequence assembly is to deduce the sequence of the DNA string from a set of sequenced and partially overlapping short substrings derived from that string. Analogy to physical map assembly: DNA sequence of a substring may be viewed as a precise ordered fingerprint (in analogy to STSs) and the suffix-prefix match determines if two substrings would be assembled together. In general, the shortest superstring problem (find the shortest string that contains each string from a certain set of strings as its substring) is NP-hard and heuristics are being developed to address the problem. JM - http://folding.chmcc.org

Inexact or approximate string matching • Two major reasons for the importance of approximate matching in • computational molecular biology are: • Measurement (e.g. sequencing) errors and fuzzy nature of underlying • molecular processes (e.g. hybridization may occur despite some • mismatches) • ii) Redundancy in biology with evolutionary processes resulting in closely • related, yet, different sequences that require approximate matching in • order to detect their relatedness and identify variable as well as • conserved features that may reveal fingerprints of structure and function • Either generalizations of exact string matching methods, such as suffix trees, • or dynamic programming (or their heuristic combinations) are being used to • solve this problem. JM - http://folding.chmcc.org

Redundancy in biological systems An example: two globin-like sequences: --MSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE MLS+GEWQLVL+VW KVEAD+ GHGQ++LIRLFK HPETLEKFD+FKHLK+E EMKASE MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE DLKKHGVTVLTALGAILKKKGHHEAELKPFAQSHATKHKIPIKYLEFI--AIIHVLHSRH DLKKHG TVLTALG ILKKKGHHEAE KP AQSHATKHKIP+KYLEFI I VL S+H DLKKHGATVLTALGGILKKKGHHEAE-KPLAQSHATKHKIPVKYLEFISEC-IQVLQSKH PGNFGADAQGAMNKALELFRKDIAAKYKELGYQG PG+FGADAQGAMNKALELFRKD+A+ YKE PGDFGADAQGAMNKALELFRKDMASNYKE----- • Note that there are two types of mismatches: • Due to point mutations • Due to insertions and deletions (gaps) JM - http://folding.chmcc.org

Gap penalties: evolutionary and computational considerations • Linear gap penalties: g(g) = - g d for a gap of length g and constant d • Affine gap penalties: g(g) = - [ d + (g -1) e ] where d is opening gap penalty and e an extension gap penalty. JM - http://folding.chmcc.org

Dynamic programming algorithm for string alignment • Our goal is to find an optimal matching for two strings S1 = a1a2…an and S2 = b1b2…bmover a certain alphabet S, given a scoring matrix s(a,b) for each a and b in Sand (for simplicity) a linear gap penalty • Relation to minimal edit distance (number of insertions, deletions and substitutions required to transform one string into the other) problem • The similarity measure (scoring matrix) should represent biological relatedness and separate true matches from random alignments (find more in Chapter 2 of “Biological Sequence Analysis” by Durbin et. al.) JM - http://folding.chmcc.org

How many alignments are there? All the possible alignments (with gaps) may be represented in the Form of a DP graph (DP table). Consider an example with two strings of length 2: \  a1b1a2b2 \ | b1a1b2a2 \_ 0 1 1 \ \ _ |_ |  a1b1b2a2 _ a1a2b1b2 \ | 1 3 5 |_ _ \ |  b1a1a2b2 \ 1 5 13 | _ _ |_ _ |_ _ _ |_ _ | |_ |_ | | b1b2a1a2 | | |_ JM - http://folding.chmcc.org

Computing the number of alignments with gaps Definition A string of length n+m, obtained by intercalating two strings S1 = a1a2…anandS2 = b1b2…bm , while preserving the order of the symbols in S1andS2, will be referred to as an intercalated string and denoted by S1/2. Note that S1 and S2are subsequences of S1/2 but in general they are not substrings of S1/2. Definition Two alignments are called redundant if their score is identical. The relationship of “having the same score” may be used to define equivalence classes of non-redundant alignments. For example, the class a1b1b2a2: a1b1b2a2  a1-a2 a1a2- b1b2- ; b1-b2 JM - http://folding.chmcc.org

Computing the number of alignments with gaps Lemma There is one-to-one correspondence (bijection) between the set of the non-redundant gapped alignments of two strings S1andS2and the set of the intercalated strings {S1/2}. Corollary The number of non-redundant gapped alignments of two strings, of length n and m, respectively, is equal to (n+m)!/[m!n!]. Proof Since the order of each of the sequences is preserved when intercalating them, we have in fact n+m positions to put m elements of the second sequence (once this is done the position of each of the elements of the first sequence is fixed unambiguously). Hence, the total number of intercalated sequences S1/2is given by the number of m-element combinations of n+m elements and the corollary is a simple consequence of the one-to-one correspondence between alignments and intercalated sequences stated in the lemma. QED JM - http://folding.chmcc.org

Computing the number of alignments with gaps Problem Consider for simplicity two strings of the same length and using the Stirling formula (x! ~ (2p)1/2 xx+1/2 e-x ) show that: (n+n)!/[n!n!] ~ 22n / (2pn)1/2 Note that for a very short by biology standards sequence of length n=50 one needs to perform about 1030 basic operations for an exhaustive search, making the naïve approach infeasible. Dynamic programming provides in polynomial time an optimal solution for a class of optimization problems with exponentially scaling search space, including the approximate string matching. JM - http://folding.chmcc.org

Introduction to Bioinformatics: Lecture IV Sequence Similarity and Dynamic Programming