Algorithms in Bioinformatics: A Practical Introduction

Algorithms in Bioinformatics: A Practical Introduction Suffix tree

Overview • What is suffix tree? • Simple application of suffix tree • Linear time algorithm for constructing suffix tree • Suffix array • FM-index • 1-mismatch search

Suffix 1 acacag$ 2 cacag$ 3 acag$ 4 cag$ 5 ag$ 6 g$ 7 $ Suffix Trie E.g. consider the string S = acacag$ Suffix Trie: a ties of all possible suffices of S g $ a c $ 7 6 a c g a g $ c c g 5 $ a a 4 $ g g 3 $ $ 2 1

$ a g c $ a 7 6 g c a $ c a g v g 5 $ c $ a g g $ 2 4 $ 1 3 Suffix Tree Suffix tree for S=acacag$: merge nodes with only one child S= “ca” is an edge label Path-label of node v is “aca” Denoted as (v) This is a leaf edge

Size of Suffix Tree (I) • How big is a suffix tree? • Suffix tree has exactly n leaves and at most 2n-1 edges • The total length of all edge labels is O(n2). • Can we store suffix tree using o(n2) bit space? S= $ a g c $ a 7 6 g c a $ c a g g 5 $ c $ a g g $ 2 4 $ 1 3

7,7 $ a g 6,7 c $ 1,1 a 7 6 2,3 2,3 g c a $ 4,7 c 6,7 a g 6,7 g 4,7 5 $ c 6,7 $ a g g $ 2 4 $ 1 3 Size of Suffix Tree (II) Suffix tree has exactly n leaves and at most 2n-1 edges Note that each edge label can be represented using 2 indices Thus, suffix tree can be represented using O(n log n) bits S= Note: The end index of every leaf edge should be 7, the last index of S. Thus, for leaf edges, we only need to store the start index.

Property of suffix tree • Fact: For any internal node v in the suffix tree, if the path label of v is (v)=ap, then • there exists another node w in the suffix tree such that (w)=p. • Proof: Skip the proof. • Definition of Suffix Link: • For any internal node v, define its suffix link sl(v) = w.

$ a g c $ a 7 6 g c a $ c a g g 5 $ c $ a g g $ 2 4 $ 1 3 Suffix Link example • S=acacag$

Generalized suffix tree • Build a suffix tree for two or more strings • E.g. S1 = acgat#, S2 = cgt$ c a g t # $ g 4 6 c t a a g t t t a t # $ # $ $ t # # # 3 2 4 2 1 1 5 3

Applications of Suffix Tree

$ a g c $ a 7 6 g c a $ c a g g 5 $ c $ a g g $ 2 4 $ 1 3 Exact string matching problem • To find all occurrences of Q in S (searching) • Search for the node x in the suffix tree which represent Q • All the leaves in the subtree rooted at x are the occurrences • Time: O(|Q| + occ) where occ is the total no. of occurrences E.g. S = acacag$Q = aca Occurrences: 1, 3

$ a g c $ a 7 6 g c a $ c a g g 5 $ c $ a g g $ 2 4 $ 1 3 Longest repeated substring problem • To find the longest repeated substring in S • Find the deepest internal node • Time: O(n) E.g. S = acacag$ The longest repeat is aca.

Longest common substring problem • To find the longest common substring of two or more sequences • Note: 1970, Don Knuth conjectured that a linear time algorithm for this problem is impossible • Now, we know that it can be solved in linear time. • E.g. consider two string S1 and S2, • Build generalized suffix tree for S1# and S2$ • Then, mark each internal node with leaves representing suffixes of both S1 and S2. • Report the deepest marked node

Example for the longest common substring • E.g. S1 = acgat#, S2 = cgt$ • The longest common substring is “cg”. Its length is 2. c a g t # $ g 4 6 c t a a g t t t a t # $ # $ $ t # # # 3 2 4 2 1 1 5 3

$ a g c $ a 7 6 g c a $ c a g g 5 $ c $ a g g $ 4 $ 2 1 3 Longest common prefix (I) • Given a string S. For any i, j, • Denote lcp(i, j) be the length of the longest common prefix of suffix i and j of S. S= The longest common prefix of suffix 1 and suffix 3 is aca! lcp(1, 3) = 3

Longest common prefix (II) • Note that the lowest common ancestor(lca) of leaves i and j identifies the longest common prefix. • lcp(i, j) = |(lca(i, j))|. • A well-know result: • Consider a tree of size n, after an O(n) time preprocessing, the lca for any two nodes can be returned in O(1) time. • First obtained by Harel and Tarjan (SIAM J. Comp. 1984) • Simplified by Schieber and Vishkin (SIAM J. Comp. 1988) • Based on the above result, • After an O(n) time preprocessing, • For any suffix i and suffix j, we can compute the longest common prefix of them in O(1) time.

Finding Palindrome (I) • Given a string S, palindrome is a substring u of S s.t. u = ur • E.g. ACAGACA • Consider a palindrome u=S[i..i+|u|-1], u is called a maximal palindrome if S[i’..j’] is not a palindrome for any [i’..j’][i..i+|u|-1]. • Note that every palindrome is contained in a maximal palindrome. • Thus, maximal palindromes are a compact way to represent all palindromes. • Complemented Palindrome is a string u s.t. u = ūr • E.g. ACAUGU • Maximal complemented palindrome is defined similarly.

Finding Palindrome (II) • Recall that restriction enzyme usually is in the form of complemented palindrome. • This motivates the following two problems: • The palindrome problem: • Given a string S (representing the genome) of length n, the problem is to locate all maximal palindromes in S. • The complemented palindrome problem: • Given a string S (representing the genome) of length n, the problem is to locate all maximal complemented palindromes in S.

Properties of palindrome (I) • If S[i..i+k-1]=Sr[n-i+1..n-i+k], then u=S[i-k+1..i+k-1] is an odd length palindrome

Properties of palindrome (II) • If S[i..i+k-1]=Sr[n-i+2..n-i+k+1], then u=S[i-k..i+k-1] is an even length palindrome

Solution to the palindrome problem • Preprocess S and Sr so that any longest common prefix query can be answered in constant time. • For i=1 to n, • Find the longest common prefix for (Si, Srn-i+1). If the longest prefix is k, we find an odd length maximal palindrome S[i-k+1..i+k-1]. • Find the longest common prefix for (Si, Srn-i+2). If the longest prefix is k, we find an even length maximal palindrome S[i-k..i+k-1].

Extracting embedded suffix tree from a generalized suffix tree • Input: The generalized suffix tree T of K strings S1, …, SK. • Aim: Compute the suffix tree Ti of the string Si. r c r a g t # $ g # a c 4 w y w g 6 g x z 6 c c a a t t a a g g t t t t a t t # a t # $ t # # # # $ $ # t # # 4 2 3 # 1 5 T1 3 2 4 2 1 1 5 3 T S1 = acgat#, S2 = cgt$

Extracting embedded suffix tree from a generalized suffix tree • Observation: Ti is a subtree of T such that • The leaves of Ti are the leaves of T corresponding to Si. • The internal nodes of Ti are the lowest common ancestors of some leaves for Si. • The edges of Ti can be inferred from the ancestor descendent relationship among those nodes. r c r a g t # $ g # a c 4 w y w g 6 g x z 6 c c a a t t a a g g t t t t a t t # a t # $ t # # # # $ $ # t # # 4 2 3 # 1 5 T1 3 2 4 2 1 1 5 3 T S1 = acgat#, S2 = cgt$

Extracting embedded suffix tree from a generalized suffix tree r r r r r r a # # # # a # a a # a c c c c g w w w g g g w g 6 6 6  6  6    6 g c c c c a a a a a t t t t g g g g a t t t a t t a a a t # # # t # t # t t # t # # # # # # # # # 4 4 2 3 4 2 3 1 1 4 2 1 1 5 1

Common substrings of more than 2 strings (I) • Given a set of strings (protein or DNA sequences), we want to know what substrings are common to a large number of these strings? • Why this question is important? • DNA and protein sequences will evolve. If a substring occur commonly in wide range of species. This may mean that the substring is critical for the correct functionality.

Common substrings of more than 2 strings (II) • Given K strings whose total length is n. • For every 2kK, define l(k) be the length of the longest substring common to at least k of these strings. • The problem is to compute l(k) for all k.

Common substrings of more than 2 strings (III) • Example: • Consider a set of 5 strings { sandollar, sandlot, handler, grand, pantry } • Then, we have

Common substrings of more than 2 strings (IV) • Illustrating the solution by example: • S1 = aacg$, S2 = acgc#, S3 = cga%. (K=3) • Build a generalized suffix tree T for the K strings in O(n) time. c a % # $ g 4 5 5 g c g a c g # % a c a c c $ $ $ $ # # % # % 4 3 3 1 4 2 2 1 3 2 1

Common substrings of more than 2 strings (V) • By traversing T, for each internal node v, compute its string depth. In total, O(n) time. 0 c a % # $ g 1 1 4 5 5 g c g a 1 2 c 3 g # % a c a c c $ $ $ $ # # % # % 4 3 3 1 4 2 2 1 3 2 1

Common substrings of more than 2 strings (VI) • By traversing T, for each internal node v, compute C(v). [C(v) is defined as the number of distinct termination symbols in the subtree rooted at v] • This step takes O(Kn) time. 3 c a % # $ g 3 3 4 5 5 g c g a 3 3 c 2 g # % a c a c c $ $ $ $ # # % # % 4 3 3 1 4 2 2 1 3 2 1

Common substrings of more than 2 strings (VII) • Traverse T and visit every internal node v. For each v, if V(C(v)) < string-depth of v, set V(C(v)) = string-depth of v. [After step 4, V(k) = the length of the longest substring common to exactly k of these strings.] • l(k)=V(k). For i=k-1 downto 2, l(i)=max{l(i+1), V(i)}. • This two steps take O(n) time. • For our example, V(2) = 3, V(3) = 2. • Thus, l(3) = 2, l(2) = 3. • In total, this algorithm takes O(Kn) time. • Actually, we can improve this algorithm to O(n) time by mean of lcp!

Linear time algorithm for constructing suffix tree

Straightforward construction of suffix tree • Consider S = s1s2…sn where sn=$ • Algorithm: • Initialize the tree with only a root • For i = n to 1 • Includes S[i..n] into the tree • Time: O(n2)

c c a c a a a $ a $ $ c $ $ c c $ $ $ $ a a c a a $ a $ $ $ $ $ 4 4 5 5 3 4 5 4 3 3 1 2 2 5 5 I4 I3 I5 I2 I1 Example of construction • S=acca$ Init For-loop    

c c c a a a c c c # # $ $ $ c c c a a a c c c # a a a $ a $ a $ a $ $ $ $ $ $ $ $ $ 4 4 4 3 3 3 1 1 1 1 2 2 2 2 2 5 5 5 J2 J1 I1 Construction of generalized suffix tree • S’= c# Init For-loop  

Can we construct a suffix tree in o(n2) time? • Yes. We can construct it in O(n) time. • Weiner’s algorithm [1973] • Linear time for constant size alphabet, but much space • McGreight’s algorithm [JACM 1976] • Linear time for constant size alphabet, quadratic space • Ukkonen’s algorithm [Algorithmica, 1995] • Online algorithm, linear time for constant size alphabet, less space • Farach’s algorithm [FOCS 1997] • Linear time for general alphabet • Hon,Sadakane, and Sung’s algorithm [FOCS 2003] • O(n) bit space O(n logen) time for 0<e<1 • O(n) bit space O(n) time for suffix array construction • We will discuss Farach’s algorithm later.

$ $ a a g g c c $ $ a a 7 7 6 6 g g c c a a $ $ c c a a g g g g 5 5 $ $ c c $ $ a a g g g g $ $ 4 4 $ $ 2 2 1 1 3 3 Idea • Build Odd Suffix Tree and Even Suffix Tree • Then, merge odd and even suffix tree. Even Suffix Tree Odd Suffix Tree

Idea • Input: a string S of length n • Recursively compute the suffix tree To of all suffixes beginning at the odd positions. • To is of size n/2. • From To, compute Te which is the suffix tree for all suffixes beginning at the even positions. • Merge To and Te to form the suffix tree for S.

Stage 1: Constructing odd suffix tree • Given a string S[1..n], we generate a new string S’[1..n/2] as follows. • we map pairs of characters into single characters as follows: • S[1..2], S[3..4], S[5..6], …, S[n-1..n]. • Remove the duplicates from the pairs of characters and sort them by radix sort. • S’[i] = rank of S[2i-1..2i] in the sorted list, for i=1, 2, …, n/2. • By recursion, we get the suffix tree T’ for S’ • Convert T’ to the odd suffix tree To.

Example (I) • S = aaabbbabbaba$ • S[1..2]=aa, S[3..4]=ab, S[5..6]=bb, S[7..8]=ab, S[9..10]=ba, S[11..12]=ba. • By stable sort, aa < ab < ba < bb. • Rank(aa)=1, Rank(ab)=2, Rank(ba)=3, Rank(bb)=4. • So, S’=124233$.

Example (II) • By recursion, construct the suffix tree T’ for S’: 4233$ $ 1 2 4 2 3 3 $ 2 3 7 3 2 4 2 3 3 $ 3 3 $ 3 $ $ 6 5 1 4 2

Example (III) • Convert T’ to the odd tree: $ bbabbaba$ a a a b b b a b b a b a $ a b ba 13 5 b b b a b b a b a $ i  2i-1 b a b a $ b a $ $ 11 9 7 1 This is not a suffix tree 3

Example (IV) • Refine the odd tree To: $ b a babbaba$ a a b b b a b b a b a $ 13 b b a 5 b b a b b a b a $ b a $ a b a $ $ 11 9 7 1 3

Time complexity for building the odd tree • Let Time(n) be the time to build a suffix tree for a string of length n. • Stable sorting and refinement of the odd trees take O(n) time. • Build suffix tree for S’ takes Time(n/2). • So, Stage 1 takes Time(n/2)+O(n) time.

Stage 2: Build the even tree • Generate the lex-ordering of the leaves in Te. • For any two adjacent leaves 2i and 2j, we find lcp(2i, 2j). • Construct the even tree Te from left to right (according to the lex-ordering).

Build the even tree (Step 1) • We get the lex-ordering of the leaves in To. • Generate the lex-ordering of the leaves in Te. • For each leaf i in To, get the preceding character c=S[i-1] and form a pair (c,i). Each pair represents a even suffix i-1. • Perform stable sorting on those pairs. We get the lex-ordering of the leaves in Te.

Example S = aaabbbabbaba$ • Lex-ordering of the leaves in To: • 13 < 1 < 7 < 3 < 11 < 9 < 5 • The pairs are: • (a,13), ($,1), (b,7), (a, 3), (a, 11), (b, 9), (b, 5). • After stable sorting, we have • ($, 1), (a, 13), (a, 3), (a, 11), (b, 7), (b, 9), (b, 5). • Hence, the lex-ordering of the leaves of Te: • 12 < 2 < 10 < 6 < 8 < 4

Build the even tree (Step 2) • For any two adjacent leaves 2i and 2j, we first find lcp(2i, 2j). • Observation: lcp(2i, 2j) = • lcp(2i+1, 2j+1)+1 if S[2i]=S[2j] • 0 otherwise • Proof: • If S[2i]S[2j], lcp(2i,2j)=0. • Otherwise, lcp(2i,2j)=1+lcp(2i+1,2j+1).

Example • Recall that the lex-ordering of leaves: • 12 < 2 < 10 < 6 < 8 < 4. • By the previous observation, we have • lcp(8,4)=lcp(9,5)+1=2 • Similarly, we have • lcp(12,2)=1, lcp(2,10)=1, lcp(10,6)=0, lcp(6,8)=1, lcp(8,4)=2 $ b a babbaba$ a a b b b a b b a b a $ 13 b b a 5 b b a b b a b a $ b a $ a b a $ $ 11 9 7 1 3

Build the even tree (Step 3) • Construct the even tree Te from left to right. a a $ $ a b b b a b b a b a $ a $ a b b b a b b a b a $ b a $ 12 12 12 10 2 2

Algorithms in Bioinformatics: A Practical Introduction

Algorithms in Bioinformatics: A Practical Introduction

Presentation Transcript

7310 1 CS5286 Algorithms and Techniques for Web Search ...

341: Introduction to Bioinformatics

Can PRAM Graph Algorithms Provide Practical Speedups on Many-Core Machines?

A Short Introduction to Unix for Bioinformatics

Bioinformatics Training

High-throughput Biological Data The data deluge and bioinformatics algorithms

Introduction to bioinformatics Lecture 3 High-throughput Biological Data - data deluge, bioinformatics algorithms- and e

Genetic Algorithms

Data Structures and Algorithms Introduction to Algorithms

341: Introduction to Bioinformatics

BIOINFORMATICS Introduction

MNW2 course Introduction to Bioinformatics

Bioinformatics

Bioinformatics Algorithms and Data Structures

Introduction to Bioinformatics

Introduction to Bioinformatics

What SAT can do for BioInformatics ?

Introduction to Bioinformatics

Bioinformatics PhD. Course