170 likes | 258 Vues
Exploring DNA sequences, motifs, and evolution using multiple sequence comparison methods PMS1 and PMS2. Primary applications include finding transcription factor binding sites and drug target identification.
E N D
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering University of Connecticut, Storrs, CT
Introduction • General Problem: Multiple Sequence Comparison • Biological Basis • DNA structure/function • Sequence of nucleotides • Modeled as strings • Genes code for proteins • Structure Function • Evolution – Result of DNA mutations and selective pressures Image credit: www.britannica.com
Introduction • Goals of Multiple Sequence Comparison • Deduce evolutionary relationships. • Protein and gene function studies. • Find transcription factor/ regulatory protein binding sites. • Approaches: • Find common subpatterns and deduce a biological relationship. • Find common subpatterns between DNA sequences with a known biological relationship.
Planted Motif Search • Motifs- Common functional subsequences in a set of biological sequences • Planted (l,d) motif search problem: • Input are n strings (S1, S2, … , Sn) of length m and two integers l and d. Find all strings x such that |x| = l and every input string contains at least one variant of x at a Hamming distance of at most d. • Primary applications: Finding transcription factor binding sites; drug target identification
Algorithm PMS1 • Generate the set of all l-mers in each input sequence. Let Ci correspond to the l-mers of Si. • For each l-mer u in Ci (1 <i<n), generate all l-mers v such that v is at a Hamming distance of at most d from u (v is a “neighbor” of u). Let Li correspond to all l-mers u and v from input sequence Si. • Alphabetically sort each set of neighbors Li and eliminate any duplicates. • Merge and intersect all sets Li to find the l-mer that appears in each neighborhood. Such l-mers constitute the motifs in the input sequences.
Algorithm PMS2 • Algorithm PMS2 exploits these observations • If M occurs in each input sequence, then at least l-k+1 length-k substrings of M occur in each input sequence. • In each input sequence there must be at least one position ij such that a k-mer of M occurs at each position ij – ij+l – k .
Algorithm PMS2 • Use a modified PMS1 to solve the planted (d+c, d)-motif problem. Let R contain the (d+c)-motifs. • Find all of the occurrences of R in an arbitrary input sequence Sj,. Let Li contain the (d+c)-motifs of R with variants starting at position i of Sj. • For each position i in Sj • A is the l-mer of Sj starting at position i • M1 and M2 are members of Li and Li+l – (d+c). • If the last 2(d+c) – l characters of M1 are equal to the first 2(d+c) – l characters of M2, form an l-mer B by appending the last l – (d+c) characters of M2 to M1. • If dH(A,B) <d, add B to a list of candidates C. Once the list of candidates is complete, check if each candidate is a motif.
Results • n =20and m = 600; arbitrary motif inserted in each input sequence • Each implementation gave the correct planted motif for each (l,d) case • PMS1 was faster than PMS2 for the challenging instances (9,2) and (11,3) • Otherwise, PMS2 could be faster, depending on the value of c • Low values of c lead to a high number of (d+c,d)-motifs, which leads to a high number of candidate strings • Conslusions • PMS1 better-suited for challenge problems • PMS2 better suited for larger l Runtimes, in seconds, of algorithms PMS1 and PMS2
Minimization of Consensus Sequences • Consensus Sequence • An expression that can be used to describe two or more sequences • Two forms: • {c1, c2, … ,cn} – Presence of one of the given characters c in the list • {i1, i2, … in}c – Character c may occur any number of times ik • Examples: • Merging abcde, abccde, abcdee, and abccdee gives ab{1,2}cd{1,2}e • Merging agtgc and actgc gives a{c,g}tgc • Problem Statement: Output a minimum number of consensus sequences for a given set of input sequences.
Minimization of Consensus Sequences • Algorithm • To start, all input sequences are “alive” • An arbitrary alive sequence S is chosen and compared with every other alive sequence T to check if they can be merged. • Dynamic programming is used to optimally align S and T • The optimal alignments of S and T will have loops corresponding to insertions, deletions, and replacements. • Merging may occur only if all loops can be resolved • All mismatches can be resolved • Insertions and deletions can be resolved only if there is a match of the inserted/deleted character to the left or right of the loop. • If S and T can be merged, a consensus sequence is generated and added to the list of “alive” sequences. S and T are killed. • This process continues until no two alive sequences can be merged. At that point, all remaining alive input and consensus sequences are output.
Summary • Planted Motif Search Problem: Find an l-mer that differs in d or less places from at least one l-mer in each input sequence • Algorithms PMS1 and PMS2 are based on a model that generates the neighborhood of every input sequence and intersects them to find the motifs • PMS1 is best suited for challenge problems; use PMS2 for larger l • Future work will include the minimization of consensus sequences (regular expressions)
Acknowledgements • Special thanks to: • Sanguthevar Rajasekaran • National Science Foundation • University of Connecticut Department of Computer Science and Engineering
Levenshtein Distance • Formal definition: The lowest number of edit operations, consisting of insertion (I), deletion (D), and replacement (R), necessary to convert one string to another. • Algorithm • Let Di,j be the edit distance of S1(1…i) and S2(1…j). • Add a blank space to the beginning of each string and align the strings along the edges of a matrix. • By definition, Di,0= i and D0,j= j. • Recurrence relation: Di,j= min(Di-1,j+ 1, Di,j-1+ 1, Di-1,j-1 + ti,j ) • ti,j = 0 if S1[i] = S2[j] , 1 otherwise (substitution) • By definition of Di,j, Dn,m, where n = |S1| and m = |S2|, is the edit distance of S1 and S2
Example • S1 = vintner, S2 = writers Adapted from Algorithms on Strings, Trees, and Sequences by Dan Gusfield, 1999. • Value in bottom-right cell gives Levenshtein distance (5)
Optimal Alignment from Levenshtein Distance • Working from the bottom right of the matrix, insert pointers • Set a pointer from cell (i,j) to • Cell (i-1, j) if Di,j = Di-1,j + 1 • Corresponds to a deletion of S1(i) from S1 • Cell (i, j-1) if Di,j = Di,j-1 + 1 • Corresponds to an insertion of S2(j) into S1 • Cell (i-1, j-1) if Di,j = Di-1,j-1 + ti,j • Corresponds to match (t=0) or replacement (t=1) • Follow the pointers from Dn,m to D(0,0) to get optimal alignment • Some cells may have two pointers, in which case more than one optimal alignment exists • 3 optimal alignments in the example: