310 likes | 415 Vues
Learn about local alignment methods using DAWG to reveal sequence similarity. Understand key techniques like Needleman–Wunsch algorithm, dynamic programming, and scoring matrices (BLOSUM, PAM). Explore meaningful alignment, BLAST scores, and improving DP table efficiency. Discover DAWG construction and its application in sequence alignment.
E N D
Sequence Local Alignment using Directed Acyclic Word Graph Do Huy Hoang
Sequence Similarity • Alignment • Arrange DNA/Protein sequences to show the similarity • “” denotes the insertion/deletion event
Other variations • Edit distance • Longest common substring • Affine gap scoring • Using scoring matrix (BLOSUM, PAM)
Alignment score computation • Needleman–Wunsch • Dynamic programming
Local alignment • Local alignment • Find the best alignments of two substring from the sequences
BWTSW • Motivation • Scoring 75% similarity • Local alignment table most are zero • Meaningful alignment • Suffix tree • Meaningful alignment • Meaningful alignment with gap • How good is it?
Meaningful alignment (1) • Sequences similarity sometimes implies functional similarity. • Biologists is NOT usually interested in sequences with less than 70% similarity. • BLAST score • Match = 1 • Mismatch = -3 • Open Gap = -5 • Extending gap = -2
Meaningful alignment (2) • BLAST score • Match = 1 • Mismatch = -3 • Open Gap = -5 • Extending Gap = -2 • At least 70% match to have none zero score
Meaningful alignment (3) • BLAST score • Match = 1 • Mismatch = -3 • Open Gap = -5 • Extending Gap = -2 • How many none zero entries in the local alignment DP table?
How to improve? • Idea: • Not storing zero score entries • Using suffix tree to prune off early
BWTSW details • FM index for suffix tree representation • Prune zero entries • Store DP vector using linked list
Analysis • Text length = N • Pattern length = M • Alphabet size =
Average running time (1) • Let F(L) be the number of pairs of strings length L, which Score(S1,S2) > 0 • Sizeof{(S1,S2) : Len(S1)=Len(S2)=L, Score(S1,S2)>0} • F(L) counts the number of pairs of 75% identity. • F(L) = sum(i=0..L/4, Binomial(L,i) * (-1)i) • F(L) k1k2L • F(log(N)) k3* N0.68
Average running time (2) • Given S1, Pr(Score(S1,S2) > 0|S1) = F(L)/L • For M < log(N) • The number of entries are • O(M * F(M)) < O(log(N)*F(log(N)) • For M > log (N) • O(M * N * F(M) / L) • On average • Time = O(M*F(log(N))) = M * N0.68
Possible improvement of BWTSW • Worst case running time O(N2 M) • When M=N • O(M N0.68+M3) When M is substring of N • What about ST vs. ST?
What we used in BWTSW is Suffix Trie (not suffix tree). • #Prove it# • Suffix trie has O(N2)nodes • DAWG is a similar structure with O(N) nodes
DAWG (2) • DAWG: Directed Acyclic Word Graph • DAWG is a cyclic automata that recognizes all the sub-strings of the given string.
DAWG (3) • Example: • DAWG of “abcbc”
DAWG (4) • End-set view
Trivial DAWG construction • Using End-set class
DAWG properties • For |w|>2, the Directed Acyclic Word Graph for w has at most 2|w|-1 states, and 3|w|-4 edges
D(w) and ST(wR) • There is a map between nodes in DAWG and implicit ST(wR) • Example: w=abcbc, wR=cbcba • Store DAWG using ST, which uses only o(N) bits a cb b a a cba cba
D(w) and ST(wR) (2) list all incoming edges of node q in Dw using ST(w^R)
Local Alignment using DAWG • Basis • Induction
Extensions • Meaningful alignment using DAWG • Prune the nodes whose Score is less than zero • Shortest path pruning style • Cache log(N) nodes the worst case running time is M*N*log(N), average case is the same for M << N.