1 / 88

An Approx. Algo. For Alignment MSA using Motif Discovery

An Approx. Algo. For Alignment MSA using Motif Discovery. B.J. Chen M.W. Chang C.H. Tsai H.Y. Chuang NTU. MSA. Multiple Sequence Alignment: Given N sequences, align these sequences, possibly with gaps, that brings out the best commonality of these N sequences.

stella
Télécharger la présentation

An Approx. Algo. For Alignment MSA using Motif Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Approx. Algo. For Alignment MSA using Motif Discovery B.J. Chen M.W. Chang C.H. Tsai H.Y. Chuang NTU

  2. MSA • Multiple Sequence Alignment: Given N sequences, align these sequences, possibly with gaps, that brings out the best commonality of these N sequences. • Usually measure the alignment by penalizing the mis-matches and gaps, and rewarding the matches.

  3. Previous works • [Alt89], [CHM94], [CL88], [GKS95]…etc. works best for small values of N (about 2-6). • [ZZ96], [Vih98] handle the case of N >2 by first applying pairwise alignment techniques.

  4. Motivation of this work • For larger values of N, we need additional constraints to give biological meaningful alignment. • MOTIF: A common patterns across sequences.

  5. Motivation of this work • Alignment number K : (2≤ K ≤ N) A user controlled parameter constrains the alignment to have at least K sequences agree on a character, whenever possible, in the alignment. • The commonality across most sequences is required to be detected.

  6. Problem Description • Align multiple strings with gaps • Hope : Brings out the best commonality • Parameter: K , the alignment number • The alignment should has at least K sequences agree on a character, whenever possible, in the alignment. • K = 1 has no meaning

  7. Problem Definition • Given N sequences , each have length ni • Let matrix A be an alignment , the size of A is N x T ( T is the length after alignment, T >= ni ) • Let Ai be i th row in A,throw all gaps in Ai will become original sequence_i

  8. Problem Definition(2) • Define EA(a,i,j) = 1 if A[i,j] = a EA(a,i,j) = 0 o.w • #(a,j) = Σi EA(a,i,j) • a in column j is “bad ” if #(a,j) < K Define: = Σ #(a,j) a: #(a,j) > 0 #(a,j) < K

  9. Problem Definition(3) • Example: Given K = 4 = # A + # B = 4 A A B B B B C C X X O O O O X X

  10. Problem Definition(4) • Minimize whole badness: K-MSA • The paper proposes the proof that this problem is MAX SNP hard (I’ll report this part later)

  11. Stage 1: Motif Discovery • We begin by defining a motif in a sequence • Given a string s on alphabet ∑ and an integer K, 2 ≤ K ≤ |s| • Definition 1: K-motif • A string m on ∑∪’.’ is K-motif with location list Lm = ( l1,l2,…,lp), if all of the following hold • m[0], m[|m|-1] belong to ∑ • p≥ K • Every “don’t care” character at position j in m, there exist at least two distinct occurrences li1 and li2, 1 ≤i1, i2≤p , such that s[ li1+j] ≠ s[ li2+j]

  12. Definition 2 • Maximal Motif: • Let p1, p2, …, pk be all the motifs in the sequence s. A motif pi is maximal if and only if • there is no pj (j≠i) such that pi is a substring of pj, or • if pi is a substring of pj, then there exists at least one occurrence of pi in s that is not covered by pj in s

  13. Definition 3 • Redundant motif: • A maximal motif m, with location list Lm, is redundant if there exist maximal motifs mi, 1 ≤i≤p , such that Lm = Lm1 ∪ Lm2 ∪ … ∪ Lmp • A motif that is not redundant is called an irredundant motif

  14. Example 1 • Let the input string s have the following form: • ac1c2c3baXc2c3bYac1Xc3bYYac1c2Xb _________________________________________ ac1c2c3b aXc2c3bY ac1Xc3bYY ac1c2Xb a…b + + + + a..c3b + + + a.c2.b + + + ac1..b + + + a.c2c3b + + ac1.c3b + + ac1c2.b + +

  15. Motif Discovery • The motifs of interest to the sequence alignment problem are the irredundant motifs– lemma 4 • There exists a polynomial time algorithm to extract them from the input • In the rest of the paper we use motifs that occur in multiple sequences, i.e., occur in each sequence exactly once. • If there exists a motif that occurs more than once in a sequence, then for the arguments that follow, each occurrence is treated as a distinct motif

  16. Lemma 4 • Lemma 4: if p is a redundant motif, then using the motif p does not improve the cost of the K-MSA optimization problem • Proof. Let p be rendered redundant by motifs p1, p2, …, pn, n ≥ 1. By definition, motif p has less number of solid-characters than each of pi, 1 ≤i≤ n. Thus if an alignment can use motif p, it can certainly use all the motifs p1, p2, …,pn, giving a larger number of solid-characters; thus a higher cost for the K-MSA optimization problem

  17. Stage 2: Sequence Alignment • Definition 4 • two irredundant motifs pi and pj, if there exists a sequence s containing both these motifs • Let ni and nj be the sizes of the motifs pi and pjand let li and ljbe the locations (offsets) in a sequence s respectively • Motifs Overlap: if the intervals [li, li + ni] and [lj, lj + nj] have a non-empty intersection p1 p2

  18. Definition 5 • Pairwise Compatible Motifs: Two motifs, p1 and p2, are pairwise feasible if there exists an alignment of the sequences that does not introduce gaps in the motifs p1 and p2 p1 p2

  19. Lemma 1 • Two irredundant motifs pi and pj are pairwise feasible if and only if none of the following hold • (domain crossing mismatch) if pi and pj do not overlap in all the sequence, then pi is to the left pj, without loss of generality • (overlap mismatch) if pi and pj overlap in any sequence, than pi is at some fixed distance d to left of pj, without loss of generality

  20. Mismatch Motif A Motif B Overlap mismatches (i) (ii) Domain-crossing mismatches

  21. Definition 6 • Motif alignment: Given a set M of motifs, a motif alignment of the sequences, s1, s2, …,sm, is the alignment such that in all the sequences, without breaking the motifs in M, the motifs are aligned (in all the sequences they appear) • Feasible set: If such an alignment exists, the set is called a feasible set

  22. Definition 7 • Linear ordering of motifs: given a set of feasible motifs, a consistent ordering of the motifs such that in every sequence, the set of motifs that are present in the sequence appear in the left to right order is called the linear ordering

  23. Example 2 • H IAJ G L B • A M C N B Q D • C O P ED R F • H I S J GE T U F • H I V J G • A…B in sequence 1 and 2 • C…D in sequence 2 and 3 • E..F in sequence 3 and 4 • HI.JG in sequence 1, 4 and 5

  24. Example 2 • (iv) (i) (ii) (iii) • H IAJ G L B -- -- -- -- • -- -- A M C N B Q D -- -- • -- -- -- -- C O P ED R F • H I S J G-- --E T U F • H I V J G -- -- -- -- -- -- i ii iv iii

  25. Example 3 • H IAJ G L B -- -- -- -- • -- -- A M C N B Q D -- -- • -- -- -- -- C O P ED R F • -- -- -- -- -- H IEJ GF • -- -- -- -- -- H I K J G -- • No alignment due to overlap mismatch • There is no linear ordering of the motifs

  26. Example 4 • A G H DX Y • A I C D J F • X YC D K F • A..D in sequences 1 and 2 • CD.F in sequences 2 and 3 • XY in sequences 1 and 3 • linear order of the motifs is (i), (iii), (ii) • no alignment due to crossing mismatch i iii ii

  27. Definition 8 • Domain crossing error: Given a set of motifs, m1, m2,…, mn, a domain crossing error is said to occur if there exists a linear ordering of the motifs mi1, mi2, …, min, yet there exists no alignment that respects all the n motifs

  28. Lemma 2 • A set of irredundant motifs p1, p2, …, pn is feasible if and only if none of the following holds: • There exist distinct motifs pi and pj such that pi and pj are pairwise infeasible • There exists a non-empty subset of the motifs without a linear ordering • There exists a non-empty subset of the motifs that demonstrate domain crossing error

  29. The Graph-theoretic Formulation • Construct a directed graph G= ( V, E) where every motif pi corresponds to a vertex vi, thus n = |V|. The directed edge are introduced as follows: • There is no edge between two vertices where the two corresponding motifs do not occur simultaneously in any sequence • If pi is to the left of pjin every sequence that the two motifs are present, then a directed edge is placed from vi to vj • The edges are labeled as follows: • Forbidden: if the motifs are not pairwise feasible • Overlap: if the motifs corresponding to v1 and v2 overlap • Non-overlap: if the motifs do not overlap

  30. Handling Domain Crossing Mismatches - intro • Introduce consistent graph w.r.t. a vertex. • First, define distance on edges. Dv1,v2 = minimum distance between v1 and v2 in every sequence that both them appear in.

  31. Consistent Graph w.r.t. a Vertex • G = (V,E); for each edge (u,v): label {forbidden, overlap, nonoverlap}; weight = Du,v • Definition: Valid path: contains no “forbidden” edge Overlap-path: all edges in the path are “overlap” Weight Dp =  De, e in p (path)

  32. Consistent Graph w.r.t. a Vertex • pV; The graph is consistent w.r.t.p if q V, for all pair of vertex-disjoint valid paths from p to q, P1 and P2, 1. Dp1 = Dp2, if P1 and P2 are both overlap-paths, or, 2. Dp1  Dp2, if P1 is an overlap-path and P2 is not.

  33. 4 v1 v3 (non-overlap) 2 2 (non-overlap) (overlap) v2 An Example • Consider a previous example. AGHDXY AICDJF XYCDKF A..D CD.F XY P1: v1 -> v2, D = 2 P2: v1 -> v3 ->v2, D = 6 Not consistent!

  34. Graph Consistent v.s. Domain Crossing Mismatches • Use “Graph Consistent” property to deal with the domain crossing mismatches problem. • If the induced graph is consistent w.r.t every vertex, then there is no domain crossing mismatches. • Moreover, it may rule out overlap mismatches.

  35. Lemma Rewritten • Given a subset of vertices(motifs) v1…vn, construct graph as previously defined.This induced subgraph on v1…vn is feasible, if the following holds: 1. there is no edge labled forbidden in the subgraph, 2. the induced subgraph is acyclic, and, 3. the subgraph is consistent w.r.t. every vi.

  36. Algorithm • Idea: an infeasible set to a feasible set

  37. Algorithm • Detect the basic infeasible subsets. • Eliminate motifs to obtain a feasible set that maximizes the cost. • Render the alignment.

  38. Algorithm • Before we construct the basic infeasible sets, we have one more definition: Given a graph G and two cycles C1 and C2 on G, if all vertices defining C1 are also in C2, then C2 is redundant with respect to C1.

  39. Algorithm – Step1 • Compute the following sets: 1. Fi: contains the 2 endpoints of i-th forbidden edge. 2. Cj: contains vertices that form a directed non-redundant cycle in the graph. 3. Pk : contains vertices that a non-redundant path in the graph.

  40. Algorithm – Step1 • The number of the total basic infeasible sets we obtained in Step1 is bounded. • More precisely, there would be no more than . • It is easy to prove, since the number would be bounded by the number of cycles or closed path.

  41. Algorithm – Step2 • Mapping to SETCOVER problem. { v1, v2…vn } = F1 …Fnf C1…Cnc P1…Pnp U = {F1…Fnf, C1…Cnc, P1…Pnp} Si = { Fl | vi Fl, 1 l  nf}  { Cl | vi Cl, 1 l  nc}  { Pl | vi Pl, 1 l  np} costi = #solid_char in motifi * #sequences containing motifi AB..CD  4

  42. Algorithm – Step3 • Find feasible subgraph containing only overlap edges. • For each sequence, find an ordering of motifs in that subgraph. Say, p1, p2…pj. • Align sequences.

  43. Example • A set of 3 sequences: • GFPCQFSAG • GFPCQFSGG • GPCQSAGK • K=2

  44. Example • Irredundant motifs (By Teiresias) • PCQ in Seq. 1,2,3 • GFPCQFS.G in Seq. 1,2 • PCQ..G in Seq. 2,3 • SAG in Seq. 1,3 GFPCQFSAG GFPCQFSGG GPCQSAGK

  45. Example • Sequences to Graph 2 (0.V.) 1 2 0 (O.V.) 2 (O.V.) 3 (non.) 4 3 3 (O.V.) 6 (O.V.)

  46. Example • Basic infeasible sets • No F • No C • Closed path set P, P1 = 214 P2 = 123 P3 = 134 P4 = 234

  47. Example • Make feasible set by S.C. problem G = {P1,P2,P3,P4} S1 = {P1,P2,P3} w1 = 3 S2 = {P1,P2,P4} w2 = 9 S3 = {P2,P3,P4} w3 = 6 S4 = {P1,P3,P4} w4 = 3 • Solution = S1,S4

  48. Example • Get aligned blocks from overlap graphs GFPCQFS.G PCQ..G 2 2 (O.V.) 3

  49. Example • Result --- G F P C Q F S a G G F P C Q F S G G - g P C Q s a G K

  50. Demonstration • IBM Bioinformatics Group http://cbcsrv.watson.ibm.com/Tmsa.html

More Related