1 / 30

Multiple Sequence Composition Alignment

Multiple Sequence Composition Alignment. Name: Yip Chi Kin Date: 21-12-2006. Studied Papers. [B03] Composition Alignment. [S98] Divide-and-conquer Alignment. [M99] DIALIGN Algorithm. [SMS03] DCA + Segment-based. Main Aspects. ․Dynamic Programming

winda
Télécharger la présentation

Multiple Sequence Composition Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence Composition Alignment Name: Yip Chi Kin Date: 21-12-2006

  2. Studied Papers [B03] Composition Alignment [S98] Divide-and-conquer Alignment [M99] DIALIGN Algorithm [SMS03] DCA + Segment-based

  3. Main Aspects ․Dynamic Programming ․Composition Alignment ․Meta-code MSA ․Simultaneous MSA Pairwise Library (Global & Local) Consistency & Ungapped Divide-and-conquer Segment-based (Optimal scores)

  4. Edit Graph CTG matches C T A deletions insertions CTGA • C T G A • • • Dynamic Programming Dot Matrix DP Matrix s(ai,bi) -d -d

  5. -CTTCT - G C A T C 0 -2 -4 -6 -8 -10 -2 -1 -3 -5 -7 -9 -4 -1 -2 -4 -4 -6 -6 -3 -2 -3 -5 -5 -8 -5 -2 -1 -3 -4 -10 -7 -4 -3 0 -2 Global Alignment Needleman-Wunsch Algorithm Scoring GA Results G C A T C - - C T T C T

  6. -TTTACAGGCAG - G A A C G G T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 2 0 0 0 0 2 0 2 0 1 1 2 0 0 0 0 0 2 1 2 1 0 0 3 1 0 0 0 0 0 4 2 1 0 2 1 2 0 0 0 0 0 2 3 4 3 2 1 3 0 0 0 0 0 0 1 5 6 4 2 3 0 2 2 2 0 0 0 3 4 5 3 1 GA Results - G A A C – G G T - - T T T A C A G G C A G Local Alignment Smith-Waterman Algorithm Scoring

  7. MSA Methods ․Consistency-based ․Exact method ․Progressive method ․Iterative method ․Stochastic method ․Hidden Markov method

  8. C G T C T C T G T C C C G T C T C T G T C C C G A T A T T C G A T A T T C G T C T C T G T C C C G T C T C T G T C C C G A T A T T C G A T A T T MSA Concepts Consistency-based method PSAs Trace formulation Latter formulation

  9. C G T C T C - - - G T C T C - - T G T C C C G A T A T - T C G T C T C G T C T C C T T G G T T C C C C C G T C T C - - - G T C T C T G T C C C C G G A A T T A A T T T T C G A T A T T MSA Results Results of MSA Aligned regions Unrealized Consistent Realized

  10. Divide-and-conquer C1 C2 C3 S1 S2 S3 Prefix Suffix Divide S1C1 S2C2 S3C3 C1S1 C2S2 C3S3 Divide Divide Align optimally Concatenate

  11. Prefix Suffix Sequence: GTTCATGCCAGGTGTAAATC CTATAC- -CTATAC 3 4 3 4 6 8 10 4 2 3 2 4 6 8 6 4 2 2 2 4 6 8 6 4 2 1 2 4 10 8 6 4 2 0 2 12 10 8 6 4 2 0 0 2 4 6 8 10 12 2 1 3 5 7 9 11 4 3 1 3 5 7 9 6 5 3 1 3 5 7 8 7 5 3 1 3 5 10 8 7 5 3 2 3 G T A T C - - G T A T C DP Distance Wopt (prefix) Wopt (suffix) CS1,S2[C1,C2] = Wopt (prefix) + Wopt (suffix) – Wopt (total)

  12. CTATAC 0 3 4 7 11 15 19 3 0 3 4 8 12 16 7 4 0 2 4 8 12 11 8 4 0 1 4 8 15 12 8 4 0 0 4 19 15 12 8 4 1 0 G T A T C Additional-cost Cost of Diagonal CS1,S2[1,1] = 0 CS1,S2[2,2] = 0 CS1,S2[3,3] = 0 CS1,S2[4,4] = 0 CS1,S2[5,4] = 0 CS1,S2[6,5] = 0 CS1,S2[2,2] = 1 + 2 – 3 = 0 = Wopt[CT,GT] + Wopt[ATAC,ATAC] – Wopt[CTATC,GTATAC] CS1,S2[4,3] = 3 + 1 – 3 = 1 = Wopt[CTAT,GTA] + Wopt[AC,TAC] – Wopt[CTATC,GTATAC]

  13. Space & Time ‘Chain’ of boxes along Diagonal in order to reduce searching time Full sequence searching

  14. I A V L F A E L A V I F G Y Y I A V L F A E D I A V L F A E D V T F A E L A C V I F G S L A C V I F G S P W D D V T F D A E P W D D V T F D A E y - - d Y I A V L F A E D - c - s - L A C V I F G S p w d d d - P W D D V T F D A E DIALIGN Non-Consistent (Simultaneous) Non-Consistent (Cross over) Consistent diagonals GA Results

  15. Y I A V L F A Y D D L A C V I F G S S W D D V M F Y A E Weighting Diagonal Weights where SD is sum of similarity values of same diagonal lD lD is length of diagonal D w(D) = – log P(lD, SD) Overlap weighting w(D1) = 1.9 w(D3) = 1.5 w(D2) = 1.7 w(D4) = 2.6 w(D5) = 0.2 Diagonals D1,D4and D5Score = 1.9 + 2.6 + 0.2 = 4.7 Diagonals D1,D2,D3and D5 Score = 1.9 + 1.7 + 1.5 + 0.2 = 5.3 Y I A V L F A Y D D Y I A V L F A Y D D L A C V I F G S L A C V I F G S S W D D V M F Y A E S W D D V M F Y A E

  16. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 S1 S2 S3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 f1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 S1 S2 S3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 f2 f3 Consistency check Overlap weights Fragments checking Transitivity frontier [1,9]

  17. M1 (1) M1 (2) M1 (2) M1 (1) M2(2) M2 M2(1) Greedy Strategy M1 (1) M1 (1) M1 (2) M1 (2) M2 M2 M3 M3 Greedy Approach Tandem duplications S1 S2 S1 S2 Consistency conflicts S1 S2 S3 S1 S2 S3

  18. 0 1 0 1 1 1 G 0 T C 1 0 0 A 0 G C 1 T C 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 1 1 1 C T G G C C T A + + – + – – – + – – – A A C T T T G 0 0 0 1 2 2 1 2 1 0 -1 0 -1 -2 -3 A G C A C T - Composition Alignment Composition matches Single character match CM of Prefix Length Sequence #1 Sequence #2 Matching Prefix length

  19. Match Length 111010001 001101110 Replaced by 7 Replaced by 7 111010001 001101110 Composition Matching 3 2 2 2 Prefix length 1 0 4 9 15 –1 2 2 –2 Replaced by 2 –3

  20. 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 Composition Matching Sequence #1 CM = 2 Sequence #2 CM = -1 Sequence #1 CM = 1 Sequence #2 CM of Prefix Length (Total=9) Sequence #1 CM = 0 Sequence #2

  21. Meta-Code Code about code Mismatch code Input code Code Reservoir Match code Code for Testing Mismatch code Original Code Meta-Code Control Rule

  22. Code ‘A’ in S1 Code ‘G’ Code ‘G’ Store code in Reservoir S1 If both Codes founded from Reservoir S1 and Reservoir S2 delete this two codes Code from S1 Code ‘T’ in S2 Code ‘C’ Code ‘C’ Store code in Reservoir S2 Code from S2 Code ‘AG’ Code ‘CT’ Reservoir Code (e.g. AGRCT) Reservoir Codes

  23. Meta-Code Rule If reservoir code = r, then stop the looping Looping for creating meta-code If CM length is valid, reservoir code = r, Position = p. Value of r Values of r and p Copy the codes from S1 and S2, p = p –1, output meta-code. Meta-code (e.g. AMT) Codes from S1 and S2

  24. S1: T A A C A G A G A T A C A G G A G T A C G G G A A C G A T S2: T T C T T T T G T T C C T C C C C C G A C C T T C T C Length 0 1 1 2 1 1 2 2 1 1 2 2 Meta Code R ART ART AGRTT ART ART AGRCT AGRCC ARC GRC GARTC GARTC CM (Lengths & Codes) Composition Matching of S1 and S2 in prefix length Reservoir codes in S1 Reservoir codes in S2

  25. CM of Metacode Invalid length AGRTT GARTC Composition Matching Invalid length ART ART ARC 2 1 Prefix length 0 10 12 2 6 4 2 –1 ART GARTC ART AGRCT

  26. T T A A C C G G T T C G C T G C T G C A G C A C | T T | C T G | C C C G A | T C T T T T C C T TMG G GMT C C C C C G CMT A GMC T AMG C TMA C T T C G T C C T C G A C Composition MSA Composition matching Meta-code MSA S1 S2 New S2

  27. Code catalogue 1t1 1t2 1t3 1t4 1t5 2t1 2t2 2t3 2t4 2t5 3t1 3t2 A = Currency / Cards B = Stock / Structured P. C = Unit Trusts / Bonds D = Insurance / Finance E = Mortgages / Loans … Time Granularities Branch bank #1 A C B C E B A A E B C A … Branch bank #2 B E B A A A C E D B E A … Branch bank #3 A C A A B C E E D B E E … Week #2 Week #1 Fixed Segment ․Semi-global alignment ․Least overlap problem ․Simple segmentation ․Composition alignment ․Weekly behaviour Segment Length LS= 5

  28. C C C C B B A C D A C D A A A A B B A A D D C C A A B A A B A A B B B B Meta-Code Branch bank #1 Branch bank #2 PSA Branch bank #1 Meta-Code Branch bank #3 Branch bank #2 Branch bank #3 Family Group Composition alignment Fixed-Segment Composition MSA Family Classifications Family Group

  29. Further Problems Meta-Code Composition MSA ․Fixed-segment length ․Prior sequence choice ․Speed-up PSAs ․Nos. of Segments/Codes

  30. Conclusions ․Fixed-segmentComposition (Least Overlap Problems) ․Meta-code Approach (Easier Transform Applications) ․Widespread use of MSA (Simultaneous Multiple Sequences)

More Related