1 / 68

Sequence Alignment

Sequence Alignment. Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW: http://www.csie.ntu.edu.tw/~kmchao. GenBank 200.0. GenBank 215.0. GenBank 233.0. orz’s sequence evolution. the origin?

janicea
Télécharger la présentation

Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Alignment Kun-Mao Chao (趙坤茂) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW: http://www.csie.ntu.edu.tw/~kmchao

  2. GenBank 200.0

  3. GenBank 215.0

  4. GenBank 233.0

  5. orz’s sequence evolution • the origin? • their evolutionary relationships? • their putative functional relationships? • orz (kid) • OTZ (adult) • Orz (big head) • Crz (motorcycle driver) • on_ (soldier) • or2 (bottom up) • oΩ (back high) • STO (the other way around) • Oroz (me)

  6. What? THETR UTHIS MOREI MPORT ANTTH ANTHE FACTS The truth is more important than the facts.

  7. Dot Matrix

  8. On July 20, 1969, Armstrong and Apollo 11 Lunar Module (LM) pilot Buzz Aldrin became the first people to land on the Moon.

  9. Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: C---TTAACTCGGATCA--T Sequence A Sequence B

  10. Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: Mismatch Match C---TTAACTCGGATCA--T Deletion gap Insertion gap

  11. Alignment Graph C G G A T C A T Sequence A: CTTAACT Sequence B: CGGATCAT CTTAACT C---TTAACTCGGATCA--T

  12. A simple scoring scheme • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C - - - T T A A C TC G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score

  13. An optimal alignment-- the alignment of maximum score • Let A=a1a2…am and B=b1b2…bn . • Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj • With proper initializations, Si,j can be computedas follows.

  14. ComputingSi,j j w(ai,bj) w(ai,-) i w(-,bj) Sm,n

  15. Match: 8 Mismatch: -5 Gap symbol: -3 Initializations C G G A T C A T CTTAACT

  16. Match: 8 Mismatch: -5 Gap symbol: -3 S3,5 = ? C G G A T C A T CTTAACT

  17. Match: 8 Mismatch: -5 Gap symbol: -3 S3,5 = ? C G G A T C A T CTTAACT

  18. Match: 8 Mismatch: -5 Gap symbol: -3 S3,5 = 5 C G G A T C A T CTTAACT optimal score

  19. C T T A A C – TC G G A T C A T 8 – 5 –5 +8 -5 +8 -3 +8 = 14 C G G A T C A T CTTAACT

  20. Now try this example in class Sequence A: CAATTGA Sequence B: GAATCTGC Their optimal alignment?

  21. Match: 8 Mismatch: -5 Gap symbol: -3 Initializations G A A T C T G C CAATTGA

  22. Match: 8 Mismatch: -5 Gap symbol: -3 S4,2 = ? G A A T C T G C CAATTGA

  23. Match: 8 Mismatch: -5 Gap symbol: -3 S4,2 = ? G A A T C T G C CAATTGA

  24. Match: 8 Mismatch: -5 Gap symbol: -3 S5,5 = ? G A A T C T G C CAATTGA

  25. Match: 8 Mismatch: -5 Gap symbol: -3 S5,5 = ? G A A T C T G C CAATTGA

  26. Match: 8 Mismatch: -5 Gap symbol: -3 S5,5 = 14 G A A T C T G C CAATTGA optimal score

  27. C A A T - T G AG A A T C T G C -5 +8 +8 +8 -3 +8 +8 -5 = 27 G A A T C T G C CAATTGA

  28. Longest Common Subsequence (LCS) A subsequence of a sequence S is obtained by deleting zero or more symbols from S. For example, the following are all subsequences of “president”: pred, sdn, predent. The longest common subsequence problem is to find a maximum-length common subsequence between two sequences.

  29. Alignment vs. LCS Sequence A: CAATTGA Sequence B: GAATCTGC Compute their optimal alignmentunder the following scoring scheme: Match: 1 Mismatch: 0 Gap symbol: 0

  30. Alignment score = LCS length Match: 1 Mismatch: 0 Gap symbol: 0 G A A T C T G C CAATTGA optimal score

  31. C AAT - TG AG AAT C TG C LCS: AATTG 0 +1 +1 +1 +0 +1 +1 +0 = 5 G A A T C T G C CAATTGA optimal score

  32. Edit distance • The edit distance (Levenshtein distance) between Sequence A and Sequence B is equal to the minimum number of operations (deletion, insertion, or substitution) required to transform Sequence A to Sequence B. CA A T -T GAGA A T CT GC edit distance = 3

  33. Alignment vs. Edit distance Sequence A: CAATTGA Sequence B: GAATCTGC Alignment score: maximized Edit distance: minimized Compute their optimal alignmentunder the following scoring scheme: Match: 0 Mismatch: -1 Gap symbol: -1

  34. |Optimal Alignment score| = Edit distance Match: 0 Mismatch: -1 Gap symbol: -1 G A A T C T G C CAATTGA

  35. |Optimal Alignment score| = Edit distance Match: 0 Mismatch: -1 Gap symbol: -1 G A A T C T G C CAATTGA optimal score

  36. Global Alignment vs. Local Alignment • global alignment: • local alignment:

  37. Maximum-sum interval • Given a sequence of real numbers a1a2…an, find a consecutive subsequence with the maximum sum. 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 For each position, we can compute the maximum-sum interval ending at that position in O(n) time. Therefore, a naive algorithm runs in O(n2) time.

  38. Computing a segment sum in O(1) time? • Input: a sequence of real numbers a1a2…an • Query: the sum of ai ai+1…aj

  39. Computing a segment sum in O(1) time • prefix-sum(i) = a1+a2+…+ai • all n prefix sums are computable in O(n) time. • sum(i, j) = prefix-sum(j) – prefix-sum(i-1) j i prefix-sum(j) prefix-sum(i-1)

  40. Maximizing sum(i, j) O(n)-time Method 1 • sum(i, j) = prefix-sum(j) – prefix-sum(i-1) • For each location j, prefix-sum(j) is fixed. To compute the maximum-sum interval ending at position j can be done by finding the minimum prefix-sum before position j. j i prefix-sum(j) prefix-sum(i-1)

  41. Maximum-sum interval Sequence 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 prefix-min(j) 0 0 000 0-1-1 -1-1 -1-5 -5 -5 -5 -5 prefix-sum(j) 0 9 6 7 14 -1 1 4 0 2 -5 1 -1 7 11 2 max_sum(j) 0 9 6 7 14 -1 2 5 1 3 -4 6 4 12 16 7 The maximum sum prefix-sum(j)= a1+a2+…+aj prefix-min(j): the minimum prefix-sum before position j max_sum(j)= prefix-sum(j)-prefix-min(j) The maximum-sum interval: 6 -2 8 4

  42. ai Maximum-sum interval(The recurrence relation) • Define S(i) to be the maximum sum of the intervals ending at position i. O(n)-time Method 2 If S(i-1) < 0, concatenating ai with its previous interval gives less sum than ai itself.

  43. Maximum-sum interval(Tabular computation) 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7 The maximum sum

  44. Maximum-sum interval(Traceback) 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7 The maximum-sum interval: 6 -2 8 4

  45. An optimal local alignment • Si,j: the score of an optimal local alignment ending at (i, j) between a1a2…ai and b1b2…bj. • With proper initializations, Si,j can be computedas follows.

  46. Match: 8 Mismatch: -5 Gap symbol: -3 local alignment C G G A T C A T CTTAACT

  47. Match: 8 Mismatch: -5 Gap symbol: -3 local alignment C G G A T C A T CTTAACT

  48. Match: 8 Mismatch: -5 Gap symbol: -3 local alignment C G G A T C A T CTTAACT The best score

  49. A – C - TA T C A T 8-3+8-3+8 = 18 C G G A T C A T CTTAACT The best score

More Related