1 / 109

Pairwise Sequence Alignments

Pairwise Sequence Alignments. Topics to be Covered. Comparison methods Global alignment Local alignment. Introduction to Alignment. Analyze the similarities and differences at the individual base level or amino acid level

annora
Télécharger la présentation

Pairwise Sequence Alignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pairwise Sequence Alignments

  2. Topics to be Covered • Comparison methods • Global alignment • Local alignment

  3. Introduction to Alignment • Analyze the similarities and differences at the individual base level or amino acid level • Aim is to infer structural, functional and evolutionary relationships among sequences

  4. Sequence Alignment 982 TGTTTGCTAAAGCTTCAGCTATCCACAACCCAATTGACCTCTAC 1022 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 961 TCTTTGCTAAGACCGCCTCCATCTACAACCCAATCA - - - TCTAC 1001 • Two sequences written out , one on top of the other • Identical or similar characters placed in same column • Nonidentical characters either placed in same column as mismatch or opposite gap in the other sequence • Overall quality of the alignment is then evaluated based on a formula that counts the number of identical (or similar) pairs minus the number of mismatches and gaps

  5. Pairwise Sequence Alignments • Why to compare • Similarity search is necessary for: • Family assignment • Sequence annotation • Construction of phylogenetic trees • Learn about evolutionary relationships • Classify sequences • Identify functions • Homology Modeling

  6. Essential Elements of an Alignment Algorithm • Defining the problem (Global, local alignment) • Scoring scheme (Gap penalties) • Distance Matrix (PAM, BLOSUM series)

  7. Global and Local Alignments Global – attempt is made to align the entire sequence using as many characters as possible, up to both ends of the sequences Local – stretches of sequence with the highest density of matches are aligned • L G P S S K Q T G K G S – S R I W D N • | | | | | | | Global Alignment • L N – I T K S A G K G A I M R L G D A • - - - - - - T G K G - - - - - - • | | | Local Alignment • - - - - - - - A G K G - - - - - -

  8. Local vs. Global Alignment (cont’d) • Global Alignment • Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C TCCCAGTTATGTCAGGGGACACGAGCATGCAGAGAC |||||||||||| AATTGCCGCCGTCGTTTTCAGCAGTTATGTCAGATC

  9. Global and Local Alignments • Global - When two sequences are of approximately equal length. Here, the goal is to obtain maximum score by completely aligning them • Local- When one sequence is a sub-string of the other or the goal is to get maximum local score • Protein motif searches in a database

  10. Dynamic programming algorithm • Dynamic programming = • Build up optimal alignment using previous solutions for optimal alignments of subsequences

  11. A T T A A T A T T A T A A T T A Aligning Sequences without Insertions and Deletions: Hamming Distance Given two DNA sequences vand w : v : w: • The Hamming distance: dH(v, w) = 8 is large but the sequences are very similar

  12. T A T A A T T A A T T A A T T A Aligning Sequences with Insertions and Deletions By shifting one sequence over one position: v : -- w: -- • The edit distance: dH(v, w) = 2. • Hamming distance neglects insertions and deletions in DNA

  13. Edit Distance Levenshtein (1966) introduced edit distance between two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other d(v,w) = MIN number of elementary operations to transform vw

  14. Edit Distance vs Hamming Distance Hamming distance always compares i-th letter of v with i-th letter of w V = ATATATAT W= TATATATA Hamming distance: d(v, w)=8 Computing Hamming distance is a trivial task.

  15. Edit Distance vs Hamming Distance Edit distance may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = - ATATATAT V = ATATATAT Just one shift Make it all line up W= TATATATA W = TATATATA Hamming distance: Edit distance: d(v, w)=8d(v, w)=2 Computing Hamming distance Computing edit distance is a trivial task is a non-trivial task

  16. Edit Distance vs Hamming Distance Edit distance may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = - ATATATAT V = ATATATAT W= TATATATA W = TATATATA Hamming distance: Edit distance: d(v, w)=8d(v, w)=2 (one insertion and one deletion) How to find what j goes with what i ???

  17. Edit Distance: Example • TGCATAT  ATCCGAT in 5 steps • TGCATAT(delete last T) • TGCATA(delete last A) • TGCAT (insert A at front) • ATGCAT (substitute C for 3rdG) • ATCCAT (insert G before last A) • ATCCGAT (Done)

  18. Edit Distance: Example • TGCATAT  ATCCGAT in 5 steps • TGCATAT(delete last T) • TGCATA(delete last A) • TGCAT (insert A at front) • ATGCAT (substitute C for 3rdG) • ATCCAT (insert G before last A) • ATCCGAT (Done) • What is the edit distance? 5?

  19. Edit Distance: Example (cont’d) TGCATAT  ATCCGAT in 4 steps TGCATAT (insert A at front) ATGCATAT(delete 6thT) ATGCATA (substitute G for 5thA) ATGCGTA (substitute C for 3rdG) ATCCGTA (Done)

  20. Edit Distance: Example (cont’d) TGCATAT  ATCCGAT in 4 steps TGCATAT (insert A at front) ATGCATAT(delete 6thT) ATGCAAT (substitute G for 5thA) ATGCGAT (substitute C for 3rdG) ATCCGAT (Done) Can it be done in 3 steps???

  21. The Alignment Grid • Every alignment path is from source to sink

  22. v A C G T A C T w 1 2 3 4 5 6 7 0 A 0 T 1 G 2 T 3 T 4 A 5 T 6 7 Alignment as a Path in the Edit Graph 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) - Corresponding path -

  23. v A T C G T A C w A 1 2 3 4 5 6 7 0 0 T G 1 2 T 3 T 4 A 5 T 6 7 Alignment as a Path in the Edit Graph Every path in the edit graph corresponds to an alignment:

  24. v A T C G T A C w 1 2 3 4 5 6 7 0 A 0 T 1 G 2 T 3 T 4 A 5 T 6 7 Alignment as a Path in the Edit Graph Old Alignment 0122345677 v= AT_GTTAT_ w=ATCGT_A_C 0123455667 New Alignment 0122345677 v= AT_GTTAT_ w=ATCG_TA_C 0123445667

  25. From LCS to Alignment: Change up the Scoring • The Longest Common Subsequence (LCS) problem—the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). • In the LCS Problem, we scored 1 for matches and 0 for indels • Consider penalizing indels and mismatches with negative scores • Simplest scoring schema: • +1 : match premium • -μ : mismatch penalty • -σ : indel penalty

  26. Simple Scoring • When mismatches are penalized by –μ, indels are penalized by –σ, and matches are rewarded with +1, • the resulting score is: • #matches – μ(#mismatches) – σ (#indels)

  27. Dynamic programming algorithm • define a matrix Fij: • Fij is the optimal alignment of • subsequence A1...i and B1...j • iterative build up: F(0,0) = 0 • define each element i,j from • (i-1,j): gap in sequence A • (i, j-1): gap in sequence B • (i-1, j-1): alignment of Ai to Bj

  28. Dynamic programming

  29. Sequence Comparison Scoring Matrices • • The choice of a scoring matrix can strongly influence the outcome of sequence analysis • • Scoring matrices implicitly represent a particular theory of evolution • • Elements of the matrices specify the similarity or the • distance of replacing one residue (base) by another • • Distance and similarity matrices are inter-convertible by some mathematical transformation.

  30. Protein Scoring Matrices • The two most popular matrices are the PAM and the BLOSUM matrix

  31. Scoring Insertions and Deletions A T G T A A T G C A T A T G T G G A A T G A A T G T - - A A T G C A T A T G T G G A A T G A insertion / deletion The creation of a gap is penalized with a negative score value.

  32. Why Gap Penalties? • The optimal alignment of two similar sequences is usually that which • maximizes the number of matches and • minimizes the number of gaps. • Permitting the insertion of arbitrarily many gaps can lead to high • scoring alignments of non-homologous sequences. • Penalizing gaps forces alignments to have relatively few gaps.

  33. Why Gap Penalties? Gaps not permitted Score: 0 1 GTGATAGACACAGACCGGTGGCATTGTGG 29 ||| | | ||| | || || | 1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29 Match = 5 Mismatch = -4 Gaps allowed but not penalized Score: 88 1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29 ||| || | | | ||| || | | || || | 1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29

  34. Gap Penalties Linear gap penalty score: γ(g) = - gd Affine gap penalty score: γ(g) = -d - (g -1)e γ(g) = gap penalty score of a gap of length g d = gap opening penalty e = gap extension penalty g = gap length

  35. Scoring Indels: Naive Approach • A fixed penalty σ is given to every indel: • -σ for 1 indel, • -2σ for 2 consecutive indels • -3σ for 3 consecutive indels, etc. • Can be too severe penalty for a series of 100 consecutive indels

  36. This is more likely. This is less likely. Affine Gap Penalties • In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events: ATA__GC ATATTGC ATAG_GC AT_GTGC Normal scoring would give the same score for both alignments

  37. Accounting for Gaps • Gaps- contiguous sequence of spaces in one of the rows • Score for a gap of length x is: • -(ρ +σx) • where ρ >0 is the penalty for introducing a gap: • gap opening penalty • ρ will be large relative to σ: • gap extension penalty • because you do not want to add too much of a penalty for extending the gap.

  38. Affine Gap Penalty Recurrences si,j = s i-1,j - σ max s i-1,j –(ρ+σ) si,j = s i,j-1 - σ max s i,j-1 –(ρ+σ) si,j = si-1,j-1 + δ(vi, wj) max s i,j s i,j Continue Gap in w (deletion) Start Gap in w (deletion): from middle Continue Gap in v (insertion) Start Gap in v (insertion):from middle Match or Mismatch End deletion: from top End insertion: from bottom

  39. Scoring Insertions and Deletions match = 1 mismatch = 0 Total Score: 4 A T G T T A T A C T A T G T G C G T A T A Total Score: 8 - 3.2 = 4.8 A T G T - - - T A T A C T A T G T G C G T A T A Gap parameters: d = 3 (gap opening) e = 0.1 (gap extension) g = 3 (gap lenght) γ(g) = -3 - (3 -1) 0.1 = -3.2 insertion / deletion

  40. 1 ...VLSPADKFLTNV 12 |||| 1 VFTELSPAKTV.... 11 1 V...LSPADKFLTNV 12 | |||| | | | 1 VFTELSPA.K..T.V 11 gap opening penalty = 3 gap extension penalty = 0.1 score = 6.3 gap opening penalty = 0 gap extension penalty = 0.1 score = 11.3 Modification of Gap Penalties Score Matrix: BLOSUM62

  41. Pairwise Sequence AlignmentLocal AlignmentSemi-Global Alignment

  42. Local Alignment • A local Alignment between sequence s and • sequence t is an alignment with maximum • similarity between a substring of s and a • substring of t. • T. F. Smith & M. S. Waterman, “Identification of Common Molecular Subsequences”, J. Mol. Biol., 147:195-197

  43. Why choose a local alignment algorithm? • More meaningful – point out conserved regions between two sequences • Aligns two sequences of different lengths to be matched • Aligns two partially overlapping sequences • Aligns two sequences where one is a subsequence of another 43

  44. Dynamic ProgrammingLocal Alignment • Si,j = MAXIMUM • [ Si-1, j-1 + s(ai,bj)(match/mismatch in the diagonal), • Si,j-1 + w (gap in sequence #1), • Si-1,j + w (gap in sequence #2), • 0] 44

  45. Initialization Step

  46. Matrix Fill Step

  47. Traceback Step

  48. Traceback Step

  49. Traceback Step

  50. An Introduction To Multiple Sequence Alignment (MSA)

More Related