1 / 59

Pairwise and Multiple Sequence Alignment Lesson 2

Pairwise and Multiple Sequence Alignment Lesson 2. Motivation.

angus
Télécharger la présentation

Pairwise and Multiple Sequence Alignment Lesson 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pairwise and Multiple Sequence AlignmentLesson 2

  2. Motivation ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…

  3. What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE |||| ||||| ||| |||| || MVHLTPEEKTAVNALWGKVNVDAVGGE

  4. Why perform a pairwise sequence alignment? e.g., predicting characteristics of a protein – premised on: similar sequence (or structure) similar function Finding homology between two sequences

  5. Local vs. Global • Local alignment – finds regions of high similarity in parts of the sequences • Global alignment – finds the best alignment across the entire two sequences ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN CDRYYQ ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ

  6. Evolutionary changes in sequences Three types of nucleotide changes: • Substitution – a replacement of one (or more) sequence characters by another: • Insertion - an insertion of one (or more) sequence characters: • Deletion – a deletion of one (or more) sequence characters: AAGA  AACA AAG A T A A GA Insertion + Deletion Indel

  7. Choosing an alignment: • Many different alignments between two sequences are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- . . . How do we determine which is the best alignment?

  8. A C G T A C G T Toy exercise Compute the scores of each of the following alignments using this naïve scoring scheme • Match: +1 • Mismatch: -2 • Indel: -1 Scoring scheme: Substitution matrix Gap penalty (opening = extending) AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA-

  9. Substitution matrices: accounting for biological context • Which best reflects the biological reality regarding nucleotide mismatch penalty? • Tr > Tv > 0 • Tv > Tr > 0 • 0 > Tr > Tv • 0 > Tv > Tr Tr = Transition Tv = Transversion

  10. Scoring schemes: accounting for biological context • Which best reflects the biological reality regarding these mismatch penalties? • Arg->Lys > Ala->Phe • Arg->Lys > Thr->Asp • Asp->Val > Asp->Glu

  11. PAM matrices • Family of matrices PAM 80, PAM 120, PAM 250, … • The number with a PAM matrix (the n in PAMn) represents the evolutionary distance between the sequences on which the matrix is based • The (ith,jth)cell in a PAMn matrix denotes the probability that amino-acid i will be replaced by amino-acid j in time n: Pi→j,n • Greater n numbers denote greater distances

  12. PAM - limitations • Based on only one original dataset • Examines proteins with few differences (85% identity) • Based mainly on small globular proteins so the matrix is biased

  13. BLOSUM matrices • Different BLOSUMn matrices are calculated independently from BLOCKS (ungapped, manually created local alignments) • BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity • The (ith,jth)cell in a BLOSUM matrix denotes the log of odds of the observed frequency and expected frequency of amino acids i and j in the same position in the data: log(Pij/qi*qj) • Higher n numbers denote higher identity between the sequences on which the matrix is based

  14. PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences • BLOSUM62 for general use • BLOSUM80 for close relations • BLOSUM45 for distant relations • PAM120 for general use • PAM60 for close relations • PAM250 for distant relations

  15. Substitution matrices exercise • Pick the best substitution matrix (PAM and BLOSUM) for each pairwise alignment: • Human – chimp • Human - yeast • Human – fish PAM options: PAM60 PAM120 PAM250 BLOSUM options: BLOSUM45 BLOSUM62 BLOSUM80

  16. Substitution matrices • Nucleic acids: • Transition-transversion • Amino acids: • Evolutionary (empirical data) based: (PAM, BLOSUM) • Physico-chemical properties based (Grantham, McLachlan)

  17. Gap penalty AAGCGAAATTCGAAC A-G-GAA-CTCGAAC AAGCGAAATTCGAAC AGG---AACTCGAAC • Which alignment has a higher score? • Which alignment is more likely?

  18. Pairwise alignment algorithm matrix representation: formulation 2 sequences: S1 and S2 and a Scoring scheme: match = 1, mismatch = -1, gap = -2 V[i,j] = value of the optimal alignment between S1[1…i] and S2[1…j] V[i,j] + S(S1[i+1],S2[j+1]) V[i+1,j+1] = max V[i+1,j] + S(gap) V[i,j+1] + S(gap)

  19. Scoring scheme: Match = 1 Mismatch = -1 Indel (gap) = -2 Pairwise alignment algorithm matrix representation: initialization S1 S2

  20. Scoring scheme: Match = 1 Mismatch = -1 Indel (gap) = -2 Pairwise alignment algorithm matrix representation: filling the matrix S1 S2

  21. Pairwise alignment algorithm matrix representation: trace back S1 S2

  22. Pairwise alignment algorithm matrix representation: trace back S1 S2 AAAC AG-C

  23. Assessing the significance of an alignment score True AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATTC-GAA AGGCTCATTTCTGA- 28.0 Random AGATCAGTAGACTA GAGTAGCTATCTCT AGATCAGTAGACTA----- ----GAGTAG-CTATCTCT 26.0 . . CGATAGATAGCATA GCATGTCATGATTC CGATAGATAGCATA--------- ---------GCATGTCATGATTC 16.0

  24. Web servers for pairwise alignment

  25. BLAST 2 sequences (bl2Seq) at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool)engine for local alignment • Does not use an exact algorithm but a heuristic

  26. Back to NCBI

  27. BLAST – bl2seq

  28. Bl2Seq - query • blastn – nucleotide blastp – protein

  29. Bl2seq results

  30. Bl2seq results Dissimilarity Low complexity Similarity Gaps Match

  31. Query type: AA or DNA? • For coding sequences, AA (protein) data are better • Selection operates most strongly at the protein level → the homology is more evident • AA – 20 char’ alphabet DNA - 4 char’ alphabet lower chance of random homology for AA ↓

  32. BLAST – programs Query: DNA Protein Database: DNA Protein

  33. BLAST – Blastp

  34. Blastp - results

  35. Blastp – results (cont’)

  36. Blast scores: • Bits score– A score for the alignment according to the number of similarities, identities, etc. It has a standard set of units and is thus independent of the scoring scheme • Expected-score (E-value) –The number of alignments with the same or higher score one can “expect” to see by chance when searching a random database with a random sequence of particular sizes. The closer the e-value is to zero, the greater the confidence that the hit is really a homolog

  37. Multiple Sequence Alignment (MSA)

  38. Multiple sequence alignment Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG-- Similar to pairwise alignment BUT nsequences are aligned instead of just 2 Each row represents an individual sequence Each column represents the ‘same’ position

  39. Why perform an MSA? MSAs are at the heart of comparative genomics studies which seek to study evolutionary histories, functional and structural aspects of sequences, and to understand phenotypic differences between species

  40. variable conserved Multiple sequence alignment Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG-- Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG--

  41. Alignment methods There is no available optimal solution for MSA – all methods are heuristics: • Progressive/hierarchical alignment (ClustalX) • Iterative alignment (MAFFT, MUSCLE)

  42. Progressive alignment A B C D E First step: compute pairwise distances Compute the pairwise alignments for all against all (10 pairwise alignments). The similarities are converted to distances and stored in a table

  43. A B C D E Second step: build a guide tree • Cluster the sequences to create a tree (guide tree): • represents the order in which pairs of sequences are to be aligned • similar sequences are neighbors in the tree • distant sequences are distant from each other in the tree The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!

  44. Sequence A Sequence B Sequence C Sequence D Sequence E A B C D E Third step: align sequences in a bottom up order • Align the most similar (neighboring) pairs • Align pairs of pairs • Align sequences clustered to pairs of pairs deeper in the tree

  45. Sequence A Sequence B Sequence C Sequence D Sequence E A B C D E Main disadvantages of progressive alignments Guide-tree topology may be considerably wrong Globally aligning pairs of sequences may create errors that will propagate through to the final result

  46. A MSA B C D E Iterative alignment A B C DE Pairwise distance table Iterate until the MSA does not change (convergence) Guide tree

  47. Blastp – acquiring sequences

  48. blastp – acquiring sequences

  49. blastp – acquiring sequences

  50. MSA input: multiple sequence Fasta file >gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH >gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH >gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH >gi|4885397|ref|NP_005323.1| hemoglobin, zeta [Homo sapiens] MSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEK YR

More Related