1 / 41

Before we begin…

Before we begin….

fell
Télécharger la présentation

Before we begin…

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Before we begin… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…

  2. Pairwise Sequence AlignmentLesson 2

  3. What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE |||| ||||| ||| |||| || MVHLTPEEKTAVNALWGKVNVDAVGGE

  4. Why sequence alignment? Predict characteristics of a protein – use the structure or function information on known proteins with similar sequences available in databases in order to predict the structure or function of an unknown protein Assumptions: similar sequences produce similar proteins

  5. Local vs. Global Global alignment: forces alignment in regions which differ • Global alignment – finds the best alignment across the whole two sequences. • Local alignment – finds regions of high similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ Local alignment concentrates on regions of high similarity ADLG CDRYFQ |||| |||| | ADLG CDRYYQ

  6. Sequence evolution In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of changes: • Insertion - an insertion of a letter or several letters to the sequence. AAGA AAGTA Insertion AAG A T

  7. Sequence evolution In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of changes : • Insertion - an insertion of a letter or several letters to the sequence. AAGA AAGTA • Deletion – a deletion of a letter (or more) from the sequence. AAGA AGA Deletion A A AG

  8. Evolutionary changes in sequences In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of mutations: • Insertion - an insertion of a letter or several letters to the sequence. AAGA AAGTA • Deletion - deleting a letter (or more) from the sequence. AAGA AGA • Substitution – a replacement of one (or more) sequence letter by another AAGA AACA Substitution AA A C G Insertion + Deletion Indel

  9. Sequence alignment AAGCTGAATTCGAA AGGCTCATTTCTGA One possible alignment: AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2mismatches 4 indels (gap) 10 perfect matches

  10. Choosing an alignment: • Many different alignments are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Which alignment is better?

  11. Scoring an alignment:example - naïve scoring system: • Match: +1 • Mismatch: -2 • Indel: -1 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Score: =(+1)x10 + (-2)x2 + (-1)x4= 2 Score: =(+1)x9 + (-2)x2 + (-1)x6 = -1 Higher score  Better alignment

  12. Scoring system: • Different scoring systems can produce different optimal alignments • Scoring systems implicitly represent a particular theory of similarity/dissimilarity between sequence characters: evolution based, physico-chemical properties based • Some mismatches are more plausible • Transition vs. Transversion • LysArg ≠ LysCys • Gap extension Vs. Gap opening

  13. Substitutions Matrices • Nucleic acids: • Transition-transversion • Amino acids: • Evolution (empirical data) based: (PAM, BLOSUM) • Physico-chemical properties based (Grantham, McLachlan)

  14. PAM Matrices • Family of matrices PAM 80, PAM 120, PAM 250 • The number with PAM matrices represent evolutionary distance • Greater numbers denote greater distances

  15. Which PAM matrix to use? • Low PAM numbers: strong similarities • High PAM numbers: weak similarities • PAM120 for general use (40% identity) • PAM60 for close relations (60% identity) • PAM250 for distant relations (20% identity) • If uncertain, try several different matrices • PAM40, PAM120, PAM250

  16. PAM - limitations • Based on only one original dataset • Examines proteins with few differences (85% identity) • Based mainly on small globular proteins so the matrix is biased

  17. BLOSUM Matrices • Different BLOSUMn matrices are calculated independently from BLOCKS • BLOSUMn is based on sequences that share at least n percent identity • BLOSUM62 represents closer sequences than BLOSUM45

  18. Example : Blosum62 derived from blocks of sequences that share at least 62% identity

  19. Which BLOSUM matrix to use? • Low BLUSOM numbers for distant sequences • High BLUSOM numbers for similar sequences • BLOSUM62 for general use • BLOSUM80 for close relations • BLOSUM45 for distant relations

  20. PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences

  21. Gap penalty • We expect to penalize gaps • A different score for gap opening and for extension • Insertions and deletions are rare in evolution • But once they occur, they are easy to extend • Gap-extension penalty < gap-opening penalty

  22. Web servers for pairwise alignment

  23. BLAST 2 sequences (bl2Seq) at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool)engine for local alignment • Does not use an exact algorithm but a heuristic

  24. Back to NCBI

  25. BLAST – bl2seq

  26. Bl2Seq - query • blastn – nucleotide blastp – protein

  27. Bl2seq results

  28. Bl2seq results Dissimilarity Low complexity Gaps Similarity Match

  29. Bl2seq results: • Bits score– A score for the alignment according to the number of similarities, identities, etc. • Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a database of a particular size. The closer the e-value approaches zero, the greater the confidence that the hit is real

  30. BLAST – programs Query: DNA Protein Database: DNA Protein

  31. BLAST – Blastp

  32. Blastp - results

  33. Blastp – results (cont’)

  34. Blastp – acquiring sequences

  35. blastp – acquiring sequences (cont’)

  36. Fasta format – multiple sequences >gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH >gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH >gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH

  37. Searching for remote homologs • Sometimes BLAST isn’t enough • Large protein family, and BLAST only finds close members. We want more distant members • PSI-BLAST • Profile HMMs (not discussed in this exercise)

  38. PSI-BLAST • Position Specific Iterated BLAST Regular blast Construct profile from blast results Blast profile search Final results

  39. PSI-BLAST • Advantage: PSI-BLAST looks for seq’s that are close to the query, and learns from them to extend the circle of friends • Disadvantage: if we obtained a WRONG hit, we will get to unrelated sequences (contamination). This gets worse and worse each iteration

  40. BLAST – PSI-Blast

  41. PSI-Blast - results

More Related