390 likes | 535 Vues
This exercise focuses on the principles of pairwise sequence alignment, showcasing the methods used to compare DNA, RNA, or protein sequences. It covers the importance of identifying similarities and differences to predict characteristics of proteins based on sequence conservation. The exercise distinguishes between local and global alignment approaches, explains scoring systems for evaluating alignments (including match, mismatch, and indel scores), and discusses substitution matrices like PAM and BLOSUM for accurate analysis. It provides practical examples and exercises for enhanced understanding.
E N D
Motivation ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…
What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE |||| ||||| ||| |||| || MVHLTPEEKTAVNALWGKVNVDAVGGE
Why sequence alignment? Predict characteristics of a protein – Premised on: similar sequence (or structure) similar function
Local vs. Global Global alignment: forces alignment in regions which differ • Global alignment – finds the best alignment across the whole two sequences. • Local alignment – finds regions of high similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ Local alignment concentrates on regions of high similarity ADLG CDRYFQ |||| |||| | ADLG CDRYYQ
Evolutionary changes in sequences Three types of changes: • Substitution – a replacement of one (or more) sequence letter by another: • Insertion - an insertion of a letter or several letters to the sequence: • Deletion - deleting a letter (or more) from the sequence: AAGA AACA AAG A T A A GA Insertion + Deletion Indel
Choosing an alignment: • Many different alignments are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Which alignment is better?
Exercise: compute both alignment scores • Match: +1 • Mismatch: -2 • Indel: -1 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA-
Scoring systems: accounting for biological context • Which is true about the scores in a pairwise alignment of nucleotide sequences? • Tr > Tv > 0 • Tr < Tv < 0 • 0 > Tr > Tv • 0 > Tv > Tr Tr = Transition Tv = Transversion
Scoring systems: accounting for biological context • Which is true about the scores in a pairwise alignment of amino-acid sequences? • Asp->Asn > Asp->Glu • Arg->His > Ala->Phe • Arg->His < Thr->Met
Substitutions matrices • Nucleic acids: • Transition-transversion • Amino acids: • Evolutionary (empirical data) based: (PAM, BLOSUM) • Physico-chemical properties based (Grantham, McLachlan)
PAM matrices • Family of matrices PAM 80, PAM 120, PAM 250 • The number with a PAM matrix represents the evolutionary distance between the sequences on which the matrix is based • Greater numbers denote greater distances
PAM - limitations • Based on only one original dataset • Examines proteins with few differences (85% identity) • Based mainly on small globular proteins so the matrix is biased
BLOSUM matrices • Different BLOSUMn matrices are calculated independently from BLOCKS (ungapped local alignments) • BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity • BLOSUM62 represents closer sequences than BLOSUM45
Substitution matrices exercise • Pick the best substitution matrix (PAM and BLOSUM) for each pairwise alignment: • Human – chimp • Human - yeast • Human – fish PAM options: PAM60 PAM120 PAM250 BLOSUM options: BLOSUM45 BLOSUM62 BLOSUM80
PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences • BLOSUM62 for general use • BLOSUM80 for close relations • BLOSUM45 for distant relations • PAM120 for general use • PAM60 for close relations • PAM250 for distant relations
Gap penalty AAGCGAAATTCGAAC A-G-GAA-CTCGAAC AAGCGAAATTCGAAC AGG---AACTCGAAC • Which alignment is more likely? • Which alignment has a higher score?
BLAST 2 sequences (bl2Seq) at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool)engine for local alignment • Does not use an exact algorithm but a heuristic
Bl2Seq - query • blastn – nucleotide blastp – protein
Bl2seq results Dissimilarity Low complexity Gaps Similarity Match
BLAST – programs Query: DNA Protein Database: DNA Protein
Blast scores: • Bits score– A score for the alignment according to the number of similarities, identities, etc. • Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a random database of a particular size. The closer the e-value is to zero, the greater the confidence that the hit is really a homolog
Fasta format – multiple sequences >gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH >gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH >gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH
Searching for remote homologs • Sometimes BLAST isn’t enough • Large protein family, and BLAST only finds close members. We want more distant members • PSI-BLAST
PSI-BLAST • Position Specific Iterated BLAST Regular blast Construct profile from blast results Blast profile search Final results
PSI-BLAST • Advantage: PSI-BLAST looks for seq’s that are close to the query, and learns from them to extend the circle of friends • Disadvantage: if we obtained a WRONG hit, we will get to unrelated sequences (contamination). This gets worse and worse each iteration
PSI-BLAST Which one(s) of the following is/are correct? • PSI-BLAST is expected to give more hits than BLAST • PSI-BLAST is an iterative search method • PSI-BLAST is faster than BLAST • Each iteration of PSI-BLAST can only improve the results of the previous iteration