Pairwise and multiple sequence alignment

Pairwise andmultiple sequence alignment MLW2013 - 2011 BiGCaT bioinformatics

Contents • PART 1: PAIRWISE SEQUENCE ALIGNMENT • Basics • Substitution matrices • Algorithms for sequence alignment • PART 2: MULTIPLE SEQUENCE ALIGNMENT

Further reading • Bioinformatics and Functional Genomicsby Jonathan Pevsner • Chapter 3 and 10 • Available from the library (SL) • Understanding bioinformaticsby M Zvelebil and J.O. Baum • Chapter 4; (extra: 5.1, 6.4 and 6.5) • Available from the library (SL) • Chapter 4 freely available from:http://www.garlandscience.com/textbooks/0815340249/pdf/UBchapter4.pdf

PART 1:PAIRWISE SEQUENCE ALIGNMENT

Basics

Is used: to decide if two proteins / genes are related structurally or functionally to identify domains or motifs that are shared between proteins in the analysis and annotation of genomes Pairwise sequence alignment Images: Homologene and PDB

Is used: in phylogenetic analysis, allowing to look back millions of years to identify which transcripts or genes have been retrieved in an experiment Is the basis of BLAST searching (next lecture) Pairwise sequence alignment Image: http://tolweb.org/Eutheria/15997

Definitions Identity • The extent to which two sequences are exactly the same Conservation • Changes at a specific position of an amino acid (or less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue Similarity • The extent to which nucleotide or protein sequences are related, based on identity and conservation

Conservation • Conservation is relevant for alignment since an amino acid is more likely to be changed to another one out of the same group than to others

Definition Homology • Similarity attributed to descent from a common ancestor • NOTE: Two genes are either homologous or they are not, they cannot be somewhat homologous • don’t confuse this term with similarity or identity!

Definitions • Two types of homology: Orthologs • Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function Paralogs • Homologous sequences within a single species that arose by gene duplication

Example

Two types of non-homology Homoplasmy • convergent evolution: twosimilar sequences thatevolved independently(without a common ancestor) Xenology • horizontal gene transfer: genes that occur in unrelated organisms, caused by exchange of genomic material Images: Wikipedia

Definition Pairwise sequence alignment: • The process of lining up two sequences to achieve maximal levels of identity and conservation in order to assess the degree of similarity and the possibility of homology.

Gaps • To deal with insertions/deletions, and improve alignments like:we need to introduce gaps: • A gap is a position at which a letter is paired with nothing (an empty space)

Example protein alignment 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 Retinol-binding protein . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 b-lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Very similar (two dots) Somewhat similar (one dot) Identity (bar)

Example protein alignment 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Internal gap Terminal gap

Alignment of protein sequencescan be more informative than DNA • protein is more informative (20 vs. 4 characters) • many amino acids share related biophysical properties • codons are degenerate: changes in the third position often do not alter the amino acid that is specified • protein sequences offer a longer “look-back” time

DNA sequences can be translated into protein for use in pairwise alignments • DNA can be translated into six potential proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG

When use DNA alignments? • DNA alignments are still appropriate in several cases: • to confirm the identity of a cDNA • to study noncoding regions of DNA • to study DNA polymorphisms • to study genomes • example: Neanderthal vs modern human DNA Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 |||||||| |||| |||||| ||||| | ||||||||||||||||||||||||||||||| Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247

Which alignment is best? • Often, sequences can be aligned in several ways • To determine the ‘best’ alignment we need a quantitative score, that reflects the degree of similarity • The alignment with the highest score is considered to be the best one • A very simple score would be to determine the percentage of identities

Dot-plot • A dot-plot can graphically represent identities • Put one sequence along the horizontal axis • The other along the vertical axis • Put a dot where they are identical • A diagonal group of dots indicates a stretch of identities

R B E E L D | | | | | |R B E E L D is scored  Dot-plot • Since we are not interested in coincident matches, we can use a window of width n. This means that only if at least m matches are present within those n residues, we put a score • For example: using a window of width 4, and an m of 3 V O O R B E E L D | | | | | | |V E U R B E E L D 2 < 3 2 < 3 3 >= 3 4 >= 3 4 >= 3 4 >= 3

Example no window window size 10, 3 matches required

More advanced scoring • A more elaborate score should reward identity and conservation, and penalise mismatches and gaps • We can improve the scoring scheme by also giving a (smaller) positive score to conserved but not identical residues (e.g. from the same group of amino acids) • And by giving very unalike residues and gaps a certain negative score as well

Positive score Negative score Not scored Gap penalties • Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is penalised more than the length of the gap • Terminal gaps and mismatches are not penalised and not considered to be aligned

Substition matrices

How are the exact scores determined in reality? • The scores for each specific match or mismatch is based on the real substition rates of residues • Thus for each pair of amino acids, there is a unique score (or two, if changes in both directions have a different score) • A matrix that contains these scores is called a substitution matrix

How are these matrices computed? – Overview • First one determines all mutations that have occurred in a sample set of sequences • Note that the results will indeed differ for different protein families, species et cetera • Based on these, one can compute how often each amino acid has been replaced by each other amino acid • Each value is corrected for the number of occurences of the amino acid in all sequences and the overall number of mutations • The values are log-scaled for easier compution

PAM matrices • Often used substitution matrices are the PAM matrices (Point Accepted Mutations)by Dayhoff et al. • PAM is based on global (full length) alignments of a sample set of closely related proteins (>85% identity) from several species • This ensures that each of the differences has probably occurred by only one mutation

AAA ABA ABA Divergent sequences • In less similar sequences, these situations may and will occur (examples, suppose the thick boxed sequences are in your data set): • In order to correctly deduce these, you will need the ancestry of the protein (a phylogenetic tree) multiple substitutions back substitution coincidental substitutions parallel substitutions convergent substitutions AAA AAA AAA AAA ACA ACA ABA ABA ABA ABA ACA ACA AAA

Accepted point mutations (percentages x 10) only part of the matrix is shown

Relative amino acid mutabilities Asn 134 His 66 Ser 120 Arg 65 Asp 106 Lys 56 Glu 102 Pro 56 Ala 100 Gly 49 Thr 97 Tyr 41 Ile 96 Phe 41 Met 94 Leu 40 Gln 93 Cys 20 Val 74 Trp 18 • Alanine is normalised to 100

Several PAM matrices • Based on these estimates, PAM matrices have been computed for several evolutionary distances • PAM1 is the matrix computed from sequences that are no more than 1% divergent • Other matrices are computed from this matrix • For example PAM250 is suitable for proteins with 250 mutations on 100 residues • This is possible, since residues will be changed more than once on longer evolutionary time

Dayhoff’s PAM1 mutation probability matrix • Each element of the matrix shows the probability x 104 that an original amino acid (top) will be replaced by another amino acid (side) only part of the matrix is shown

Dayhoff’s PAM250 mutation probability matrix • Each element of the matrix shows the percentage of an original amino acid (top) to be replaced by another amino acid (side) only part of the matrix is shown

Relation between PAM and % identity twilight zone midnight zone • At PAM1, two proteins are 99% identical • At PAM10.7, there are 10 differences per 100 residues • At PAM80, there are 50 differences per 100 residues • At PAM250, there are 80 differences per 100 residues %identity Evolutionary distance in PAMs

Remarks • Why will two proteins with 50% identity have 80 changes per 100 residues on average? • Because some residues have changed multiple times • Because some residues will have changed back to the original residue • As a rule of thumb, two proteins sharing > 30% over a substantial region are usually homologous • Proteins with 20% to 30% identity are in the “twilight zone” and may be statistically significantly related.

Why do we transform to log scale • It is easier to compute total scores: instead of multiplying probabilities, we can now just sum scores! • The scores are ‘log odds’ scores • Positive scores mean that the likelihood of the change is higher than expected by chance, negative means lower, 0 is neutral

PAM250scoring matrix

Different matrices give different results • Example: analysing two distantly related proteins with a, say, PAM40 matrix finds almost no match • A PAM250 matrix is much more tolerant of mismatches hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC * ** * ** 24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7% rbp4 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV btlact 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN * **** * * * * ** * rbp4 86 --CADMVGTFTDTEDPAKFKM btlact 80 GECAQKKIIAEKTKIPAVFKI ** * ** **

BLOSUM matrices • BLOSUM (BLOCKS substitution) matrices are based on observed local (only parts of the sequences may align)alignments • They are based not only on closely related, but also on more distantly related proteins • The BLOCKS database contains thousands of groups of multiple sequence alignments

BLOSUM also comes in flavours • BLOSUM62 is suitable for sequences that have less than 62% similarity • It is often used as it works well for comparing moderately distant proteins, but also performs well in detecting closer relationships • A search for distant relatives may be more sensitive with a different matrix

BLOSUM62scoring matrix

Note that a higher PAM number corresponds to a lower BLOSUM number!

Algorithms for sequence alignment

How can you find the best alignment? – Overview • Set up a matrix with one sequence on top and one sequence to the left • Fill up the matrix with scores • The scores depend on the scoring system you use: identities, similarities, substitution matrix-based scores et cetera • Find the optimal alignment using a ‘dynamic programming’ algorithm • Extract the alignment

Two dynamic programming methods • Advantage: only part of all possibilities have to be searched, and still the optimal one is found! • Two main methods • Global alignment (Needleman & Wunsch, 1970) • Total lengths of both sequences are aligned (“global”) • Gaps are permitted • Local alignment (Smith & Waterman, 1980) • Alignment may contain just a portion of either sequence • Appropriate for finding matched domains between sequences • Still computationally intensive

Illustration • Example: Needleman-Wunsch with percentage identity as scoring scheme • Start with an identity matrix • Fill in the matrix starting from the bottom right • Compute the best possible score for the sequence alignment up till that point

Pairwise and multiple sequence alignment

Pairwise and multiple sequence alignment

Presentation Transcript

Pairwise and Multiple Sequence Alignment Lesson 2

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise sequence Alignment

Pairwise sequence alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise sequence alignment

Pairwise Sequence Alignment

Pairwise sequence Alignment

Pairwise Sequence Alignment (II)

Pairwise Sequence Alignment

Pairwise Sequence Alignment (I)

Pairwise Sequence Alignment (cont.)

Pairwise Sequence Alignment

Pairwise sequence alignment

Pairwise sequence alignment (practice)

Pairwise Sequence Alignment (II)

Pairwise sequence alignment