1 / 45

Sequence Alignment

Sequence Alignment. Finding similarities between sequences (Chapter 11). The Problem. you have a sequence and you want to know if it is similar to another known sequence Is it identical to another, known sequence? Is it similar to another , known sequence?

toril
Télécharger la présentation

Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Alignment Finding similarities between sequences (Chapter 11)

  2. The Problem • you have a sequence and you want to know if it is similar to another known sequence • Is it identical to another, known sequence? • Is it similar to another , known sequence? • If so, how similar, and is the similarity restricted to a few regions, or generalized?

  3. Methods - Conceptual • one way of looking at this problem is a big motif search, i.e. treating one sequence as a motif and scanning the other to find if it matches. • Regular expressions • HMM • Scoring

  4. Regular Expressions • not a workable approach • a simple regex scan will only probe for a small subset of all possible matches, simplest being a perfect match. • no way to implement a quantitative evaluation of how similar • not computationally rapid enough for searching big databases

  5. HMM Methods • currently very powerful for finding subtle similarities between a number of sequences • conceptually complex, not easy to implement (although some tools are coming on line) • Can not easily be speeded up, so not workable for searches of large databases, but very good method for doing small number of alignments

  6. Quantitative Method • want to be able to assign a score to each potential alignment between two sequences, and find the best score • therefore, this is a maximization problem • we will go through development of current best quantitative alignment methods

  7. Simple Example • have two nucleotide sequences: ACGGTTGAATGC CGATTCATGC • by eye, you would probably get: ACGGTTGAATGC -CGATTCA-TGC

  8. Three Important Points • maximized the number of exact matches • to do this had to add gaps to one or both of the sequences • allowed some mismatches

  9. But what about alternatives: ACGGTTGAATGC -CGATTCA-TGC -CGATT-CATGC -CGATTC-ATGC • Or ACGG-TTG-AATGC -CG-ATT-CA-TGC

  10. Finding the “Best” Alignment • to call something the “best” you need to have some criterion • typically this involves a scoring scheme for assigning value to any alignment and then finding the alignment (out of all possible alignments) that has the maximum score based on that scoring scheme

  11. Scoring Scheme • implicit in that simple, intuitive alignment are three general concepts: • you get points for matching • gapping has no cost • you didn’t subtract points for mismatch • however, cost-free gapping and mismatching are usually not optimal

  12. Boundary Conditions • a very simple scheme gives you 1 point for each match, minus for a mismatch, zero for a gap • under this scheme the optimal score can always be obtained by adding unlimited gaps to get every individual alignment possible, or at worst match a gap to every mismatch

  13. Gap Penalties • in biological terms, a gap in one sequence in an alignment is called an indel, short for insertion/deletion • reason: a priori you can not tell whether extra sequence was inserted into one homologue, or sequence was removed from the other

  14. in any case, a gap is a hypothesis that either an insertion or a deletion occurred, and such events are relatively uncommon • therefore, a penalty should be imposed for a gap, usually rather high to reflect the idea that some number of mismatches are more likely than an indel

  15. Affine Gap Penalties • once a gap is hypothesized, the size of the gap is not well defined, so the penalty for having a string of n gaps should be less than xn, where x is the penalty for opening a gap • this yields a 2-parameter formula for a gap penalty, an affine gap penalty

  16. Penalty = G + (n-1)L • G is the gap opening penalty • L is the gap extension penalty • n is the length of the gap • values for these parameters are empirically determined • depend on the scoring scheme that is in use • usually G>L

  17. Scoring Matrix • when we were looking at the scanning window case, we assigned a number value to each amino acid and then summed them • for an alignment, we need to assign a scoring value to every position in an alignment

  18. therefore, need a score for every pairwise combination of amino acids • gaps in either sequence are scored using a gap penalty • so, how do we get a scoring matrix? • simplest possibility is to assign 1 for a perfect match and 0 or -1 for every mismatch • not bad for nucleic acid sequences • terrible for protein sequences

  19. Problems with Simple Identity Matrix • the model implicit in this simple matrix grossly oversimplifies the process by which two sequences diverge. • different amino acids convert to other amino acids with different frequencies • this phenomenon is based on • the chemical nature of the aa sidechain • the genetic code

  20. Amino Acid Similarity • some side chains are chemically very similar, e.g. D and E, R and K, S and T, I and L and V • this similarity means that changes within these groups tends to have a smaller effect on protein structure and function than changes between the groups

  21. therefore, we can score a S:T pairing as more of a match than a S:I pairing • conversely, a R:E pairing, or a R:L pairing should get a negative score, since they involve major changes in sidechain identity, which tend to be selected against • Note, this phenomenon is based on natural selection, not on susceptibility to mutational change

  22. Genetic Code • the genetic code is a degenerate 3-letter nucleotide code that translates into the amino acid sequence of proteins. • some nucleotide changes will not alter the amino acid encoded by the codon that contains the change (degeneracy)

  23. each amino acid is related to the other 19 amino acids by one or two or three nucleotide changes • therefore, you could score mismatches at the amino acid level based on the minimum number of nucleotide changes that would be required to interconvert the two residues • e.g. F:L = UUU:UUA, 1 change • e.g. W:M = UGG:ATG, 2 changes

  24. How Do We Generate a Good Scoring Matrix? • a priori approaches are so oversimplified as to be misleading • therefore, need to extract information from real data so the scoring matrix reflects the real process underlying the comparison • therefore, extract information from “undeniable alignments”

  25. 5’ Break

  26. Protein Scoring Matrices • two major sets: • PAM - Point Accepted Mutation matrix, based on differences between closely related proteins (Dayhoff et al. [1978] in Atlas of Protein Sequence and Structure) • BLOSUM - BLOcks SUbstitution Matrix based on BLOCKS database of local alignments with different similarities (Henikoff and Henikoff [1992] PNAS 10915-10919)

  27. PAM Matrices • based on alignments of closely related sequences, mainly of antibody proteins • if there is no selection pressure, then the substitution matrix could be directly derived from the amino acid frequency (called the background frequency) • but the observed substitution frequencies (target frequencies) differ

  28. that means that there are certain transitions that are more accepted as point mutations • constructed a matrix, PAM1, of the natural log of the ratio (target frequency/background frequency) for comparisons in which the overall sequence difference is <15%, corrected to reflect a 1% divergence • by multiplying the matrix by itself can generate the appropriate matrix for successively more divergent sequences

  29. typically people use the PAM250 matrix, but others are available • appropriate scoring matrix is the one that reflects the amount of divergence between sequences, so the more divergent the proteins being aligned the larger the PAM number should be

  30. BLOSUM Matrices • unlike the PAM matrices, where the substitution matrix is obtained by multiplying the log odds matrix from highly similar sequences to get matrices compatible with higher divergence • BLOSUM matrices are derived from alignments from the BLOCKS database of multiple alignments

  31. blocks are chosen to have different levels of divergence, more divergent blocks yield a matrix that is appropriate for aligning more divergent proteins • hence, BLOSUM is based on observed target frequencies at different levels of divergence, rather than extrapolation from very similar sequence alignments

  32. Finding the “Best” Alignment • we have a scoring algorithm, a scoring matrix and gap penalties • we operationally define the best alignment as that which yields the highest score when the scoring scheme is applied to it • How do we find the best score?

  33. Simple Method • construct all possible alignments • determine the score for each of them • select the one(s) with the highest score

  34. Problem • how many alignments can you make with two sequences, length m and n? • number is so large (of the order of 3mn ) as to make the problem computationally intractable for most cases of interest. • so, need an algorithm that will allow you to find the maximum score without evaluating every alignment

  35. Dot Plot • place two sequences along two axes of a square graph, table • simple algorithm • mark each cell that corresponds to two sequence elements matching with a dot • pattern should show diagonals where regions of sequence match

  36. Improved Dot Plot • instead of using single site comparisons, compare equal sized windows using a scoring matrix. • graph dots with an intensity that is a function of the score between the windows. • implemented in Peptool

  37. Dot Plot Between Proteins • if you use two different sequences on the dot plot, get a graphical indication of where they align and where they do not • for very similar sequences, get a diagonal with varying intensity along its length quantitatively indicating the sequence similarity along the alignment

  38. Dot Plot Comparison • when the proteins are more different, get diagonal line segments, indicating regions of similarity • gaps between lines indicate areas of low sequence similarity • offsets between the segments indicate different lengths of the low-similarity regions

  39. Connection • dot plot is a heuristic for visualizing the quantitative relationship between two aligned sequences • implicit in the dot matrix are all possible alignments • each global alignment can be represented by tracing a path through the matrix

  40. Dynamic Programming • a class of algorithms that can rapidly find an optimal solution to a problem if that problem can be broken down into a a set of sub-problems that can also be optimized • does a good job of finding optimal paths through graphs, which is one way of looking at the alignment problem

  41. General idea: • If you have a prefix alignment of length i then there are only three possibilities for lengthening that alignment: • The next elements of each sequence are aligned with each other • The next element of the top sequence is aligned with a gap • The next element of the bottom sequence is aligned with a gap

More Related