1 / 68

Material taken (1) from a Lecture by Gary Benson, Departments of Computer Science and Biology,

Searching for Similarities in Genetic and Proteomic Sequences Dr. Jaume Bacardit jaume.bacardit@nottingham.ac.uk. Material taken (1) from a Lecture by Gary Benson, Departments of Computer Science and Biology,

Télécharger la présentation

Material taken (1) from a Lecture by Gary Benson, Departments of Computer Science and Biology,

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching for Similarities in Geneticand Proteomic SequencesDr. Jaume Bacarditjaume.bacardit@nottingham.ac.uk Material taken (1) from a Lecture by Gary Benson, Departments of Computer Science and Biology, Boston University, given at the 2003 ISSCB, further modified by Prof. Natalio Krasnogor (UoN). Examples from P.A. Pevzner’s “Computational Molecular Biology”, D.A.Krane & M.L. Raymer’s “Fundamental Concepts of Bioinformatics” and from C. Gybas & P.Jambeck’s “Developing Bioinformatics Computer Skills”,) and (2) from “Arthur Lesk, Introduction to Bioinformatics, 2nd edition, Oxford University Press, 2005”

  2. Outline Similarity and Alignment • Define homology, similarity by descent and similarity by convergence • Common mutations and their mathematical models • Alignments • Scoring Alignments • Gap penalty functions • Computing the best scoring alignment – the Longest Common Subsequence (LCS) problem • Sequence Alignment • Multiple Sequence Alignment

  3. Similarity and Biomolecules Similarity is expected among biomolecules that are descended from a common ancestor. Mutations cause differences, but survival of the organism requires that mutations occur in regions that are less critical to function while importantcatalytic, regulatory or structural regions remain similar.

  4. Similarity and Evolution Evolution has duplicated and shuffled bits and pieces of molecules to produce new linear arrangements that combine function in novel ways. Regions of similarity often suggest an evolutionary tie and/or common functional properties between very different molecules.

  5. An alignment between two or more genetic or proteomic sequences represents an explicit hypothesis vis a vis their evolutionary histories. • Thus comparison of related gene/protein sequences have been instrumental in shedding light into the information content of these sequences and their biological functions. • Hence, comparing and aligning gene/protein sequences is a cornerstone for bioinformatics

  6. Three common similarity problems • Start with a query sequence with unknown properties and search within a database of millions of sequences to find those which share similarity with the query. • Start with a small set of sequences and identify similarities and differences among them. • In many sequences or very long sequences, detect commonly occurring patterns.

  7. What is Similarity?How can we measure it?

  8. Morphology Morphology is the form and structure of an organism. Should shared morphology mean similarity?

  9. Hands

  10. Aquatic Shape

  11. The beauty of evolution: Eyes and ears for example have been rediscovered in many species independently, like in octopuses, flies, bees, mammals, etc Shared morphology Shared morphology does not necessarily imply common ancestry. The animals with hands have all evolved from a common ancestor with a hand. The ichthyosaur, shark and porpoise each evolved sea life adaptations independently.

  12. Homology When similarity is due to common ancestry, we call it homology.

  13. Modern molecular biology seeks to understand cellular processes through the action of DNA, RNA, and protein molecules. This will ultimately lead to a biochemical understanding of: • The pathogenesis of infectious diseases like AIDS, hepatitisand SARS. • The mutagenic properties of environmental toxins and how they lead to diseases like cancer. • The etiology of human genetic disease. • Strategies to prevent and treat diseases through drug and vaccine design, gene therapy, risk reduction, etc.

  14. How homology helps Given molecular sequences X and Y: X ~ Y AND INFO(Y) ==> INFO(X) (“ ~ ” means similar) X,Y could be gene sequences, protein sequences

  15. Are the Sequences Similar?

  16. Are the Sequences Similar • How similar? • What parts are the most similar? Remember, the common ancestor of the two sequences may have existed millions of years ago.

  17. How can we tell if the two sequences are similar? Similarity judgements should be based on: • The types of changes or mutations that occur within sequences. • Characteristics of those different types of mutations. • The frequency of those mutations.

  18. Common mutations in DNA Substitution: A C G T T G A C A C G A T G A C Deletion: A C G T T G A C A C G A C Insertion: A C G T T G A C A C G C A A T T G A C

  19. Common mutations Duplication: A C G T T G A C A C G T T G AT T G A C Inversion (double stranded DNA shown): A C G T T G A C T G C A A C T G A C T C A A C C T G A G T T G G

  20. Frequency of mutations Substitution > Insertion, Deletion > > Duplication > Inversion

  21. Dotplots • Dotplot is a simple picture that gives an overview of pairwise sequence similarity • A dotplot is a matrix. The rows correspond to the residues of one sequence and the columns correspond to the residues of the other sequence • A cell in the matrix is filled if the residues in the row and column match, otherwise, it is left blank • Stretches of similar residues show up as diagonals in the upper left-lower right direction

  22. Dotplots • Dotplot of the two sequences aligned in the previous lecture..... Gap in the alignment Zones of well aligned residues

  23. Dotplots • It is easy to identify long regions of matching residues, but not so easy to align distant-related sequences • To improve the visibility of the dotplot, two parameters are added • Window - length of region of consecutive residues • Threshold – minimum number of matches required within the window to mark a cell

  24. Evolutionary history of sequences

  25. Alignments There are many ways to align two sequences. We just saw one way: T T A C G T ACA G A T T A T - - G G A A C A - - - T A Here is another: T T A C G T – A C A G A T T A T - - - G G A A C - - A T - A Which is better? Remember, we can not choose based on the evolutionary history, because that is unknown.

  26. Alignments: Definitions • Gap: a break in the alignment, in either one of the sequences. • For nucleotides, a consequence of an insertion or deletion mutation. • For proteins, it’s more difficult to say. • Regions of matching residues. • Indicate parts of a sequence that are well conserved • Mismatched residues. • For nucleotides, a consequence of a substitution mutation • Less conserved regions

  27. Finding the Best Alignment:Ranking Alignments by Score Score an alignment by • Partitioning it into columns • Assign a weight to each column • Sum the column weights

  28. Distance Scoring Distance scoring: • Alignment gets a non-negative score. • Alignment of identical sequences scores zero, all others > zero. • Best alignment has smallest score. Typical scoring functions are: • d(a,a) = 0; identity • d(a,b) = d(b,a) > 0; a ≠ b; substitution • g = d(a, – ) > 0; indel (gap)

  29. Similarity Scoring Similarity scoring: • Alignment scores may be positive, zero, or negative. • More similar means larger positive score. • The best alignment has largest score. Typical scoring functions are: • s(a,b) is { > 0 if a and b are similar in one or more characteristics or are observed to substitute frequently for each other; ≤ 0 otherwise }; substitution • g = s(a, – ) < 0; indel (gap)

  30. Gap penalty functions • Single character gap penalty g(a, – ) = c (c a constant or a value dependent on a) • Affine (linear) gap penalty g(k) = α + βk (α is a gap opening penalty, β is a gap extension penalty) • Concave gap penalty g(k) =α + β(m(k)) m(k) is a function like log(k) which grows more slowly as k increases.

  31. Distance Scoring Alignment parameters: d(a, a) = 0; d(a, b) = + 2, g = + 4 A – G C C G T A T A C G A - - T - T 0 4 0 2 4 4 0 4 0 = 18

  32. Similarity Scoring Scoring parameters: s(a, a) = + 5, s(a, b) = - 3, g = - 8 A – G C C G T A T A C G A - - T - T 5 5 5 5 + = - 15 8 3 8 8 8 -

  33. Similarity scoring with affine gap Alignment parameters: s(a, a) = + 5, s(a, b) = - 3, g(k) = α + β*(k-1), α = - 4, β = - 2 g(k) = -4 for opening and – 2*(k-1) for extending k>=2 A – G C C G T A T A C G A - - T – T 5 5 5 5 + = 4 3 4 2 3 4 - * The opening gap is counted only once * Has been accounted for in the previous position

  34. Scoring Matrices • Given that not all types of indel and mutations are equally likely and • Given that not all of the changes are equally severe • We might want to penalize differently accordingly to which nucleotide/amino acid are mutually mistmatched • Example: • Consider two protein sequences one of which has an alanine in a given position. A substitution to another small, hydrophobic amino acid, e.g. valine, will not be as bad as a substitution to a bulky & charged residue like lysine. • Thus we might want to score an alignment of alanine-valine more favourably than alanine-lysine.

  35. The relative scores are captured in scoring matrices • For nucleotides these are quite simple: • BLASTS uses: • +5 if the two aligned nucleotides are identical • -4 if they are not • Others: Transition Transversion Matrix Transitions purine-purine or pyrimidine-pyrimidine mildly penalized while transversions purine-pyrimidine or pyrimidine-purine heavily penalized Identity Matrix

  36. Scoring Matrices for Proteins • Designing SM for protein seq is more complicated. Two main approaches: • SM based on chemical-physical properties • SM based on observed substitution frequencies

  37. Physico-chemical similarity scores Examples: • Pairing two amino acids with aromatic functional group should result in a good score while pairing amino acids where one is non-polar and the other is charged should not. • SM have been devised based on hydrophobicity, charge, electronegativity and size. • Also the genetic code has been used where a pair of amino acids is scored accordingly to the minimum number of nucleotide substitutions necessary to convert a codon from one residue to the other.

  38. Observed substitution frequency scores • Observe actual substitution rates in nature • If a substitution between amino acids i and j is observed frequently then positions where these two are aligned are scored favourably. • Likewise if i and j are seldom observed to be interchanged in nature then they are penalized in any alignment. • Example: Asp, Glu, Ser are the most mutable aminoacids while Cys & Trp are the least mutable

  39. Metrics of similarity: PAM matrices • 1 PAM = 1 Percent Accepted Mutation • Two sequences 1 PAM apart have 99% identical residues • Collecting statistics of such pairs of sequences, and correcting for different amino acid abundances produces the 1 PAM substitution matrix • Other PAM matrices are computed from pairs of sequences with larger distance • E.g. PAM250 ~ 20% identical residues

  40. Metrics of similarity: BLOSUM matrices Created later in time, and from larger volumes of proteins Designed to perform better in distant relationships BLOSUM = BLOcks SUbstitution Matrix Computed from regions alignable without gaps of closely-related proteins  related to local alignments, explained later in the lecture A parameter specifies the maximum % of sequence identity in the alignments used to compute the matrix  avoid overweighting closely related sequences BLOSUM62 is the standard substitution matrix in most alignement programs  maximum of 62% sequence identity

  41. Global vs Local alignment: Introduction • We know how to score good or bad alignments • How to find the optimal one? • Two classes of alignment methods • Global alignment • Finds the best alignment of one entire sequence with another entire sequence • Local alignment • Find the best alignment of one segment of a sequence against another segment of another sequence

  42. Global vs Local alignment: Trajectories • Before explaining the alignment method, let’s illustrate the concept of an alignment trajectory, which is similar to a dotplot A L I G N G A P E D • A diagonal arrow means an alignment between residues (match or not) • An horizontal arrow means a gap in the sequence indexing the columns • A vertical arrow means a gap in the sequence indexing the rows G G G A A A P P A A A L L I I G G G N N E E D D

  43. Global alignment • There is an exact method to find an optimal alignment • Based on a methodology called Dynamic Programming • A solution will be constucted incrementally • For each step in the alignment, we will choose the action (align residues, or introduce a gap one of the sequences, that is, the direction of the arrow in the trajectory) that will have the best local score • The exact method will always find the solution, but it is very costly • Approximate methods will probably find a good solution and are faster. Explained later in the lecture

  44. Global alignment • The exact global alignment method is known as the Needleman and Wunsch algorithm • To find the global alignment this method constructs a matrix of n+1 rows (numbered from 0 to n) and m+1 columns (numbered from 0 to m), where n and m are the sizes of the two sequences to align • A cell (i,j) in the matrix will contain the optimal score of aligning the first i residues of the first sequence with the first j residues of the second sequence • Therefore, cell (n,m) will contain the optimal score of the global alignment between the two sequences

  45. Global alignment • The matrix will be filled incrementally • To decide the value of cell (i,j) we will look at cells (i-1,j-1), (i,j-1) and (i-1,j) • These three cells are the three possible predecessors of the trajectory passing through (i,j), and each of them will already have an score • A movement from each of these three cells to (i,j) will have an score of its own • We will choose the movement that minimizes the sum of the previous score and its own score

  46. Global alignment • We will illustrate how the matrix is filled with an example. • We have two sequences A=ggaatgg and B=atg • We use a scoring scheme where a match has a score of 0, a mismatch a score of 20 and an insertion or deletion an score of 25 • To start constructing the matrix, we first need to initialize the cells in row 0 and column 0 • The values in this cells indicate the score of having a gap at the beginning of either sequence

  47. Global alignment • We need to decide the score for cell (1,1) • A movement from (0,0) would have a score of 0 (previous score) + 20 (mismatch score) = 20 • A movement from (1,0) would have a score of 25 (previous score) + 25 (gap score) = 50 • A movement from (0,1) would have a score of 25 (previous score) + 25 (gap score) = 50 • Therefore, we choose a movement from (0,0) to (1,1) • We place an arrow to point the chosen predecessor • The value for the rest of cells is decided in the same way

  48. Global alignment • After filling all cells we obtain.... • We can see that we have cells with two arrows. It means that the same score was obtained with two different movements  two trajectories equally good • The score of the optimal alignment is 100 • If we follow the arrows from cell (7,3), we will obtain all four optimal alignments ggaatgg ---atg- ggaatgg ---at-g ggaatgg --a-t-g ggaatgg --a-tg-

  49. Local Sequence Alignment • GSA seeks similarities among entire strings • Useful when similarity is expected to extend over the entire strings, e.g., protein sequences of the same protein family • These protein sequences are well conserved even among different species having almost the same length among humans vs fruit flies. • There are other applications where the alignment score of some substrings of the query pair can be better than the alignment of the whole query pair. • Homeobox appear in many species with large variations but with a highly conserved region called homeodomain. • How do we find the common region ignoring the variable region?

More Related