1 / 22

A Brief Introduction to Biological Sequence Alignment

A Brief Introduction to Biological Sequence Alignment. Sun Kim CSE SNU For Bio Data Mining 4541.776.002 Sep 2011. Aligning a pair of sequences. Problem: given a pair of sequences, find the best alignment among all possible alignments. Goal: to compute the best alignment needs

olathe
Télécharger la présentation

A Brief Introduction to Biological Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Brief Introduction to Biological Sequence Alignment Sun Kim CSE SNU For Bio Data Mining 4541.776.002 Sep 2011 Bio & Health Informatics Lab, SNU

  2. Aligning a pair of sequences • Problem: • given a pair of sequences, find the best alignment among all possible alignments. • Goal: to compute the best alignment needs • type of the alignment • a scoring scheme • A scoring matrix • Gap penalty scheme • Two types of alignment problems • Local sequence alignment • Global sequence alignment Bio & Health Informatics Lab, SNU

  3. glutamate-ammonia ligase related sequences Query sequence 1 >A8XYH6 A8XYH6_CAEBR CBR-GLN-2 protein [Caenorhabditis briggsae] MTHLNFETRMPLGQAVIDQFLGLRPHPTKIQATYVWIDGTGENLRSKTRTFDRLPKKIED YPIWNYDGSSTGQAKGRDSDRYLRPVAAYPDPFLGGANKLVMCDTLDHEMQPTATNHRQA CAEIMNEIRDTRPWFGMEQEYLIVDRDEHPLGWPKHGFPAPQGKYYCSVGADRAFGREVV ETHYRACLHAGLNIFGTNAEVTPGQWEFQIGTCEGIDMGDQLWMSRYILHRVAEQFGVCV SLDPKPKVTMGDWNGAGCHTNFSTAEMRAPGGIAAIEAAMEGLKRTHLEAMKVYDPHGGE DNLRRLTGRHETSSADKFSWGVANRGCSIRIPRQVAAERKGYLEDRRPSSNCDPYQVTAM IAQSILL Query sequence 2 >O02225 O02225_CAEEL Protein C28D4.3, confirmed by transcript evidence [Caenorhabditis elegans] MSHLNYETRLPLGQATIDHFMGLPAHPTKCQATYVWIDGTGEHLRAKTRTINTKPQYLSE YPIWNYDGSSTGQADGLNSDRYLRPVAVFPDPFLGGLNVLVMCDTLDHEMKPTATNHRQM CAELMKKVSDQQPWFGMEQEYLIVDRDEHPLGWPKHGYPAPQGKYYCGIGADRAFGREVV ETHYRACLHAGITIFGSNAEVTPGQWEFQIGTCLGIEMGDQLWMARYILHRVAEQFGVCV SLDPKPRVTMGDWNGAGCHTNFSTIDMRRPDGLETIIAAMEGLKKTHSEAMKVYDPNGGH DNLRRLTGRHETSQADQFSWGIANRACSVRIPRQVADEGRGYLEDRRPSSNCDPYLVTAM IVKSVLIN Bio & Health Informatics Lab, SNU

  4. A Pairwise Alignment of The Two Sequences. Bio & Health Informatics Lab, SNU

  5. Scoring matrix BLOSUM 62 Bio & Health Informatics Lab, SNU

  6. Compute A Score for A Pairwise Alignment of The Two Sequences. Adding scores in the scoring matrix: S(M,M) + S(T,S) + S(H,H) + ….. Bio & Health Informatics Lab, SNU

  7. Gap Penalty and Scoring Matrix • Gap penalty • http://en.wikipedia.org/wiki/Gap_penalty • http://www.brc.dcs.gla.ac.uk/~drg/courses/bioinformaticsHM/slides/scoring_matrices.pdf Bio & Health Informatics Lab, SNU

  8. Computing The Best Alignment • Until now, we assume that an alignment is “given” to compute a score of an alignment. • The pairwise sequence alignment problem is to compute “the best alignment” among all possible alignments. • Alignment 1  score 1 • Alignment 2  score 2 • … • Then select Alignment k whose score is the best among all. • However, there are too many alignments to consider. • Fortunately, we can use the dynamic programming technique to find the best alignment in a quadratic time and space. Bio & Health Informatics Lab, SNU

  9. Levenshtein distance(Edit distance) • http://en.wikipedia.org/wiki/Levenshtein_distance Bio & Health Informatics Lab, SNU

  10. Global Alignment Algorithm • Needleman-Wunch algorithm • http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm Bio & Health Informatics Lab, SNU

  11. Local Alignment Algorithm • Smith–Waterman algorithm • http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm • http://docencia.ac.upc.edu/master/AMPP/slides/ampp_sw_presentation.pdf Bio & Health Informatics Lab, SNU

  12. BLAST • http://en.wikipedia.org/wiki/BLAST Bio & Health Informatics Lab, SNU

  13. FASTA • http://en.wikipedia.org/wiki/FASTA Bio & Health Informatics Lab, SNU

  14. Statistical Evaluation of Search Result • Although the alignment algorithms look for the ‘optimal’ one (the best in terms of a scoring scheme), there is no guarantee that the human-invented optimal one is biologically meaningful though the optimality incorporated `the domain knowledge’. • Thus the final step in bioinformatics is to compare how likely it is by chance. • The definition of the random model is very important; in many cases, how to define random models (negative models) is a very important research topic. Bio & Health Informatics Lab, SNU

  15. Multiple Sequence Alignment • Aligning multiple sequences is an important for many applications in bioinformatics. • The computing optimal multiple sequence alignment is still an open problem. • Defining the optimality criteria (scoring scheme?, gap penalty score?). • Computational complexity. Bio & Health Informatics Lab, SNU

  16. Local vs. Global Multiple Sequence Alignment • Like the pairwise sequence alignment, there are two types of alignment problems, local and global. • Since there are many sequences, another factor needs to be considered. • The alignment of the whole set or a subset of the input sequence set? Bio & Health Informatics Lab, SNU

  17. Scoring Scheme for the Multiple Sequence Alignment • Sum of pairs. • Since any scoring matrix, eg., BLOSUM62, shows a score of only a pair of amino acid or nucleotide characters. • Information theoretic scoring scheme. • A nice way to consider multiple characters together but it is hard to utilize the domain knowledge (well established scoring matrix, eg., BLOSUM62). Bio & Health Informatics Lab, SNU

  18. Global Multiple Sequence Alignment • Progressive alignment. • Pattern (k-mer)-based strategy. • Computing the optimal alignment. Bio & Health Informatics Lab, SNU

  19. Local Multiple Sequence Alignment • This is also known as (a.k.a) the motif discovery problem. • Many machine learning techniques are used: Gibbs sampling, Expectation-Maximization, Information theory. • It will be covered in a separate lecture. Bio & Health Informatics Lab, SNU

  20. List of Multiple Sequence Alignment • http://en.wikipedia.org/wiki/Multiple_sequence_alignment 1 Dynamic programming and computational complexity 2 Progressive alignment construction 3 Iterative methods 4 Hidden Markov models 5 Genetic algorithms and simulated annealing 6 Motif finding 7 Visualization and editing tools Bio & Health Informatics Lab, SNU

  21. ClustalW • The most widely used “progressive alignment” algorithm. • Starting by computing alignments of all possible pairs of input sequences. • Building a guiding tree by using the UPGMA algorithm. • Following the guide tree, it constructs the multiple sequence alignment in a “greedy” fashion. Bio & Health Informatics Lab, SNU

  22. MUSCLE • MUSCLE (multiple sequence comparison by log-expectation) -- Nucleic Acids Research, 2004, Vol. 32, No. 5 • A very nice, iterative progressive alignment algorithm using k-mers. Bio & Health Informatics Lab, SNU

More Related