A Brief Introduction to Biological Sequence Alignment

A Brief Introduction to Biological Sequence Alignment Sun Kim CSE SNU For Bio Data Mining 4541.776.002 Sep 2011 Bio & Health Informatics Lab, SNU

Aligning a pair of sequences • Problem: • given a pair of sequences, find the best alignment among all possible alignments. • Goal: to compute the best alignment needs • type of the alignment • a scoring scheme • A scoring matrix • Gap penalty scheme • Two types of alignment problems • Local sequence alignment • Global sequence alignment Bio & Health Informatics Lab, SNU

glutamate-ammonia ligase related sequences Query sequence 1 >A8XYH6 A8XYH6_CAEBR CBR-GLN-2 protein [Caenorhabditis briggsae] MTHLNFETRMPLGQAVIDQFLGLRPHPTKIQATYVWIDGTGENLRSKTRTFDRLPKKIED YPIWNYDGSSTGQAKGRDSDRYLRPVAAYPDPFLGGANKLVMCDTLDHEMQPTATNHRQA CAEIMNEIRDTRPWFGMEQEYLIVDRDEHPLGWPKHGFPAPQGKYYCSVGADRAFGREVV ETHYRACLHAGLNIFGTNAEVTPGQWEFQIGTCEGIDMGDQLWMSRYILHRVAEQFGVCV SLDPKPKVTMGDWNGAGCHTNFSTAEMRAPGGIAAIEAAMEGLKRTHLEAMKVYDPHGGE DNLRRLTGRHETSSADKFSWGVANRGCSIRIPRQVAAERKGYLEDRRPSSNCDPYQVTAM IAQSILL Query sequence 2 >O02225 O02225_CAEEL Protein C28D4.3, confirmed by transcript evidence [Caenorhabditis elegans] MSHLNYETRLPLGQATIDHFMGLPAHPTKCQATYVWIDGTGEHLRAKTRTINTKPQYLSE YPIWNYDGSSTGQADGLNSDRYLRPVAVFPDPFLGGLNVLVMCDTLDHEMKPTATNHRQM CAELMKKVSDQQPWFGMEQEYLIVDRDEHPLGWPKHGYPAPQGKYYCGIGADRAFGREVV ETHYRACLHAGITIFGSNAEVTPGQWEFQIGTCLGIEMGDQLWMARYILHRVAEQFGVCV SLDPKPRVTMGDWNGAGCHTNFSTIDMRRPDGLETIIAAMEGLKKTHSEAMKVYDPNGGH DNLRRLTGRHETSQADQFSWGIANRACSVRIPRQVADEGRGYLEDRRPSSNCDPYLVTAM IVKSVLIN Bio & Health Informatics Lab, SNU

A Pairwise Alignment of The Two Sequences. Bio & Health Informatics Lab, SNU

Scoring matrix BLOSUM 62 Bio & Health Informatics Lab, SNU

Compute A Score for A Pairwise Alignment of The Two Sequences. Adding scores in the scoring matrix: S(M,M) + S(T,S) + S(H,H) + ….. Bio & Health Informatics Lab, SNU

Gap Penalty and Scoring Matrix • Gap penalty • http://en.wikipedia.org/wiki/Gap_penalty • http://www.brc.dcs.gla.ac.uk/~drg/courses/bioinformaticsHM/slides/scoring_matrices.pdf Bio & Health Informatics Lab, SNU

Computing The Best Alignment • Until now, we assume that an alignment is “given” to compute a score of an alignment. • The pairwise sequence alignment problem is to compute “the best alignment” among all possible alignments. • Alignment 1  score 1 • Alignment 2  score 2 • … • Then select Alignment k whose score is the best among all. • However, there are too many alignments to consider. • Fortunately, we can use the dynamic programming technique to find the best alignment in a quadratic time and space. Bio & Health Informatics Lab, SNU

Levenshtein distance(Edit distance) • http://en.wikipedia.org/wiki/Levenshtein_distance Bio & Health Informatics Lab, SNU

Global Alignment Algorithm • Needleman-Wunch algorithm • http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm Bio & Health Informatics Lab, SNU

Local Alignment Algorithm • Smith–Waterman algorithm • http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm • http://docencia.ac.upc.edu/master/AMPP/slides/ampp_sw_presentation.pdf Bio & Health Informatics Lab, SNU

BLAST • http://en.wikipedia.org/wiki/BLAST Bio & Health Informatics Lab, SNU

FASTA • http://en.wikipedia.org/wiki/FASTA Bio & Health Informatics Lab, SNU

Statistical Evaluation of Search Result • Although the alignment algorithms look for the ‘optimal’ one (the best in terms of a scoring scheme), there is no guarantee that the human-invented optimal one is biologically meaningful though the optimality incorporated `the domain knowledge’. • Thus the final step in bioinformatics is to compare how likely it is by chance. • The definition of the random model is very important; in many cases, how to define random models (negative models) is a very important research topic. Bio & Health Informatics Lab, SNU

Multiple Sequence Alignment • Aligning multiple sequences is an important for many applications in bioinformatics. • The computing optimal multiple sequence alignment is still an open problem. • Defining the optimality criteria (scoring scheme?, gap penalty score?). • Computational complexity. Bio & Health Informatics Lab, SNU

Local vs. Global Multiple Sequence Alignment • Like the pairwise sequence alignment, there are two types of alignment problems, local and global. • Since there are many sequences, another factor needs to be considered. • The alignment of the whole set or a subset of the input sequence set? Bio & Health Informatics Lab, SNU

Scoring Scheme for the Multiple Sequence Alignment • Sum of pairs. • Since any scoring matrix, eg., BLOSUM62, shows a score of only a pair of amino acid or nucleotide characters. • Information theoretic scoring scheme. • A nice way to consider multiple characters together but it is hard to utilize the domain knowledge (well established scoring matrix, eg., BLOSUM62). Bio & Health Informatics Lab, SNU

Global Multiple Sequence Alignment • Progressive alignment. • Pattern (k-mer)-based strategy. • Computing the optimal alignment. Bio & Health Informatics Lab, SNU

Local Multiple Sequence Alignment • This is also known as (a.k.a) the motif discovery problem. • Many machine learning techniques are used: Gibbs sampling, Expectation-Maximization, Information theory. • It will be covered in a separate lecture. Bio & Health Informatics Lab, SNU

List of Multiple Sequence Alignment • http://en.wikipedia.org/wiki/Multiple_sequence_alignment 1 Dynamic programming and computational complexity 2 Progressive alignment construction 3 Iterative methods 4 Hidden Markov models 5 Genetic algorithms and simulated annealing 6 Motif finding 7 Visualization and editing tools Bio & Health Informatics Lab, SNU

ClustalW • The most widely used “progressive alignment” algorithm. • Starting by computing alignments of all possible pairs of input sequences. • Building a guiding tree by using the UPGMA algorithm. • Following the guide tree, it constructs the multiple sequence alignment in a “greedy” fashion. Bio & Health Informatics Lab, SNU

MUSCLE • MUSCLE (multiple sequence comparison by log-expectation) -- Nucleic Acids Research, 2004, Vol. 32, No. 5 • A very nice, iterative progressive alignment algorithm using k-mers. Bio & Health Informatics Lab, SNU

A Brief Introduction to Biological Sequence Alignment

A Brief Introduction to Biological Sequence Alignment

Presentation Transcript

Introduction to Sequence Alignment

2. Comparing biological sequences: sequence alignment (cont’d)

Trees, Stars, and Multiple Biological Sequence Alignment

Sequence Alignment

A Brief Introduction to Sequence Stratigraphy

Sequence Alignment

Biological Sequence Comparison and Alignment

Sequence Alignment

An Introduction to Sequence Alignment

Sequence Alignment

Introduction to the theory of sequence alignment

Sequence Alignment

2. Comparing biological sequences : sequence alignment

Sequence Alignment

Sequence Alignment

Sequence alignment

Biological Motivation for Multiple Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence alignment