Pairwise Sequence Alignment

Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 10th, 2013 BMI/CS 576

Key concepts in today’s class • What is pairwise sequence alignment? • How to handle gaps? • Gap penalties • What are the issues in sequence alignment? • What are the different types of alignment problems • Global alignment • Local alignment • Dynamic programming solution to finding the best alignment • Given a scoring function, find the best alignment of two sequences • Global • Local

What is sequence alignment The task of locating equivalent regions of two or more sequences to maximize their overall similarity

Why sequence alignment? • Identifying similarity between two sequences • Sequence specifies the function of a protein • Similarity in sequence can imply similarity in function. • Assign function to uncharacterized sequences based on characterized sequences • Sequence from different species can be compared to estimate the evolutionary relationships between species • We will come back to this in Phylogenetic trees.

Alignment example T H I S S E Q U E N C E T H A T S E Q U E N C E

How to compare these two sequences? T H I S S E Q U E N C E T H A T I S A S E Q U E N C E

Need to incorporate gaps _ _ _T H I S S E Q U E N C E T H I S _ _ _ S E Q U E N C E T H A T I S A S E Q U E N C E T H A T I S A S E Q U E N C E Alignment 1: 3 gaps, 8 matches Alignment 2: 3 gaps, 9 matches

Issues in sequence alignment • What type of alignment? • Align the entire sequence or part of it? • How to score an alignment? • the sequences we’re comparing typically differ in length • some amino acid pairs are more substitutable than others • How to find the alignment? • Algorithms for alignment • How to tell if the alignment is biologically meaningful? • Computing significance of an alignment

Pairwise alignment: task definition Given • a pair of sequences (DNA or protein) • a scoring function Do • determine the common substrings in the sequences such that the similarity score is maximized

Sequence can change through different mutations • substitutions: ACGAAGGA • insertions: ACGAACGGA • deletions: ACGAAGA Gaps

Scoring function • Scoring function comprises • Substitution matrix s(a,b): indicates the score of aligning a with b • Gap penalty functionγ(g) where g is the gap length • DNA has a simple scoring function • Consider match and mismatch • Proteins have a more complex scoring function • Some amino acid pairs might be more substitutable versus others • Scores might capture • similar physical and chemical properties • evolutionary relationships • More next class

Gap penalty function • Different gap penalty functions require somewhat different scoring algorithms • Linear score • γ(g)=-dg, where d is a fixed cost, g is the gap length • Longer the gap higher the penalty • Affine score • γ(g)=-(d +(g-1)e), dand e are fixed penalties and e<d. • d : gap open penalty • e : gap extension penalty

Types of pairwise sequence alignment problems • Global alignment (Needleman-Wunsch algorithm) • Align the entire sequence • Local alignment (Smith-Waterman algorithm) • Align part of the sequence

Global alignment • Given two sequences u and v and a scoring function, find the alignment with the maximal score. • The number of possible alignments is very big.

Number of possible alignments Sequence 1 CAATGA Sequence 2 ATTGAT CAATGA ATTGAT CAATGA_ _ATTGAT CAATGA_ AT_TGAT Alignment 1 Alignment 2 Alignment 3

Number of possible alignments • There are • possible global alignments for 2 sequences of length n • E.g. two sequences of length 100 have possible alignments • But we can use dynamic programming to find an optimal alignment efficiently

Dynamic programming (DP) • Divide and conquer • Solve a problem using pre-computed solutions for subparts of the problem

Dynamic programming for sequence alignment • Two sequences: AAAC, AAG • x: AAAC • xi: Denotes the ithletter • y: AAG • yidenotes thejthletter in y. • Assume cost of gap is d. • Assume we have aligned x1..xi to y1..yj • There are three possibilities AAxiAAyj AAAAAG AA_AAG AAAAA_

A G C A A A C Dynamic programming idea • Letx’s length be n • Let y’s length be m • construct an (n+1)(m+1)matrix F • F (j, i) = score of the best alignment ofx1…xiwith y1…yj x score of best alignment of AAA to AG y

DP for global alignment with linear gap penalty DP recurrence relation Score of the best partial alignment between x1..xi and y1.. yj F(j-1,i-1) F(j-1,i) s(xi,yj) -d F(j,i-1) F(j,i) -d

DP algorithm sketch: global alignment • Initialize first row and column of matrix • Fill in rest of matrix from top to bottom, left to right • For each F(i, j), save pointer(s) to cell(s) that resulted in best score • F(m, n) holds the optimal alignment score; • Trace back from F(m, n) to F(0, 0) to recover alignment

Global alignment example • Suppose we choose the following scoring scheme • d = 2, where d is the gap penalty

A G C -d -2d -3d 0 A -d A -2d A -3d C -4d Initializing matrix: global alignment with linear gap penalty

Global alignment example A G C 0 -2 -4 -6 A -1 -3 -2 1 A -1 -4 0 -2 A -6 -3 -2 -1 C -8 -5 -4 -1

Trace back to retrieve the alignment A G C 0 -2 -4 -6 A -1 -3 -2 1 x: A A A C y: A G _ C A -1 -4 0 -2 one optimal alignment A -6 -3 -2 -1 C -8 -5 -4 -1

Computational complexity • initialization: O(m), O(n) where sequence lengths are m, n • filling in rest of matrix: O(mn) • traceback: O(m + n) • Since sequences have nearly same length, the computational complexity is

Summary of DP • Maximize F(j,i) using F(j-1,i-1), F(j-1,i) or F(j,i-1) • works for either DNA or protein sequences, although the substitution matrices used differ • finds an optimal alignment • the exact algorithm (and computational complexity) depends on gap penalty function

Local alignment • Look for local regions sequence similarity. • Aligning substrings of x and y. • Try to find the best alignment over all possible substrings. • This seems difficult computationally. • But can be solved by DP.

Local alignment motivation • Comparing protein sequences that share adomain(independently folded unit) but differ elsewhere • Comparing DNA sequences that share a similar sequence patternbut differ elsewhere • More sensitive when comparing highly diverged sequences • Selection on the genome differs between regions • Unconstrained regions are not very alignable

Local alignment DP Similar to global alignment with a few exceptions • DP recurrence is different • Top and left row are initialized to 0s. New in local: starts a new alignment

Local alignment DP algorithm • Initialization: first row and first column initialized with 0’s • Traceback: • Find maximum value of F(j, i) • can be anywhere in matrix • Stop when we get to a cell with value 0

A A G A 0 0 0 0 0 T 0 T 0 A 0 1 A 0 1 G 0 Local alignment example 0 0 0 0 0 0 0 0 1 0 1 2 0 1 0 0 3 1 x: A A G y: A A G

Pairwise Sequence Alignment