1 / 32

Pairwise Sequence Alignment

Pairwise Sequence Alignment. Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 10 th , 2013. BMI/CS 576. Key concepts in today’s class. What is pairwise sequence alignment? How to handle gaps? Gap penalties

shira
Télécharger la présentation

Pairwise Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 10th, 2013 BMI/CS 576

  2. Key concepts in today’s class • What is pairwise sequence alignment? • How to handle gaps? • Gap penalties • What are the issues in sequence alignment? • What are the different types of alignment problems • Global alignment • Local alignment • Dynamic programming solution to finding the best alignment • Given a scoring function, find the best alignment of two sequences • Global • Local

  3. What is sequence alignment The task of locating equivalent regions of two or more sequences to maximize their overall similarity

  4. Why sequence alignment? • Identifying similarity between two sequences • Sequence specifies the function of a protein • Similarity in sequence can imply similarity in function. • Assign function to uncharacterized sequences based on characterized sequences • Sequence from different species can be compared to estimate the evolutionary relationships between species • We will come back to this in Phylogenetic trees.

  5. Alignment example T H I S S E Q U E N C E T H A T S E Q U E N C E

  6. How to compare these two sequences? T H I S S E Q U E N C E T H A T I S A S E Q U E N C E

  7. Need to incorporate gaps _ _ _T H I S S E Q U E N C E T H I S _ _ _ S E Q U E N C E T H A T I S A S E Q U E N C E T H A T I S A S E Q U E N C E Alignment 1: 3 gaps, 8 matches Alignment 2: 3 gaps, 9 matches

  8. Issues in sequence alignment • What type of alignment? • Align the entire sequence or part of it? • How to score an alignment? • the sequences we’re comparing typically differ in length • some amino acid pairs are more substitutable than others • How to find the alignment? • Algorithms for alignment • How to tell if the alignment is biologically meaningful? • Computing significance of an alignment

  9. Pairwise alignment: task definition Given • a pair of sequences (DNA or protein) • a scoring function Do • determine the common substrings in the sequences such that the similarity score is maximized

  10. Sequence can change through different mutations • substitutions: ACGAAGGA • insertions: ACGAACGGA • deletions: ACGAAGA Gaps

  11. Scoring function • Scoring function comprises • Substitution matrix s(a,b): indicates the score of aligning a with b • Gap penalty functionγ(g) where g is the gap length • DNA has a simple scoring function • Consider match and mismatch • Proteins have a more complex scoring function • Some amino acid pairs might be more substitutable versus others • Scores might capture • similar physical and chemical properties • evolutionary relationships • More next class

  12. Gap penalty function • Different gap penalty functions require somewhat different scoring algorithms • Linear score • γ(g)=-dg, where d is a fixed cost, g is the gap length • Longer the gap higher the penalty • Affine score • γ(g)=-(d +(g-1)e), dand e are fixed penalties and e<d. • d : gap open penalty • e : gap extension penalty

  13. Types of pairwise sequence alignment problems • Global alignment (Needleman-Wunsch algorithm) • Align the entire sequence • Local alignment (Smith-Waterman algorithm) • Align part of the sequence

  14. Global alignment • Given two sequences u and v and a scoring function, find the alignment with the maximal score. • The number of possible alignments is very big.

  15. Number of possible alignments Sequence 1 CAATGA Sequence 2 ATTGAT CAATGA ATTGAT CAATGA_ _ATTGAT CAATGA_ AT_TGAT Alignment 1 Alignment 2 Alignment 3

  16. Number of possible alignments • There are • possible global alignments for 2 sequences of length n • E.g. two sequences of length 100 have possible alignments • But we can use dynamic programming to find an optimal alignment efficiently

  17. Dynamic programming (DP) • Divide and conquer • Solve a problem using pre-computed solutions for subparts of the problem

  18. Dynamic programming for sequence alignment • Two sequences: AAAC, AAG • x: AAAC • xi: Denotes the ithletter • y: AAG • yidenotes thejthletter in y. • Assume cost of gap is d. • Assume we have aligned x1..xi to y1..yj • There are three possibilities AAxiAAyj AAAAAG AA_AAG AAAAA_

  19. A G C A A A C Dynamic programming idea • Letx’s length be n • Let y’s length be m • construct an (n+1)(m+1)matrix F • F (j, i) = score of the best alignment ofx1…xiwith y1…yj x score of best alignment of AAA to AG y

  20. DP for global alignment with linear gap penalty DP recurrence relation Score of the best partial alignment between x1..xi and y1.. yj F(j-1,i-1) F(j-1,i) s(xi,yj) -d F(j,i-1) F(j,i) -d

  21. DP algorithm sketch: global alignment • Initialize first row and column of matrix • Fill in rest of matrix from top to bottom, left to right • For each F(i, j), save pointer(s) to cell(s) that resulted in best score • F(m, n) holds the optimal alignment score; • Trace back from F(m, n) to F(0, 0) to recover alignment

  22. Global alignment example • Suppose we choose the following scoring scheme • d = 2, where d is the gap penalty

  23. A G C -d -2d -3d 0 A -d A -2d A -3d C -4d Initializing matrix: global alignment with linear gap penalty

  24. Global alignment example A G C 0 -2 -4 -6 A -1 -3 -2 1 A -1 -4 0 -2 A -6 -3 -2 -1 C -8 -5 -4 -1

  25. Trace back to retrieve the alignment A G C 0 -2 -4 -6 A -1 -3 -2 1 x: A A A C y: A G _ C A -1 -4 0 -2 one optimal alignment A -6 -3 -2 -1 C -8 -5 -4 -1

  26. Computational complexity • initialization: O(m), O(n) where sequence lengths are m, n • filling in rest of matrix: O(mn) • traceback: O(m + n) • Since sequences have nearly same length, the computational complexity is

  27. Summary of DP • Maximize F(j,i) using F(j-1,i-1), F(j-1,i) or F(j,i-1) • works for either DNA or protein sequences, although the substitution matrices used differ • finds an optimal alignment • the exact algorithm (and computational complexity) depends on gap penalty function

  28. Local alignment • Look for local regions sequence similarity. • Aligning substrings of x and y. • Try to find the best alignment over all possible substrings. • This seems difficult computationally. • But can be solved by DP.

  29. Local alignment motivation • Comparing protein sequences that share adomain(independently folded unit) but differ elsewhere • Comparing DNA sequences that share a similar sequence patternbut differ elsewhere • More sensitive when comparing highly diverged sequences • Selection on the genome differs between regions • Unconstrained regions are not very alignable

  30. Local alignment DP Similar to global alignment with a few exceptions • DP recurrence is different • Top and left row are initialized to 0s. New in local: starts a new alignment

  31. Local alignment DP algorithm • Initialization: first row and first column initialized with 0’s • Traceback: • Find maximum value of F(j, i) • can be anywhere in matrix • Stop when we get to a cell with value 0

  32. A A G A 0 0 0 0 0 T 0 T 0 A 0 1 A 0 1 G 0 Local alignment example 0 0 0 0 0 0 0 0 1 0 1 2 0 1 0 0 3 1 x: A A G y: A A G

More Related