Créer une présentation
Télécharger la présentation

Télécharger la présentation
## Sequence alignment

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Sequence alignment**Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 18th, 2012 BMI/CS 576**Outline of this segment**• Pairwise Sequence alignment • Local and global alignments • Gap penalities • Heuristic algorithms • Multiple sequence alignment • Phylogenetic trees • Probabilistic and parsimony methods**Sequences are related**• Darwin said that all organisms are related by a common ancestor. • Sequence changed gradually from ancestral sequence through changes • Similar DNA sequence often implies similar function**Compare sequence**• Enables us to ask how similar are two sequences • Closer in sequence -> closer in ancestry • Horizontal transfer in bacteria generates some exceptions • Infer function based on sequence similarity.**Pairwise alignment:task definition**Given • a pair of sequences (DNA or protein) • a scoring function Do • determine the common substrings in the sequences such that the similarity score is maximized**How to compute distances between two strings?**• Hamming distance • Works only with strings of equal lenght • How far are ATATATAT and TATATATA? • Edit distance between sequences s and t • Number of edits needed to transform s to t • Editing could be switches, insertions and deletions • How far are ATATATAT and TATATATA?**An alignment**• A two row matrix: one row for string s and another for string t • No column can have a “gap” in both rows. • At most m+n**Alignment example (DNA)**DNA sequence alignment of the RAP1 gene between S. cerevisiae and K. lactis**Alignment example (Protein)**Protein sequence alignment of the RAP1 gene between S. cerevisiae and K. lactis Notice fewer aligned columns?**Issues in sequence alignment**• the sequences we’re comparing typically differ in length • Insertions, deletions • there may be only a relatively small region in the sequences that matches • we want to allow partial matches (i.e. some amino acid pairs are more substitutable than others) • variable length regions may have been inserted/deleted from the common ancestral sequence**DNA sequence edits**• substitutions: ACGAAGGA • insertions: ACGAACGGA • deletions: ACGAAGA Gaps**Mismatches and gaps**CA--GATTCGAAT CGCCGATT---AT mismatch gap Gaps make sequence alignment complicated**Scoring an alignment: what is needed?**• Scoring matrix to score alignments • s(a,b) indicates score of aligning character a with character b • A kXk matrix where k is the size of our alphabet. • Algorithm to score an alignment • Statistical model to assess how likely is the alignment to occur by random chance**BLOSUM**• BLOckSUbstitution Matrix • Specifies how likely one amino acid switches into another over evolutionary time • Some AA changes are more rare than others • E.g. Serine->Phenylalanine is three times more likely than Tryptophan->Phenylalanine • AA that have more codons can tolerate more mutations. • Matrix is created from alignments themselves(!) • Consider sequences at least x% similar, exclude very similar blocks, count substitutions • BLOSUM50: sequences were no more than 50% similar of x% • Each entry is a log odds ratio of observing A and B to A alone or B alone**Example substitution scoring matrix (BLOSUM50)**A5 R -2 7 N -1 -1 7 D -2 -2 2 8 C -1 -4 -2 -4 13 Q -1 1 0 0 -3 7 E -1 0 0 2 -3 2 6 G 0 -3 0 -1 -3 -2 -3 8 H -2 0 1 -1 -3 1 0 -2 10 I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 L -2 -3 -4 -4 -2 -2 -3 -4 -3 25 K -1 3 0 -1 -3 21 -2 0 -3 -3 6 M -1 -2 -2 -4 -2 0 -2 -3 -1 23 -2 7 F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 S1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 25 W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 28 V 0 -3 -3 -4 -1 -3 -3 -4 -4 41 -3 1 -1 -3 -2 0 -3 -1 5 A R N D C Q E G H I L K M F P S T W Y V**Scoring Gaps**• Linear score • γ(d)=-dg, where g is a fixed cost, d is the gap length • Longer the gap higher the penalty • Affine score • γ(d)=-(d +(g-1)e), d and g are fixed penalties and g<d. • d : gap open penalty • g : gap extension penalty**Sequence alignment problems**• Pairwise • Compare two sequences • Global alignment (Needleman-Wunsch algorithm) • Local alignment (Smith-Waterman algorithm) • Multiple-sequence alignment • Compare >2 sequences**Global alignment**• Given two sequences u and v and a scoring matrix delta, find the alignment with the maximal score. • The number of possible alignments is very big.**Number of possible alignments**Sequence 1 CAATGA Sequence 2 ATTGAT CAATGA ATTGAT CAATGA-- --ATTGAT CAATGA-- AT--TGAT Alignment 1 Alignment 2 Alignment 3**Number of possible alignments**• there are • possible global alignments for 2 sequences of length n • e.g. two sequences of length 100 have possible alignments • but we can use dynamic programming to find an optimal alignment efficiently**Dynamic programming**• Divide and conquer • Solve a problem using pre-computed solutions for smaller subparts of the problem**Dynamic programming for sequence alignment**• Two sequences: AAAC, AAG • x: AAAC • xi: Denotes the ithletter • y: AAG • yidenotes thejthletter in y. • Assume we have aligned x1..xi to y1..yj • There are three possibilities AAxiAAyj AAAAAG AA--AAG AAAAA--**A**G C A A A C Dynamic programming idea • Letx’s length be n • Let y’s length be m • construct an (n+1)(m+1)matrix F • F (j, i) = score of the best alignment ofx1…xiwith y1…yj x score of best alignment of AAA to AG y**DP for global alignment with linear gap penalty**• DP recurrence relation:**DP algorithm sketch: global alignment**• initialize first row and column of matrix • fill in rest of matrix from top to bottom, left to right • for each F(i, j), save pointer(s) to cell(s) that resulted in best score • F(m, n) holds the optimal alignment score; • Trace back from F(m, n) to F(0, 0) to recover alignment**Global alignment example**• suppose we choose the following scoring scheme: • +1 • -1 • d(penalty for aligning with a gap) =2**A**G C -d -2d -3d 0 A -d A -2d A -3d C -4d Initializing matrix: global alignment with linear gap penalty**Global alignment example**A G C 0 -2 -4 -6 one optimal alignment A -1 -3 -2 1 x: A A A C y: A G - C A -1 -4 0 -2 A -6 -3 -2 -1 C -8 -5 -4 -1**Computational complexity**• initialization: O(m), O(n) where sequence lengths are m, n • filling in rest of matrix: O(mn) • traceback: O(m + n) • hence, if sequences have nearly same length, the computational complexity is**Summary of DP**• Maximize F(j,i) using F(j-1,i-1), F(j-1,i) or F(j,i-1) • works for either DNA or protein sequences, although the substitution matrices used differ • finds an optimal alignment • the exact algorithm (and computational complexity) depends on gap penalty function (we’ll come back to this issue)**Local alignment**• Look for local regions of sequence similarity. • Aligning substrings of x and y. • Try to find the best alignment over all possible substrings. • This seems difficult computationally. But can be solved by DP.**Local alignment motivation**• Comparing protein sequences that share a common sequence pattern or domain (independently folded unit) but differ elsewhere • Comparing DNA sequences that share a similar sequence patternbut differ elsewhere • More sensitive when comparing highly diverged sequences • Selection on the genome differs between regions • Unconstrained regions are not very alignable**Local alignment DP**• Similar to global alignment with a few exceptions • Initialize the first row and column to 0s Start a new alignment Key difference for local**Local alignment DP algorithm**• initialization: first row and first column initialized with 0’s • traceback: • find maximum value of F(j, i)can be anywhere in matrix • stop when we get to a cell with value 0**A**A G A 0 0 0 0 0 T 0 T 0 A 0 1 A 0 1 G 0 Local alignment example 0 0 0 0 0 0 0 0 1 0 1 2 0 1 0 0 3 1 x: A A G y: A A G