190 likes | 468 Vues
Sequence Alignment Tutorial #2. © Ydo Wexler & Dan Geiger. Sequence Comparison. Much of bioinformatics involves sequences DNA sequences RNA sequences Protein sequences We can think of these sequences as strings of letters DNA & RNA: |alphabet|=4 Protein: |alphabet|=20. Global Alignment.
E N D
Sequence AlignmentTutorial #2 © Ydo Wexler & Dan Geiger .
Sequence Comparison Much of bioinformatics involves sequences • DNA sequences • RNA sequences • Protein sequences We can think of these sequences as strings of letters • DNA & RNA: |alphabet|=4 • Protein: |alphabet|=20
Global Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: • GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA • A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A
Hypotheses space Best biological explanaiton Biological data Global Alignment -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements: • Perfect matches • Mismatches • Insertions & deletions (indel) Example (cont): Symmetric view of evolution
Global Alignmentscoring scheme Score each position independently: • Match: +1 • Mismatch: -1 • Indel: -2 Score of an alignment is sum of position scores Example:-GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = 3 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score:(+1x5) + (-1x6) + (-2x11) = -23
Sequence Alignment Variants Two basic variants of sequence alignment: • Global alignment (The Needelman-Wunsch Algorithm) • Local alignment (The Smith-Waterman Algorithm) Today we’ll see : • Overlap alignment • Affine cost for gaps We’ll use ideas of dynamic programming presented in the lecture
Overlap Alignment Consider the following problem: • Find the most significant overlap between two sequences S,T ? • Possible overlap relations: a. b. Difference from local alignment: Here we require alignment between the endpoints of the two sequences.
Overlap Alignment Formally: given S[1..n] , T[1..m] find i,j such that: d=max{D(S[1..i],T[j..m]) , D(S[i..n],T[1..j]) , D(S[1..n],T[i..j]) , D(S[i..j],T[1..m]) } is maximal. Solution: Same asGlobal alignment except we don’t not penalise overhanging ends.
local overlap global Overlap Alignment • Initialization:V[i,0]=0,V[0,j]=0 Recurrence:as in global alignment Score:maximum value at the bottom line and rightmost line
Overlap Alignment (Example) S =PAWHEAE T =HEAGAWGHEE Scoring scheme : • Match: +4 • Mismatch: -1 • Indel: -5
Overlap Alignment (Example) S =PAWHEAE T =HEAGAWGHEE Scoring scheme : • Match: +4 • Mismatch: -1 • Indel: -5
Overlap Alignment (Example) S =PAWHEAE T =HEAGAWGHEE Scoring scheme: • Match: +4 • Mismatch: -1 • Indel: -5
Scoring scheme : • Match: +4 • Mismatch: -1 • Indel: -5 -2 Overlap Alignment (Example) The best overlap is: PAWHEAE------ ---HEAGAWGHEE Pay attention! A different scoring scheme could yield a different result, such as: ---PAW-HEAE HEAGAWGHEE-
Affine gap scores • Observation: Insertions and deletions often occur in blocks longer than a single nucleotide. • Consequence: • Current scoring scheme gives a constant penalty per gap unit. • This does not score well the above phenomenon. Question: How do we modify the scheme to incorporate this?
d or e ? Alignment with affine gap scores • Penalty score for a gap of length g : d - penalty for introduction of a gap e - penalty for elongating the gap by one unit. Typically d > e Problem: When aligning S[i] to a gap we do not know how much to penalize. Solution: we compute 3 matrices simultaneously M(i,j) - the score obtained by aligning S[i] to T[j] IS(i,j) - the score obtained by aligning S[i]to a gap IT(i,j) - the score obtained by aligning T[j]to a gap
We assume that a deletion will not be followed directly by an insertion. This can be obtained by using Affine gap scores • Initialization:depending on the problem (global, local,…) • Recurrence:uses already known values - M(i’,j’), IS(i’,j’), IT(i’,j’)
Affine gap scores • Simplification: Why are two matrices enough?