1 / 49

BCB 444/544

BCB 444/544. Lecture 8 Finish: Dynamic Programming Global vs Local Alignment Scoring Matrices & Alignment Statistics BLAST #8_Sept7. Required Reading ( before lecture). √ Last week: - for Lectures 4-7 Pairwise Sequence Alignment, Dynamic Programming,

elle
Télécharger la présentation

BCB 444/544

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BCB 444/544 Lecture 8 Finish:Dynamic Programming Global vs Local Alignment Scoring Matrices & Alignment Statistics BLAST #8_Sept7 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  2. Required Reading (before lecture) √Last week: - for Lectures 4-7 Pairwise Sequence Alignment, Dynamic Programming, Global vs Local Alignment, Scoring Matrices, Statistics • Xiong: Chp 3 • Eddy: What is Dynamic Programming?2004 Nature Biotechnol 22:909 http://www.nature.com/nbt/journal/v22/n7/abs/nbt0704-909.html √Wed Sept 5- for Lecture 7 & Lab 3 Database Similarity Searching: BLAST (nope, more DP) • Chp 4 - pp 51-62 Fri Sept 7- for Lecture 8 (will finish on Monday) BLAST variations; BLAST vs FASTA • Chp 4 - pp 51-62 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  3. Assignments & Announcements √Tues Sept 4 - Lab #2 Exercise Writeup due by 5 PMSend via email to Pete Zabackpetez@iastate.edu (For now, no late penalty - just send ASAP) √Wed Sept 5 - Notes for Lecture 5 posted online - HW#2 posted online & sent via email & handed out in class Fri Sept 14 - HW#2 Due by 5 PM Fri Sept 21 - Exam #1 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  4. Chp 3- Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment • √Evolutionary Basis • √Sequence Homology versus Sequence Similarity • √Sequence Similarity versus Sequence Identity • Methods - cont • Scoring Matrices • Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  5. Methods • √Global and Local Alignment • √Alignment Algorithms • √Dot Matrix Method • Dynamic Programming Method - cont • Gap penalities • DP for Global Alignment • DP for Local Alignment • Scoring Matrices • Amino acid scoring matrices • PAM • BLOSUM • Comparisons between PAM & BLOSUM • Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  6. Dynamic Programming - 4 Steps: • Define score of optimal alignment, using recursion • Initialize and fill in a DP matrix for storing optimal scores of subproblems, by solving smallest subproblems first (bottom-up approach) • Calculate score of optimal alignment(s) • Trace back through matrix to recover optimal alignment(s) that generated optimal score BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  7. 1- Define Score of Optimal Alignment using Recursion Initial conditions: Define: •  = Match Reward • = Mismatch Penalty  = Gap penalty Recursive definition: For 1  i  N, 1  j  M: (xi,yj) = or  = Gap penalty BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  8. 2- Initialize & Fill in DP Matrix for Storing Optimal Scores of Subproblems • Construct sequence vs sequence matrix • Fill in from [0,0] to [N,M] (row by row),calculating best • possible score for each alignment ending at residues at [i,j] 0 1 N 0 S(0,0)=0 1 S(i,j) S(N,M) M BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  9. How do we calculate S(i,j)?i.e., Score for alignment of x[1..i] to y[1..j]? xi aligns to yj xi aligns to a gap yj aligns to a gap x1 x2 . . . xi-1 xi x1 x2 . . . xi-1 xi x1 x2 . . . xi — y1 y2 . . . yj-1 yj y1 y2 . . . yj — y1 y2 . . . yj-1 yj S(i,j-1) -  S(i-1,j-1) + (xi,yj) S(i-1,j) -  1 of 3 cases  optimal score for this subproblem: BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  10. Specific Example: Note: I changed sequences on this slide (to match the rest of DP example) Scoring Consequence? Case 1: Line up xi with yj i - 1 i x: C - T C G C A y: C A T - T C A Match Bonus j - 1 j Case 2: Line up xi with space i - 1 i x: C - T C G C - A y: C A T - T C A - Space Penalty j Case 3: Line up yj with space i x: C - T C G C A - y: C A T - T C - A Space Penalty j -1 j BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  11. Ready? Fill in DP Matrix + (xi,yj) = or - - Recursion Initialization Keep track of dependencies of scores (in a pointer matrix) 0 1 N 0 S(0,0)=0 1 S(i-1,j-1) S(i-1,j) •  = Match Reward • = Mismatch Penalty  = Gap penalty S(i,j-1) S(i,j) S(N,M) M BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  12. Fill in the DP matrix !! -5 -10 -15 -20 -25 -30 -35 λ C T C G C A G C 0 -5 -10 -15 -20 -25 -30 -35 -40 λ C 10 5 A T T C A C +10 for match, -2 for mismatch,-5 for space BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  13. 3- Calculate Score S(N,M) of Optimal Alignment - for Global Alignment λ C A T T C A C λ C T C G C A G C +10 for match, -2 for mismatch,-5 for space BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  14. 4- Trace back through matrix to recover optimal alignment(s) that generated the optimal score How? "Repeat" alignment calculations in reverse order, starting at from position with highest score and following path, position by position, back through matrix Result? Optimal alignment(s) of sequences BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  15. Traceback - for Global Alignment Start in lower right corner & trace back to upper left Each arrow introduces one character at end of alignment: • A horizontal move puts a gap in leftsequence • A vertical move puts a gap in top sequence • A diagonal move uses one character from eachsequence BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  16. Traceback to Recover Alignment λ C A T T C A C λ C T C G C A G C Can have >1 optimal alignment; this example has 2 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  17. Traceback to Recover Alignment λ C A T T C A C λ C T C G C A G C Where did red arrows come from? BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  18. Traceback to Recover Alignment λ C A T T C A C λ C T C G C A G C +10 for match, -2 for mismatch,-5 for space • Where did 33 come from? Match = 10, so 33-10= 23 Must have come from diagonal • Where did 23 come from? (Not a match) Left? 28-5= 23; Diag? 13-2= 11; Top? 8-5= 3 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  19. Traceback to Recover Alignment λ C A T T C A C λ C T C G C A G C +10 for match, -2 for mismatch,-5 for space • Where did 8 come from? Two possibilities: 13-5= 8 or 10-2=8 • Then, follow both paths BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  20. Traceback to Recover Alignment λ C A T T C A C λ C T C G C A G C C with C - with A T with T C with - G with T C with C A with A G with - C with C Great - but what are the alignments? #1 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  21. Traceback to Recover Alignment λ C A T T C A C λ C T C G C A G C C with C - with A T with T C with T G with - C with C A with A G with - C with C Great - but what are the alignments? #2 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  22. What are the 2 Global Alignments with Optimal Score = 33? C - T C G C A G C 1: Top: C T C G C A G C Left: C A T T C A C C - T C G C A G C 2: • A horizontal move puts a gap in leftsequence • A vertical move puts a gap in top sequence • A diagonal move uses one character from eachsequence BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  23. What are the 2 Global Alignments with Optimal Score = 33? C - TCG C AG C C A T-TC A- C 1: Top: C T C G C A G C Left: C A T T C A C C - T C G C A G C C A T T - C A - C 2: Check the scores: +10 for match, -2 for mismatch,-5 for space BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  24. or, Check Traceback? λ d 1 C v h A d d T d h d T 2 d C h d A C λ C T C G C A G C • h= horizontal move puts a gap in leftsequence • v = vertical move puts a gap in top sequence • d = diagonal move uses one character from eachsequence BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  25. Local Alignment: Motivation • To "ignore" stretches of non-coding DNA: • Non-coding regions (if "non-functional") are more likely to contain mutations than coding regions • Local alignment between two protein-encoding sequencesis likely to be between two exons • To locate protein domains or motifs: • Proteins with similar structures and/or similar functions but from different species (for example), often exhibit local sequence similarities • Local sequence similarities may indicate ”functional modules” Non-coding - "not encoding protein" Exons - "protein-encoding" parts of genes vs Introns = "intervening sequences" - segments of eukaryotic genes that "interrupt" exons Introns are transcribed into RNA, but are later removed by RNA processing & are not translated into protein BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  26. Local Alignment: Example G G T C T G A G A A A C G A Match: +2Mismatch or space: -1 Best local alignment: G G TC T G AG A A AC – G A- Score = 5 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  27. Local Alignment: Algorithm This slide has been changed! 1) Initialize top row & leftmost column of matrix with "0" 2) Fill in DP matrix: In local alignment, no negative scores Assign "0" to cells with negative scores 3) Optimal score? in highest scoring cell(s) 4) Optimal alignment(s)? Traceback from each cell containing the optimal score, until a cell with "0" is reached (not just from lower right corner) BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  28. Local Alignment DP: Initialization & Recursion New Slide BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  29. Filling in DP Matrix for Local Alignment No negative scores - fill in "0" λ C A T T C A C λ C T C G C A G C +1 for match, -1 for mismatch,-5 for space BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  30. Traceback - for Local Alignment λ C A T T C A C λ C T C G C A G C 1 4 2 3 +1 for match, -1 for mismatch,-5 for space BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  31. What are the 4 Local Alignments with Optimal Score = 2? C T C G C A G C 1: C T C G C A G C 2: C T C G C A G C 3: C T C G C A G C 4: C T C G C A G C C A T T C A C BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  32. What are the 4 Local Alignments with Optimal Score = 2? C T C G C A G C - - - - C A T T 1: C T C G C A G C C A T T C A C 2: C T C G C A G C T TC A C 3: C TC G C A G C T T C A C 4: C T C G C A G C C A T T C A C Check the scores: +1 for match, -1 for mismatch,-5 for space BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  33. Some Results re: Alignment Algorithms(for ComS, CprE & Math types) • Most pairwise sequence alignment problems can be solved in O(mn) time • Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88] • Highly similar sequences can be aligned in O (dn) time, where d measures the distance between the sequences [Landau86] • for Biologists:Big O notation • used when analyzing algorithms for efficiency • refers to time or number of steps it takes to • solve a problem • expressed as a function of size of the problem BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  34. Affine Gap Penalty Functions Affine Gap Penalties = Differential Gap Penalties used to reflect cost differences between opening a gap and extending an existing gap Total Gap Penalty is linear function of gap length: W = + X(k - 1) where = gap opening penalty = gap extension penalty k = length of gap Sometimes, a Constant Gap Penalty is used, but it is usually least realistic than the Affine Gap Penalty Can also be solved in O(nm) time using DP BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  35. Methods • √Global and Local Alignment • √Alignment Algorithms • √Dot Matrix Method • √Dynamic Programming Method - cont • Gap penalities • DP for Global Alignment • DP for Local Alignment • Scoring Matrices • Amino acid scoring matrices • PAM • BLOSUM • Comparisons between PAM & BLOSUM • Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  36. "Scoring" or "Substitution" Matrices 2 Major types for Amino Acids: PAM & BLOSUM PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differences in alignments of closely related proteins BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  37. PAM Matrix PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differences in closely related proteins • Model includes defined rate for each type of sequence change • Suffix number (n) reflects amount of "time" passed: rate ofexpected mutation if n% of amino acids had changed • PAM1 - for less divergent sequences (shorter time) • PAM250 - for more divergent sequences (longer time) BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  38. BLOSUM Matrix BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins • Doesn't rely on a specific evolutionary model • Suffix number (n) reflects expected similarity: average % aa identity in the MSA from which the matrix was generated • BLOSUM45 - for more divergent sequences • BLOSUM62 - for less divergent sequences BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  39. PAM250 vs BLOSUM 62 See Text Fig 3.5 = PAM250 Fig 3.6= BLOSUM62 Usually only 1/2 of matrix is displayed (it is symmetric) Here: s(a,b) corresponds to score of aligning character a with character b BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  40. Which is Better? PAM or BLOSUM • PAM matrices • derived from evolutionary model • often used in reconstructing phylogenetic trees - but, not very good for highly divergent sequences • BLOSUM matrices • based on direct observations • more 'realistic" - and outperform PAM matrices in terms of accuracy in local alignment BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  41. Which Type of Matrix Should You Use? Several other types of matrices available: • Gonnet & Jones-Taylor-Thornton: • very robust in tree construction • "Best" matrix depends on task: • different matrices for different applications ADVICE: if unsure, try several different matrices & choose the one that gives best alignment result BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  42. Sequence Alignment Statistics • Distribution of similarity scores in sequence alignment is not a simple "normal" distribution • "Gumble extreme value distribution" - a highly skewed normal distribution with a long tail BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  43. How Assess Statistical Significance of an Alignment? • Compare score of an alignment with distribution of scores of alignments for many 'randomized' (shuffled) versions of the original sequence • If score is in extreme margin, then unlikely due to random chance • P-value = probability that original alignment is due to random chance (lower P is better) P = 10-5 - 10-50 sequences have clear homology P > 10-1 no better than random Check out:PRSS (Probability of Random Shuffles) http://www.ch.embnet.org/software/PRSS_form.html BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  44. Chp 4- Database Similarity Searching SECTION II SEQUENCE ALIGNMENT Xiong: Chp 4 Database Similarity Searching • Unique Requirements of Database Searching • Heuristic Database Searching • Basic Local Alignment Search Tool (BLAST) • FASTA • Comparison of FASTA and BLAST • Database Searching with Smith-Waterman Method BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  45. Exhaustive vs Heuristic Methods Exhaustive- tests every possible solution • guaranteed to give best answer (identifies optimal solution) • can be very time/space intensive! • e.g., Dynamic Programming as in Smith-Waterman algorithm Heuristic - does NOT test every possibility • no guarantee that answer is best (but, often can identify optimal solution) • sacrifices accuracy (potentially) for speed • uses "rules of thumb" or "shortcuts" • e.g.,BLAST & FASTA BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  46. Today's Lab: focus on BLASTBasic Local Alignment Search Tool STEPS: • Create list of very possible "word" (e.g., 3-11 letters) from query sequence • Search database to identify sequences that contain matching words • Score match of word with sequence, using a substitution matrix • Extend match (seed) in both directions, while calculating alignment score at each step • Continue extension until score drops below a threshold (due to mismatches) High Scoring Segment Pair (HSP) - contiguous aligned segment pair (no gaps) BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  47. Lab3: focus on BLASTBasic Local Alignment Search Tool BLAST Results? • Original version of BLAST? List of HSPs = Maximum Scoring Pairs • More recent, improved version of BLAST? Allows gaps:Gapped Alignment How?Allows score to drop below threshold, (but only temporarily) BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  48. BLAST - a few details Developed by Stephen Altschul at NCBI in 1990 • Word length? • Typically: 3 aa for protein sequence 11 nt for DNA sequence • Substitution matrix? • Default is BLOSUM62 • Can change under Algorithm Parameters • Choose other BLOSUM or PAM matrices • Stop-Extension Threshold? • Typically: 22 for proteins 20 for DNA BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

  49. BLAST - Statistical Significance? • E-value:E = m x n x P m = total number of residues in database n= number of residues in query sequence P = probability that an HSP is result of random chance lower E-value, less likely to result from random chance, thus higher significance • Bit Score: S' normalized score, to account for differences in sequence length & size of database 3. Low Complexity Masking remove repeats that confound scoring BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

More Related