1 / 57

BCB 444/544

BCB 444/544. Lecture 10 BLAST Details Plus some Gene Jargon #10_Sept12. Required Reading ( before lecture). √ Mon Sept 10 - for Lecture 9 BLAST variations; BLAST vs FASTA, SW Chp 4 - pp 51-62 √ Wed Sept 12 - for Lecture 10 & Lab 4 Multiple Sequence Alignment (MSA)

naiara
Télécharger la présentation

BCB 444/544

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BCB 444/544 Lecture 10 BLAST Details Plus some Gene Jargon #10_Sept12 BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  2. Required Reading (before lecture) √Mon Sept 10 - for Lecture 9 BLAST variations; BLAST vs FASTA, SW • Chp 4 - pp 51-62 √Wed Sept 12 - for Lecture 10 & Lab 4 Multiple Sequence Alignment (MSA) • Chp 5 - pp 63-74 Fri Sept 14 - for Lecture 11 Position Specific Scoring Matrices & Profiles • Chp 6 - pp 75-78 (but not HMMs) • Good Additional Resource re: Sequence Alignment? • Wikipedia: http://en.wikipedia.org/wiki/Sequence_alignment BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  3. Assignments & Announcements - #1 Revised Grading Policy has been sent via email Please review! √Mon Sept 10 - Lab 3 Exercise due 5 PM:to:terrible@iastate.edu Thu Sept 13 - GradedLabs 2 & 3 will be returned at beginning of Lab 4 Fri Sept 14 - HW#2 due by 5 PM (106 MBB) Study Guide for Exam 1 will be posted by 5 PM BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  4. Review: Gene Jargon #1 (for HW2, 1c) Exons = "protein-encoding" (or "kept" parts) of eukaryotic genes vs Introns = "intervening sequences" = segments of eukaryotic genes that "interrupt" exons • Introns are transcribed into pre-RNA • but are later removed by RNA processing • &do not appear in mature mRNA • so are not translated into protein BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  5. Assignments & Announcements - #2 Mon Sept 17-Answers to HW#2 will be posted by 5 PM Thu Sept 20 - Lab = Optional Review Session for Exam Fri Sept 21 - Exam 1 - Will cover: • Lectures 2-12 (thru Mon Sept 17) • Labs 1-4 • HW2 • All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  6. Chp 3- Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment • √Evolutionary Basis • √Sequence Homology versus Sequence Similarity • √Sequence Similarity versus Sequence Identity • √Methods - (Dot Plots, DP; Global vs Local Alignment) • √Scoring Matrices (PAM vs BLOSUM) • √Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  7. Local Alignment: Algorithm This slide has been changed! 1) Initialize top row & leftmost column of matrix with "0" 2) Fill in DP matrix: In local alignment, no negative scores Assign "0" to cells with negative scores 3) Optimal score? in highest scoring cell(s) 4) Optimal alignment(s)? Traceback from each cell containing the optimal score, until a cell with "0" is reached (not just from lower right corner) BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  8. Local Alignment DP: Initialization & Recursion New Slide BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  9. A Few Words about Parameter Selection in Sequence Alignment Optimal alignment between a pair of sequences depends critically on the selection of substitution matrix & gap penaltyfunction In using BLAST or similar software, it is important to understand and, sometimes, to adjust these parameters (default is NOT always best!) How do we pick parameters that give the most biologically meaningful alignments and alignment scores? BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  10. Calculating an Alignment Score using a Substitution Matrix & an Affine Gap Penalty • Alignment score is sum of all match/mismatch scores (from substitution matrix) with an affine penalty subtracted for each gap a b c -- da c c e f d9 2 7 6 => 24 - (10 + 2) = 12 Matchscore Gap opening + extension AlignmentScore Values from substitution matrix BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  11. Chp 4- Database Similarity Searching SECTION II SEQUENCE ALIGNMENT Xiong: Chp 4 Database Similarity Searching • Unique Requirements of Database Searching • Heuristic Database Searching • Basic Local Alignment Search Tool (BLAST) • FASTA • Comparison of FASTA and BLAST • Database Searching with Smith-Waterman Method BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  12. Database searching Sequence database Query Sequence Target sequences ranked by score Sequence comparison algorithm BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  13. Why search a database? • Given a newly discovered gene, • Does it occur in other species? • Is its function known in another species? • Given a newly sequenced genome, which regions align with genomes of other organisms? • Identification ofpotential genes • Identification of other functional parts of chromosomes • Find members of a multigene family BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  14. Recall: There are 3 Basic Types of Alignment Algorithms? SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 1) Dot Matrix 2) Dynamic Programming Xiong: Chp 4 3) Word or k-tuple methods (BLAST & FASTA) Wikipedia: Word methods, also known as k-tuple methods, are heuristic methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  15. Exhaustive vs Heuristic Methods Exhaustive- tests every possible solution • guaranteed to give best answer (identifies optimal solution) • can be very time/space intensive! • e.g., Dynamic Programming (as in Smith-Waterman algorithm) Heuristic - does NOT test every possibility • no guarantee that answer is best (but, often can identify optimal solution) • sacrifices accuracy (potentially) for speed • uses "rules of thumb" or "shortcuts" • e.g.,BLAST & FASTA BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  16. Why do we Need Fast Search Algorithms? • Your query is 200 amino acids long (N) • You are searching a non-redundant database, which currently contains >106 proteins (K) • If proteins in database have avg length 200 aa (M), then: • Must fill in 200  200  106 = 4  1010 DP entries!! • 4  1010 operations just to fill in the DP matrix! • DP for pairwise alignment is O(NM) • Searching in a database is O(NMK) • Need faster algorithms for searching in large databases! BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  17. FASTA vs BLAST • Both FASTA, BLAST are based on heuristics • Tradeoff: Sensitivity vs Speed • DP is slower, but more sensitive • FASTA • user defines value for k = word length • Slower, but more sensitive than BLAST at lower values of k, (preferred for searches involving a very short query sequence) • BLAST family • Family of different algorithms optimized for particular types of queries, such as searching for distantly related sequence matches • BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  18. Lab3: focus on BLASTBasic Local Alignment Search Tool STEPS: • Create list of very possible "word" (e.g., 3-11 letters) from query sequence • Search database to identify sequences that contain matching words • Score match of word with sequence, using a substitution matrix • Extend match (seed) in both directions, while calculating alignment score at each step • Continue extension until score drops below a threshold (due to mismatches) High Scoring Segment Pair (HSP) - contiguous aligned segment pair (no gaps) BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  19. What are the Results of a BLAST Search? Original version of BLAST? List of HSPs called Maximum Scoring Pairs More recent, improved version of BLAST? Allows gaps:Gapped Alignment How?Allows score to drop below threshold, (but only temporarily) BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  20. Why is Gapped Alignment Harder? • Without gaps, there are N+M-1 possible alignments between sequences of length N and M • Once we start allowing gaps, there are many more possible arrangements to consider: abcbcd abcbcd abcbcd ||| | | ||| || || abc--d a--bcd ab--cd • Becomes a very large number when we also allow mismatches, because we need to look at every possible pairing between elements: Roughly NM possible alignments! e.g.: for N=M=100, there are 100100=10200 possible alignments & 100 aa is a small protein! BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  21. BLAST - a few details Developed by Stephen Altschul at NCBI in 1990 • Word length? • Typically: 3 aa for protein sequence 11 nt for DNA sequence • Substitution matrix? • Default is BLOSUM62 • Can change under Algorithm Parameters • Can choose other BLOSUM or PAM matrices • Change other parameters here, too • Stop-Extension Threshold? • Typically: 22 for proteins 20 for DNA BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  22. BLAST - Statistical Significance? • E-value:E = m x n x P m = total number of residues in database n= number of residues in query sequence P = probability that an HSP is result of random chance lower E-value,less likely to result from random chance, thus higher significance • Bit Score: S' normalized score, to account for differences in size of database (m) & sequence length(n) - more later 3. Low Complexity Masking remove repeats that confound scoring - more sooner BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  23. BLAST algorithms can generate both "global" and "local" alignments Globalalignment Local alignment BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  24. BLAST - a Family of Programs: Different BLAST "flavors" • BLASTP - protein sequence query against protein DB • BLASTN - DNA/RNA seq query against DNA DB (GenBank) • BLASTX - 6-frame translated DNA seq query against protein DB • TBLASTN - protein query against 6-frame DNA translation • TBLASTX - 6-frame DNA query to 6-frame DNA translation • PSI-BLAST - protein "profile" query against protein DB • PHI-BLAST - protein pattern against protein DB • Newest: MEGA-BLAST - optimized for highly similar sequences Which tool should you use? http://www.ncbi.nlm.nih.gov/blast/producttable.shtml BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  25. Review: Gene Jargon #2.1 6-Frame translated DNA Sequence? Remember GeneBoy exercise? BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  26. Review: Gene Jargon #2.2 6-Frame translated DNA Sequence? Try NCBI tools: http://www.ncbi.nlm.nih.gov/gorf/orfig.cgi http://www.ncbi.nlm.nih.gov/ Or - for some Biology review re: DNA/RNA & ORFs, see next 3 slides borrowed from EMBL-EBI: http://www.ebi.ac.uk/ BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  27. Review: Gene Jargon #2.3 http://www.ebi.ac.uk/ DNA Strands BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  28. Review: Gene Jargon #2.4 http://www.ebi.ac.uk/ RNA Strands - copied from DNA BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  29. Review: Gene Jargon #2.5 http://www.ebi.ac.uk/ Reading Frames BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  30. BLAST - How does it work? Main idea - based on dot plots! BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  31. Dot Plots - apply in BLAST: • Perform fast, approximate local alignments to find sequences in database that are related to query sequence • Here, use 4-base "window" • 75% identity (allow mismatches) BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  32. Remove low-complexity regions (LCRs) Make a list (dictionary): all words of length 3aa or 11 nt Augment list to include similar words Store list in a search tree (data structure) Scan database for occurrences of words in search tree Connect nearby occurrences Extend matches (words) in both directions Prune list of matches using a score threshold Evaluate significance of each remaining match Perform Smith-Waterman to get alignment Detailed Steps in BLAST algorithm BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  33. 1: Filter low-complexity regions (LCRs) This slide has been changed! K = computational complexity; varies from 0 (very low complexity) to 1 (high complexity) • Low complexity regions, transmembrane regions and coiled-coil regions often display significant similarity without homology. • Low complexity sequences can yield false positives. • Screen them out of your query sequences! When appropriate! Alphabet size (4 or 20) Window length (usually 12) • e.g., for GGGG: • L! = 4!=4x3x2x1= 24 • nG=4 nT=nA=nC=0 • ni! = 4!x0!x0!x0! = 24 K=1/4 log4 (24/24) = 0 For CGTA: K=1/4 log4(24/1) = 0.57 Frequency of ith letter in the window BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  34. 2: List all words in query YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  35. 3: Augment word list YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … AAA AAB AAC … YYY 203 = 8000 possible matches BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  36. 3: Augment word list G G F A A A 0 + 0 + -2 = -2 Non-match BLOSUM62 scores G G F G G Y 6 + 6 + 3 = 15 Match A user-specified threshold, T, determines which 3-letter words are considered matches and non-matches BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  37. 3: Augment word list YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … GGI GGL GGM GGF GGW GGY … BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  38. 3: Augment word list Observation: Selecting only words with score > T greatly reduces number of possible matches otherwise, 203for 3-letter words from amino acid sequences! BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  39. Example Find all words that match EAM with a score greater than or equal to 11 A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 EAM 5 + 4 + 5 = 14 DAM 2 + 4 + 5 = 11 QAM 2 + 4 + 5 = 11 ESM 5 + 1 + 5 = 11 EAL 5 + 4 + 2 = 11 BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  40. 4: Store words in search tree Augmented list of query words “Does this query contain GGF?” Search tree “Yes, at position 2.” BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  41. Search tree G G F L M W Y GGF GGL GGM GGW GGY BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  42. Example D Q E K A A A C G S T V A M M M M M M M M I V L M Put this word list into a search tree DAM QAM EAM KAM ECM EGM ESM ETM EVM EAI EAL EAV BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  43. 5: Scan the database sequences Database sequence    Query sequence      BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  44. Example Scan this "database" for occurrences of your words MKFLILLFNILCLDAMLAADNHGVGPQGASGVDPITFDINSNQTGPAFLTAVEAIGVKYLQVQHGSNVNIHRLVEGNVKAMENA E A M P Q L S V D A M  BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  45. 6: Connect nearby occurences (diagonal matches in Gapped BLAST) Database sequence Two dots are connected IFF if they are less than A letters apart & are on diagonal    Query sequence      BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  46. 7: Extend matches in both directions Scan DB BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  47. 7: Extend matches, calculating score at each step L P P Q G L L Query sequence M P P E G L L Database sequence <word> 7 2 6 BLOSUM62 scores word score = 15 <--- ---> 2 7 7 2 6 4 4 HSP SCORE = 32 (High Scoring Pair) • Each match is extended to left & right until a negative BLOSUM62 score is encountered • Extension step typically accounts for > 90% of execution time BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  48. 8: Prune matches • Discard all matches that score below defined threshold BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  49. 9: Evaluate significance This slide has been changed! • BLAST uses an analytical statistical significance calculation RECALL: • E-value:E = m x n x P m = total number of residues in database n= number of residues in query sequence P = probability that an HSP is result of random chance lower E-value,less likely to result from random chance, thus higher significance • Bit Score: S' = normalized score, to account for differences in size of database (m) & sequence length(n); Note (below) that bit score is linearly related to raw alignment score, so:higher S' means alignment has higher significance S'= ( X S - ln K)/ln2 where:  = Gumble distribution constant S = raw alignment score K = constant associated with scoring matrix For more details - see text & BLAST tutorial BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

  50. 10: Use Smith-Waterman algorithm (DP) to generate alignment • ONLY significant matches are re-analyzed using Smith-Waterman DP algorithm. • Alignments reported by BLAST are produced by dynamic programming BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon

More Related