1 / 43

BCB 444/544

BCB 444/544. Lecture 13 Star Alignment & Clustal (for MSA) Perhaps: Profiles & Hidden Markov Models (HMMs) #13_Sept19. Required Reading ( before lecture). √ Mon Sept 17 - Lecture 12 Position Specific Scoring Matrices & PSI-BLAST Chp 6 - pp 75-78 (but not HMMs)

yves
Télécharger la présentation

BCB 444/544

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BCB 444/544 Lecture 13 Star Alignment & Clustal (for MSA) Perhaps: Profiles & Hidden Markov Models (HMMs) #13_Sept19 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  2. Required Reading (before lecture) √Mon Sept 17 - Lecture 12 Position Specific Scoring Matrices & PSI-BLAST • Chp 6 - pp 75-78 (but not HMMs) Wed Sept 19 - Lecture 13 (not covered on Exam 1) Profiles & Hidden Markov Models • Chp 6 - pp 79-84 • Eddy: What is a hidden Markov Model? 2004 Nature Biotechnol 22:1315 http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html Fri Sept 21 - EXAM 1 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  3. Assignments & Announcements √Sun Sept 16 - Study Guide for Exam 1 was posted √Mon Sept 17-Answers to HW#2 were posted Thu Sept 20 - Lab = Optional Review Session for Exam Fri Sept 21 - Exam 1 - Will cover: • Lectures 2-12 (thru Mon Sept 17) • Labs 1-4 • HW2 • All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming? BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  4. Chp 5- Multiple Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 5 Multiple Sequence Alignment • √Scoring Function • √Exhaustive Algorithms • Heuristic Algorithms • Star Alignment • Clustal • √Practical Issues • First, review MSA scoring briefly, then back to Star Alignment & ClustalW BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  5. Scoring an Alignment - in Lecture 12, so will be covered on Exam 1 Gap penalty F F F I D D D F F F I I - - A F P G Q I K - F F I Y Y Y A F P G Q I K A F P G Q I K - - - I D D D G G G G G G G F F F I Y Y Y G G Q G Q G K F F F I D D D W W W W W W W In practice, simple scoring functions are used Usually, columns are scored independently: ith column of alignment m BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  6. Sum of Pairs (SP) Score F F I - mi PAM or BLOSUM score residue l F F F I F F I - A F P G - F F Y A F P G A F P G - - D D G G G G F F F I G G Q G F F F I W W W W • SP = sum of pairs = sum of scores of all possible pairs of sequences in an MSA, based on a particular scoring matrix • Compute for each column c: S(mi) = k<l s(mik, mil) BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  7. Example: Calculating SP Score I added more colors to this slide m1 m2 m3 F - G G G D M = F - G F Y D Gap penalty = -8 s(-,-) = 0 BLOSUM 60 S(m) = S(m1) + S(m2) + S(m3) = 3s(F,F) + 2s(-,Y) + s(-,-) + s(G,G) + 2s(G,D) = 15 -16 + 0 + 4 -6 = -3 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  8. Algorithms & Software for MSA? #1 Exhaustive Methods • √ Multidimensional dynamic programming (DP) • Divide-and-Conquer Alignment (DCA) - "semi-exhaustive" web-based version available - see textbook • Full DP Optimal Global Alignment? Prohibitive in both time & space requirements for more than 10 sequences!! Heuristic Methods • Progressive (Star Alignment, Clustal) • Iterative • Block-based BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  9. Dynamic Programming for MSA 3D • As with pairwise alignments, MSAs can be computed by dynamic programming* *(if you're not in a rush!) F 2D BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  10. Generalized Needleman-Wunsch Algorithm 3D Given 3 sequences x, y, and z: Main iteration loop: S(i,j,k) = max ( S(i-1, j-1, k-1) + (xi, yj, zk), S(i-1, j-1, k ) + (xi, yj, - ), S(i-1, j , k-1) + (xi, -, zk), S(i-1, j , k ) + (xi, -, - ), S(i , j-1, k-1) + ( -, yj, zk), S(i , j-1, k ) + ( -, yj, -), S(i , j , k-1) + ( -, -, zk) ) BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  11. What Happens to Computational Complexity? 3D Given k sequences of length n • Space for matrix: O(nk) • Neighbors/cell: 2k-1 • Time to compute SP score: O(k2) • Overall runtime: O(k22knk) • Wow!!! BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  12. What's so bad about those exponents? Example: Running Time of DP for MSA • Overall runtime: O(k22knk) Sequences? Globins only »150 aa !! But: There are fast heuristics BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  13. Progressive Alignment Multiple Alignment by adding sequences 1 2 3 4 Heuristic procedure: • Align most similar sequences first • Add sequences progressively Often: use guide tree to determine order of alignments 2 Examples:Star Alignment ClustalW BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  14. Guide Trees Binary tree • Leaves correspond to sequences • Internal nodes represent alignments • Root corresponds to final MSA -TCG -TCC ATC- ATG- ATC TCG ATG TCC TCC ATC ATG TCG BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  15. Star Alignment - skipped on Monday: will NOT be covered on Exam 1 • Back to2 Examples of • Progressive Alignment Heuristicsfor MSA: • STAR Alignment • Clustal BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  16. Star Alignment • Fast heuristic to compute MSA • Good approximation of optimal MSA, if scoring scheme satisfies triangle inequality Algorithm: • Compute pairwise similarities • Select center sc that maximizes Σic S(sc,si) • Add sequences in decreasing orderof similarity to center sc • Produce a multiple alignment Msuch that, for every i, the induced pairwise alignment of scand si is same as the optimal alignment of sc and si BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  17. Step 2 - Select center sc that maximizes Σic S(sc,si) FGGHL-GF F-GHLPGF FGGHP-FG FGGHL-GF Does that function look familiar? Recall: Consensus sequence = single sequence (more accurately; "model") that represents most common residue of each column in MSA Steiner consensus sequence or string:Given sequences s1,…, sk, find a sequence s* that maximizes Σi S(s*,si) "String" equivalent of arithmetic mean:consensus sequence is string that minimizes sum of edit distances to members of a family of strings (thus, maximizing similarity score…) BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  18. Step 3 - Add sequences in decreasing orderof similarity to center sc s1: MPE s2: MKE s3: MSKE s4: SKE MPE | | MKE MSKE | || M-KE s1 s3 s2 MKE || SKE S-KE M-PE MSKE M-KE M-PE MSKE M-KE MSKE M-KE s4 S2+S3 +S1 +S4 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  19. Step 4 - Produce a multiple alignment M such that for every i: the induced pairwise alignment of scand si is same as optimal alignment of sc and si ScAA--CCTT S1AATGCC-- ScA-ACC-TT S2AGACCGT- S1A-ATGCC--- ScA-A--CC-TT S2AGA--CCGT- BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  20. Complexity of Star Alignment? Given k sequences of length n, and an upper bound l for alignment length We need: • O(k2n2) to compute the alignments • O(k2) to compute the center • O(k2l) to build multiple alignment Overall: O(k2n2) Duh - Is this really much better than O(k22knk)? YES!Remember: k = # of sequences n = length of sequences BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  21. CLUSTAL: Overview Progressive Alignment Guide Tree 1 2 3 4 5 1 2 3 4 2 1 2 3 4 5 3 DistanceMatrix 4 1 1 + 2 1 + 3 1 + 4 2 + 3 2 + 4 3 + 4 Pairwise Alignments • Compute pairwise alignments (DP) • Convert similarities into distances • Distance between a pair = # of mismatched positions in alignment (divided by total # of matches) • Build guide tree from distances by Neighbor Joining • Align with respect to guide tree BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  22. CLUSTAL: Example 1 2 3 4 5 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  23. One "small" problem? Finding the Guide Tree Guide Tree 1 2 3 4 5 1 2 3 4 1 2 3 4 5 DistanceMatrix Goal: Given k sequences and their pairwise distances, find a tree, such that all distances correspond to path lengths between leaves Problem:Such a tree might not exist! BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  24. CLUSTAL W Tree Tree calculated from an alignment of >1100 ring finger domains, using ClustalW 1.83 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  25. Algorithms & Software for MSA? #2 √ Exhaustive Methods • Multidimensional dynamic programming (DP) • Divide-and-Conquer Alignment (DCA) - "semi-exhaustive" web-based version available - see textbook • Full DP Optimal Global Alignment? Prohibitive in both time & space requirements for more than 10 sequences!! Heuristic Methods • √Progressive (Star Alignment, Clustal) • Iterative • Block-based BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  26. Algorithms & Software for MSA? #3 will NOT be covered on Exam1 Heuristic Methods - continued • Progressive alignments (Star Alignment, Clustal) • Others: T-Coffee, DbClustal -see text: can be better than Clustal • Match closely-related sequences first using a guide tree • Partial order alignments (POA) • Doesn't rely on guide tree; adds sequences in order given • PRALINE • Preprocesses input sequences by building profiles for each • Iterative methods • Idea: optimal solution can be found by repeatedly modifying existing suboptimal solutions(eg: PRRN) • Block-based Alignment • Multiple re-building attempts to find best alignment (eg:DIALIGN2 & Match-Box) • Local alignments • Profiles, Blocks, Patterns - more on these soon! BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  27. Chp 6 - Profiles & Hidden Markov Models SECTION II SEQUENCE ALIGNMENT Xiong: Chp 6 Profiles & HMMs • √Position Specific Scoring Matrices (PSSMs) • √PSI-BLAST First, review above briefly, then: • Profiles • Markov Models & Hidden Markov Models BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  28. PSI-BLAST (Covered in Lecture 12, so will be covered on Exam1) • Position Specific Iterated BLAST • Intuition: substitution matrices should be "sensitive" to protein context • e.g., larger penalty for Ala→Gly substitution if in a helix rather than in a loop • Basic idea: • Use BLAST with high stringency to generate a set of closely related sequences • Align those sequences to create a new substitution matrix for each position • Use this matrix (iteratively) to find additional sequences BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  29. PSI-BLAST Pseudocode Position-Specific Scoring Matrix Convert query to PSSM (or a Profile) do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs This step requires a user-defined threshold Note: Xiong textbook distinguishes between PSSMs (which have no gaps) & Profiles (can include gaps). Thus, based on these definitions, PSI-BLAST uses a Profile to iteratively add new homologs - other authors refer to pattern used by PSI-BLAST as a PSSM. BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  30. What is a PSSM? Position-Specific Scoring Matrix I added more text to this slide “K” at position 3 gets a score of 2 8 residue sequence A PSSM is: • a representation of a motif • an n by m matrix, where n is size of alphabet & m is length of sequence • a matrix of scores in which entry at (i, j) is score assigned by PSSM to letter i at the jth position 20 letter alphabet Xiong:PSSM = table that contains probability information re: residues at each position of an ungapped MSA Also, sometimes called: Position Weight Matrix (PWM) Note: Assumes positions are independent BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  31. Assigning a "Match" Score with a PSSM PSSM assigns sequence NMFWAFGH a score of: 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  32. Creating a PSSM from 1 Sequence R L RNRGQFGH R BLOSUM62 matrix 20 by 20 20 by L BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  33. Creating a PSSM from Multiple Sequences • Discard columns that contain gaps in query sequence • Compute relative sequence weights • Compute PSSM entries, taking into account • Observed residues in column • Sequence weights • Substitution matrix BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  34. 1- Discard Columns with Gaps in Query EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVD-LVNNA KALGGFNVIVNNA ARFGKID-LIPNA FEPEGMWGLVNNA AQLKTVDVLINGA EEFG----SVDGLVNNA QKYG----RLDVMINNA RRLG----TLNVLVNNA GGIG----PVD-LVNNA KALG----GFNVIVNNA ARFG----KID-LIPNA FEPEGPEKGMWGLVNNA AQLK----TVDVLINGA BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  35. Smaller weights are assigned to redundant sequences Larger weights are assigned to unique sequences 2- Compute Sequence Weights Info re: weights was added to this slide EEFGSVDGLVNNA 1.2 QKYGRLDVMINNA 1.2 RRLGTLNVLVNNA 0.8 GGIGPVDLLVNNA 0.8 KALGGFNVIVNNA 1.1 ARFGKIDTLIPNA 0.9 FEPEGMWGLVNNA 1.1 AQLKTVDVLINGA 1.3 • How are weights determined? • Based on branch lengths in guide tree: value for each sequence is then used to multiply raw alignment scores • Goal of weighting? to decrease matching scores of frequent characters in MSA & increase scores of infrequent characters BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  36. 3- Compute PSSM Entries (simplified version) This slide was modified A 0.085 C 0.019 D 0.054 E 0.065 F 0.040 G 0.072 H 0.023 I 0.058 K 0.056 L 0.096 M 0.024 P 0.053 Q 0.042 R 0.054 S 0.072 T 0.063 V 0.073 W 0.016 Y 0.034 E Q R G K A F A = PSSM Observed residues Background frequencies PSSM column / Usually derived from large sequence database BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  37. PSSM Entries = Log-Odds Scores This slide was modified Observed frequency of residue “A” Foreground model (i.e., the PSSM) • Estimate probability of observing each residue(probability of A given M, where M is PSSM model) • Divide by background probability of observing each residue(probability of A given B, where B is background model) • Take log so that can add (rather than multiply) scores Background model BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  38. Why (not) PSI-BLAST? • Psi-BLAST weights sequences according to observed diversity specific to family under investigation • Advantage: If sequences used to construct PSSMs are all homologous, sensitivity for a given level of specificity improves significantly • Disadvantage: However, if any non-homologous sequences are included in PSSMs, they become “corrupted” and "pull in" additional non-homologous sequences, resulting in false positive hits BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  39. How to Use PSI-BLAST Effectively • Set initial thresholds high • Inspect each iteration's result for suspicious sequences (When in doubt, leave it out!) • Do several iterations (~5), or until no new sequences are found • Make initial search very broad • First, use NR (large, inclusive database) with up to 5 iterations to set PSSM • Then use that PSSM to search in a more restricted domain, if possible • Be particularly cautious about matches to sequences with highly biased amino acid content BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  40. Summary: DP, BLAST & PSI-BLAST • Dynamic programming is O(NM) for pairwise alignment • BLAST is O(M) • BLAST produces an index of words in query sequence that allows fast matching to the database • At NCBI, target databases are also pre-indexed to indicate positions in all database sequences that match each possible search word above some score threshold • PSI-BLAST iterates BLAST, adding new homologs at each iteration BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  41. Applications of MSA • Building phylogenetic trees • Finding conserved patterns: • Regulatory motifs (TF binding sites) • Splice sites • Protein domains • Identifying and characterizing protein families • Find out which protein domains have same function • Finding SNPs(single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms) • DNA fragment assembly (in genomic sequencing) BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  42. Application: Discover Conserved Patterns Is there a conservedcis-acting regulatory sequence? Rationale: if sequences are homologous (derived from a common ancestor), they may be structurally/functionally equivalent TATA box = transcriptional promoter element Sequence Logo BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

  43. Sequence Motifs (Patterns) Other types of representations? • √ Consensus Sequence • √ PSSM - Position-Specific Scoring Matrix • √ Sequence Logo - "enhanced"consensus sequence, in which symbol size information entropy • Information entropy???In information theory, the Shannon entropy or information entropy is a measure of the [decrease in] uncertainty associated with a random variable. Entropy quantifies information in a piece of data. - Wikipedia • Check out this fun website: Tom Scheider, NCIF • http://www.ccrnp.ncifcrf.gov/~toms/glossary.html#sequence_logo • Profile • HMM - Hidden Markov Model BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs

More Related