1 / 48

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment. Xingquan Zhu Dept. of Computer Science and Engineering. Outline. Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods Multidimensional dynamic programming Star Alignment Tree Alignment

nickolas
Télécharger la présentation

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COT 6930HPC and BioinformaticsMultiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

  2. Outline • Multiple Sequence Alignment • What, Why, and How • Multiple Sequence Alignment Methods • Multidimensional dynamic programming • Star Alignment • Tree Alignment • Progressive Alignment • Clustalw: a widely used algorithm • Iterative Alignment • Genetic Algorithm

  3. What is a Multiple Sequence Alignment? • Pairwise alignments: involve two sequences • Multiple sequence alignments: involve more than 2 sequences (often 100’s, either nucleotide or protein). • A formal definition • A multiple alignment of strings S1, … Sk is a series of strings with spaces such that |S1’| = … = |Sk’| Sj’ is an extension of Sj by insertion of spaces • Goal: Find an optimal multiple alignment. Hs ---MK----- --LSLVAAML LLLSAARAEE EDKK-EDVGT VVGIDLGTTY Sp ---MKKFQLF SILSYFVALF LLPMAFASGD DNST-ESYGT VIGIDLGTTY Tg MTAAKKLSLF SLAALFCLLS VATLRPVAAS DAEEGKVKDV VIGIDLGTTY Pf --------MN QIRPYILLLI VSLLKFISAV DSN---IEGP VIGIDLGTTY

  4. Why we do multiple alignments? • In order to reveal the relationship between a group of sequences (homology) • Simultaneous alignment of similar gene sequences may • Discover the conserved regions in genes • Determine the consensus sequence of these aligned sequences • Help defines a protein family that may share a common biochemical function or evolutionary origin and thus reveals an evolutionary history of the sequences. • Help prediction of the secondary and tertiary structures of new sequences

  5. MSA Methods • Multidimensional dynamic programming • Extension of DP to multiple (3) sequences • Star Alignment, Tree Alignment, Progressive Alignment • Starting with an alignment of the most alike sequences and building an alignment by adding more sequences • Iterative methods • Making an initial alignment of groups of sequences and revising the alignment to achieve a more reasonable result

  6. Outline • Multiple Sequence Alignment • What, Why, and How • Multiple Sequence Alignment Methods • Multidimensional dynamic programming • Star Alignment • Tree Alignment • Progressive Alignment • Clustalw: a widely used algorithm • Iterative Alignment • Genetic Algorithm

  7. Multiple Sequence Alignment by DP • Pairwise sequence alignment • a scoring matrix where each position provides the best alignment up to that point • Extension to 3 sequences • the lattice of a cube that is to be filled with calculated dynamic programming scores. • Scoring positions • on 3 surfaces of the cube represent the alignment of a pair

  8. Scoring of MSA: Sum of Pairs • Scores = summation of all possible combinations of amino acid pairs • Using BLOSUM62 matrix, gap penalty -8 • In column 1, we have pairs • -,S • -,S • S,S • k(k-1)/2 pairs per column -8 - 8 + 4 = -12

  9. Sum of Pairs • Given 5 sequences: N C C E N N C E N - C N S C S N S C S E • How many possible combinations of pairwise alignments for each position?

  10. Sum of Pairs • Assume: match/mismatch/gap = 1/0/-1 N C C E N N C E N - C N S C S N S C S E The 1st position: # of N-N (3), # of S-S (1), # of N-S (6) SP(1) = 4*1 + 0*6 + (-1)*0 = 4 The 2nd position: # of C-C (3), # of N-C (3), # of gaps (4), SP(2) = 3*1 + 0*3 + (-1)*4 = -1

  11. Dynamic programming matrix Pairwise alignment Seq 2 G T G C T T G A T G G C C T Gap in sequence 1 Match/Mismatch Seq 1 Gap in sequence 2

  12. Dynamic programming matrix Multiple sequence alignment Seq 1 S M V V M A Seq 3 many possibilities Seq 2 S M T

  13. DP Alignment Examples • All three match/mismatch • Sequence 1 & 2 match/mismatch with gap in 3 • Sequence 1 & 3 match/mismatch with gap in 2 • Sequence 2 & 3 match/mismatch with gap in 1 • Sequence 1 with gaps in 2 & 3 • Sequence 2 with gaps in 1 & 3 • Sequence 3 with gaps in 1 & 2 • Choose the largest value among the above seven possibilities

  14. Computational Complexity • For protein sequences each 300 amino acid in length & excluding gaps, with DP algorithm • Two sequences, 3002 comparisons • Three sequences, 3003 comparisons • N sequences, 300N comparisons O(LN) L: length of the sequences; N: number of sequences • The number of comparisons & memory required are too large for n > 3 and not practical

  15. Outline • Multiple Sequence Alignment • What, Why, and How • Multiple Sequence Alignment Methods • Multidimensional dynamic programming • Star Alignment • Tree Alignment • Progressive Alignment • Clustalw: a widely used algorithm • Iterative Alignment • Genetic Algorithm

  16. Star Alignments • Heuristic method for multiple sequence alignments • Select a sequence sc as the center of the star • For each sequence s1, …, sk such that index i  c, perform a global alignment (using DP) • Aggregate alignments with the principle “once a gap, always a gap.”

  17. Star Alignments Example MPE | | MKE MSKE - || MKE s1: MPE s2: MKE s3: MSKE s4: SKE s3 s1 s2 MKE || SKE -MPE -MKE MSKE -SKE -MPE -MKE MSKE MPE MKE s4

  18. Choosing a center • Try them all and pick the one with the best score • Calculate all O(k2) alignments, and pick the sequence sc that maximizes

  19. Star Alignment Example • S1=ATTGCCATT • S2=ATGGCCATT • S3=ATCCAATTTT • S4=ATCTTCTT • S5=ATTGCCGATT 2 1 -11 -3 -17

  20. Star Alignments Example Merging Pairwise Alignment

  21. Star Alignment Example Merging Pairwise Alignment

  22. Analysis • Assuming all sequences have length n • O(n2) to calculate global alignment • O(k2) global alignments to calculate • Using a reasonable data structure for joining alignments, no worse than O(kl), where l is upper bound on alignment lengths • O(k2n2+kl)=O(k2n2) overall cost

  23. Outline • Multiple Sequence Alignment • What, Why, and How • Multiple Sequence Alignment Methods • Multidimensional dynamic programming • Star Alignment • Tree Alignment • Progressive Alignment • Clustalw: a widely used algorithm • Iterative Alignment • Genetic Algorithm

  24. weight: sim(s1,s2) Tree Alignment Consensus String • Compute the overall similarity based on pairwise alignment along the edge • The sum of all these weights is the score of the tree sequence sequence S1 sequence S2 sequence The consensus stringderived from multiple alignment is the concatenation of the consensus characters for each column. The consensus characterfor column is the character that minimizes the summed distance to it from all the characters in column

  25. Tree Alignment Example • Scoringsystem used is CAT - GT CTG C - G CAT CTG 3 3 CAT CTG 1 0 1 CG GT We have a score of 8

  26. Tree Alignment Example

  27. Example

  28. Example

  29. Example

  30. Example

  31. Example

  32. Example

  33. Example

  34. Analysis • We don’t know the correct tree • Without the tree, the tree alignment problem is NP-complete • Likely only exponential time solution available (for optimal answers)

  35. Outline • Multiple Sequence Alignment • What, Why, and How • Multiple Sequence Alignment Methods • Multidimensional dynamic programming • Star Alignment • Tree Alignment • Progressive Alignment • Clustalw: a widely used algorithm • Iterative Alignment • Genetic Algorithm

  36. Progressive Methods • DP-based MSA program is limited in 3 sequences or to a small # of relatively short sequences • Progressive alignments uses DP to build a msa starting with the most related sequences and then progressively adding less-related sequences or groups of sequences to the initial alignment • Most commonly used approach

  37. Progressive Methods • Progressive alignment is heuristic. • It does not separate the process of scoring an alignment from the optimization algorithm • It does not directly optimize any global scoring scoring function of “alignment correctness”. • It is fast, efficient and the results are reasonable. We will illustrate this using ClustalW.

  38. Progressive MSA occurs in 3 stages • Do a set of global pairwise alignments (Needleman and Wunsch) • Create a guide tree • Progressively align the sequences

  39. ClustalW Procedure

  40. Progressive Methods: ClustalW • http://www.ebi.ac.uk/clustalw/ • ClustalW is a general purpose multiple alignment program for DNA or proteins. • ClustalW: The W standing for “weighting” to represent the ability of the program to provide weights to the sequence and program parameters. • CLUSTALX provides a graphic interface

  41. Operational options Output options Input options, matrix choice, gap opening penalty Gap information,output tree type File input in GCG, FASTA, EMBL, GenBank, Phylip, or several other formats Use Clustal W to do a progressive MSA

  42. Progressive MSA stage 3 of 3 : progressive alignment • Make a MSA based on the order in the guide tree • Start with the two most closely related sequences • Then add the next closest sequence • Continue until all sequences are added to the MSA

  43. Problems w/ Progressive Alignment • Highly sensitive to the choice of initial pair to align. • The very first sequences to be aligned are the most closely related on the sequence tree. If alignment good, few errors in the initial alignment • The more distantly related these sequences, the more errors • Errors in alignment propagated to the MSA

  44. Outline • Multiple Sequence Alignment • What, Why, and How • Multiple Sequence Alignment Methods • Multidimensional dynamic programming • Star Alignment • Tree Alignment • Progressive Alignment • Clustalw: a widely used algorithm • Iterative Alignment • Genetic Algorithm

  45. Iterative Methods • Results do NOT depend on the initial pairwise alignment (recall progressive methods) • Starting with an initial alignment and repeatedly realigning groups of the sequences • Repeat until one MSA doesn’t change significantly from the next. • After iterations, alignments are better and better. • An example is genetic algorithm approach.

  46. Genetic Algorithms • A general problem solving method modeled on evolutionary change. • Inspired by the biological evolution process • Uses concepts of “Natural Selection” and “Genetic Inheritance” (Darwin 1859) • Create a set of candidate solutions to your problem, and cause these solutions to evolve and become more and more fit over repeated generations. • Use survival of the fittest, mutation, and crossover to guide evolution.

  47. Genetic Search Algorithms Random generation(candidate solutions) Evaluation(fitness function) Crossover + Mutation(change some selected candidate solutions to converge to the optimal solution and to prevent a local extreme Selection(candidate solutions with larger fitness values will have larger chance to be included)

  48. Outline • Multiple Sequence Alignment • What, Why, and How • Multiple Sequence Alignment Methods • Multidimensional dynamic programming • Star Alignment • Tree Alignment • Progressive Alignment • Clustalw: a widely used algorithm • Iterative Alignment • Genetic Algorithm

More Related