1 / 38

Multiple Sequence Alignment

Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment. Up until now we have only tried to align two sequences. What about more than two? And what for? A faint similarity between two sequences becomes significant if present in many

devin
Télécharger la présentation

Multiple Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence Alignment

  2. Multiple Alignment versus Pairwise Alignment • Up until now we have only tried to align two sequences. • What about more than two? And what for? • A faint similarity between two sequences becomes significant if present in many • Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal

  3. Generalizing the Notion of Pairwise Alignment • Alignment of 2 sequences is represented as a 2-row matrix • In a similar way, we represent alignment of 3 sequences as a 3-row matrix A T - G C G - A _ C G T - A A T C A C - A • Score: more conserved columns, better alignment

  4. Aligning Three Sequences source • Same strategy as aligning two sequences • Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align • For global alignments, go from source to sink sink

  5. 2-D vs 3-D Alignment Grid V W 2-D edit graph 3-D edit graph

  6. 2-D cell versus 2-D Alignment Cell In 2-D, 3 edges in each unit square In 3-D, 7 edges in each unit cube

  7. Architecture of 3-D Alignment Cell (i-1,j,k-1) (i-1,j-1,k-1) (i-1,j,k) (i-1,j-1,k) (i,j,k-1) (i,j-1,k-1) (i,j,k) (i,j-1,k)

  8. si-1,j-1,k-1 + (vi, wj, uk) si-1,j-1,k + (vi, wj, _ ) si-1,j,k-1 + (vi, _, uk) si,j-1,k-1 + (_, wj, uk) si-1,j,k + (vi, _ , _) si,j-1,k + (_, wj, _) si,j,k-1 + (_, _, uk) Multiple Alignment: Dynamic Programming cube diagonal: no indels • si,j,k = max • (x, y, z) is an entry in the 3-D scoring matrix face diagonal: one indel edge diagonal: two indels

  9. Multiple Alignment: Running Time • For 3 sequences of length n, the run time is 7n3; O(n3) • For ksequences, build a k-dimensional Manhattan, with run time (2k-1)(nk); O(2knk) • Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time

  10. Practically speaking • Multiple alignment is a hard problem • Yet, it is of extreme practical importance • Comparing several species is a very common task • Doing this for the entire genome is an increasingly common demand! • Several heuristic-based algorithms have been developed that employ greedy, divide-and-conquer, dynamic programming approaches in various combinations • The algorithms we will see today are actually used by current multiple aligners

  11. Multiple Alignment Induces Pairwise Alignments Every multiple alignment induces pairwise alignments x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

  12. Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments Given 3 arbitrary pairwise alignments: x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG can we construct a multiple alignment that induces them?

  13. Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments Given 3 arbitrary pairwise alignments: x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG can we construct a multiple alignment that induces them? NOT ALWAYS Pairwise alignments may be inconsistent

  14. Inferring Multiple Alignment from Pairwise Alignments • From an optimal multiple alignment, we can infer pairwise alignments between all pairs of sequences, but they are not necessarily optimal • It is difficult to infer a ``good” multiple alignment from optimal pairwise alignments between all sequences

  15. Profile Representation of Multiple Alignment - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4 In the past we were aligning a sequence against a sequence Can we align a sequence against a profile? Can we align a profile against a profile?

  16. Multiple Alignment: Greedy Approach • Choose most similar pair of strings and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat • This is a heuristic greedy method u1= ACg/tTACg/tTACg/cT… u2 = TTAATTAATTAA… … uk = CCGGCCGGCCGG… u1= ACGTACGTACGT… u2 = TTAATTAATTAA… u3 = ACTACTACTACT… … uk = CCGGCCGGCCGG k-1 k

  17. Progressive Alignment • Progressive alignment is a variation of greedy algorithm with a somewhat more intelligent strategy for choosing the order of alignments. • Use profiles to compare sequences • Gaps in consensus string are permanent

  18. ClustalW • Popular multiple alignment tool today • ‘W’ stands for ‘weighted’ (different parts of alignment are weighted differently). • Three-step process 1.) Construct pairwise alignments 2.) Build Guide Tree 3.) Progressive Alignment guided by the tree

  19. v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 - Step 1: Pairwise Alignment • Aligns each sequence again each other giving a similarity matrix • Similarity = exact matches / sequence length (percent identity)

  20. Step 2: Guide Tree • Create Guide Tree using the similarity matrix • ClustalW uses the “neighbor-joining method” • Guide tree roughly reflects evolutionary relations

  21. v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 - Step 2: Guide Tree (cont’d) v1 v3 v4 v2 Calculate:v1,3 = alignment (v1, v3)v1,3,4 = alignment((v1,3),v4)v1,2,3,4 = alignment((v1,3,4),v2)

  22. Step 3: Progressive Alignment • Start by aligning the two most similar sequences • Following the guide tree, add in the next sequences, aligning to the existing alignment • Insert gaps as necessary

  23. Threaded Blockset Aligner (TBA) Blanchette et al. 2004 • A relatively new multiple alignment method • Proposes a new way to represent sequence similarity among multiple sequences • slightly different from usual multiple alignment • more suited for whole genome comparisons • Evaluation of alignment

  24. Drawbacks of usual alignment • To align several full genomes, one common approach is to align a “reference genome” (e.g., human) to all others • Simple approach, easy to visualize • Drawback: if some sequence is conserved in a subset of species, but not in reference, will not see that • Drawback: choice of reference makes a difference. Different references may result in different alignments

  25. New approach • Instead of producing a standard multiple alignment (one large matrix of characters and gaps)… • … Produce a set of “blocks”. Each block is a local alignment of some or all of the sequences

  26. Figure 1 (A) Blocks (alignments) of a hypothetical threaded blockset for sequences h (400 bp), m (400 bp) and r (350 bp) (B) Projection of the threaded blockset onto m Mathieu Blanchette et al. Genome Res. 2004; 14: 708-715 Set of blocks

  27. Figure 1 (A) Blocks (alignments) of a hypothetical threaded blockset for sequences h (400 bp), m (400 bp) and r (350 bp) (B) Projection of the threaded blockset onto m Mathieu Blanchette et al. Genome Res. 2004; 14: 708-715 Projected on “m”

  28. Definition: Blockset • Block: rectangular array (matrix) of symbols (bases and dashes). • Removing dashes from any row gives a substring of an original sequence • No column can be dashes only • A set of blocks is a blockset

  29. Definition: Ref-blockset • “S-Ref” blockset: A blockset where every block has a row for species S. • S is called the reference for this blockset • Every position of the original sequence from S is in at most one block (not repeated) • Example: “human-ref blockset”, where a segment of the human chromosome is the reference

  30. Definition: Threaded blockset • A sequence S “threads” a blockset if every position in S occurs precisely once in some block of the blockset • By definition, S threads an “S-ref blockset” • If a blockset is threaded by each of the original sequences, it is a “threaded blockset”

  31. Threaded blockset advantages • “Project” a threaded blockset onto any of the original sequences: • pick the blocks having that sequence • order them according to position in sequence • no inconsistency when projecting to different reference sequences • Threaded blockset allows for evolutionary operations like inversions and duplications • Traditional multiple alignment does not accommodate these

  32. Figure 2 (A) Alignments between the chloroplast genomes of Arabidopsis thaliana and Oenothera elata (evening primrose) Mathieu Blanchette et al. Genome Res. 2004; 14: 708-715

  33. Threaded Blockset Aligner (TBA) • Finds threaded blocksets • Uses an alignment algorithm called MULTIZ • MULTIZ uses ideas similar to progressive alignment

  34. Figure 5 Pictorial representation of an application of MULTIZ Mathieu Blanchette et al. Genome Res. 2004; 14: 708-715

  35. Figure 5 Pictorial representation of an application of MULTIZ Mathieu Blanchette et al. Genome Res. 2004; 14: 708-715 • M is an HUMAN-ref blockset • N is a COW-ref blockset • G is a pairwise HUMAN-ref blockset • MULTIZ combines M and N, using G as a guide • Finds a segment “h” of HUMAN in M and • a segment “c” of COW in N, such that • “h” and “c” are aligned according to G. • Then aligns “h” and “c” using dynamic programming, using G as guide.

  36. Evaluation • Evaluating alignment programs is very difficult • What is a benchmark here ? • We haven’t witnessed the process of evolution, so we cannot say for certain what the true alignment of “extant” sequences should be • One approach: “simulate” evolution

  37. Simulating evolution • Generate a random sequence and introduce realistic evolutionary changes to it, along branches of an assumed phylogeny • Evolutionary changes made, and their rates, must be realistic • Substitutions, insertions, deletions, insertion & deletion rates, duplications, introduction of repeat elements, etc.

  38. Evaluating alignment • Once simulation done, take all the sequences at the leaf nodes of the phylogeny (started with root) • Align these sequences using software • Compare computed alignment and known (“true”) alignment

More Related