1 / 29

Trees, Stars, and Multiple Biological Sequence Alignment

Trees, Stars, and Multiple Biological Sequence Alignment. Jesse Wolfgang CSE 497 February 19, 2004. Importance?. Molecular evolution (Dayhoff). RNA folding (Trifonov, Bolshoi). Gene regulation (Galas et al.). Protein structure-function relationships (Wu, Kabat). Introduction.

vine
Télécharger la présentation

Trees, Stars, and Multiple Biological Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Trees, Stars, and Multiple Biological Sequence Alignment Jesse Wolfgang CSE 497 February 19, 2004

  2. Importance? • Molecular evolution (Dayhoff) • RNA folding (Trifonov, Bolshoi) • Gene regulation (Galas et al.) • Protein structure-function relationships(Wu, Kabat)

  3. Introduction • Original sequence unknown • Must consider all possible transformations • Including insertions, deletions, and replacements • Choose the most likely set of transformations • With a given model of protein evolution

  4. K-sequence: sequence of k characters • An alignment of the sequences is written as • Each is obtained from • Blanks are inserted in positions where some of the other sequences have a nonblank character • At least one must be nonblank for each • is the length of the aligned sequences Sequences and Alignments

  5. D Q L F D N V QQ G L D - - Q – L F D N V Q - - - - - - Q G L - Alignments • Ex: sequences DQLF, DNVQ, QGL

  6. A lattice of sequences with lengths n • Consists of -dimensional hypercubes • Cartesian product of strings of squares • Forms an -dimensional parallelepiped • A path between the sequences is a set of connected line segments (connected broken line) Lattices and Paths

  7. Paths = 2n-1 = O(2n) 2 dimensions 3 dimensions 3 possible paths 7 possible paths

  8. sublattice F L Q D D D - - N - - V - Q Q Q - - G L - L F - - Q D N V Q G L 3-dimensional parallelepiped Paths • Sequences DQLF, DNVQ, QGL

  9. Note: • Where is the length of D B A B C D A B – D - B C D A B A B C D C D Paths and Sequence Length • Sequences: ABCD, ABD, BCD

  10. Note: • Where is the length of D C B A I E F G H A B C D – - - - - - - - - - - E F G H - - - - - - - - - - - I J K J K Paths and Sequence Length • Sequences: ABCD, EFGH, IJK

  11. denotes an alignment of and D Q L F F Q L Q G D Q D N V Q L G D Q – L F - Q G L - L Projections • Sequences DQLF, DNVQ, QGL

  12. is a measure assigned to • Measure of the similarity among based upon a particular metric • For each measure there is at least one path with attaining a minimum value at , the optimal path Optimal Paths

  13. Each vertex in L is an end corner of the sublattice • First: compute score of each of the possible paths for the cube that has a vertex at the original corner F L Q D Q D N V Q • Next: using this information, compute minimum score to reach the vertices of the adjacent cubes to the original corner G L Calculating Optimal Paths

  14. Problems with This Algorithm • Calculates a weighted sum of its projected pairwise alignments • Called “Sum-of-the-Pairs” (SP) • Other methods fit biological intuition more closely

  15. Tree-Alignment • Treat sequences as leaves of an evolutionary tree • Reconstruct ancestral sequences which minimize the cost of the tree • Must assign sequences to internal nodes • Align the given and reconstructed sequences • Star-alignment: only one internal node

  16. Tree-Alignment • Many different methods for calculating tree alignments • Discuss version used by ClustalX

  17. Tree-Alignment in ClustalX • Three main parts • Perform pairwise alignment on all sequences to calculate a distance matrix • Use distance matrix to calculate a guide tree • Sequences are progressively aligned using the branching order in the guide tree http://bimas.dcrt.nih.gov/clustalw/clustalw.html

  18. Calculating Distance Matrix • Use standard dynamic programming to find the best alignment • Gap penalties for opening a gap and continuing a gap (possibly different) • Divide number of matches by total number of residues compared (excluding gaps) • Convert to distances by dividing by 100 and subtracting from 1 • Gives one entry in the n by n matrix

  19. A T C GA T C C = 3/4 = .75/100 = 1-.0075 = .9925 A T C G A G G C = 1/4 = .25/100 = 1-.0025 = .9975 Calculating Distance Matrix • Ex: sequences ATCG, ATCC, AGGC, AGCC

  20. Calculating Distance Matrix

  21. Calculating a Guide Tree • Using Nearest-Neighbor method to group sequences • Results in an unrooted tree • Branch lengths proportional to estimated divergence • “Mid-point” method used to determine root • Means of the branch lengths to each side of the root are equal (or approximately equal)

  22. ATCG = 1.8245 ATCT = 1.8245 AGGC = 1.3308 1/3 1 1.6599 GCAA = 1 .9975/2 .9975 .9925 .9925 Calculating a Guide Tree AGAA GCAA AGCC AGGC ATCG ATCT ATCG

  23. ATCG = 1.4911 ATCT = 1.4911 1.4911 1 1 AGCC AGGC = 1.4986 .9975/2 .9975/2 GCAA = 1.4986 1.4986 ATCG AGAA .9925 .9925 ATCT ATCG AGGC GCAA Calculating a Guide Tree

  24. Progressive Alignment • Perform a series of pairwise alignments • Slowly align larger and larger groups of sequences • Follow the branching order of the tree • From leaves to root

  25. ATCT ATCG AGGC GCAA Progressive Alignment AGCC ATCG AGAA

  26. A A C A A A C A C C A A A C A C A A C Traditional (SP) Tree-Alignment Star-Alignment Input seq Reconstructedseq Missmatches -- A, A, C A 6 1 2 Alignment Costs Traditional A, A, A, C, C A, A, A, C, C A, A, A, C, C

  27. Alignment Inconsistencies • Different definitions of multiple alignments can yield different optimal alignments • Optimal tree-alignments minimize number of mutations from theorized common ancestors • SP-alignments maximize number of positions where aligned sequences agree • Sometimes makes more biological sense since certain regions of proteins more likely to mutate

  28. Traditional (SP) Star-Alignment - A C C- A C C- T C TA T C T A C C -A C C -T C T -A T C T -- A C C - Alignment Inconsistencies • Ex: cost of 1 for aligning two different letters, cost of 2 for aligning a letter with a null • Sequences: ACC, ACC, TCT, ATCT Input sequences Reconstructedsequences

  29. ClustalX Demo • Multiple sequence alignment program • For more information on ClustalX • http://www.at.embnet.org/embnet/progs/clustal/clustalx.htm

More Related