890 likes | 997 Vues
Recent Progress in Multiple Sequence Alignments: A Survey. Cédric Notredame. Our Scope. What are The existing Methods ?. How Do They Work: -Assemby Algorithms -Weighting Schemes. When Do They Work ?. Which Future ?. Outline. - Introduction. - A taxonomy of the existing Packages.
E N D
Recent Progress in Multiple Sequence Alignments:A Survey Cédric Notredame
Our Scope What are The existing Methods? How Do They Work: -Assemby Algorithms -Weighting Schemes. When Do They Work ? Which Future?
Outline -Introduction -A taxonomy of the existing Packages -A few algorithms… -Performance Comparison using BaliBase
LIKE ANYMODEL What Is A Multiple Sequence Alignment? A MSA is a MODEL It Indicates the RELATIONSHIP between residues of different sequences. It REVEALS -Similarities -Inconsistencies
How Can I Use A Multiple Sequence Alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * : .* . : Extrapolation Motifs/Patterns Multiple Alignments Are CENTRAL to MOST Bioinformatics Techniques. Profiles Phylogeny Struc. Prediction
How Can I Use A Multiple Sequence Alignment? Multiple Alignments Is the most INTEGRATIVE Method Available Today. We Need MSA to INCORPORATE existing DATA
BIOLOGY:What is A Good Alignment COMPUTATIONWhat is THE Good Alignment chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * Why Is It Difficult To Compute A multiple Sequence Alignment? A CROSSROAD PROBLEM
Why Is It Difficult To Compute A multiple Sequence Alignment ? BIOLOGY COMPUTATION CIRCULAR PROBLEM.... Good Good Alignment Sequences
Simultaneous As opposed to Progressive [Simultaneous: they simultaneously use all the information] Exact As opposed to Heursistic [Heuristics: cut corners like Blast Vs SW] [Heuristics: do not guarranty an optimal solution] StochasticAs opposed to Determinist [Stochastic: contain an element of randomness] [Stochastic: Example of a Monte Carlo Surface estimation ] Iterative As opposed to Non Iterative [Iterative: run the same algorithm many times] [Iterative: Most stochastic methods are iterative]
Simultaneous Progressive Clustal DCA T-Coffee Combalign MSA POA Dialign Iteralign GAs GA SAGA Prrp SAM HMMer Non tree based HMMs Iterative OMA PralineMAFFT
Simultaneous Progressive Clustal DCA T-Coffee Combalign MSA POA Dialign OMA PralineMAFFT Iteralign GA Prrp SAM HMMer Iterative SAGA Stochastic
NEARLY EVERY OPTIMISATIONALGORITHMHAS BEEN APPLIED TO THEMSA PROBLEM!!!
Scoring an Alignment: Evolutionary based methods BIOLOGY How many events separate my sequences? Such an evaluation relies on a biological model. COMPUTATION Every position musd be independant
A A C AA A C C REAL Tree Model: ALL the sequences evolved from the same ancestor A C Tree: Cost=1 A A C PROBLEM: We do not know the true tree
AA A C C STAR Tree Model: ALL the sequences have the same ancestor A C Star Tree: Cost=2 A A C A PROBLEM: the tree star is phylogenetically wrong
[s(a,b): matrix] AA A C C A C [i: column i] Sums of Pairs: Cost=6 [k, l: seq index] A C A Sums of Pairs Model=Every sequence is the ancestor of every sequence PROBLEM: -over-estimation of the mutation costs -Requires a weighting scheme
Cost= 5*N*(N-1)/2 [5: Leucine Vs Leucine with Blosum50] LL L L L Cost=5*N*(N-1)/2-(5)*(N-1) - (-4)*(N-1) G [glycine effect] Cost=5*N*(N-1)/2-(9)*(N-1) Sums of Pairs: Some of itslimitations (Durbin, p140)
2*(9)*(N-1) (9) Delta= = 5*N*(N-1) 5*N Delta G LL L L L N Sums of Pairs: Some of its limitations (Durbin, p140) Conclusion: The more Leucine, the less expensive it gets to add a Glycin to the column...
AA A C C Enthropy based Functions Model: Minimize the enthropy (variety) in each Column [number of Alanine (a) in column i] [Score of column i] [a: alphabet] [P can incorporate pseudocounts] S=0 if the column is conserved PROBLEM: -requires a simultaneous alignment -assumes independant sequences
AA A C C Consistency based Functions Model: Maximise the consistency (agreement) with a list of constraints (alignments) [kand l are sequences, i is a column] [the two residues are found aligned in the list of constraints] PROBLEM: -requires a list of constraints
Weighted Sums of Pairs Concistency Based Combalign MSA T-Coffee ClustalPOA Praline DCA Dialign Prrp Iteralign SAGA MAFFTOMA GIBBS SAM HMMer Enthropy
A Few Algorithms MSA and DCA POA ClustalW MAFFT Dialign II Prrp SAGA GIBBS Sampler
Simultaneous Alignments : MSA 1) Set Bounds on each pair of sequences (Carillo and Lipman) 2) Compute the Maln within the Hyperspace -Few Small Closely Related Sequence. -Memory and CPU hungry -Do Well When They Can Run.
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP MSA: the carillo and Lipman bounds ( ) S = ) ( S chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE + ) ( S … [Pairwise projection of sequences k and l]
Upper ? Lower a(k,m) â(k,m) a(k,l) â(k,l) MSA: the carillo and Lipman bounds a(k,l)=score of the projection k l in the optimal MSA S(a(x,y))=score of the complete multiple alignment â(k,l)=score of the optimal alignment of k l
â(k,l) ? LM+ â(k,l)-S(â(x,y)) a(k,l) â(k,l) MSA: the carillo and Lipman bounds LM: a lower bound for the complete MSA LM<=S(â(x,y)) - (â(k,l)-a(k,l)) a(k,l)>=LM +â(k,l)-S(â(x,y))
MSA: the carillo and Lipman bounds â(k,l) LM+ â(k,l)-S(â(x,y)) a(k,l) ä(k,l) â(k,l) LM: can be measured on ANY heuristic alignment LM = S(ä(x,y)) The better LM, the tighter the bounds…
MSA: the carillo and Lipman bounds N N 0 0 Best( M-i, N-j) Best( 0-i, 0-j) + M M Forward backward
Simultaneous Alignments : MSA 1) Set Bounds on each pair of sequences (Carillo and Lipman) 2) Compute the Maln within the Hyperspace -Few Small Closely Related Sequence. -Memory and CPU hungry -Do Well When They Can Run.
-Few Small Closely Related Sequence, but less limited than MSA -Memory and CPU hungry, but less than MSA -Do Well When Can Run. Simultaneous Alignments : DCA
Simultaneous With a New Sequence Representaion: POA-Partial Ordered Graph
POA POA makes it possible to represent complex relationships: -domain deletion -domain inversions
Progressive Alignment: ClustalW Feng and Dolittle, 1988; Taylor 198ç Clustering
Dynamic Programming Using A Substitution Matrix Progressive Alignment: ClustalW
Tree based Alignment : Recursive Algorithm Align ( Node N) { if ( N->left_child is a Node) A1=Align ( N->left_child) else if ( N->left_child is a Sequence) A1=N->left_child if (N->right_child is a node) A2=Align (N->right_child) else if ( N->right_child is a Sequence) A2=N->right_child Return dp_alignment (A1, A2) } C A D F G E B
Progressive Alignment : ClustalW -Depends on the CHOICE of the sequences. -Depends on the ORDER of the sequences (Tree). • -Depends on the PARAMETERS: • Substitution Matrix. • Penalties (Gop, Gep). • Sequence Weight. • Tree making Algorithm.
Progressive Alignment : ClustalW Weighting Weighting Within ClustalW
Progressive Alignment : ClustalW GOP Position Specific GOP
Progressive Alignment : ClustalW 2 2 3 -Scales Well: N, N L -Greedy Heuristic (No Guarranty). -Fast ClustalW is the most Popular Method