PLaSMA: A new dynamic programming algorithm for multiple sequence alignment

PLaSMA: A new dynamic programming algorithm for multiple sequence alignment V. Derrien, J.M. Richer, J.K. Hao Universite d'Angers - 2, Boulevard Lavoisier 49045 Angers Cedex 01 - France E-mail: hao@info.univ-angers.fr Web: www.info.univ-angers.fr/pub/hao

Plan • Introduction • Sequence alignment • pairwise alignment • multiple alignment • Principle of PLaSMA • Results • Conclusion

Introduction Sequence alignment is a fundamental tool for many important applications in biology: • Gene discovery • Phylogenetic analysis • Homology modeling • Motif discovery • Disease diagnosis • ...

Pairwise alignment • Similarity between 2 sequences (according to a distance matrix PAM, Blosum...), • Comparison of one sequence with a set of sequences, • Efficient algorithm using dynamic programming, • Complexity en O(m.n), • Sum-of-pairs scoring function: sum of similarity values, associated to each pair of residues.

Multiple alignment • Similarity of a set of K sequences (K>2) • Known to be NP-Hard [Wang & Jiang 94]) • Example : AGCTAGCACTGA - CTAGCATG - AGCCTAGCTGC - ATCAGCAAATGC s AG-CTAGC-ACTGA - ---CTAGC--ATG- - AGCCTAGC---TGC - A-TC-AGCAAATGC s

Multiple alignment • Two main approaches: • Progressive: alignment of sub groups of sequences based on profiles (Clustal W [Thompson et al, 1994]) • Iterative: simultaneous alignment of all the sequences (SAGA [Notredame et al, 1996]) • Scoring function: • Sum-of-pairs, • Coffee, • T-Coffee • …

Principle of Clustal W • Construction of a distance matrix from pairwise alignment, • Construction of a guide-tree by a clustering algorithm (Neighbour-Joining), each internal node of the tree being a profile (consensus sequence), • Based on the order of the guide-tree, progressive alignment of the sequences by DP according to the guide-tree, • Problems • « Once a gap, always a gap », • Guide tree quality is critical, • Lose of information due to the use of profiles

PLaSMA • To overcome the problem of information lose, PLaSMA aligns directly two groups of aligned sequences: the notion of blocks • A block is a set of aligned sequences • Extension of dynamic programming to align • A sequence against a block, • Two blocks. • Important technical differences in • using the guide-tree (each internal node being a block), • implementing the DP algorithm for aligning of blocks.

PLaSMA: general algorithm Input: E the set of sequences (|E|=n) Output: L the alignment of E • Construction of a guide-tree of E by clustering algorithm NJ • Construction of |E| blocks each composed of one sequence • L={B1,...,B|E|} • While |L| > 1 Do • Take two most similar blocks Bi and Bj according to the guide-tree • B = align(Bi,Bj) • L = L{B}-{Bi,Bj}

ATCCAGCT A-CGAGCT ATCC-GC- A--TCCAG-CT A---CGAG-CT A--TCC-G-C- ACGTC--GT-T AC-TC--GTC- ACGTCGT-T AC-TCGTC- Principe of PLaSMA

Balibase • Sum-of-pairs (and other) scoring functions may be biologically meaningless: need for Benchmark Alignment • BAliBASE is such a database: http://www-igbmc.u-strasbg.fr/Bioinfo/Balibase • A set of 139 protein alignments with referencegrouped in 5 families, • A reference alignment is given (obtained using structure and function information derived from public database, • 2 possible evaluation functions, • Results for 10 top (progressive and iterative) alignment algorithms → Comparison.

Comparison with the reference First evaluation comparison with the reference: • Counting the number of the pairs of exact residues match (1 if ok, 0 otherwise) GAAWQGQIVG EPQNDDELPM -A-SGDNTLS KKEREEDIDL Result GAAWQGQIVG EPQNDDELPM -ASGDNTLSM KKEREEDIDL Reference

Comparison with the reference Second evaluation comparison with the reference: • Counting the number of exact columsmatch (1 if ok, 0 otherwise) GAAWQGQIVG EPQNDDELPM -A-SGDNTLS KKEREEDIDL Result GAAWQGQIVG EPQNDDELPM -ASGDNTLSM KKEREEDIDL Reference

Results • PLaSMA is assessed on 139 instances of BAliBASE using the two evaluation functions (large values are better): • Remark: PLaSMA, like Clulstal W, is very fast and requires only several seconds to align 20 sequences of 200 residues.

Results • PLaSMA obtains the best results for 7 instances (better than Cluster W on 23 instances

Example : 2mhr Clustal W hemt_sipcu GFPVPDPFIW DASFKTFYDD LDNQHKQLFQ AILTQGNVGG -ATAGDNAYA CLVAHFLFEE AAMQV-AKYG 1hrb GFPIPDPYVW DPSFRTFYSI IDDEHKTLFN GIFHLAIDDN -ADNLGELRR CTGKHFLNQE VLMEA-SQY- mp2_nerdi GFEIPEPYKW DESFQVFYEK LDEEHKQIFN AIFALCGGNN -AGNLKSLVD VTANHFADEE AMLKASASYG hem1_phago -FDIPEPYVW DESFRVFYDN LDDEHKGLFK GVFNCAADMS SAGNLKHLID VTTTHFRNEE AMMDA-AKYE hemt_linun --KVPEPFAW NESFATSYKN IDLEHRTLFN GLFALSEFNT -RDQLLACKE VFVMHFRDEQ GQMEK-ANYE hemt_sipcu GYGAHKAAHE EFLGKVKGGS A-----DAAY CKDWLTQHIK TIDFKYKGK 1hrb FYDEHKKEHD GFINALDNWK G-----DVKW AKAWLVNHIK TIDFK-KGK mp2_nerdi DFDSHKKKHE DFLAVIRGLG APVPQDKINY AKEWLVNHIK GTDFGYKGK hem1_phago NVVPHKQMHK DFLAKLGGLK APLDQGTIDY AKDWLVQHIK TTDFKYKGK hemt_linun HFEEHRGIHE GFLEKMGHWK APVAQKDIKF GMEWLVNHIP TEDFKYKGK PLaSMA hemt_sipcu GFPVPDPFIW DASFKTFYDD LDNQHKQLFQ AILTQGNV-G GATAGDNAYA CLVAHFLFEE AAMQV-AKYG 1hrb GFPIPDPYVW DPSFRTFYSI IDDEHKTLFN GIFHLAID-D NADNLGELRR CTGKHFLNQE VLMEA-SQYF mp2_nerdi GFEIPEPYKW DESFQVFYEK LDEEHKQIFN AIFALCGG-N NAGNLKSLVD VTANHFADEE AMLKASASYG hem1_phago -FDIPEPYVW DESFRVFYDN LDDEHKGLFK GVFNCAADMS SAGNLKHLID VTTTHFRNEE AMMDA-AKYE hemt_linun --KVPEPFAW NESFATSYKN IDLEHRTLFN GLFALSEF-N TRDQLLACKE VFVMHFRDEQ GQMEK-ANYE hemt_sipcu GYGAHKAAHE EFLGKVKGGS AD-----AAY CKDWLTQHIK TIDFKYKGK 1hrb -YDEHKKEHD GFINALDNWK GD-----VKW AKAWLVNHIK TIDFK-KGK mp2_nerdi DFDSHKKKHE DFLAVIRGLG APVPQDKINY AKEWLVNHIK GTDFGYKGK hem1_phago NVVPHKQMHK DFLAKLGGLK APLDQGTIDY AKDWLVQHIK TTDFKYKGK hemt_linun HFEEHRGIHE GFLEKMGHWK APVAQKDIKF GMEWLVNHIP TEDFKYKGK

Conclusion & perspectives • Encouraging results, • Ongoing improvements: • Redistribution of the gaps after each alignment, with local search, • Other scoring functions (Coffee, T-Coffee). • Post optimization of the final alignment, to improve the areas of weak similarity, • Release of a first beta-version on the Web in the near future.

PLaSMA: A new dynamic programming algorithm for multiple sequence alignment