1 / 17

PLaSMA: A new dynamic programming algorithm for multiple sequence alignment

PLaSMA: A new dynamic programming algorithm for multiple sequence alignment. V. Derrien, J.M. Richer, J.K. Hao Universite d'Angers - 2, Boulevard Lavoisier 49045 Angers Cedex 01 - France E-mail: hao@info.univ-angers.fr Web: www.info.univ-angers.fr/pub/hao. Plan. Introduction

wardah
Télécharger la présentation

PLaSMA: A new dynamic programming algorithm for multiple sequence alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PLaSMA: A new dynamic programming algorithm for multiple sequence alignment V. Derrien, J.M. Richer, J.K. Hao Universite d'Angers - 2, Boulevard Lavoisier 49045 Angers Cedex 01 - France E-mail: hao@info.univ-angers.fr Web: www.info.univ-angers.fr/pub/hao

  2. Plan • Introduction • Sequence alignment • pairwise alignment • multiple alignment • Principle of PLaSMA • Results • Conclusion

  3. Introduction Sequence alignment is a fundamental tool for many important applications in biology: • Gene discovery • Phylogenetic analysis • Homology modeling • Motif discovery • Disease diagnosis • ...

  4. Pairwise alignment • Similarity between 2 sequences (according to a distance matrix PAM, Blosum...), • Comparison of one sequence with a set of sequences, • Efficient algorithm using dynamic programming, • Complexity en O(m.n), • Sum-of-pairs scoring function: sum of similarity values, associated to each pair of residues.

  5. Multiple alignment • Similarity of a set of K sequences (K>2) • Known to be NP-Hard [Wang & Jiang 94]) • Example : AGCTAGCACTGA - CTAGCATG - AGCCTAGCTGC - ATCAGCAAATGC s AG-CTAGC-ACTGA - ---CTAGC--ATG- - AGCCTAGC---TGC - A-TC-AGCAAATGC s

  6. Multiple alignment • Two main approaches: • Progressive: alignment of sub groups of sequences based on profiles (Clustal W [Thompson et al, 1994]) • Iterative: simultaneous alignment of all the sequences (SAGA [Notredame et al, 1996]) • Scoring function: • Sum-of-pairs, • Coffee, • T-Coffee • …

  7. Principle of Clustal W • Construction of a distance matrix from pairwise alignment, • Construction of a guide-tree by a clustering algorithm (Neighbour-Joining), each internal node of the tree being a profile (consensus sequence), • Based on the order of the guide-tree, progressive alignment of the sequences by DP according to the guide-tree, • Problems • « Once a gap, always a gap », • Guide tree quality is critical, • Lose of information due to the use of profiles

  8. PLaSMA • To overcome the problem of information lose, PLaSMA aligns directly two groups of aligned sequences: the notion of blocks • A block is a set of aligned sequences • Extension of dynamic programming to align • A sequence against a block, • Two blocks. • Important technical differences in • using the guide-tree (each internal node being a block), • implementing the DP algorithm for aligning of blocks.

  9. PLaSMA: general algorithm Input: E the set of sequences (|E|=n) Output: L the alignment of E • Construction of a guide-tree of E by clustering algorithm NJ • Construction of |E| blocks each composed of one sequence • L={B1,...,B|E|} • While |L| > 1 Do • Take two most similar blocks Bi and Bj according to the guide-tree • B = align(Bi,Bj) • L = L{B}-{Bi,Bj}

  10. ATCCAGCT A-CGAGCT ATCC-GC- A--TCCAG-CT A---CGAG-CT A--TCC-G-C- ACGTC--GT-T AC-TC--GTC- ACGTCGT-T AC-TCGTC- Principe of PLaSMA

  11. Balibase • Sum-of-pairs (and other) scoring functions may be biologically meaningless: need for Benchmark Alignment • BAliBASE is such a database: http://www-igbmc.u-strasbg.fr/Bioinfo/Balibase • A set of 139 protein alignments with referencegrouped in 5 families, • A reference alignment is given (obtained using structure and function information derived from public database, • 2 possible evaluation functions, • Results for 10 top (progressive and iterative) alignment algorithms → Comparison.

  12. Comparison with the reference First evaluation comparison with the reference: • Counting the number of the pairs of exact residues match (1 if ok, 0 otherwise) GAAWQGQIVG EPQNDDELPM -A-SGDNTLS KKEREEDIDL Result GAAWQGQIVG EPQNDDELPM -ASGDNTLSM KKEREEDIDL Reference

  13. Comparison with the reference Second evaluation comparison with the reference: • Counting the number of exact columsmatch (1 if ok, 0 otherwise) GAAWQGQIVG EPQNDDELPM -A-SGDNTLS KKEREEDIDL Result GAAWQGQIVG EPQNDDELPM -ASGDNTLSM KKEREEDIDL Reference

  14. Results • PLaSMA is assessed on 139 instances of BAliBASE using the two evaluation functions (large values are better): • Remark: PLaSMA, like Clulstal W, is very fast and requires only several seconds to align 20 sequences of 200 residues.

  15. Results • PLaSMA obtains the best results for 7 instances (better than Cluster W on 23 instances

  16. Example : 2mhr Clustal W hemt_sipcu GFPVPDPFIW DASFKTFYDD LDNQHKQLFQ AILTQGNVGG -ATAGDNAYA CLVAHFLFEE AAMQV-AKYG 1hrb GFPIPDPYVW DPSFRTFYSI IDDEHKTLFN GIFHLAIDDN -ADNLGELRR CTGKHFLNQE VLMEA-SQY- mp2_nerdi GFEIPEPYKW DESFQVFYEK LDEEHKQIFN AIFALCGGNN -AGNLKSLVD VTANHFADEE AMLKASASYG hem1_phago -FDIPEPYVW DESFRVFYDN LDDEHKGLFK GVFNCAADMS SAGNLKHLID VTTTHFRNEE AMMDA-AKYE hemt_linun --KVPEPFAW NESFATSYKN IDLEHRTLFN GLFALSEFNT -RDQLLACKE VFVMHFRDEQ GQMEK-ANYE hemt_sipcu GYGAHKAAHE EFLGKVKGGS A-----DAAY CKDWLTQHIK TIDFKYKGK 1hrb FYDEHKKEHD GFINALDNWK G-----DVKW AKAWLVNHIK TIDFK-KGK mp2_nerdi DFDSHKKKHE DFLAVIRGLG APVPQDKINY AKEWLVNHIK GTDFGYKGK hem1_phago NVVPHKQMHK DFLAKLGGLK APLDQGTIDY AKDWLVQHIK TTDFKYKGK hemt_linun HFEEHRGIHE GFLEKMGHWK APVAQKDIKF GMEWLVNHIP TEDFKYKGK PLaSMA hemt_sipcu GFPVPDPFIW DASFKTFYDD LDNQHKQLFQ AILTQGNV-G GATAGDNAYA CLVAHFLFEE AAMQV-AKYG 1hrb GFPIPDPYVW DPSFRTFYSI IDDEHKTLFN GIFHLAID-D NADNLGELRR CTGKHFLNQE VLMEA-SQYF mp2_nerdi GFEIPEPYKW DESFQVFYEK LDEEHKQIFN AIFALCGG-N NAGNLKSLVD VTANHFADEE AMLKASASYG hem1_phago -FDIPEPYVW DESFRVFYDN LDDEHKGLFK GVFNCAADMS SAGNLKHLID VTTTHFRNEE AMMDA-AKYE hemt_linun --KVPEPFAW NESFATSYKN IDLEHRTLFN GLFALSEF-N TRDQLLACKE VFVMHFRDEQ GQMEK-ANYE hemt_sipcu GYGAHKAAHE EFLGKVKGGS AD-----AAY CKDWLTQHIK TIDFKYKGK 1hrb -YDEHKKEHD GFINALDNWK GD-----VKW AKAWLVNHIK TIDFK-KGK mp2_nerdi DFDSHKKKHE DFLAVIRGLG APVPQDKINY AKEWLVNHIK GTDFGYKGK hem1_phago NVVPHKQMHK DFLAKLGGLK APLDQGTIDY AKDWLVQHIK TTDFKYKGK hemt_linun HFEEHRGIHE GFLEKMGHWK APVAQKDIKF GMEWLVNHIP TEDFKYKGK

  17. Conclusion & perspectives • Encouraging results, • Ongoing improvements: • Redistribution of the gaps after each alignment, with local search, • Other scoring functions (Coffee, T-Coffee). • Post optimization of the final alignment, to improve the areas of weak similarity, • Release of a first beta-version on the Web in the near future.

More Related