1 / 21

Fast Protein Structure Database Search Using Suffix Trees

Explore efficient algorithms for approximate database searching of polypeptide structures, enabling fast and accurate retrieval of similar protein structures. Learn about suffix trees, polypeptide angles, and fault-tolerant searching.

donnamyers
Télécharger la présentation

Fast Protein Structure Database Search Using Suffix Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Algorithms GroupProf. Ernst W. MayrTechnical University of Munich Fast Approximate Database Searching of Polypeptide Structures Hanjo Taeubig Arno Buchner Jan Griebsch German Conference on Bioinformatics October 4th, 2004

  2. Structure • motivation & problem definition • suffix trees • polypeptide angles suffix trees • application & future work

  3. I. Motivation • the function of a protein is largely determined by it’s structure and geometric shape • How to find similar structures in a database ? • related work • DALI, VAST, CE • TopScan, ProtDex2 • existing methods are mostly based on the principlefilter heuristics + exhaustive search/pairwise comparison and scale at least linearly

  4. I. Motivation • PDB – Protein Data Bank • ca. 3.5GB compressed, 14GB decompressed • > 23.000 entries • 90% Proteins, 5% Nucleotidesequences, 4% Nucleotide-Protein complexes • 85% x-ray cristalography, 15% NMR • protein structure databases grow almost exponentially • search methods with time complexity at most O(n) required

  5. I. Problem Definition • search a given polypeptide structure in a protein database • search the longest common substructure in the database • identify frequent substructures (motifs) in the database

  6. II. Suffix Trees Tries • tree with a root node • every edge is labeled with a letter • labels of all edges to the child nodes of one node are pairwise distinct

  7. II. Suffix Trees Suffixtries • stores all suffixes of a string • the sentinel $ ensures that every suffix is represented by a leaf Suffixtree for the word aaabbb$

  8. II. Suffix Trees Compressed Suffixtries • collapse linear paths in the tree • store only start- and end-index • linear number of inner nodes

  9. II. Suffix Trees Further Extensions • generalized suffix trees • stores suffixes of multiple strings in one tree • online linear time construction Time Complexity • Finding an occurrence of the search pattern does not depend on the size of the searched database, but linearly on the length m of the pattern • Finding allk occurrences of a pattern takes time proprtional to m+k

  10. III. Polypeptide Angles Suffix Tree Idea • encode the geometry of the database proteins in a translation and rotation invariant linear description (“structural text”) • torsion angle encoding of the protein backbone • adapt efficient text mining methods to the error tolerant substructure searching problem • generalized suffix trees with fault tolerant search strategies

  11. 1a1f III. Polypeptide Angles Suffix Tree … (22,93), (112, 4) …  Discretization  … a b b a …

  12. 1a1f … (22,93), (112, 4) …  Discretization  … a b b a … III. Polypeptide Angles Suffix Tree

  13. III. Polypeptide Angles Suffix Tree Fault Tolerant Searching • accept a “neighborhood range” of  intervals left and right • worst case time complexity: exponential (!) • average: O( ) figure: branching with =1

  14. IV. Application Example • search occurrences the C2H2 zinc finger in the complete PDB • discretization: 24 intervals of 15° • compare with SCOP classification, sequence-based search, SPASM

  15. IV. Application Score E Sequences producing significant alignments: (bits) Value gi|37926551|pdb|1LLM|C Chain C, Crystal Structure Of A Zif2... 47 6e-07 gi|15988358|pdb|1F2I|G Chain G, Cocrystal Structure Of Sele... 42 2e-05 gi|3319019|pdb|1A1H|A Chain A, Qgsr (Zif268 Variant) Zinc F... 42 3e-05 gi|3319013|pdb|1A1F|A Chain A, Dsnr (Zif268 Variant) Zinc F... 41 3e-05 gi|3319022|pdb|1A1I|A Chain A, Radr (Zif268 Variant) Zinc F... 41 3e-05 gi|16975178|pdb|1JK1|A Chain A, Zif268 D20a Mutant Bound To... 41 3e-05 gi|2098365|pdb|1AAY|A Chain A, Zif268 Zinc Finger-Dna Compl... 41 4e-05 gi|33357855|pdb|1P47|A Chain A, Crystal Structure Of Tandem... 41 5e-05 gi|443340|pdb|1ZAA|C Chain C, Zif268 Immediate Early Gene (... 40 8e-05 gi|15988466|pdb|1G2F|C Chain C, Structure Of A Cys2his2 Zin... 33 0.015 gi|15988460|pdb|1G2D|C Chain C, Structure Of A Cys2his2 Zin... 32 0.025 gi|1941952|pdb|1MEY|C Chain C, Crystal Structure Of A Desig... 28 0.44 gi|40889293|pdb|1P7A|A Chain A, Solution Stucture Of The Th... 27 0.64 gi|3318788|pdb|2ADR| Adr1 Dna-Binding Domain From Saccharo... 27 0.78 gi|2094895|pdb|1SP1| Nmr Structure Of A Zinc Finger Domain... 26 1.4 gi|1420993|pdb|1ARD| Yeast Transcription Factor Adr1 (Resi... 23 9.7 . . .

  16. IV. Application

  17. Minimum RMSD superposition: 1a1f vs. 1f2i “False” positives: 1a1f vs. 1vl2 IV. Application 1a1f vs. 6 other true positives

  18. IV. Application Run Time • decompression of the packed PDB files • parsing of the PDB files and calculating the torsion angles • discretization and building the PAST • searching a structure 25min 55min 2min seconds Pre-processing Searching

  19. Summary • suffixtree-based protein (sub-)structure database search method • preprocessing required • fast search • does not rely on heuristics, SSE recognition • adaptable sensitivity and error models • until gapped matching is modeled: applicable for shorter peptide chains, motifs • surprisingly simple

  20. Future Work • model matching with insertions & deletions • consensus search pattern • implementation and practical testing of further error models •  and  angle encoding • identification of new motifs • testing, testing, testing: evaluating the method further with real life problems from pharmaceutical researchers, biologists, patent offices, …

  21. Acknowledgements • Hanjo Taeubig, Arno Buchner • Volker Heun, Moritz Maass • BFAM/BMBF • ALTANA

More Related