pdbj/

PRAGMA 11 Workshop, Osaka Univ., on October 16, 2006 Integrated Web Service for Database Queries and Computations Grid Web Services for Analog Queries to Protein 3D Structures Haruki Nakamura Institute for Protein Research, Lab. Protein Informatics, Osaka University http://www.pdbj.org/

BioGrid at Osaka http://www.biogrid.jp Started from 2002 Goals:Grid Technology Development for: Biotechnology (drug discovery) and Medical Sciences. Leader: Shinji Shimojo (CMC, Osaka Univ.) Government Support (MEXT): 5years H. Nakamura K. Yamaguchi T. Takada H. Matsuda S. Shimojo S. Date Integration of Data Grid and Computing Grid T. Akiyama to access Integrated and Standardized Databases with High-performance Computers, from anywhere and with low cost.

Databases in Life Sciencesnearly 1,000 DBs Organism Medical Scale Large OMIM Organ Drug Data MEDLINE Tissue MDDR Cell Literature Function Sugar Chain Molecular Interaction Network Organelle Gene Ontology Gene Expression DIP Mass Analysis KEGG GEO 3D-Cordinates InterPro, Pfam Molecule Chemical Structure Swiss-Prot DDBJ/EMBL/GenBank H-Inv, FANTOM PubChem, Ligand, ChEBI PDB Sequence Fine Atom Conceptual Level Less Conceputal (Formal) Highly Conceptual

BioDataGrid by Hideo Matsuda at Osaka Univ.

BioGrid at Osaka http://www.biogrid.jp Started from 2002 Goals:Grid Technology Development for: Biotechnology (drug discovery) and Medical Sciences. Leader: Shinji Shimojo (CMC, Osaka Univ.) Government Support (MEXT): 5years H. Nakamura K. Yamaguchi T. Takada H. Matsuda S. Shimojo S. Date Integration of Data Grid and Computing Grid T. Akiyama to access Integrated and Standardized Databases with High-performance Computers, from anywhere and with low cost.

Nakamura et al. (2004) New Generation Computing, 22, 157-166. BioPfuga:Biosimulation Platform United on Grid Architecture A platform where individual application programs at the different levels are united to execute a hybrid computation. In particular, BioPfuga is a platform for biosimulation. http://www.biogrid.jp/

Construction of BioPfuga ・Propose and Design Communication between program components by XML description (BMSML: BioMolecular Simulation-ML) , and Creation of the APIs to handle it. ・Divide programs into a set of many components Different program modules are then implemented as the service programs based on the OGSA (Open Grid Service Architecture) mechanism.

http://www.biogrid.jp/

Construction of BioPfuga ・Propose and Design Communication between program components by XML description (BMSML: BioMolecular Simulation-ML) , and Creation of the APIs to handle it. ・Divide programs into a set of many components Different program modules are then implemented as the service programs based on the OGSA (Open Grid Service Architecture) mechanism.

Quantum mechanics (QM) and Molecular mechanics (MM) simulation coupled on BioPfuga Coupled simulation on Grids MM area Hamiltonian of total system Active site of an enzyme Htotal = HQM(x,x) + HQM/MM(x,y) + HMM(y,y) Calculate in QM area Calculate in MM area Ligand QM area (Yonezawa et al. at SC2004)

PKCd-C1B Simulation of Electronic structure (1) AMOSSAb initio Molecular Orbital Simulation for Supercomputer 856 atoms, 8672 AO’s developed by NEC Quantum Chemistry Group Rapid computation for huge molecular systems (20,000 basis for RHF, 1,000 basis for CASSCF and MP2), with high parallel performance Simulation of Electronic structure (2) GSO-Xgeneralized spin density function theory (DFT) developed by Shusuke Yamanaka & Kizashi Yamaguchi, at Osaka Univ. Simulation of Protein-solvent structure prestoX-basicProtein Engineering Simulator eXtended –basic versin- developed by Yoshifumi Fukunishi at JBIRC-AIST and Haruki Nakamura at Institute for Protein Research, Osaka Univ.

GT3 prestoX Service QM(HF) GT3 GT3 AMOSS Service GSO-X Service MPI 30CPU AMOSS(HF) A hybrid-QM/MM with BioPfuga on the GridService@SC2004 Client controller User@USA, Pittsburgh @IPR Osaka Site-A Viewer portal @TokyoIT MM(MD) Site-B @CMC Osaka Site-C BMSML BMSML QM(DFT) 8CPU prestoX (MPI) 8-boards MPI 50CPU BMSML MD-Grape2 GSO-X(DFT) (Special purpose computing board: max 50GFLOPS)

Configuration of a computing system on a Grid environment with several different program modules. Portal/ Client Program User Communicate using SOAP Personal user Distributed computing System Service QM(HF) Service MM Service QM (DFT) SOAP SOAP

Our final goal is Integration of Data Grid and Computing Grid. H. Nakamura K. Yamaguchi T. Takada H. Matsuda S. Shimojo S. Date Integration of Data Grid and Computing Grid T. Akiyama to access Integrated and Standardized Databases with High-performance Computers, from anywhere and with low cost. BioGrid at Osaka http://www.biogrid.jp Started from 2002 Goals:Grid Technology Development for: Biotechnology (drug discovery) and Medical Sciences. Leader: Shinji Shimojo (CMC, Osaka Univ.) Government Support (MEXT): 5years

Integration of computing systems and Database query systems on a Grid Portal/ Client Program User Communicate using SOAP Personal user site-A site-B site-C Distributed Computing System GT GT GT Service A Service B Service C SOAP GRID computing services H. Nakamura, BioGrid2005 (March 9, 2005)

Integration of computing systems and Database query systems on a Grid Portal/ Client Program User Communicate using SOAP Personal user site-A site-B site-C Distributed Computing and DB System GT GT GT Service A Service B Service C SOAP GRID computing services Database query services H. Nakamura, BioGrid2005 (March 9, 2005)

Proposed Portal for Multiple Databases for Protein Structures Berman, H. et al. (2006) Structure, 14, 1211-1217. Theoretical Model DB 3 Theoretical Model DB 2 Theoretical Model DB 1 Figure 1. A Schematic for the Proposed Portal: The portal is envisioned to be a modular web services architecture achieved by using an implementation of the Simple Object Access Protocol (SOAP http://www.w3.org/TR/soap/), that allows for seamless data exchange between the portal and all registered contributors. SOAP is a light weight protocol for exchange of information in a decentralized, distributed environment. It is an XML-based protocol, which defines a framework for representing remote procedure calls and responses. The XML wrappers basically map the individual contributing metadata format into an XML format that the data portal understands (Drinkwater et al., 2004). These services will be platform and language independent allowing other services (other portals or clients) to communicate with the data portal (see http://www.e-science.clrc.ac.uk/web/projects/dataportal/)

Protein Tertiary Structure PDB (Protein Data Bank): Protein Tertiary Structure Database Atom species and coordinates, amino acid residues, any other experimental results with raw data, experimental methods, and the conditions. X-ray crystallography, NMR, and Electron microscope experiments.

http://www.wwpdb.org/ Kim Henrick, Helen Berman, Haruki Nakamura

Rutgers Univ. ＵＣＳＤＮＩＳＴ RCSB PDBj EBI International collaboration in wwPDB • Curation, data processing, and • registration are made by all the • members, collaborating with each other. • 2) We have a single data archive, which is • looked after by one “archive keeper (RCSB)”. • 3) Data format and new descriptions will be discussed • among the members. • 4) Members are encouraged to develop their own • browsers, viewers, and other APIs and services. (Berman, Henrick & Nakamura (2003) Nat. Struct. Biol. 10, 980)

Protein Data Bank Japan http://www.pdbj.org/ At Institute for Protein Research, Osaka Univ. since 2001 assisted from the Institute for Bioinformatics Research and Development, Japan Science and Technology Agency (BIRD-JST).

Yearly PDBj registration number Yearly registration number Total registration number Yearly wwPDB registration number Total wwPDB registration number 1972 75 80 85 90 95 2000 2005 year Processed data numbers at PDBj We process more than 30 % deposited data of the entire world, mainly from Asian and Oceania regions

Maintain Format Standards • PDB (conventional and flat) • PDB Exchange (mmCIF) • Mechanism for extension based on new demands • PDBML • Derived from mmCIF • All entries converted to XML • Automatic translation from mmCIF data files and dictionaries • 3-styles of translation released (Westbrook, Ito, Nakamura, Henrick, Berman (2005) Bioinformatics, 21, 988-992)

PDBML: canonical XML description of PDB data, developed by the wwPDB. (Westbrook et al. (2005) Bioinformatics, 21, 988-992) http://pdbml.pdb.org/schema/pdbx.xsd, http://pdbml.pdb.org/schema/pdbx-ext.xsd, ftp://ftp.pdbj.org/ →No validation errors for more than 39,000 PDB file description.

Examples for the atom coordinates <PDBx:atom_siteCategory> <PDBx:atom_site id="1"> <PDBx:group_PDB>ATOM</PDBx:group_PDB> <PDBx:type_symbol>N</PDBx:type_symbol> <PDBx:label_atom_id>N</PDBx:label_atom_id> <PDBx:label_comp_id>THR</PDBx:label_comp_id> <PDBx:label_asym_id>A</PDBx:label_asym_id> <PDBx:label_entity_id>1</PDBx:label_entity_id> <PDBx:label_seq_id>1</PDBx:label_seq_id> <PDBx:Cartn_x>17.047</PDBx:Cartn_x> <PDBx:Cartn_y>14.099</PDBx:Cartn_y> <PDBx:Cartn_z>3.625</PDBx:Cartn_z> <PDBx:occupancy>1.00</PDBx:occupancy> <PDBx:B_iso_or_equiv>13.79</PDBx:B_iso_or_equiv> <PDBx:auth_seq_id>1</PDBx:auth_seq_id> <PDBx:auth_comp_id>THR</PDBx:auth_comp_id> <PDBx:auth_asym_id>A</PDBx:auth_asym_id> <PDBx:auth_atom_id>N</PDBx:auth_atom_id> <PDBx:pdbx_PDB_model_num>1</PDBx:pdbx_PDB_model_num> </PDBx:atom_site> Full-tag description Separated file for coordinates <atom_record id="1">ATOM 1 A A 1 1 ? . THR THR N N N 17.047 14.099 3.625 1.00 13.79</atom_record>

Applications of PDBML • Browser at PDBj with the native XML DB • Database extension and annotation • SOAP (Simple Object Access Protocol) services at PDBj • Molecular graphics viewer (jV) , which directly parses PDBML, displaying the information written in XML (Kinoshita & Nakamura (2004) Bioinformatics, 20, 1329-1330, Westbrook et al (2005) Bioinformatics, 21, 988-992)

Wild-card "*" can be used: *2as, 12*,　 *

"and" and wild-card are available.

Search proteins, which have one or more helices, which are longer than 10. /datablock[struct_confCategory/struct_conf /pdbx_PDB_helix_length>=“10"]/@datablockName

xPSSS new facility Both XQuery and XPath are available.

Search for all entries and helix of length with helix of length greater than or equal to 10 residues:

Queries by XQuery and XPath at our new xPSSS

PDBjViewer or jV • Offers interactive molecular visualization with RasMol type commands (source code free). • Can be used both as Stand- alone and as applet with Java. • Any polygons defined by XML can be displayed and manipulated. • Can parse PDBML files and display the molecules. http://www.pdbj.org/PDBjViewer/

xPSSS Database System Archive (RCSB-PDB /MSD-EBI /PDBj) Internet download (FTP) FTP server Web server downloader xPSSS PDBML AddInformation XSLT processor Native XML-DB PDBMLplus CATRES Data Function/ Source Information PDBMLplus PDBMLplusF Annotation Data PDBMLplus Filtering & Recostructing Loader PDBMLplusF Get/Input Tools DDBJ SwisProt/UniProt PIR/GenBank/KEGG/GDB/ ProTherm/EzCatDB EBI/CSA /CATRES PDF files for the primary citations Manual input from literatures

Development of Secondary Databases: data mining Protein Dynamics Database, ProMode (Wako & Endo) Protein Molecular Surface Database,eF-site (Kinoshita & Nakamura） Alignment of Structural Homologues, ASH(Toh) Sequence Navigator & Structure Navigator (Standley） Encyclopedia of Protein Structures, eProtS (Ito & Nakamura）

Superfolds of Proteins

Identify Protein Function from ・Sequence similarity ・ Fold similarity

Compare Protein folds/architectures

Distance Matrix DNA-binding domain of Bovine Papilloma Virus-1 E2 （2BOP） RNA-binding domain of the U1A spliceosomal protein（1URN）

Map of the entire protein folds 498 SCOP domains Dali (distance matrix alignment) http://www.ebi.ac.uk/dali/ Hou et al. (2003) Proc. Natl. Acad. Sci. USA, 100, 2386-2390.

Structure Navigator http://www.pdbj.org/strucnavi/ • Rapidly generates structure neighbors for any PDB ID • Both an HTTP and SOAP interface are available • Alignments are displayed in a readable format • 3D coordinates can be viewed or downloaded Structural similarity is defined by the Number of Equivalent Residues (NER1) Standley, D.M. et al. (2005) 1Standley D, Toh H, Nakamura H. Proteins 2004;57:381-391

pdbj/

pdbj/

Presentation Transcript

Eugene Krissinel Macromolecular Structure Database keb@ebi.ac.uk