1 / 31

(The Encyclopedia of Life (EOL))

The Open Notebook. (The Encyclopedia of Life (EOL)). The Annotation and Cataloging of Proteins, Life's Building Blocks. for…. research. education. medicine. A Multitude of Data Sites. Current Problem Using Data Sites. Difficult to keep track of data files

hope-munoz
Télécharger la présentation

(The Encyclopedia of Life (EOL))

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Open Notebook (The Encyclopedia of Life (EOL)) The Annotation and Cataloging of Proteins, Life's Building Blocks for… research education medicine

  2. A Multitude of Data Sites

  3. Current Problem Using Data Sites • Difficult to keep track of data files • Data often returned in various formats • Searches are often frequently repeated in entirety, tying up server resources

  4. Developments in Data Transfer • XML increasingly being used to encapsulate data • SOAP-based access to data services, an XML-based method for exchanging information, springing up <?xml version="1.0"?> <notebook-data> </notebook-data> string[] getGenomeAnnotationStatus ( int Format_option) SOAP consumer invokes SOAP method over HTTP protocol SOAP server processes request and returns any data in an XML-formatted SOAP packet SOAP server SOAP consumer

  5. Notebook Overview Notebook link Web Services Interface SOAP Server Application invoked by mime type getIncrementalUpdate(string sequence, string date) <?xml version="1.0"?> <notebook_data> <data> … Background SOAP Queries Virtual community messaging Open Notebook Annotations Metadata sharing XML/RDF store BLAST Data Keyword data Scheduler Stored queries BLAST Annotations Keyword queries Session info

  6. Open Notebook Protocol • Agreed set of protocols for invoking and then feeding with data a client-side application to enable client-side data persistence • Not tied to one programming language

  7. Invocation of Client-side Application • Experimental mime type (as per RFC2048 ) application/x-opennotebook • Application registers with web browser/OS to handle this mime type. • Data then streams to application in agreed XML schema format <?xml version="1.0"?> <notebook_data> <data> …

  8. Data would describe required data viewers • Specialized viewers and their current availability specified in XML data download <?xml version="1.0"?> <notebook_data> <basic-viewer>blast</basic-viewer> <advanced-viewer> <availability>available</availability> <platforms>Java;win32;macosx</platforms> <download>http://www.xxx.com/…</download> </advanced-viewer>

  9. Data updates • Indication whether data is updatable <?xml version="1.0"?> <notebook_data> <updatable>yes</updatable> <SOAP-proxy> http://www.xxx.org/soapservice< SOAP-proxy> <update-method>getGenomes(string seq)</update-method> <incrementally-updatable>yes</incrementally-updatable> …

  10. Programming Language-Neutral • Important to just specify protocols and activation scenarios • Enables development of a variety of different and branded versions • Java is envisaged an excellent programming language choice for starting development of an open source version

  11. Encyclopedia of Life • The Encyclopedia of Life (EOL) project is a joint development of the San Diego Supercomputer Center (SDSC) and scientists and biological resources worldwide • EOL involves SDSC staff from HPC (High Performance Computing), DAKS (Distributed Annotation and Knowledge System), Grids, Clusters and Visualization • EOL has three parts: • Putative functional and 3-D structure assignment through the largest computation ever attempted in biology • Integration of key biological resources • Make this data available to end-user through an intuitive interface • Opportunity to start from ground up

  12. integrated Genomic Annotation Pipeline - iGAP Deduced Protein sequences sequence info structure info NR, PFAM Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) SCOP, PDB Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) Create PSI-BLAST profiles for Protein sequences Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB Only sequences w/out A-prediction Functional assignment by PFAM, NR, PSIPred assignments Domain location prediction by sequence FOLDLIB Store assigned regions in the DB

  13. ~800 genomes @ 10k-20k per =~107 ORF’s integrated Genomic Annotation Pipeline - iGAP Deduced Protein sequences sequence info structure info NR, PFAM Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) SCOP, PDB 4 CPU years 104 entries Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) 228 CPU years Create PSI-BLAST profiles for Protein sequences 3 CPU years Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A-prediction 9 CPU years Structural assignment of domains by 123D on FOLDLIB Only sequences w/out A-prediction 252 CPU years Functional assignment by PFAM, NR, PSIPred assignments 3 CPU years Domain location prediction by sequence FOLDLIB Store assigned regions in the DB

  14. EOL Data Flow Sequence data from genomic sequencing projects EOL GRID MySQL DataMart(s) Domain location prediction Structure assignment by 123D Structure assignment by PSI-BLAST Integrated Genome Annotation Pipeline (iGAP) Query databases Return data Load/update scripts Data warehouse JBOSS v3.1 Putative Functional and 3D Assignment Application Server Normalized DB2 schema Apache AXIS Web Server/ Web Services Pipeline data Integrated with Other Resources Retrieve Web pages & Invoke SOAP methods Web Services consumers Web Interface

  15. Local Data Aggregation iGAP Local lookup tables Java Application Server PHProjekt Oracle db Keyword search NLQ search BLAST EOL Registry Temporary session search data

  16. EOL Front End: Web Interface

  17. Interactive Data Rendering • Need for interactive client side graphical data rendering • Flash used in EOL prototype but… • development time high • thin client capabilities limited by player parsing capabilities • Scalable Vector Graphics (SVG) • Described by an XML-based text file • graphic description can be created server-side • standards based • Interactivity provided by embedded ECMA scripting • Negatives: • Little native support in web browsers • Must use proprietary plugin (Adobe) in practice

  18. SVG Data Rendering SVG XML-based graphic is generated in real-time on the server <svg> <rect x=“0” y=“0”> … </svg> EOL Web Server Embedded ECMA Script makes calls to EOL server for data EOL Data Data is returned to the SVG component

  19. Session Data Persistence Session Object retains pointers to temp data EOL Server Temp Data

  20. EOL Front End: Web Services (cont) Package: org.eolproject.ejb JBOSS v3.1 getDomains(int id, int format_option) Application Server Apache AXIS getDomains(33499519, 1) Web Server getDomains(33499519, 0) Open Notebook Open Notebook General data access Integration into enterprise applications Flash XML rendering HTML rendering

  21. Open Notebook Software Wish List • Multi-Platform application • Easy installation and update • Local search functionality • Data annotation • Built-in basic data viewers for popular data, i.e. BLAST, sequence alignments, basic molecular rendering • Automated download of specialized data viewers • Automatic data updates via background use of web services • User notification of new data • Point-and-click interface to support new breed of PDA’s and Tablets • Peer-to-peer querying of annotation data

  22. Easy Installation and Update • Idiot-proof installation • Java Network Launch Protocol (JNLP) good contender, i.e. WebStart • JNLP has ability to provide application updates

  23. Local search functionality • Whatever kind of database is used, it needs to be able to support some kind of search functionality • For the open notebook project we would seek an open source XML-based database, look to xml:db API for a means to interact with a native XML database • EXIST is one example of an open source, native XML database

  24. Data annotation & Peer-to-peer querying of annotation data • Personal annotations on local data a useful and relatively easy feature to implement • Peer-to-peer access contentious and needs to be well controlled • Potentially could create a real community of online scientists • Effectively a scientific “Napster”

  25. Built-in Basic Data Viewers • Need to have minimum built-in capability • Text viewer • SVG Graphics viewer • NCBI DTD-based BLAST browser • Multiple sequence alignment viewer • Molecule renderer

  26. Automatic data updates via SOAP calls • Server-side must be set up for providing SOAP method calls • Potential to drastically reduce server load by performing incremental search getBlastData( string sequence, string last-queried )

  27. Point-and-click interface • Intuitive interface • Constructed with an eye on developments in personal computing e.g. PDA’s and Tablet computers

  28. What Next…? • Upload a seed Java-based project onto the Bioinformatics.org site together with an RFC • Discuss online the merits of the project

  29. Summary • A genuine need for a means to: • Collate data • Automatic updates of data • Enable shared data annotations • Specialized data processing • Java provides a compelling platform to develop an open version of this client-side application

  30. EOL Team Dave Archbell Kim Baldridge Chaitanya Baru Fran Berman Philip Bourne Robert Byrnes Henri Casanova Eliot Clingman Neil Cotofana Cassie Ferguson Tony Fountain Jerry Greenberg Michael Gribskov Dana Jermanis Wilfred Li Jennifer Matthews Mark Miller Julie Mitchell Coleman Mosley Greg Quinn Vicente Reyes Jerry Rowley Peter Shin Ilya Shindyalov Chris Smith David Stoner Stella Veretnik

  31. Further information: http://www.eolproject.info http://www.bioinformatics.org/opennotebook

More Related