1 / 34

Xianfeng Jeff Chen Ph.D . Research Investigator/Project Manager

Overview and Implementation Strategy of the NIAID-Funded Bio-defense Proteomics Database System. Xianfeng Jeff Chen Ph.D . Research Investigator/Project Manager. (1) Introduction. Agenda Today. VBI responsibility in Admin Center PRCs datatype and organism

pembroke
Télécharger la présentation

Xianfeng Jeff Chen Ph.D . Research Investigator/Project Manager

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview and Implementation Strategy of the NIAID-Funded Bio-defense Proteomics Database System • Xianfeng Jeff Chen Ph.D. • Research Investigator/Project Manager

  2. (1) Introduction Agenda Today • VBI responsibility in Admin Center • PRCs datatype and organism • Proteomics data submission and storage work flow • VBI computing system architecture (CPU and storage) • VBI database system prototype and functionality • VBI existing database schema and status • Example Y2H schema for design logics and case study • Proposed data integration and knowledgebase construction (2) Database Development (3) Strategy on Knowledgebase Development

  3. Introduction

  4. Proteomics Data Management Tasks of Proteomics Data Management RAW DATA • (processed data) Data Storage & Visualization Tools (VBI) Data QA/QC, Interoperability (VBI/GU) Analysis, Annotation, & Curation (GU) SOP, LIMS, & Adm DB (SSS)

  5. PRCs Major Data Type Organization Major Data Type University of MichiganMicroarray and mass spectrometry Caprion Mass spectrometry Harvard Proteomics Institute Genomics and protein expression array Albert Einsten College of MedicineMass spectrometry PNNL Mass spectrometry Scripps NMR structural and X-ray crystal diffraction data Myriad Genetics Yeast two-hybrid system

  6. PRCs Organisms • Einstein Toxoplasma gondii, Cryptosporidium parvum • Caprion Brucella abortus • Harvard Bacillus anthracis (Protein array), Vibrio cholerae • Myriad Bacillus anthracis (Y2H), Yersinia pestis, • Francisella tularensis, vaccinia • PNNL Orthopox (vaccinia and monkeypox), Salmonella typhimurium, Salmonella typhi • Scripps SARS CoV • Michigan Bacillus anthracis(TXP, MS) + host (human)

  7. Proteomics Data Flow Data Modeling w/ Decomposition 2D GELS Protein Array LC Immunoaffinity purification Y2H MS MS/MS NMR X-Ray Cryoelectron Microscopy X-Ray Defraction etc… PRCS Converting to Standard Format QA & QC QA & QC Standard Format VBI Quality Assurance & QualityControl Standard Format for Each Data Type Quality Assurance & QualityControl Relational Database Public Data Sources Data Types MIAME and MIAPE-like Standards/SOP for Data Submission

  8. Database Development

  9. VBI Computing System LINUX Web Server Gimli PC Users Jeff Wei Chaitanya Chengdong Ranjan Oswald Bruno SUN (Solaris) Project Elenwe Binary Software Data Storage Proteomics Application Server Genomics 7 PRCs Networked File Server Proteomics Chendong, Jeff, Wei, Ranjan, Chaitanya TUOR Relational Database Server

  10. System Development in Q3 of 2005 Development Test/Stage Production Web Interface Database

  11. Proteomics Database Project Websites Production:http://proteinbank.vbi.vt.edu/bprc Test: http://proteinbankdev.gepasi.org/bprc/ Development:http://txue.bioinformatics.vt.edu:8080/bprc http://wsun.vbi.vt.edu:8080/bprc/

  12. Production Website Instance Dynamically generated webpage Functionalities: • Account management • File and doc management • News group and news update • Textual data display • 2D gel Image data display • Table and record query • Data uploading and simple submission • HTTP data downloading • SFTP file transfer

  13. Database Query Search By Experiment Search By Organism • Select Experiment • Retrieve list of Bait protein • and nucleotide, Prey protein & • nucleotide • Links to details of bait and Prey • example: Drosophila melanogaster • Escherichia coli • Saccharomyces cerevisiae • Homo sapiens • Drosophila melanogaster • Helicobacter pylori • Caenorhabclitis elegans Search By Data Type • Proteomics • Genomics • Microarray

  14. Query for Scripps Sample Data Search By Project/Experiment • Scripps MS testing project • Available peptide hit list • Retrieve peak information and • m/z & intensity list

  15. Query for 2 D Gel Data Search By Experiment/Sample

  16. Proteomics Database Architecture Three Phases of Database Design Production Design Normalized with Key-value Pair Process-Oriented Application Layer Stored Procedure for Analysis Pipeline 2D Gel MS LC Views -- materialized views Logical Layer NMR Y2H X-Ray Defraction X-Ray Cryoelectron Microscopy Protein Array Immnoaffinity Purification Multiple Schemas of Disparate Data Consolidate to One Schema to Remove Redundancy Physical Layer Final Views

  17. Phase 2 Phase 1 Phase 3 Consolidation into a Few Schema Individual Dataset Modeling Analysis Pipeline Procedures DisparateData With Multiple Schemas A normalizeddata model implemented as key –value pairs, highlydecomposed. Logical Layer with Views for the User Test/stage PhysicalLayer Version 1 0.5-1 year Version 2 1-1.5 year Version 3 2 years Production Proteomics Database ArchitectureThree Database Instances Development • Partially Processed Data • Data Enhanced with Knowledge • Interface Less Changeable • Curated/Annotated Data

  18. Status of VBI Database Development Schema Development Test/stage Production Adm +(10/10) + + 2 D Gel +(10/10) + + MS +(10/10) + + Interaction +(9/10) + - Pathway +(7/10) + - Data Repository +(8/10) + + Y2H +(10/10) + + Genomics +(10/10)(GUS) + + Microarray +(10/10) (AE) + + (Maturity) Default Tablespace:Admin_data, Genomics_TBLS, Pathway_TBLS, Microarray_TBLS, Proteomics_TBLS.

  19. Generic Experiment Data Components • Who (People) • Where (Organization) • Project (Goal) • Materials and Methods (Metadata) • Results (Raw Data) • Conclusion and Hypothesis (Processed and Analyzed Data) -------Example of Database Design Logics

  20. Y2H Data Component Modeling People Experiment Sample Project Results Conclusion Hypothesis DNA /Protein Detail

  21. Experiment Component Object Model Experiment Experiment Design Design Description Experiment Factor Ontology Entry Factor Value Ontology entries are taking care of the annotation cases 1) There are diverse choices and there exist ontologies that can better capture the information 2) What are essentially controlled vocabularies which are limited in number of choices but might grow in the future or vary by technology type

  22. Y2H Partial Database Schema

  23. Proteomics DB System Architecture • Batch Processing • Data uploading; • Data validation; • Data analysis; • Data processing Perl, Java JSP, CGI, Java JDBC, Perl DBI/DBD, ODBC Private File Server Oracle Relational Database Public File Server

  24. ------- Data, Tool, Project, and Team Interoperability System Architecture of Putative VBI Proteomics Knowledgebase Web Display and Data Visualization Security Application Layer Security Service-Oriented MiddleWare with Process Control Temporary data Security Virtual Database/ Warehouse Security Mass Spectrometry Array Express Two Component System 2D Gel Structure Data Genomics Data

  25. Strategy on Data Integration and Construction of Knowledge Warehouse

  26. Biological Information Workflow Diagnostics, Therapeutics & Vaccines Target Discovery Biological Research Knowledge Generation Knowledge Management Data Management Curation and Annotation of Data Cleaning, Processing Algorithms Information Storage, Queries & DB Management

  27. VBI PDC Project Phases Phase I Phase II Phase III First 2 years 3rd-4th years 5th year Knowledge generation Knowledge management Knowledge presentation Bio-IT Scope Data Integration • Raw data management • Schema development • Data visualization • Data standardization • Integration at interface level • Integration of data at DB level • Interoperability of datasets • Normalization and warehousing • Predefined query • Materialized view • Comparative analysis • Statistical analysis

  28. Mapping the Proteome • (1) Yeast two-hybrid system • Measures association between • two proteins. • Allows very high throughput. • (2) Mass spectrometry • Allows identification of proteins within large complexes (2-100 proteins). • Lower throughput.

  29. Infer Complex Interaction Topology PO4 Knowledgebase Binary interactions R2H Analysis Complex Interaction Model MS Analysis Proteins N-ary interations

  30. Bacillus anthracis Data Organization (1) Completed Genome Ames, Ames Ancestor, a2012 NCBI, TIGR (2) Yeast two-hybrid interaction data Myriad Genetics (3) Mass Spectrometry Scripps and Caprion (4) Microarray expression profiling Univ. of Michigan (5) Interspecies and interspecies clustering NCBI(COG) and TIGR (6) Functional category assignment GU(PIR)

  31. Strategy for Knowledgebase Construction (1) Annotation Improvement • Non-homologous based methods -------------- phylogenetic profiling, • Rosetta stone pattern, • operon analysis, • co-expression profiling, • gene neighboring etc. • (2) Comparative genomics with two reference genomes --- E. Coli and Yeast (2) Identifying anchor points for data integration • Known metabolic pathway – E. coli and yeast; • Known signal transduction pathway; • Known Gene regulation machinery; • Known Protein-protein interaction map.

  32. Data Integration Lay down microarray data to add co-expression pattern to gene network Lay down MS multiple interaction data to expend the network Lay down Y2H interaction data and expend network Anchor on knowledge network of Reference Genomes – E. Coli and Yeast Comparative Genomics Improved annotation Genomics Data Putative Knowledgebase: No thing http://www.Bacillus_anthracis.org

  33. Data Mining and Knowledge Augmentation Microarray Literature MS analysis Y2H analysis

  34. Acknowledgement Organization Name Role Dr. Jeff Chen Project Manager/Investigator VBI Dr. Chendong Zhang Senior Software Engineer VBI Dr. Steve Cammer Bioinformatics Scientist VBI Dr. Oswald Crasta Scientist and CI-Co-director VBI Susan Baker DBA VBI Jiang Lu DBA VBI Ranjan Jha Software Engineer VBI Qiang Yu Software Engineer VBI Jian Li Software Engineer VBI Wei Sun Software Engineer VBI Chaitanya Kommidi Software Engineer VBI Dr.Bruno Sobral Co-PI VBI Dr. Peter MacGarvey Senior Bioinformatics Scientist GU Dr. Cathy Wu Co-PI GU Paula Yadvish Web Coordinator SSS Margaret Moore PI SSS

More Related