1 / 25

Data management, curation, and display

Data management, curation, and display. Bob Sinkovits AfCS Bioinformatics Lab San Diego Supercomputer Center UC San Diego. AfCS quick overview. Cell/molecular biology project focusing on cellular signaling

oke
Télécharger la présentation

Data management, curation, and display

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data management, curation,and display Bob Sinkovits AfCS Bioinformatics Lab San Diego Supercomputer Center UC San Diego

  2. AfCS quick overview • Cell/molecular biology project focusing on cellular signaling • Collaboration involving eight laboratories plus other investigators working on bridging projects and data analysis • Two main activities • High throughput generation of experimental data • Molecule page project (in collaboration with NPG) • http://www.signaling-gateway.org

  3. The data management problem • Collecting and archiving data • Tracking meta-data associated with experiments (reagents, technicians, labs, dates, machine settings, protocols, etc.) • Processing raw data • Curation • Organization and display • Data distribution

  4. Data collection Data acquisition for the AfCS involves the separate transfer of experimental data and the description of the experiment (meta –data) GUIs meta-data Experimental Lab SDSC wget data (results)

  5. Experimental data collection Experimental data files transferred on a nightly basis using the UNIX wget utility under control of cron job Myriad Ca++, cAMP phosphoprotein cytokine UTSW UCSF Y2H Ca++ SDSC microscopy single cell Ca++ Caltech microarray Lipid MS Stanford Vanderbilt

  6. Experimental data collection • The UNIX wget utility was ideal for our project where data needs to be collected from a limited number of sites on a regular basis. Filters allow control over transfers. • One drawback is that file transfers are initiated if the timestamps on the remote files change. May be worthwhile to make effort write a better wget that also compares checksums

  7. Meta-data collection • Meta-data inserted directly into the AfCS Oracle database through a set of Java Swing GUIs • Sample, experiment, cell line, etc. IDs are generated automatically based on date, laboratory code, etc. • Error checking, the use of pull down menus, and database constraints ensure that valid data entered into GUIs

  8. Meta-data collection

  9. Meta-data collection • All experimental samples and materials (protein extracts, gels, cell preps, plasmids, solutions, reagents, etc.) are physically labeled using a 2-d barcode. Symbol Cyclone scanner Zebra Z4M barcode printer

  10. Data/information flow Oracle 9i GUIs parse.pl curation meta-data www postprocess.pl SDSC Labs data SRB Disk / Tape silo Off-site backup(Caltech)

  11. Databasing • Each type/category of experimental data is stored in a separate database schema • Easier to work with schemas containing smaller numbers of tables • Minimizes possibility of data loss/corruption • Avoids confusion due to multiple developers working in a single schema (overlap of namespaces) • Easier recovery • Privileges granted as needed between schemas

  12. Databasing • Strongly encourage using multiple instances • Production, test, and development for datasets modified by large numbers of users • Production and development may be suitable for datasets that are modified by one/few user • Multiple instances • Provide test beds for new releases of RDBMS • Allow developers to make schema modifications without impacting production system

  13. Databasing • Oracle has worked great for us, but it’s not cheap (even with the educational discount). • For large databases, need to think a lot about performance • Every table should have a primary key • Use indexes for columns that are frequently used in searches • Run ANALYZE on a regular basis • Use bind variables

  14. Data archives • For users who want to analyze complete data sets, downloading results one experiment at a time can be tedious and impractical • For projects that deal with large amounts of data, an ftp server is essential for distributing complete archive

  15. Data archives Archives of data sets can be downloaded at ftp://ftp.afcs.org/pub/datacenter

  16. Data curation • Need to provide convenient way for the AfCS labs to curate data • By ligand (don’t release until replicated) • By experiment (flag bad experiments) • By sample (flag bad samples w/o discarding expt) • Web interfaces for curation have been developed and are restricted by user

  17. Data curation • Ligand, experiments, and samples can be annotated in three ways • Public – available for public • Internal – restricted to internal use. Validity of data still being investigated or experimental conditions not yet replicated • Invalid – experiment or sample flagged as being bad; not available to anyone

  18. Data curation

  19. Data curation by ligand For curation by ligand, interface is based on the public display with additional features

  20. Data curation by sample/expt Curate by experiment Curate by sample

  21. Data curation by sample/expt Curate by experiment Curate by sample

  22. Data curation by sample/expt For some assays, such as cytokine and phosphoprotein, the large number of samples make curation by sampleid impractical. Curation limited to the experiment level

  23. Summary • Catch/prevent data entry errors early • Build business rules (constraints) into database • Limit choices through controlled vocabularies • LIMS much more scalable, shareable than traditional laboratory notebooks • wget ideal for data transfers from limited number of sites • When presenting data to the public, provide access at different granularities

  24. Summary • Use multiple database instances • Production and development at minimum • Plus test for projects where data is entered by large numbers of users • Use multiple schemas, with privileges granted as necessary • Safer • Ease of development • Compartmentalization of data

  25. Acknowledgements • Madhusudan, Ilango Vadivelu – LIMS • Stephen Lyon – web master • Brad Kroeger – systems administration • Chic Barna, Ray Bean – database administration • Sylvain Pradervand – phosphoprotein display • Ron Taussig, Gil Sambrano, Richard Scheuermann - data center design • Paul Sternweis – Ca++, cAMP display • Susie Mumby – phosphoprotein, cytokine display • Lonnie Sorrels, Keng-Mean Lin, Sangdun Choi, Nick Wong, Robert Hsueh, Heping Han, Ruth Levitz

More Related