200 likes | 306 Vues
Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury Laboratory. Progress of a PX project. Structure Deposition (PDB @ RCSB, EBI). Data Collection (synchrotron, home source).
E N D
Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury Laboratory
Progress of a PX project Structure Deposition (PDB @ RCSB, EBI) Data Collection (synchrotron, home source) Database Queries Structure Solution (CCP4 etc.)
PROTEIN DATABANK • international repository for the processing and distribution of 3-D macromolecular structure data determined experimentally by X-ray crystallography and NMR. • data deposited to PDB at RCSB (U.S.) and EBI (U.K.)
USES OF PDB • Retrieval of data of single structure • Global searches (e.g. for molecule name, particular cofactor, etc.) • Generating statistics (e.g. structures vs. resolution) • Derived databases (e.g. ReLiBase, scop/CATH)
Examples of deposited information • Name of source organism • Reference to sequence database entry • Temperature of diffraction expt. • No. of unique reflections • Rmerge as function of resolution • Starting model for molecular replacement • Restraints used in refinement • Identification of secondary structure elements • Atomic coordinates and structure factor amplitudes
HARVESTING CONCEPT • Pioneered by EBI deposition centre. • Data Harvesting is a protocol for communicating relevant data from Software to Deposition Site • Why? • More reliable data • Richer database
HARVEST: Action • Action of harvesting is entirely local. • A run of a program captures all significant information produced by that program run, and stores it in a (date-stamped) file. • Control of the contents of the harvest file is by the developers of the software being run and the researcher running it.
HARVEST: File Format • mmCIF has been selected as the format to represent harvest (deposition) data items • several files are generated • mmCIF relationships not necessarily maintained • ‘TRUE’ final complete mmCIF file only generated after complete processing of a submission at the deposition site
Identifying harvesting files • Each run of a harvesting program produces a single file. • Files identified by Project Name and Dataset Name.
Project Name • Project Name is the individual’s in-house laboratory code for a structure that will eventually be deposited • Equivalent to a PDB idcode or _entry.id • E.g. • A new native structure • A mutant structure • A ligand protein complex
Dataset Name • Dataset Name is an individual’s code to represent each experiment carried out to solve a particular Project Name. • Equivalent to _diffrn.id • E.g. • Each wavelength in a MAD experiment • Each Heavy atom derivative • Each different NMR experiment carried out in the course of a structure determination
Management of harvest Files • CCP4 Prototype uses a directory in $HOME to store harvest files with file names: $HOME/DepositFiles/PName/DName.ProgName_mode • Files sent to EBI at time of deposition. • Ultimately the individual research worker is responsible for the management of their own data files.
HARVEST: Problems • Management of harvesting files: • A structure may be solved by more than one user • A structure may be solved using different machines not NFS connected • More than one run and which run is FINAL? • Scope of harvesting: • Need to persuade software authors to adopt protocol • Still need manual addition/checking of information
Implementation in CCP4 • Harvesting files produced by: • [MOSFLM] (data processing) • SCALA / TRUNCATE (data reduction) • MLPHARE (phasing) • RESTRAIN / REFMAC (refinement) • Associated libraries: • libccif - Peter Keller’s suite of routines to read and write mmCIF files • harvlib.f - Kim Henrick’s Fortran front end to libccif Public release - January 2000
Example: SCALA output (1) data_phosphate_binding_protein[A197C_chromophore_x] _entry.id phosphate_binding_protein _diffrn.id A197C_chromophore_x_audit.creation_date 1997-10-30T12:43:41+00:00_software.classification 'data reduction'_software.contact_author 'P.R. Evans'_software.contact_author_email pre@mrc-lmb.cam.ac.uk_software.description 'scale together multiple observations of reflections'_software.name Scala_software.version 'CCP4_2.2.3 1/7/97'
Example: SCALA output (2) _diffrn_reflns.d_res_low 35.36 _diffrn_reflns.d_res_high 3.00 _diffrn_reflns.number_measured_all 17986 _diffrn_reflns.number_unique_all 6645 _diffrn_reflns.number_centric_all 363 _diffrn_reflns.number_anomalous_all 2348 _diffrn_reflns.Rmerge_I_anomalous_all 0.050
User Input • For each program run, user can specify: • Project Name • Dataset Name • USECWD - write harvest file to cwd rather than deposit directory; useful for trial runs • NOHARVEST - do not write harvest file
Automation • All that program needs to know is Project Name and Dataset Name • This information carried between programs in header section of reflection file (MTZ file) • Information written to reflection file as soon as possible (ideally written to image files and passed on).
Current status • Harvesting software released as part of CCP4 in January 2000. No harvesting files sent to EBI as yet (early days!) • CNS also produces harvesting files, and some use of these • Plans to extend to concept to data from NMR and EM
Acknowledgements • Kim Henrick, Peter Keller (EBI) • Eleanor Dodson, Phil Evans (CCP4) • BBSRC http://www.dl.ac.uk/CCP/CCP4/newsletter35/dataharvest.html http://www.dl.ac.uk/CCP/CCP4/newsletter37/13_harvest.html