Data Harvesting: automatic extraction of information necessary

Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury Laboratory

Progress of a PX project Structure Deposition (PDB @ RCSB, EBI) Data Collection (synchrotron, home source) Database Queries Structure Solution (CCP4 etc.)

PROTEIN DATABANK • international repository for the processing and distribution of 3-D macromolecular structure data determined experimentally by X-ray crystallography and NMR. • data deposited to PDB at RCSB (U.S.) and EBI (U.K.)

USES OF PDB • Retrieval of data of single structure • Global searches (e.g. for molecule name, particular cofactor, etc.) • Generating statistics (e.g. structures vs. resolution) • Derived databases (e.g. ReLiBase, scop/CATH)

Examples of deposited information • Name of source organism • Reference to sequence database entry • Temperature of diffraction expt. • No. of unique reflections • Rmerge as function of resolution • Starting model for molecular replacement • Restraints used in refinement • Identification of secondary structure elements • Atomic coordinates and structure factor amplitudes

HARVESTING CONCEPT • Pioneered by EBI deposition centre. • Data Harvesting is a protocol for communicating relevant data from Software to Deposition Site • Why? • More reliable data • Richer database

HARVEST: Action • Action of harvesting is entirely local. • A run of a program captures all significant information produced by that program run, and stores it in a (date-stamped) file. • Control of the contents of the harvest file is by the developers of the software being run and the researcher running it.

HARVEST: File Format • mmCIF has been selected as the format to represent harvest (deposition) data items • several files are generated • mmCIF relationships not necessarily maintained • ‘TRUE’ final complete mmCIF file only generated after complete processing of a submission at the deposition site

Identifying harvesting files • Each run of a harvesting program produces a single file. • Files identified by Project Name and Dataset Name.

Project Name • Project Name is the individual’s in-house laboratory code for a structure that will eventually be deposited • Equivalent to a PDB idcode or _entry.id • E.g. • A new native structure • A mutant structure • A ligand protein complex

Dataset Name • Dataset Name is an individual’s code to represent each experiment carried out to solve a particular Project Name. • Equivalent to _diffrn.id • E.g. • Each wavelength in a MAD experiment • Each Heavy atom derivative • Each different NMR experiment carried out in the course of a structure determination

Management of harvest Files • CCP4 Prototype uses a directory in $HOME to store harvest files with file names: $HOME/DepositFiles/PName/DName.ProgName_mode • Files sent to EBI at time of deposition. • Ultimately the individual research worker is responsible for the management of their own data files.

HARVEST: Problems • Management of harvesting files: • A structure may be solved by more than one user • A structure may be solved using different machines not NFS connected • More than one run and which run is FINAL? • Scope of harvesting: • Need to persuade software authors to adopt protocol • Still need manual addition/checking of information

Implementation in CCP4 • Harvesting files produced by: • [MOSFLM] (data processing) • SCALA / TRUNCATE (data reduction) • MLPHARE (phasing) • RESTRAIN / REFMAC (refinement) • Associated libraries: • libccif - Peter Keller’s suite of routines to read and write mmCIF files • harvlib.f - Kim Henrick’s Fortran front end to libccif Public release - January 2000

Example: SCALA output (1) data_phosphate_binding_protein[A197C_chromophore_x] _entry.id phosphate_binding_protein _diffrn.id A197C_chromophore_x_audit.creation_date 1997-10-30T12:43:41+00:00_software.classification 'data reduction'_software.contact_author 'P.R. Evans'_software.contact_author_email pre@mrc-lmb.cam.ac.uk_software.description 'scale together multiple observations of reflections'_software.name Scala_software.version 'CCP4_2.2.3 1/7/97'

Example: SCALA output (2) _diffrn_reflns.d_res_low 35.36 _diffrn_reflns.d_res_high 3.00 _diffrn_reflns.number_measured_all 17986 _diffrn_reflns.number_unique_all 6645 _diffrn_reflns.number_centric_all 363 _diffrn_reflns.number_anomalous_all 2348 _diffrn_reflns.Rmerge_I_anomalous_all 0.050

User Input • For each program run, user can specify: • Project Name • Dataset Name • USECWD - write harvest file to cwd rather than deposit directory; useful for trial runs • NOHARVEST - do not write harvest file

Automation • All that program needs to know is Project Name and Dataset Name • This information carried between programs in header section of reflection file (MTZ file) • Information written to reflection file as soon as possible (ideally written to image files and passed on).

Current status • Harvesting software released as part of CCP4 in January 2000. No harvesting files sent to EBI as yet (early days!) • CNS also produces harvesting files, and some use of these • Plans to extend to concept to data from NMR and EM

Acknowledgements • Kim Henrick, Peter Keller (EBI) • Eleanor Dodson, Phil Evans (CCP4) • BBSRC http://www.dl.ac.uk/CCP/CCP4/newsletter35/dataharvest.html http://www.dl.ac.uk/CCP/CCP4/newsletter37/13_harvest.html

Data Harvesting: automatic extraction of information necessary

Data Harvesting: automatic extraction of information necessary

Presentation Transcript

CSE 636 Data Integration

Managing Information Extraction SIGMOD 2006 Tutorial

Chapter 16 – Input Design and Prototyping Objectives:

INTRODUCTRY TO CHEMICAL ENGINEERING

Data Processing

Energy Harvesting Bicycle Light

SNP Resources: Finding SNPs, Databases and Data Extraction

Web Log, Text, and Other Data Mining

Introduction to Informatica PowerCenter

Information Extraction

Discourse Segmentation

ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Information Extraction from the World Wide Web

EM415 – Custom Extraction Techniques

Automatic Verification of Data-Centric Web Services

Text summarization

Temporal Information Extraction

SUSHI A beginner’s guide to NISO’s Standardized Usage Statistics Harvesting Initiative

EVENT EXTRACTION

3 Typical Work on Automatic Relation Extraction

Key Information Set (KIS)