60 likes | 167 Vues
The CLARION Project for the “Infrastructure for Integration in Structural Sciences” (I2S2) mtg, Rutherford Labs, 11 th February 2010. CLARION – C hemical La boratory R epository I n/ O rganic N otebooks Principal Investigator: Peter Murray-Rust Co-Investigator: Jim Downing
E N D
The CLARION Project for the “Infrastructure for Integration in Structural Sciences”(I2S2) mtg, Rutherford Labs, 11th February 2010 CLARION – Chemical Laboratory Repository In/Organic Notebooks Principal Investigator: Peter Murray-Rust Co-Investigator: Jim Downing Project Team: Nick Day, Sam Adams, Brian Brooks Unilever Centre, Department of Chemistry, University of Cambridge
CLARION overview Crystall- ography Files (CIF) EmMa user interface Internal Scientist CHEM-1 repository 3 ELN (IDBS) 1 5 EmMa Embargo Mgr 7 RDF triplestores CLARION query app 2 Data Loader Data Releaser NMR files 4 6 CML, RDF CHEM-0 repository JUMBO converters SPARQL interface Publications database External Scientist Scientist collects data & stores it in variety of locations EmMa is notified about the new content Scientist specifies the release conditions for the data Timer waits until release conditions are met Data is moved into CHEM-1 repository... ... and (at some time) into CHEM-0 repository Repository queried by scientists
Blue boxes indicate logical machine environments CLARION architecture • EmMa’s role: • Adds metadata • Defines embargo release conditions • Is the gatekeeper for metadata quality • Is the gatekeeper for security (trust, authentication, authorisation) File Feed Adapter Atom Feed Data Files Embargo Manager (EmMa) Lensfield Loader Query System CLARION repository CHEM-0/1 repository • Jetty webserver • cron jobs • Java RDF GUI client Atom Feed Reader Data Handler Release Manager Atom Feed CML RDF Triplestore ELN Feed ELN server Adapter Atom Feed • JUMBO converters • Ontologies: • ChemAxiom • ORE • ORE Chem Expt • Jetty webserver • Java & Clojure Chemical Structure index ELN API ELN • Jetty webserver • Java • H2db for metadata • Jetty webserver • Java • SPARQL • SOAP • Jetty webserver • cron jobs • Java • Sesame • Chemicx • Design principles used: • Decoupling through standard web interfaces (http, Atom) • Avoid data duplication (by using http references unless a copy is required) • Don’t do manually that which can be done automatically • Manual semantification as early as possible • Automatic semantification as late as possible • Give ability to undo an action during a grace period rather than getting confirmation
CLARION development stages & timings Repository Sources Data Loader EmMa 1 Stage 1 2 Stage 2 • Stage 1: First data-feed into EmMa • Atom-feeds from file stores • EmMa feed-readers • EmMa user review tool • EmMa output atom-feeds • Stage 2: Basic functionality to store first data-type into repository • Lensfield reads EmMa feeds • Process data to CML • Process CML to RDF • Store triples into triple-store • Indexing of chemical structures • Stage 3: Basic querying functionality • Authentication & authorisation • Pilot users loading data • V1 query tool Stage 3 3 2 0 1 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Data stored in RDF and chemical structures indexed System in use by pilot users & simple query interface for SSS & RDF queries. Querying by outside users. Scientists presented with data records to which they add metadata and then set embargo release conditions
EmMa: A general tool for controlling data release between systems ? PubChem ISIS Pump Atom feed Public Atom feed PDB Pump ELN Atom feed EmMa NCS eCrystals XRay Atom feed Chem-0 Atom feed NMR Private Atom feed Atom feed Chem-1 Etc Original data plus basic metadata Fully semantified data (RDF)
How EmMa could facilitate data release in collaborating institutions Private repository Institution A Rutherford Public repository 1 neutron 4 3 2 EmMa 7 6 5 Events: Scientist sends sample to Rutherford Rutherford stores data locally and sends copy back to scientist Institution’s EmMa is informed about new data Scientist specifies data release conditions Release conditions reached, data released to public repository Rutherford monitors institution’s atom feed, detects data is released Rutherford makes data visible in their own public-access repository Institution B EmMa