EVS Data Curation
E N D
Presentation Transcript
EVS Data Curation The processing and publication of data for web browsing and programmatic access
Gene Ontology and Zebrafish • Downloaded as OBO from web sites • Processed with C++ program into Ontylog xml – OBO2TDE.exe • Processed with C++ program into OWL – ontyxToOWL.exe • Loaded using LoadNCIThesOWL.sh • Metadata loaded using LoadMetadata • Hierarchy and Sources manually edited
HL7 and VA_NDFRT • Retrieved from sources • Processed by Apelon into Ontylog XML • Loaded into LexBIG using LoadNCIThesOwl and manifest • Metadata loaded using LoadMetadata
MGED • OWL file downloaded from source web site • Loaded into Protégé • Classified • Inferred version exported as OWL file • Loaded into LexBIG using LoadNCIThesOwl • Metadata loaded using LoadMetadata • Hierarchy and Sources manually edited
Snomed, MedDRA and LOINC • Extracted from the UMLS into RRF files • Loaded into LexBIG using LoadUMLSFiles • Metadata loaded using LoadMetadata
UMLS Semnet • Downloaded from UMLS Semnet web site • Loaded using LoadUMLSSemnet • Metadata loaded using LoadMetadata
Metathesaurus • Load from UMLS into MEME • NCI Thesaurus imported monthly • Other vocabs added or removed • NCI specific edits made to data and relations • Exported as RRF • Imported to LexBIG using LoadNCIMeta • Metadata loaded using LoadMetadata
Preparing TDE Thesaurus for MEME • Thesaurus Ontylog XML baseline is processed through C++ app publishMEME.exe • Current baseline compared to previous to get summary of new properties or roles • Summary used to create import configuration file • Baseline imported into MEME
NCI Thesaurus from TDE • Edited in TDE and exported to Ontylog XML by name • Run through publishTDE to remove unpublishable properties • run through OntyxToOwl.exe to create OWL file by code • Loaded into LexBIG using LoadNCIThesOWL • Metadata loaded using LoadMetadata • History generated from TDE baseline • History loaded using LoadNCIHistory
NCI Thesaurus from Protege • Run OWL through application to get Ontylog XML by name • Run Ontylog XML through publishTDE to remove unpublishable properties • Run through OntylogtoOWL to get OWL by code • Do history using the Ontylog XML
NCI Thesaurus History Processing • evs_history records concept modifications made in editor • These records are extracted monthly to consolidate and to remove identifying information • Cleaned records are loaded into concept_history • Full concept_history loaded into LexBIG for NCI Thesaurus
log.out New concepts created through Create or Split actions: C72675|Feet_First . Concepts merged into other concepts: C17841|Oncologic_Surgeon . Retired concepts (including merged): C17841|Oncologic_Surgeon . New concepts not found in BSLN2: C73140|Ethaverine_ . Retired concepts not found in BSLN2 C73401|Maqui_Berry_Flavor . Modify records correponding to Retired_Kind are discarded: 667487|C62920|Medical_Device_Unsafe_to_Use|Modify|2008-03-05 … . Modify records correponding to new codes are discarded: 666753|C72831|Pramiracetam_Hydrochloride|Modify|2008-02-29 … . Modify records correponding to merged codes are discarded: 668629|C3824|Lesion|Modify|2008-03-06 11:03:49.0|remennik|6116otsaremennl.nci.nih.gov|(null)|0 . Records correponding to codes not found in BSLN2 are discarded: 671933|C73140|Ethaverine_|New|2008-03-19 12:03:01.0|shaiu|MSDCorp-Mesh001.inside.msdinc.com|(null)|0 . WARNING: New codes created, then retired, but still found in BSLN2: (to be edited manually) C72675|Feet_First . List of all remaining records . List of all discarded records: 666753|C72831|Pramiracetam_Hydrochloride|Modify|2008-02-29 09:02:56.0|shaiu|MSDCorp-Mesh001.inside.msdinc.com|(null)|0 .
tde_history_report.txt Spilanthes_oleracea (Code: C72446) Number of modelers: 3 Modeler: shaiu Modeler: thomas Modeler: creech Modeler: shaiu Action: modify time: 2008-03-05 05:03:58.0 Modeler: thomas Action: modify time: 2008-03-06 02:03:05.0 Action: modify time: 2008-03-14 10:03:06.0 Modeler: creech Action: modify time: 2008-03-06 02:03:06.0 ------------------------------------------------------------------ . Edited actions for the following concepts are discarded: Concept codes requiring manual review:
DTS_history • DTS_history_script.sql insert into concept_history(concept, editaction, editdate, reference) values ('C72675', 'create', '28-MAR-08', null); insert into concept_history(concept, editaction, editdate, reference) values ('C72676', 'create', '28-MAR-08', null); . . • DTS_history_out.txt 666540|C72675|create|28-MAR-08|(null) 666541|C72676|create|28-MAR-08|(null) 666542|C62171|modify|28-MAR-08|(null) . .
DTS_history_out.out Lists complete contents of both baselines . Number of codes in {baseline A} : 65265 Number of codes in {baseline B} : 66022 Concepts found in {baseline B}: but not in {baseline A} C72675 C72676 . Concepts found in {baseline A}: but not in {baseline B} (should be empty) . Verify DTS_history_out.txt against baseline data. New Concepts: 757 (1) C72675 (2) C72676 . Concepts created through Split: 0 Split Concepts: 0 Retired Concepts: 4 (1) C20920 (2) C62920 Concepts retired through Merge: 5 (1) C14142 Merge Concepts: 5 (1) C1363 Modified Concepts: 1364 Invalid actions: 0
Tiered Deployments • NCICB uses 4-tiered deployments • Dev tier – used internally by EVS team to test software and data • QA tier – used by QA and other software teams to test against new EVS software or data • Stage tier – used to test software deployments in a near-production environment • Production – available to outside users