1 / 19

A Virtual File System for the PubChem Chemical Structure and Bioassay Database

A Virtual File System for the PubChem Chemical Structure and Bioassay Database. Wolf-D. Ihlenfeldt Xemistry GmbH K önigstein , Germany . PubChem on the Web. PubChem Project Mission.

leala
Télécharger la présentation

A Virtual File System for the PubChem Chemical Structure and Bioassay Database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Virtual File System for the PubChem Chemical Structure and Bioassay Database Wolf-D. IhlenfeldtXemistry GmbH Königstein, Germany

  2. PubChem on the Web

  3. PubChem Project Mission • Provide comprehensive public access to screening data generated by NIH Roadmap Initiative and other public research projects • Link assay results, structures screened, literature references, basic computed properties, external information sources • Convenient and free queries and download of filtered structure and assay data for further research • Wait a moment - they call that convenient ?!?

  4. Problems with Interactive Data Retrieval in PubChem • Separation betweentext/data (Entrez) andstructurequerysystemswithinconsistentinterfaces • Dumbed-down structurequeryinterface, but overengineeredtextquerytools • ObscureEntrezsyntaxforcombining multiple subqueries • QuirkyEntrezapproachesregardingnumericalqueries, quoting, fieldnames, outputformats, historytitles, autoqueryexpansion… • Historyofhistoryproblems

  5. Interactive Data Retrieval in PubChem • Very limited customizationofdownloadabledatacontent • Fullstructuredatarecordonlyas ASN.1 blob, optionallywithgratutioushomebrew XML wrapper • SD-fileisincomplete, a structureapproximationand still not compatiblewithexactinterpretationof MDL standards • Nevertheless, well donesystemforbrowsing, but not forseriousdatacollection

  6. Routes to Programmatic Data Retrieval from PubChem Somedisconnectedcomponentsexist: • Entrez e-utilsBasic accesstoEntreztextdatabases, getstatus, retrieve ID sets, somerecorddataorsethistory via simple text-basedqueries • PubChemstructuredisplaypagesCan beabusedfordirectdownloadofsinglerecords in ASN.1 format, bypassingthe FTP waitqueue • PubChem Power User Gateway (PUG)XML/ASN.1 specificationforexecuting simple structurequeriesandgetting ID sets, history handle fromPubChemservers • Nodirect SQL serverdbaccessever, that‘spolicy!

  7. The Cactvs Toolkit • Universal scriptingenvironmentforchemicaldataprocessing • Framework ofchemicalobjects(ensembles, reactions, tables, …), dynamicallydefinedobjectpropertieswithassociatedcomputationmethods, andextensionmodules (I/O modulesfor different typesoffiles, databaseaccess, data type handlers, commandextensions,…) • Lazycomputation – requestsomedata on an object, and a way will befoundtogetitifpossible

  8. Cactvs and PubChem • Cactvs Toolkitlicensedby NCBI as integral componentofthePubChemsoftwaresuite • Usedforfile I/O, syntaxverification, propertycomputation, structuredepiction, structureidentification via hashcodes, interfaceto NIST InChI suite, fingerprints, sub/superstructure & formulasearchsystem, WWW structuresketching • OnlyexternallyavailabletoolkitthatunderstandsPubChemdatastructures (ASN.1 specsforsubstances, compounds, assays, and PUG) – includingliteraturereferences, conformerdata, etc.

  9. Basic PubChemIntegration • Ensemble object creation via CID:set eh [ens create $cid]Direct download and parsing of binary ASN.1 record via display page. Also supported as file I/O module. • Computation of CID and SIDs from structure:set cid [ens get $eh E_CID]set sidlist [ens get $eh E_SIDSET]Parsing of Entrez E-utils output from submission of InChI string as text search

  10. Basic PubChemIntegration • Compoundnamelookupsetiupacname [ensget $eh E_IUPAC_NAME]Directdownloadandparsingof XML CID displayrecord, extractingOpenEyecomputedname • CAS numberlookupsetcasno [ensget $eh E_CAS] Directdownloadandparsingof XML SID setdisplayrecordswhichcontaindepositor-suppliednames, usingpatternrecognition

  11. Initial PubChem Integration • CAS number I/O moduleset eh [molfile read $casfile]Look up CID as generic term via E-utils, download ASN.1 record via CID. Also supported as object creation commandset eh [ens create $cas]

  12. The PubChem Virtual File Project • ImprovedaccesstoPubChemdatabasemakeitindistinguishablefrom a local, read-onlystructurefile in Cactvs scriptingenvironment • Input functionstransparentlyreadstructuresandalltheirdatafromPubChem • Query functionsconvenientdevelopmentandarchivalofqueriesexceedingthecapabilitesof Web interfacesand PUG, maintainingstandard Cactvs queryandretrievalsyntax

  13. General Approach • Implementa Cactvs I/O moduleI/O modulesincorporatefunctiontableswithrichsetoffunctionsthatareautomaticallycalled in specificsituations, capabilityflags, documentationfields, etc. • Hidden, automaticuseofEntrez E-utilsand PUGRun asmanytasksaspossible on Entrez/PubChemstructuresearch, datadownloadandlocalprocessingonlyas last resort • Optimizeforsakeofefficiencyand just beingniceUsecachingtechniquestoreducenetworkandserverload, observe NCBI scriptaccessrules

  14. PubChem Virtual File I/O Code sample: • filex load pubchem19 • molfile open <pubchem>molfile0 • molfile count molfile012002343 • molfile read molfile0ens0 • ens props ens0…E_INCHI E_IUPAC_NAME E_NCBI_COMPOUND_ID E_EXACT_MASS E_TPSA E_SMILES E_SMILES/2…. • ens get ens0 E_CID1 • molfile read molfile0ens1 • molfile set molfile0 record 999999 Contact Entrez e-utils, get database status E-utils, get 5K sector of record-CID map, then single-record ASN.1 download via display page Single-record ASN.1 download via display page Try to load compressed CID use bit vector from xemistry.com, fallback are more e-utils queries for record/CID map sectors

  15. Simple PubChem Queries Code sample: set fh [molfile open <pubchem>] set cidlist [molfile scan $fh „structure >= $smarts“ \ {proplist E_CID}] Operations behind the scenes: • Set-up of PUG record • Post PUG, monitor return status • Cache CID result data • Direct access to result set, no structure download

  16. Intermediate PubChem Queries Code sample: set fh [molfile open <pubchem>] set enslist [molfile scan $fh \„or {structure = $smiles1} {structure = $smiles2}\ {structure = $smiles3}“ enslist] Operations behind the scenes: • Create and post PUG records, get history keys • Perform server-side e-utils result merge via history keys • Retrieve CID set • Download structures as ASN.1 blobs via CID

  17. Power PubChem Queries Code sample: setstfh [molfile open $mysdfile] setfh [molfile open <pubchem>] setth [molfilescan $fh \„and {structure ~>= $stfh 95} {formula >= \[M\]0} \ {E_NMOLECULES = 1} {E_STEREO_COUNT(1) >= 1}“ \{table E_CID score E_SMILES E_FORMULA recordimage} \ {} 1000] tablewrite $th similar_in_pubchem.xls Bioassayaccessisunfortunately not yetpartof PUG.

  18. Summary • Goal: MakePubChemfinallyconvenientlyaccessibleasdatasourceforlocalwork • Feature: Read alldatafromPubChemrecords, andfurthermanipulateittoyourheart‘scontent • Feature: Write andconservecomplexqueriesbeyondwhatyoucan do withthe Web interface • Feature: Export data in manymoreformatsthanpossible via the Web interface • Future: Sort out remainingproblemswithcachingandfieldaccess in complexqueries, use parallel PUG submissions, integrateassaydataaccess

  19. Availability • Is a standardcomponentof3.353 andlater CACTVS toolkitreleases • Free academicdownloadsfromwww.xemistry.comfor multiple platforms (Linux, MS Windows, MacOSX, Solaris, BSD) • Also partofbasiccommercialtoolkit, tobedistributedwithregularcustomerupdates

More Related