1 / 31

The C 2 M system

The C 2 M system. The C 2 M system. Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl. Setting. Scientist working with multiple, heterogeneous resources like Databases Knowledge bases

noah
Télécharger la présentation

The C 2 M system

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The C2M system

  2. The C2M system Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

  3. Setting • Scientist working with multiple, heterogeneous resources like • Databases • Knowledge bases • Programs • Task requires co-operation of resources • Resources in-house or remote makes no difference

  4. SciDashboard™ • Long-term vision: scientist’s dashboard • SciDashboard™ allows scientist to visually: • Select resources • Connect resources • Identify sources and sinks • Specify data transformations underway • C2M first step towards SciDashboard™

  5. Co-operating resources • First problem: format multiplicity • Format multiplicity is unavoidable • Standardisation social process with high stakes • No format caters for all needs • Second problem: combining resources • Merging, comparing, deduplicating

  6. Format multiplicity • Chemical example: molecular structure files

  7. Molecular structure files • About 20 formats in daily use, for example: • MDL Molfile (MOL) • Connection table (CT) • Standard Molecular Description file (SMD) • Almost all formats specify plaintext files with record-field structure • Delimiters often space and newline characters

  8. CT-file ethanol CH3CH2OH ethanol.ct 3 2 -0.8667 -0.2500 0.0000 C 0.0000 0.2500 0.0000 C 0.8667 -0.2500 0.0000 O 1 2 1 1 2 3 1 1

  9. CT-file ethanol CH3CH2OH ethanol.ct 3 2 -0.8667 -0.2500 0.0000 C 0.0000 0.2500 0.0000 C 0.8667 -0.2500 0.0000 O 1 2 1 1 2 3 1 1

  10. CT-file ethanol CH3CH2OH ethanol.ct 3 2 -0.8667 -0.2500 0.0000 C (1) 0.0000 0.2500 0.0000 C (2) 0.8667 -0.2500 0.0000 O (3) 1 2 1 1 2 3 1 1

  11. MOL-file ethanol CH3CH2OH ethanol.mol ChemDraw03070310372D 3 2 0 0 0 0 0 0 0 0999 V2000 -1.2975 -0.3750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0025 0.3750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.3000 -0.3750 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 2 3 1 0 0 0 0 M END

  12. Solving format multiplicity: Wrappers

  13. Wrappers • Wrapper tools exist such as • Chemistry: Babel, ChemDraw • Molecular biology: SRS • Bibliography management: EndNotes, bp • Disadvantage: adding new format impossible or very difficult • “Roll your own” wrappers: awk, perl • Difficult to maintain

  14. Wrapper generators • Basic idea: produce wrapper from high-level description of formats • Often two-step process: A → R → B with R an internal representation • Obvious argument: two-step process takes fewer converters than direct conversion • Disadvantage: R fixed and dedicated

  15. Preparing for middleware • Keyword: modularisation • Stakeholders are responsible for their own specifications, for example: • Content provider offers syntactic format description • User determines internal representation • Internal representation allows combination of resources

  16. The C2M system • C2M: chemical configurable middleware • Implemented in Quintus Prolog • Current state: a wrapper generator • Wrappers produced from high-level specifications of formats and internal representation • Internal representation chosen by user, if desired per task • C2M can be extended to middleware

  17. Current C2M is … • a specification language • for specifying the format of foreign files • for specifying the internal representation • a programming language • for programming wrappers by means of specifications • for inserting copious documentation • a system • for producing wrappers and their documentation

  18. C2M system overview

  19. File conversion by C2M

  20. C2M specifications • Two kinds of specifications: • Specification of internal representation • Specification of file format each in a file of its own • Internal representation: ontology • File format specification: read-only, write-only, or both read and write

  21. Language design principles • Adhere to well-known designs • HTML (tags and tag attributes) • context-free grammar (as in BNF) • functions • Use or mimic well-known symbols • grammar rules: lhs -> rhs1 rhs2 rhs3 (→) or lhs ::= rhs1 rhs2 rhs3 (as in BNF) • instantiation: lhs <- funct(arg1, arg2) (←)

  22. Ontology • Frame system • Tree structure with concepts and attributes • Three kinds of concepts: • concept1 = concept2 concept3 concept4 • concept1 = repeated(concept2) • primitive concepts (leaves) • Leaves hold information

  23. Ontology example <C2M-SPECIFICATION type=“ontology” name=“simple-ont”> <ONTOLOGY> sentence = repeated(word) </ONTOLOGY> </C2M-SPECIFICATION>

  24. File format specification • File format specification: grammar + semantic bindings • Grammar specifies structure • System uses grammar to produce parse tree • Semantic bindings map nodes in parse tree onto concepts in internal representation

  25. File format spec example <C2M-SPECIFICATION type=“file-format” name=“simple-form” <READGRAM> .... </READGRAM> <SBREAD> .... </SBREAD> .... </C2M-SPECIFICATION>

  26. File format spec: readgram <READGRAM> <ULG> line -> string line -> sp-string+ sp-string -> spaces string </ULG> <LLG> spaces -> space+ string -> printable-char+ </LLG> </READGRAM>

  27. File format spec: sbread <SBREAD> sentence =^ line word <- identity(string) </SBREAD>

  28. Claims C2M is • sufficiently expressive • fully declarative • a literate programming environment (specification and documentation in one) • easy to learn • amenable to division of labour

  29. Claims (contnd.) • Compared to ChemDraw and their likes, C2M: • Allows for easy addition of new formats • Format specifications can be reused • Prepares for true middleware • Compared to “roll-your-own” wrappers, C2M: • Facilitates reuse and adaptation • Facilitates extensive documentation

  30. To be done (short term) • Stabilise system • Experiment • Provide extensive manual and documentation • Prepare system for others to experiment • But current version implemented in proprietary software platform

  31. To be done (long term) • Test language by means of user surveys • Develop version 2  • Version x may well be wholly visual • Embed system in larger environment • SciDashboard™ • “Habitable Interfaces”

More Related