Markup in Atomic and Molecular Simulations: Implementation & Issues

Markup in Atomic and Molecular Simulations: Implementation & Issues Jon Wakelin Dept. Earth Sciences University of Cambridge

Overview • Background – The problem • A solution • An Implementation • Demo • Summary

Background (1) • The computational chemistry and physics communities have more data than ever before • Advances in Computer power • Access to HPC facilities • Algorithmic & scientific advances • Better exploitation of existing facilities (Grid) • High throughput computing (Condor) • Same factors have lead to qualitative changes in data • Can now attempt new kinds of calculations

Background (2) • The majority of this data is reused • Starts in a database • Passed into a program • Post-processed • Visualized, etc… • Most notably… • Structures/Coordinates • Forcefields/Interatomic potentials • Basis Sets • Pseudopotentials

Background (3) • So the nature of our data has changed but the way we deal with it has not • Still rely on bespoke text and binary formats • Issues such as interoperability, data management and data reuse are tackled in an informal or ad hoc manner • Binary markup languages (NetCDF, HDF)

A solution • XML • Allows the user to describe data of arbitrary structure • Or… allows the user to structure his/her data arbitrarily • Provides us with a known format (i.e. it is easy to parse) • Many free tools and standards • ~7 years old, so fairly well road tested • CML (Chemical Markup Language) • Extensions to CML core for simulations - CMLComp • CML is not tied to a particular chemistry or physics program

What will markup do for us? • Facilitate data exchange • Between chemistry and physics software, but also… • Easier to extract data to databases • Facilitates other tasks such as data-mining • Make data producers more accountable • Schemas and related technologies • Dictionaries • Reduce Software development (eventually) • No need to support multiple formats • No need to write ‘converters’ • Standard libraries for processing CML

Data Exchange (examples) • Equilibrate MD in DLPOLY then continue in SIESTA • Visualize output from Gaussian in Jmol • Compare timings between VASP and CASTEP • Take structure from ICSD and relax in SIESTA • Develop forcefield in GULP use in DLPOLY • Calculate property X in Dalton and property Y in GAMES • And so on… in fact while these examples should be familiar to us all, they are essentially trivial, however… • Grid/Condor facilitate hi-throughput computing • Often want to create complex workflow schemes • E.g. using Condor’s DAG Manager • But there is no prescription for how to handle the data as it ‘flows’

In.xml Parser In.txt In.xml In.txt CODE Parser CODE Out.txt In.txt In.xml Out.txt CODE CODE Parser Out.xml Out.xml Out.xml Design 2 Design 3 Design 1

Design 1 • Only option when you don’t have access to source code • Input: XSL or program using SAX, DOM • Output: JumboMarker • Programs Using this Design: MOPAC, Gaussian • Pros & Cons • Generality – it will work for any code! • Don't need access to the source code • Requires more user intervention • Parsing text to create XML! • Need to know all combinations that the code can throw at you • Is at the mercy of changes to the output by the code developers.

Design 2 • When you have access to the source code • When you are using Fortran • Input: XSL, program with SAX, DOM • Output: Jumbo90, WXML • Examples: SIESTA, GULP, DLPOLY • Pros and Cons • Avoid Tricky text => XML conversion • Only have to maintain a single program • Simpler from point of view of end user • End user still has to convert CML => text

Design 3 • When you have access to the source code • Input/Output: DOM • Examples: Jmol, JChemEdit, openBabel • Pros and Cons • Simplest for end user • Most Chem/Phys programs still written in Fortran • Limited XML support for Fortran • CML is not the file format of your program • A CML file is not guaranteed to contain all the info you need • Alternatively it may contain to much • “Towards a common data and command …”

Implementation - Output • An F90 library for creating well-formed XML • WXML (A. Garcia) • An F90 library for formatting CML • Jumbo90 • Provides convenience routines for creating CML elements • Has been used in SIESTA, GULP, DLPOLY • We should look to auto-generate these libraries • But output is the easy part...

Implementation - Input • Could link to libxml2 (C Library) • Could implement SAX or DOM in Fortran • Several groups have tried this • A. Garcia has an F90 SAX parser • We have built an F95 DOM parser on top of this • Currently supports DOM 1.0 • Could we go one step further? • Could we implement a CML-DOM in Fortran? • Generic W3 DOM Vs. language specific DOM • E.g. MathML-DOM, SVG-DOM, CML-DOM

Person Name Stats Fst Sec Height Wieght Jon Smith 20 60 XML as a tree <person> <name> <fst>Jon</fst> <sec>Smith</sec> </name> <stats> <height>20<height> <weight>60</weight> </stats> </person>

= person Element Element Element = stats Element Element Element Element Text Text Text Text Generic DOM Tree = weight = 60

Generic DOM • Implementation in F95 • Inheritance Vs. flattened view • Similarities with C’s libxml implementation • Using Linked-lists/pointers • Functions return pointer to data structures • Remember to use pointer syntax!!!! • Things to do • No Validation • No Xpath • No 16 bit strings • Benefits • Portable • Live nodes

Demo • siesta.xml – H2O • siesta.html – H2O • siesta.html – Pyrophyllite • gulp.html – Al/Cu cluster

Summary • Began with three Observations: • Quantitative and Qualititative changes in our Data • Data exchange is essential (even in the simplest calculation) • Bespoke data formats and ad hoc solutions for data exchange • Changing the way we deal with data, will: • Facilitate data exchange and interoperability • Make data and data producers more accountable • Reduce code development (but not yet) • Implementation • Design depends on: access to source, programming language • Output – Jumbo90/WXML • Input – F90 implementations of SAX/DOM/CML-DOM

Acknowledgments • P. Murray-Rust & A. Garcia • NERC

Markup in Atomic and Molecular Simulations: Implementation & Issues