LT PyXML: A fast validating XML parser embedded in Python

LT PyXML: A fast validating XML parser embedded in Python Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Acknowledgements • This work was carried out in the Language Technology Group of the Human Communication Research Centre, whose baseline funding comes from the UK Economic and Social Research Council • The UK Engineering and Physical Sciences Research Council funded project NSCOPE, which stimulated some of the work discussed here today • This work was also helped by grants to our group from Sun Microsystems and Microsoft

How we use SGML/XML • We use SGML and XML in the context of collecting, standardising, distributing, annotating and using large text collections (corpora) for computational linguistics research and development • These corpora are: • Large: 10-100 million words • Densely annotated: often every word has associated markup • DTDs and validation are very important to us

An aside about validation • A DTD or schema is a contract between producers and consumers • It provides a guaranteed interface • Producers validate to ensure they are providing what they promised • Consumers validate to check up on producers • and to protect their applications • Application authors validate to simplify their task • Leave error detection and analysis to the validating parser

How we use XML (2) • Like any other SME, we produce documents • Being a university-embedded SME, we produce lots of documents • Lots of those documents are trivial variations on one-another, based on target medium and/or audience • Overhead slides for teaching • Web pages for publicity/teaching backup • Presentation slides for conferences • Research papers for monographs and journals

Our application needs • Batch applications to automatically add linguistic annotation • Modular, pipelined programs supporting data parallelism • Specialised interactive editors to hand-correct markup • Authoring tools and publication tools which make content-sharing easy

We built software: RXP & LT XML • because of the following issues: • Price • Efficiency • C-language interface • Documentation • Contrast with EXPAT • 50 to 100% slower • but still 90% faster than Java implementations • Thoroughly documented • Validates • Coverage nine nines identical

LT XML: Basic Architecture • Pipelines of ‘fat’ streams • c.f. Unix ‘thin’ streams • API provides primitives for XML-appropriate input and output • Two alternative views: • micro-sequence: start-tag, comment, char-data, end-tag, proc. inst • tree-structure: sequence of sub-trees, level ad lib.

Flat view • provides GetNextBit which reads the next bit of XML: • Start/empty tags (including attributes and all values) • Text==PCDATA • End tags • Processing instructions • PrintBit will write one of these to an output stream

Tree-structured view • Items are subtrees of the SGML structure • Reading • GetNextItem • GetNextQueryItem • Writing • PrintItem • The two views (flat or tree-structured) can be mixed to suit the needs of the application

Query language • LT XML defines a query language which allows the specification of elements from an XML document • Queries are tree based, using element names, attribute values and textual data • Similar path-style syntax to XPath • Regular expressions are allowed for attribute values.

Query language, continued • The LT XML query language is not a complete relational query language, although that can be built on top • For efficiency reasons, LT XML doesn't allow queries which require back-tracking or an unbounded amount of left context • The query language allows programmers to quickly find the sub-structure they are interested in, while ignoring the rest

Query example .*/TEXT/./P[TYPE=STD]/S[1]

Simple Tools are Simple to Build • Less than one page of C code to produce simple application • Pipelines mean you can compose simple tools for complex applications

Pre-constructed Tools • Extract text content: textonly • Select fragments based on tags, attributes and text content: sggrep • Count tags: sgcount • Production-system style transformation: sgmltrans • Simple pattern-based information extraction: sgrpg • Indexing for fast access: mkindex

Availability • Free to all for research use • Executables and libraries for Unix (Solaris, SunOs, Linux, FreeBSD) and Win32 • Sources for Unix • Packaged executable for Mac • http://www.ltg.ed.ac.uk/software/xml/

What about user interaction? • C is not the world's easiest or most portable GUI-building environment • We have inhouse clients who are happy with scripting languages • So we've embedded LT XML inside a number of other contexts • Common Lisp • Perl • Python • It's the Python embedding that's the main topic for today

LT PyXML Basics • A C-implemented Python module • Integrates the LT XML API into Python • Architecture • Both views (bits and tree fragments) • Objects • including garbage collection • Functions • A modest subset • We've used the Tkinter module for all our GUI work, put Python has other GUI options

LT PyXML functions • Files • Open, OpenString, Fopen, Close • Bits • GetNextBit, ItemParse • Attributes • GetAttrVal, ItemActualAttributes, PutAttrVal • Queries • ParseQuery, GetNextQueryItem • Printing • Print, PrintEndTag, PrintStartTag, PrintTextLiteral

LT PyXML Objects • Use native Python lists and dictionaries where we can • New primitive Objects, often lazy wrt pullthrough • Files • NSL_File • Doctypes • NSL_Doctype, NSL_ElementType, NSL_AttrDefn, NSL_ContentParticle • Instances • NSL_Bit, NSL_Item, NSL_ERef , NSL_OOB • Queries • NSL_Query

LT PyXML limitations • 8-bit character inventory (Python/Tk limitation) • I haven't delivered on the promise in the abstract, but • The binary is in the XED distributions • A proper release will appear shortly

Three applications • XED • instance access minimal • doctype access minimal • Schema workbench • instance access paradigmatic • depends heavily on validation • XML DTD Normaliser • instance access non-existent • doctype access paradigmatic

XED • A text editor for XML document instances • Implemented in Python using LT PyXML and Tkinter • Optimised for hand-authoring small- to medium-sized documents • Cross-platform • Free of charge • Sources not yet available

XED features • Single-window WYSIWYG presentation • Add, remove and rename balanced start/end tag pairs and empty elements • Add, remove and rename attribute name/value pairs • Add or remove comments, CDATA sections and processing instructions • Context-sensitive tag and attribute menus

XED features, cont'd • Filling of text content, indenting of element-only content • Structure-sensitive point-and-sweep selection paradigm • Structure-preserving cut and paste • Multiple undo • Key bindings based on xxxPad under WIN32; based on Emacs under Unix

XED demo • See http://www.ltg.ed.ac.uk/ht/xed.html • The vast bulk of XED is Python/Tk, but it's made possible by LT PyXML • Control of text segments • Control of OOB processing • Context-sensitive menus are initialised from the DTD • Really helps newcomers to XML get started • Cannot produce ill-formed XML

Schema Workbench demo • Not publically available yet • Built to facilitate development of the XML Schema spec • When I started writing large schemata which exploited the refinement aspects of the public WD • I needed to see the type hierarchy • I needed to produce a normalised DTD to compare with the originals

Schema Workbench features • The schema document to schema structures part of this took less than a day to write • Two main reasons • Validation on the way in meant • I could depend on the presence of required components • I didn't need to check for misplaced bits • Python's object-creation and evaluation facilities • Turned most NSL_Items directly into Python objects with object type == GI • Once I had the structures, implementing refinement was easy

DTD normaliser • This was a two hour, 1.5 page job: • Find the DTD • Construct a string file which uses it • Open that string • Sort the doctype • Print the declarations, sorting disjunctions

I can't resist :-) • Once I got the tools built, I could diff the normalised XHTML draft DTD and the DTD produced from my XHTML schema • I found one error • in the DTD!

When it's time to railroad,everybody railroads • The next big challenge for XML, Schemas particularly is • Managing the mapping between document infoset and application infoset • LT PyXML has proved to be a useful laboratory for exploring this issue

LT PyXML: A fast validating XML parser embedded in Python