320 likes | 448 Vues
LT.PyXML is a fast and efficient XML parser integrated into Python, developed by the HCRC Language Technology Group at the University of Edinburgh. It supports large, densely annotated text collections for computational linguistics research. The parser features two views—flat and tree-structured—and includes a query language allowing efficient element specification. Designed with ease of use in mind, LT.PyXML simplifies the processing of XML documents while offering essential validation capabilities. It is freely available for research use across various platforms.
E N D
LT PyXML: A fast validating XML parser embedded in Python Henry S. Thompson HCRC Language Technology Group University of Edinburgh
Acknowledgements • This work was carried out in the Language Technology Group of the Human Communication Research Centre, whose baseline funding comes from the UK Economic and Social Research Council • The UK Engineering and Physical Sciences Research Council funded project NSCOPE, which stimulated some of the work discussed here today • This work was also helped by grants to our group from Sun Microsystems and Microsoft
How we use SGML/XML • We use SGML and XML in the context of collecting, standardising, distributing, annotating and using large text collections (corpora) for computational linguistics research and development • These corpora are: • Large: 10-100 million words • Densely annotated: often every word has associated markup • DTDs and validation are very important to us
An aside about validation • A DTD or schema is a contract between producers and consumers • It provides a guaranteed interface • Producers validate to ensure they are providing what they promised • Consumers validate to check up on producers • and to protect their applications • Application authors validate to simplify their task • Leave error detection and analysis to the validating parser
How we use XML (2) • Like any other SME, we produce documents • Being a university-embedded SME, we produce lots of documents • Lots of those documents are trivial variations on one-another, based on target medium and/or audience • Overhead slides for teaching • Web pages for publicity/teaching backup • Presentation slides for conferences • Research papers for monographs and journals
Our application needs • Batch applications to automatically add linguistic annotation • Modular, pipelined programs supporting data parallelism • Specialised interactive editors to hand-correct markup • Authoring tools and publication tools which make content-sharing easy
We built software: RXP & LT XML • because of the following issues: • Price • Efficiency • C-language interface • Documentation • Contrast with EXPAT • 50 to 100% slower • but still 90% faster than Java implementations • Thoroughly documented • Validates • Coverage nine nines identical
LT XML: Basic Architecture • Pipelines of ‘fat’ streams • c.f. Unix ‘thin’ streams • API provides primitives for XML-appropriate input and output • Two alternative views: • micro-sequence: start-tag, comment, char-data, end-tag, proc. inst • tree-structure: sequence of sub-trees, level ad lib.
Flat view • provides GetNextBit which reads the next bit of XML: • Start/empty tags (including attributes and all values) • Text==PCDATA • End tags • Processing instructions • PrintBit will write one of these to an output stream
Tree-structured view • Items are subtrees of the SGML structure • Reading • GetNextItem • GetNextQueryItem • Writing • PrintItem • The two views (flat or tree-structured) can be mixed to suit the needs of the application
Query language • LT XML defines a query language which allows the specification of elements from an XML document • Queries are tree based, using element names, attribute values and textual data • Similar path-style syntax to XPath • Regular expressions are allowed for attribute values.
Query language, continued • The LT XML query language is not a complete relational query language, although that can be built on top • For efficiency reasons, LT XML doesn't allow queries which require back-tracking or an unbounded amount of left context • The query language allows programmers to quickly find the sub-structure they are interested in, while ignoring the rest
Query example .*/TEXT/./P[TYPE=STD]/S[1]
Simple Tools are Simple to Build • Less than one page of C code to produce simple application • Pipelines mean you can compose simple tools for complex applications
Pre-constructed Tools • Extract text content: textonly • Select fragments based on tags, attributes and text content: sggrep • Count tags: sgcount • Production-system style transformation: sgmltrans • Simple pattern-based information extraction: sgrpg • Indexing for fast access: mkindex
Availability • Free to all for research use • Executables and libraries for Unix (Solaris, SunOs, Linux, FreeBSD) and Win32 • Sources for Unix • Packaged executable for Mac • http://www.ltg.ed.ac.uk/software/xml/
What about user interaction? • C is not the world's easiest or most portable GUI-building environment • We have inhouse clients who are happy with scripting languages • So we've embedded LT XML inside a number of other contexts • Common Lisp • Perl • Python • It's the Python embedding that's the main topic for today
LT PyXML Basics • A C-implemented Python module • Integrates the LT XML API into Python • Architecture • Both views (bits and tree fragments) • Objects • including garbage collection • Functions • A modest subset • We've used the Tkinter module for all our GUI work, put Python has other GUI options
LT PyXML functions • Files • Open, OpenString, Fopen, Close • Bits • GetNextBit, ItemParse • Attributes • GetAttrVal, ItemActualAttributes, PutAttrVal • Queries • ParseQuery, GetNextQueryItem • Printing • Print, PrintEndTag, PrintStartTag, PrintTextLiteral
LT PyXML Objects • Use native Python lists and dictionaries where we can • New primitive Objects, often lazy wrt pullthrough • Files • NSL_File • Doctypes • NSL_Doctype, NSL_ElementType, NSL_AttrDefn, NSL_ContentParticle • Instances • NSL_Bit, NSL_Item, NSL_ERef , NSL_OOB • Queries • NSL_Query
LT PyXML limitations • 8-bit character inventory (Python/Tk limitation) • I haven't delivered on the promise in the abstract, but • The binary is in the XED distributions • A proper release will appear shortly
Three applications • XED • instance access minimal • doctype access minimal • Schema workbench • instance access paradigmatic • depends heavily on validation • XML DTD Normaliser • instance access non-existent • doctype access paradigmatic
XED • A text editor for XML document instances • Implemented in Python using LT PyXML and Tkinter • Optimised for hand-authoring small- to medium-sized documents • Cross-platform • Free of charge • Sources not yet available
XED features • Single-window WYSIWYG presentation • Add, remove and rename balanced start/end tag pairs and empty elements • Add, remove and rename attribute name/value pairs • Add or remove comments, CDATA sections and processing instructions • Context-sensitive tag and attribute menus
XED features, cont'd • Filling of text content, indenting of element-only content • Structure-sensitive point-and-sweep selection paradigm • Structure-preserving cut and paste • Multiple undo • Key bindings based on xxxPad under WIN32; based on Emacs under Unix
XED demo • See http://www.ltg.ed.ac.uk/ht/xed.html • The vast bulk of XED is Python/Tk, but it's made possible by LT PyXML • Control of text segments • Control of OOB processing • Context-sensitive menus are initialised from the DTD • Really helps newcomers to XML get started • Cannot produce ill-formed XML
Schema Workbench demo • Not publically available yet • Built to facilitate development of the XML Schema spec • When I started writing large schemata which exploited the refinement aspects of the public WD • I needed to see the type hierarchy • I needed to produce a normalised DTD to compare with the originals
Schema Workbench features • The schema document to schema structures part of this took less than a day to write • Two main reasons • Validation on the way in meant • I could depend on the presence of required components • I didn't need to check for misplaced bits • Python's object-creation and evaluation facilities • Turned most NSL_Items directly into Python objects with object type == GI • Once I had the structures, implementing refinement was easy
DTD normaliser • This was a two hour, 1.5 page job: • Find the DTD • Construct a string file which uses it • Open that string • Sort the doctype • Print the declarations, sorting disjunctions
I can't resist :-) • Once I got the tools built, I could diff the normalised XHTML draft DTD and the DTD produced from my XHTML schema • I found one error • in the DTD!
When it's time to railroad,everybody railroads • The next big challenge for XML, Schemas particularly is • Managing the mapping between document infoset and application infoset • LT PyXML has proved to be a useful laboratory for exploring this issue