The Joy of SAX (and DOM, and JDOM…)

The Joy of SAX(and DOM, and JDOM…) Bill MacCartney 11 October 2004

Roadmap • What are XML APIs good for? • Overview of JAXP • JAXP XML Parsers • SAX vs. DOM • SAX • SAX Architecture • Using SAX • SAXExample.java • DOM • DOM Architecture • Using DOM • DOMExample.java • JDOM

What are XML APIs for? • You want to read/write data from/to XML files, and you don't want to write an XML parser. • Applications: • processing an XML-tagged corpus • saving configs, prefs, parameters, etc. as XML files • sharing results with outside users in portable format • example: typed dependency relations • alternative to serialization for persistent stores • doesn't break with changes to class definition • human-readable

Overview of JAXP • JAXP = Java API for XML Processing • Provides a common interface for creating and using the standard SAX, DOM, and XSLT APIs in Java. • All JAXP packages are included standard in JDK 1.4+.The key packages are:

JAXP XML Parsers • javax.xml.parsers defines abstract classes DocumentBuilder (for DOM) and SAXParser (for SAX). • It also defines factory classes DocumentBuilderFactory and SAXParserFactory. By default, these give you the “reference implementation” of DocumentBuilder and SAXParser, but they are intended to be vendor-neutral factory classes, so that you could swap in a different implementation if you preferred. • The JDK includes three XML parser implementations from Apache: • Crimson: The original. Small and fast. Based on code donated to Apache by Sun. Standard implementation for J2SE 1.4. • Xerces: More features. Supports XML Schema. Based on code donated to Apache by IBM. • Xerces 2: The future. Standard implementation for J2SE 5.0.

Java-specific interprets XML as a stream of events you supply event-handling callbacks SAX parser invokes your event-handlers as it parses doesn't build data model in memory serial access very fast, lightweight good choice when no data model is needed, or natural structure for data model is list, matrix, etc. W3C standard for representing structured documents platform and language neutral(not Java-specific!) interprets XML as a tree of nodes builds data model in memory enables random access to data therefore good for interactive apps more CPU- and memory-intensive good choice when data model has natural tree structure SAX vs. DOM SAX = Simple API for XML DOM = Document Object Model There is also JDOM … more later

SAX Architecture

Using SAX importjavax.xml.parsers.*; importorg.xml.sax.*; importorg.xml.sax.helpers.*; // get a SAXParser object SAXParserFactory factory = SAXParserFactory.newInstance(); SAXParser saxParser = factory.newSAXParser(); // invoke parser using your custom content handler saxParser.parse(inputStream, myContentHandler); saxParser.parse(file, myContentHandler); saxParser.parse(url, myContentHandler); Here’s the standard recipe for starting with SAX: (This reflects SAX 1, which you can still use, but SAX 2 prefers a new incantation…)

Using SAX 2 // tell SAX which XML parser you want (here, it’s Crimson) System.setProperty("org.xml.sax.driver", "org.apache.crimson.parser.XMLReaderImpl"); // get an XMLReader object XMLReaderreader = XMLReaderFactory.createXMLReader(); // tell the XMLReader to use your custom content handler reader.setContentHandler(myContentHandler); // Have the XMLReader parse input from Reader myReader: reader.parse(new InputSource(myReader)); In SAX 2, the following usage is preferred: But where does myContentHandler come from?

Defining a ContentHandler • Easiest route: define a new class which extends org.xml.sax.helpers.DefaultHandler. • Override event-handling methods from DefaultHandler: (All are no-ops in DefaultHandler.) startDocument() // receive notice of start of document endDocument() // receive notice of end of document startElement() // receive notice of start of each element endElement() // receive notice of end of each element characters() // receive a chunk of character data error() // receive notice of recoverable parser error // ...plus more...

startElement()and endElement() The SAXParser invokes your callbacks to notify you of events: • For simple usage, ignore namespaceURI and localName, and just use qName (the “qualified” name). • XML namespaces are described in an appendix, below. • startElement() and endElement() events always come in pairs: • “<foo/>” will generate calls: startElement("", "", "foo", null) endElement("", "", "foo") startElement(String namespaceURI, // for use w/ namespaces String localName, // for use w/ namespaces String qName, // "qualified" name -- use this one! Attributes atts) endElement(String namespaceURI, String localName, String qName)

SAX Attributes • Every call to startElement() includes an Attributes object which represents all the XML attributes for that element. • Methods in the Attributes interface: getLength() // return number of attributes getIndex(StringqName) // look up attribute's index by qName getValue(StringqName) // look up attribute's value by qName getValue(intindex) // look up attribute's value by index // ... and others ...

SAX characters() • May be called multiple times within each block of character data—for example, once per line. • So, you may want to use calls to characters() to accumulate characters in a StringBuffer, and stop accumulating at the next call to startElement(). The characters() event handler receives notification of character data (i.e. content that is not part of an XML element): publicvoidcharacters(char[]ch, // buffer containing chars intstart, // start position in buffer intlength) // num of chars to read

SAXExample: Input XML <?xml version="1.0" encoding="UTF-8"?> <dots> this is before the first dot and it continues on multiple lines <dot x="9" y="81" /> <dot x="11" y="121" /> <flip> flip is on <dot x="196" y="14" /> <dot x="169" y="13" /> </flip> flip is off <dot x="12" y="144" /> <extra>stuff</extra>  </dots>

SAXExample: Code Please see SAXExample.java

SAXExample: Input  Output startDocument startElement: dots (0 attributes) characters: this is before the first dot and it continues on multiple lines startElement: dot (2 attributes) endElement: dot startElement: dot (2 attributes) endElement: dot startElement: flip (0 attributes) characters: flip is on startElement: dot (2 attributes) endElement: dot startElement: dot (2 attributes) endElement: dot endElement: flip characters: flip is off startElement: dot (2 attributes) endElement: dot startElement: extra (0 attributes) characters: stuff endElement: extra endElement: dots endDocument Finished parsing input. Got the following dots: [(9, 81), (11, 121), (14, 196), (13, 169), (12, 144)] <?xml version="1.0" encoding="UTF-8"?> <dots> this is before the first dot and it continues on multiple lines <dot x="9" y="81" /> <dot x="11" y="121" /> <flip> flip is on <dot x="196" y="14" /> <dot x="169" y="13" /> </flip> flip is off <dot x="12" y="144" /> <extra>stuff</extra>  </dots>

DOM Architecture

DOM Document Structure Document structure: XML Input: Document +---Element <dots> +---Text "this is before the first dot | and it continues on multiple lines" +---Element <dot> +---Text "" +---Element <dot> +---Text "" +---Element <flip> |+---Text "flip is on" |+---Element <dot> |+---Text "" |+---Element <dot> |+---Text "" +---Text "flip is off" +---Element <dot> +---Text "" +---Element <extra> |+---Text "stuff" +---Text "" +---Comment "a final comment" +---Text "" <?xml version="1.0" encoding="UTF-8"?> <dots> this is before the first dot and it continues on multiple lines <dot x="9" y="81" /> <dot x="11" y="121" /> <flip> flip is on <dot x="196" y="14" /> <dot x="169" y="13" /> </flip> flip is off <dot x="12" y="144" /> <extra>stuff</extra>  </dots> • There’s a text node between every pair of element nodes, even if the text is empty. • XML comments appear in special comment nodes. • Element attributes do not appear in tree—available through Element object.

Using DOM Here’s the basic recipe for getting started with DOM: importjavax.xml.parsers.*; importorg.w3c.dom.*; // get a DocumentBuilder object DocumentBuilderFactorydbf = DocumentBuilderFactory.newInstance(); DocumentBuilderdb = null; try { db = dbf.newDocumentBuilder(); } catch (ParserConfigurationExceptione) { e.printStackTrace(); } // invoke parser to get a Document Documentdoc = db.parse(inputStream); Documentdoc = db.parse(file); Documentdoc = db.parse(url);

DOM Document access idioms OK, say we have a Document. How do we get at the pieces of it? Here are some common idioms: // get the root of the Document tree Elementroot = doc.getDocumentElement(); // get nodes in subtree by tag name NodeListdots = root.getElementsByTagName("dot"); // get first dot element ElementfirstDot = (Element) dots.item(0); // get x attribute of first dot Stringx = firstDot.getAttribute("x");

More Document accessors Node access methods: StringgetNodeName() shortgetNodeType() DocumentgetOwnerDocument() boolean hasChildNodes() NodeListgetChildNodes() NodegetFirstChild() NodegetLastChild() NodegetParentNode() NodegetNextSibling() NodegetPreviousSibling() booleanhasAttributes() ... and more ... Element extends Node and adds these access methods: StringgetTagName() booleanhasAttribute(Stringname) StringgetAttribute(Stringname) NodeListgetElementsByTagName(Stringname) … and more … Document extends Node and adds these access methods: ElementgetDocumentElement() DocumentTypegetDoctype() ... plus the Element methods just mentioned ... ... and more ... e.g. DOCUMENT_NODE, ELEMENT_NODE, TEXT_NODE, COMMENT_NODE, etc.

Creating & manipulating Documents The DOM API also includes lots of methods for creating and manipulating Document objects: // get new empty Document from DocumentBuilder Documentdoc = db.newDocument(); // create a new <dots> Element and add to Document as root Elementroot = doc.createElement("dots"); doc.appendChild(root); // create a new <dot> Element and add as child of root Elementdot = doc.createElement("dot"); dot.setAttribute("x", "9"); dot.setAttribute("y", "81"); root.appendChild(dot);

More Document manipulators Node manipulation methods: voidsetNodeValue(StringnodeValue) NodeappendChild(NodenewChild) NodeinsertBefore(NodenewChild, NoderefChild) NoderemoveChild(NodeoldChild) ... and more ... Element manipulation methods: voidsetAttribute(Stringname, Stringvalue) voidremoveAttribute(Stringname) … and more … Document manipulation methods: TextcreateTextNode(Stringdata) CommentcreateCommentNode(Stringdata) ... and more ...

Writing a Document as XML • Strangely, since JAXP 1.1, there is no simple, documented way to write out a Document object as XML. • Instead, you can exploit an undocumented trick: cast the Document to a Crimson XmlDocument, which knows how to write itself out: • There is a supported way to write Documents as XML via the XSLT library, but it is far more clumsy than this two-line trick. • Of course, one could just walk the Document tree and write XML using printlns. • JDOM remedies this with easy XML output! importorg.apache.crimson.tree.XmlDocument; XmlDocumentx = (XmlDocument) doc; x.write(out, "UTF-8");

DOMExample: Code Please see DOMExample.java

JDOM Overview • DOM can be awkward for Java programmers • Language-neutral  does not use Java features • Example: getChildNodes() returns a NodeList, which is not a List. (NodeList.iterator() is not defined.) • JDOM looks like a good alternative: • open source project, Apache license, late beta • builds on top of JAXP, integrates with SAX and DOM • similar to DOM model, but no shared code • API designed to be easy & obvious for Java programmers • exploits power of Java language: collections, method overloading • rumored to become integrated in future JDKs • XML output is easy! • Key packages:org.jdom, org.jdom.transform, org.jdom.input, org.jdom.output.

schweet! DOM vs. JDOM The DOM way: DocumentBuilderFactoryfactory = DocumentBuilderFactory.newInstance(); DocumentBuilderbuilder = factory.newDocumentBuilder(); Documentdoc = builder.newDocument(); Elementroot = doc.createElement("root"); Texttext = doc.createText("This is the root"); root.appendChild(text); doc.appendChild(root); The JDOM way: Documentdoc = newDocument(); Elemente = newElement("root"); e.setText("This is the root"); doc.addContent(e);

Pointers • Everything in this tutorial (slides, example code, example data) will be archived at http://nlp.stanford.edu/local/for your future reference. • There’s a good JAXP/SAX/DOM tutorial at:http://java.sun.com/xml/jaxp/dist/1.1/docs/tutorial/ • You can learn more about JDOM athttp://www.jdom.org/docs/faq.html

The Joy of SAX (and DOM, and JDOM…)

The Joy of SAX (and DOM, and JDOM…)

Presentation Transcript

Schema validation with java

JDOM: How It Works, and How It Opened the Java Process

J2EE —— 第 5 章 Simple API for XML(SAX)

XML Programming in Java