980 likes | 1.13k Vues
Introduction to SAX: a standard interface for event-based XML parsing. Cheng-Chia Chen. What is an Event-Based Interface?. Two major types of XML APIs: Tree-based APIs ==> DOM compiles an XML document into an internal tree structure, then allows an application to navigate that tree.
E N D
Introduction to SAX:a standard interface for event-based XML parsing Cheng-Chia Chen
What is an Event-Based Interface? Two major types of XML APIs: • Tree-based APIs ==> DOM • compiles an XML document into an internal tree structure, then allows an application to navigate that tree. • Event-based APIs. ==> SAX • reports parsing events (such as the start and end of elements) directly to the application through callbacks, • usually does not build an internal tree. • The application implements handlers to deal with the different events, much like handling events in a graphical user interface. • Comparison: For tree-based APIs • useful for many applications • require more system resources, especially if the document is large.
How an event-based API works • consider the following sample document: • <?xml version="1.0"> • <doc> • <para>Hello, world!</para> • </doc> • An event-based interface will break the structure of this document down into a sequence of SAX events: • start document • start element: doc • start element: para • characters: Hello, world! • end element: para • end element: doc • end document
Quick Start for SAX Application Writers 1. Download and install at least two Java libraries, making certain that you add all of them to your CLASSPATH: 1. the SAX interfaces and classes; and 2. XML parsers that supports SAX. 2. make a note of the full classname of the SAX driver for the parser • Xerces => “org.apache.xerces.parsers.SAXParser”. • Sun’s =>”com.sun.xml.parser.(Parser|ValidatingParser)” or 3. create event handlers to receive information about the document. • The most important type of handler is the DocumentHandler (SAX1) ContentHandler(SAX2), which receives events for the start and end of elements, character data, processing instructions, and other basic XML structure. • Rather than implementing the entire interface, you can create a class that extendsHandlerBase (or DefaultHandler for SAX2), and then implement only the methods that you need.
Example: (MyHandler.java) • prints a message each time an element starts or ends: • import org.xml.sax.HandlerBase; • import org.xml.sax.AttributeList; • public class MyHandler extends HandlerBase { • public void startElement (String name, AttributeList atts) • { • System.out.println("Start element: " + name); • } • public void endElement (String name) • { • System.out.println("End element: " + name); • } • }
The main program (SAXApp.java) • import org.xml.sax.Parser; • import org.xml.sax.DocumentHandler; • import org.xml.sax.helpers.ParserFactory; • public class SAXApp { • static final String parserClass = "com.microstar.xml.SAXDriver"; • // or org.apache.xerces.parsers.SAXParser for xerces • public static void main (String args[]) throws Exception • { • Parser parser = ParserFactory.makeParser(parserClass); • DocumentHandler handler = new MyHandler(); • parser.setDocumentHandler(handler); • for (int i = 0; i < args.length; i++) { • parser.parse(args[i]); • } } }
The input • the input XML document (roses.xml): • <?xml version="1.0"?> • <poem> • <line>Roses are red,</line> • <line>Violets are blue.</line> • <line>Sugar is sweet,</line> • <line>and I love you.</line> • </poem> • To parse this with your SAXApp application, you would supply the absolute URL of the document on the command line: java SAXApp file://localhost/tmp/roses.xml or java SAXApp file:///tmp/roses.xml
The output • The output should be as follows: Start element: poem Start element: line End element: line Start element: line End element: line Start element: line End element: line Start element: line End element: line End element: poem Congratulations: you‘re parsing XML! you can now go and figure out something more interesting to do with your event handlers.
[ ] SAX Driver’s parser classname supplied by application writer Implementation of Parser AttrbuteList Locator (supplied by Driver writer)
SAX 1.0: Java Road Map • The SAX Java distribution contains • 11 core classes and interfaces together with • 3 optional helper classes and • 4 demonstration classes. • there are only three interfaces that SAX parser writers need to implement. • While there are five interfaces available for application writers, simple XML applications will need only one or two of them.
SAX classes and interfaces • Falling into five groups: 1. interfaces implemented by the parser: • Parser, AttributeList (required), and Locator (optional) 2.interfaces implemented by the application: • DocumentHandler, ErrorHandler, DTDHandler, and • EntityResolver • (all optional: DocumentHandler will be the most important one for typical XML applications) 3.standard SAX classes: • InputSource, SAXException, • SAXParseException, HandlerBase • (all fully implemented by SAX)
SAX classes and interfaces 4.optional Java-specific helper classes in the org.xml.sax.helpers package: • ParserFactory, AttributeListImpl, and LocatorImpl • (all fully implemented by the SAX Java distribution) 5.Java demonstration classes in the nul package: • SystemIdDemo, ByteStreamDemo, and • CharacterStreamDemo, • all can be run as Java applications; • All three share a DemoHandler class
Interfaces for Parser Writers (org.xml.sax package) • A SAX-conformant XML parser needs to implement only two or three simple interfaces; 1. Parser • the main interface to a SAX parser: • allow the user to register handlers for callbacks, to set the locale for error reporting, and to start an XML parse. 2. AttributeList • allow users to iterate through an attribute list. • a convenience implementation available in the AttributeListImpl. 3. Locator • allows users to find the location of current event in the XML source document.
Interfaces for Application Writers (org.xml.sax package) • A SAX application may implement any or none of the following interfaces, as required. • may need only DocumentHandler and possibly ErrorHandler). • can implement all of these interfaces in a single class. 1. DocumentHandler • the interface that applications will probably use the most • in many cases, it is the only one needed to be implemented. • If an application provides an implementation of this interface, it will receive notification of basic document-related events like the start and end of elements. 2.ErrorHandler • used for special error handling.
Interfaces for Application Writers (cont’d) 3. DTDHandler • If an application needs to work with notations and unparsed (binary) entities, it must implement this interface to receive notification of the NOTATION and unparsed ENTITY declarations. 4. EntityResolver • If an application needs to do redirection of URIs in documents (or other types of custom handling), it must provide an implementation of this interface.
Standard SAX Classes (org.xml.sax package) 1. InputSource • contains all of the necessary information for a single input source, including a public identifier, system identifier, byte stream, and character stream (as appropriate). • The application must instantiate at least one InputSource for the Parser,and the EntityHandler may instantiate others. 2. SAXException : represents a general SAX exception. 3. SAXParseException : represents a SAX exception tied to a specific point in an XML source document. 4. HandlerBase • provides default implementations for DocumentHandler, ErrorHandler, DTDHandler, and EntityResolver. • application writers can subclass this to simplify handler writing.
Java-Specific Helper Classes (org.xml.sax.helpers package) • not part of the core SAX distribution, • may not be available in SAX implementations in other languages: • provided simply as a convenience for Java programmers. 1. ParserFactory • used to load SAX parsers dynamically at run time, based on the class name. 2. AttributeListImpl • used to to make a persistent copy of an AttributeList, or • used to provide a default implementation of AttributeList to the application. 3. LocatorImpl • used to make a persistent snapshot of a Locator's values at a specific point in the parse.
Interfaces: AttributeList DTDHandler DocumentHandler EntityResolver ErrorHandler Locator Parser Classes: HandlerBase InputSource Exceptions: SAXException SAXParseException Package: org.xml.sax
methods: getLength() Return the number of attributes in this list. getName(int) Return the name of an attribute in this list (by position). 0-based getType( int | String ) Return the type of an attribute in the list (by position or by name ). getValue(int | String ) Return the value of an attribute in the list (by position or by name). Interface org.xml.sax.AttributeList
Method Index notationDecl(String name, String pubID, String sysID) throws SAXException Receive notification of a notation declaration event. Ex: <!NOTATION GIF PUBLIC “abc” > notationDecl(“GIF”, “abc”, “”) unparsedEntityDecl(String, String, String, String) Receive notification of an unparsed entity declaration event. Ex: <!ENTITY aPic SYSTEM ‘here” NDATA GIF> =>unparsedEntityDecl( “aPic”, “”, // publicId “here”,// String systemId, “GIF”) // notationName Interface org.xml.sax.DTDHandler
Method index parse(InputSource) Parse an XML document. parse(String) Parse an XML document from a system identifier (URI). setDocumentHandler(DocumentHandler) Allow an application to register a document event handler. setDTDHandler(DTDHandler) Allow an application to register a DTD event handler. setEntityResolver(EntityResolver) Allow an application to register a custom entity resolver. setErrorHandler(ErrorHandler) Allow an application to register an error event handler. setLocale(Locale) Allow an application to request a locale for errors and warnings. Note: all return types are void. Interface org.xml.sax.Parser
Method Index (implemented optionally in SaxDriver ) getColumnNumber() Return the column number where the current document event ends. getLineNumber() Return the line number where the current document event ends. getPublicId() Return the public identifier for the current document event. getSystemId() Return the system identifier for the current document event. Interface org.xml.sax.Locator
Method Index characters(char[], int, int) Receive notification of character data. endDocument() Receive notification of the end of a document. endElement(String) Receive notification of the end of an element. ignorableWhitespace(char[], int, int) Receive notification of ignorable whitespace in element content. processingInstruction(String, String) Receive notification of a processing instruction. setDocumentLocator(Locator) Receive an object for locating the origin of SAX document events. startDocument() Receive notification of the beginning of a document. startElement(String, AttributeList) Receive notification of the beginning of an element. Interface org.xml.sax.DocumentHandler
example: print the end location of an endElement event pubilc class myHandler entends HandlerBase { int[] loc = new int[2]; // store locator info String[] loc34 = new String[2]; … pubic void setDocumentLocator(Locator l) { loc[0] = l.getColumnNumber(); loc[1] = l.getLineNumber(); loc34[0] = l.getPublicId(); loc34[1] = l.getSystemId(); } … public void endElement(String tag) { … System.out.println(“end of “ + tag + “ element at “ colum:” + loc[0] + “ line: “ + loc[1] ); … }
InputSource resolveEntity(String pubId, String sysId) Allow the application to resolve external entities. The Parser will call this method before opening any external entity except the top-level document entity the parser will use the returned InputSource to continue entity substitution. including: the external DTD subset, external entities referenced within the DTD ( parameter entities), and external entities referenced within the document element ( general entities) Interface org.xml.sax.EntityResolver
special entity processing for XHTML dtd import org.xml.sax.EntityResolver, org.xml.sax.InputSource; public class MyResolver implements EntityResolver { public InputSource resolveEntity (String publicId, String systemId) { if (publicId.equals(“-//W3c//DTD XHTML 1.0//EN”) || systemId.equals(“http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd") ) { // return my local xhtml1.0 DTD Reader reader = new FileReader(“myXhtmlDtdFile.dtd”); return new InputSource(reader); } else { // use the default behaviour return null; } } }
Method Index error(SAXParseException) Receive notification of a recoverable error. fatalError(SAXParseException) Receive notification of a non-recoverable error. warning(SAXParseException) Receive notification of a warning. Interface org.xml.sax.ErrorHandler
Constructors: InputSource() Zero-argument default constructor. InputSource(InputStream) Create a new input source with a byte stream. InputSource(Reader) Create a new input source with a character stream. InputSource(String) Create a new input source with a system identifier. the input is usually a URL eg: http://localhost/mydoc file://localhost/c:/mydir/mydoc Methods getByteStream() Get the byte stream for this input source. getCharacterStream() Get the character stream for this input source. getEncoding() Get the character encoding for a byte stream or URI. getPublicId() Get the public identifier for this input source. getSystemId() Get the system identifier for this input source. Class org.xml.sax.InputSource
setByteStream(InputStream) Set the byte stream for this input source. setCharacterStream(Reader) Set the character stream for this input source. setEncoding(String) Set the character encoding, if known. setPublicId(String) Set the public identifier for this input source. setSystemId(String) Set the system identifier for this input source. precedence for determining the location of input source: getCharacterStream() getByteStream() new URL ( getSystemID()) Class org.xml.sax.InputSource
Constructor: HandlerBase() Methods: characters(char[], int, int) Receive notification of character data inside an element. endDocument() Receive notification of the end of the document. endElement(String) Receive notification of the end of an element. error(SAXParseException) Receive notification of a recoverable parser error. fatalError(SAXParseException) Report a fatal XML parsing error. ignorableWhitespace(char[], int, int) Receive notification of ignorable whitespace in element content. notationDecl(String, String, String) Receive notification of a notation declaration. processingInstruction(String, String) Receive notification of a processing instruction. Class org.xml.sax.HandlerBase
resolveEntity(String, String) Resolve an external entity. setDocumentLocator(Locator) Receive a Locator object for document events. startDocument() Receive notification of the beginning of the document. startElement(String, AttributeList) Receive notification of the start of an element. unparsedEntityDecl(String, String, String, String) Receive notification of an unparsed entity declaration. warning(SAXParseException) Receive notification of a parser warning. Class org.xml.sax.HandlerBase (cont’d)
Class org.xml.sax.SAXException • Constructors: • SAXException(Exception) • Create a new SAXException wrapping an existing exception. • SAXException(String) • Create a new SAXException. • SAXException(String, Exception) • Create a new SAXException from an existing exception.
Class org.xml.sax.SAXException Methods: • getException() • Return the embedded exception, if any. • getMessage() • Return a detail message for this exception. • toString() • Convert this exception to a string.
Constructors: SAXParseException(String, Locator) Create a new SAXParseException from a message and a Locator. SAXParseException(String, Locator, Exception) Wrap an existing exception in a SAXParseException. SAXParseException(String, String, String, int, int) Create a new SAXParseException. SAXParseException(String, String, String, int, int, Exception) Create a new SAXParseException with an embedded exception. Class org.xml.sax.SAXParseException
Class org.xml.sax.SAXParseException • Methods: • getColumnNumber() • The column number of the end of the text where the exception occurred. • getLineNumber() • The line number of the end of the text where the exception occurred. • getPublicId() • Get the public identifier of the entity where the exception occurred. • getSystemId() • Get the system identifier of the entity where the exception occurred.
SAX 2.0 • a new Java-based release of SAX, the Simple API for XML. • SAX2 • introduces configurable features and properties • adds support for XML Namespaces; • includes adapters so that SAX1 parsers and applications can interoperate with SAX2.
Changes from SAX 1.0 • Deprecated interfaces and classes: • should be used only for interaction with SAX1 drivers or applications: • org.xml.sax.* • Parser XMLReader • DocumentHandler ContentHandler • AttributeList Attributes • HandlerBase DefaultHandler • org.xml.sax.helpers.* • ParserFactory XMLReaderFactory • AttributeListImpl AttributesImpl
Changes from SAX 1.0 • new interfaces and classes added to SAX2: • org.xml.sax.* • XMLReader (replaces Parser) • ContentHandler (replaces DocumentHandler) • Attributes (replaces AttributeList) • XMLFilter • SAXNotSupportedException • SAXNotRecognizedException • org.xml.sax.helpers.* • AttributesImpl (replaces AttributeListImpl) • DefaultHandler (replaces HandlerBase) • NamespaceSupport • XMLFilterImpl • ParserAdapter (implements XMLReader) • XMLReaderAdapter (implements Parser) • org.xml.sax.ext.* • LexicalHandler • DeclHandler
SAX2: Features and Properties • adds standard methods to query and set features and properties in an XMLReader. • can request an XMLReader • to validate (or not to validate) a document, or • to internalize (or not to internalize) all names, • using the getFeature, setFeature, getProperty, and setProperty methods: • EX: try{if( xmlReader.getFeature( "http://xml.org/sax/features/validation")){ System.out.println("Parser is validating."); }else{ System.out.println("Parser is not validating.");} }catch(SAXException e){ System.out.println( "Parser may or may not be validating."); }
Core Features • Anyone is free to define new SAX2 features. • Note that features may be read-only or read/write, and that they may be modifiable only when parsing, or only when not parsing. • http://xml.org/sax/features/namespaces • true => Perform Namespace processing. • false: Optionally do not perform Namespace processing (implies namespace-prefixes). • access: (parsing) read-only; (not parsing) read/write • http://xml.org/sax/features/namespace-prefixes • true: Report the original prefixed names and attributes used for Namespace declarations. • false: Do not report attributes used for Namespace declarations, and optionally do not report original prefixed names. • access: (parsing) read-only; (not parsing) read/write
Core Features supplied by SAX2 • http://xml.org/sax/features/string-interning • true => All element names, prefixes, attribute names, Namespace URIs, and local names are internalized using java.lang.String.intern. • access: (parsing) read-only; (not parsing) read/write • http://xml.org/sax/features/validation • true => Report all validation errors (implies external-general-entities and external-parameter-entities). • access: (parsing) read-only; (not parsing) read/write • http://xml.org/sax/features/external-general-entities • true => Include all external general (text) entities. • access: (parsing) read-only; (not parsing) read/write • http://xml.org/sax/features/external-parameter-entities • true: Include all external parameter entities, including the external DTD subset. • false: Do not include any external parameter entities, even the external DTD subset. • access: (parsing) read-only; (not parsing) read/write
Core Properties • http://xml.org/sax/properties/lexical-handler • data type: org.xml.sax.ext.LexicalHandler • description: An optional extension handler for lexical events like comments. access: read/write • http://xml.org/sax/properties/declaration-handler • data type: org.xml.sax.ext.DeclHandler • description: An optional extension handler for DTD-related events other than notations and unparsed entities. access: read/write • http://xml.org/sax/properties/dom-node • data type: org.w3c.dom.Node • description: When parsing, the current DOM node being visited if this is a DOM iterator; when not parsing, the root DOM node for iteration. • access: (parsing) read-only; (not parsing) read/write • http://xml.org/sax/properties/xml-string • data type: java.lang.String • description: The literal string of characters that was the source for the current event. access: read-only
SAX2 Namespace Support • standardized Namespace support • essential for higher-level standards like XSL, XML Schemas, RDF, and XLink. • Namespace processing affects only element and attribute names. • Without Namespace processing: • name = qName (qualified name;may contains :), • With Namespace processing: • name = [URI] + localName (must not contain : ) • SAX2 • support either of these viewsor both simultaneously,
Sax2 namespace support • affects the ContentHandler and Attributes interfaces. • In SAX2, the startElement and endElement callbacks in a content handler looks like this: public void startElement (String uri, String localName, String qName, Attributes atts)throws SAXException; public void endElement (String uri, String localName, String qName) throws SAXException; • By default, an XML reader will report a Namespace URI and a local name for every element, in both the start and end handler. • Example: <html:hr xmlns:html= "http://www.w3.org/1999/xhtml"/> • uri = "http://www.w3.org/1999/xhtml" • localName=“hr” • qName = “html:hr” or “” depending on namespace-prefix feature set or not
startPrefixMapping, endPrefixMapping • SAX2 also reports the scope of Namespace declarations, so that applications can resolve prefixes in attribute values or character data if necessary. public void startPrefixMapping (String prefix, String uri) throws SAXException; public void endPrefixMapping (String prefix) throws SAXException; Ex: Before the start-element event, the XML reader would call : startPrefixMapping("html","http://www.w3.org/1999/xhtml") After the end-element event ,the XML reader would call : endPrefixMapping("html")
Configuring Namespace Support • "http://xml.org/features/namespaces" feature • true [default] => • Namespace URIs and local names must be available, and • start/endPrefixMapping events must be reported. • "http://xml.org/features/namespace-prefixes" feature • true => controls the reporting of prefixed names and Namespace declarations (xmlns* attributes): • false [default] => qualified prefixed names(qName) may optionally be reported, but xmlns* attributes must not be reported. Note: At least one of both features must be true.
Configuration Example • Consider the following simple sample document: <h:hello xmlns:h ="http://www.greeting.com/ns/“ id ="a1" h:person ="David"/> • NS true ,NSP false (the default) => report • h:hello => "http://www.greeting.com/ns/" + "hello"; • xmlns:h => Unknown ( not appearing in attrs) • id =>“”(empty string) + "id“ • h:person => "http://www.greeting.com/ns/" + "person". • namespaces, namespace-prefixes both true: • h:hello => "http://www.greeting.com/ns/" + "hello“ + “h:hello” • xmlns:h => “” + “” + “xmlns” (? xmlns:h) • id =>“”(empty string) + "id“ + “id” • h:person => "http://www.greeting.com/ns/" + "person“ + h:person”. • namespaces is false and namespace-prefixes is true: • “” + “” + "h:hello"; “” + “” + "xmlns:h"; • “” + “” + "id"; and “” + “” + "h:person".
Interfaces: AttributeList Attributes ContentHandler DeclHandler DocumentHandler DTDHandler EntityResolver ErrorHandler Locator Parser XMLReader XMLFilter Classes: HandlerBase InputSource Exceptions: SAXException SAXParseException SAXNotRecognizedException SAXNotSupportedException Package: org.xml.sax for SAX2
Methods index: getLength() Return the number of attributes in this list. getName(int index) Return the name of an attribute in this list (by position). getType(int index) Return the type of an attribute in the list (by position). getType(String name) Return the type of an attribute in the list (by name). getValue(int index) Return the value of an attribute in the list (by position). getValue(String name) Return the value of an attribute in the list (by name). Interface org.xml.sax.AttributeList
int getLength() int getIndex(String qName) Look up the index of an attribute by XML 1.0 qualified name. int getIndex(String uri, String localName) Look up the index of an attribute by Namespace name. String getLocalName(int index) String getQName(int index) String getURI(int index) String getType(int index) String getType(String qName) String getType(String uri, String localName) String getValue(int index) String getValue(String qName) String getValue(String uri, String localName) Note: all methods return null if namespace processing does not support them. e.g. if the namespace feature is false => getValue(uri, localName) returns null. interface org.xml.sax.attributes