1 / 51

SAX and more…

SAX and more…. CS 236607, Spring 2008/9. SAX Parser. SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes the corresponding method of the corresponding handler

bella
Télécharger la présentation

SAX and more…

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SAX and more… CS 236607, Spring 2008/9

  2. SAX Parser • SAX = Simple API for XML • XML is read sequentially • When a parsing event happens, the parser invokes the corresponding method of the corresponding handler • The handlers are programmer’s implementation of standard Java API (i.e., interfaces and classes)

  3. <?xmlversion="1.0"?> <!DOCTYPEcountriesSYSTEM"world.dtd"> <countries> <country continent="&as;"> <!--israel--> <name>Israel</name> <population year="2001">6,199,008</population> <city capital="yes"><name>Jerusalem</name></city> <city><name>Ashdod</name></city> </country> <country continent="&eu;"> <name>France</name> <population year="2004">60,424,213</population> </country> </countries>

  4. <?xmlversion="1.0"?> <!DOCTYPEcountriesSYSTEM"world.dtd"> <countries> <country continent="&as;"> <!--israel--> <name>Israel</name> <population year="2001">6,199,008</population> <city capital="yes"><name>Jerusalem</name></city> <city><name>Ashdod</name></city> </country> <country continent="&eu;"> <name>France</name> <population year="2004">60,424,213</population> </country> </countries> Start Document

  5. <?xmlversion="1.0"?> <!DOCTYPEcountriesSYSTEM"world.dtd"> <countries> <country continent="&as;"> <!--israel--> <name>Israel</name> <population year="2001">6,199,008</population> <city capital="yes"><name>Jerusalem</name></city> <city><name>Ashdod</name></city> </country> <country continent="&eu;"> <name>France</name> <population year="2004">60,424,213</population> </country> </countries> Start Element

  6. <?xmlversion="1.0"?> <!DOCTYPEcountriesSYSTEM"world.dtd"> <countries> <country continent="&as;"> <!--israel--> <name>Israel</name> <population year="2001">6,199,008</population> <city capital="yes"><name>Jerusalem</name></city> <city><name>Ashdod</name></city> </country> <country continent="&eu;"> <name>France</name> <population year="2004">60,424,213</population> </country> </countries> Start Element

  7. <?xmlversion="1.0"?> <!DOCTYPEcountriesSYSTEM"world.dtd"> <countries> <country continent="&as;"> <!--israel--> <name>Israel</name> <population year="2001">6,199,008</population> <city capital="yes"><name>Jerusalem</name></city> <city><name>Ashdod</name></city> </country> <country continent="&eu;"> <name>France</name> <population year="2004">60,424,213</population> </country> </countries> Comment

  8. <?xmlversion="1.0"?> <!DOCTYPEcountriesSYSTEM"world.dtd"> <countries> <country continent="&as;"> <!--israel--> <name>Israel</name> <population year="2001">6,199,008</population> <city capital="yes"><name>Jerusalem</name></city> <city><name>Ashdod</name></city> </country> <country continent="&eu;"> <name>France</name> <population year="2004">60,424,213</population> </country> </countries> Start Element

  9. <?xmlversion="1.0"?> <!DOCTYPEcountriesSYSTEM"world.dtd"> <countries> <country continent="&as;"> <!--israel--> <name>Israel</name> <population year="2001">6,199,008</population> <city capital="yes"><name>Jerusalem</name></city> <city><name>Ashdod</name></city> </country> <country continent="&eu;"> <name>France</name> <population year="2004">60,424,213</population> </country> </countries> Characters

  10. <?xmlversion="1.0"?> <!DOCTYPEcountriesSYSTEM"world.dtd"> <countries> <country continent="&as;"> <!--israel--> <name>Israel</name> <population year="2001">6,199,008</population> <city capital="yes"><name>Jerusalem</name></city> <city><name>Ashdod</name></city> </country> <country continent="&eu;"> <name>France</name> <population year="2004">60,424,213</population> </country> </countries> End Element

  11. <?xmlversion="1.0"?> <!DOCTYPEcountriesSYSTEM"world.dtd"> <countries> <country continent="&as;"> <!--israel--> <name>Israel</name> <population year="2001">6,199,008</population> <city capital="yes"><name>Jerusalem</name></city> <city><name>Ashdod</name></city> </country> <country continent="&eu;"> <name>France</name> <population year="2004">60,424,213</population> </country> </countries> End Element

  12. <?xmlversion="1.0"?> <!DOCTYPEcountriesSYSTEM"world.dtd"> <countries> <country continent="&as;"> <!--israel--> <name>Israel</name> <population year="2001">6,199,008</population> <city capital="yes"><name>Jerusalem</name></city> <city><name>Ashdod</name></city> </country> <country continent="&eu;"> <name>France</name> <population year="2004">60,424,213</population> </country> </countries> End Document

  13. <?xml version="1.0"?> . . . SAX Parsers When you see the start of the document do … SAX Parser When you see the start of an element do … When you see the end of an element do …

  14. Used to create a SAX Parser Handles document events: start tag, end tag, etc. Handles Parser Errors Handles DTD Handles Entities

  15. Creating a Parser • The SAX interface is an accepted standard • There are many implementations of many vendors • Standard API does not include an actual implementation, but Sun provides one with JDK • We would like to be able to change the implementation used, without changing any code in the program • How this is done?

  16. System Properties • Propertiesare configuration values managed as key/value pairs. In each pair, the key and value are both String values. • The System class maintains a Properties object that describes the configuration of the current working environment. • System properties include information about the current user, the current version of the Java runtime, and the character used to separate components of a file path name. • Java standard System Properties.

  17. Factory Design Pattern • Have a “factory” class that creates the actual parsers • org.xml.sax.helpers.XMLReaderFactory • The factory checks configurations, mainly the value of a system property, that specify the implementation • In order to change the implementation, simply change the system property Read more about XMLReaderFactory Class

  18. The factory will look for this… Creating a SAX Parser import org.xml.sax.*; import org.xml.sax.helpers.*; publicclass EchoWithSax { publicstaticvoid main(String[] args) throwsException { System.setProperty("org.xml.sax.driver", "org.apache.xerces.parsers.SAXParser"); XMLReader reader = XMLReaderFactory.createXMLReader(); reader.parse("world.xml"); } } Implements XMLReader (not a must) Read more about XMLReader Interface, Xerces SAXParser class

  19. Implementing the Content Handler • A SAX parser invokes methods such as startDocument, startElement and endElement of its content handleras it runs • In order to react to parsing events we must: • implement the ContentHandler interface • set the parser’s content handler with an instance of our ContentHandler implementation

  20. ContentHandler Methods • startDocument - parsing begins • endDocument - parsing ends • startElement - an opening tag is encountered • endElement - a closing tag is encountered • characters - text (CDATA) is encountered • ignorableWhitespace - white spaces that should be ignored (according to the DTD) • and more ... Read more about ContentHandler Interface

  21. The Default Handler • The class DefaultHandler implements all handler interfaces (usually, in an empty manner) • i.e., ContentHandler, EntityResolver, DTDHandler, ErrorHandler • An easy way to implement the ContentHandlerinterface is to extend DefaultHandler Read more about DefaultHandler Class

  22. A Content Handler Example import org.xml.sax.helpers.DefaultHandler; import org.xml.sax.*; publicclassEchoHandlerextendsDefaultHandler { int depth = 0; publicvoid print(String line) { for(int i=0; i<depth; ++i) System.out.print(" "); System.out.println(line); }

  23. A Content Handler Example publicvoid startDocument() throws SAXException { print("BEGIN"); } publicvoid endDocument() throws SAXException { print("END"); } publicvoid startElement(String ns, String localName, String qName, Attributes attrs) throws SAXException { print("Element " + qName + "{"); ++depth; for (int i = 0; i < attrs.getLength(); ++i) print(attrs.getLocalName(i) + "=" + attrs.getValue(i)); } We will discuss this interface later…

  24. A Content Handler Example publicvoid endElement(String ns, String lName, String qName) throws SAXException { --depth; print("}"); } publicvoid characters(char buf[], int offset, int len) throws SAXException { String s = newString(buf, offset, len).trim(); ++depth; print(s); --depth; } }

  25. Fixing The Parser publicclass EchoWithSax { publicstaticvoid main(String[] args) throwsException { XMLReader reader = XMLReaderFactory.createXMLReader(); reader.setContentHandler(new EchoHandler()); reader.parse("world.xml"); } } What would happen without this line?

  26. Empty Elements • What do you think happens when the parser parses an empty element? <rating stars="five"/>

  27. Attributes Interface • The Attributes interface provides an access to all attributes of an element • getLength(),getQName(i), getValue(i), getType(i),getValue(qname), etc. • The following are possible types for attributes: CDATA, ID, IDREF, IDREFS, NMTOKEN, NMTOKENS, ENTITY, ENTITIES, NOTATION #attributes Read more about Attributes Interface

  28. ErrorHandler Interface • We implement ErrorHandler to receive error events (similar to implementing ContentHandler) • DefaultHandler implements ErrorHandler in an empty fashion, so we can extend it (as before) • AnErrorHandler is registered with • reader.setErrorHandler(handler); • Three methods: • void error(SAXParseException ex); • void fatalError(SAXParserExcpetion ex); • void warning(SAXParserException ex);

  29. Parsing Errors • Fatal errors disable the parser from continuing parsing • For example, the document is not well formed, an unknown XML version is declared, etc. • Errors (that is recoverable ones) occur for example when the parser is validating and validity constrains are violated • Warnings occur when abnormal (yet legal) conditions are encountered • For example, an entity is declared twice in the DTD

  30. EntityResolver and DTDHandler • The interface EntityResolver enables the programmer to specify a new source for translation of external entities • The interface DTDHandler enables the programmer to react to notations (unparsed entities) declarations inside the DTD ( Notation Example ) e.g. external DTD Usually appear with external non-xml resources and describe their type Read more about EntityResolver Interface, DTDHandler Interface

  31. Features and Properties • SAX parsers can be configured by setting their features and properties • Syntax: • reader.setFeature("feature-url", boolean) • reader.setProperty("property-url", Object) • Standard feature URLs have the form: http://xml.org/sax/features/feature-name • Standard property URLs have the form http://xml.org/sax/properties/prop-name

  32. Feature/Property Examples • Features: • namespaces - are namespaces supported? • validation - does the parser validate (against the declared DTD) ? • http://apache.org/xml/features/nonvalidating/load-external-dtd • Ignore the DTD? (spec. to Xerces implementation) • Properties: • lexical-handler - see the next slide... Read more about Features Read more about Properties

  33. Lexical Events • Lexical events have to do with the way that a document was written and not with its content • Examples: • A comment is a lexical event (<!-- comment -->) • The use of an entity is a lexical event (&gt;) • These can be dealt with by implementing the LexicalHandler interface, and setting the lexical-handler property to an instance of the handler

  34. LexicalHandler Methods • comment(char[]ch, intstart, intlength) • startCDATA() • endCDATA() • startEntity(java.lang.Stringname) • endEntity(java.lang.Stringname) • and more... Read more about LexicalHandler Interface

  35. Different Approaches

  36. The DOM Approach • Tree-based API : map an XML document into an internal tree structure, then allow an application to navigate that tree. • The application is active. • Provides random-access. • Remember that it is possible to construct a parse tree using an event-based API, and it is possible to use an event-based API to traverse an in-memory tree. (Actually DOM is a level above SAX)

  37. <?xmlversion="1.0"?> <!DOCTYPEcountriesSYSTEM"world.dtd"> <countries> <country continent="&as;"> <name>Israel</name> <population year="2001">6,199,008</population> <city capital="yes"><name>Jerusalem</name></city> <city><name>Ashdod</name></city> </country> <country continent="&eu;"> <name>France</name> <population year="2004">60,424,213</population> </country> </countries>

  38. The DOM Tree

  39. The SAX Approach • Event based API. • The Application is passive. • Provides Serial access parser. Given the following XML document: This XML document, when passed through a SAX parser, will generate the following sequence of events (Pushing)…

  40. XML Processing Instruction, named xml, with attributes version equal to "1.0" and encoding equal to "UTF-8" • XML Element start, named RootElement, with an attribute param equal to "value" • XML Element start, named FirstElement • XML Text node, with data equal to "Some Text" (note: text processing, with regard to spaces, can be changed) • XML Element end, named FirstElement • XML Element start, named SecondElement, with an attribute param2 equal to "something" • XML Text node, with data equal to "Pre-Text" • XML Element start, named Inline • XML Text node, with data equal to "Inlined text" • XML Element end, named Inline • XML Text node, with data equal to "Post-text." • XML Element end, named SecondElement • XML Element end, named RootElement

  41. Pull vs. Push • SAX is known as a push framework • the parser has the initivative • the programmer must react to events • An alternative is a pull framework • the programmer has the initiative • the parser must react to requests

  42. StAX - Streaming API for XML • An API to read and write XML documents in the Java programming language (not like DOM & SAX that were written for Parsing). • The StAX approach: the programmatic entry point is a cursor that represents a point within the document. The application moves the cursor forward - 'pulling' the information from the parser as it needs. • Thus, the application is active. • Not part of the standard Java API.

  43. For example, here's a simple bit of code that iterates through an XML document and prints out the names of the different elements it encounters: Here's the beginning of the output when I ran this across a simple well-formed HTML file: htmlheadtitlemetametalinkmeta...

  44. JDOM • An implementation of generic XML trees in Java. • JDOM provides a way to represent XML document for easy and efficient reading, manipulation, and writing. • one can construct a tree of elements, then generate a XML file from it, like:

  45. SAX vs. DOM A more detailed comparison

  46. Parser Efficiency • The DOM object built by DOM parsers is usually complicated and requires more memory storage than the XML file itself • A lot of time is spent on construction before use • For some very large documents, this may be impractical • SAX parsers store only local information that is encountered during the serial traversal • Hence, programming with SAX parsers is, in general, more efficient

  47. Programming using SAX is Difficult • In some cases, programming with SAX is difficult: • How can we find, using a SAX parser, elements e1 with ancestor e2? • How can we find, using a SAX parser, elements e1 that have a descendant element e2? • How can we find the element e1 referenced by the IDREF attribute of e2?

  48. Node Navigation • SAX parsers do not provide access to elements other than the one currently visited in the serial (DFS) traversal of the document • In particular, • They do not read backwards • They do not enable access to elements by ID or name • DOM parsers enable any traversal method • Hence, using DOM parsers is usually more comfortable

  49. More DOM Advantages • You can save time and effort if you send and receive DOM objects instead of XML files • But, DOM object are generally larger than the source • DOM parsers provide a natural integration of XML reading and manipulating • e.g., “cut and paste” of XML fragments

  50. Which should we use?DOM vs. SAX • If your document is very large and you only need a few elements – use SAX • If you need to manipulate (i.e., change) the XML – use DOM • If you need to access the XML many times – use DOM (assuming the file is not too large)

More Related