1 / 45

Perl/XML::DOM - reading and writing XML from Perl

Perl/XML::DOM - reading and writing XML from Perl. Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk http://www.bioinf.org.uk/. Aims and objectives. Refresh the structure of a XML (or XHTML) document Know problems in reading and writing XML

xerxes
Télécharger la présentation

Perl/XML::DOM - reading and writing XML from Perl

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk http://www.bioinf.org.uk/

  2. Aims and objectives • Refresh the structure of a XML (or XHTML) document • Know problems in reading and writing XML • Understand the requirements of XML parsers and the two main types • Know how to write code using the DOM parser • PRACTICAL: write a script to read XML

  3. Tags: paired opening and closing tags May contain data and/or other (nested) tags Attributes: optional – contained within the opening tag un-paired tags use special syntax An XML refresher! <mutants> <mutant_group native='1abc01'> <structure> <method>x-ray</method> <resolution>1.8</resolution> <rfactor>0.20</rfactor> </structure> <mutant domid='2bcd01'> <structure> <method>x-ray</method> <resolution>1.8</resolution> <rfactor>0.20</rfactor> </structure> <mutation res='L24' native='ALA' subs='ARG' /> </mutant> </mutant_group> </mutants>

  4. Writing XML Writing XML is straightforward • Generate XML from a Perl script using print() statements. • However: • tags correctly nested • quote marks correctly paired • international character sets

  5. Reading XML As simple or complex as you wish! • Full control over XML: • simple pattern may suffice • Otherwise, may be dangerous • may rely on un-guaranteed formatting <mutants><mutant_group native='1abc01'><structure><method> x-ray</method><resolution>1.8</resolution><rfactor>0.20 </rfactor></structure><mutant domid='2bcd01'><structure> <method>x-ray</method><resolution>1.8</resolution> <rfactor>0.20</rfactor></structure><mutation res='L24’ native='ALA’ subs='ARG'/></mutant></mutant_group></mutants>

  6. XML Parsers • Clear rules for data boundaries and hierarchy • Predictable; unambiguous • Parser translates XML into • stream of events • complex data object

  7. XML Parsers • Different data sources of data • files • character strings • remote references • different character encodings • standard Latin • Japanese • checking for well-formedness errors Good parser will handle:

  8. XML Parsers • Read stream of characters • differentiate markup and data • Optionally replace entity references • (e.g. &lt; with <) • Assemble complete document • disparate (perhaps remote) sources • Report syntax and validation errors • Pass data to client program

  9. XML Parsers • If XML has no syntax errors it is 'well formed' • With a DTD, a validating parser will check it matches:'valid'

  10. XML Parsers • Writing a good parser is a lot of work! • A lot of testing needed • Fortunately, many parsers available

  11. Getting data to your program • Parser can generate 'events' • Tags are converted into events • Events triggered in your program as the document is read • Parser acts as a pipeline converting XML into processed chunks of data sent to your program: • an 'event stream'

  12. Getting data to your program OR… • XML converted into a tree structure • Reflects organization of the XML • Whole document read into memory before your program gets access

  13. Pros and cons • In the parser, everything is likely to be event-driven • tree-based parsers create a data structure from the event stream • Data structure • More convenient • Can access data in any order • Code usually simpler • May be impossible to handle very large files • Need more processor time • Need more memory • Event stream • Faster to access limited data • Use less memory • Parser loses data at the next event • More complex code

  14. SAX and DOM de facto standard APIs for XML parsing • SAX (Simple API for XML) • event-stream API • originally for Java, but now for several programming languages (including Perl) • development promoted by Peter Murray Rust, 1997-8 • DOM (Document Object Model) • W3C standard tree-based parser • platform- and language-neutral • allows update of document content and structure as well as reading

  15. Perl XML parsers Many parsers available • Differ in three major ways: • parsing style (event driven or data structure) • 'standards-completeness’ • speed (implementation in C or pure Perl)

  16. Perl XML parsers XML::Simple • Very easy to use • Designed for simple applications • Can’t handle 'mixed content' • tags containing both data and other tags <p>This is <b>mixed</b> content</p>

  17. Perl XML parsers XML::Parser • Oldest Perl XML parser • Reasonably fast and flexibile • Not very standards-compliant. • Is a wrapper to 'expat’ • probably the first C XML parser written by James Clark

  18. Remember 2 lines before an error $@ stores the error Eval used so parser doesn’t cause program to exit XML::Parser use XML::Parser; my $xmlfile = shift @ARGV; # the file to parse # initialize parser object and parse the string my $parser = XML::Parser->new( ErrorContext => 2 ); eval { $parser->parsefile( $xmlfile ); }; # report error or success if( $@ ) { $@ =~ s/at \/.*?$//s; # remove module line number print STDERR "\nERROR in '$xmlfile':\n$@\n"; } else { print STDERR "'$xmlfile' is well-formed\n"; } Simple example - check well-formedness

  19. Perl XML parsers XML::DOM • Implements W3C DOM Level 1 • Built on top of XML::Parser • Very good fast, stable and complete • Limited extended functionality XML::SAX • Implements SAX2 wrapper to Expat • Fast, stable and complete

  20. Perl XML parsers XML::LibXML • Wrapper around GNOME libxml2 • Very fast, complete and stable • Validating/non-validating • DOM and SAX support

  21. Perl XML parsers XML::Twig • DOM-like parser, BUT • Allows you to define elements which can be parsed as discrete units • 'twigs' (small branches of a tree)

  22. Perl XML parsers Several others • More specialized, adding… • XPath (to select data from XML document) • re-formatting (XSLT or other methods) • ...

  23. XML::DOM • DOM is a standard API • once learned moving to a different language is straightforward • moving between implementations also easy • Suppose we want to extract some data from an XML file...

  24. XML::DOM <data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> <species name='Drosophila melanogaster'> <common-name>fruit fly</common-name> <conservation status='not endangered' /> </species> </data> We want: cat (Felix domesticus) not endangered fruit fly (Drosophila melanogaster) not endangered

  25. Import XML::DOM module obtain filename initialize a parser parse the file Can now treat $species in the same way as $doc Returns each element matching ‘species’ Nested loops correspond to nested tags Look at each species element in turn $species contains content and descendant elements #!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose();

  26. $common_name contains content and any descendant elements Returns each element matching ‘common-name’ Here there is only one <common-name> element per <species> element, but parser can't know that. Have to specify that we want the first (and only) <common-name> element Extract the text from the element object using ->getNodeValue The <common-name> element contains only one child which is data. Obtain first (and only) child element using ->getFirstChild #!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose();

  27. Here we obtain an object and use the ->item() method to obtain the first item: $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; Alternative is to access an individual array element: @common_names = $species->getElementsByTagName('common-name'); $cname = $common_names[0]->getFirstChild->getNodeValue; ->getElementsByTagName returns an array Here we work through the array: foreach $species ($doc->getElementsByTagName('species')) { }

  28. There could have been more than one <common-name> element within this <species> element Obtain the actual text rather than an element object <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue;

  29. Attributes much simpler! Can only contain text (no nested elements) Simply specify the attribute <species name='Felix domesticus'> Attributes #!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose();

  30. DOM parser can't know there is only one <conservation> element per species. Extract the first (and only) one. Extract the <conservation> elements from this <species> element Contains only a ‘status’ attribute, extract its value with ->getAttribute() <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> This is an empty element, there are no child elements so we don’t need ->getFirstChild $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status');

  31. Print the extracted information Loop back to next <species> element Clean up and free memory #!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose();

  32. XML::DOM Note • Not necessary to use variable names that match the tags, but it is a very good idea! • There are many many more functions, but this set covers most needs

  33. Writing XML with XML::DOM

  34. Import XML::DOM Initialize data Create an XML document object Utility method to print XML header <?xml version=“1.0” ?> #!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString;

  35. Create a ‘root’ element for our data Loop through each of the species Create a <species> element Set its ‘name’ attribute Join the <species> element to the parent <data> element #!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString;

  36. <?xml version=“1.0” ?> <data> <species name='Felix domesticus’ /> </data>

  37. Create a <common-name> element Create a text node containing the specified text Join this text node as a child of the <name> element Join the <name> element as a child of <species> #!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString;

  38. <?xml version=“1.0” ?> <data> <species name='Felix domesticus'> <common-name>cat</common-name> </species> </data>

  39. Create a <conservation> element Set the ‘status’ attribute Join the <conservation> element as a child of <species> #!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString;

  40. <?xml version=“1.0” ?> <data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> </data>

  41. Loop back to handle the next species Finally print the resulting data structure #!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString;

  42. <?xml version=“1.0” ?> <data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> <species name='Drosophila melanogaster'> <common-name>fruit fly</common-name> <conservation status='not endangered' /> </species> </data>

  43. Summary - reading XML Create a parser $parser = XML::DOM::Parser->new(); Parse a file $doc = $parser->parsefile('filename'); Extract all elements matching tag-name $element_set = $doc->getElementsByTagName('tag-name') Extract first element of a set $element = $element_set->item(0); Extract first child of an element $child_element = $element->getFirstChild; Extract text from an element $text = $element->getNodeValue; Get the value of a tag’s attribute $text = $element->getAttribute('attribute-name');

  44. Summary - writing XML Create an XML document structure $doc = XML::DOM::Document->new; Utility to create an XML header $header = $doc->createXMLDecl('1.0'); Create a tagged element $element = $doc->createElement('tag-name'); Set an attribute for an element $element->setAttribute('attrib-name', 'value'); Append a child element to an element $parent_element->appendChild($child_element); Create a text node element $element = $doc->createTextNode('text'); Print a document structure as a string print $root_element->toString;

  45. Summary • Two types of parser • Event-driven • Data structure • Writing a good parser is difficult! • Many parsers available • XML::DOM for reading and writing data

More Related