1 / 12

Parsing HTML Topic 3, Chapter 7

Network Programming Kansas State University at Salina. Parsing HTML Topic 3, Chapter 7. Picking information from an HTML page. A difficult problem HTML defines page layout, not content – advantage XML Very useful because of volume of data available

Télécharger la présentation

Parsing HTML Topic 3, Chapter 7

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Network Programming Kansas State University at Salina Parsing HTMLTopic 3, Chapter 7

  2. Picking information from an HTML page • A difficult problem • HTML defines page layout, not content – advantage XML • Very useful because of volume of data available • If the format of the page changes, your program is broken.

  3. HTML • Definition: Token – one piece of information in an HTML formatted page • HTML tag – usually only relates to formatting • URL or image reference • Textual information • Must look at several tokens to determine context of the data • Start-tag, End-tag structure leads parsing code to use finite state machines and stacks. ( <TABLE> … </TABLE> )

  4. Tokens <HTML> <HEAD> <TITLE> Tim Bower </TITLE> </HEAD> <BODY BGCOLOR="lightyellow"> <TABLE> <TR> <TD> <H1>Tim Bower</H1> {'data': [], 'type': 'StartTag', 'name': u'html'} {'data': [], 'type': 'StartTag', 'name': u'head'} {'data': u'\n ', 'type': 'SpaceCharacters'} {'data': [], 'type': 'StartTag', 'name': u'title'} {'data': u' ', 'type': 'SpaceCharacters'} {'data': u'Tim Bower', 'type': 'Characters'} {'data': u' ', 'type': 'SpaceCharacters'} {'data': [], 'type': 'EndTag', 'name': u'title'} {'data': u'\n', 'type': 'SpaceCharacters'} {'data': [], 'type': 'EndTag', 'name': u'head'} {'data': u'\n\n', 'type': 'SpaceCharacters'} {'data': [(u'bgcolor', u'lightyellow')], 'type': 'StartTag', 'name': u'body'} {'data': u' \n\n', 'type': 'SpaceCharacters'} {'data': [], 'type': 'StartTag', 'name': u'table'} {'data': u' ', 'type': 'SpaceCharacters'} {'data': [], 'type': 'StartTag', 'name': u'tbody'} {'data': [], 'type': 'StartTag', 'name': u'tr'} {'data': u'\n', 'type': 'SpaceCharacters'} {'data': [], 'type': 'StartTag', 'name': u'td'} {'data': u'\n', 'type': 'SpaceCharacters'} {'data': [], 'type': 'StartTag', 'name': u'h1'} {'data': u'Tim Bower', 'type': 'Characters'} {'data': [], 'type': 'EndTag', 'name': u'h1'}

  5. Two main programming strategies • The call-back approach (HTMLParser shown in text book) • Define your own class that extends the HTMLParser class • Nice use of inheritance and polymorphism • Pass the HTML page to the parser and it calls functions from your class as needed to process the start-tags, data elements, end-tags and a few other miscellaneous tags. • The document tree approach • Parser builds a tree (data structure object) based on the page contents • You iterate through the tree or a list of tokens taken from the tree looking for desired data.

  6. HTMLParser import HTMLParser class TitleParser(HTMLParser): def __init__(self): self.title = '' self.readingtitle = 0 HTMLParser.__init__(self) def handle_starttag(self, tag, \ attrs): if tag == 'title': self.readingtitle = 1 def handle_data(self, data): if self.readingtitle: self.title += data def handle_endtag(self, tag): if tag == 'title': print “*** %s ***” % \ self.title self.readingtitle = 0 fd = open(sys.argv[1]) tp = TitleParser() tp.feed(fd.read())

  7. Argh!, HTMLParser is fragile and hard to debug. Traceback (most recent call last): File "C:\Users\tim\Documents\Classes\Net_Programming\Source_code\ Topic 3 - Web\weatherParser.py", line 258, in <module> parser.feed(data) File "C:\Python25\lib\HTMLParser.py", line 108, in feed self.goahead(0) File "C:\Python25\lib\HTMLParser.py", line 148, in goahead k = self.parse_starttag(i) File "C:\Python25\lib\HTMLParser.py", line 226, in parse_starttag endpos = self.check_for_whole_start_tag(i) File "C:\Python25\lib\HTMLParser.py", line 301, in check_for_whole_start_tag self.error("malformed start tag") File "C:\Python25\lib\HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParseError: malformed start tag, at line 120, column 477

  8. html5lib • Found on Python package index • Install setuptools then use Python to install html5lib (see the README file). Both are on K-State Online. • Advantages: • Robust, standards based parser • Filtering data after the page is parsed is easier to follow and debug than the call-back approach • Disadvantage: • Documentation of API for traversing the tree

  9. Build the tree: Loop through tokens: html5lib Usage p = html5lib.HTMLParser( \ tree=treebuilders.getTreeBuilder("dom")) f = open( "weather.html", "r" ) dom_tree = p.parse(f) f.close() walker = treewalkers.getTreeWalker("dom") stream = walker(dom_tree) passtags = [ u'a', u'h1', u'h2', u'h3', u'h4',u'em', \ u'strong', u'br', u'img', \ u'dl', u'dt', u'dd' ] for token in stream: # Don't show non interesting stuff if token.has_key('name'): if token['name'] in passtags: continue print token

  10. The DOM tree alternative • The DOM tree may be used directly. • Not documented with html5lib, but xml.dom package is standard with Python. • DOM trees are normally used with XML, but html5lib can make a DOM tree from HTML. • Walk through the tree by examining children nodes of each node. With knowledge of the page structure, you may be able to go almost directly to the desired information. • See chapter 8 and DOMtry.py posted file.

  11. html5lib tokens • Stream of tokens is a list • Each token is a dictionary • token[ ‘data’ ] • String (unicode encoding) • Empty list • List of tuples for formatting attributes • token[ ‘type’ ] – (StartTag, EndTag, Characters, SpaceCharacters) • token[ ‘name’ ] – description of start and end tags. (table, tr, td, h1, br, ul, li, … ) • See example of tokens on previous slide

  12. html5lib token parsing doingTitle = False for token in stream: if token.has_key('name'): if token['name'] in passtags: continue else: tName = token['name'] tType = token['type'] if tType == 'StartTag': if tName == u'title': title = '' doingTitle = True if tType == 'EndTag': if tName == u'title': print "*** %s ***\n" % title doingTitle = False if tType == 'Characters': if doingTitle: title += token['data']

More Related