Doc/library/htmlparser.rst

   1
   2 :mod:`HTMLParser` --- Simple HTML and XHTML parser
   3 ==================================================
   4
   5 .. module:: HTMLParser
   6    :synopsis: A simple parser that can handle HTML and XHTML.
   7
   8 .. note::
   9
  10    The :mod:`HTMLParser` module has been renamed to :mod:`html.parser` in Python
  11    3.  The :term:`2to3` tool will automatically adapt imports when converting
  12    your sources to Python 3.
  13
  14
  15 .. versionadded:: 2.2
  16
  17 .. index::
  18    single: HTML
  19    single: XHTML
  20
  21 **Source code:** :source:`Lib/HTMLParser.py`
  22
  23 --------------
  24
  25 This module defines a class :class:`.HTMLParser` which serves as the basis for
  26 parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
  27 Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser
  28 in :mod:`sgmllib`.
  29
  30
  31 .. class:: HTMLParser()
  32
  33    An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
  34    when start tags, end tags, text, comments, and other markup elements are
  35    encountered.  The user should subclass :class:`.HTMLParser` and override its
  36    methods to implement the desired behavior.
  37
  38    The :class:`.HTMLParser` class is instantiated without arguments.
  39
  40    Unlike the parser in :mod:`htmllib`, this parser does not check that end tags
  41    match start tags or call the end-tag handler for elements which are closed
  42    implicitly by closing an outer element.
  43
  44 An exception is defined as well:
  45
  46 .. exception:: HTMLParseError
  47
  48    :class:`.HTMLParser` is able to handle broken markup, but in some cases it
  49    might raise this exception when it encounters an error while parsing.
  50    This exception provides three attributes: :attr:`msg` is a brief
  51    message explaining the error, :attr:`lineno` is the number of the line on
  52    which the broken construct was detected, and :attr:`offset` is the number of
  53    characters into the line at which the construct starts.
  54
  55
  56 Example HTML Parser Application
  57 -------------------------------
  58
  59 As a basic example, below is a simple HTML parser that uses the
  60 :class:`.HTMLParser` class to print out start tags, end tags and data
  61 as they are encountered::
  62
  63    from HTMLParser import HTMLParser
  64
  65    # create a subclass and override the handler methods
  66    class MyHTMLParser(HTMLParser):
  67        def handle_starttag(self, tag, attrs):
  68            print "Encountered a start tag:", tag
  69        def handle_endtag(self, tag):
  70            print "Encountered an end tag :", tag
  71        def handle_data(self, data):
  72            print "Encountered some data  :", data
  73
  74    # instantiate the parser and fed it some HTML
  75    parser = MyHTMLParser()
  76    parser.feed('<html><head><title>Test</title></head>'
  77                '<body><h1>Parse me!</h1></body></html>')
  78
  79 The output will then be::
  80
  81    Encountered a start tag: html
  82    Encountered a start tag: head
  83    Encountered a start tag: title
  84    Encountered some data  : Test
  85    Encountered an end tag : title
  86    Encountered an end tag : head
  87    Encountered a start tag: body
  88    Encountered a start tag: h1
  89    Encountered some data  : Parse me!
  90    Encountered an end tag : h1
  91    Encountered an end tag : body
  92    Encountered an end tag : html
  93
  94
  95 :class:`.HTMLParser` Methods
  96 ----------------------------
  97
  98 :class:`.HTMLParser` instances have the following methods:
  99
 100
 101 .. method:: HTMLParser.feed(data)
 102
 103    Feed some text to the parser.  It is processed insofar as it consists of
 104    complete elements; incomplete data is buffered until more data is fed or
 105    :meth:`close` is called.  *data* can be either :class:`unicode` or
 106    :class:`str`, but passing :class:`unicode` is advised.
 107
 108
 109 .. method:: HTMLParser.close()
 110
 111    Force processing of all buffered data as if it were followed by an end-of-file
 112    mark.  This method may be redefined by a derived class to define additional
 113    processing at the end of the input, but the redefined version should always call
 114    the :class:`.HTMLParser` base class method :meth:`close`.
 115
 116
 117 .. method:: HTMLParser.reset()
 118
 119    Reset the instance.  Loses all unprocessed data.  This is called implicitly at
 120    instantiation time.
 121
 122
 123 .. method:: HTMLParser.getpos()
 124
 125    Return current line number and offset.
 126
 127
 128 .. method:: HTMLParser.get_starttag_text()
 129
 130    Return the text of the most recently opened start tag.  This should not normally
 131    be needed for structured processing, but may be useful in dealing with HTML "as
 132    deployed" or for re-generating input with minimal changes (whitespace between
 133    attributes can be preserved, etc.).
 134
 135
 136 The following methods are called when data or markup elements are encountered
 137 and they are meant to be overridden in a subclass.  The base class
 138 implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
 139
 140
 141 .. method:: HTMLParser.handle_starttag(tag, attrs)
 142
 143    This method is called to handle the start of a tag (e.g. ``<div id="main">``).
 144
 145    The *tag* argument is the name of the tag converted to lower case. The *attrs*
 146    argument is a list of ``(name, value)`` pairs containing the attributes found
 147    inside the tag's ``<>`` brackets.  The *name* will be translated to lower case,
 148    and quotes in the *value* have been removed, and character and entity references
 149    have been replaced.
 150
 151    For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
 152    would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
 153
 154    .. versionchanged:: 2.6
 155       All entity references from :mod:`htmlentitydefs` are now replaced in the
 156       attribute values.
 157
 158
 159 .. method:: HTMLParser.handle_endtag(tag)
 160
 161    This method is called to handle the end tag of an element (e.g. ``</div>``).
 162
 163    The *tag* argument is the name of the tag converted to lower case.
 164
 165
 166 .. method:: HTMLParser.handle_startendtag(tag, attrs)
 167
 168    Similar to :meth:`handle_starttag`, but called when the parser encounters an
 169    XHTML-style empty tag (``<img ... />``).  This method may be overridden by
 170    subclasses which require this particular lexical information; the default
 171    implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
 172
 173
 174 .. method:: HTMLParser.handle_data(data)
 175
 176    This method is called to process arbitrary data (e.g. text nodes and the
 177    content of ``<script>...</script>`` and ``<style>...</style>``).
 178
 179
 180 .. method:: HTMLParser.handle_entityref(name)
 181
 182    This method is called to process a named character reference of the form
 183    ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
 184    (e.g. ``'gt'``).
 185
 186
 187 .. method:: HTMLParser.handle_charref(name)
 188
 189    This method is called to process decimal and hexadecimal numeric character
 190    references of the form ``&#NNN;`` and ``&#xNNN;``.  For example, the decimal
 191    equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
 192    in this case the method will receive ``'62'`` or ``'x3E'``.
 193
 194
 195 .. method:: HTMLParser.handle_comment(data)
 196
 197    This method is called when a comment is encountered (e.g. ``<!--comment-->``).
 198
 199    For example, the comment ``<!-- comment -->`` will cause this method to be
 200    called with the argument ``' comment '``.
 201
 202    The content of Internet Explorer conditional comments (condcoms) will also be
 203    sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
 204    this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
 205
 206
 207 .. method:: HTMLParser.handle_decl(decl)
 208
 209    This method is called to handle an HTML doctype declaration (e.g.
 210    ``<!DOCTYPE html>``).
 211
 212    The *decl* parameter will be the entire contents of the declaration inside
 213    the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
 214
 215
 216 .. method:: HTMLParser.handle_pi(data)
 217
 218    This method is called when a processing instruction is encountered.  The *data*
 219    parameter will contain the entire processing instruction.  For example, for the
 220    processing instruction ``<?proc color='red'>``, this method would be called as
 221    ``handle_pi("proc color='red'")``.
 222
 223    .. note::
 224
 225       The :class:`.HTMLParser` class uses the SGML syntactic rules for processing
 226       instructions.  An XHTML processing instruction using the trailing ``'?'`` will
 227       cause the ``'?'`` to be included in *data*.
 228
 229
 230 .. method:: HTMLParser.unknown_decl(data)
 231
 232    This method is called when an unrecognized declaration is read by the parser.
 233
 234    The *data* parameter will be the entire contents of the declaration inside
 235    the ``<![...]>`` markup.  It is sometimes useful to be overridden by a
 236    derived class.
 237
 238
 239 .. _htmlparser-examples:
 240
 241 Examples
 242 --------
 243
 244 The following class implements a parser that will be used to illustrate more
 245 examples::
 246
 247    from HTMLParser import HTMLParser
 248    from htmlentitydefs import name2codepoint
 249
 250    class MyHTMLParser(HTMLParser):
 251        def handle_starttag(self, tag, attrs):
 252            print "Start tag:", tag
 253            for attr in attrs:
 254                print "     attr:", attr
 255        def handle_endtag(self, tag):
 256            print "End tag  :", tag
 257        def handle_data(self, data):
 258            print "Data     :", data
 259        def handle_comment(self, data):
 260            print "Comment  :", data
 261        def handle_entityref(self, name):
 262            c = unichr(name2codepoint[name])
 263            print "Named ent:", c
 264        def handle_charref(self, name):
 265            if name.startswith('x'):
 266                c = unichr(int(name[1:], 16))
 267            else:
 268                c = unichr(int(name))
 269            print "Num ent  :", c
 270        def handle_decl(self, data):
 271            print "Decl     :", data
 272
 273    parser = MyHTMLParser()
 274
 275 Parsing a doctype::
 276
 277    >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
 278    ...             '"http://www.w3.org/TR/html4/strict.dtd">')
 279    Decl     : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
 280
 281 Parsing an element with a few attributes and a title::
 282
 283    >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
 284    Start tag: img
 285         attr: ('src', 'python-logo.png')
 286         attr: ('alt', 'The Python logo')
 287    >>>
 288    >>> parser.feed('<h1>Python</h1>')
 289    Start tag: h1
 290    Data     : Python
 291    End tag  : h1
 292
 293 The content of ``script`` and ``style`` elements is returned as is, without
 294 further parsing::
 295
 296    >>> parser.feed('<style type="text/css">#python { color: green }</style>')
 297    Start tag: style
 298         attr: ('type', 'text/css')
 299    Data     : #python { color: green }
 300    End tag  : style
 301    >>>
 302    >>> parser.feed('<script type="text/javascript">'
 303    ...             'alert("<strong>hello!</strong>");</script>')
 304    Start tag: script
 305         attr: ('type', 'text/javascript')
 306    Data     : alert("<strong>hello!</strong>");
 307    End tag  : script
 308
 309 Parsing comments::
 310
 311    >>> parser.feed('<!-- a comment -->'
 312    ...             '<!--[if IE 9]>IE-specific content<![endif]-->')
 313    Comment  :  a comment
 314    Comment  : [if IE 9]>IE-specific content<![endif]
 315
 316 Parsing named and numeric character references and converting them to the
 317 correct char (note: these 3 references are all equivalent to ``'>'``)::
 318
 319    >>> parser.feed('&gt;&#62;&#x3E;')
 320    Named ent: >
 321    Num ent  : >
 322    Num ent  : >
 323
 324 Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
 325 :meth:`~HTMLParser.handle_data` might be called more than once::
 326
 327    >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
 328    ...     parser.feed(chunk)
 329    ...
 330    Start tag: span
 331    Data     : buff
 332    Data     : ered
 333    Data     : text
 334    End tag  : span
 335
 336 Parsing invalid HTML (e.g. unquoted attributes) also works::
 337
 338    >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
 339    Start tag: p
 340    Start tag: a
 341         attr: ('class', 'link')
 342         attr: ('href', '#main')
 343    Data     : tag soup
 344    End tag  : p
 345    End tag  : a