2 :mod:`HTMLParser` --- Simple HTML and XHTML parser
3 ==================================================
6 :synopsis: A simple parser that can handle HTML and XHTML.
10 The :mod:`HTMLParser` module has been renamed to :mod:`html.parser` in Python
11 3. The :term:`2to3` tool will automatically adapt imports when converting
12 your sources to Python 3.
21 **Source code:** :source:`Lib/HTMLParser.py`
25 This module defines a class :class:`.HTMLParser` which serves as the basis for
26 parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
27 Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser
31 .. class:: HTMLParser()
33 An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
34 when start tags, end tags, text, comments, and other markup elements are
35 encountered. The user should subclass :class:`.HTMLParser` and override its
36 methods to implement the desired behavior.
38 The :class:`.HTMLParser` class is instantiated without arguments.
40 Unlike the parser in :mod:`htmllib`, this parser does not check that end tags
41 match start tags or call the end-tag handler for elements which are closed
42 implicitly by closing an outer element.
44 An exception is defined as well:
46 .. exception:: HTMLParseError
48 :class:`.HTMLParser` is able to handle broken markup, but in some cases it
49 might raise this exception when it encounters an error while parsing.
50 This exception provides three attributes: :attr:`msg` is a brief
51 message explaining the error, :attr:`lineno` is the number of the line on
52 which the broken construct was detected, and :attr:`offset` is the number of
53 characters into the line at which the construct starts.
56 Example HTML Parser Application
57 -------------------------------
59 As a basic example, below is a simple HTML parser that uses the
60 :class:`.HTMLParser` class to print out start tags, end tags and data
61 as they are encountered::
63 from HTMLParser import HTMLParser
65 # create a subclass and override the handler methods
66 class MyHTMLParser(HTMLParser):
67 def handle_starttag(self, tag, attrs):
68 print "Encountered a start tag:", tag
69 def handle_endtag(self, tag):
70 print "Encountered an end tag :", tag
71 def handle_data(self, data):
72 print "Encountered some data :", data
74 # instantiate the parser and fed it some HTML
75 parser = MyHTMLParser()
76 parser.feed('<html><head><title>Test</title></head>'
77 '<body><h1>Parse me!</h1></body></html>')
79 The output will then be::
81 Encountered a start tag: html
82 Encountered a start tag: head
83 Encountered a start tag: title
84 Encountered some data : Test
85 Encountered an end tag : title
86 Encountered an end tag : head
87 Encountered a start tag: body
88 Encountered a start tag: h1
89 Encountered some data : Parse me!
90 Encountered an end tag : h1
91 Encountered an end tag : body
92 Encountered an end tag : html
95 :class:`.HTMLParser` Methods
96 ----------------------------
98 :class:`.HTMLParser` instances have the following methods:
101 .. method:: HTMLParser.feed(data)
103 Feed some text to the parser. It is processed insofar as it consists of
104 complete elements; incomplete data is buffered until more data is fed or
105 :meth:`close` is called. *data* can be either :class:`unicode` or
106 :class:`str`, but passing :class:`unicode` is advised.
109 .. method:: HTMLParser.close()
111 Force processing of all buffered data as if it were followed by an end-of-file
112 mark. This method may be redefined by a derived class to define additional
113 processing at the end of the input, but the redefined version should always call
114 the :class:`.HTMLParser` base class method :meth:`close`.
117 .. method:: HTMLParser.reset()
119 Reset the instance. Loses all unprocessed data. This is called implicitly at
123 .. method:: HTMLParser.getpos()
125 Return current line number and offset.
128 .. method:: HTMLParser.get_starttag_text()
130 Return the text of the most recently opened start tag. This should not normally
131 be needed for structured processing, but may be useful in dealing with HTML "as
132 deployed" or for re-generating input with minimal changes (whitespace between
133 attributes can be preserved, etc.).
136 The following methods are called when data or markup elements are encountered
137 and they are meant to be overridden in a subclass. The base class
138 implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
141 .. method:: HTMLParser.handle_starttag(tag, attrs)
143 This method is called to handle the start of a tag (e.g. ``<div id="main">``).
145 The *tag* argument is the name of the tag converted to lower case. The *attrs*
146 argument is a list of ``(name, value)`` pairs containing the attributes found
147 inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
148 and quotes in the *value* have been removed, and character and entity references
151 For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
152 would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
154 .. versionchanged:: 2.6
155 All entity references from :mod:`htmlentitydefs` are now replaced in the
159 .. method:: HTMLParser.handle_endtag(tag)
161 This method is called to handle the end tag of an element (e.g. ``</div>``).
163 The *tag* argument is the name of the tag converted to lower case.
166 .. method:: HTMLParser.handle_startendtag(tag, attrs)
168 Similar to :meth:`handle_starttag`, but called when the parser encounters an
169 XHTML-style empty tag (``<img ... />``). This method may be overridden by
170 subclasses which require this particular lexical information; the default
171 implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
174 .. method:: HTMLParser.handle_data(data)
176 This method is called to process arbitrary data (e.g. text nodes and the
177 content of ``<script>...</script>`` and ``<style>...</style>``).
180 .. method:: HTMLParser.handle_entityref(name)
182 This method is called to process a named character reference of the form
183 ``&name;`` (e.g. ``>``), where *name* is a general entity reference
187 .. method:: HTMLParser.handle_charref(name)
189 This method is called to process decimal and hexadecimal numeric character
190 references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
191 equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``;
192 in this case the method will receive ``'62'`` or ``'x3E'``.
195 .. method:: HTMLParser.handle_comment(data)
197 This method is called when a comment is encountered (e.g. ``<!--comment-->``).
199 For example, the comment ``<!-- comment -->`` will cause this method to be
200 called with the argument ``' comment '``.
202 The content of Internet Explorer conditional comments (condcoms) will also be
203 sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
204 this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
207 .. method:: HTMLParser.handle_decl(decl)
209 This method is called to handle an HTML doctype declaration (e.g.
210 ``<!DOCTYPE html>``).
212 The *decl* parameter will be the entire contents of the declaration inside
213 the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
216 .. method:: HTMLParser.handle_pi(data)
218 This method is called when a processing instruction is encountered. The *data*
219 parameter will contain the entire processing instruction. For example, for the
220 processing instruction ``<?proc color='red'>``, this method would be called as
221 ``handle_pi("proc color='red'")``.
225 The :class:`.HTMLParser` class uses the SGML syntactic rules for processing
226 instructions. An XHTML processing instruction using the trailing ``'?'`` will
227 cause the ``'?'`` to be included in *data*.
230 .. method:: HTMLParser.unknown_decl(data)
232 This method is called when an unrecognized declaration is read by the parser.
234 The *data* parameter will be the entire contents of the declaration inside
235 the ``<![...]>`` markup. It is sometimes useful to be overridden by a
239 .. _htmlparser-examples:
244 The following class implements a parser that will be used to illustrate more
247 from HTMLParser import HTMLParser
248 from htmlentitydefs import name2codepoint
250 class MyHTMLParser(HTMLParser):
251 def handle_starttag(self, tag, attrs):
252 print "Start tag:", tag
255 def handle_endtag(self, tag):
256 print "End tag :", tag
257 def handle_data(self, data):
259 def handle_comment(self, data):
260 print "Comment :", data
261 def handle_entityref(self, name):
262 c = unichr(name2codepoint[name])
263 print "Named ent:", c
264 def handle_charref(self, name):
265 if name.startswith('x'):
266 c = unichr(int(name[1:], 16))
268 c = unichr(int(name))
270 def handle_decl(self, data):
273 parser = MyHTMLParser()
277 >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
278 ... '"http://www.w3.org/TR/html4/strict.dtd">')
279 Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
281 Parsing an element with a few attributes and a title::
283 >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
285 attr: ('src', 'python-logo.png')
286 attr: ('alt', 'The Python logo')
288 >>> parser.feed('<h1>Python</h1>')
293 The content of ``script`` and ``style`` elements is returned as is, without
296 >>> parser.feed('<style type="text/css">#python { color: green }</style>')
298 attr: ('type', 'text/css')
299 Data : #python { color: green }
302 >>> parser.feed('<script type="text/javascript">'
303 ... 'alert("<strong>hello!</strong>");</script>')
305 attr: ('type', 'text/javascript')
306 Data : alert("<strong>hello!</strong>");
311 >>> parser.feed('<!-- a comment -->'
312 ... '<!--[if IE 9]>IE-specific content<![endif]-->')
314 Comment : [if IE 9]>IE-specific content<![endif]
316 Parsing named and numeric character references and converting them to the
317 correct char (note: these 3 references are all equivalent to ``'>'``)::
319 >>> parser.feed('>>>')
324 Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
325 :meth:`~HTMLParser.handle_data` might be called more than once::
327 >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
328 ... parser.feed(chunk)
336 Parsing invalid HTML (e.g. unquoted attributes) also works::
338 >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
341 attr: ('class', 'link')
342 attr: ('href', '#main')