8 Since version 2.0, lxml comes with a dedicated package for dealing
9 with HTML: ``lxml.html``. It provides a special Element API for HTML
10 elements, as well as a number of utilities for common tasks.
15 1.1 Parsing HTML fragments
16 1.2 Really broken pages
17 2 HTML Element Methods
18 3 Running HTML doctests
19 4 Creating HTML with the E-factory
24 6.1 Form Filling Example
31 9.1 Microformat Example
33 The main API is based on the `lxml.etree`_ API, and thus, on the ElementTree_
36 .. _`lxml.etree`: tutorial.html
37 .. _ElementTree: http://effbot.org/zone/element-index.htm
43 Parsing HTML fragments
44 ----------------------
46 There are several functions available to parse HTML:
48 ``parse(filename_url_or_file)``:
49 Parses the named file or url, or if the object has a ``.read()``
50 method, parses from that.
52 If you give a URL, or if the object has a ``.geturl()`` method (as
53 file-like objects from ``urllib.urlopen()`` have), then that URL
54 is used as the base URL. You can also provide an explicit
55 ``base_url`` keyword argument.
57 ``document_fromstring(string)``:
58 Parses a document from the given string. This always creates a
59 correct HTML document, which means the parent node is ``<html>``,
60 and there is a body and possibly a head.
62 ``fragment_fromstring(string, create_parent=False)``:
63 Returns an HTML fragment from a string. The fragment must contain
64 just a single element, unless ``create_parent`` is given;
65 e.g,. ``fragment_fromstring(string, create_parent='div')`` will
66 wrap the element in a ``<div>``.
68 ``fragments_fromstring(string)``:
69 Returns a list of the elements found in the fragment.
71 ``fromstring(string)``:
72 Returns ``document_fromstring`` or ``fragment_fromstring``, based
73 on whether the string looks like a full document, or just a
79 The normal HTML parser is capable of handling broken HTML, but for
80 pages that are far enough from HTML to call them 'tag soup', it may
81 still fail to parse the page. A way to deal with this is
82 ElementSoup_, which deploys the well-known BeautifulSoup_ parser to
83 build an lxml HTML tree.
85 .. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
86 .. _ElementSoup: elementsoup.html
92 HTML elements have all the methods that come with ElementTree, but
93 also include some extra methods:
96 Drops the element and all its children. Unlike
97 ``el.getparent().remove(el)`` this does *not* remove the tail
98 text; with ``drop_tree`` the tail text is merged with the previous
102 Drops the tag, but keeps its children and text.
104 ``.find_class(class_name)``:
105 Returns a list of all the elements with the given CSS class name.
106 Note that class names are space separated in HTML, so
107 ``doc.find_class_name('highlight')`` will find an element like
108 ``<div class="sidebar highlight">``. Class names *are* case
111 ``.find_rel_links(rel)``:
112 Returns a list of all the ``<a rel="{rel}">`` elements. E.g.,
113 ``doc.find_rel_links('tag')`` returns all the links `marked as
114 tags <http://microformats.org/wiki/rel-tag>`_.
116 ``.get_element_by_id(id, default=None)``:
117 Return the element with the given ``id``, or the ``default`` if
118 none is found. If there are multiple elements with the same id
119 (which there shouldn't be, but there often is), this returns only
123 Returns the text content of the element, including the text
124 content of its children, with no markup.
126 ``.cssselect(expr)``:
127 Select elements from this element and its children, using a CSS
128 selector expression. (Note that ``.xpath(expr)`` is also
129 available as on all lxml elements.)
132 Returns the corresponding ``<label>`` element for this element, if
133 any exists (None if there is none). Label elements have a
134 ``label.for_element`` attribute that points back to the element.
137 The base URL for this element, if one was saved from the parsing.
138 This attribute is not settable. Is None when no base URL was
141 Running HTML doctests
142 =====================
144 One of the interesting modules in the ``lxml.html`` package deals with
145 doctests. It can be hard to compare two HTML pages for equality, as
146 whitespace differences aren't meaningful and the structural formatting
147 can differ. This is even more a problem in doctests, where output is
148 tested for equality and small differences in whitespace or the order
149 of attributes can let a test fail. And given the verbosity of
150 tag-based languages, it may take more than a quick look to find the
151 actual differences in the doctest output.
153 Luckily, lxml provides the ``lxml.doctestcompare`` module that
154 supports relaxed comparison of XML and HTML pages and provides a
155 readable diff in the output when a test fails. The HTML comparison is
156 most easily used by importing the ``usedoctest`` module in a doctest:
158 .. sourcecode:: pycon
160 >>> import lxml.html.usedoctest
162 Now, if you have a HTML document and want to compare it to an expected result
163 document in a doctest, you can do the following:
165 .. sourcecode:: pycon
168 >>> html = lxml.html.fromstring('''\
169 ... <html><body onload="" color="white">
174 >>> print lxml.html.tostring(html)
175 <html><body onload="" color="white"><p>Hi !</p></body></html>
177 >>> print lxml.html.tostring(html)
178 <html> <body color="white" onload=""> <p>Hi !</p> </body> </html>
180 >>> print lxml.html.tostring(html)
182 <body color="white" onload="">
187 In documentation, you would likely prefer the pretty printed HTML output, as
188 it is the most readable. However, the three documents are equivalent from the
189 point of view of an HTML tool, so the doctest will silently accept any of the
190 above. This allows you to concentrate on readability in your doctests, even
191 if the real output is a straight ugly HTML one-liner.
193 Note that there is also an ``lxml.usedoctest`` module which you can
194 import for XML comparisons. The HTML parser notably ignores
195 namespaces and some other XMLisms.
198 Creating HTML with the E-factory
199 ================================
201 .. _`E-factory`: http://online.effbot.org/2006_11_01_archive.htm#et-builder
203 lxml.html comes with a predefined HTML vocabulary for the `E-factory`_,
204 originally written by Fredrik Lundh. This allows you to quickly generate HTML
207 .. sourcecode:: pycon
209 >>> from lxml.html import builder as E
210 >>> from lxml.html import usedoctest
213 ... E.LINK(rel="stylesheet", href="great.css", type="text/css"),
214 ... E.TITLE("Best Page Ever")
217 ... E.H1(E.CLASS("heading"), "Top News"),
218 ... E.P("World News only on this page", style="font-size: 200%"),
219 ... "Ah, and here's some more text, by the way.",
220 ... lxml.html.fromstring("<p>... and this is a parsed fragment ...</p>")
224 >>> print lxml.html.tostring(html)
227 <link href="great.css" rel="stylesheet" type="text/css">
228 <title>Best Page Ever</title>
231 <h1 class="heading">Top News</h1>
232 <p style="font-size: 200%">World News only on this page</p>
233 Ah, and here's some more text, by the way.
234 <p>... and this is a parsed fragment ...</p>
238 Note that you should use ``lxml.html.tostring`` and **not**
239 ``lxml.tostring``. ``lxml.tostring(doc)`` will return the XML
240 representation of the document, which is not valid HTML. In
241 particular, things like ``<script src="..."></script>`` will be
242 serialized as ``<script src="..." />``, which completely confuses
248 A handy method for viewing your HTML:
249 ``lxml.html.open_in_browser(lxml_doc)`` will write the document to
250 disk and open it in a browser (with the `webbrowser module
251 <http://python.org/doc/current/lib/module-webbrowser.html>`_).
256 There are several methods on elements that allow you to see and modify
257 the links in a document.
260 This yields ``(element, attribute, link, pos)`` for every link in
261 the document. ``attribute`` may be None if the link is in the
262 text (as will be the case with a ``<style>`` tag with
265 This finds any link in an ``action``, ``archive``, ``background``,
266 ``cite``, ``classid``, ``codebase``, ``data``, ``href``,
267 ``longdesc``, ``profile``, ``src``, ``usemap``, ``dynsrc``, or
268 ``lowsrc`` attribute. It also searches ``style`` attributes for
269 ``url(link)``, and ``<style>`` tags for ``@import`` and ``url()``.
271 This function does *not* pay attention to ``<base href>``.
273 ``.resolve_base_href()``:
274 This function will modify the document in-place to take account of
275 ``<base href>`` if the document contains that tag. In the process
276 it will also remove that tag from the document.
278 ``.make_links_absolute(base_href, resolve_base_href=True)``:
279 This makes all links in the document absolute, assuming that
280 ``base_href`` is the URL of the document. So if you pass
281 ``base_href="http://localhost/foo/bar.html"`` and there is a link
282 to ``baz.html`` that will be rewritten as
283 ``http://localhost/foo/baz.html``.
285 If ``resolve_base_href`` is true, then any ``<base href>`` tag
286 will be taken into account (just calling
287 ``self.resolve_base_href()``).
289 ``.rewrite_links(link_repl_func, resolve_base_href=True, base_href=None)``:
290 This rewrites all the links in the document using your given link
291 replacement function. If you give a ``base_href`` value, all
292 links will be passed in after they are joined with this URL.
294 For each link ``link_repl_func(link)`` is called. That function
295 then returns the new link, or None to remove the attribute or tag
296 that contains the link. Note that all links will be passed in,
297 including links like ``"#anchor"`` (which is purely internal), and
298 things like ``"mailto:bob@example.com"`` (or ``javascript:...``).
300 If you want access to the context of the link, you should use
301 ``.iterlinks()`` instead.
306 In addition to these methods, there are corresponding functions:
308 * ``iterlinks(html)``
309 * ``make_links_absolute(html, base_href, ...)``
310 * ``rewrite_links(html, link_repl_func, ...)``
311 * ``resolve_base_href(html)``
313 These functions will parse ``html`` if it is a string, then return the new
314 HTML as a string. If you pass in a document, the document will be copied
315 (except for ``iterlinks()``), the method performed, and the new document
321 Any ``<form>`` elements in a document are available through
322 the list ``doc.forms`` (e.g., ``doc.forms[0]``). Form, input, select,
323 and textarea elements each have special methods.
325 Input elements (including ``<select>`` and ``<textarea>``) have these
329 The name of the element.
332 The value of an input, the content of a textarea, the selected
333 option(s) of a select. This attribute can be set.
335 In the case of a select that takes multiple options (``<select
336 multiple>``) this will be a set of the selected options; you can
337 add or remove items to select and unselect the options.
342 For select elements, this is all the *possible* values (the values
346 For select elements, true if this is a ``<select multiple>``
352 The type attribute in ``<input>`` elements.
355 True if this can be checked (i.e., true for type=radio and
359 If this element is checkable, the checked state. Raises
360 AttributeError on non-checkable inputs.
362 The form itself has these attributes:
365 A dictionary-like object that can be used to access input elements
366 by name. When there are multiple input elements with the same
367 name, this returns list-like structures that can also be used to
368 access the options and their values as a group.
371 A dictionary-like object used to access *values* by their name.
372 ``form.inputs`` returns elements, this only returns values.
373 Setting values in this dictionary will effect the form inputs.
374 Basically ``form.fields[x]`` is equivalent to
375 ``form.inputs[x].value`` and ``form.fields[x] = y`` is equivalent
376 to ``form.inputs[x].value = y``. (Note that sometimes
377 ``form.inputs[x]`` returns a compound object, but these objects
378 also have ``.value`` attributes.)
380 If you set this attribute, it is equivalent to
381 ``form.fields.clear(); form.fields.update(new_value)``
384 Returns a list of ``[(name, value), ...]``, suitable to be passed
385 to ``urllib.urlencode()`` for form submission.
388 The ``action`` attribute. This is resolved to an absolute URL if
392 The ``method`` attribute, which defaults to ``GET``.
397 Note that you can change any of these attributes (values, method,
398 action, etc) and then serialize the form to see the updated values.
399 You can, for instance, do:
401 .. sourcecode:: pycon
403 >>> from lxml.html import fromstring, tostring
404 >>> form_page = fromstring('''<html><body><form>
405 ... Your name: <input type="text" name="name"> <br>
406 ... Your phone: <input type="text" name="phone"> <br>
407 ... Your favorite pets: <br>
408 ... Dogs: <input type="checkbox" name="interest" value="dogs"> <br>
409 ... Cats: <input type="checkbox" name="interest" value="cats"> <br>
410 ... Llamas: <input type="checkbox" name="interest" value="llamas"> <br>
411 ... <input type="submit"></form></body></html>''')
412 >>> form = form_page.forms[0]
413 >>> form.fields = dict(
414 ... name='John Smith',
415 ... phone='555-555-3949',
416 ... interest=set(['cats', 'llamas']))
417 >>> print tostring(form)
422 <input name="name" type="text" value="John Smith">
424 <input name="phone" type="text" value="555-555-3949">
425 <br>Your favorite pets:
427 <input name="interest" type="checkbox" value="dogs">
429 <input checked name="interest" type="checkbox" value="cats">
431 <input checked name="interest" type="checkbox" value="llamas">
433 <input type="submit">
442 You can submit a form with ``lxml.html.submit_form(form_element)``.
443 This will return a file-like object (the result of
444 ``urllib.urlopen()``).
446 If you have extra input values you want to pass you can use the
447 keyword argument ``extra_values``, like ``extra_values={'submit':
448 'Yes!'}``. This is the only way to get submit values into the form,
449 as there is no state of "submitted" for these elements.
451 You can pass in an alternate opener with the ``open_http`` keyword
452 argument, which is a function with the signature ``open_http(method,
457 .. sourcecode:: pycon
459 >>> from lxml.html import parse, submit_form
460 >>> page = parse('http://tinyurl.com').getroot()
461 >>> page.forms[1].fields['url'] = 'http://codespeak.net/lxml/'
462 >>> result = parse(submit_form(page.forms[1])).getroot()
463 >>> [a.attrib['href'] for a in result.xpath("//a[@target='_blank']")]
464 ['http://tinyurl.com/2xae8s', 'http://preview.tinyurl.com/2xae8s']
469 The module ``lxml.html.clean`` provides a ``Cleaner`` class for cleaning up
470 HTML pages. It supports removing embedded or script content, special tags,
471 CSS style annotations and much more.
473 Say, you have an evil web page from an untrusted source that contains lots of
474 content that upsets browsers and tries to run evil code on the client side:
476 .. sourcecode:: pycon
481 ... <script type="text/javascript" src="evil-site"></script>
482 ... <link rel="alternate" type="text/rss" src="evil-rss">
484 ... body {background-image: url(javascript:do_evil)};
485 ... div {color: expression(evil)};
488 ... <body onload="evil_function()">
489 ... <!-- I am interpreted for EVIL! -->
490 ... <a href="javascript:evil_function()">a link</a>
491 ... <a href="#" onclick="evil_function()">another link</a>
492 ... <p onclick="evil_function()">a paragraph</p>
493 ... <div style="display: none">secret EVIL!</div>
494 ... <object> of EVIL! </object>
495 ... <iframe src="evil-site"></iframe>
496 ... <form action="evil-site">
497 ... Password: <input type="password" name="password">
499 ... <blink>annoying EVIL!</blink>
500 ... <a href="evil-site">spam spam SPAM!</a>
501 ... <image src="evil!">
505 To remove the all suspicious content from this unparsed document, use the
506 ``clean_html`` function:
508 .. sourcecode:: pycon
510 >>> from lxml.html.clean import clean_html
512 >>> print clean_html(html)
516 <style>/* deleted */</style>
517 <a href="">a link</a>
518 <a href="#">another link</a>
520 <div>secret EVIL!</div>
524 <a href="evil-site">spam spam SPAM!</a>
530 The ``Cleaner`` class supports several keyword arguments to control exactly
531 which content is removed:
533 .. sourcecode:: pycon
535 >>> from lxml.html.clean import Cleaner
537 >>> cleaner = Cleaner(page_structure=False, links=False)
538 >>> print cleaner.clean_html(html)
541 <link rel="alternate" src="evil-rss" type="text/rss">
542 <style>/* deleted */</style>
545 <a href="">a link</a>
546 <a href="#">another link</a>
548 <div>secret EVIL!</div>
552 <a href="evil-site">spam spam SPAM!</a>
557 >>> cleaner = Cleaner(style=True, links=True, add_nofollow=True,
558 ... page_structure=False, safe_attrs_only=False)
560 >>> print cleaner.clean_html(html)
565 <a href="">a link</a>
566 <a href="#">another link</a>
568 <div>secret EVIL!</div>
572 <a href="evil-site" rel="nofollow">spam spam SPAM!</a>
577 You can also whitelist some otherwise dangerous content with
578 ``Cleaner(host_whitelist=['www.youtube.com'])``, which would allow
579 embedded media from YouTube, while still filtering out embedded media
582 See the docstring of ``Cleaner`` for the details of what can be
589 In addition to cleaning up malicious HTML, ``lxml.html.clean``
590 contains functions to do other things to your HTML. This includes
595 autolink_html(html, ...)
597 This finds anything that looks like a link (e.g.,
598 ``http://example.com``) in the *text* of an HTML document, and
599 turns it into an anchor. It avoids making bad links.
601 Links in the elements ``<textarea>``, ``<pre>``, ``<code>``,
602 anything in the head of the document. You can pass in a list of
603 elements to avoid in ``avoid_elements=['textarea', ...]``.
605 Links to some hosts can be avoided. By default links to
606 ``localhost*``, ``example.*`` and ``127.0.0.1`` are not
607 autolinked. Pass in ``avoid_hosts=[list_of_regexes]`` to control
610 Elements with the ``nolink`` CSS class are not autolinked. Pass
611 in ``avoid_classes=['code', ...]`` to control this.
613 The ``autolink_html()`` version of the function parses the HTML
614 string first, and returns a string.
620 You can also wrap long words in your html::
622 word_break(doc, max_width=40, ...)
624 word_break_html(html, ...)
626 This finds any long words in the text of the document and inserts
627 ``​`` in the document (which is the Unicode zero-width space).
629 This avoids the elements ``<pre>``, ``<textarea>``, and ``<code>``.
630 You can control this with ``avoid_elements=['textarea', ...]``.
632 It also avoids elements with the CSS class ``nobreak``. You can
633 control this with ``avoid_classes=['code', ...]``.
635 Lastly you can control the character that is inserted with
636 ``break_character=u'\u200b'``. However, you cannot insert markup,
639 ``word_break_html(html)`` parses the HTML document and returns a
645 The module ``lxml.html.diff`` offers some ways to visualize
646 differences in HTML documents. These differences are *content*
647 oriented. That is, changes in markup are largely ignored; only
648 changes in the content itself are highlighted.
650 There are two ways to view differences: ``htmldiff`` and
651 ``html_annotate``. One shows differences with ``<ins>`` and
652 ``<del>``, while the other annotates a set of changes similar to ``svn
653 blame``. Both these functions operate on text, and work best with
654 content fragments (only what goes in ``<body>``), not complete
657 Example of ``htmldiff``:
659 .. sourcecode:: pycon
661 >>> from lxml.html.diff import htmldiff, html_annotate
662 >>> doc1 = '''<p>Here is some text.</p>'''
663 >>> doc2 = '''<p>Here is <b>a lot</b> of <i>text</i>.</p>'''
664 >>> doc3 = '''<p>Here is <b>a little</b> <i>text</i>.</p>'''
665 >>> print htmldiff(doc1, doc2)
666 <p>Here is <ins><b>a lot</b> of <i>text</i>.</ins> <del>some text.</del> </p>
667 >>> print html_annotate([(doc1, 'author1'), (doc2, 'author2'),
668 ... (doc3, 'author3')])
669 <p><span title="author1">Here is</span>
670 <b><span title="author2">a</span>
671 <span title="author3">little</span></b>
672 <i><span title="author2">text</span></i>
673 <span title="author2">.</span></p>
675 As you can see, it is imperfect as such things tend to be. On larger
676 tracts of text with larger edits it will generally do better.
678 The ``html_annotate`` function can also take an optional second
679 argument, ``markup``. This is a function like ``markup(text,
680 version)`` that returns the given text marked up with the given
681 version. The default version, the output of which you see in the
684 .. sourcecode:: python
686 def default_markup(text, version):
687 return '<span title="%s">%s</span>' % (
688 cgi.escape(unicode(version), 1), text)
696 This example parses the `hCard <http://microformats.org/wiki/hcard>`_
699 First we get the page:
701 .. sourcecode:: pycon
704 >>> from lxml.html import fromstring
705 >>> url = 'http://microformats.org/'
706 >>> content = urllib.urlopen(url).read()
707 >>> doc = fromstring(content)
708 >>> doc.make_links_absolute(url)
710 Then we create some objects to put the information in:
712 .. sourcecode:: pycon
714 >>> class Card(object):
715 ... def __init__(self, **kw):
716 ... for name, value in kw:
717 ... setattr(self, name, value)
718 >>> class Phone(object):
719 ... def __init__(self, phone, types=()):
720 ... self.phone, self.types = phone, types
722 And some generally handy functions for microformats:
724 .. sourcecode:: pycon
726 >>> def get_text(el, class_name):
727 ... els = el.find_class(class_name)
729 ... return els[0].text_content()
732 >>> def get_value(el):
733 ... return get_text(el, 'value') or el.text_content()
734 >>> def get_all_texts(el, class_name):
735 ... return [e.text_content() for e in els.find_class(class_name)]
736 >>> def parse_addresses(el):
737 ... # Ideally this would parse street, etc.
738 ... return el.find_class('adr')
742 .. sourcecode:: pycon
744 >>> for el in doc.find_class('hcard'):
747 ... card.fn = get_text(el, 'fn')
749 ... for tel_el in card.find_class('tel'):
750 ... card.tels.append(Phone(get_value(tel_el),
751 ... get_all_texts(tel_el, 'type')))
752 ... card.addresses = parse_addresses(el)