doc/lxmlhtml.txt

   1 =========
   2 lxml.html
   3 =========
   4
   5 :Author:
   6   Ian Bicking
   7
   8 Since version 2.0, lxml comes with a dedicated package for dealing
   9 with HTML: ``lxml.html``.  It provides a special Element API for HTML
  10 elements, as well as a number of utilities for common tasks.
  11
  12 .. contents::
  13 ..
  14    1  Parsing HTML
  15      1.1  Parsing HTML fragments
  16      1.2  Really broken pages
  17    2  HTML Element Methods
  18    3  Running HTML doctests
  19    4  Creating HTML with the E-factory
  20      4.1  Viewing your HTML
  21    5  Working with links
  22      5.1  Functions
  23    6  Forms
  24      6.1  Form Filling Example
  25      6.2  Form Submission
  26    7  Cleaning up HTML
  27      7.1  autolink
  28      7.2  wordwrap
  29    8  HTML Diff
  30    9  Examples
  31      9.1  Microformat Example
  32
  33 The main API is based on the `lxml.etree`_ API, and thus, on the ElementTree_
  34 API.
  35
  36 .. _`lxml.etree`: tutorial.html
  37 .. _ElementTree:  http://effbot.org/zone/element-index.htm
  38
  39
  40 Parsing HTML
  41 ============
  42
  43 Parsing HTML fragments
  44 ----------------------
  45
  46 There are several functions available to parse HTML:
  47
  48 ``parse(filename_url_or_file)``:
  49     Parses the named file or url, or if the object has a ``.read()``
  50     method, parses from that.
  51
  52     If you give a URL, or if the object has a ``.geturl()`` method (as
  53     file-like objects from ``urllib.urlopen()`` have), then that URL
  54     is used as the base URL.  You can also provide an explicit
  55     ``base_url`` keyword argument.
  56
  57 ``document_fromstring(string)``:
  58     Parses a document from the given string.  This always creates a
  59     correct HTML document, which means the parent node is ``<html>``,
  60     and there is a body and possibly a head.
  61
  62 ``fragment_fromstring(string, create_parent=False)``:
  63     Returns an HTML fragment from a string.  The fragment must contain
  64     just a single element, unless ``create_parent`` is given;
  65     e.g,. ``fragment_fromstring(string, create_parent='div')`` will
  66     wrap the element in a ``<div>``.
  67
  68 ``fragments_fromstring(string)``:
  69     Returns a list of the elements found in the fragment.
  70
  71 ``fromstring(string)``:
  72     Returns ``document_fromstring`` or ``fragment_fromstring``, based
  73     on whether the string looks like a full document, or just a
  74     fragment.
  75
  76 Really broken pages
  77 -------------------
  78
  79 The normal HTML parser is capable of handling broken HTML, but for
  80 pages that are far enough from HTML to call them 'tag soup', it may
  81 still fail to parse the page.  A way to deal with this is
  82 ElementSoup_, which deploys the well-known BeautifulSoup_ parser to
  83 build an lxml HTML tree.
  84
  85 .. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
  86 .. _ElementSoup: elementsoup.html
  87
  88
  89 HTML Element Methods
  90 ====================
  91
  92 HTML elements have all the methods that come with ElementTree, but
  93 also include some extra methods:
  94
  95 ``.drop_tree()``:
  96     Drops the element and all its children.  Unlike
  97     ``el.getparent().remove(el)`` this does *not* remove the tail
  98     text; with ``drop_tree`` the tail text is merged with the previous
  99     element.
 100
 101 ``.drop_tag()``:
 102     Drops the tag, but keeps its children and text.
 103
 104 ``.find_class(class_name)``:
 105     Returns a list of all the elements with the given CSS class name.
 106     Note that class names are space separated in HTML, so
 107     ``doc.find_class_name('highlight')`` will find an element like
 108     ``<div class="sidebar highlight">``.  Class names *are* case
 109     sensitive.
 110
 111 ``.find_rel_links(rel)``:
 112     Returns a list of all the ``<a rel="{rel}">`` elements.  E.g.,
 113     ``doc.find_rel_links('tag')`` returns all the links `marked as
 114     tags <http://microformats.org/wiki/rel-tag>`_.
 115
 116 ``.get_element_by_id(id, default=None)``:
 117     Return the element with the given ``id``, or the ``default`` if
 118     none is found.  If there are multiple elements with the same id
 119     (which there shouldn't be, but there often is), this returns only
 120     the first.
 121
 122 ``.text_content()``:
 123     Returns the text content of the element, including the text
 124     content of its children, with no markup.
 125
 126 ``.cssselect(expr)``:
 127     Select elements from this element and its children, using a CSS
 128     selector expression.  (Note that ``.xpath(expr)`` is also
 129     available as on all lxml elements.)
 130
 131 ``.label``:
 132     Returns the corresponding ``<label>`` element for this element, if
 133     any exists (None if there is none).  Label elements have a
 134     ``label.for_element`` attribute that points back to the element.
 135
 136 ``.base_url``:
 137     The base URL for this element, if one was saved from the parsing.
 138     This attribute is not settable.  Is None when no base URL was
 139     saved.
 140
 141 Running HTML doctests
 142 =====================
 143
 144 One of the interesting modules in the ``lxml.html`` package deals with
 145 doctests.  It can be hard to compare two HTML pages for equality, as
 146 whitespace differences aren't meaningful and the structural formatting
 147 can differ.  This is even more a problem in doctests, where output is
 148 tested for equality and small differences in whitespace or the order
 149 of attributes can let a test fail.  And given the verbosity of
 150 tag-based languages, it may take more than a quick look to find the
 151 actual differences in the doctest output.
 152
 153 Luckily, lxml provides the ``lxml.doctestcompare`` module that
 154 supports relaxed comparison of XML and HTML pages and provides a
 155 readable diff in the output when a test fails.  The HTML comparison is
 156 most easily used by importing the ``usedoctest`` module in a doctest:
 157
 158 .. sourcecode:: pycon
 159
 160     >>> import lxml.html.usedoctest
 161
 162 Now, if you have a HTML document and want to compare it to an expected result
 163 document in a doctest, you can do the following:
 164
 165 .. sourcecode:: pycon
 166
 167     >>> import lxml.html
 168     >>> html = lxml.html.fromstring('''\
 169     ...    <html><body onload="" color="white">
 170     ...      <p>Hi  !</p>
 171     ...    </body></html>
 172     ... ''')
 173
 174     >>> print lxml.html.tostring(html)
 175     <html><body onload="" color="white"><p>Hi !</p></body></html>
 176
 177     >>> print lxml.html.tostring(html)
 178     <html> <body color="white" onload=""> <p>Hi    !</p> </body> </html>
 179
 180     >>> print lxml.html.tostring(html)
 181     <html>
 182       <body color="white" onload="">
 183         <p>Hi !</p>
 184       </body>
 185     </html>
 186
 187 In documentation, you would likely prefer the pretty printed HTML output, as
 188 it is the most readable.  However, the three documents are equivalent from the
 189 point of view of an HTML tool, so the doctest will silently accept any of the
 190 above.  This allows you to concentrate on readability in your doctests, even
 191 if the real output is a straight ugly HTML one-liner.
 192
 193 Note that there is also an ``lxml.usedoctest`` module which you can
 194 import for XML comparisons.  The HTML parser notably ignores
 195 namespaces and some other XMLisms.
 196
 197
 198 Creating HTML with the E-factory
 199 ================================
 200
 201 .. _`E-factory`: http://online.effbot.org/2006_11_01_archive.htm#et-builder
 202
 203 lxml.html comes with a predefined HTML vocabulary for the `E-factory`_,
 204 originally written by Fredrik Lundh.  This allows you to quickly generate HTML
 205 pages and fragments:
 206
 207 .. sourcecode:: pycon
 208
 209     >>> from lxml.html import builder as E
 210     >>> from lxml.html import usedoctest
 211     >>> html = E.HTML(
 212     ...   E.HEAD(
 213     ...     E.LINK(rel="stylesheet", href="great.css", type="text/css"),
 214     ...     E.TITLE("Best Page Ever")
 215     ...   ),
 216     ...   E.BODY(
 217     ...     E.H1(E.CLASS("heading"), "Top News"),
 218     ...     E.P("World News only on this page", style="font-size: 200%"),
 219     ...     "Ah, and here's some more text, by the way.",
 220     ...     lxml.html.fromstring("<p>... and this is a parsed fragment ...</p>")
 221     ...   )
 222     ... )
 223
 224     >>> print lxml.html.tostring(html)
 225     <html>
 226       <head>
 227         <link href="great.css" rel="stylesheet" type="text/css">
 228         <title>Best Page Ever</title>
 229       </head>
 230       <body>
 231         <h1 class="heading">Top News</h1>
 232         <p style="font-size: 200%">World News only on this page</p>
 233         Ah, and here's some more text, by the way.
 234         <p>... and this is a parsed fragment ...</p>
 235       </body>
 236     </html>
 237
 238 Note that you should use ``lxml.html.tostring`` and **not**
 239 ``lxml.tostring``.  ``lxml.tostring(doc)`` will return the XML
 240 representation of the document, which is not valid HTML.  In
 241 particular, things like ``<script src="..."></script>`` will be
 242 serialized as ``<script src="..." />``, which completely confuses
 243 browsers.
 244
 245 Viewing your HTML
 246 -----------------
 247
 248 A handy method for viewing your HTML:
 249 ``lxml.html.open_in_browser(lxml_doc)`` will write the document to
 250 disk and open it in a browser (with the `webbrowser module
 251 <http://python.org/doc/current/lib/module-webbrowser.html>`_).
 252
 253 Working with links
 254 ==================
 255
 256 There are several methods on elements that allow you to see and modify
 257 the links in a document.
 258
 259 ``.iterlinks()``:
 260     This yields ``(element, attribute, link, pos)`` for every link in
 261     the document.  ``attribute`` may be None if the link is in the
 262     text (as will be the case with a ``<style>`` tag with
 263     ``@import``).
 264
 265     This finds any link in an ``action``, ``archive``, ``background``,
 266     ``cite``, ``classid``, ``codebase``, ``data``, ``href``,
 267     ``longdesc``, ``profile``, ``src``, ``usemap``, ``dynsrc``, or
 268     ``lowsrc`` attribute.  It also searches ``style`` attributes for
 269     ``url(link)``, and ``<style>`` tags for ``@import`` and ``url()``.
 270
 271     This function does *not* pay attention to ``<base href>``.
 272
 273 ``.resolve_base_href()``:
 274     This function will modify the document in-place to take account of
 275     ``<base href>`` if the document contains that tag.  In the process
 276     it will also remove that tag from the document.
 277
 278 ``.make_links_absolute(base_href, resolve_base_href=True)``:
 279     This makes all links in the document absolute, assuming that
 280     ``base_href`` is the URL of the document.  So if you pass
 281     ``base_href="http://localhost/foo/bar.html"`` and there is a link
 282     to ``baz.html`` that will be rewritten as
 283     ``http://localhost/foo/baz.html``.
 284
 285     If ``resolve_base_href`` is true, then any ``<base href>`` tag
 286     will be taken into account (just calling
 287     ``self.resolve_base_href()``).
 288
 289 ``.rewrite_links(link_repl_func, resolve_base_href=True, base_href=None)``:
 290     This rewrites all the links in the document using your given link
 291     replacement function.  If you give a ``base_href`` value, all
 292     links will be passed in after they are joined with this URL.
 293
 294     For each link ``link_repl_func(link)`` is called.  That function
 295     then returns the new link, or None to remove the attribute or tag
 296     that contains the link.  Note that all links will be passed in,
 297     including links like ``"#anchor"`` (which is purely internal), and
 298     things like ``"mailto:bob@example.com"`` (or ``javascript:...``).
 299
 300     If you want access to the context of the link, you should use
 301     ``.iterlinks()`` instead.
 302
 303 Functions
 304 ---------
 305
 306 In addition to these methods, there are corresponding functions:
 307
 308 * ``iterlinks(html)``
 309 * ``make_links_absolute(html, base_href, ...)``
 310 * ``rewrite_links(html, link_repl_func, ...)``
 311 * ``resolve_base_href(html)``
 312
 313 These functions will parse ``html`` if it is a string, then return the new
 314 HTML as a string.  If you pass in a document, the document will be copied
 315 (except for ``iterlinks()``), the method performed, and the new document
 316 returned.
 317
 318 Forms
 319 =====
 320
 321 Any ``<form>`` elements in a document are available through
 322 the list ``doc.forms`` (e.g., ``doc.forms[0]``).  Form, input, select,
 323 and textarea elements each have special methods.
 324
 325 Input elements (including ``<select>`` and ``<textarea>``) have these
 326 attributes:
 327
 328 ``.name``:
 329     The name of the element.
 330
 331 ``.value``:
 332     The value of an input, the content of a textarea, the selected
 333     option(s) of a select.  This attribute can be set.
 334
 335     In the case of a select that takes multiple options (``<select
 336     multiple>``) this will be a set of the selected options; you can
 337     add or remove items to select and unselect the options.
 338
 339 Select attributes:
 340
 341 ``.value_options``:
 342     For select elements, this is all the *possible* values (the values
 343     of all the options).
 344
 345 ``.multiple``:
 346     For select elements, true if this is a ``<select multiple>``
 347     element.
 348
 349 Input attributes:
 350
 351 ``.type``:
 352     The type attribute in ``<input>`` elements.
 353
 354 ``.checkable``:
 355     True if this can be checked (i.e., true for type=radio and
 356     type=checkbox).
 357
 358 ``.checked``:
 359     If this element is checkable, the checked state.  Raises
 360     AttributeError on non-checkable inputs.
 361
 362 The form itself has these attributes:
 363
 364 ``.inputs``:
 365     A dictionary-like object that can be used to access input elements
 366     by name.  When there are multiple input elements with the same
 367     name, this returns list-like structures that can also be used to
 368     access the options and their values as a group.
 369
 370 ``.fields``:
 371     A dictionary-like object used to access *values* by their name.
 372     ``form.inputs`` returns elements, this only returns values.
 373     Setting values in this dictionary will effect the form inputs.
 374     Basically ``form.fields[x]`` is equivalent to
 375     ``form.inputs[x].value`` and ``form.fields[x] = y`` is equivalent
 376     to ``form.inputs[x].value = y``.  (Note that sometimes
 377     ``form.inputs[x]`` returns a compound object, but these objects
 378     also have ``.value`` attributes.)
 379
 380     If you set this attribute, it is equivalent to
 381     ``form.fields.clear(); form.fields.update(new_value)``
 382
 383 ``.form_values()``:
 384     Returns a list of ``[(name, value), ...]``, suitable to be passed
 385     to ``urllib.urlencode()`` for form submission.
 386
 387 ``.action``:
 388     The ``action`` attribute.  This is resolved to an absolute URL if
 389     possible.
 390
 391 ``.method``:
 392     The ``method`` attribute, which defaults to ``GET``.
 393
 394 Form Filling Example
 395 --------------------
 396
 397 Note that you can change any of these attributes (values, method,
 398 action, etc) and then serialize the form to see the updated values.
 399 You can, for instance, do:
 400
 401 .. sourcecode:: pycon
 402
 403     >>> from lxml.html import fromstring, tostring
 404     >>> form_page = fromstring('''<html><body><form>
 405     ...   Your name: <input type="text" name="name"> <br>
 406     ...   Your phone: <input type="text" name="phone"> <br>
 407     ...   Your favorite pets: <br>
 408     ...   Dogs: <input type="checkbox" name="interest" value="dogs"> <br>
 409     ...   Cats: <input type="checkbox" name="interest" value="cats"> <br>
 410     ...   Llamas: <input type="checkbox" name="interest" value="llamas"> <br>
 411     ...   <input type="submit"></form></body></html>''')
 412     >>> form = form_page.forms[0]
 413     >>> form.fields = dict(
 414     ...     name='John Smith',
 415     ...     phone='555-555-3949',
 416     ...     interest=set(['cats', 'llamas']))
 417     >>> print tostring(form)
 418     <html>
 419       <body>
 420         <form>
 421         Your name:
 422           <input name="name" type="text" value="John Smith">
 423           <br>Your phone:
 424           <input name="phone" type="text" value="555-555-3949">
 425           <br>Your favorite pets:
 426           <br>Dogs:
 427           <input name="interest" type="checkbox" value="dogs">
 428           <br>Cats:
 429           <input checked name="interest" type="checkbox" value="cats">
 430           <br>Llamas:
 431           <input checked name="interest" type="checkbox" value="llamas">
 432           <br>
 433           <input type="submit">
 434         </form>
 435       </body>
 436     </html>
 437
 438
 439 Form Submission
 440 ---------------
 441
 442 You can submit a form with ``lxml.html.submit_form(form_element)``.
 443 This will return a file-like object (the result of
 444 ``urllib.urlopen()``).
 445
 446 If you have extra input values you want to pass you can use the
 447 keyword argument ``extra_values``, like ``extra_values={'submit':
 448 'Yes!'}``.  This is the only way to get submit values into the form,
 449 as there is no state of "submitted" for these elements.
 450
 451 You can pass in an alternate opener with the ``open_http`` keyword
 452 argument, which is a function with the signature ``open_http(method,
 453 url, values)``.
 454
 455 Example:
 456
 457 .. sourcecode:: pycon
 458
 459     >>> from lxml.html import parse, submit_form
 460     >>> page = parse('http://tinyurl.com').getroot()
 461     >>> page.forms[1].fields['url'] = 'http://codespeak.net/lxml/'
 462     >>> result = parse(submit_form(page.forms[1])).getroot()
 463     >>> [a.attrib['href'] for a in result.xpath("//a[@target='_blank']")]
 464     ['http://tinyurl.com/2xae8s', 'http://preview.tinyurl.com/2xae8s']
 465
 466 Cleaning up HTML
 467 ================
 468
 469 The module ``lxml.html.clean`` provides a ``Cleaner`` class for cleaning up
 470 HTML pages.  It supports removing embedded or script content, special tags,
 471 CSS style annotations and much more.
 472
 473 Say, you have an evil web page from an untrusted source that contains lots of
 474 content that upsets browsers and tries to run evil code on the client side:
 475
 476 .. sourcecode:: pycon
 477
 478     >>> html = '''\
 479     ... <html>
 480     ...  <head>
 481     ...    <script type="text/javascript" src="evil-site"></script>
 482     ...    <link rel="alternate" type="text/rss" src="evil-rss">
 483     ...    <style>
 484     ...      body {background-image: url(javascript:do_evil)};
 485     ...      div {color: expression(evil)};
 486     ...    </style>
 487     ...  </head>
 488     ...  <body onload="evil_function()">
 489     ...    <!-- I am interpreted for EVIL! -->
 490     ...    <a href="javascript:evil_function()">a link</a>
 491     ...    <a href="#" onclick="evil_function()">another link</a>
 492     ...    <p onclick="evil_function()">a paragraph</p>
 493     ...    <div style="display: none">secret EVIL!</div>
 494     ...    <object> of EVIL! </object>
 495     ...    <iframe src="evil-site"></iframe>
 496     ...    <form action="evil-site">
 497     ...      Password: <input type="password" name="password">
 498     ...    </form>
 499     ...    <blink>annoying EVIL!</blink>
 500     ...    <a href="evil-site">spam spam SPAM!</a>
 501     ...    <image src="evil!">
 502     ...  </body>
 503     ... </html>'''
 504
 505 To remove the all suspicious content from this unparsed document, use the
 506 ``clean_html`` function:
 507
 508 .. sourcecode:: pycon
 509
 510     >>> from lxml.html.clean import clean_html
 511
 512     >>> print clean_html(html)
 513     <html>
 514       <body>
 515         <div>
 516           <style>/* deleted */</style>
 517           <a href="">a link</a>
 518           <a href="#">another link</a>
 519           <p>a paragraph</p>
 520           <div>secret EVIL!</div>
 521           of EVIL!
 522           Password:
 523           annoying EVIL!
 524           <a href="evil-site">spam spam SPAM!</a>
 525           <img src="evil!">
 526         </div>
 527       </body>
 528     </html>
 529
 530 The ``Cleaner`` class supports several keyword arguments to control exactly
 531 which content is removed:
 532
 533 .. sourcecode:: pycon
 534
 535     >>> from lxml.html.clean import Cleaner
 536
 537     >>> cleaner = Cleaner(page_structure=False, links=False)
 538     >>> print cleaner.clean_html(html)
 539     <html>
 540       <head>
 541         <link rel="alternate" src="evil-rss" type="text/rss">
 542         <style>/* deleted */</style>
 543       </head>
 544       <body>
 545         <a href="">a link</a>
 546         <a href="#">another link</a>
 547         <p>a paragraph</p>
 548         <div>secret EVIL!</div>
 549         of EVIL!
 550         Password:
 551         annoying EVIL!
 552         <a href="evil-site">spam spam SPAM!</a>
 553         <img src="evil!">
 554       </body>
 555     </html>
 556
 557     >>> cleaner = Cleaner(style=True, links=True, add_nofollow=True,
 558     ...                   page_structure=False, safe_attrs_only=False)
 559
 560     >>> print cleaner.clean_html(html)
 561     <html>
 562       <head>
 563       </head>
 564       <body>
 565         <a href="">a link</a>
 566         <a href="#">another link</a>
 567         <p>a paragraph</p>
 568         <div>secret EVIL!</div>
 569         of EVIL!
 570         Password:
 571         annoying EVIL!
 572         <a href="evil-site" rel="nofollow">spam spam SPAM!</a>
 573         <img src="evil!">
 574       </body>
 575     </html>
 576
 577 You can also whitelist some otherwise dangerous content with
 578 ``Cleaner(host_whitelist=['www.youtube.com'])``, which would allow
 579 embedded media from YouTube, while still filtering out embedded media
 580 from other sites.
 581
 582 See the docstring of ``Cleaner`` for the details of what can be
 583 cleaned.
 584
 585
 586 autolink
 587 --------
 588
 589 In addition to cleaning up malicious HTML, ``lxml.html.clean``
 590 contains functions to do other things to your HTML.  This includes
 591 autolinking::
 592
 593    autolink(doc, ...)
 594
 595    autolink_html(html, ...)
 596
 597 This finds anything that looks like a link (e.g.,
 598 ``http://example.com``) in the *text* of an HTML document, and
 599 turns it into an anchor.  It avoids making bad links.
 600
 601 Links in the elements ``<textarea>``, ``<pre>``, ``<code>``,
 602 anything in the head of the document.  You can pass in a list of
 603 elements to avoid in ``avoid_elements=['textarea', ...]``.
 604
 605 Links to some hosts can be avoided.  By default links to
 606 ``localhost*``, ``example.*`` and ``127.0.0.1`` are not
 607 autolinked.  Pass in ``avoid_hosts=[list_of_regexes]`` to control
 608 this.
 609
 610 Elements with the ``nolink`` CSS class are not autolinked.  Pass
 611 in ``avoid_classes=['code', ...]`` to control this.
 612
 613 The ``autolink_html()`` version of the function parses the HTML
 614 string first, and returns a string.
 615
 616
 617 wordwrap
 618 --------
 619
 620 You can also wrap long words in your html::
 621
 622    word_break(doc, max_width=40, ...)
 623
 624    word_break_html(html, ...)
 625
 626 This finds any long words in the text of the document and inserts
 627 ``&#8203;`` in the document (which is the Unicode zero-width space).
 628
 629 This avoids the elements ``<pre>``, ``<textarea>``, and ``<code>``.
 630 You can control this with ``avoid_elements=['textarea', ...]``.
 631
 632 It also avoids elements with the CSS class ``nobreak``.  You can
 633 control this with ``avoid_classes=['code', ...]``.
 634
 635 Lastly you can control the character that is inserted with
 636 ``break_character=u'\u200b'``.  However, you cannot insert markup,
 637 only text.
 638
 639 ``word_break_html(html)`` parses the HTML document and returns a
 640 string.
 641
 642 HTML Diff
 643 =========
 644
 645 The module ``lxml.html.diff`` offers some ways to visualize
 646 differences in HTML documents.  These differences are *content*
 647 oriented.  That is, changes in markup are largely ignored; only
 648 changes in the content itself are highlighted.
 649
 650 There are two ways to view differences: ``htmldiff`` and
 651 ``html_annotate``.  One shows differences with ``<ins>`` and
 652 ``<del>``, while the other annotates a set of changes similar to ``svn
 653 blame``.  Both these functions operate on text, and work best with
 654 content fragments (only what goes in ``<body>``), not complete
 655 documents.
 656
 657 Example of ``htmldiff``:
 658
 659 .. sourcecode:: pycon
 660
 661     >>> from lxml.html.diff import htmldiff, html_annotate
 662     >>> doc1 = '''<p>Here is some text.</p>'''
 663     >>> doc2 = '''<p>Here is <b>a lot</b> of <i>text</i>.</p>'''
 664     >>> doc3 = '''<p>Here is <b>a little</b> <i>text</i>.</p>'''
 665     >>> print htmldiff(doc1, doc2)
 666     <p>Here is <ins><b>a lot</b> of <i>text</i>.</ins> <del>some text.</del> </p>
 667     >>> print html_annotate([(doc1, 'author1'), (doc2, 'author2'),
 668     ...                      (doc3, 'author3')])
 669     <p><span title="author1">Here is</span>
 670        <b><span title="author2">a</span>
 671        <span title="author3">little</span></b>
 672        <i><span title="author2">text</span></i>
 673        <span title="author2">.</span></p>
 674
 675 As you can see, it is imperfect as such things tend to be.  On larger
 676 tracts of text with larger edits it will generally do better.
 677
 678 The ``html_annotate`` function can also take an optional second
 679 argument, ``markup``.  This is a function like ``markup(text,
 680 version)`` that returns the given text marked up with the given
 681 version.  The default version, the output of which you see in the
 682 example, looks like:
 683
 684 .. sourcecode:: python
 685
 686     def default_markup(text, version):
 687         return '<span title="%s">%s</span>' % (
 688             cgi.escape(unicode(version), 1), text)
 689
 690 Examples
 691 ========
 692
 693 Microformat Example
 694 -------------------
 695
 696 This example parses the `hCard <http://microformats.org/wiki/hcard>`_
 697 microformat.
 698
 699 First we get the page:
 700
 701 .. sourcecode:: pycon
 702
 703     >>> import urllib
 704     >>> from lxml.html import fromstring
 705     >>> url = 'http://microformats.org/'
 706     >>> content = urllib.urlopen(url).read()
 707     >>> doc = fromstring(content)
 708     >>> doc.make_links_absolute(url)
 709
 710 Then we create some objects to put the information in:
 711
 712 .. sourcecode:: pycon
 713
 714     >>> class Card(object):
 715     ...     def __init__(self, **kw):
 716     ...         for name, value in kw:
 717     ...             setattr(self, name, value)
 718     >>> class Phone(object):
 719     ...     def __init__(self, phone, types=()):
 720     ...         self.phone, self.types = phone, types
 721
 722 And some generally handy functions for microformats:
 723
 724 .. sourcecode:: pycon
 725
 726     >>> def get_text(el, class_name):
 727     ...     els = el.find_class(class_name)
 728     ...     if els:
 729     ...         return els[0].text_content()
 730     ...     else:
 731     ...         return ''
 732     >>> def get_value(el):
 733     ...     return get_text(el, 'value') or el.text_content()
 734     >>> def get_all_texts(el, class_name):
 735     ...     return [e.text_content() for e in els.find_class(class_name)]
 736     >>> def parse_addresses(el):
 737     ...     # Ideally this would parse street, etc.
 738     ...     return el.find_class('adr')
 739
 740 Then the parsing:
 741
 742 .. sourcecode:: pycon
 743
 744     >>> for el in doc.find_class('hcard'):
 745     ...     card = Card()
 746     ...     card.el = el
 747     ...     card.fn = get_text(el, 'fn')
 748     ...     card.tels = []
 749     ...     for tel_el in card.find_class('tel'):
 750     ...         card.tels.append(Phone(get_value(tel_el),
 751     ...                                get_all_texts(tel_el, 'type')))
 752     ...     card.addresses = parse_addresses(el)