doc/elementsoup.txt

   1 ====================
   2 BeautifulSoup Parser
   3 ====================
   4
   5 BeautifulSoup_ is a Python package that parses broken HTML, just like
   6 lxml supports it based on the parser of libxml2.  BeautifulSoup uses a
   7 different parsing approach.  It is not a real HTML parser but uses
   8 regular expressions to dive through tag soup.  It is therefore more
   9 forgiving in some cases and less good in others.  It is not uncommon
  10 that lxml/libxml2 parses and fixes broken HTML better, but
  11 BeautifulSoup has superiour `support for encoding detection`_.  It
  12 very much depends on the input which parser works better.
  13
  14 .. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
  15 .. _`support for encoding detection`: http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful%20Soup%20Gives%20You%20Unicode%2C%20Dammit
  16 .. _ElementSoup: http://effbot.org/zone/element-soup.htm
  17
  18 To prevent users from having to choose their parser library in
  19 advance, lxml can interface to the parsing capabilities of
  20 BeautifulSoup through the ``lxml.html.soupparser`` module.  It
  21 provides three main functions: ``fromstring()`` and ``parse()`` to
  22 parse a string or file using BeautifulSoup into an ``lxml.html``
  23 document, and ``convert_tree()`` to convert an existing BeautifulSoup
  24 tree into a list of top-level Elements.
  25
  26 .. contents::
  27 ..
  28    1  Parsing with the soupparser
  29    2  Entity handling
  30    3  Using soupparser as a fallback
  31    4  Using only the encoding detection
  32
  33
  34 Parsing with the soupparser
  35 ===========================
  36
  37 The functions ``fromstring()`` and ``parse()`` behave as known from
  38 ElementTree.  The first returns a root Element, the latter returns an
  39 ElementTree.
  40
  41 There is also a legacy module called ``lxml.html.ElementSoup``, which
  42 mimics the interface provided by ElementTree's own ElementSoup_
  43 module.  Note that the ``soupparser`` module was added in lxml 2.0.3.
  44 Previous versions of lxml 2.0.x only have the ``ElementSoup`` module.
  45
  46 Here is a document full of tag soup, similar to, but not quite like, HTML:
  47
  48 .. sourcecode:: pycon
  49
  50     >>> tag_soup = '<meta><head><title>Hello</head<body onload=crash()>Hi all<p>'
  51
  52 all you need to do is pass it to the ``fromstring()`` function:
  53
  54 .. sourcecode:: pycon
  55
  56     >>> from lxml.html.soupparser import fromstring
  57     >>> root = fromstring(tag_soup)
  58
  59 To see what we have here, you can serialise it:
  60
  61 .. sourcecode:: pycon
  62
  63     >>> from lxml.etree import tostring
  64     >>> print tostring(root, pretty_print=True),
  65     <html>
  66       <meta/>
  67       <head>
  68         <title>Hello</title>
  69       </head>
  70       <body onload="crash()">Hi all<p/></body>
  71     </html>
  72
  73 Not quite what you'd expect from an HTML page, but, well, it was broken
  74 already, right?  BeautifulSoup did its best, and so now it's a tree.
  75
  76 To control which Element implementation is used, you can pass a
  77 ``makeelement`` factory function to ``parse()`` and ``fromstring()``.
  78 By default, this is based on the HTML parser defined in ``lxml.html``.
  79
  80 For a quick comparison, libxml2 2.6.32 parses the same tag soup as
  81 follows.  The main difference is that libxml2 tries harder to adhere
  82 to the structure of an HTML document and moves misplaced tags where
  83 they (likely) belong.  Note, however, that the result can vary between
  84 parser versions.
  85
  86 .. sourcecode:: html
  87
  88     <html>
  89       <head>
  90         <meta/>
  91         <title>Hello</title>
  92       </head>
  93       <body>
  94         <p>Hi all</p>
  95         <p/>
  96       </body>
  97     </html>
  98
  99
 100 Entity handling
 101 ===============
 102
 103 By default, the BeautifulSoup parser also replaces the entities it
 104 finds by their character equivalent.
 105
 106 .. sourcecode:: pycon
 107
 108     >>> tag_soup = '<body>&copy;&euro;&#45;&#245;&#445;<p>'
 109     >>> body = fromstring(tag_soup).find('.//body')
 110     >>> body.text
 111     u'\xa9\u20ac-\xf5\u01bd'
 112
 113 If you want them back on the way out, you can just serialise with the
 114 default encoding, which is 'US-ASCII'.
 115
 116 .. sourcecode:: pycon
 117
 118     >>> tostring(body)
 119     '<body>&#169;&#8364;-&#245;&#445;<p/></body>'
 120
 121     >>> tostring(body, method="html")
 122     '<body>&#169;&#8364;-&#245;&#445;<p></p></body>'
 123
 124 Any other encoding will output the respective byte sequences.
 125
 126 .. sourcecode:: pycon
 127
 128     >>> tostring(body, encoding="utf-8")
 129     '<body>\xc2\xa9\xe2\x82\xac-\xc3\xb5\xc6\xbd<p/></body>'
 130
 131     >>> tostring(body, method="html", encoding="utf-8")
 132     '<body>\xc2\xa9\xe2\x82\xac-\xc3\xb5\xc6\xbd<p></p></body>'
 133
 134     >>> tostring(body, encoding=unicode)
 135     u'<body>\xa9\u20ac-\xf5\u01bd<p/></body>'
 136
 137     >>> tostring(body, method="html", encoding=unicode)
 138     u'<body>\xa9\u20ac-\xf5\u01bd<p></p></body>'
 139
 140
 141 Using soupparser as a fallback
 142 ==============================
 143
 144 The downside of using this parser is that it is `much slower`_ than
 145 the HTML parser of lxml.  So if performance matters, you might want to
 146 consider using ``soupparser`` only as a fallback for certain cases.
 147
 148 .. _`much slower`: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
 149
 150 One common problem of lxml's parser is that it might not get the
 151 encoding right in cases where the document contains a ``<meta>`` tag
 152 at the wrong place.  In this case, you can exploit the fact that lxml
 153 serialises much faster than most other HTML libraries for Python.
 154 Just serialise the document to unicode and if that gives you an
 155 exception, re-parse it with BeautifulSoup to see if that works
 156 better.
 157
 158 .. sourcecode:: pycon
 159
 160     >>> tag_soup = '''\
 161     ... <meta http-equiv="Content-Type"
 162     ...       content="text/html;charset=utf-8" />
 163     ... <html>
 164     ...   <head>
 165     ...     <title>Hello W\xc3\xb6rld!</title>
 166     ...   </head>
 167     ...   <body>Hi all</body>
 168     ... </html>'''
 169
 170     >>> import lxml.html
 171     >>> import lxml.html.soupparser
 172
 173     >>> root = lxml.html.fromstring(tag_soup)
 174     >>> try:
 175     ...     ignore = tostring(root, encoding=unicode)
 176     ... except UnicodeDecodeError:
 177     ...     root = lxml.html.soupparser.fromstring(tag_soup)
 178
 179
 180 Using only the encoding detection
 181 =================================
 182
 183 If you prefer a 'real' (and fast) HTML parser instead of the regular
 184 expression based one in BeautifulSoup, you can still benefit from
 185 BeautifulSoup's `support for encoding detection`_ in the
 186 ``UnicodeDammit`` class.
 187
 188 .. sourcecode:: pycon
 189
 190     >>> from BeautifulSoup import UnicodeDammit
 191
 192     >>> def decode_html(html_string):
 193     ...     converted = UnicodeDammit(html_string, isHTML=True)
 194     ...     if not converted.unicode:
 195     ...         raise UnicodeDecodeError(
 196     ...             "Failed to detect encoding, tried [%s]",
 197     ...             ', '.join(converted.triedEncodings))
 198     ...     # print converted.originalEncoding
 199     ...     return converted.unicode
 200
 201     >>> root = lxml.html.fromstring(decode_html(tag_soup))