doc/api.txt

   1 ===========================
   2 APIs specific to lxml.etree
   3 ===========================
   4
   5 lxml.etree tries to follow established APIs wherever possible.  Sometimes,
   6 however, the need to expose a feature in an easy way led to the invention of a
   7 new API.  This page describes the major differences and a few additions to the
   8 main ElementTree API.
   9
  10 For a complete reference of the API, see the `generated API
  11 documentation`_.
  12
  13 Separate pages describe the support for `parsing XML`_, executing `XPath and
  14 XSLT`_, `validating XML`_ and interfacing with other XML tools through the
  15 `SAX-API`_.
  16
  17 lxml is extremely extensible through `XPath functions in Python`_, custom
  18 `Python element classes`_, custom `URL resolvers`_ and even `at the C-level`_.
  19
  20 .. _`parsing XML`: parsing.html
  21 .. _`XPath and XSLT`: xpathxslt.html
  22 .. _`validating XML`: validation.html
  23 .. _`SAX-API`: sax.html
  24 .. _`XPath functions in Python`: extensions.html
  25 .. _`Python element classes`: element_classes.html
  26 .. _`at the C-level`: capi.html
  27 .. _`URL resolvers`: resolvers.html
  28 .. _`generated API documentation`: api/index.html
  29
  30
  31 .. contents::
  32 ..
  33    1   lxml.etree
  34    2   Other Element APIs
  35    3   Trees and Documents
  36    4   Iteration
  37    5   Error handling on exceptions
  38    6   Error logging
  39    7   Serialisation
  40    8   CDATA
  41    9   XInclude and ElementInclude
  42    10  write_c14n on ElementTree
  43
  44 ..
  45   >>> try: from StringIO import StringIO
  46   ... except ImportError:
  47   ...    from io import BytesIO
  48   ...    def StringIO(s=None):
  49   ...        if isinstance(s, str): s = s.encode("UTF-8")
  50   ...        return BytesIO(s)
  51
  52   >>> try: from collections import deque
  53   ... except ImportError:
  54   ...    class deque(list):
  55   ...        def popleft(self): return self.pop(0)
  56
  57   >>> try: unicode = __builtins__["unicode"]
  58   ... except (NameError, KeyError): unicode = str
  59
  60
  61 lxml.etree
  62 ----------
  63
  64 lxml.etree tries to follow the `ElementTree API`_ wherever it can.  There are
  65 however some incompatibilities (see `compatibility`_).  The extensions are
  66 documented here.
  67
  68 .. _`ElementTree API`: http://effbot.org/zone/element-index.htm
  69 .. _`compatibility`:   compatibility.html
  70
  71 If you need to know which version of lxml is installed, you can access the
  72 ``lxml.etree.LXML_VERSION`` attribute to retrieve a version tuple.  Note,
  73 however, that it did not exist before version 1.0, so you will get an
  74 AttributeError in older versions.  The versions of libxml2 and libxslt are
  75 available through the attributes ``LIBXML_VERSION`` and ``LIBXSLT_VERSION``.
  76
  77 The following examples usually assume this to be executed first:
  78
  79 .. sourcecode:: pycon
  80
  81   >>> from lxml import etree
  82
  83 ..
  84   >>> import sys
  85   >>> from lxml import etree as _etree
  86   >>> if sys.version_info[0] >= 3:
  87   ...   class etree_mock(object):
  88   ...     def __getattr__(self, name): return getattr(_etree, name)
  89   ...     def tostring(self, *args, **kwargs):
  90   ...       s = _etree.tostring(*args, **kwargs)
  91   ...       if isinstance(s, bytes) and bytes([10]) in s: s = s.decode("utf-8") # CR
  92   ...       if s[-1] == '\n': s = s[:-1]
  93   ...       return s
  94   ... else:
  95   ...   class etree_mock(object):
  96   ...     def __getattr__(self, name): return getattr(_etree, name)
  97   ...     def tostring(self, *args, **kwargs):
  98   ...       s = _etree.tostring(*args, **kwargs)
  99   ...       if s[-1] == '\n': s = s[:-1]
 100   ...       return s
 101   >>> etree = etree_mock()
 102
 103
 104 Other Element APIs
 105 ------------------
 106
 107 While lxml.etree itself uses the ElementTree API, it is possible to replace
 108 the Element implementation by `custom element subclasses`_.  This has been
 109 used to implement well-known XML APIs on top of lxml.  For example, lxml ships
 110 with a data-binding implementation called `objectify`_, which is similar to
 111 the `Amara bindery`_ tool.
 112
 113 lxml.etree comes with a number of `different lookup schemes`_ to customize the
 114 mapping between libxml2 nodes and the Element classes used by lxml.etree.
 115
 116 .. _`custom element subclasses`: element_classes.html
 117 .. _`objectify`: objectify.html
 118 .. _`different lookup schemes`: element_classes.html#setting-up-a-class-lookup-scheme
 119 .. _`Amara bindery`: http://uche.ogbuji.net/tech/4suite/amara/
 120
 121
 122 Trees and Documents
 123 -------------------
 124
 125 Compared to the original ElementTree API, lxml.etree has an extended tree
 126 model.  It knows about parents and siblings of elements:
 127
 128 .. sourcecode:: pycon
 129
 130   >>> root = etree.Element("root")
 131   >>> a = etree.SubElement(root, "a")
 132   >>> b = etree.SubElement(root, "b")
 133   >>> c = etree.SubElement(root, "c")
 134   >>> d = etree.SubElement(root, "d")
 135   >>> e = etree.SubElement(d,    "e")
 136   >>> b.getparent() == root
 137   True
 138   >>> print(b.getnext().tag)
 139   c
 140   >>> print(c.getprevious().tag)
 141   b
 142
 143 Elements always live within a document context in lxml.  This implies that
 144 there is also a notion of an absolute document root.  You can retrieve an
 145 ElementTree for the root node of a document from any of its elements.
 146
 147 .. sourcecode:: pycon
 148
 149   >>> tree = d.getroottree()
 150   >>> print(tree.getroot().tag)
 151   root
 152
 153 Note that this is different from wrapping an Element in an ElementTree.  You
 154 can use ElementTrees to create XML trees with an explicit root node:
 155
 156 .. sourcecode:: pycon
 157
 158   >>> tree = etree.ElementTree(d)
 159   >>> print(tree.getroot().tag)
 160   d
 161   >>> etree.tostring(tree)
 162   b'<d><e/></d>'
 163
 164 ElementTree objects are serialised as complete documents, including
 165 preceding or trailing processing instructions and comments.
 166
 167 All operations that you run on such an ElementTree (like XPath, XSLT, etc.)
 168 will understand the explicitly chosen root as root node of a document.  They
 169 will not see any elements outside the ElementTree.  However, ElementTrees do
 170 not modify their Elements:
 171
 172 .. sourcecode:: pycon
 173
 174   >>> element = tree.getroot()
 175   >>> print(element.tag)
 176   d
 177   >>> print(element.getparent().tag)
 178   root
 179   >>> print(element.getroottree().getroot().tag)
 180   root
 181
 182 The rule is that all operations that are applied to Elements use either the
 183 Element itself as reference point, or the absolute root of the document that
 184 contains this Element (e.g. for absolute XPath expressions).  All operations
 185 on an ElementTree use its explicit root node as reference.
 186
 187
 188 Iteration
 189 ---------
 190
 191 The ElementTree API makes Elements iterable to supports iteration over their
 192 children.  Using the tree defined above, we get:
 193
 194 .. sourcecode:: pycon
 195
 196   >>> [ child.tag for child in root ]
 197   ['a', 'b', 'c', 'd']
 198
 199 To iterate in the opposite direction, use the ``reversed()`` function
 200 that exists in Python 2.4 and later.
 201
 202 Tree traversal should use the ``element.iter()`` method:
 203
 204 .. sourcecode:: pycon
 205
 206   >>> [ el.tag for el in root.iter() ]
 207   ['root', 'a', 'b', 'c', 'd', 'e']
 208
 209 lxml.etree also supports this, but additionally features an extended API for
 210 iteration over the children, following/preceding siblings, ancestors and
 211 descendants of an element, as defined by the respective XPath axis:
 212
 213 .. sourcecode:: pycon
 214
 215   >>> [ child.tag for child in root.iterchildren() ]
 216   ['a', 'b', 'c', 'd']
 217   >>> [ child.tag for child in root.iterchildren(reversed=True) ]
 218   ['d', 'c', 'b', 'a']
 219   >>> [ sibling.tag for sibling in b.itersiblings() ]
 220   ['c', 'd']
 221   >>> [ sibling.tag for sibling in c.itersiblings(preceding=True) ]
 222   ['b', 'a']
 223   >>> [ ancestor.tag for ancestor in e.iterancestors() ]
 224   ['d', 'root']
 225   >>> [ el.tag for el in root.iterdescendants() ]
 226   ['a', 'b', 'c', 'd', 'e']
 227
 228 Note how ``element.iterdescendants()`` does not include the element
 229 itself, as opposed to ``element.iter()``.  The latter effectively
 230 implements the 'descendant-or-self' axis in XPath.
 231
 232 All of these iterators support an additional ``tag`` keyword argument that
 233 filters the generated elements by tag name:
 234
 235 .. sourcecode:: pycon
 236
 237   >>> [ child.tag for child in root.iterchildren(tag='a') ]
 238   ['a']
 239   >>> [ child.tag for child in d.iterchildren(tag='a') ]
 240   []
 241   >>> [ el.tag for el in root.iterdescendants(tag='d') ]
 242   ['d']
 243   >>> [ el.tag for el in root.iter(tag='d') ]
 244   ['d']
 245
 246 The most common way to traverse an XML tree is depth-first, which
 247 traverses the tree in document order.  This is implemented by the
 248 ``.iter()`` method.  While there is no dedicated method for
 249 breadth-first traversal, it is almost as simple if you use the
 250 ``collections.deque`` type from Python 2.4.
 251
 252 .. sourcecode:: pycon
 253
 254     >>> root = etree.XML('<root><a><b/><c/></a><d><e/></d></root>')
 255     >>> print(etree.tostring(root, pretty_print=True, encoding=unicode))
 256     <root>
 257       <a>
 258         <b/>
 259         <c/>
 260       </a>
 261       <d>
 262         <e/>
 263       </d>
 264     </root>
 265
 266     >>> queue = deque([root])
 267     >>> while queue:
 268     ...    el = queue.popleft()  # pop next element
 269     ...    queue.extend(el)      # append its children
 270     ...    print(el.tag)
 271     root
 272     a
 273     d
 274     b
 275     c
 276     e
 277
 278 See also the section on the utility functions ``iterparse()`` and
 279 ``iterwalk()`` in the `parser documentation`_.
 280
 281 .. _`parser documentation`: parsing.html#iterparse-and-iterwalk
 282
 283
 284 Error handling on exceptions
 285 ----------------------------
 286
 287 Libxml2 provides error messages for failures, be it during parsing, XPath
 288 evaluation or schema validation.  The preferred way of accessing them is
 289 through the local ``error_log`` property of the respective evaluator or
 290 transformer object.  See their documentation for details.
 291
 292 However, lxml also keeps a global error log of all errors that occurred at the
 293 application level.  Whenever an exception is raised, you can retrieve the
 294 errors that occured and "might have" lead to the problem from the error log
 295 copy attached to the exception:
 296
 297 .. sourcecode:: pycon
 298
 299   >>> etree.clear_error_log()
 300   >>> broken_xml = '''
 301   ... <root>
 302   ...   <a>
 303   ... </root>
 304   ... '''
 305   >>> try:
 306   ...   etree.parse(StringIO(broken_xml))
 307   ... except etree.XMLSyntaxError, e:
 308   ...   pass # just put the exception into e
 309
 310 ..
 311   >>> etree.clear_error_log()
 312   >>> try:
 313   ...   etree.parse(StringIO(broken_xml))
 314   ... except etree.XMLSyntaxError:
 315   ...   import sys; e = sys.exc_info()[1]
 316
 317 Once you have caught this exception, you can access its ``error_log`` property
 318 to retrieve the log entries or filter them by a specific type, error domain or
 319 error level:
 320
 321 .. sourcecode:: pycon
 322
 323   >>> log = e.error_log.filter_from_level(etree.ErrorLevels.FATAL)
 324   >>> print(log)
 325   <string>:4:8:FATAL:PARSER:ERR_TAG_NAME_MISMATCH: Opening and ending tag mismatch: a line 3 and root
 326   <string>:5:1:FATAL:PARSER:ERR_TAG_NOT_FINISHED: Premature end of data in tag root line 2
 327
 328 This might look a little cryptic at first, but it is the information that
 329 libxml2 gives you.  At least the message at the end should give you a hint
 330 what went wrong and you can see that the fatal errors (FATAL) happened during
 331 parsing (PARSER) lines 4, column 8 and line 5, column 1 of a string (<string>,
 332 or the filename if available).  Here, PARSER is the so-called error domain,
 333 see ``lxml.etree.ErrorDomains`` for that.  You can get it from a log entry
 334 like this:
 335
 336 .. sourcecode:: pycon
 337
 338   >>> entry = log[0]
 339   >>> print(entry.domain_name)
 340   PARSER
 341   >>> print(entry.type_name)
 342   ERR_TAG_NAME_MISMATCH
 343   >>> print(entry.filename)
 344   <string>
 345
 346 There is also a convenience attribute ``last_error`` that returns the last
 347 error or fatal error that occurred:
 348
 349 .. sourcecode:: pycon
 350
 351   >>> entry = e.error_log.last_error
 352   >>> print(entry.domain_name)
 353   PARSER
 354   >>> print(entry.type_name)
 355   ERR_TAG_NOT_FINISHED
 356   >>> print(entry.filename)
 357   <string>
 358
 359
 360 Error logging
 361 -------------
 362
 363 lxml.etree supports logging libxml2 messages to the Python stdlib logging
 364 module.  This is done through the ``etree.PyErrorLog`` class.  It disables the
 365 error reporting from exceptions and forwards log messages to a Python logger.
 366 To use it, see the descriptions of the function ``etree.useGlobalPythonLog``
 367 and the class ``etree.PyErrorLog`` for help.  Note that this does not affect
 368 the local error logs of XSLT, XMLSchema, etc.
 369
 370
 371 Serialisation
 372 -------------
 373
 374 lxml.etree has direct support for pretty printing XML output.  Functions like
 375 ``ElementTree.write()`` and ``tostring()`` support it through a keyword
 376 argument:
 377
 378 .. sourcecode:: pycon
 379
 380   >>> root = etree.XML("<root><test/></root>")
 381   >>> etree.tostring(root)
 382   b'<root><test/></root>'
 383
 384   >>> print(etree.tostring(root, pretty_print=True))
 385   <root>
 386     <test/>
 387   </root>
 388
 389 Note the newline that is appended at the end when pretty printing the
 390 output.  It was added in lxml 2.0.
 391
 392 By default, lxml (just as ElementTree) outputs the XML declaration only if it
 393 is required by the standard:
 394
 395 .. sourcecode:: pycon
 396
 397   >>> unicode_root = etree.Element( u"t\u3120st" )
 398   >>> unicode_root.text = u"t\u0A0Ast"
 399   >>> etree.tostring(unicode_root, encoding="utf-8")
 400   b'<t\xe3\x84\xa0st>t\xe0\xa8\x8ast</t\xe3\x84\xa0st>'
 401
 402   >>> print(etree.tostring(unicode_root, encoding="iso-8859-1"))
 403   <?xml version='1.0' encoding='iso-8859-1'?>
 404   <t&#12576;st>t&#2570;st</t&#12576;st>
 405
 406 Also see the general remarks on `Unicode support`_.
 407
 408 .. _`Unicode support`: parsing.html#python-unicode-strings
 409
 410 You can enable or disable the declaration explicitly by passing another
 411 keyword argument for the serialisation:
 412
 413 .. sourcecode:: pycon
 414
 415   >>> print(etree.tostring(root, xml_declaration=True))
 416   <?xml version='1.0' encoding='ASCII'?>
 417   <root><test/></root>
 418
 419   >>> unicode_root.clear()
 420   >>> etree.tostring(unicode_root, encoding="UTF-16LE",
 421   ...                              xml_declaration=False)
 422   b'<\x00t\x00 1s\x00t\x00/\x00>\x00'
 423
 424 Note that a standard compliant XML parser will not consider the last line
 425 well-formed XML if the encoding is not explicitly provided somehow, e.g. in an
 426 underlying transport protocol:
 427
 428 .. sourcecode:: pycon
 429
 430   >>> notxml = etree.tostring(unicode_root, encoding="UTF-16LE",
 431   ...                                       xml_declaration=False)
 432   >>> root = etree.XML(notxml)        #doctest: +ELLIPSIS
 433   Traceback (most recent call last):
 434     ...
 435   lxml.etree.XMLSyntaxError: ...
 436
 437
 438 CDATA
 439 -----
 440
 441 By default, lxml's parser will strip CDATA sections from the tree and
 442 replace them by their plain text content.  As real applications for
 443 CDATA are rare, this is the best way to deal with this issue.
 444
 445 However, in some cases, keeping CDATA sections or creating them in a
 446 document is required to adhere to existing XML language definitions.
 447 For these special cases, you can instruct the parser to leave CDATA
 448 sections in the document:
 449
 450 .. sourcecode:: pycon
 451
 452   >>> parser = etree.XMLParser(strip_cdata=False)
 453   >>> root = etree.XML('<root><![CDATA[test]]></root>', parser)
 454   >>> root.text
 455   'test'
 456
 457   >>> etree.tostring(root)
 458   b'<root><![CDATA[test]]></root>'
 459
 460 Note how the ``.text`` property does not give any indication that the
 461 text content is wrapped by a CDATA section.  If you want to make sure
 462 your data is wrapped by a CDATA block, you can use the ``CDATA()``
 463 text wrapper:
 464
 465 .. sourcecode:: pycon
 466
 467   >>> root.text = 'test'
 468
 469   >>> root.text
 470   'test'
 471   >>> etree.tostring(root)
 472   b'<root>test</root>'
 473
 474   >>> root.text = etree.CDATA(root.text)
 475
 476   >>> root.text
 477   'test'
 478   >>> etree.tostring(root)
 479   b'<root><![CDATA[test]]></root>'
 480
 481
 482 XInclude and ElementInclude
 483 ---------------------------
 484
 485 You can let lxml process xinclude statements in a document by calling the
 486 xinclude() method on a tree:
 487
 488 .. sourcecode:: pycon
 489
 490   >>> data = StringIO('''\
 491   ... <doc xmlns:xi="http://www.w3.org/2001/XInclude">
 492   ... <foo/>
 493   ... <xi:include href="doc/test.xml" />
 494   ... </doc>''')
 495
 496   >>> tree = etree.parse(data)
 497   >>> tree.xinclude()
 498   >>> print(etree.tostring(tree.getroot()))
 499   <doc xmlns:xi="http://www.w3.org/2001/XInclude">
 500   <foo/>
 501   <a xml:base="doc/test.xml"/>
 502   </doc>
 503
 504 Note that the ElementTree compatible ElementInclude_ module is also supported
 505 as ``lxml.ElementInclude``.  It has the additional advantage of supporting
 506 custom `URL resolvers`_ at the Python level.  The normal XInclude mechanism
 507 cannot deploy these.  If you need ElementTree compatibility or custom
 508 resolvers, you have to stick to the external Python module.
 509
 510 .. _ElementInclude: http://effbot.org/zone/element-xinclude.htm
 511
 512
 513 write_c14n on ElementTree
 514 -------------------------
 515
 516 The lxml.etree.ElementTree class has a method write_c14n, which takes a file
 517 object as argument.  This file object will receive an UTF-8 representation of
 518 the canonicalized form of the XML, following the W3C C14N recommendation.  For
 519 example:
 520
 521 .. sourcecode:: pycon
 522
 523   >>> f = StringIO('<a><b/></a>')
 524   >>> tree = etree.parse(f)
 525   >>> f2 = StringIO()
 526   >>> tree.write_c14n(f2)
 527   >>> print(f2.getvalue().decode("utf-8"))
 528   <a><b></b></a>