1 ===========================
2 APIs specific to lxml.etree
3 ===========================
5 lxml.etree tries to follow established APIs wherever possible. Sometimes,
6 however, the need to expose a feature in an easy way led to the invention of a
7 new API. This page describes the major differences and a few additions to the
10 For a complete reference of the API, see the `generated API
13 Separate pages describe the support for `parsing XML`_, executing `XPath and
14 XSLT`_, `validating XML`_ and interfacing with other XML tools through the
17 lxml is extremely extensible through `XPath functions in Python`_, custom
18 `Python element classes`_, custom `URL resolvers`_ and even `at the C-level`_.
20 .. _`parsing XML`: parsing.html
21 .. _`XPath and XSLT`: xpathxslt.html
22 .. _`validating XML`: validation.html
23 .. _`SAX-API`: sax.html
24 .. _`XPath functions in Python`: extensions.html
25 .. _`Python element classes`: element_classes.html
26 .. _`at the C-level`: capi.html
27 .. _`URL resolvers`: resolvers.html
28 .. _`generated API documentation`: api/index.html
37 5 Error handling on exceptions
41 9 XInclude and ElementInclude
42 10 write_c14n on ElementTree
45 >>> try: from StringIO import StringIO
46 ... except ImportError:
47 ... from io import BytesIO
48 ... def StringIO(s=None):
49 ... if isinstance(s, str): s = s.encode("UTF-8")
52 >>> try: from collections import deque
53 ... except ImportError:
54 ... class deque(list):
55 ... def popleft(self): return self.pop(0)
57 >>> try: unicode = __builtins__["unicode"]
58 ... except (NameError, KeyError): unicode = str
64 lxml.etree tries to follow the `ElementTree API`_ wherever it can. There are
65 however some incompatibilities (see `compatibility`_). The extensions are
68 .. _`ElementTree API`: http://effbot.org/zone/element-index.htm
69 .. _`compatibility`: compatibility.html
71 If you need to know which version of lxml is installed, you can access the
72 ``lxml.etree.LXML_VERSION`` attribute to retrieve a version tuple. Note,
73 however, that it did not exist before version 1.0, so you will get an
74 AttributeError in older versions. The versions of libxml2 and libxslt are
75 available through the attributes ``LIBXML_VERSION`` and ``LIBXSLT_VERSION``.
77 The following examples usually assume this to be executed first:
81 >>> from lxml import etree
85 >>> from lxml import etree as _etree
86 >>> if sys.version_info[0] >= 3:
87 ... class etree_mock(object):
88 ... def __getattr__(self, name): return getattr(_etree, name)
89 ... def tostring(self, *args, **kwargs):
90 ... s = _etree.tostring(*args, **kwargs)
91 ... if isinstance(s, bytes) and bytes([10]) in s: s = s.decode("utf-8") # CR
92 ... if s[-1] == '\n': s = s[:-1]
95 ... class etree_mock(object):
96 ... def __getattr__(self, name): return getattr(_etree, name)
97 ... def tostring(self, *args, **kwargs):
98 ... s = _etree.tostring(*args, **kwargs)
99 ... if s[-1] == '\n': s = s[:-1]
101 >>> etree = etree_mock()
107 While lxml.etree itself uses the ElementTree API, it is possible to replace
108 the Element implementation by `custom element subclasses`_. This has been
109 used to implement well-known XML APIs on top of lxml. For example, lxml ships
110 with a data-binding implementation called `objectify`_, which is similar to
111 the `Amara bindery`_ tool.
113 lxml.etree comes with a number of `different lookup schemes`_ to customize the
114 mapping between libxml2 nodes and the Element classes used by lxml.etree.
116 .. _`custom element subclasses`: element_classes.html
117 .. _`objectify`: objectify.html
118 .. _`different lookup schemes`: element_classes.html#setting-up-a-class-lookup-scheme
119 .. _`Amara bindery`: http://uche.ogbuji.net/tech/4suite/amara/
125 Compared to the original ElementTree API, lxml.etree has an extended tree
126 model. It knows about parents and siblings of elements:
128 .. sourcecode:: pycon
130 >>> root = etree.Element("root")
131 >>> a = etree.SubElement(root, "a")
132 >>> b = etree.SubElement(root, "b")
133 >>> c = etree.SubElement(root, "c")
134 >>> d = etree.SubElement(root, "d")
135 >>> e = etree.SubElement(d, "e")
136 >>> b.getparent() == root
138 >>> print(b.getnext().tag)
140 >>> print(c.getprevious().tag)
143 Elements always live within a document context in lxml. This implies that
144 there is also a notion of an absolute document root. You can retrieve an
145 ElementTree for the root node of a document from any of its elements.
147 .. sourcecode:: pycon
149 >>> tree = d.getroottree()
150 >>> print(tree.getroot().tag)
153 Note that this is different from wrapping an Element in an ElementTree. You
154 can use ElementTrees to create XML trees with an explicit root node:
156 .. sourcecode:: pycon
158 >>> tree = etree.ElementTree(d)
159 >>> print(tree.getroot().tag)
161 >>> etree.tostring(tree)
164 ElementTree objects are serialised as complete documents, including
165 preceding or trailing processing instructions and comments.
167 All operations that you run on such an ElementTree (like XPath, XSLT, etc.)
168 will understand the explicitly chosen root as root node of a document. They
169 will not see any elements outside the ElementTree. However, ElementTrees do
170 not modify their Elements:
172 .. sourcecode:: pycon
174 >>> element = tree.getroot()
175 >>> print(element.tag)
177 >>> print(element.getparent().tag)
179 >>> print(element.getroottree().getroot().tag)
182 The rule is that all operations that are applied to Elements use either the
183 Element itself as reference point, or the absolute root of the document that
184 contains this Element (e.g. for absolute XPath expressions). All operations
185 on an ElementTree use its explicit root node as reference.
191 The ElementTree API makes Elements iterable to supports iteration over their
192 children. Using the tree defined above, we get:
194 .. sourcecode:: pycon
196 >>> [ child.tag for child in root ]
199 To iterate in the opposite direction, use the ``reversed()`` function
200 that exists in Python 2.4 and later.
202 Tree traversal should use the ``element.iter()`` method:
204 .. sourcecode:: pycon
206 >>> [ el.tag for el in root.iter() ]
207 ['root', 'a', 'b', 'c', 'd', 'e']
209 lxml.etree also supports this, but additionally features an extended API for
210 iteration over the children, following/preceding siblings, ancestors and
211 descendants of an element, as defined by the respective XPath axis:
213 .. sourcecode:: pycon
215 >>> [ child.tag for child in root.iterchildren() ]
217 >>> [ child.tag for child in root.iterchildren(reversed=True) ]
219 >>> [ sibling.tag for sibling in b.itersiblings() ]
221 >>> [ sibling.tag for sibling in c.itersiblings(preceding=True) ]
223 >>> [ ancestor.tag for ancestor in e.iterancestors() ]
225 >>> [ el.tag for el in root.iterdescendants() ]
226 ['a', 'b', 'c', 'd', 'e']
228 Note how ``element.iterdescendants()`` does not include the element
229 itself, as opposed to ``element.iter()``. The latter effectively
230 implements the 'descendant-or-self' axis in XPath.
232 All of these iterators support an additional ``tag`` keyword argument that
233 filters the generated elements by tag name:
235 .. sourcecode:: pycon
237 >>> [ child.tag for child in root.iterchildren(tag='a') ]
239 >>> [ child.tag for child in d.iterchildren(tag='a') ]
241 >>> [ el.tag for el in root.iterdescendants(tag='d') ]
243 >>> [ el.tag for el in root.iter(tag='d') ]
246 The most common way to traverse an XML tree is depth-first, which
247 traverses the tree in document order. This is implemented by the
248 ``.iter()`` method. While there is no dedicated method for
249 breadth-first traversal, it is almost as simple if you use the
250 ``collections.deque`` type from Python 2.4.
252 .. sourcecode:: pycon
254 >>> root = etree.XML('<root><a><b/><c/></a><d><e/></d></root>')
255 >>> print(etree.tostring(root, pretty_print=True, encoding=unicode))
266 >>> queue = deque([root])
268 ... el = queue.popleft() # pop next element
269 ... queue.extend(el) # append its children
278 See also the section on the utility functions ``iterparse()`` and
279 ``iterwalk()`` in the `parser documentation`_.
281 .. _`parser documentation`: parsing.html#iterparse-and-iterwalk
284 Error handling on exceptions
285 ----------------------------
287 Libxml2 provides error messages for failures, be it during parsing, XPath
288 evaluation or schema validation. The preferred way of accessing them is
289 through the local ``error_log`` property of the respective evaluator or
290 transformer object. See their documentation for details.
292 However, lxml also keeps a global error log of all errors that occurred at the
293 application level. Whenever an exception is raised, you can retrieve the
294 errors that occured and "might have" lead to the problem from the error log
295 copy attached to the exception:
297 .. sourcecode:: pycon
299 >>> etree.clear_error_log()
306 ... etree.parse(StringIO(broken_xml))
307 ... except etree.XMLSyntaxError, e:
308 ... pass # just put the exception into e
311 >>> etree.clear_error_log()
313 ... etree.parse(StringIO(broken_xml))
314 ... except etree.XMLSyntaxError:
315 ... import sys; e = sys.exc_info()[1]
317 Once you have caught this exception, you can access its ``error_log`` property
318 to retrieve the log entries or filter them by a specific type, error domain or
321 .. sourcecode:: pycon
323 >>> log = e.error_log.filter_from_level(etree.ErrorLevels.FATAL)
325 <string>:4:8:FATAL:PARSER:ERR_TAG_NAME_MISMATCH: Opening and ending tag mismatch: a line 3 and root
326 <string>:5:1:FATAL:PARSER:ERR_TAG_NOT_FINISHED: Premature end of data in tag root line 2
328 This might look a little cryptic at first, but it is the information that
329 libxml2 gives you. At least the message at the end should give you a hint
330 what went wrong and you can see that the fatal errors (FATAL) happened during
331 parsing (PARSER) lines 4, column 8 and line 5, column 1 of a string (<string>,
332 or the filename if available). Here, PARSER is the so-called error domain,
333 see ``lxml.etree.ErrorDomains`` for that. You can get it from a log entry
336 .. sourcecode:: pycon
339 >>> print(entry.domain_name)
341 >>> print(entry.type_name)
342 ERR_TAG_NAME_MISMATCH
343 >>> print(entry.filename)
346 There is also a convenience attribute ``last_error`` that returns the last
347 error or fatal error that occurred:
349 .. sourcecode:: pycon
351 >>> entry = e.error_log.last_error
352 >>> print(entry.domain_name)
354 >>> print(entry.type_name)
356 >>> print(entry.filename)
363 lxml.etree supports logging libxml2 messages to the Python stdlib logging
364 module. This is done through the ``etree.PyErrorLog`` class. It disables the
365 error reporting from exceptions and forwards log messages to a Python logger.
366 To use it, see the descriptions of the function ``etree.useGlobalPythonLog``
367 and the class ``etree.PyErrorLog`` for help. Note that this does not affect
368 the local error logs of XSLT, XMLSchema, etc.
374 lxml.etree has direct support for pretty printing XML output. Functions like
375 ``ElementTree.write()`` and ``tostring()`` support it through a keyword
378 .. sourcecode:: pycon
380 >>> root = etree.XML("<root><test/></root>")
381 >>> etree.tostring(root)
382 b'<root><test/></root>'
384 >>> print(etree.tostring(root, pretty_print=True))
389 Note the newline that is appended at the end when pretty printing the
390 output. It was added in lxml 2.0.
392 By default, lxml (just as ElementTree) outputs the XML declaration only if it
393 is required by the standard:
395 .. sourcecode:: pycon
397 >>> unicode_root = etree.Element( u"t\u3120st" )
398 >>> unicode_root.text = u"t\u0A0Ast"
399 >>> etree.tostring(unicode_root, encoding="utf-8")
400 b'<t\xe3\x84\xa0st>t\xe0\xa8\x8ast</t\xe3\x84\xa0st>'
402 >>> print(etree.tostring(unicode_root, encoding="iso-8859-1"))
403 <?xml version='1.0' encoding='iso-8859-1'?>
404 <tㄠst>tਊst</tㄠst>
406 Also see the general remarks on `Unicode support`_.
408 .. _`Unicode support`: parsing.html#python-unicode-strings
410 You can enable or disable the declaration explicitly by passing another
411 keyword argument for the serialisation:
413 .. sourcecode:: pycon
415 >>> print(etree.tostring(root, xml_declaration=True))
416 <?xml version='1.0' encoding='ASCII'?>
419 >>> unicode_root.clear()
420 >>> etree.tostring(unicode_root, encoding="UTF-16LE",
421 ... xml_declaration=False)
422 b'<\x00t\x00 1s\x00t\x00/\x00>\x00'
424 Note that a standard compliant XML parser will not consider the last line
425 well-formed XML if the encoding is not explicitly provided somehow, e.g. in an
426 underlying transport protocol:
428 .. sourcecode:: pycon
430 >>> notxml = etree.tostring(unicode_root, encoding="UTF-16LE",
431 ... xml_declaration=False)
432 >>> root = etree.XML(notxml) #doctest: +ELLIPSIS
433 Traceback (most recent call last):
435 lxml.etree.XMLSyntaxError: ...
441 By default, lxml's parser will strip CDATA sections from the tree and
442 replace them by their plain text content. As real applications for
443 CDATA are rare, this is the best way to deal with this issue.
445 However, in some cases, keeping CDATA sections or creating them in a
446 document is required to adhere to existing XML language definitions.
447 For these special cases, you can instruct the parser to leave CDATA
448 sections in the document:
450 .. sourcecode:: pycon
452 >>> parser = etree.XMLParser(strip_cdata=False)
453 >>> root = etree.XML('<root><![CDATA[test]]></root>', parser)
457 >>> etree.tostring(root)
458 b'<root><![CDATA[test]]></root>'
460 Note how the ``.text`` property does not give any indication that the
461 text content is wrapped by a CDATA section. If you want to make sure
462 your data is wrapped by a CDATA block, you can use the ``CDATA()``
465 .. sourcecode:: pycon
467 >>> root.text = 'test'
471 >>> etree.tostring(root)
474 >>> root.text = etree.CDATA(root.text)
478 >>> etree.tostring(root)
479 b'<root><![CDATA[test]]></root>'
482 XInclude and ElementInclude
483 ---------------------------
485 You can let lxml process xinclude statements in a document by calling the
486 xinclude() method on a tree:
488 .. sourcecode:: pycon
490 >>> data = StringIO('''\
491 ... <doc xmlns:xi="http://www.w3.org/2001/XInclude">
493 ... <xi:include href="doc/test.xml" />
496 >>> tree = etree.parse(data)
498 >>> print(etree.tostring(tree.getroot()))
499 <doc xmlns:xi="http://www.w3.org/2001/XInclude">
501 <a xml:base="doc/test.xml"/>
504 Note that the ElementTree compatible ElementInclude_ module is also supported
505 as ``lxml.ElementInclude``. It has the additional advantage of supporting
506 custom `URL resolvers`_ at the Python level. The normal XInclude mechanism
507 cannot deploy these. If you need ElementTree compatibility or custom
508 resolvers, you have to stick to the external Python module.
510 .. _ElementInclude: http://effbot.org/zone/element-xinclude.htm
513 write_c14n on ElementTree
514 -------------------------
516 The lxml.etree.ElementTree class has a method write_c14n, which takes a file
517 object as argument. This file object will receive an UTF-8 representation of
518 the canonicalized form of the XML, following the W3C C14N recommendation. For
521 .. sourcecode:: pycon
523 >>> f = StringIO('<a><b/></a>')
524 >>> tree = etree.parse(f)
526 >>> tree.write_c14n(f2)
527 >>> print(f2.getvalue().decode("utf-8"))