1 =======================
2 The lxml.etree Tutorial
3 =======================
6 :description: The lxml tutorial on XML that feels like Python
7 :keywords: lxml, etree, tutorial, ElementTree, Python, XML, HTML
12 This tutorial briefly overviews the main concepts of the `ElementTree API`_ as
13 implemented by ``lxml.etree``, and some simple enhancements that make your
14 life as a programmer easier.
16 For a complete reference of the API, see the `generated API
19 .. _`ElementTree API`: http://effbot.org/zone/element-index.htm#documentation
20 .. _`generated API documentation`: api/index.html
25 1.1 Elements are lists
26 1.2 Elements carry attributes
27 1.3 Elements contain text
28 1.4 Using XPath to find text
31 2 The ElementTree class
32 3 Parsing from strings and files
33 3.1 The fromstring() function
34 3.2 The XML() function
35 3.3 The parse() function
37 3.5 Incremental parsing
38 3.6 Event-driven parsing
45 >>> try: from StringIO import StringIO
46 ... except ImportError:
47 ... from io import BytesIO
49 ... if isinstance(s, str): s = s.encode("UTF-8")
52 >>> try: unicode = __builtins__["unicode"]
53 ... except (NameError, KeyError): unicode = str
55 >>> try: basestring = __builtins__["basestring"]
56 ... except (NameError, KeyError): basestring = str
59 A common way to import ``lxml.etree`` is as follows:
63 >>> from lxml import etree
65 If your code only uses the ElementTree API and does not rely on any
66 functionality that is specific to ``lxml.etree``, you can also use (any part
67 of) the following import chain as a fall-back to the original ElementTree:
69 .. sourcecode:: python
72 from lxml import etree
73 print("running with lxml.etree")
77 import xml.etree.cElementTree as etree
78 print("running with cElementTree on Python 2.5+")
82 import xml.etree.ElementTree as etree
83 print("running with ElementTree on Python 2.5+")
86 # normal cElementTree install
87 import cElementTree as etree
88 print("running with cElementTree")
91 # normal ElementTree install
92 import elementtree.ElementTree as etree
93 print("running with ElementTree")
95 print("Failed to import ElementTree from any known place")
97 To aid in writing portable code, this tutorial makes it clear in the examples
98 which part of the presented API is an extension of lxml.etree over the
99 original `ElementTree API`_, as defined by Fredrik Lundh's `ElementTree
102 .. _`ElementTree library`: http://effbot.org/zone/element-index.htm
106 >>> from lxml import etree as _etree
107 >>> if sys.version_info[0] >= 3:
108 ... class etree_mock(object):
109 ... def __getattr__(self, name): return getattr(_etree, name)
110 ... def tostring(self, *args, **kwargs):
111 ... s = _etree.tostring(*args, **kwargs)
112 ... if isinstance(s, bytes) and bytes([10]) in s: s = s.decode("utf-8") # CR
113 ... if s[-1] == '\n': s = s[:-1]
116 ... class etree_mock(object):
117 ... def __getattr__(self, name): return getattr(_etree, name)
118 ... def tostring(self, *args, **kwargs):
119 ... s = _etree.tostring(*args, **kwargs)
120 ... if s[-1] == '\n': s = s[:-1]
122 >>> etree = etree_mock()
128 An ``Element`` is the main container object for the ElementTree API. Most of
129 the XML tree functionality is accessed through this class. Elements are
130 easily created through the ``Element`` factory:
132 .. sourcecode:: pycon
134 >>> root = etree.Element("root")
136 The XML tag name of elements is accessed through the ``tag`` property:
138 .. sourcecode:: pycon
143 Elements are organised in an XML tree structure. To create child elements and
144 add them to a parent element, you can use the ``append()`` method:
146 .. sourcecode:: pycon
148 >>> root.append( etree.Element("child1") )
150 However, this is so common that there is a shorter and much more efficient way
151 to do this: the ``SubElement`` factory. It accepts the same arguments as the
152 ``Element`` factory, but additionally requires the parent as first argument:
154 .. sourcecode:: pycon
156 >>> child2 = etree.SubElement(root, "child2")
157 >>> child3 = etree.SubElement(root, "child3")
159 To see that this is really XML, you can serialise the tree you have created:
161 .. sourcecode:: pycon
163 >>> print(etree.tostring(root, pretty_print=True))
174 To make the access to these subelements as easy and straight forward as
175 possible, elements behave like normal Python lists:
177 .. sourcecode:: pycon
186 >>> root.index(root[1]) # lxml.etree only!
189 >>> children = list(root)
191 >>> for child in root:
197 >>> root.insert(0, etree.Element("child0"))
201 >>> print(start[0].tag)
203 >>> print(end[0].tag)
206 >>> root[0] = root[-1] # this moves the element!
207 >>> for child in root:
213 Prior to ElementTree 1.3 and lxml 2.0, you could also check the truth value of
214 an Element to see if it has children, i.e. if the list of children is empty.
215 This is no longer supported as people tend to find it surprising that a
216 non-None reference to an existing Element can evaluate to False. Instead, use
217 ``len(element)``, which is both more explicit and less error prone.
219 Note in the examples that the last element was *moved* to a different position
220 in the last example. This is a difference from the original ElementTree (and
221 from lists), where elements can sit in multiple positions of any number of
222 trees. In lxml.etree, elements can only sit in one position of one tree at a
225 If you want to *copy* an element to a different position, consider creating an
226 independent *deep copy* using the ``copy`` module from Python's standard
229 .. sourcecode:: pycon
231 >>> from copy import deepcopy
233 >>> element = etree.Element("neu")
234 >>> element.append( deepcopy(root[1]) )
236 >>> print(element[0].tag)
238 >>> print([ c.tag for c in root ])
239 ['child3', 'child1', 'child2']
241 The way up in the tree is provided through the ``getparent()`` method:
243 .. sourcecode:: pycon
245 >>> root is root[0].getparent() # lxml.etree only!
248 The siblings (or neighbours) of an element are accessed as next and previous
251 .. sourcecode:: pycon
253 >>> root[0] is root[1].getprevious() # lxml.etree only!
255 >>> root[1] is root[0].getnext() # lxml.etree only!
259 Elements carry attributes
260 -------------------------
262 XML elements support attributes. You can create them directly in the Element
265 .. sourcecode:: pycon
267 >>> root = etree.Element("root", interesting="totally")
268 >>> etree.tostring(root)
269 b'<root interesting="totally"/>'
271 Fast and direct access to these attributes is provided by the ``set()`` and
272 ``get()`` methods of elements:
274 .. sourcecode:: pycon
276 >>> print(root.get("interesting"))
279 >>> root.set("interesting", "somewhat")
280 >>> print(root.get("interesting"))
283 However, a very convenient way of dealing with them is through the dictionary
284 interface of the ``attrib`` property:
286 .. sourcecode:: pycon
288 >>> attributes = root.attrib
290 >>> print(attributes["interesting"])
293 >>> print(attributes.get("hello"))
296 >>> attributes["hello"] = "Guten Tag"
297 >>> print(attributes.get("hello"))
299 >>> print(root.get("hello"))
303 Elements contain text
304 ---------------------
306 Elements can contain text:
308 .. sourcecode:: pycon
310 >>> root = etree.Element("root")
311 >>> root.text = "TEXT"
316 >>> etree.tostring(root)
319 In many XML documents (*data-centric* documents), this is the only place where
320 text can be found. It is encapsulated by a leaf tag at the very bottom of the
323 However, if XML is used for tagged text documents such as (X)HTML, text can
324 also appear between different elements, right in the middle of the tree:
328 <html><body>Hello<br/>World</body></html>
330 Here, the ``<br/>`` tag is surrounded by text. This is often referred to as
331 *document-style* or *mixed-content* XML. Elements support this through their
332 ``tail`` property. It contains the text that directly follows the element, up
333 to the next element in the XML tree:
335 .. sourcecode:: pycon
337 >>> html = etree.Element("html")
338 >>> body = etree.SubElement(html, "body")
339 >>> body.text = "TEXT"
341 >>> etree.tostring(html)
342 b'<html><body>TEXT</body></html>'
344 >>> br = etree.SubElement(body, "br")
345 >>> etree.tostring(html)
346 b'<html><body>TEXT<br/></body></html>'
349 >>> etree.tostring(html)
350 b'<html><body>TEXT<br/>TAIL</body></html>'
352 The two properties ``.text`` and ``.tail`` are enough to represent any
353 text content in an XML document. This way, the ElementTree API does
354 not require any `special text nodes`_ in addition to the Element
355 class, that tend to get in the way fairly often (as you might know
356 from classic DOM_ APIs).
358 However, there are cases where the tail text also gets in the way.
359 For example, when you serialise an Element from within the tree, you
360 do not always want its tail text in the result (although you would
361 still want the tail text of its children). For this purpose, the
362 ``tostring()`` function accepts the keyword argument ``with_tail``:
364 .. sourcecode:: pycon
366 >>> etree.tostring(br)
368 >>> etree.tostring(br, with_tail=False) # lxml.etree only!
371 .. _`special text nodes`: http://www.w3.org/TR/DOM-Level-3-Core/core.html#ID-1312295772
372 .. _DOM: http://www.w3.org/TR/DOM-Level-3-Core/core.html
374 If you want to read *only* the text, i.e. without any intermediate
375 tags, you have to recursively concatenate all ``text`` and ``tail``
376 attributes in the correct order. Again, the ``tostring()`` function
377 comes to the rescue, this time using the ``method`` keyword:
379 .. sourcecode:: pycon
381 >>> etree.tostring(html, method="text")
385 Using XPath to find text
386 ------------------------
388 .. _XPath: xpathxslt.html#xpath
390 Another way to extract the text content of a tree is XPath_, which
391 also allows you to extract the separate text chunks into a list:
393 .. sourcecode:: pycon
395 >>> print(html.xpath("string()")) # lxml.etree only!
397 >>> print(html.xpath("//text()")) # lxml.etree only!
400 If you want to use this more often, you can wrap it in a function:
402 .. sourcecode:: pycon
404 >>> build_text_list = etree.XPath("//text()") # lxml.etree only!
405 >>> print(build_text_list(html))
408 Note that a string result returned by XPath is a special 'smart'
409 object that knows about its origins. You can ask it where it came
410 from through its ``getparent()`` method, just as you would with
413 .. sourcecode:: pycon
415 >>> texts = build_text_list(html)
418 >>> parent = texts[0].getparent()
419 >>> print(parent.tag)
424 >>> print(texts[1].getparent().tag)
427 You can also find out if it's normal text content or tail text:
429 .. sourcecode:: pycon
431 >>> print(texts[0].is_text)
433 >>> print(texts[1].is_text)
435 >>> print(texts[1].is_tail)
438 While this works for the results of the ``text()`` function, lxml will
439 not tell you the origin of a string value that was constructed by the
440 XPath functions ``string()`` or ``concat()``:
442 .. sourcecode:: pycon
444 >>> stringify = etree.XPath("string()")
445 >>> print(stringify(html))
447 >>> print(stringify(html).getparent())
454 For problems like the above, where you want to recursively traverse the tree
455 and do something with its elements, tree iteration is a very convenient
456 solution. Elements provide a tree iterator for this purpose. It yields
457 elements in *document order*, i.e. in the order their tags would appear if you
458 serialised the tree to XML:
460 .. sourcecode:: pycon
462 >>> root = etree.Element("root")
463 >>> etree.SubElement(root, "child").text = "Child 1"
464 >>> etree.SubElement(root, "child").text = "Child 2"
465 >>> etree.SubElement(root, "another").text = "Child 3"
467 >>> print(etree.tostring(root, pretty_print=True))
469 <child>Child 1</child>
470 <child>Child 2</child>
471 <another>Child 3</another>
474 >>> for element in root.iter():
475 ... print("%s - %s" % (element.tag, element.text))
481 If you know you are only interested in a single tag, you can pass its name to
482 ``iter()`` to have it filter for you:
484 .. sourcecode:: pycon
486 >>> for element in root.iter("child"):
487 ... print("%s - %s" % (element.tag, element.text))
491 By default, iteration yields all nodes in the tree, including
492 ProcessingInstructions, Comments and Entity instances. If you want to
493 make sure only Element objects are returned, you can pass the
494 ``Element`` factory as tag parameter:
496 .. sourcecode:: pycon
498 >>> root.append(etree.Entity("#234"))
499 >>> root.append(etree.Comment("some comment"))
501 >>> for element in root.iter():
502 ... if isinstance(element.tag, basestring):
503 ... print("%s - %s" % (element.tag, element.text))
505 ... print("SPECIAL: %s - %s" % (element, element.text))
510 SPECIAL: ê - ê
511 SPECIAL: <!--some comment--> - some comment
513 >>> for element in root.iter(tag=etree.Element):
514 ... print("%s - %s" % (element.tag, element.text))
520 >>> for element in root.iter(tag=etree.Entity):
521 ... print(element.text)
524 In lxml.etree, elements provide `further iterators`_ for all directions in the
525 tree: children, parents (or rather ancestors) and siblings.
527 .. _`further iterators`: api.html#iteration
533 Serialisation commonly uses the ``tostring()`` function that returns a
534 string, or the ``ElementTree.write()`` method that writes to a file, a
535 file-like object, or a URL (via FTP PUT or HTTP POST). Both calls accept
536 the same keyword arguments like ``pretty_print`` for formatted output
537 or ``encoding`` to select a specific output encoding other than plain
540 .. sourcecode:: pycon
542 >>> root = etree.XML('<root><a><b/></a></root>')
544 >>> etree.tostring(root)
545 b'<root><a><b/></a></root>'
547 >>> print(etree.tostring(root, xml_declaration=True))
548 <?xml version='1.0' encoding='ASCII'?>
549 <root><a><b/></a></root>
551 >>> print(etree.tostring(root, encoding='iso-8859-1'))
552 <?xml version='1.0' encoding='iso-8859-1'?>
553 <root><a><b/></a></root>
555 >>> print(etree.tostring(root, pretty_print=True))
562 Note that pretty printing appends a newline at the end.
564 Since lxml 2.0 (and ElementTree 1.3), the serialisation functions can
565 do more than XML serialisation. You can serialise to HTML or extract
566 the text content by passing the ``method`` keyword:
568 .. sourcecode:: pycon
570 >>> root = etree.XML(
571 ... '<html><head/><body><p>Hello<br/>World</p></body></html>')
573 >>> etree.tostring(root) # default: method = 'xml'
574 b'<html><head/><body><p>Hello<br/>World</p></body></html>'
576 >>> etree.tostring(root, method='xml') # same as above
577 b'<html><head/><body><p>Hello<br/>World</p></body></html>'
579 >>> etree.tostring(root, method='html')
580 b'<html><head></head><body><p>Hello<br>World</p></body></html>'
582 >>> print(etree.tostring(root, method='html', pretty_print=True))
585 <body><p>Hello<br>World</p></body>
588 >>> etree.tostring(root, method='text')
591 As for XML serialisation, the default encoding for plain text
592 serialisation is ASCII:
594 .. sourcecode:: pycon
596 >>> br = root.find('.//br')
597 >>> br.tail = u'W\xf6rld'
599 >>> etree.tostring(root, method='text') # doctest: +ELLIPSIS
600 Traceback (most recent call last):
602 UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' ...
604 >>> etree.tostring(root, method='text', encoding="UTF-8")
607 Here, serialising to a Python unicode string instead of a byte string
608 might become handy. Just pass the ``unicode`` type as encoding:
610 .. sourcecode:: pycon
612 >>> etree.tostring(root, encoding=unicode, method='text')
615 The W3C has a good `article about the Unicode character set and
616 character encodings`_.
618 .. _`article about the Unicode character set and character encodings`: http://www.w3.org/International/tutorials/tutorial-char-enc/
621 The ElementTree class
622 =====================
624 An ``ElementTree`` is mainly a document wrapper around a tree with a
625 root node. It provides a couple of methods for parsing, serialisation
626 and general document handling. One of the bigger differences is that
627 it serialises as a complete document, as opposed to a single
628 ``Element``. This includes top-level processing instructions and
629 comments, as well as a DOCTYPE and other DTD content in the document:
631 .. sourcecode:: pycon
633 >>> tree = etree.parse(StringIO('''\
634 ... <?xml version="1.0"?>
635 ... <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "eggs"> ]>
641 >>> print(tree.docinfo.doctype)
642 <!DOCTYPE root SYSTEM "test">
644 >>> # lxml 1.3.4 and later
645 >>> print(etree.tostring(tree))
646 <!DOCTYPE root SYSTEM "test" [
647 <!ENTITY tasty "eggs">
653 >>> # lxml 1.3.4 and later
654 >>> print(etree.tostring(etree.ElementTree(tree.getroot())))
655 <!DOCTYPE root SYSTEM "test" [
656 <!ENTITY tasty "eggs">
662 >>> # ElementTree and lxml <= 1.3.3
663 >>> print(etree.tostring(tree.getroot()))
668 Note that this has changed in lxml 1.3.4 to match the behaviour of
669 lxml 2.0. Before, the examples were serialised without DTD content,
670 which made lxml loose DTD information in an input-output cycle.
673 Parsing from strings and files
674 ==============================
676 ``lxml.etree`` supports parsing XML in a number of ways and from all
677 important sources, namely strings, files, URLs (http/ftp) and
678 file-like objects. The main parse functions are ``fromstring()`` and
679 ``parse()``, both called with the source as first argument. By
680 default, they use the standard parser, but you can always pass a
681 different parser as second argument.
684 The fromstring() function
685 -------------------------
687 The ``fromstring()`` function is the easiest way to parse a string:
689 .. sourcecode:: pycon
691 >>> some_xml_data = "<root>data</root>"
693 >>> root = etree.fromstring(some_xml_data)
696 >>> etree.tostring(root)
703 The ``XML()`` function behaves like the ``fromstring()`` function, but is
704 commonly used to write XML literals right into the source:
706 .. sourcecode:: pycon
708 >>> root = etree.XML("<root>data</root>")
711 >>> etree.tostring(root)
718 The ``parse()`` function is used to parse from files and file-like objects:
720 .. sourcecode:: pycon
722 >>> some_file_like = StringIO("<root>data</root>")
724 >>> tree = etree.parse(some_file_like)
726 >>> etree.tostring(tree)
729 Note that ``parse()`` returns an ElementTree object, not an Element object as
730 the string parser functions:
732 .. sourcecode:: pycon
734 >>> root = tree.getroot()
737 >>> etree.tostring(root)
740 The reasoning behind this difference is that ``parse()`` returns a
741 complete document from a file, while the string parsing functions are
742 commonly used to parse XML fragments.
744 The ``parse()`` function supports any of the following sources:
746 * an open file object
748 * a file-like object that has a ``.read(byte_count)`` method returning
749 a byte string on each call
753 * an HTTP or FTP URL string
755 Note that passing a filename or URL is usually faster than passing an
762 By default, ``lxml.etree`` uses a standard parser with a default setup. If
763 you want to configure the parser, you can create a you instance:
765 .. sourcecode:: pycon
767 >>> parser = etree.XMLParser(remove_blank_text=True) # lxml.etree only!
769 This creates a parser that removes empty text between tags while parsing,
770 which can reduce the size of the tree and avoid dangling tail text if you know
771 that whitespace-only content is not meaningful for your data. An example:
773 .. sourcecode:: pycon
775 >>> root = etree.XML("<root> <a/> <b> </b> </root>", parser)
777 >>> etree.tostring(root)
778 b'<root><a/><b> </b></root>'
780 Note that the whitespace content inside the ``<b>`` tag was not removed, as
781 content at leaf elements tends to be data content (even if blank). You can
782 easily remove it in an additional step by traversing the tree:
784 .. sourcecode:: pycon
786 >>> for element in root.iter("*"):
787 ... if element.text is not None and not element.text.strip():
788 ... element.text = None
790 >>> etree.tostring(root)
791 b'<root><a/><b/></root>'
793 See ``help(etree.XMLParser)`` to find out about the available parser options.
799 ``lxml.etree`` provides two ways for incremental step-by-step parsing. One is
800 through file-like objects, where it calls the ``read()`` method repeatedly.
801 This is best used where the data arrives from a source like ``urllib`` or any
802 other file-like object that can provide data on request. Note that the parser
803 will block and wait until data becomes available in this case:
805 .. sourcecode:: pycon
807 >>> class DataSource:
808 ... data = [ b"<roo", b"t><", b"a/", b"><", b"/root>" ]
809 ... def read(self, requested_size):
811 ... return self.data.pop(0)
812 ... except IndexError:
815 >>> tree = etree.parse(DataSource())
817 >>> etree.tostring(tree)
820 The second way is through a feed parser interface, given by the ``feed(data)``
821 and ``close()`` methods:
823 .. sourcecode:: pycon
825 >>> parser = etree.XMLParser()
827 >>> parser.feed("<roo")
828 >>> parser.feed("t><")
829 >>> parser.feed("a/")
830 >>> parser.feed("><")
831 >>> parser.feed("/root>")
833 >>> root = parser.close()
835 >>> etree.tostring(root)
838 Here, you can interrupt the parsing process at any time and continue it later
839 on with another call to the ``feed()`` method. This comes in handy if you
840 want to avoid blocking calls to the parser, e.g. in frameworks like Twisted,
841 or whenever data comes in slowly or in chunks and you want to do other things
842 while waiting for the next chunk.
844 After calling the ``close()`` method (or when an exception was raised
845 by the parser), you can reuse the parser by calling its ``feed()``
848 .. sourcecode:: pycon
850 >>> parser.feed("<root/>")
851 >>> root = parser.close()
852 >>> etree.tostring(root)
859 Sometimes, all you need from a document is a small fraction somewhere deep
860 inside the tree, so parsing the whole tree into memory, traversing it and
861 dropping it can be too much overhead. ``lxml.etree`` supports this use case
862 with two event-driven parser interfaces, one that generates parser events
863 while building the tree (``iterparse``), and one that does not build the tree
864 at all, and instead calls feedback methods on a target object in a SAX-like
867 Here is a simple ``iterparse()`` example:
869 .. sourcecode:: pycon
871 >>> some_file_like = StringIO("<root><a>data</a></root>")
873 >>> for event, element in etree.iterparse(some_file_like):
874 ... print("%s, %4s, %s" % (event, element.tag, element.text))
878 By default, ``iterparse()`` only generates events when it is done parsing an
879 element, but you can control this through the ``events`` keyword argument:
881 .. sourcecode:: pycon
883 >>> some_file_like = StringIO("<root><a>data</a></root>")
885 >>> for event, element in etree.iterparse(some_file_like,
886 ... events=("start", "end")):
887 ... print("%5s, %4s, %s" % (event, element.tag, element.text))
893 Note that the text, tail and children of an Element are not necessarily there
894 yet when receiving the ``start`` event. Only the ``end`` event guarantees
895 that the Element has been parsed completely.
897 It also allows to ``.clear()`` or modify the content of an Element to
898 save memory. So if you parse a large tree and you want to keep memory
899 usage small, you should clean up parts of the tree that you no longer
902 .. sourcecode:: pycon
904 >>> some_file_like = StringIO(
905 ... "<root><a><b>data</b></a><a><b/></a></root>")
907 >>> for event, element in etree.iterparse(some_file_like):
908 ... if element.tag == 'b':
909 ... print(element.text)
910 ... elif element.tag == 'a':
911 ... print("** cleaning up the subtree")
914 ** cleaning up the subtree
916 ** cleaning up the subtree
918 If memory is a real bottleneck, or if building the tree is not desired at all,
919 the target parser interface of ``lxml.etree`` can be used. It creates
920 SAX-like events by calling the methods of a target object. By implementing
921 some or all of these methods, you can control which events are generated:
923 .. sourcecode:: pycon
925 >>> class ParserTarget:
928 ... def start(self, tag, attrib):
929 ... self.events.append(("start", tag, attrib))
931 ... events, self.events = self.events, []
932 ... self.close_count += 1
935 >>> parser_target = ParserTarget()
937 >>> parser = etree.XMLParser(target=parser_target)
938 >>> events = etree.fromstring('<root test="true"/>', parser)
940 >>> print(parser_target.close_count)
943 >>> for event in events:
944 ... print('event: %s - tag: %s' % (event[0], event[1]))
945 ... for attr, value in event[2].items():
946 ... print(' * %s = %s' % (attr, value))
947 event: start - tag: root
950 You can reuse the parser and its target as often as you like, so you
951 should take care that the ``.close()`` methods really resets the
952 target to a usable state (also in the case of an error!).
954 .. sourcecode:: pycon
956 >>> events = etree.fromstring('<root test="true"/>', parser)
957 >>> print(parser_target.close_count)
959 >>> events = etree.fromstring('<root test="true"/>', parser)
960 >>> print(parser_target.close_count)
962 >>> events = etree.fromstring('<root test="true"/>', parser)
963 >>> print(parser_target.close_count)
966 >>> for event in events:
967 ... print('event: %s - tag: %s' % (event[0], event[1]))
968 ... for attr, value in event[2].items():
969 ... print(' * %s = %s' % (attr, value))
970 event: start - tag: root
977 The ElementTree API avoids `namespace prefixes`_ wherever possible and deploys
978 the real namespaces instead:
980 .. sourcecode:: pycon
982 >>> xhtml = etree.Element("{http://www.w3.org/1999/xhtml}html")
983 >>> body = etree.SubElement(xhtml, "{http://www.w3.org/1999/xhtml}body")
984 >>> body.text = "Hello World"
986 >>> print(etree.tostring(xhtml, pretty_print=True))
987 <html:html xmlns:html="http://www.w3.org/1999/xhtml">
988 <html:body>Hello World</html:body>
991 .. _`namespace prefixes`: http://www.w3.org/TR/xml-names/#ns-qualnames
993 As you can see, prefixes only become important when you serialise the result.
994 However, the above code becomes somewhat verbose due to the lengthy namespace
995 names. And retyping or copying a string over and over again is error prone.
996 It is therefore common practice to store a namespace URI in a global variable.
997 To adapt the namespace prefixes for serialisation, you can also pass a mapping
998 to the Element factory, e.g. to define the default namespace:
1000 .. sourcecode:: pycon
1002 >>> XHTML_NAMESPACE = "http://www.w3.org/1999/xhtml"
1003 >>> XHTML = "{%s}" % XHTML_NAMESPACE
1005 >>> NSMAP = {None : XHTML_NAMESPACE} # the default namespace (no prefix)
1007 >>> xhtml = etree.Element(XHTML + "html", nsmap=NSMAP) # lxml only!
1008 >>> body = etree.SubElement(xhtml, XHTML + "body")
1009 >>> body.text = "Hello World"
1011 >>> print(etree.tostring(xhtml, pretty_print=True))
1012 <html xmlns="http://www.w3.org/1999/xhtml">
1013 <body>Hello World</body>
1016 Namespaces on attributes work alike:
1018 .. sourcecode:: pycon
1020 >>> body.set(XHTML + "bgcolor", "#CCFFAA")
1022 >>> print(etree.tostring(xhtml, pretty_print=True))
1023 <html xmlns="http://www.w3.org/1999/xhtml">
1024 <body bgcolor="#CCFFAA">Hello World</body>
1027 >>> print(body.get("bgcolor"))
1029 >>> body.get(XHTML + "bgcolor")
1032 You can also use XPath in this way:
1034 .. sourcecode:: pycon
1036 >>> find_xhtml_body = etree.ETXPath( # lxml only !
1037 ... "//{%s}body" % XHTML_NAMESPACE)
1038 >>> results = find_xhtml_body(xhtml)
1040 >>> print(results[0].tag)
1041 {http://www.w3.org/1999/xhtml}body
1047 The ``E-factory`` provides a simple and compact syntax for generating XML and
1050 .. sourcecode:: pycon
1052 >>> from lxml.builder import E
1054 >>> def CLASS(*args): # class is a reserved word in Python
1055 ... return {"class":' '.join(args)}
1058 ... E.html( # create an Element called "html"
1060 ... E.title("This is a sample document")
1063 ... E.h1("Hello!", CLASS("title")),
1064 ... E.p("This is a paragraph with ", E.b("bold"), " text in it!"),
1065 ... E.p("This is another paragraph, with a", "\n ",
1066 ... E.a("link", href="http://www.python.org"), "."),
1067 ... E.p("Here are some reservered characters: <spam&egg>."),
1068 ... etree.XML("<p>And finally an embedded XHTML fragment.</p>"),
1073 >>> print(etree.tostring(page, pretty_print=True))
1076 <title>This is a sample document</title>
1079 <h1 class="title">Hello!</h1>
1080 <p>This is a paragraph with <b>bold</b> text in it!</p>
1081 <p>This is another paragraph, with a
1082 <a href="http://www.python.org">link</a>.</p>
1083 <p>Here are some reservered characters: <spam&egg>.</p>
1084 <p>And finally an embedded XHTML fragment.</p>
1088 The Element creation based on attribute access makes it easy to build up a
1089 simple vocabulary for an XML language:
1091 .. sourcecode:: pycon
1093 >>> from lxml.builder import ElementMaker # lxml only !
1095 >>> E = ElementMaker(namespace="http://my.de/fault/namespace",
1096 ... nsmap={'p' : "http://my.de/fault/namespace"})
1100 >>> SECTION = E.section
1104 ... TITLE("The dog and the hog"),
1106 ... TITLE("The dog"),
1107 ... PAR("Once upon a time, ..."),
1108 ... PAR("And then ...")
1111 ... TITLE("The hog"),
1112 ... PAR("Sooner or later ...")
1116 >>> print(etree.tostring(my_doc, pretty_print=True))
1117 <p:doc xmlns:p="http://my.de/fault/namespace">
1118 <p:title>The dog and the hog</p:title>
1120 <p:title>The dog</p:title>
1121 <p:par>Once upon a time, ...</p:par>
1122 <p:par>And then ...</p:par>
1125 <p:title>The hog</p:title>
1126 <p:par>Sooner or later ...</p:par>
1130 One such example is the module ``lxml.html.builder``, which provides a
1131 vocabulary for HTML.
1137 The ElementTree library comes with a simple XPath-like path language
1138 called ElementPath_. The main difference is that you can use the
1139 ``{namespace}tag`` notation in ElementPath expressions. However,
1140 advanced features like value comparison and functions are not
1143 .. _ElementPath: http://effbot.org/zone/element-xpath.htm
1144 .. _`full XPath implementation`: xpathxslt.html#xpath
1146 In addition to a `full XPath implementation`_, lxml.etree supports the
1147 ElementPath language in the same way ElementTree does, even using
1148 (almost) the same implementation. The API provides four methods here
1149 that you can find on Elements and ElementTrees:
1151 * ``iterfind()`` iterates over all Elements that match the path
1154 * ``findall()`` returns a list of matching Elements
1156 * ``find()`` efficiently returns only the first match
1158 * ``findtext()`` returns the ``.text`` content of the first match
1160 Here are some examples:
1162 .. sourcecode:: pycon
1164 >>> root = etree.XML("<root><a x='123'>aText<b/><c/><b/></a></root>")
1166 Find a child of an Element:
1168 .. sourcecode:: pycon
1170 >>> print(root.find("b"))
1172 >>> print(root.find("a").tag)
1175 Find an Element anywhere in the tree:
1177 .. sourcecode:: pycon
1179 >>> print(root.find(".//b").tag)
1181 >>> [ b.tag for b in root.iterfind(".//b") ]
1184 Find Elements with a certain attribute:
1186 .. sourcecode:: pycon
1188 >>> print(root.findall(".//a[@x]")[0].tag)
1190 >>> print(root.findall(".//a[@y]"))