doc/tutorial.txt

   1 =======================
   2 The lxml.etree Tutorial
   3 =======================
   4
   5 .. meta::
   6   :description: The lxml tutorial on XML that feels like Python
   7   :keywords: lxml, etree, tutorial, ElementTree, Python, XML, HTML
   8
   9 :Author:
  10   Stefan Behnel
  11
  12 This tutorial briefly overviews the main concepts of the `ElementTree API`_ as
  13 implemented by ``lxml.etree``, and some simple enhancements that make your
  14 life as a programmer easier.
  15
  16 For a complete reference of the API, see the `generated API
  17 documentation`_.
  18
  19 .. _`ElementTree API`: http://effbot.org/zone/element-index.htm#documentation
  20 .. _`generated API documentation`: api/index.html
  21
  22 .. contents::
  23 ..
  24    1  The Element class
  25      1.1  Elements are lists
  26      1.2  Elements carry attributes
  27      1.3  Elements contain text
  28      1.4  Using XPath to find text
  29      1.5  Tree iteration
  30      1.6  Serialisation
  31    2  The ElementTree class
  32    3  Parsing from strings and files
  33      3.1  The fromstring() function
  34      3.2  The XML() function
  35      3.3  The parse() function
  36      3.4  Parser objects
  37      3.5  Incremental parsing
  38      3.6  Event-driven parsing
  39    4  Namespaces
  40    5  The E-factory
  41    6  ElementPath
  42
  43
  44 ..
  45   >>> try: from StringIO import StringIO
  46   ... except ImportError:
  47   ...    from io import BytesIO
  48   ...    def StringIO(s):
  49   ...        if isinstance(s, str): s = s.encode("UTF-8")
  50   ...        return BytesIO(s)
  51
  52   >>> try: unicode = __builtins__["unicode"]
  53   ... except (NameError, KeyError): unicode = str
  54
  55   >>> try: basestring = __builtins__["basestring"]
  56   ... except (NameError, KeyError): basestring = str
  57
  58
  59 A common way to import ``lxml.etree`` is as follows:
  60
  61 .. sourcecode:: pycon
  62
  63     >>> from lxml import etree
  64
  65 If your code only uses the ElementTree API and does not rely on any
  66 functionality that is specific to ``lxml.etree``, you can also use (any part
  67 of) the following import chain as a fall-back to the original ElementTree:
  68
  69 .. sourcecode:: python
  70
  71     try:
  72       from lxml import etree
  73       print("running with lxml.etree")
  74     except ImportError:
  75       try:
  76         # Python 2.5
  77         import xml.etree.cElementTree as etree
  78         print("running with cElementTree on Python 2.5+")
  79       except ImportError:
  80         try:
  81           # Python 2.5
  82           import xml.etree.ElementTree as etree
  83           print("running with ElementTree on Python 2.5+")
  84         except ImportError:
  85           try:
  86             # normal cElementTree install
  87             import cElementTree as etree
  88             print("running with cElementTree")
  89           except ImportError:
  90             try:
  91               # normal ElementTree install
  92               import elementtree.ElementTree as etree
  93               print("running with ElementTree")
  94             except ImportError:
  95               print("Failed to import ElementTree from any known place")
  96
  97 To aid in writing portable code, this tutorial makes it clear in the examples
  98 which part of the presented API is an extension of lxml.etree over the
  99 original `ElementTree API`_, as defined by Fredrik Lundh's `ElementTree
 100 library`_.
 101
 102 .. _`ElementTree library`: http://effbot.org/zone/element-index.htm
 103
 104 ..
 105   >>> import sys
 106   >>> from lxml import etree as _etree
 107   >>> if sys.version_info[0] >= 3:
 108   ...   class etree_mock(object):
 109   ...     def __getattr__(self, name): return getattr(_etree, name)
 110   ...     def tostring(self, *args, **kwargs):
 111   ...       s = _etree.tostring(*args, **kwargs)
 112   ...       if isinstance(s, bytes) and bytes([10]) in s: s = s.decode("utf-8") # CR
 113   ...       if s[-1] == '\n': s = s[:-1]
 114   ...       return s
 115   ... else:
 116   ...   class etree_mock(object):
 117   ...     def __getattr__(self, name): return getattr(_etree, name)
 118   ...     def tostring(self, *args, **kwargs):
 119   ...       s = _etree.tostring(*args, **kwargs)
 120   ...       if s[-1] == '\n': s = s[:-1]
 121   ...       return s
 122   >>> etree = etree_mock()
 123
 124
 125 The Element class
 126 =================
 127
 128 An ``Element`` is the main container object for the ElementTree API.  Most of
 129 the XML tree functionality is accessed through this class.  Elements are
 130 easily created through the ``Element`` factory:
 131
 132 .. sourcecode:: pycon
 133
 134     >>> root = etree.Element("root")
 135
 136 The XML tag name of elements is accessed through the ``tag`` property:
 137
 138 .. sourcecode:: pycon
 139
 140     >>> print(root.tag)
 141     root
 142
 143 Elements are organised in an XML tree structure.  To create child elements and
 144 add them to a parent element, you can use the ``append()`` method:
 145
 146 .. sourcecode:: pycon
 147
 148     >>> root.append( etree.Element("child1") )
 149
 150 However, this is so common that there is a shorter and much more efficient way
 151 to do this: the ``SubElement`` factory.  It accepts the same arguments as the
 152 ``Element`` factory, but additionally requires the parent as first argument:
 153
 154 .. sourcecode:: pycon
 155
 156     >>> child2 = etree.SubElement(root, "child2")
 157     >>> child3 = etree.SubElement(root, "child3")
 158
 159 To see that this is really XML, you can serialise the tree you have created:
 160
 161 .. sourcecode:: pycon
 162
 163     >>> print(etree.tostring(root, pretty_print=True))
 164     <root>
 165       <child1/>
 166       <child2/>
 167       <child3/>
 168     </root>
 169
 170
 171 Elements are lists
 172 ------------------
 173
 174 To make the access to these subelements as easy and straight forward as
 175 possible, elements behave like normal Python lists:
 176
 177 .. sourcecode:: pycon
 178
 179     >>> child = root[0]
 180     >>> print(child.tag)
 181     child1
 182
 183     >>> print(len(root))
 184     3
 185
 186     >>> root.index(root[1]) # lxml.etree only!
 187     1
 188
 189     >>> children = list(root)
 190
 191     >>> for child in root:
 192     ...     print(child.tag)
 193     child1
 194     child2
 195     child3
 196
 197     >>> root.insert(0, etree.Element("child0"))
 198     >>> start = root[:1]
 199     >>> end   = root[-1:]
 200
 201     >>> print(start[0].tag)
 202     child0
 203     >>> print(end[0].tag)
 204     child3
 205
 206     >>> root[0] = root[-1] # this moves the element!
 207     >>> for child in root:
 208     ...     print(child.tag)
 209     child3
 210     child1
 211     child2
 212
 213 Prior to ElementTree 1.3 and lxml 2.0, you could also check the truth value of
 214 an Element to see if it has children, i.e. if the list of children is empty.
 215 This is no longer supported as people tend to find it surprising that a
 216 non-None reference to an existing Element can evaluate to False.  Instead, use
 217 ``len(element)``, which is both more explicit and less error prone.
 218
 219 Note in the examples that the last element was *moved* to a different position
 220 in the last example.  This is a difference from the original ElementTree (and
 221 from lists), where elements can sit in multiple positions of any number of
 222 trees.  In lxml.etree, elements can only sit in one position of one tree at a
 223 time.
 224
 225 If you want to *copy* an element to a different position, consider creating an
 226 independent *deep copy* using the ``copy`` module from Python's standard
 227 library:
 228
 229 .. sourcecode:: pycon
 230
 231     >>> from copy import deepcopy
 232
 233     >>> element = etree.Element("neu")
 234     >>> element.append( deepcopy(root[1]) )
 235
 236     >>> print(element[0].tag)
 237     child1
 238     >>> print([ c.tag for c in root ])
 239     ['child3', 'child1', 'child2']
 240
 241 The way up in the tree is provided through the ``getparent()`` method:
 242
 243 .. sourcecode:: pycon
 244
 245     >>> root is root[0].getparent()  # lxml.etree only!
 246     True
 247
 248 The siblings (or neighbours) of an element are accessed as next and previous
 249 elements:
 250
 251 .. sourcecode:: pycon
 252
 253     >>> root[0] is root[1].getprevious() # lxml.etree only!
 254     True
 255     >>> root[1] is root[0].getnext() # lxml.etree only!
 256     True
 257
 258
 259 Elements carry attributes
 260 -------------------------
 261
 262 XML elements support attributes.  You can create them directly in the Element
 263 factory:
 264
 265 .. sourcecode:: pycon
 266
 267     >>> root = etree.Element("root", interesting="totally")
 268     >>> etree.tostring(root)
 269     b'<root interesting="totally"/>'
 270
 271 Fast and direct access to these attributes is provided by the ``set()`` and
 272 ``get()`` methods of elements:
 273
 274 .. sourcecode:: pycon
 275
 276     >>> print(root.get("interesting"))
 277     totally
 278
 279     >>> root.set("interesting", "somewhat")
 280     >>> print(root.get("interesting"))
 281     somewhat
 282
 283 However, a very convenient way of dealing with them is through the dictionary
 284 interface of the ``attrib`` property:
 285
 286 .. sourcecode:: pycon
 287
 288     >>> attributes = root.attrib
 289
 290     >>> print(attributes["interesting"])
 291     somewhat
 292
 293     >>> print(attributes.get("hello"))
 294     None
 295
 296     >>> attributes["hello"] = "Guten Tag"
 297     >>> print(attributes.get("hello"))
 298     Guten Tag
 299     >>> print(root.get("hello"))
 300     Guten Tag
 301
 302
 303 Elements contain text
 304 ---------------------
 305
 306 Elements can contain text:
 307
 308 .. sourcecode:: pycon
 309
 310     >>> root = etree.Element("root")
 311     >>> root.text = "TEXT"
 312
 313     >>> print(root.text)
 314     TEXT
 315
 316     >>> etree.tostring(root)
 317     b'<root>TEXT</root>'
 318
 319 In many XML documents (*data-centric* documents), this is the only place where
 320 text can be found.  It is encapsulated by a leaf tag at the very bottom of the
 321 tree hierarchy.
 322
 323 However, if XML is used for tagged text documents such as (X)HTML, text can
 324 also appear between different elements, right in the middle of the tree:
 325
 326 .. sourcecode:: html
 327
 328     <html><body>Hello<br/>World</body></html>
 329
 330 Here, the ``<br/>`` tag is surrounded by text.  This is often referred to as
 331 *document-style* or *mixed-content* XML.  Elements support this through their
 332 ``tail`` property.  It contains the text that directly follows the element, up
 333 to the next element in the XML tree:
 334
 335 .. sourcecode:: pycon
 336
 337     >>> html = etree.Element("html")
 338     >>> body = etree.SubElement(html, "body")
 339     >>> body.text = "TEXT"
 340
 341     >>> etree.tostring(html)
 342     b'<html><body>TEXT</body></html>'
 343
 344     >>> br = etree.SubElement(body, "br")
 345     >>> etree.tostring(html)
 346     b'<html><body>TEXT<br/></body></html>'
 347
 348     >>> br.tail = "TAIL"
 349     >>> etree.tostring(html)
 350     b'<html><body>TEXT<br/>TAIL</body></html>'
 351
 352 The two properties ``.text`` and ``.tail`` are enough to represent any
 353 text content in an XML document.  This way, the ElementTree API does
 354 not require any `special text nodes`_ in addition to the Element
 355 class, that tend to get in the way fairly often (as you might know
 356 from classic DOM_ APIs).
 357
 358 However, there are cases where the tail text also gets in the way.
 359 For example, when you serialise an Element from within the tree, you
 360 do not always want its tail text in the result (although you would
 361 still want the tail text of its children).  For this purpose, the
 362 ``tostring()`` function accepts the keyword argument ``with_tail``:
 363
 364 .. sourcecode:: pycon
 365
 366     >>> etree.tostring(br)
 367     b'<br/>TAIL'
 368     >>> etree.tostring(br, with_tail=False) # lxml.etree only!
 369     b'<br/>'
 370
 371 .. _`special text nodes`: http://www.w3.org/TR/DOM-Level-3-Core/core.html#ID-1312295772
 372 .. _DOM: http://www.w3.org/TR/DOM-Level-3-Core/core.html
 373
 374 If you want to read *only* the text, i.e. without any intermediate
 375 tags, you have to recursively concatenate all ``text`` and ``tail``
 376 attributes in the correct order.  Again, the ``tostring()`` function
 377 comes to the rescue, this time using the ``method`` keyword:
 378
 379 .. sourcecode:: pycon
 380
 381     >>> etree.tostring(html, method="text")
 382     b'TEXTTAIL'
 383
 384
 385 Using XPath to find text
 386 ------------------------
 387
 388 .. _XPath: xpathxslt.html#xpath
 389
 390 Another way to extract the text content of a tree is XPath_, which
 391 also allows you to extract the separate text chunks into a list:
 392
 393 .. sourcecode:: pycon
 394
 395     >>> print(html.xpath("string()")) # lxml.etree only!
 396     TEXTTAIL
 397     >>> print(html.xpath("//text()")) # lxml.etree only!
 398     ['TEXT', 'TAIL']
 399
 400 If you want to use this more often, you can wrap it in a function:
 401
 402 .. sourcecode:: pycon
 403
 404     >>> build_text_list = etree.XPath("//text()") # lxml.etree only!
 405     >>> print(build_text_list(html))
 406     ['TEXT', 'TAIL']
 407
 408 Note that a string result returned by XPath is a special 'smart'
 409 object that knows about its origins.  You can ask it where it came
 410 from through its ``getparent()`` method, just as you would with
 411 Elements:
 412
 413 .. sourcecode:: pycon
 414
 415     >>> texts = build_text_list(html)
 416     >>> print(texts[0])
 417     TEXT
 418     >>> parent = texts[0].getparent()
 419     >>> print(parent.tag)
 420     body
 421
 422     >>> print(texts[1])
 423     TAIL
 424     >>> print(texts[1].getparent().tag)
 425     br
 426
 427 You can also find out if it's normal text content or tail text:
 428
 429 .. sourcecode:: pycon
 430
 431     >>> print(texts[0].is_text)
 432     True
 433     >>> print(texts[1].is_text)
 434     False
 435     >>> print(texts[1].is_tail)
 436     True
 437
 438 While this works for the results of the ``text()`` function, lxml will
 439 not tell you the origin of a string value that was constructed by the
 440 XPath functions ``string()`` or ``concat()``:
 441
 442 .. sourcecode:: pycon
 443
 444     >>> stringify = etree.XPath("string()")
 445     >>> print(stringify(html))
 446     TEXTTAIL
 447     >>> print(stringify(html).getparent())
 448     None
 449
 450
 451 Tree iteration
 452 --------------
 453
 454 For problems like the above, where you want to recursively traverse the tree
 455 and do something with its elements, tree iteration is a very convenient
 456 solution.  Elements provide a tree iterator for this purpose.  It yields
 457 elements in *document order*, i.e. in the order their tags would appear if you
 458 serialised the tree to XML:
 459
 460 .. sourcecode:: pycon
 461
 462     >>> root = etree.Element("root")
 463     >>> etree.SubElement(root, "child").text = "Child 1"
 464     >>> etree.SubElement(root, "child").text = "Child 2"
 465     >>> etree.SubElement(root, "another").text = "Child 3"
 466
 467     >>> print(etree.tostring(root, pretty_print=True))
 468     <root>
 469       <child>Child 1</child>
 470       <child>Child 2</child>
 471       <another>Child 3</another>
 472     </root>
 473
 474     >>> for element in root.iter():
 475     ...     print("%s - %s" % (element.tag, element.text))
 476     root - None
 477     child - Child 1
 478     child - Child 2
 479     another - Child 3
 480
 481 If you know you are only interested in a single tag, you can pass its name to
 482 ``iter()`` to have it filter for you:
 483
 484 .. sourcecode:: pycon
 485
 486     >>> for element in root.iter("child"):
 487     ...     print("%s - %s" % (element.tag, element.text))
 488     child - Child 1
 489     child - Child 2
 490
 491 By default, iteration yields all nodes in the tree, including
 492 ProcessingInstructions, Comments and Entity instances.  If you want to
 493 make sure only Element objects are returned, you can pass the
 494 ``Element`` factory as tag parameter:
 495
 496 .. sourcecode:: pycon
 497
 498     >>> root.append(etree.Entity("#234"))
 499     >>> root.append(etree.Comment("some comment"))
 500
 501     >>> for element in root.iter():
 502     ...     if isinstance(element.tag, basestring):
 503     ...         print("%s - %s" % (element.tag, element.text))
 504     ...     else:
 505     ...         print("SPECIAL: %s - %s" % (element, element.text))
 506     root - None
 507     child - Child 1
 508     child - Child 2
 509     another - Child 3
 510     SPECIAL: &#234; - &#234;
 511     SPECIAL: <!--some comment--> - some comment
 512
 513     >>> for element in root.iter(tag=etree.Element):
 514     ...     print("%s - %s" % (element.tag, element.text))
 515     root - None
 516     child - Child 1
 517     child - Child 2
 518     another - Child 3
 519
 520     >>> for element in root.iter(tag=etree.Entity):
 521     ...     print(element.text)
 522     &#234;
 523
 524 In lxml.etree, elements provide `further iterators`_ for all directions in the
 525 tree: children, parents (or rather ancestors) and siblings.
 526
 527 .. _`further iterators`: api.html#iteration
 528
 529
 530 Serialisation
 531 -------------
 532
 533 Serialisation commonly uses the ``tostring()`` function that returns a
 534 string, or the ``ElementTree.write()`` method that writes to a file, a
 535 file-like object, or a URL (via FTP PUT or HTTP POST).  Both calls accept
 536 the same keyword arguments like ``pretty_print`` for formatted output
 537 or ``encoding`` to select a specific output encoding other than plain
 538 ASCII:
 539
 540 .. sourcecode:: pycon
 541
 542    >>> root = etree.XML('<root><a><b/></a></root>')
 543
 544    >>> etree.tostring(root)
 545    b'<root><a><b/></a></root>'
 546
 547    >>> print(etree.tostring(root, xml_declaration=True))
 548    <?xml version='1.0' encoding='ASCII'?>
 549    <root><a><b/></a></root>
 550
 551    >>> print(etree.tostring(root, encoding='iso-8859-1'))
 552    <?xml version='1.0' encoding='iso-8859-1'?>
 553    <root><a><b/></a></root>
 554
 555    >>> print(etree.tostring(root, pretty_print=True))
 556    <root>
 557      <a>
 558        <b/>
 559      </a>
 560    </root>
 561
 562 Note that pretty printing appends a newline at the end.
 563
 564 Since lxml 2.0 (and ElementTree 1.3), the serialisation functions can
 565 do more than XML serialisation.  You can serialise to HTML or extract
 566 the text content by passing the ``method`` keyword:
 567
 568 .. sourcecode:: pycon
 569
 570    >>> root = etree.XML(
 571    ...    '<html><head/><body><p>Hello<br/>World</p></body></html>')
 572
 573    >>> etree.tostring(root) # default: method = 'xml'
 574    b'<html><head/><body><p>Hello<br/>World</p></body></html>'
 575
 576    >>> etree.tostring(root, method='xml') # same as above
 577    b'<html><head/><body><p>Hello<br/>World</p></body></html>'
 578
 579    >>> etree.tostring(root, method='html')
 580    b'<html><head></head><body><p>Hello<br>World</p></body></html>'
 581
 582    >>> print(etree.tostring(root, method='html', pretty_print=True))
 583    <html>
 584    <head></head>
 585    <body><p>Hello<br>World</p></body>
 586    </html>
 587
 588    >>> etree.tostring(root, method='text')
 589    b'HelloWorld'
 590
 591 As for XML serialisation, the default encoding for plain text
 592 serialisation is ASCII:
 593
 594 .. sourcecode:: pycon
 595
 596    >>> br = root.find('.//br')
 597    >>> br.tail = u'W\xf6rld'
 598
 599    >>> etree.tostring(root, method='text')  # doctest: +ELLIPSIS
 600    Traceback (most recent call last):
 601      ...
 602    UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' ...
 603
 604    >>> etree.tostring(root, method='text', encoding="UTF-8")
 605    b'HelloW\xc3\xb6rld'
 606
 607 Here, serialising to a Python unicode string instead of a byte string
 608 might become handy.  Just pass the ``unicode`` type as encoding:
 609
 610 .. sourcecode:: pycon
 611
 612    >>> etree.tostring(root, encoding=unicode, method='text')
 613    u'HelloW\xf6rld'
 614
 615 The W3C has a good `article about the Unicode character set and
 616 character encodings`_.
 617
 618 .. _`article about the Unicode character set and character encodings`: http://www.w3.org/International/tutorials/tutorial-char-enc/
 619
 620
 621 The ElementTree class
 622 =====================
 623
 624 An ``ElementTree`` is mainly a document wrapper around a tree with a
 625 root node.  It provides a couple of methods for parsing, serialisation
 626 and general document handling.  One of the bigger differences is that
 627 it serialises as a complete document, as opposed to a single
 628 ``Element``.  This includes top-level processing instructions and
 629 comments, as well as a DOCTYPE and other DTD content in the document:
 630
 631 .. sourcecode:: pycon
 632
 633     >>> tree = etree.parse(StringIO('''\
 634     ... <?xml version="1.0"?>
 635     ... <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "eggs"> ]>
 636     ... <root>
 637     ...   <a>&tasty;</a>
 638     ... </root>
 639     ... '''))
 640
 641     >>> print(tree.docinfo.doctype)
 642     <!DOCTYPE root SYSTEM "test">
 643
 644     >>> # lxml 1.3.4 and later
 645     >>> print(etree.tostring(tree))
 646     <!DOCTYPE root SYSTEM "test" [
 647     <!ENTITY tasty "eggs">
 648     ]>
 649     <root>
 650       <a>eggs</a>
 651     </root>
 652
 653     >>> # lxml 1.3.4 and later
 654     >>> print(etree.tostring(etree.ElementTree(tree.getroot())))
 655     <!DOCTYPE root SYSTEM "test" [
 656     <!ENTITY tasty "eggs">
 657     ]>
 658     <root>
 659       <a>eggs</a>
 660     </root>
 661
 662     >>> # ElementTree and lxml <= 1.3.3
 663     >>> print(etree.tostring(tree.getroot()))
 664     <root>
 665       <a>eggs</a>
 666     </root>
 667
 668 Note that this has changed in lxml 1.3.4 to match the behaviour of
 669 lxml 2.0.  Before, the examples were serialised without DTD content,
 670 which made lxml loose DTD information in an input-output cycle.
 671
 672
 673 Parsing from strings and files
 674 ==============================
 675
 676 ``lxml.etree`` supports parsing XML in a number of ways and from all
 677 important sources, namely strings, files, URLs (http/ftp) and
 678 file-like objects.  The main parse functions are ``fromstring()`` and
 679 ``parse()``, both called with the source as first argument.  By
 680 default, they use the standard parser, but you can always pass a
 681 different parser as second argument.
 682
 683
 684 The fromstring() function
 685 -------------------------
 686
 687 The ``fromstring()`` function is the easiest way to parse a string:
 688
 689 .. sourcecode:: pycon
 690
 691     >>> some_xml_data = "<root>data</root>"
 692
 693     >>> root = etree.fromstring(some_xml_data)
 694     >>> print(root.tag)
 695     root
 696     >>> etree.tostring(root)
 697     b'<root>data</root>'
 698
 699
 700 The XML() function
 701 ------------------
 702
 703 The ``XML()`` function behaves like the ``fromstring()`` function, but is
 704 commonly used to write XML literals right into the source:
 705
 706 .. sourcecode:: pycon
 707
 708     >>> root = etree.XML("<root>data</root>")
 709     >>> print(root.tag)
 710     root
 711     >>> etree.tostring(root)
 712     b'<root>data</root>'
 713
 714
 715 The parse() function
 716 --------------------
 717
 718 The ``parse()`` function is used to parse from files and file-like objects:
 719
 720 .. sourcecode:: pycon
 721
 722     >>> some_file_like = StringIO("<root>data</root>")
 723
 724     >>> tree = etree.parse(some_file_like)
 725
 726     >>> etree.tostring(tree)
 727     b'<root>data</root>'
 728
 729 Note that ``parse()`` returns an ElementTree object, not an Element object as
 730 the string parser functions:
 731
 732 .. sourcecode:: pycon
 733
 734     >>> root = tree.getroot()
 735     >>> print(root.tag)
 736     root
 737     >>> etree.tostring(root)
 738     b'<root>data</root>'
 739
 740 The reasoning behind this difference is that ``parse()`` returns a
 741 complete document from a file, while the string parsing functions are
 742 commonly used to parse XML fragments.
 743
 744 The ``parse()`` function supports any of the following sources:
 745
 746 * an open file object
 747
 748 * a file-like object that has a ``.read(byte_count)`` method returning
 749   a byte string on each call
 750
 751 * a filename string
 752
 753 * an HTTP or FTP URL string
 754
 755 Note that passing a filename or URL is usually faster than passing an
 756 open file.
 757
 758
 759 Parser objects
 760 --------------
 761
 762 By default, ``lxml.etree`` uses a standard parser with a default setup.  If
 763 you want to configure the parser, you can create a you instance:
 764
 765 .. sourcecode:: pycon
 766
 767     >>> parser = etree.XMLParser(remove_blank_text=True) # lxml.etree only!
 768
 769 This creates a parser that removes empty text between tags while parsing,
 770 which can reduce the size of the tree and avoid dangling tail text if you know
 771 that whitespace-only content is not meaningful for your data.  An example:
 772
 773 .. sourcecode:: pycon
 774
 775     >>> root = etree.XML("<root>  <a/>   <b>  </b>     </root>", parser)
 776
 777     >>> etree.tostring(root)
 778     b'<root><a/><b>  </b></root>'
 779
 780 Note that the whitespace content inside the ``<b>`` tag was not removed, as
 781 content at leaf elements tends to be data content (even if blank).  You can
 782 easily remove it in an additional step by traversing the tree:
 783
 784 .. sourcecode:: pycon
 785
 786     >>> for element in root.iter("*"):
 787     ...     if element.text is not None and not element.text.strip():
 788     ...         element.text = None
 789
 790     >>> etree.tostring(root)
 791     b'<root><a/><b/></root>'
 792
 793 See ``help(etree.XMLParser)`` to find out about the available parser options.
 794
 795
 796 Incremental parsing
 797 -------------------
 798
 799 ``lxml.etree`` provides two ways for incremental step-by-step parsing.  One is
 800 through file-like objects, where it calls the ``read()`` method repeatedly.
 801 This is best used where the data arrives from a source like ``urllib`` or any
 802 other file-like object that can provide data on request.  Note that the parser
 803 will block and wait until data becomes available in this case:
 804
 805 .. sourcecode:: pycon
 806
 807     >>> class DataSource:
 808     ...     data = [ b"<roo", b"t><", b"a/", b"><", b"/root>" ]
 809     ...     def read(self, requested_size):
 810     ...         try:
 811     ...             return self.data.pop(0)
 812     ...         except IndexError:
 813     ...             return b''
 814
 815     >>> tree = etree.parse(DataSource())
 816
 817     >>> etree.tostring(tree)
 818     b'<root><a/></root>'
 819
 820 The second way is through a feed parser interface, given by the ``feed(data)``
 821 and ``close()`` methods:
 822
 823 .. sourcecode:: pycon
 824
 825     >>> parser = etree.XMLParser()
 826
 827     >>> parser.feed("<roo")
 828     >>> parser.feed("t><")
 829     >>> parser.feed("a/")
 830     >>> parser.feed("><")
 831     >>> parser.feed("/root>")
 832
 833     >>> root = parser.close()
 834
 835     >>> etree.tostring(root)
 836     b'<root><a/></root>'
 837
 838 Here, you can interrupt the parsing process at any time and continue it later
 839 on with another call to the ``feed()`` method.  This comes in handy if you
 840 want to avoid blocking calls to the parser, e.g. in frameworks like Twisted,
 841 or whenever data comes in slowly or in chunks and you want to do other things
 842 while waiting for the next chunk.
 843
 844 After calling the ``close()`` method (or when an exception was raised
 845 by the parser), you can reuse the parser by calling its ``feed()``
 846 method again:
 847
 848 .. sourcecode:: pycon
 849
 850     >>> parser.feed("<root/>")
 851     >>> root = parser.close()
 852     >>> etree.tostring(root)
 853     b'<root/>'
 854
 855
 856 Event-driven parsing
 857 --------------------
 858
 859 Sometimes, all you need from a document is a small fraction somewhere deep
 860 inside the tree, so parsing the whole tree into memory, traversing it and
 861 dropping it can be too much overhead.  ``lxml.etree`` supports this use case
 862 with two event-driven parser interfaces, one that generates parser events
 863 while building the tree (``iterparse``), and one that does not build the tree
 864 at all, and instead calls feedback methods on a target object in a SAX-like
 865 fashion.
 866
 867 Here is a simple ``iterparse()`` example:
 868
 869 .. sourcecode:: pycon
 870
 871     >>> some_file_like = StringIO("<root><a>data</a></root>")
 872
 873     >>> for event, element in etree.iterparse(some_file_like):
 874     ...     print("%s, %4s, %s" % (event, element.tag, element.text))
 875     end,    a, data
 876     end, root, None
 877
 878 By default, ``iterparse()`` only generates events when it is done parsing an
 879 element, but you can control this through the ``events`` keyword argument:
 880
 881 .. sourcecode:: pycon
 882
 883     >>> some_file_like = StringIO("<root><a>data</a></root>")
 884
 885     >>> for event, element in etree.iterparse(some_file_like,
 886     ...                                       events=("start", "end")):
 887     ...     print("%5s, %4s, %s" % (event, element.tag, element.text))
 888     start, root, None
 889     start,    a, data
 890       end,    a, data
 891       end, root, None
 892
 893 Note that the text, tail and children of an Element are not necessarily there
 894 yet when receiving the ``start`` event.  Only the ``end`` event guarantees
 895 that the Element has been parsed completely.
 896
 897 It also allows to ``.clear()`` or modify the content of an Element to
 898 save memory. So if you parse a large tree and you want to keep memory
 899 usage small, you should clean up parts of the tree that you no longer
 900 need:
 901
 902 .. sourcecode:: pycon
 903
 904     >>> some_file_like = StringIO(
 905     ...     "<root><a><b>data</b></a><a><b/></a></root>")
 906
 907     >>> for event, element in etree.iterparse(some_file_like):
 908     ...     if element.tag == 'b':
 909     ...         print(element.text)
 910     ...     elif element.tag == 'a':
 911     ...         print("** cleaning up the subtree")
 912     ...         element.clear()
 913     data
 914     ** cleaning up the subtree
 915     None
 916     ** cleaning up the subtree
 917
 918 If memory is a real bottleneck, or if building the tree is not desired at all,
 919 the target parser interface of ``lxml.etree`` can be used.  It creates
 920 SAX-like events by calling the methods of a target object.  By implementing
 921 some or all of these methods, you can control which events are generated:
 922
 923 .. sourcecode:: pycon
 924
 925     >>> class ParserTarget:
 926     ...     events = []
 927     ...     close_count = 0
 928     ...     def start(self, tag, attrib):
 929     ...         self.events.append(("start", tag, attrib))
 930     ...     def close(self):
 931     ...         events, self.events = self.events, []
 932     ...         self.close_count += 1
 933     ...         return events
 934
 935     >>> parser_target = ParserTarget()
 936
 937     >>> parser = etree.XMLParser(target=parser_target)
 938     >>> events = etree.fromstring('<root test="true"/>', parser)
 939
 940     >>> print(parser_target.close_count)
 941     1
 942
 943     >>> for event in events:
 944     ...     print('event: %s - tag: %s' % (event[0], event[1]))
 945     ...     for attr, value in event[2].items():
 946     ...         print(' * %s = %s' % (attr, value))
 947     event: start - tag: root
 948      * test = true
 949
 950 You can reuse the parser and its target as often as you like, so you
 951 should take care that the ``.close()`` methods really resets the
 952 target to a usable state (also in the case of an error!).
 953
 954 .. sourcecode:: pycon
 955
 956     >>> events = etree.fromstring('<root test="true"/>', parser)
 957     >>> print(parser_target.close_count)
 958     2
 959     >>> events = etree.fromstring('<root test="true"/>', parser)
 960     >>> print(parser_target.close_count)
 961     3
 962     >>> events = etree.fromstring('<root test="true"/>', parser)
 963     >>> print(parser_target.close_count)
 964     4
 965
 966     >>> for event in events:
 967     ...     print('event: %s - tag: %s' % (event[0], event[1]))
 968     ...     for attr, value in event[2].items():
 969     ...         print(' * %s = %s' % (attr, value))
 970     event: start - tag: root
 971      * test = true
 972
 973
 974 Namespaces
 975 ==========
 976
 977 The ElementTree API avoids `namespace prefixes`_ wherever possible and deploys
 978 the real namespaces instead:
 979
 980 .. sourcecode:: pycon
 981
 982     >>> xhtml = etree.Element("{http://www.w3.org/1999/xhtml}html")
 983     >>> body = etree.SubElement(xhtml, "{http://www.w3.org/1999/xhtml}body")
 984     >>> body.text = "Hello World"
 985
 986     >>> print(etree.tostring(xhtml, pretty_print=True))
 987     <html:html xmlns:html="http://www.w3.org/1999/xhtml">
 988       <html:body>Hello World</html:body>
 989     </html:html>
 990
 991 .. _`namespace prefixes`: http://www.w3.org/TR/xml-names/#ns-qualnames
 992
 993 As you can see, prefixes only become important when you serialise the result.
 994 However, the above code becomes somewhat verbose due to the lengthy namespace
 995 names.  And retyping or copying a string over and over again is error prone.
 996 It is therefore common practice to store a namespace URI in a global variable.
 997 To adapt the namespace prefixes for serialisation, you can also pass a mapping
 998 to the Element factory, e.g. to define the default namespace:
 999
1000 .. sourcecode:: pycon
1001
1002     >>> XHTML_NAMESPACE = "http://www.w3.org/1999/xhtml"
1003     >>> XHTML = "{%s}" % XHTML_NAMESPACE
1004
1005     >>> NSMAP = {None : XHTML_NAMESPACE} # the default namespace (no prefix)
1006
1007     >>> xhtml = etree.Element(XHTML + "html", nsmap=NSMAP) # lxml only!
1008     >>> body = etree.SubElement(xhtml, XHTML + "body")
1009     >>> body.text = "Hello World"
1010
1011     >>> print(etree.tostring(xhtml, pretty_print=True))
1012     <html xmlns="http://www.w3.org/1999/xhtml">
1013       <body>Hello World</body>
1014     </html>
1015
1016 Namespaces on attributes work alike:
1017
1018 .. sourcecode:: pycon
1019
1020     >>> body.set(XHTML + "bgcolor", "#CCFFAA")
1021
1022     >>> print(etree.tostring(xhtml, pretty_print=True))
1023     <html xmlns="http://www.w3.org/1999/xhtml">
1024       <body bgcolor="#CCFFAA">Hello World</body>
1025     </html>
1026
1027     >>> print(body.get("bgcolor"))
1028     None
1029     >>> body.get(XHTML + "bgcolor")
1030     '#CCFFAA'
1031
1032 You can also use XPath in this way:
1033
1034 .. sourcecode:: pycon
1035
1036     >>> find_xhtml_body = etree.ETXPath(      # lxml only !
1037     ...     "//{%s}body" % XHTML_NAMESPACE)
1038     >>> results = find_xhtml_body(xhtml)
1039
1040     >>> print(results[0].tag)
1041     {http://www.w3.org/1999/xhtml}body
1042
1043
1044 The E-factory
1045 =============
1046
1047 The ``E-factory`` provides a simple and compact syntax for generating XML and
1048 HTML:
1049
1050 .. sourcecode:: pycon
1051
1052     >>> from lxml.builder import E
1053
1054     >>> def CLASS(*args): # class is a reserved word in Python
1055     ...     return {"class":' '.join(args)}
1056
1057     >>> html = page = (
1058     ...   E.html(       # create an Element called "html"
1059     ...     E.head(
1060     ...       E.title("This is a sample document")
1061     ...     ),
1062     ...     E.body(
1063     ...       E.h1("Hello!", CLASS("title")),
1064     ...       E.p("This is a paragraph with ", E.b("bold"), " text in it!"),
1065     ...       E.p("This is another paragraph, with a", "\n      ",
1066     ...         E.a("link", href="http://www.python.org"), "."),
1067     ...       E.p("Here are some reservered characters: <spam&egg>."),
1068     ...       etree.XML("<p>And finally an embedded XHTML fragment.</p>"),
1069     ...     )
1070     ...   )
1071     ... )
1072
1073     >>> print(etree.tostring(page, pretty_print=True))
1074     <html>
1075       <head>
1076         <title>This is a sample document</title>
1077       </head>
1078       <body>
1079         <h1 class="title">Hello!</h1>
1080         <p>This is a paragraph with <b>bold</b> text in it!</p>
1081         <p>This is another paragraph, with a
1082           <a href="http://www.python.org">link</a>.</p>
1083         <p>Here are some reservered characters: &lt;spam&amp;egg&gt;.</p>
1084         <p>And finally an embedded XHTML fragment.</p>
1085       </body>
1086     </html>
1087
1088 The Element creation based on attribute access makes it easy to build up a
1089 simple vocabulary for an XML language:
1090
1091 .. sourcecode:: pycon
1092
1093     >>> from lxml.builder import ElementMaker # lxml only !
1094
1095     >>> E = ElementMaker(namespace="http://my.de/fault/namespace",
1096     ...                  nsmap={'p' : "http://my.de/fault/namespace"})
1097
1098     >>> DOC = E.doc
1099     >>> TITLE = E.title
1100     >>> SECTION = E.section
1101     >>> PAR = E.par
1102
1103     >>> my_doc = DOC(
1104     ...   TITLE("The dog and the hog"),
1105     ...   SECTION(
1106     ...     TITLE("The dog"),
1107     ...     PAR("Once upon a time, ..."),
1108     ...     PAR("And then ...")
1109     ...   ),
1110     ...   SECTION(
1111     ...     TITLE("The hog"),
1112     ...     PAR("Sooner or later ...")
1113     ...   )
1114     ... )
1115
1116     >>> print(etree.tostring(my_doc, pretty_print=True))
1117     <p:doc xmlns:p="http://my.de/fault/namespace">
1118       <p:title>The dog and the hog</p:title>
1119       <p:section>
1120         <p:title>The dog</p:title>
1121         <p:par>Once upon a time, ...</p:par>
1122         <p:par>And then ...</p:par>
1123       </p:section>
1124       <p:section>
1125         <p:title>The hog</p:title>
1126         <p:par>Sooner or later ...</p:par>
1127       </p:section>
1128     </p:doc>
1129
1130 One such example is the module ``lxml.html.builder``, which provides a
1131 vocabulary for HTML.
1132
1133
1134 ElementPath
1135 ===========
1136
1137 The ElementTree library comes with a simple XPath-like path language
1138 called ElementPath_.  The main difference is that you can use the
1139 ``{namespace}tag`` notation in ElementPath expressions.  However,
1140 advanced features like value comparison and functions are not
1141 available.
1142
1143 .. _ElementPath: http://effbot.org/zone/element-xpath.htm
1144 .. _`full XPath implementation`: xpathxslt.html#xpath
1145
1146 In addition to a `full XPath implementation`_, lxml.etree supports the
1147 ElementPath language in the same way ElementTree does, even using
1148 (almost) the same implementation.  The API provides four methods here
1149 that you can find on Elements and ElementTrees:
1150
1151 * ``iterfind()`` iterates over all Elements that match the path
1152   expression
1153
1154 * ``findall()`` returns a list of matching Elements
1155
1156 * ``find()`` efficiently returns only the first match
1157
1158 * ``findtext()`` returns the ``.text`` content of the first match
1159
1160 Here are some examples:
1161
1162 .. sourcecode:: pycon
1163
1164     >>> root = etree.XML("<root><a x='123'>aText<b/><c/><b/></a></root>")
1165
1166 Find a child of an Element:
1167
1168 .. sourcecode:: pycon
1169
1170     >>> print(root.find("b"))
1171     None
1172     >>> print(root.find("a").tag)
1173     a
1174
1175 Find an Element anywhere in the tree:
1176
1177 .. sourcecode:: pycon
1178
1179     >>> print(root.find(".//b").tag)
1180     b
1181     >>> [ b.tag for b in root.iterfind(".//b") ]
1182     ['b', 'b']
1183
1184 Find Elements with a certain attribute:
1185
1186 .. sourcecode:: pycon
1187
1188     >>> print(root.findall(".//a[@x]")[0].tag)
1189     a
1190     >>> print(root.findall(".//a[@y]"))
1191     []