doc/FAQ.txt

   1 =====================================
   2 lxml FAQ - Frequently Asked Questions
   3 =====================================
   4
   5 .. meta::
   6   :description: Frequently Asked Questions about lxml (FAQ)
   7   :keywords: lxml, lxml.etree, FAQ, frequently asked questions
   8
   9 Frequently asked questions on lxml.  See also the notes on compatibility_ to
  10 ElementTree_.
  11
  12 .. _compatibility: compatibility.html
  13 .. _ElementTree:   http://effbot.org/zone/element-index.htm
  14 .. _`build instructions`: build.html
  15 .. _`MacOS-X` : build.html#building-lxml-on-macos-x
  16
  17 .. contents::
  18 ..
  19    1  General Questions
  20      1.1  Is there a tutorial?
  21      1.2  Where can I find more documentation about lxml?
  22      1.3  What standards does lxml implement?
  23      1.4  Who uses lxml?
  24      1.5  What is the difference between lxml.etree and lxml.objectify?
  25      1.6  How can I make my application run faster?
  26      1.7  What about that trailing text on serialised Elements?
  27      1.8  How can I find out if an Element is a comment or PI?
  28      1.9  How can I map an XML tree into a dict of dicts?
  29    2  Installation
  30      2.1  Which version of libxml2 and libxslt should I use or require?
  31      2.2  Where are the binary builds?
  32      2.3  Why do I get errors about missing UCS4 symbols when installing lxml?
  33    3  Contributing
  34      3.1  Why is lxml not written in Python?
  35      3.2  How can I contribute?
  36    4  Bugs
  37      4.1  My application crashes!
  38      4.2  My application crashes on MacOS-X!
  39      4.3  I think I have found a bug in lxml. What should I do?
  40      4.4  How do I know a bug is really in lxml and not in libxml2?
  41    5  Threading
  42      5.1  Can I use threads to concurrently access the lxml API?
  43      5.2  Does my program run faster if I use threads?
  44      5.3  Would my single-threaded program run faster if I turned off threading?
  45      5.4  Why can't I reuse XSLT stylesheets in other threads?
  46      5.5  My program crashes when run with mod_python/Pyro/Zope/Plone/...
  47    6  Parsing and Serialisation
  48      6.1  Why doesn't the ``pretty_print`` option reformat my XML output?
  49      6.2  Why can't lxml parse my XML from unicode strings?
  50      6.3  What is the difference between str(xslt(doc)) and xslt(doc).write() ?
  51      6.4  Why can't I just delete parents or clear the root node in iterparse()?
  52      6.5  How do I output null characters in XML text?
  53    7  XPath and Document Traversal
  54      7.1  What are the ``findall()`` and ``xpath()`` methods on Element(Tree)?
  55      7.2  Why doesn't ``findall()`` support full XPath expressions?
  56      7.3  How can I find out which namespace prefixes are used in a document?
  57      7.4  How can I specify a default namespace for XPath expressions?
  58
  59 ..
  60   >>> import sys
  61   >>> from lxml import etree as _etree
  62   >>> if sys.version_info[0] >= 3:
  63   ...   class etree_mock(object):
  64   ...     def __getattr__(self, name): return getattr(_etree, name)
  65   ...     def tostring(self, *args, **kwargs):
  66   ...       s = _etree.tostring(*args, **kwargs)
  67   ...       if isinstance(s, bytes): s = s.decode("utf-8") # CR
  68   ...       if s[-1] == '\n': s = s[:-1]
  69   ...       return s
  70   ... else:
  71   ...   class etree_mock(object):
  72   ...     def __getattr__(self, name): return getattr(_etree, name)
  73   ...     def tostring(self, *args, **kwargs):
  74   ...       s = _etree.tostring(*args, **kwargs)
  75   ...       if s[-1] == '\n': s = s[:-1]
  76   ...       return s
  77   >>> etree = etree_mock()
  78
  79
  80 General Questions
  81 =================
  82
  83 Is there a tutorial?
  84 --------------------
  85
  86 Read the `lxml.etree Tutorial`_.  While this is still work in progress
  87 (just as any good documentation), it provides an overview of the most
  88 important concepts in ``lxml.etree``.  If you want to help out,
  89 improving the tutorial is a very good place to start.
  90
  91 There is also a `tutorial for ElementTree`_ which works for
  92 ``lxml.etree``.  The documentation of the `extended etree API`_ also
  93 contains many examples for ``lxml.etree``.  Fredrik Lundh's `element
  94 library`_ contains a lot of nice recipes that show how to solve common
  95 tasks in ElementTree and lxml.etree.  To learn using
  96 ``lxml.objectify``, read the `objectify documentation`_.
  97
  98 John Shipman has written another tutorial called `Python XML
  99 processing with lxml`_ that contains lots of examples.  Liza Daly
 100 wrote a nice article about high-performance aspects when `parsing
 101 large files with lxml`_.
 102
 103 .. _`lxml.etree Tutorial`:      tutorial.html
 104 .. _`tutorial for ElementTree`: http://effbot.org/zone/element.htm
 105 .. _`extended etree API`:        api.html
 106 .. _`objectify documentation`:  objectify.html
 107 .. _`Python XML processing with lxml`: http://www.nmt.edu/tcc/help/pubs/pylxml/
 108 .. _`element library`:          http://effbot.org/zone/element-lib.htm
 109 .. _`parsing large files with lxml`: http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
 110
 111
 112 Where can I find more documentation about lxml?
 113 -----------------------------------------------
 114
 115 There is a lot of documentation on the web and also in the Python
 116 standard library documentation, as lxml implements the well-known
 117 `ElementTree API`_ and tries to follow its documentation as closely as
 118 possible.  The recipes in Fredrik Lundh's `element library`_ are
 119 generally worth taking a look at.  There are a couple of issues where
 120 lxml cannot keep up compatibility.  They are described in the
 121 compatibility_ documentation.
 122
 123 The lxml specific extensions to the API are described by individual
 124 files in the ``doc`` directory of the source distribution and on `the
 125 web page`_.
 126
 127 The `generated API documentation`_ is a comprehensive API reference
 128 for the lxml package.
 129
 130 .. _`ElementTree API`: http://effbot.org/zone/element-index.htm
 131 .. _`the web page`:    http://codespeak.net/lxml/#documentation
 132 .. _`generated API documentation`: api/index.html
 133
 134
 135 What standards does lxml implement?
 136 -----------------------------------
 137
 138 The compliance to XML Standards depends on the support in libxml2 and libxslt.
 139 Here is a quote from `http://xmlsoft.org/ <http://xmlsoft.org/>`_:
 140
 141   In most cases libxml2 tries to implement the specifications in a relatively
 142   strictly compliant way. As of release 2.4.16, libxml2 passed all 1800+ tests
 143   from the OASIS XML Tests Suite.
 144
 145 lxml currently supports libxml2 2.6.20 or later, which has even better
 146 support for various XML standards.  The important ones are:
 147
 148 * XML 1.0
 149 * HTML 4
 150 * XML namespaces
 151 * XML Schema 1.0
 152 * XPath 1.0
 153 * XInclude 1.0
 154 * XSLT 1.0
 155 * EXSLT
 156 * XML catalogs
 157 * canonical XML
 158 * RelaxNG
 159 * xml:id
 160 * xml:base
 161
 162 Support for XML Schema is currently not 100% complete in libxml2, but
 163 is definitely very close to compliance.  Schematron is supported,
 164 although not necessarily complete.  libxml2 also supports loading
 165 documents through HTTP and FTP.
 166
 167
 168 Who uses lxml?
 169 --------------
 170
 171 As an XML library, lxml is often used under the hood of in-house
 172 server applications, such as web servers or applications that
 173 facilitate some kind of document management.  Many people who deploy
 174 Zope_ or Plone_ use it together with lxml.  Therefore, it is hard to
 175 get an idea of who uses it, and the following list of 'users and
 176 projects we know of' is definitely not a complete list of lxml's
 177 users.
 178
 179 Also note that the compatibility to the ElementTree library does not
 180 require projects to set a hard dependency on lxml - as long as they do
 181 not take advantage of lxml's enhanced feature set.
 182
 183 * cssutils_, a CSS parser and toolkit, can be used with ``lxml.cssselect``
 184 * Deliverance_, a content theming tool
 185 * `Enfold Proxy 4`_, a web server accelerator with on-the-fly XSLT processing
 186 * Inteproxy_, a secure HTTP proxy
 187 * lwebstring_, an XML template engine
 188 * OpenXMLlib_, a library for handling OpenXML document meta data
 189 * Pycoon_, a WSGI web development framework based on XML pipelines
 190 * PyQuery_, a query framework for XML/HTML, similar to jQuery for JavaScript
 191 * Rambler_, a meta search engine that aggregates different data sources
 192 * rdfadict_, an RDFa parser with a simple dictionary-like interface.
 193
 194 Zope3 and some of its extensions have good support for lxml:
 195
 196 * gocept.lxml_, Zope3 interface bindings for lxml
 197 * z3c.rml_, an implementation of ReportLab's RML format
 198 * zif.sedna_, an XQuery based interface to the Sedna OpenSource XML database
 199
 200 And don't miss the quotes by our generally happy_ users_, and other
 201 `sites that link to lxml`_.  As `Liza Daly`_ puts it: "Many software
 202 products come with the pick-two caveat, meaning that you must choose
 203 only two: speed, flexibility, or readability.  When used carefully,
 204 lxml can provide all three."
 205
 206 .. _Zope: http://www.zope.org/
 207 .. _Plone: http://www.plone.org/
 208 .. _cssutils: http://code.google.com/p/cssutils/source/browse/trunk/examples/style.py?r=917
 209 .. _Deliverance: http://www.openplans.org/projects/deliverance/project-home
 210 .. _`Enfold Proxy 4`: http://www.enfoldsystems.com/Products/Proxy/4
 211 .. _gocept.lxml: http://pypi.python.org/pypi/gocept.lxml
 212 .. _Inteproxy: http://lists.wald.intevation.org/pipermail/inteproxy-devel/2007-February/000000.html
 213 .. _lwebstring: http://pypi.python.org/pypi/lwebstring
 214 .. _OpenXMLlib: http://permalink.gmane.org/gmane.comp.python.lxml.devel/3250
 215 .. _Pycoon: http://pypi.python.org/pypi/pycoon
 216 .. _PyQuery: http://pypi.python.org/pypi/pyquery
 217 .. _Rambler: http://beta.rambler.ru/srch?query=python+lxml&searchtype=web
 218 .. _rdfadict: http://pypi.python.org/pypi/rdfadict
 219 .. _z3c.rml: http://pypi.python.org/pypi/z3c.rml
 220 .. _zif.sedna: http://pypi.python.org/pypi/zif.sedna
 221
 222 .. _happy: http://thread.gmane.org/gmane.comp.python.lxml.devel/3244/focus=3244
 223 .. _users: http://article.gmane.org/gmane.comp.python.lxml.devel/3246
 224 .. _`sites that link to lxml`: http://www.google.com/search?as_lq=http:%2F%2Fcodespeak.net%2Flxml
 225 .. _`Liza Daly`: http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
 226
 227
 228 What is the difference between lxml.etree and lxml.objectify?
 229 -------------------------------------------------------------
 230
 231 The two modules provide different ways of handling XML.  However, objectify
 232 builds on top of lxml.etree and therefore inherits most of its capabilities
 233 and a large portion of its API.
 234
 235 * lxml.etree is a generic API for XML and HTML handling.  It aims for
 236   ElementTree compatibility_ and supports the entire XML infoset.  It is well
 237   suited for both mixed content and data centric XML.  Its generality makes it
 238   the best choice for most applications.
 239
 240 * lxml.objectify is a specialized API for XML data handling in a Python object
 241   syntax.  It provides a very natural way to deal with data fields stored in a
 242   structurally well defined XML format.  Data is automatically converted to
 243   Python data types and can be manipulated with normal Python operators.  Look
 244   at the examples in the `objectify documentation`_ to see what it feels like
 245   to use it.
 246
 247   Objectify is not well suited for mixed contents or HTML documents.  As it is
 248   built on top of lxml.etree, however, it inherits the normal support for
 249   XPath, XSLT or validation.
 250
 251
 252 How can I make my application run faster?
 253 -----------------------------------------
 254
 255 lxml.etree is a very fast library for processing XML.  There are, however, `a
 256 few caveats`_ involved in the mapping of the powerful libxml2 library to the
 257 simple and convenient ElementTree API.  Not all operations are as fast as the
 258 simplicity of the API might suggest, while some use cases can heavily benefit
 259 from finding the right way of doing them.  The `benchmark page`_ has a
 260 comparison to other ElementTree implementations and a number of tips for
 261 performance tweaking.  As with any Python application, the rule of thumb is:
 262 the more of your processing runs in C, the faster your application gets.  See
 263 also the section on threading_.
 264
 265 .. _`a few caveats`:  performance.html#the-elementtree-api
 266 .. _`benchmark page`: performance.html
 267 .. _threading:        #threading
 268
 269
 270 What about that trailing text on serialised Elements?
 271 -----------------------------------------------------
 272
 273 The ElementTree tree model defines an Element as a container with a tag name,
 274 contained text, child Elements and a tail text.  This means that whenever you
 275 serialise an Element, you will get all parts of that Element:
 276
 277 .. sourcecode:: pycon
 278
 279     >>> root = etree.XML("<root><tag>text<child/></tag>tail</root>")
 280     >>> print(etree.tostring(root[0]))
 281     <tag>text<child/></tag>tail
 282
 283 Here is an example that shows why not serialising the tail would be
 284 even more surprising from an object point of view:
 285
 286 .. sourcecode:: pycon
 287
 288     >>> root = etree.Element("test")
 289
 290     >>> root.text = "TEXT"
 291     >>> print(etree.tostring(root))
 292     <test>TEXT</test>
 293
 294     >>> root.tail = "TAIL"
 295     >>> print(etree.tostring(root))
 296     <test>TEXT</test>TAIL
 297
 298     >>> root.tail = None
 299     >>> print(etree.tostring(root))
 300     <test>TEXT</test>
 301
 302 Just imagine a Python list where you append an item and it doesn't
 303 show up when you look at the list.
 304
 305 The ``.tail`` property is a huge simplification for the tree model as
 306 it avoids text nodes to appear in the list of children and makes
 307 access to them quick and simple.  So this is a benefit in most
 308 applications and simplifies many, many XML tree algorithms.
 309
 310 However, in document-like XML (and especially HTML), the above result can be
 311 unexpected to new users and can sometimes require a bit more overhead.  A good
 312 way to deal with this is to use helper functions that copy the Element without
 313 its tail.  The ``lxml.html`` package also deals with this in a couple of
 314 places, as most HTML algorithms benefit from a tail-free behaviour.
 315
 316
 317 How can I find out if an Element is a comment or PI?
 318 ----------------------------------------------------
 319
 320 .. sourcecode:: pycon
 321
 322     >>> root = etree.XML("<?my PI?><root><!-- empty --></root>")
 323
 324     >>> root.tag
 325     'root'
 326     >>> root.getprevious().tag is etree.PI
 327     True
 328     >>> root[0].tag is etree.Comment
 329     True
 330
 331
 332 How can I map an XML tree into a dict of dicts?
 333 -----------------------------------------------
 334
 335 I'm glad you asked.
 336
 337 .. sourcecode:: python
 338
 339     def recursive_dict(element):
 340          return element.tag, \
 341                 dict(map(recursive_dict, element)) or element.text
 342
 343
 344 Installation
 345 ============
 346
 347 Which version of libxml2 and libxslt should I use or require?
 348 -------------------------------------------------------------
 349
 350 It really depends on your application, but the rule of thumb is: more recent
 351 versions contain less bugs and provide more features.
 352
 353 * Do not use libxml2 2.6.27 if you want to use XPath (including XSLT).  You
 354   will get crashes when XPath errors occur during the evaluation (e.g. for
 355   unknown functions).  This happens inside the evaluation call to libxml2, so
 356   there is nothing that lxml can do about it.
 357
 358 * Try to use versions of both libraries that were released together.  At least
 359   the libxml2 version should not be older than the libxslt version.
 360
 361 * If you use XML Schema or Schematron which are still under development, the
 362   most recent version of libxml2 is usually a good bet.
 363
 364 * The same applies to XPath, where a substantial number of bugs and memory
 365   leaks were fixed over time.  If you encounter crashes or memory leaks in
 366   XPath applications, try a more recent version of libxml2.
 367
 368 * For parsing and fixing broken HTML, lxml requires at least libxml2 2.6.21.
 369
 370 * For the normal tree handling, however, any libxml2 version starting with
 371   2.6.20 should do.
 372
 373 Read the `release notes of libxml2`_ and the `release notes of libxslt`_ to
 374 see when (or if) a specific bug has been fixed.
 375
 376 .. _`release notes of libxml2`: http://xmlsoft.org/news.html
 377 .. _`release notes of libxslt`: http://xmlsoft.org/XSLT/news.html
 378
 379
 380 Where are the binary builds?
 381 ----------------------------
 382
 383 Sidnei da Silva regularly contributes Windows binaries for new
 384 releases.  This is because two of the major problems of Microsoft
 385 Windows make it non-trivial for users to build lxml on this platform:
 386 the lack of a pre-installed standard compiler and the missing package
 387 management.
 388
 389 If there is not currently a binary distribution of the most recent
 390 lxml release for this platform available from the Python Package Index
 391 (PyPI), please look through the older versions to see if they provide
 392 a binary build.  This is done by appending the version number to the
 393 PyPI URL, e.g.::
 394
 395           http://pypi.python.org/pypi/lxml/2.1.5
 396
 397 Apart from that, we generally do not provide binary builds of lxml, as
 398 most of the other operating systems out there can build lxml without
 399 problems (with the exception of `MacOS-X`_), and the sheer mass of
 400 variations between platforms makes it futile to provide builds for
 401 everyone.
 402
 403
 404 Why do I get errors about missing UCS4 symbols when installing lxml?
 405 --------------------------------------------------------------------
 406
 407 Most likely, you use a Python installation that was configured for internal
 408 use of UCS2 unicode, meaning 16-bit unicode.  The lxml egg distributions are
 409 generally compiled on platforms that use UCS4, a 32-bit unicode encoding, as
 410 this is used on the majority of platforms.  Sadly, both are not compatible, so
 411 the eggs can only support the one they were compiled with.
 412
 413 This means that you have to compile lxml from sources for your system.  Note
 414 that you do not need Cython for this, the lxml source distribution is directly
 415 compilable on both platform types.  See the `build instructions`_ on how to do
 416 this.
 417
 418
 419 Contributing
 420 ============
 421
 422 Why is lxml not written in Python?
 423 ----------------------------------
 424
 425 It *almost* is.
 426
 427 lxml is not written in plain Python, because it interfaces with two C
 428 libraries: libxml2 and libxslt.  Accessing them at the C-level is
 429 required for performance reasons.
 430
 431 However, to avoid writing plain C-code and caring too much about the
 432 details of built-in types and reference counting, lxml is written in
 433 Cython_, a Python-like language that is translated into C-code.
 434 Chances are that if you know Python, you can write `code that Cython
 435 accepts`_.  Again, the C-ish style used in the lxml code is just for
 436 performance optimisations.  If you want to contribute, don't bother
 437 with the details, a Python implementation of your contribution is
 438 better than none.  And keep in mind that lxml's flexible API often
 439 favours an implementation of features in pure Python, without
 440 bothering with C-code at all.  For example, the ``lxml.html`` package
 441 is entirely written in Python.
 442
 443 Please contact the `mailing list`_ if you need any help.
 444
 445 .. _Cython: http://www.cython.org/
 446 .. _`code that Cython accepts`: http://docs.cython.org/docs/tutorial.html
 447
 448
 449 How can I contribute?
 450 ---------------------
 451
 452 If you find something that you would like lxml to do (or do better),
 453 then please tell us about it on the `mailing list`_.  Patches are
 454 always appreciated, especially when accompanied by unit tests and
 455 documentation (doctests would be great).  See the ``tests``
 456 subdirectories in the lxml source tree (below the ``src`` directory)
 457 and the ReST_ `text files`_ in the ``doc`` directory.
 458
 459 We also have a `list of missing features`_ that we would like to
 460 implement but didn't due to lack if time.  If *you* find the time,
 461 patches are very welcome.
 462
 463 .. _ReST: http://docutils.sourceforge.net/rst.html
 464 .. _`text files`: http://codespeak.net/svn/lxml/trunk/doc/
 465 .. _`list of missing features`: http://codespeak.net/svn/lxml/trunk/IDEAS.txt
 466
 467 Besides enhancing the code, there are a lot of places where you can help the
 468 project and its user base.  You can
 469
 470 * spread the word and write about lxml.  Many users (especially new Python
 471   users) have not yet heared about lxml, although our user base is constantly
 472   growing.  If you write your own blog and feel like saying something about
 473   lxml, go ahead and do so.  If we think your contribution or criticism is
 474   valuable to other users, we may even put a link or a quote on the project
 475   page.
 476
 477 * provide code examples for the general usage of lxml or specific problems
 478   solved with lxml.  Readable code is a very good way of showing how a library
 479   can be used and what great things you can do with it.  Again, if we hear
 480   about it, we can set a link on the project page.
 481
 482 * work on the documentation.  The web page is generated from a set of ReST_
 483   `text files`_.  It is meant both as a representative project page for lxml
 484   and as a site for documenting lxml's API and usage.  If you have questions
 485   or an idea how to make it more readable and accessible while you are reading
 486   it, please send a comment to the `mailing list`_.
 487
 488 * enhance the web site. We put some work into making the web site
 489   usable, understandable and also easy to find, but there's always
 490   things that can be done better.  You may notice that we are not
 491   top-ranked when searching the web for "Python and XML", so maybe you
 492   have an idea how to improve that.
 493
 494 * help with the tutorial.  A tutorial is the most important stating point for
 495   new users, so it is important for us to provide an easy to understand guide
 496   into lxml.  As allo documentation, the tutorial is work in progress, so we
 497   appreciate every helping hand.
 498
 499 * improve the docstrings.  lxml uses docstrings to support Python's integrated
 500   online ``help()`` function.  However, sometimes these are not sufficient to
 501   grasp the details of the function in question.  If you find such a place,
 502   you can try to write up a better description and send it to the `mailing
 503   list`_.
 504
 505
 506 Bugs
 507 ====
 508
 509 My application crashes!
 510 -----------------------
 511
 512 One of the goals of lxml is "no segfaults", so if there is no clear
 513 warning in the documentation that you were doing something potentially
 514 harmful, you have found a bug and we would like to hear about it.
 515 Please report this bug to the `mailing list`_.  See the section on bug
 516 reporting to learn how to do that.
 517
 518 If your application (or e.g. your web container) uses threads, please
 519 see the FAQ section on threading_ to check if you touch on one of the
 520 potential pitfalls.
 521
 522 In any case, try to reproduce the problem with the latest versions of
 523 libxml2 and libxslt.  From time to time, bugs and race conditions are found
 524 in these libraries, so a more recent version might already contain a fix for
 525 your problem.
 526
 527 Remember: even if you see lxml appear in a crash stack trace, it is
 528 not necessarily lxml that *caused* the crash.
 529
 530
 531 My application crashes on MacOS-X!
 532 ----------------------------------
 533
 534 This was a common problem up to lxml 2.1.x.  Since lxml 2.2, the only
 535 officially supported way to use it on this platform is through a
 536 static build against freshly downloaded versions of libxml2 and
 537 libxslt.  See the build instructions for `MacOS-X`_.
 538
 539
 540 I think I have found a bug in lxml. What should I do?
 541 -----------------------------------------------------
 542
 543 First, you should look at the `current developer changelog`_ to see if this
 544 is a known problem that has already been fixed in the SVN trunk since the
 545 release you are using.
 546
 547 .. _`current developer changelog`: http://codespeak.net/svn/lxml/trunk/CHANGES.txt
 548
 549 Also, the 'crash' section above has a few good advices what to try to see if
 550 the problem is really in lxml - and not in your setup.  Believe it or not,
 551 that happens more often than you might think, especially when old libraries
 552 or even multiple library versions are installed.
 553
 554 You should always try to reproduce the problem with the latest
 555 versions of libxml2 and libxslt - and make sure they are used.
 556 ``lxml.etree`` can tell you what it runs with:
 557
 558 .. sourcecode:: python
 559
 560    from lxml import etree
 561    print "lxml.etree:       ", etree.LXML_VERSION
 562    print "libxml used:      ", etree.LIBXML_VERSION
 563    print "libxml compiled:  ", etree.LIBXML_COMPILED_VERSION
 564    print "libxslt used:     ", etree.LIBXSLT_VERSION
 565    print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION
 566
 567 If you can figure that the problem is not in lxml but in the
 568 underlying libxml2 or libxslt, you can ask right on the respective
 569 mailing lists, which may considerably reduce the time to find a fix or
 570 work-around.  See the next question for some hints on how to do that.
 571
 572 Otherwise, we would really like to hear about it.  Please report it to
 573 the `mailing list`_ so that we can fix it.  It is very helpful in this
 574 case if you can come up with a short code snippet that demonstrates
 575 your problem.  If others can reproduce and see the problem, it is much
 576 easier for them to fix it - and maybe even easier for you to describe
 577 it and get people convinced that it really is a problem to fix.
 578
 579 It is important that you always report the version of lxml, libxml2
 580 and libxslt that you get from the code snippet above.  If we do not
 581 know the library versions you are using, we will ask back, so it will
 582 take longer for you to get a helpful answer.
 583
 584 Since as a user of lxml you are likely a programmer, you might find
 585 `this article on bug reports`_ an interesting read.
 586
 587 .. _`mailing list`: http://codespeak.net/mailman/listinfo/lxml-dev
 588 .. _`this article on bug reports`: http://www.chiark.greenend.org.uk/~sgtatham/bugs.html
 589
 590
 591 How do I know a bug is really in lxml and not in libxml2?
 592 ---------------------------------------------------------
 593
 594 A large part of lxml's functionality is implemented by libxml2 and
 595 libxslt, so problems that you encounter may be in one or the other.
 596 Knowing the right place to ask will reduce the time it takes to fix
 597 the problem, or to find a work-around.
 598
 599 Both libxml2 and libxslt come with their own command line frontends,
 600 namely ``xmllint`` and ``xsltproc``.  If you encounter problems with
 601 XSLT processing for specific stylesheets or with validation for
 602 specific schemas, try to run the XSLT with ``xsltproc`` or the
 603 validation with ``xmllint`` respectively to find out if it fails there
 604 as well.  If it does, please report directly to the mailing lists of
 605 the respective project, namely:
 606
 607 * `libxml2 mailing list <http://mail.gnome.org/mailman/listinfo/xml>`_
 608 * `libxslt mailing list <http://mail.gnome.org/mailman/listinfo/xslt>`_
 609
 610 On the other hand, everything that seems to be related to Python code,
 611 including custom resolvers, custom XPath functions, etc. is likely
 612 outside of the scope of libxml2/libxslt.  If you encounter problems
 613 here or you are not sure where there the problem may come from, please
 614 ask on the lxml mailing list first.
 615
 616 In any case, a good explanation of the problem including some simple
 617 test code and some input data will help us (or the libxml2 developers)
 618 see and understand the problem, which largely increases your chance of
 619 getting help.  See the question above for a few hints on what is
 620 helpful here.
 621
 622
 623 Threading
 624 =========
 625
 626 Can I use threads to concurrently access the lxml API?
 627 ------------------------------------------------------
 628
 629 Short answer: yes, if you use lxml 2.2 and later.
 630
 631 Since version 1.1, lxml frees the GIL (Python's global interpreter
 632 lock) internally when parsing from disk and memory, as long as you use
 633 either the default parser (which is replicated for each thread) or
 634 create a parser for each thread yourself.  lxml also allows
 635 concurrency during validation (RelaxNG and XMLSchema) and XSL
 636 transformation.  You can share RelaxNG, XMLSchema and XSLT objects
 637 between threads.
 638
 639 While you can also share parsers between threads, this will serialize
 640 the access to each of them, so it is better to ``.copy()`` parsers or
 641 to just use the default parser if you do not need any special
 642 configuration.  The same applies to the XPath evaluators, which use an
 643 internal lock to protect their prepared evaluation contexts.  It is
 644 therefore best to use separate evaluator instances in threads.
 645
 646 Warning: Before lxml 2.2, and especially before 2.1, there were
 647 various issues when moving subtrees between different threads, or when
 648 applying XSLT objects from one thread to trees parsed or modified in
 649 another.  If you need code to run with older versions, you should
 650 generally avoid modifying trees in other threads than the one it was
 651 generated in.  Although this should work in many cases, there are
 652 certain scenarios where the termination of a thread that parsed a tree
 653 can crash the application if subtrees of this tree were moved to other
 654 documents.  You should be on the safe side when passing trees between
 655 threads if you either
 656
 657 - do not modify these trees and do not move their elements to other
 658   trees, or
 659
 660 - do not terminate threads while the trees they parsed are still in
 661   use (e.g. by using a fixed size thread-pool or long-running threads
 662   in processing chains)
 663
 664 Since lxml 2.2, even multi-thread pipelines are supported. However,
 665 note that it is more efficient to do all tree work inside one thread,
 666 than to let multiple threads work on a tree one after the other. This
 667 is because trees inherit state from the thread that created them,
 668 which must be maintained when the tree is modified inside another
 669 thread.
 670
 671
 672 Does my program run faster if I use threads?
 673 --------------------------------------------
 674
 675 Depends.  The best way to answer this is timing and profiling.
 676
 677 The global interpreter lock (GIL) in Python serializes access to the
 678 interpreter, so if the majority of your processing is done in Python
 679 code (walking trees, modifying elements, etc.), your gain will be
 680 close to zero.  The more of your XML processing moves into lxml,
 681 however, the higher your gain.  If your application is bound by XML
 682 parsing and serialisation, or by very selective XPath expressions and
 683 complex XSLTs, your speedup on multi-processor machines can be
 684 substantial.
 685
 686 See the question above to learn which operations free the GIL to support
 687 multi-threading.
 688
 689
 690 Would my single-threaded program run faster if I turned off threading?
 691 ----------------------------------------------------------------------
 692
 693 Possibly, yes.  You can see for yourself by compiling lxml entirely
 694 without threading support.  Pass the ``--without-threading`` option to
 695 setup.py when building lxml from source.  You can also build libxml2
 696 without pthread support (``--without-pthreads`` option), which may add
 697 another bit of performance.  Note that this will leave internal data
 698 structures entirely without thread protection, so make sure you really
 699 do not use lxml outside of the main application thread in this case.
 700
 701
 702 Why can't I reuse XSLT stylesheets in other threads?
 703 ----------------------------------------------------
 704
 705 Since later lxml 2.0 versions, you can do this.  There is some
 706 overhead involved as the result document needs an additional cleanup
 707 traversal when the input document and/or the stylesheet were created
 708 in other threads.  However, on a multi-processor machine, the gain of
 709 freeing the GIL easily covers this drawback.
 710
 711 If you need even the last bit of performance, consider keeping (a copy
 712 of) the stylesheet in thread-local storage, and try creating the input
 713 document(s) in the same thread.  And do not forget to benchmark your
 714 code to see if the increased code complexity is really worth it.
 715
 716
 717 My program crashes when run with mod_python/Pyro/Zope/Plone/...
 718 ---------------------------------------------------------------
 719
 720 These environments can use threads in a way that may not make it obvious when
 721 threads are created and what happens in which thread.  This makes it hard to
 722 ensure lxml's threading support is used in a reliable way.  Sadly, if problems
 723 arise, they are as diverse as the applications, so it is difficult to provide
 724 any generally applicable solution.  Also, these environments are so complex
 725 that problems become hard to debug and even harder to reproduce in a
 726 predictable way.  If you encounter crashes in one of these systems, but your
 727 code runs perfectly when started by hand, the following gives you a few hints
 728 for possible approaches to solve your specific problem:
 729
 730 * make sure you use recent versions of libxml2, libxslt and lxml.  The
 731   libxml2 developers keep fixing bugs in each release, and lxml also
 732   tries to become more robust against possible pitfalls.  So newer
 733   versions might already fix your problem in a reliable way.  Version
 734   2.2 of lxml contains many improvements.
 735
 736 * make sure the library versions you installed are really used.  Do
 737   not rely on what your operating system tells you!  Print the version
 738   constants in ``lxml.etree`` from within your runtime environment to
 739   make sure it is the case.  This is especially a problem under
 740   MacOS-X when newer library versions were installed in addition to
 741   the outdated system libraries.  Please read the bugs section
 742   regarding MacOS-X in this FAQ.
 743
 744 * if you use ``mod_python``, try setting this option:
 745
 746       PythonInterpreter main_interpreter
 747
 748   There was a discussion on the mailing list about this problem:
 749
 750       http://comments.gmane.org/gmane.comp.python.lxml.devel/2942
 751
 752 * compile lxml without threading support by running ``setup.py`` with the
 753   ``--without-threading`` option.  While this might be slower in certain
 754   scenarios on multi-processor systems, it *might* also keep your application
 755   from crashing, which should be worth more to you than peek performance.
 756   Remember that lxml is fast anyway, so concurrency may not even be worth it.
 757
 758 * look out for fancy XSLT stuff like foreign document access or
 759   passing in subtrees trough XSLT variables.  This might or might not
 760   work, depending on your specific usage.  Again, later versions of
 761   lxml and libxslt provide safer support here.
 762
 763 * try copying trees at suspicious places in your code and working with
 764   those instead of a tree shared between threads.  Note that the
 765   copying must happen inside the target thread to be effective, not in
 766   the thread that created the tree.  Serialising in one thread and
 767   parsing in another is also a simple (and fast) way of separating
 768   thread contexts.
 769
 770 * try keeping thread-local copies of XSLT stylesheets, i.e. one per thread,
 771   instead of sharing one.  Also see the question above.
 772
 773 * you can try to serialise suspicious parts of your code with explicit thread
 774   locks, thus disabling the concurrency of the runtime system.
 775
 776 * report back on the mailing list to see if there are other ways to work
 777   around your specific problems.  Do not forget to report the version numbers
 778   of lxml, libxml2 and libxslt you are using (see the question on reporting
 779   a bug).
 780
 781 Note that most of these options will degrade performance and/or your
 782 code quality.  If you are unsure what to do, please ask on the mailing
 783 list.
 784
 785
 786 Parsing and Serialisation
 787 =========================
 788
 789 ..
 790     making doctest happy:
 791
 792     >>> try: from StringIO import StringIO
 793     ... except ImportError: from io import StringIO # Py3
 794     >>> filename = StringIO("<root/>")
 795
 796
 797 Why doesn't the ``pretty_print`` option reformat my XML output?
 798 ---------------------------------------------------------------
 799
 800 Pretty printing (or formatting) an XML document means adding white space to
 801 the content.  These modifications are harmless if they only impact elements in
 802 the document that do not carry (text) data.  They corrupt your data if they
 803 impact elements that contain data.  If lxml cannot distinguish between
 804 whitespace and data, it will not alter your data.  Whitespace is therefore
 805 only added between nodes that do not contain data.  This is always the case
 806 for trees constructed element-by-element, so no problems should be expected
 807 here.  For parsed trees, a good way to assure that no conflicting whitespace
 808 is left in the tree is the ``remove_blank_text`` option:
 809
 810 .. sourcecode:: pycon
 811
 812    >>> parser = etree.XMLParser(remove_blank_text=True)
 813    >>> tree = etree.parse(filename, parser)
 814
 815 This will allow the parser to drop blank text nodes when constructing the
 816 tree.  If you now call a serialization function to pretty print this tree,
 817 lxml can add fresh whitespace to the XML tree to indent it.
 818
 819 Fredrik Lundh also has a Python-level function for indenting XML by
 820 appending whitespace to tags.  It can be found on his `element
 821 library`_ recipe page.
 822
 823
 824 Why can't lxml parse my XML from unicode strings?
 825 -------------------------------------------------
 826
 827 lxml can read Python unicode strings and even tries to support them if libxml2
 828 does not.  However, if the unicode string declares an XML encoding internally
 829 (``<?xml encoding="..."?>``), parsing is bound to fail, as this encoding is
 830 most likely not the real encoding used in Python unicode.  The same is true
 831 for HTML unicode strings that contain charset meta tags, although the problems
 832 may be more subtle here.  The libxml2 HTML parser may not be able to parse the
 833 meta tags in broken HTML and may end up ignoring them, so even if parsing
 834 succeeds, later handling may still fail with character encoding errors.
 835
 836 Note that Python uses different encodings for unicode on different platforms,
 837 so even specifying the real internal unicode encoding is not portable between
 838 Python interpreters.  Don't do it.
 839
 840 Python unicode strings with XML data or HTML data that carry encoding
 841 information are broken.  lxml will not parse them.  You must provide parsable
 842 data in a valid encoding.
 843
 844
 845 What is the difference between str(xslt(doc)) and xslt(doc).write() ?
 846 ---------------------------------------------------------------------
 847
 848 The str() implementation of the XSLTResultTree class (a subclass of the
 849 ElementTree class) knows about the output method chosen in the stylesheet
 850 (xsl:output), write() doesn't.  If you call write(), the result will be a
 851 normal XML tree serialization in the requested encoding.  Calling this method
 852 may also fail for XSLT results that are not XML trees (e.g. string results).
 853
 854 If you call str(), it will return the serialized result as specified by the
 855 XSL transform.  This correctly serializes string results to encoded Python
 856 strings and honours ``xsl:output`` options like ``indent``.  This almost
 857 certainly does what you want, so you should only use ``write()`` if you are
 858 sure that the XSLT result is an XML tree and you want to override the encoding
 859 and indentation options requested by the stylesheet.
 860
 861
 862 Why can't I just delete parents or clear the root node in iterparse()?
 863 ----------------------------------------------------------------------
 864
 865 The ``iterparse()`` implementation is based on the libxml2 parser.  It
 866 requires the tree to be intact to finish parsing.  If you delete or modify
 867 parents of the current node, chances are you modify the structure in a way
 868 that breaks the parser.  Normally, this will result in a segfault.  Please
 869 refer to the `iterparse section`_ of the lxml API documentation to find out
 870 what you can do and what you can't do.
 871
 872 .. _`iterparse section`: parsing.html#iterparse-and-iterwalk
 873
 874
 875 How do I output null characters in XML text?
 876 --------------------------------------------
 877
 878 Don't.  What you would produce is not well-formed XML.  XML parsers
 879 will refuse to parse a document that contains null characters.  The
 880 right way to embed binary data in XML is using a text encoding such as
 881 uuencode or base64.
 882
 883
 884 XPath and Document Traversal
 885 ============================
 886
 887 What are the ``findall()`` and ``xpath()`` methods on Element(Tree)?
 888 --------------------------------------------------------------------
 889
 890 ``findall()`` is part of the original `ElementTree API`_.  It supports a
 891 `simple subset of the XPath language`_, without predicates, conditions and
 892 other advanced features.  It is very handy for finding specific tags in a
 893 tree.  Another important difference is namespace handling, which uses the
 894 ``{namespace}tagname`` notation.  This is not supported by XPath.  The
 895 findall, find and findtext methods are compatible with other ElementTree
 896 implementations and allow writing portable code that runs on ElementTree,
 897 cElementTree and lxml.etree.
 898
 899 ``xpath()``, on the other hand, supports the complete power of the XPath
 900 language, including predicates, XPath functions and Python extension
 901 functions.  The syntax is defined by the `XPath specification`_.  If you need
 902 the expressiveness and selectivity of XPath, the ``xpath()`` method, the
 903 ``XPath`` class and the ``XPathEvaluator`` are the best choice_.
 904
 905 .. _`simple subset of the XPath language`: http://effbot.org/zone/element-xpath.htm
 906 .. _`XPath specification`:                 http://www.w3.org/TR/xpath
 907 .. _choice:                                performance.html#xpath
 908
 909
 910 Why doesn't ``findall()`` support full XPath expressions?
 911 ---------------------------------------------------------
 912
 913 It was decided that it is more important to keep compatibility with
 914 ElementTree_ to simplify code migration between the libraries.  The main
 915 difference compared to XPath is the ``{namespace}tagname`` notation used in
 916 ``findall()``, which is not valid XPath.
 917
 918 ElementTree and lxml.etree use the same implementation, which assures 100%
 919 compatibility.  Note that ``findall()`` is `so fast`_ in lxml that a native
 920 implementation would not bring any performance benefits.
 921
 922 .. _`so fast`: performance.html#tree-traversal
 923
 924
 925 How can I find out which namespace prefixes are used in a document?
 926 -------------------------------------------------------------------
 927
 928 You can traverse the document (``root.iter()``) and collect the prefix
 929 attributes from all Elements into a set.  However, it is unlikely that you
 930 really want to do that.  You do not need these prefixes, honestly.  You only
 931 need the namespace URIs.  All namespace comparisons use these, so feel free to
 932 make up your own prefixes when you use XPath expressions or extension
 933 functions.
 934
 935 The only place where you might consider specifying prefixes is the
 936 serialization of Elements that were created through the API.  Here, you can
 937 specify a prefix mapping through the ``nsmap`` argument when creating the root
 938 Element.  Its children will then inherit this prefix for serialization.
 939
 940
 941 How can I specify a default namespace for XPath expressions?
 942 ------------------------------------------------------------
 943
 944 You can't.  In XPath, there is no such thing as a default namespace.  Just use
 945 an arbitrary prefix and let the namespace dictionary of the XPath evaluators
 946 map it to your namespace.  See also the question above.