From: Hyunjee Kim Date: Thu, 31 Jan 2019 01:52:21 +0000 (+0900) Subject: Imported Upstream version 3.5.0 X-Git-Tag: upstream/4.3.0~20 X-Git-Url: http://review.tizen.org/git/?a=commitdiff_plain;h=374a69acca916dc70251ba5976b2826f8886e843;p=platform%2Fupstream%2Fpython-lxml.git Imported Upstream version 3.5.0 Change-Id: Ia246f09dba2e968b4e917b734a12085f37599e04 Signed-off-by: Hyunjee Kim --- diff --git a/CHANGES.txt b/CHANGES.txt index a5d013c6..75410300 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -2,6 +2,99 @@ lxml changelog ============== +3.5.0 (2015-11-13) +================== + +Bugs fixed +---------- + +* Unicode string results failed XPath queries in PyPy. + +* LP#1497051: HTML target parser failed to terminate on exceptions + and continued parsing instead. + +* Deprecated API usage in doctestcompare. + + +3.5.0b1 (2015-09-18) +==================== + +Features added +-------------- + +* ``cleanup_namespaces()`` accepts a new argument ``keep_ns_prefixes`` + that does not remove definitions of the provided prefix-namespace + mapping from the tree. + +* ``cleanup_namespaces()`` accepts a new argument ``top_nsmap`` that + moves definitions of the provided prefix-namespace mapping to the + top of the tree. + +* LP#1490451: ``Element`` objects gained a ``cssselect()`` method as + known from ``lxml.html``. Patch by Simon Sapin. + +* API functions and methods behave and look more like Python functions, + which allows introspection on them etc. One side effect to be aware of + is that the functions now bind as methods when assigned to a class + variable. A quick fix is to wrap them in ``staticmethod()`` (as for + normal Python functions). + +* ISO-Schematron support gained an option ``error_finder`` that allows + passing a filter function for picking validation errors from reports. + +* LP#1243600: Elements in ``lxml.html`` gained a ``classes`` property + that provides a set-like interface to the ``class`` attribute. + Original patch by masklinn. + +* LP#1341964: The soupparser now handles DOCTYPE declarations, comments + and processing instructions outside of the root element. + Patch by Olli Pottonen. + +* LP#1421512: The ``docinfo`` of a tree was made editable to allow + setting and removing the public ID and system ID of the DOCTYPE. + Patch by Olli Pottonen. + +* LP#1442427: More work-arounds for quirks and bugs in pypy and pypy3. + +* ``lxml.html.soupparser`` now uses BeautifulSoup version 4 instead + of version 3 if available. + +Bugs fixed +---------- + +* Memory errors that occur during tree adaptations (e.g. moving subtrees + to foreign documents) could leave the tree in a crash prone state. + +* Calling ``process_children()`` in an XSLT extension element without + an ``output_parent`` argument failed with a ``TypeError``. + Fix by Jens Tröger. + +* GH#162: Image data in HTML ``data`` URLs is considered safe and + no longer removed by ``lxml.html.clean`` JavaScript cleaner. + +* GH#166: Static build could link libraries in wrong order. + +* GH#172: Rely a bit more on libxml2 for encoding detection rather than + rolling our own in some cases. Patch by Olli Pottonen. + +* GH#159: Validity checks for names and string content were tightened + to detect the use of illegal characters early. Patch by Olli Pottonen. + +* LP#1421921: Comments/PIs before the DOCTYPE declaration were not + serialised. Patch by Olli Pottonen. + +* LP#659367: Some HTML DOCTYPE declarations were not serialised. + Patch by Olli Pottonen. + +* LP#1238503: lxml.doctestcompare is now consistent with stdlib's doctest + in how it uses ``+`` and ``-`` to refer to unexpected and missing output. + +* Empty prefixes are explicitly rejected when a namespace mapping is used + with ElementPath to avoid hiding bugs in user code. + +* Several problems with PyPy were fixed by switching to Cython 0.23. + + 3.4.4 (2015-04-25) ================== @@ -1061,7 +1154,7 @@ Bugs fixed file. * Work-around for libxml2 bug that can leave the HTML parser in a - non-functional state after parsing a severly broken document (fixed + non-functional state after parsing a severely broken document (fixed in libxml2 2.7.8). * ``marque`` tag in HTML cleanup code is correctly named ``marquee``. @@ -1366,7 +1459,7 @@ Bugs fixed * The ``ElementMaker`` in lxml.objectify no longer defines the default namespaces when annotation is disabled. -* Feed parser failed to honout the 'recover' option on parse errors. +* Feed parser failed to honour the 'recover' option on parse errors. * Diverting the error logging to Python's logging system was broken. @@ -2474,7 +2567,7 @@ Other changes (which can crash on certain XPath errors) * Type annotation in objectify now preserves the already annotated type by - default to prevent loosing type information that is already there. + default to prevent losing type information that is already there. * ``element.getiterator()`` returns a list, use ``element.iter()`` to retrieve an iterator (ElementTree 1.3 compatible behaviour) @@ -2677,7 +2770,7 @@ Features added Bugs fixed ---------- -* Removing Elements from a tree could make them loose their namespace +* Removing Elements from a tree could make them lose their namespace declarations * ``ElementInclude`` didn't honour base URL of original document @@ -2772,7 +2865,7 @@ Other changes * code cleanup: redundant _NodeBase super class merged into _Element class Note: although the impact should be zero in most cases, this change breaks - the compatibiliy of the public C-API + the compatibility of the public C-API 1.1.2 (2006-10-30) @@ -2852,7 +2945,7 @@ Bugs fixed Features added -------------- -* Comments and processing instructions return '' and +* Comments and processing instructions return '' and '' for repr() * Parsers are now the preferred (and default) place where element class lookup @@ -3032,7 +3125,7 @@ Bugs fixed * Extension function calls in XSLT variable declarations could break the stylesheet and crash on repeated calls -* Deep copying Elements could loose namespaces declared in parents +* Deep copying Elements could lose namespaces declared in parents * Deep copying Elements did not copy tail diff --git a/INSTALL.txt b/INSTALL.txt index ca616c9b..84059d9e 100644 --- a/INSTALL.txt +++ b/INSTALL.txt @@ -49,7 +49,7 @@ be installed, in particular: * `libxml2 `_ version 2.7.0 or later. - * We recommend libxml2 2.9.0 or a later version. + * We recommend libxml2 2.9.2 or a later version. * If you want to use the feed parser interface, especially when parsing from unicode strings, do not use libxml2 2.7.4 through @@ -57,7 +57,7 @@ be installed, in particular: * `libxslt `_ version 1.1.23 or later. - * We recommend libxslt 1.1.26 or later. Version 1.1.25 will not + * We recommend libxslt 1.1.28 or later. Version 1.1.25 will not work due to a missing library symbol. Newer versions generally contain fewer bugs and are therefore diff --git a/MANIFEST.in b/MANIFEST.in index 3be87311..82a16c90 100644 --- a/MANIFEST.in +++ b/MANIFEST.in @@ -1,6 +1,6 @@ exclude *.py include setup.py ez_setup.py setupinfo.py versioninfo.py buildlibxml.py -include test.py selftest.py selftest2.py +include test.py include update-error-constants.py include MANIFEST.in Makefile version.txt requirements.txt include CHANGES.txt CREDITS.txt INSTALL.txt LICENSES.txt README.rst TODO.txt diff --git a/Makefile b/Makefile index e51155c1..59fc76de 100644 --- a/Makefile +++ b/Makefile @@ -5,14 +5,16 @@ TESTOPTS= SETUPFLAGS= LXMLVERSION=`cat version.txt` -PYTHON_WITH_CYTHON=$(shell $(PYTHON) -c 'import Cython.Compiler' >/dev/null 2>/dev/null && echo " --with-cython" || true) -PY3_WITH_CYTHON=$(shell $(PYTHON3) -c 'import Cython.Compiler' >/dev/null 2>/dev/null && echo " --with-cython" || true) +PYTHON_WITH_CYTHON=$(shell $(PYTHON) -c 'import Cython.Build.Dependencies' >/dev/null 2>/dev/null && echo " --with-cython" || true) +PY3_WITH_CYTHON=$(shell $(PYTHON3) -c 'import Cython.Build.Dependencies' >/dev/null 2>/dev/null && echo " --with-cython" || true) +CYTHON_WITH_COVERAGE=$(shell $(PYTHON) -c 'import Cython.Coverage; import sys; assert not hasattr(sys, "pypy_version_info")' >/dev/null 2>/dev/null && echo " --coverage" || true) +CYTHON3_WITH_COVERAGE=$(shell $(PYTHON3) -c 'import Cython.Coverage; import sys; assert not hasattr(sys, "pypy_version_info")' >/dev/null 2>/dev/null && echo " --coverage" || true) all: inplace # Build in-place inplace: - $(PYTHON) setup.py $(SETUPFLAGS) build_ext -i $(PYTHON_WITH_CYTHON) --warnings + $(PYTHON) setup.py $(SETUPFLAGS) build_ext -i $(PYTHON_WITH_CYTHON) --warnings --with-coverage sdist: $(PYTHON) setup.py $(SETUPFLAGS) sdist $(PYTHON_WITH_CYTHON) @@ -30,15 +32,11 @@ test_build: build $(PYTHON) test.py $(TESTFLAGS) $(TESTOPTS) test_inplace: inplace - $(PYTHON) test.py $(TESTFLAGS) $(TESTOPTS) - PYTHONPATH=src:$(PYTHONPATH) $(PYTHON) selftest.py - PYTHONPATH=src:$(PYTHONPATH) $(PYTHON) selftest2.py + $(PYTHON) test.py $(TESTFLAGS) $(TESTOPTS) $(CYTHON_WITH_COVERAGE) test_inplace3: inplace $(PYTHON3) setup.py $(SETUPFLAGS) build_ext -i $(PY3_WITH_CYTHON) - $(PYTHON3) test.py $(TESTFLAGS) $(TESTOPTS) - PYTHONPATH=src:$(PYTHONPATH) $(PYTHON3) selftest.py - PYTHONPATH=src:$(PYTHONPATH) $(PYTHON3) selftest2.py + $(PYTHON3) test.py $(TESTFLAGS) $(TESTOPTS) $(CYTHON3_WITH_COVERAGE) valgrind_test_inplace: inplace valgrind --tool=memcheck --leak-check=full --num-callers=30 --suppressions=valgrind-python.supp \ diff --git a/PKG-INFO b/PKG-INFO index 499c4a83..2b8d3866 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,12 +1,11 @@ Metadata-Version: 1.1 Name: lxml -Version: 3.4.4 +Version: 3.5.0 Summary: Powerful and Pythonic XML processing library combining libxml2/libxslt with the ElementTree API. Home-page: http://lxml.de/ Author: lxml dev team Author-email: lxml-dev@lxml.de License: UNKNOWN -Download-URL: http://pypi.python.org/packages/source/l/lxml/lxml-3.4.4.tar.gz Description: lxml is a Pythonic, mature binding for the libxml2 and libxslt libraries. It provides safe and convenient access to these libraries using the ElementTree API. @@ -30,21 +29,25 @@ Description: lxml is a Pythonic, mature binding for the libxml2 and libxslt libr After an official release of a new stable series, bug fixes may become available at - https://github.com/lxml/lxml/tree/lxml-3.4 . - Running ``easy_install lxml==3.4bugfix`` will install + https://github.com/lxml/lxml/tree/lxml-3.5 . + Running ``easy_install lxml==3.5bugfix`` will install the unreleased branch state from - https://github.com/lxml/lxml/tarball/lxml-3.4#egg=lxml-3.4bugfix + https://github.com/lxml/lxml/tarball/lxml-3.5#egg=lxml-3.5bugfix as soon as a maintenance branch has been established. Note that this requires Cython to be installed at an appropriate version for the build. - 3.4.4 (2015-04-25) + 3.5.0 (2015-11-13) ================== Bugs fixed ---------- - * An ElementTree compatibility test added in lxml 3.4.3 that failed in - Python 3.4+ was removed again. + * Unicode string results failed XPath queries in PyPy. + + * LP#1497051: HTML target parser failed to terminate on exceptions + and continued parsing instead. + + * Deprecated API usage in doctestcompare. @@ -61,6 +64,7 @@ Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.2 Classifier: Programming Language :: Python :: 3.3 Classifier: Programming Language :: Python :: 3.4 +Classifier: Programming Language :: Python :: 3.5 Classifier: Programming Language :: C Classifier: Operating System :: OS Independent Classifier: Topic :: Text Processing :: Markup :: HTML diff --git a/buildlibxml.py b/buildlibxml.py index 8b883652..0faf05d3 100644 --- a/buildlibxml.py +++ b/buildlibxml.py @@ -182,54 +182,6 @@ def download_library(dest_dir, location, name, version_re, filename, urlretrieve(full_url, dest_filename) return dest_filename -## Backported method of tarfile.TarFile.extractall (doesn't exist in 2.4): -def _extractall(self, path=".", members=None): - """Extract all members from the archive to the current working - directory and set owner, modification time and permissions on - directories afterwards. `path' specifies a different directory - to extract to. `members' is optional and must be a subset of the - list returned by getmembers(). - """ - import copy - is_ignored_file = re.compile( - r'''[\\/]((test|results?)[\\/] - |doc[\\/].*(Log|[.](out|imp|err|ent|gif|tif|pdf))$ - |tests[\\/](.*[\\/])?(?!Makefile)[^\\/]*$ - |python[\\/].*[.]py$ - ) - ''', re.X).search - - directories = [] - - if members is None: - members = self - - for tarinfo in members: - if is_ignored_file(tarinfo.name): - continue - if tarinfo.isdir(): - # Extract directories with a safe mode. - directories.append((tarinfo.name, tarinfo)) - tarinfo = copy.copy(tarinfo) - tarinfo.mode = 448 # 0700 - self.extract(tarinfo, path) - - # Reverse sort directories. - directories.sort() - directories.reverse() - - # Set correct owner, mtime and filemode on directories. - for name, tarinfo in directories: - dirpath = os.path.join(path, name) - try: - self.chown(tarinfo, dirpath) - self.utime(tarinfo, dirpath) - self.chmod(tarinfo, dirpath) - except tarfile.ExtractError: - if self.errorlevel > 1: - raise - else: - self._dbg(1, "tarfile: %s" % sys.exc_info()[1]) def unpack_tarball(tar_filename, dest): print('Unpacking %s into %s' % (os.path.basename(tar_filename), dest)) @@ -242,10 +194,11 @@ def unpack_tarball(tar_filename, dest): else: if base_dir != base_name: print('Unexpected path in %s: %s' % (tar_filename, base_name)) - _extractall(tar, dest) + tar.extractall(dest) tar.close() return os.path.join(dest, base_dir) + def call_subprocess(cmd, **kw): try: from subprocess import proc_call @@ -381,9 +334,10 @@ def build_libxml2xslt(download_dir, build_dir, os.path.join(prefix, 'include', 'libexslt')]) static_library_dirs.append(lib_dir) - for filename in os.listdir(lib_dir): - if [l for l in ['iconv', 'libxml2', 'libxslt', 'libexslt'] if l in filename]: - if [ext for ext in ['.a'] if filename.endswith(ext)]: - static_binaries.append(os.path.join(lib_dir,filename)) + listdir = os.listdir(lib_dir) + static_binaries += [os.path.join(lib_dir, filename) + for lib in ['libxml2', 'libexslt', 'libxslt', 'iconv'] + for filename in listdir + if lib in filename and filename.endswith('.a')] return (xml2_config, xslt_config) diff --git a/doc/FAQ.txt b/doc/FAQ.txt index 2b4c9ef3..0dc620fa 100644 --- a/doc/FAQ.txt +++ b/doc/FAQ.txt @@ -201,12 +201,16 @@ not take advantage of lxml's enhanced feature set. a secure HTTP proxy * `lwebstring `_, an XML template engine +* `openpyxl `_, + a library to read/write MS Excel 2007 files * `OpenXMLlib `_, a library for handling OpenXML document meta data * `PsychoPy `_, psychology software in Python * `Pycoon `_, a WSGI web development framework based on XML pipelines +* `pycsw `_, + an `OGC CSW `_ server implementation written in Python * `PyQuery `_, a query framework for XML/HTML, similar to jQuery for JavaScript * `python-docx `_, @@ -499,7 +503,7 @@ Besides enhancing the code, there are a lot of places where you can help the project and its user base. You can * spread the word and write about lxml. Many users (especially new Python - users) have not yet heared about lxml, although our user base is constantly + users) have not yet heard about lxml, although our user base is constantly growing. If you write your own blog and feel like saying something about lxml, go ahead and do so. If we think your contribution or criticism is valuable to other users, we may even put a link or a quote on the project @@ -524,7 +528,7 @@ project and its user base. You can * help with the tutorial. A tutorial is the most important stating point for new users, so it is important for us to provide an easy to understand guide - into lxml. As allo documentation, the tutorial is work in progress, so we + into lxml. As all documentation, the tutorial is work in progress, so we appreciate every helping hand. * improve the docstrings. lxml uses docstrings to support Python's integrated @@ -862,7 +866,7 @@ lxml can add fresh whitespace to the XML tree to indent it. Note that the ``remove_blank_text`` option also uses a heuristic if it has no definite knowledge about the document's ignorable whitespace. It will keep blank text nodes that appear after non-blank text nodes -at the same level. This is to prevent document-style XML from loosing +at the same level. This is to prevent document-style XML from losing content. The HTMLParser has this structural knowledge built-in, which means that diff --git a/doc/api.txt b/doc/api.txt index efa1888c..1238cea5 100644 --- a/doc/api.txt +++ b/doc/api.txt @@ -295,7 +295,7 @@ transformer object. See their documentation for details. However, lxml also keeps a global error log of all errors that occurred at the application level. Whenever an exception is raised, you can retrieve the -errors that occured and "might have" lead to the problem from the error log +errors that occurred and "might have" lead to the problem from the error log copy attached to the exception: .. sourcecode:: pycon @@ -455,7 +455,7 @@ of the document with a user provided DOCTYPE: -The content will be encoded, but otherwise copied verbatimly into the +The content will be encoded, but otherwise copied verbatim into the output stream. It is therefore left to the user to take care for a correct doctype format, including the name of the root node. diff --git a/doc/element_classes.txt b/doc/element_classes.txt index a26bec05..e3476633 100644 --- a/doc/element_classes.txt +++ b/doc/element_classes.txt @@ -120,7 +120,7 @@ The semantics of ``_init()`` are as follows: value or by removing or adding a specific child node and then verifying this before running through the init process. -* Any exceptions raised in ``_init()`` will be propagated throught the API +* Any exceptions raised in ``_init()`` will be propagated through the API call that lead to the creation of the Element. So be careful with the code you write here as its exceptions may turn up in various unexpected places. diff --git a/doc/elementsoup.txt b/doc/elementsoup.txt index 417ab849..9317f654 100644 --- a/doc/elementsoup.txt +++ b/doc/elementsoup.txt @@ -2,24 +2,32 @@ BeautifulSoup Parser ==================== -BeautifulSoup_ is a Python package that parses broken HTML, just like -lxml supports it based on the parser of libxml2. BeautifulSoup uses a -different parsing approach. It is not a real HTML parser but uses -regular expressions to dive through tag soup. It is therefore more -forgiving in some cases and less good in others. It is not uncommon -that lxml/libxml2 parses and fixes broken HTML better, but -BeautifulSoup has superiour `support for encoding detection`_. It -very much depends on the input which parser works better. +BeautifulSoup_ is a Python package for working with real-world and broken HTML, +just like `lxml.html `_. As of version 4.x, it can use +`different HTML parsers +`_, +each of which has its advantages and disadvantages (see the link). + +lxml can make use of BeautifulSoup as a parser backend, just like BeautifulSoup +can employ lxml as a parser. When using BeautifulSoup from lxml, however, the +default is to use Python's integrated HTML parser in the +`html.parser `_ module. +In order to make use of the HTML5 parser of +`html5lib `_ instead, it is better +to go directly through the `html5parser module `_ in +``lxml.html``. + +A very nice feature of BeautifulSoup is its excellent `support for encoding +detection`_ which can provide better results for real-world HTML pages that +do not (correctly) declare their encoding. .. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/ -.. _`support for encoding detection`: http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful%20Soup%20Gives%20You%20Unicode%2C%20Dammit +.. _`support for encoding detection`: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unicode-dammit .. _ElementSoup: http://effbot.org/zone/element-soup.htm -To prevent users from having to choose their parser library in -advance, lxml can interface to the parsing capabilities of -BeautifulSoup through the ``lxml.html.soupparser`` module. It -provides three main functions: ``fromstring()`` and ``parse()`` to -parse a string or file using BeautifulSoup into an ``lxml.html`` +lxml interfaces with BeautifulSoup through the ``lxml.html.soupparser`` +module. It provides three main functions: ``fromstring()`` and ``parse()`` +to parse a string or file using BeautifulSoup into an ``lxml.html`` document, and ``convert_tree()`` to convert an existing BeautifulSoup tree into a list of top-level Elements. @@ -35,11 +43,11 @@ Parsing with the soupparser =========================== The functions ``fromstring()`` and ``parse()`` behave as known from -ElementTree. The first returns a root Element, the latter returns an +lxml. The first returns a root Element, the latter returns an ElementTree. There is also a legacy module called ``lxml.html.ElementSoup``, which -mimics the interface provided by ElementTree's own ElementSoup_ +mimics the interface provided by Fredrik Lundh's ElementSoup_ module. Note that the ``soupparser`` module was added in lxml 2.0.3. Previous versions of lxml 2.0.x only have the ``ElementSoup`` module. @@ -47,9 +55,10 @@ Here is a document full of tag soup, similar to, but not quite like, HTML: .. sourcecode:: pycon - >>> tag_soup = 'Hello</head><body onload=crash()>Hi all<p>' + >>> tag_soup = ''' + ... <meta/><head><title>Hello</head><body onload=crash()>Hi all<p>''' -all you need to do is pass it to the ``fromstring()`` function: +All you need to do is pass it to the ``fromstring()`` function: .. sourcecode:: pycon @@ -71,14 +80,15 @@ To see what we have here, you can serialise it: </html> Not quite what you'd expect from an HTML page, but, well, it was broken -already, right? BeautifulSoup did its best, and so now it's a tree. +already, right? The parser did its best, and so now it's a tree. -To control which Element implementation is used, you can pass a -``makeelement`` factory function to ``parse()`` and ``fromstring()``. -By default, this is based on the HTML parser defined in ``lxml.html``. +To control how Element objects are created during the conversion +of the tree, you can pass a ``makeelement`` factory function to +``parse()`` and ``fromstring()``. By default, this is based on the +HTML parser defined in ``lxml.html``. -For a quick comparison, libxml2 2.6.32 parses the same tag soup as -follows. The main difference is that libxml2 tries harder to adhere +For a quick comparison, libxml2 2.9.1 parses the same tag soup as +follows. The only difference is that libxml2 tries harder to adhere to the structure of an HTML document and moves misplaced tags where they (likely) belong. Note, however, that the result can vary between parser versions. @@ -90,10 +100,7 @@ parser versions. <meta/> <title>Hello - -

Hi all

-

- + Hi all

@@ -142,8 +149,9 @@ Using soupparser as a fallback ============================== The downside of using this parser is that it is `much slower`_ than -the HTML parser of lxml. So if performance matters, you might want to -consider using ``soupparser`` only as a fallback for certain cases. +the C implemented HTML parser of libxml2 that lxml uses. So if +performance matters, you might want to consider using ``soupparser`` +only as a fallback for certain cases. .. _`much slower`: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ @@ -180,22 +188,35 @@ better. Using only the encoding detection ================================= -If you prefer a 'real' (and fast) HTML parser instead of the regular -expression based one in BeautifulSoup, you can still benefit from -BeautifulSoup's `support for encoding detection`_ in the -``UnicodeDammit`` class. +Even if you prefer lxml's fast HTML parser, you can still benefit +from BeautifulSoup's `support for encoding detection`_ in the +``UnicodeDammit`` class. Once it succeeds in decoding the data, +you can simply pass the resulting Unicode string into lxml's parser. .. sourcecode:: pycon - >>> from BeautifulSoup import UnicodeDammit - - >>> def decode_html(html_string): - ... converted = UnicodeDammit(html_string, isHTML=True) - ... if not converted.unicode: - ... raise UnicodeDecodeError( - ... "Failed to detect encoding, tried [%s]", - ... ', '.join(converted.triedEncodings)) - ... # print converted.originalEncoding - ... return converted.unicode + >>> try: + ... from bs4 import UnicodeDammit # BeautifulSoup 4 + ... + ... def decode_html(html_string): + ... converted = UnicodeDammit(html_string) + ... if not converted.unicode_markup: + ... raise UnicodeDecodeError( + ... "Failed to detect encoding, tried [%s]", + ... ', '.join(converted.tried_encodings)) + ... # print converted.original_encoding + ... return converted.unicode_markup + ... + ... except ImportError: + ... from BeautifulSoup import UnicodeDammit # BeautifulSoup 3 + ... + ... def decode_html(html_string): + ... converted = UnicodeDammit(html_string, isHTML=True) + ... if not converted.unicode: + ... raise UnicodeDecodeError( + ... "Failed to detect encoding, tried [%s]", + ... ', '.join(converted.triedEncodings)) + ... # print converted.originalEncoding + ... return converted.unicode >>> root = lxml.html.fromstring(decode_html(tag_soup)) diff --git a/doc/html/FAQ.html b/doc/html/FAQ.html index 811f64fc..f6fc3c2f 100644 --- a/doc/html/FAQ.html +++ b/doc/html/FAQ.html @@ -10,7 +10,7 @@

-

lxml FAQ - Frequently Asked Questions

+

lxml FAQ - Frequently Asked Questions

Frequently asked questions on lxml. See also the notes on compatibility to ElementTree.

@@ -169,12 +169,16 @@ a web server accelerator with on-the-fly XSLT processing a secure HTTP proxy
  • lwebstring, an XML template engine
  • +
  • openpyxl, +a library to read/write MS Excel 2007 files
  • OpenXMLlib, a library for handling OpenXML document meta data
  • PsychoPy, psychology software in Python
  • Pycoon, a WSGI web development framework based on XML pipelines
  • +
  • pycsw, +an OGC CSW server implementation written in Python
  • PyQuery, a query framework for XML/HTML, similar to jQuery for JavaScript
  • python-docx, @@ -399,7 +403,7 @@ patches are very welcome.

    project and its user base. You can

    • spread the word and write about lxml. Many users (especially new Python -users) have not yet heared about lxml, although our user base is constantly +users) have not yet heard about lxml, although our user base is constantly growing. If you write your own blog and feel like saying something about lxml, go ahead and do so. If we think your contribution or criticism is valuable to other users, we may even put a link or a quote on the project @@ -420,7 +424,7 @@ top-ranked when searching the web for "Python and XML", so maybe you have an idea how to improve that.
    • help with the tutorial. A tutorial is the most important stating point for new users, so it is important for us to provide an easy to understand guide -into lxml. As allo documentation, the tutorial is work in progress, so we +into lxml. As all documentation, the tutorial is work in progress, so we appreciate every helping hand.
    • improve the docstrings. lxml uses docstrings to support Python's integrated online help() function. However, sometimes these are not sufficient to @@ -702,7 +706,7 @@ lxml can add fresh whitespace to the XML tree to indent it.

      Note that the remove_blank_text option also uses a heuristic if it has no definite knowledge about the document's ignorable whitespace. It will keep blank text nodes that appear after non-blank text nodes -at the same level. This is to prevent document-style XML from loosing +at the same level. This is to prevent document-style XML from losing content.

      The HTMLParser has this structural knowledge built-in, which means that most whitespace that appears between tags in HTML documents will not @@ -935,7 +939,7 @@ map it to your namespace. See also the question above.

  • diff --git a/doc/html/api.html b/doc/html/api.html index 5d9675fc..69a655d8 100644 --- a/doc/html/api.html +++ b/doc/html/api.html @@ -8,7 +8,7 @@
    -

    APIs specific to lxml.etree

    +

    APIs specific to lxml.etree

    lxml.etree tries to follow established APIs wherever possible. Sometimes, however, the need to expose a feature in an easy way led to the invention of a @@ -202,7 +202,7 @@ through the local error_log property of the re transformer object. See their documentation for details.

    However, lxml also keeps a global error log of all errors that occurred at the application level. Whenever an exception is raised, you can retrieve the -errors that occured and "might have" lead to the problem from the error log +errors that occurred and "might have" lead to the problem from the error log copy attached to the exception:

    >>> etree.clear_error_log()
     >>> broken_xml = '''
    @@ -321,7 +321,7 @@ of the document with a user provided DOCTYPE:

    <!DOCTYPE root SYSTEM "/tmp/test.dtd"> <root/>
    -

    The content will be encoded, but otherwise copied verbatimly into the +

    The content will be encoded, but otherwise copied verbatim into the output stream. It is therefore left to the user to take care for a correct doctype format, including the name of the root node.

    @@ -496,7 +496,7 @@ example:

    diff --git a/doc/html/api/abc.ABCMeta-class.html b/doc/html/api/abc.ABCMeta-class.html index 029f202d..48c86283 100644 --- a/doc/html/api/abc.ABCMeta-class.html +++ b/doc/html/api/abc.ABCMeta-class.html @@ -426,7 +426,7 @@ even via super()).

    @@ -693,7 +703,7 @@ - - + + + + + + - + - + - @@ -136,152 +144,152 @@ (in HtmlMixin) - + - + - + - + - - + + - - + - - + - - + - - + + - - + + - - + - - + - - + - - + - - + - - + - - + + - - + - - + @@ -319,7 +327,7 @@
    - + + + + + - + - - - + - - + - - + - + - - - + + - + - + - + - + - - + + - - + + - - + - - + + - - + + - - + + - - + + - - + + - - + + - - + - - + + @@ -299,7 +307,7 @@
    - @@ -109,31 +109,31 @@ (in ErrorDomains) - + - + - + - @@ -141,31 +141,31 @@ (in ErrorTypes) - + - + - + - @@ -173,39 +173,39 @@ (in lxml.ElementInclude) - + - + - + - + - @@ -213,7 +213,7 @@ (in _Element) - @@ -221,69 +221,69 @@ (in lxml.html.builder) - + - + - + - + - + - + - + - + - + - + - + - + - + - @@ -374,7 +374,7 @@ (in _Attrib) - @@ -382,7 +382,7 @@ (in _IDDict) - @@ -390,7 +390,7 @@ (in lxml.html) - @@ -398,7 +398,7 @@ (in HtmlMixin) - @@ -406,17 +406,25 @@ (in lxml.etree) - + + + + + - @@ -424,7 +432,7 @@ - @@ -432,7 +440,7 @@ - @@ -440,7 +448,7 @@ - @@ -479,7 +487,7 @@
    + + + + + - + - + - + - + - + - + - + - - - - -
    @@ -199,7 +200,7 @@ - + - + - + - + + + + + + @@ -281,7 +295,7 @@
    + + + + - + + -
    @@ -138,7 +144,7 @@ - - + + + + + - + - + - + - @@ -144,119 +152,119 @@ (in ErrorTypes) - + - + - + - + - + - + - + - + - + - + - + - + - + - + - @@ -264,7 +272,7 @@ (in ErrorTypes) - @@ -272,7 +280,7 @@ (in ErrorTypes) - @@ -280,7 +288,7 @@ (in ErrorTypes) - @@ -288,7 +296,7 @@ (in ErrorTypes) - @@ -296,7 +304,7 @@ (in ErrorTypes) - @@ -304,7 +312,7 @@ (in ErrorTypes) - @@ -312,7 +320,7 @@ (in ErrorTypes) - @@ -320,7 +328,7 @@ (in ErrorTypes) - @@ -328,7 +336,7 @@ (in ErrorTypes) - @@ -336,7 +344,7 @@ (in ErrorTypes) - @@ -344,7 +352,7 @@ (in ErrorTypes) - @@ -352,7 +360,7 @@ (in ErrorTypes) - @@ -360,7 +368,7 @@ (in ErrorTypes) - @@ -368,7 +376,7 @@ (in ErrorTypes) - @@ -376,7 +384,7 @@ (in ErrorTypes) - @@ -384,7 +392,7 @@ (in ErrorTypes) - @@ -392,7 +400,7 @@ (in ErrorTypes) - @@ -400,7 +408,7 @@ (in ErrorTypes) - @@ -408,7 +416,7 @@ (in ErrorTypes) - @@ -416,7 +424,7 @@ (in ErrorTypes) - @@ -424,7 +432,7 @@ (in ErrorTypes) - @@ -432,7 +440,7 @@ (in ErrorTypes) - @@ -440,7 +448,7 @@ (in ErrorTypes) - @@ -448,7 +456,7 @@ (in ErrorTypes) - @@ -456,7 +464,7 @@ (in ErrorTypes) - @@ -464,7 +472,7 @@ (in ErrorTypes) - @@ -472,7 +480,7 @@ (in ErrorTypes) - @@ -480,7 +488,7 @@ (in ErrorTypes) - @@ -488,7 +496,7 @@ (in ErrorTypes) - @@ -496,7 +504,7 @@ (in ErrorTypes) - @@ -504,7 +512,7 @@ (in ErrorTypes) - @@ -512,7 +520,7 @@ (in ErrorTypes) - @@ -520,7 +528,7 @@ (in ErrorTypes) - @@ -528,7 +536,7 @@ (in ErrorTypes) - @@ -536,7 +544,7 @@ (in ErrorTypes) - @@ -544,7 +552,7 @@ (in ErrorTypes) - @@ -552,7 +560,7 @@ (in ErrorTypes) - @@ -560,31 +568,31 @@ (in ErrorTypes) - + - + - + - @@ -592,56 +600,56 @@ (in ErrorTypes) - + - + - + - + - + - + - + - - - - -
    @@ -720,7 +721,7 @@ - - - - - - - - - - - - - - - - - - - - + - - + - - + - - + - - + - - + - - + - - - - + - - + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - + - - + - - + - - + - - + - - + - - + - - + - - + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - + + + + + @@ -766,7 +750,7 @@ (in lxml.tests.common_imports) - @@ -774,7 +758,7 @@ (in lxml.html.builder) - @@ -782,7 +766,7 @@ (in lxml.html) - @@ -790,7 +774,7 @@ (in _Element) - @@ -798,7 +782,7 @@ (in lxml.html.builder) - @@ -806,7 +790,7 @@ (in lxml.html.defs) - @@ -814,7 +798,7 @@ (in lxml.html.diff) - @@ -822,7 +806,7 @@ (in lxml.html.diff) - @@ -830,7 +814,7 @@ (in lxml.html.diff) - @@ -838,7 +822,7 @@ (in lxml.html.diff) - @@ -846,7 +830,7 @@ (in lxml.html.diff) - @@ -854,7 +838,7 @@ (in DocInfo) - @@ -862,7 +846,7 @@ (in TreeBuilder) - @@ -870,7 +854,7 @@ (in lxml.html.diff) - @@ -878,7 +862,7 @@ (in lxml.html.diff) - @@ -886,7 +870,7 @@ (in ElementTreeContentHandler) - @@ -894,7 +878,7 @@ (in ElementTreeContentHandler) - @@ -902,7 +886,7 @@ (in ElementTreeContentHandler) - @@ -910,192 +894,247 @@ (in ElementTreeContentHandler) - - + - + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + - + + + + + - - - - - - - - - - + + + + + @@ -1133,7 +1172,7 @@
    - - + - + - - - + - - + + - - + - - + - - - - - + - - + - - - + - + - - - + - - + - - + - - - - + - - - - - - + + - + - - - - - + + - - + - - + - - + - - - - - - + - - + - + - + - + - - - + - + - + - + - - - + - + + + + - - + - - + + + + + - + - + - + - + - @@ -437,7 +445,7 @@ (in ETreeOnlyTestCase) - @@ -445,7 +453,7 @@ (in ETreeOnlyTestCase) - @@ -453,3075 +461,3171 @@ (in ETreeOnlyTestCase) - - - + - - + - - - - - - - - - - + - - + - - + - - + + + + + + + + + + + - - - - - - - - + - + - - + - + - + - - + - - + - - - + + + + + + + + + + - - - + - + + + + + + - - + - - + + + + + + - - - - + - - + - - + - - + + - - + + - + - - - + - - + + - - + - - - - + - - - - + - - + - - + - - - - + - - + - - + + - - - - - - + - - - - - - - - + - - + - - + - - + - - + + - - + - - + - - + - - + - + + + + + + - - + - - + - - - - - - - - - - - - - - + - - - - - - - + - - + + - - + - + - - - + - - + - - - - - - - - + - - + - - + - - + - - + - - - - + - - - - - - + - - - - + - - - - + - - + - - + - - + - - + - - + - - + - - + + - - + - - + - - + - - + - - - - + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + - - - - - - + - - + - - + - - + - - + - - + - - + - - + - - - - + - + + + + + + - - + - - + - - - + + + + + + + + + + + + + + + - - - - - - - - - - - - - - - - + + + + + - - + + + + + + - - + - - + + + + + + - - + - - + - - + - - + - - - - + - - - - - - - - + - - + - - + - - + - - + - - + - - + - - + - - + - - + - - + - - + + - - + - - + - - + - - + + - - + + - + - - + - - - + - - + - - + - - + - - + - - + - - + - - - - + - - + - - + - - + - - + - - - - - - - - - - - - - - - - - - - - - - + - - + - - + - - + - - + + - - + + - - + - - + - - + + - - + + - - + + - + - - - + - - + - - + + - - + + - - - - + + + + - - + - - + - - + + - - - - - + - - - + - - + - - + - - + - - + - - - - + - + - - - - - + - - + + - - + + - - + + - - + + - - + + - - + - - - - + - - - - + - - + - - + - - + - - + - - + + - - + + - + - - - + + - - + - - + - - + - - + + - - + - - + - + - - + - - - + - - + - - + - - + + - - + - - + - - + - - + + - - + - - + - - + - + - - + - - + + + + + + + + + + + + + + + + + + + + - - + - - + - - + - - + - - + - - + - - + - - + - - - - + - - + - - + - - - - + - - + - - - - + - - + - - + - - - - + - - + - - - - + - - - - - - - - + - - - - - - + - - + - - + - - + - - - - + - - + - - + - - + - - - - - - - - - - - - - - - - - - - - - + - - - + - - + - - + - - - - - - - - + - - + - - + - - + - - + - - + - - - - + - - - - - - + - - + - - + - - + - - + - - + + - - + + - - + - - + - - - - - - - - - - - + - - + - - + + - - + + - - + - - + - - + - - - - - - - - - - + - - - - + - - - - + - - - - + + - - + - - + + + + + + - - + + - - - - + - - + - - - - - - + - - + - - + + - - + - - + - - + + - - + + - - + + - - + - - + + - - + - - + - - + - @@ -3529,11 +3633,18 @@ - + + + + + +
    @@ -3568,7 +3679,7 @@ - + - + - + - + - + + + + + + - - + +
    @@ -185,7 +193,7 @@ + + + + + - + - - - - -
    @@ -184,7 +185,7 @@ - - - - - - + - - - - + - - + - + + + + + - + + + + +
    @@ -185,7 +193,7 @@ - - + - - + + + + + + + + + + + + + + + - + - + - + - + - + - + - + - + - @@ -270,7 +294,7 @@ (in ErrorTypes) - @@ -278,7 +302,7 @@ (in ErrorTypes) - @@ -286,7 +310,7 @@ (in ErrorTypes) - @@ -294,7 +318,7 @@ (in lxml.objectify) - @@ -302,7 +326,7 @@ (in lxml.tests.test_objectify) - @@ -310,7 +334,7 @@ (in ErrorDomains) - @@ -318,7 +342,7 @@ (in lxml.etree) - @@ -326,7 +350,7 @@ (in _ElementTree) - @@ -334,75 +358,61 @@ (in _XSLTResultTree) - + - + - + - + - + - + - + - - - - - - - - - -
    @@ -437,7 +447,7 @@ - - + + + + + + + + + + + + - - + + + + + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + + + + + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - - - - - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + - - + + - - + + - - + - - + + - - + + - - + + - - + + - - + + - - + + - - + - - + + - - + + - - + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - + - - - + + - - + + - - + + - - + - - + + - - + + - - + + - - + + - - + + - - - + + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + + + + + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + + + + + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + + + + + + - - + + - - + + - - + + + + + + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + + + + + + - - + + - - + + - - + + - - + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + - - + + - - + + - - + + - - + + - - + - - + - - + + - - + - - + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - - + - - + + - - + + - - + @@ -1998,7 +1950,7 @@
    - + - + - + - + - + + + + + + - + - + - + - @@ -177,47 +185,48 @@ - - + + - - + + - - + + - - + - - + + - +
    @@ -254,7 +263,7 @@