doc/resolvers.txt

   1 Document loading and URL resolving
   2 ==================================
   3
   4 .. contents::
   5 ..
   6    1  URI Resolvers
   7    2  Document loading in context
   8    3  I/O access control in XSLT
   9
  10
  11 Lxml has support for custom document loaders in both the parsers and XSL
  12 transformations.  These so-called resolvers are subclasses of the
  13 etree.Resolver class.
  14
  15 ..
  16   >>> try: from StringIO import StringIO
  17   ... except ImportError:
  18   ...    from io import BytesIO
  19   ...    def StringIO(s):
  20   ...        if isinstance(s, str): s = s.encode("UTF-8")
  21   ...        return BytesIO(s)
  22
  23 URI Resolvers
  24 -------------
  25
  26 Here is an example of a custom resolver:
  27
  28 .. sourcecode:: pycon
  29
  30   >>> from lxml import etree
  31
  32   >>> class DTDResolver(etree.Resolver):
  33   ...     def resolve(self, url, id, context):
  34   ...         print("Resolving URL '%s'" % url)
  35   ...         return self.resolve_string(
  36   ...             '<!ENTITY myentity "[resolved text: %s]">' % url, context)
  37
  38 This defines a resolver that always returns a dynamically generated DTD
  39 fragment defining an entity.  The ``url`` argument passes the system URL of
  40 the requested document, the ``id`` argument is the public ID.  Note that any
  41 of these may be None.  The context object is not normally used by client code.
  42
  43 Resolving is based on three methods of the Resolver object that build internal
  44 representations of the result document.  The following methods exist:
  45
  46 * ``resolve_string`` takes a parsable string as result document
  47 * ``resolve_filename`` takes a filename
  48 * ``resolve_file`` takes an open file-like object that has at least a read() method
  49 * ``resolve_empty`` resolves into an empty document
  50
  51 The ``resolve()`` method may choose to return None, in which case the next
  52 registered resolver (or the default resolver) is consulted.  Resolving always
  53 terminates if ``resolve()`` returns the result of any of the above
  54 ``resolve_*()`` methods.
  55
  56 Resolvers are registered local to a parser:
  57
  58 .. sourcecode:: pycon
  59
  60   >>> parser = etree.XMLParser(load_dtd=True)
  61   >>> parser.resolvers.add( DTDResolver() )
  62
  63 Note that we instantiate a parser that loads the DTD.  This is not done by the
  64 default parser, which does no validation.  When we use this parser to parse a
  65 document that requires resolving a URL, it will call our custom resolver:
  66
  67 .. sourcecode:: pycon
  68
  69   >>> xml = '<!DOCTYPE doc SYSTEM "MissingDTD.dtd"><doc>&myentity;</doc>'
  70   >>> tree = etree.parse(StringIO(xml), parser)
  71   Resolving URL 'MissingDTD.dtd'
  72   >>> root = tree.getroot()
  73   >>> print(root.text)
  74   [resolved text: MissingDTD.dtd]
  75
  76 The entity in the document was correctly resolved by the generated DTD
  77 fragment.
  78
  79
  80 Document loading in context
  81 ---------------------------
  82
  83 XML documents memorise their initial parser (and its resolvers) during their
  84 life-time.  This means that a lookup process related to a document will use
  85 the resolvers of the document's parser.  We can demonstrate this with a
  86 resolver that only responds to a specific prefix:
  87
  88 .. sourcecode:: pycon
  89
  90   >>> class PrefixResolver(etree.Resolver):
  91   ...     def __init__(self, prefix):
  92   ...         self.prefix = prefix
  93   ...         self.result_xml = '''\
  94   ...              <xsl:stylesheet
  95   ...                     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  96   ...                <test xmlns="testNS">%s-TEST</test>
  97   ...              </xsl:stylesheet>
  98   ...              ''' % prefix
  99   ...     def resolve(self, url, pubid, context):
 100   ...         if url.startswith(self.prefix):
 101   ...             print("Resolved url %s as prefix %s" % (url, self.prefix))
 102   ...             return self.resolve_string(self.result_xml, context)
 103
 104 We demonstrate this in XSLT and use the following stylesheet as an example:
 105
 106 .. sourcecode:: pycon
 107
 108   >>> xml_text = """\
 109   ... <xsl:stylesheet version="1.0"
 110   ...    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 111   ...   <xsl:include href="honk:test"/>
 112   ...   <xsl:template match="/">
 113   ...     <test>
 114   ...       <xsl:value-of select="document('hoi:test')/*/*/text()"/>
 115   ...     </test>
 116   ...   </xsl:template>
 117   ... </xsl:stylesheet>
 118   ... """
 119
 120 Note that it needs to resolve two URIs: ``honk:test`` when compiling the XSLT
 121 document (i.e. when resolving ``xsl:import`` and ``xsl:include`` elements) and
 122 ``hoi:test`` at transformation time, when calls to the ``document`` function
 123 are resolved.  If we now register different resolvers with two different
 124 parsers, we can parse our document twice in different resolver contexts:
 125
 126 .. sourcecode:: pycon
 127
 128   >>> hoi_parser = etree.XMLParser()
 129   >>> normal_doc = etree.parse(StringIO(xml_text), hoi_parser)
 130
 131   >>> hoi_parser.resolvers.add( PrefixResolver("hoi") )
 132   >>> hoi_doc = etree.parse(StringIO(xml_text), hoi_parser)
 133
 134   >>> honk_parser = etree.XMLParser()
 135   >>> honk_parser.resolvers.add( PrefixResolver("honk") )
 136   >>> honk_doc = etree.parse(StringIO(xml_text), honk_parser)
 137
 138 These contexts are important for the further behaviour of the documents.  They
 139 memorise their original parser so that the correct set of resolvers is used in
 140 subsequent lookups.  To compile the stylesheet, XSLT must resolve the
 141 ``honk:test`` URI in the ``xsl:include`` element.  The ``hoi`` resolver cannot
 142 do that:
 143
 144 .. sourcecode:: pycon
 145
 146   >>> transform = etree.XSLT(normal_doc)
 147   Traceback (most recent call last):
 148     ...
 149   lxml.etree.XSLTParseError: Cannot resolve URI honk:test
 150
 151   >>> transform = etree.XSLT(hoi_doc)
 152   Traceback (most recent call last):
 153     ...
 154   lxml.etree.XSLTParseError: Cannot resolve URI honk:test
 155
 156 However, if we use the ``honk`` resolver associated with the respective
 157 document, everything works fine:
 158
 159 .. sourcecode:: pycon
 160
 161   >>> transform = etree.XSLT(honk_doc)
 162   Resolved url honk:test as prefix honk
 163
 164 Running the transform accesses the same parser context again, but since it now
 165 needs to resolve the ``hoi`` URI in the call to the document function, its
 166 ``honk`` resolver will fail to do so:
 167
 168 .. sourcecode:: pycon
 169
 170   >>> result = transform(normal_doc)
 171   Traceback (most recent call last):
 172     ...
 173   lxml.etree.XSLTApplyError: Cannot resolve URI hoi:test
 174
 175   >>> result = transform(hoi_doc)
 176   Traceback (most recent call last):
 177     ...
 178   lxml.etree.XSLTApplyError: Cannot resolve URI hoi:test
 179
 180   >>> result = transform(honk_doc)
 181   Traceback (most recent call last):
 182     ...
 183   lxml.etree.XSLTApplyError: Cannot resolve URI hoi:test
 184
 185 This can only be solved by adding a ``hoi`` resolver to the original parser:
 186
 187 .. sourcecode:: pycon
 188
 189   >>> honk_parser.resolvers.add( PrefixResolver("hoi") )
 190   >>> result = transform(honk_doc)
 191   Resolved url hoi:test as prefix hoi
 192   >>> print(str(result)[:-1])
 193   <?xml version="1.0"?>
 194   <test>hoi-TEST</test>
 195
 196 We can see that the ``hoi`` resolver was called to generate a document that
 197 was then inserted into the result document by the XSLT transformation.  Note
 198 that this is completely independent of the XML file you transform, as the URI
 199 is resolved from within the stylesheet context:
 200
 201 .. sourcecode:: pycon
 202
 203   >>> result = transform(normal_doc)
 204   Resolved url hoi:test as prefix hoi
 205   >>> print(str(result)[:-1])
 206   <?xml version="1.0"?>
 207   <test>hoi-TEST</test>
 208
 209 It may be seen as a matter of taste what resolvers the generated document
 210 inherits.  For XSLT, the output document inherits the resolvers of the input
 211 document and not those of the stylesheet.  Therefore, the last result does not
 212 inherit any resolvers at all.
 213
 214
 215 I/O access control in XSLT
 216 --------------------------
 217
 218 By default, XSLT supports all extension functions from libxslt and libexslt as
 219 well as Python regular expressions through EXSLT.  Some extensions enable
 220 style sheets to read and write files on the local file system.
 221
 222 XSLT has a mechanism to control the access to certain I/O operations during
 223 the transformation process.  This is most interesting where XSL scripts come
 224 from potentially insecure sources and must be prevented from modifying the
 225 local file system.  Note, however, that there is no way to keep them from
 226 eating up your precious CPU time, so this should not stop you from thinking
 227 about what XSLT you execute.
 228
 229 Access control is configured using the ``XSLTAccessControl`` class.  It can be
 230 called with a number of keyword arguments that allow or deny specific
 231 operations:
 232
 233 .. sourcecode:: pycon
 234
 235   >>> transform = etree.XSLT(honk_doc)
 236   Resolved url honk:test as prefix honk
 237   >>> result = transform(normal_doc)
 238   Resolved url hoi:test as prefix hoi
 239
 240   >>> ac = etree.XSLTAccessControl(read_network=False)
 241   >>> transform = etree.XSLT(honk_doc, access_control=ac)
 242   Resolved url honk:test as prefix honk
 243   >>> result = transform(normal_doc)
 244   Traceback (most recent call last):
 245     ...
 246   lxml.etree.XSLTApplyError: xsltLoadDocument: read rights for hoi:test denied
 247
 248 There are a few things to keep in mind:
 249
 250 * XSL parsing (``xsl:import``, etc.) is not affected by this mechanism
 251 * ``read_file=False`` does not imply ``write_file=False``, all controls are
 252   independent.
 253 * ``read_file`` only applies to files in the file system.  Any other scheme
 254   for URLs is controlled by the ``*_network`` keywords.
 255 * If you need more fine-grained control than switching access on and off, you
 256   should consider writing a custom document loader that returns empty
 257   documents or raises exceptions if access is denied.