1 Document loading and URL resolving
2 ==================================
7 2 Document loading in context
8 3 I/O access control in XSLT
11 Lxml has support for custom document loaders in both the parsers and XSL
12 transformations. These so-called resolvers are subclasses of the
16 >>> try: from StringIO import StringIO
17 ... except ImportError:
18 ... from io import BytesIO
20 ... if isinstance(s, str): s = s.encode("UTF-8")
26 Here is an example of a custom resolver:
30 >>> from lxml import etree
32 >>> class DTDResolver(etree.Resolver):
33 ... def resolve(self, url, id, context):
34 ... print("Resolving URL '%s'" % url)
35 ... return self.resolve_string(
36 ... '<!ENTITY myentity "[resolved text: %s]">' % url, context)
38 This defines a resolver that always returns a dynamically generated DTD
39 fragment defining an entity. The ``url`` argument passes the system URL of
40 the requested document, the ``id`` argument is the public ID. Note that any
41 of these may be None. The context object is not normally used by client code.
43 Resolving is based on three methods of the Resolver object that build internal
44 representations of the result document. The following methods exist:
46 * ``resolve_string`` takes a parsable string as result document
47 * ``resolve_filename`` takes a filename
48 * ``resolve_file`` takes an open file-like object that has at least a read() method
49 * ``resolve_empty`` resolves into an empty document
51 The ``resolve()`` method may choose to return None, in which case the next
52 registered resolver (or the default resolver) is consulted. Resolving always
53 terminates if ``resolve()`` returns the result of any of the above
54 ``resolve_*()`` methods.
56 Resolvers are registered local to a parser:
60 >>> parser = etree.XMLParser(load_dtd=True)
61 >>> parser.resolvers.add( DTDResolver() )
63 Note that we instantiate a parser that loads the DTD. This is not done by the
64 default parser, which does no validation. When we use this parser to parse a
65 document that requires resolving a URL, it will call our custom resolver:
69 >>> xml = '<!DOCTYPE doc SYSTEM "MissingDTD.dtd"><doc>&myentity;</doc>'
70 >>> tree = etree.parse(StringIO(xml), parser)
71 Resolving URL 'MissingDTD.dtd'
72 >>> root = tree.getroot()
74 [resolved text: MissingDTD.dtd]
76 The entity in the document was correctly resolved by the generated DTD
80 Document loading in context
81 ---------------------------
83 XML documents memorise their initial parser (and its resolvers) during their
84 life-time. This means that a lookup process related to a document will use
85 the resolvers of the document's parser. We can demonstrate this with a
86 resolver that only responds to a specific prefix:
90 >>> class PrefixResolver(etree.Resolver):
91 ... def __init__(self, prefix):
92 ... self.prefix = prefix
93 ... self.result_xml = '''\
95 ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
96 ... <test xmlns="testNS">%s-TEST</test>
99 ... def resolve(self, url, pubid, context):
100 ... if url.startswith(self.prefix):
101 ... print("Resolved url %s as prefix %s" % (url, self.prefix))
102 ... return self.resolve_string(self.result_xml, context)
104 We demonstrate this in XSLT and use the following stylesheet as an example:
106 .. sourcecode:: pycon
109 ... <xsl:stylesheet version="1.0"
110 ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
111 ... <xsl:include href="honk:test"/>
112 ... <xsl:template match="/">
114 ... <xsl:value-of select="document('hoi:test')/*/*/text()"/>
117 ... </xsl:stylesheet>
120 Note that it needs to resolve two URIs: ``honk:test`` when compiling the XSLT
121 document (i.e. when resolving ``xsl:import`` and ``xsl:include`` elements) and
122 ``hoi:test`` at transformation time, when calls to the ``document`` function
123 are resolved. If we now register different resolvers with two different
124 parsers, we can parse our document twice in different resolver contexts:
126 .. sourcecode:: pycon
128 >>> hoi_parser = etree.XMLParser()
129 >>> normal_doc = etree.parse(StringIO(xml_text), hoi_parser)
131 >>> hoi_parser.resolvers.add( PrefixResolver("hoi") )
132 >>> hoi_doc = etree.parse(StringIO(xml_text), hoi_parser)
134 >>> honk_parser = etree.XMLParser()
135 >>> honk_parser.resolvers.add( PrefixResolver("honk") )
136 >>> honk_doc = etree.parse(StringIO(xml_text), honk_parser)
138 These contexts are important for the further behaviour of the documents. They
139 memorise their original parser so that the correct set of resolvers is used in
140 subsequent lookups. To compile the stylesheet, XSLT must resolve the
141 ``honk:test`` URI in the ``xsl:include`` element. The ``hoi`` resolver cannot
144 .. sourcecode:: pycon
146 >>> transform = etree.XSLT(normal_doc)
147 Traceback (most recent call last):
149 lxml.etree.XSLTParseError: Cannot resolve URI honk:test
151 >>> transform = etree.XSLT(hoi_doc)
152 Traceback (most recent call last):
154 lxml.etree.XSLTParseError: Cannot resolve URI honk:test
156 However, if we use the ``honk`` resolver associated with the respective
157 document, everything works fine:
159 .. sourcecode:: pycon
161 >>> transform = etree.XSLT(honk_doc)
162 Resolved url honk:test as prefix honk
164 Running the transform accesses the same parser context again, but since it now
165 needs to resolve the ``hoi`` URI in the call to the document function, its
166 ``honk`` resolver will fail to do so:
168 .. sourcecode:: pycon
170 >>> result = transform(normal_doc)
171 Traceback (most recent call last):
173 lxml.etree.XSLTApplyError: Cannot resolve URI hoi:test
175 >>> result = transform(hoi_doc)
176 Traceback (most recent call last):
178 lxml.etree.XSLTApplyError: Cannot resolve URI hoi:test
180 >>> result = transform(honk_doc)
181 Traceback (most recent call last):
183 lxml.etree.XSLTApplyError: Cannot resolve URI hoi:test
185 This can only be solved by adding a ``hoi`` resolver to the original parser:
187 .. sourcecode:: pycon
189 >>> honk_parser.resolvers.add( PrefixResolver("hoi") )
190 >>> result = transform(honk_doc)
191 Resolved url hoi:test as prefix hoi
192 >>> print(str(result)[:-1])
193 <?xml version="1.0"?>
194 <test>hoi-TEST</test>
196 We can see that the ``hoi`` resolver was called to generate a document that
197 was then inserted into the result document by the XSLT transformation. Note
198 that this is completely independent of the XML file you transform, as the URI
199 is resolved from within the stylesheet context:
201 .. sourcecode:: pycon
203 >>> result = transform(normal_doc)
204 Resolved url hoi:test as prefix hoi
205 >>> print(str(result)[:-1])
206 <?xml version="1.0"?>
207 <test>hoi-TEST</test>
209 It may be seen as a matter of taste what resolvers the generated document
210 inherits. For XSLT, the output document inherits the resolvers of the input
211 document and not those of the stylesheet. Therefore, the last result does not
212 inherit any resolvers at all.
215 I/O access control in XSLT
216 --------------------------
218 By default, XSLT supports all extension functions from libxslt and libexslt as
219 well as Python regular expressions through EXSLT. Some extensions enable
220 style sheets to read and write files on the local file system.
222 XSLT has a mechanism to control the access to certain I/O operations during
223 the transformation process. This is most interesting where XSL scripts come
224 from potentially insecure sources and must be prevented from modifying the
225 local file system. Note, however, that there is no way to keep them from
226 eating up your precious CPU time, so this should not stop you from thinking
227 about what XSLT you execute.
229 Access control is configured using the ``XSLTAccessControl`` class. It can be
230 called with a number of keyword arguments that allow or deny specific
233 .. sourcecode:: pycon
235 >>> transform = etree.XSLT(honk_doc)
236 Resolved url honk:test as prefix honk
237 >>> result = transform(normal_doc)
238 Resolved url hoi:test as prefix hoi
240 >>> ac = etree.XSLTAccessControl(read_network=False)
241 >>> transform = etree.XSLT(honk_doc, access_control=ac)
242 Resolved url honk:test as prefix honk
243 >>> result = transform(normal_doc)
244 Traceback (most recent call last):
246 lxml.etree.XSLTApplyError: xsltLoadDocument: read rights for hoi:test denied
248 There are a few things to keep in mind:
250 * XSL parsing (``xsl:import``, etc.) is not affected by this mechanism
251 * ``read_file=False`` does not imply ``write_file=False``, all controls are
253 * ``read_file`` only applies to files in the file system. Any other scheme
254 for URLs is controlled by the ``*_network`` keywords.
255 * If you need more fine-grained control than switching access on and off, you
256 should consider writing a custom document loader that returns empty
257 documents or raises exceptions if access is denied.