1 ====================================
2 Using custom Element classes in lxml
3 ====================================
5 lxml has very sophisticated support for custom Element classes. You
6 can provide your own classes for Elements and have lxml use them by
7 default for all elements generated by a specific parser, only for a
8 specific tag name in a specific namespace or even for an exact element
9 at a specific position in the tree.
11 Custom Elements must inherit from the ``lxml.etree.ElementBase`` class, which
12 provides the Element interface for subclasses:
16 >>> from lxml import etree
18 >>> class honk(etree.ElementBase):
19 ... def honking(self):
20 ... return self.get('honking') == 'true'
21 ... honking = property(honking)
23 This defines a new Element class ``honk`` with a property ``honking``.
25 The following document describes how you can make lxml.etree use these
26 custom Element classes.
30 1 Background on Element proxies
31 2 Element initialization
32 3 Setting up a class lookup scheme
33 3.1 Default class lookup
34 3.2 Namespace class lookup
35 3.3 Attribute based lookup
36 3.4 Custom element class lookup
37 3.5 Tree based element class lookup in Python
38 4 Generating XML with custom classes
39 5 Implementing namespaces
43 ... except NameError: unicode = str
46 Background on Element proxies
47 =============================
49 Being based on libxml2, lxml.etree holds the entire XML tree in a C
50 structure. To communicate with Python code, it creates Python proxy
51 objects for the XML elements on demand.
53 .. image:: proxies.png
55 The mapping between C elements and Python Element classes is
56 completely configurable. When you ask lxml.etree for an Element by
57 using its API, it will instantiate your classes for you. All you have
58 to do is tell lxml which class to use for which kind of Element. This
59 is done through a class lookup scheme, as described in the sections
63 Element initialization
64 ======================
66 There is one thing to know up front. Element classes *must not* have
67 an ``__init___`` or ``__new__`` method. There should not be any
68 internal state either, except for the data stored in the underlying
69 XML tree. Element instances are created and garbage collected at
70 need, so there is no way to predict when and how often a proxy is
71 created for them. Even worse, when the ``__init__`` method is called,
72 the object is not even initialized yet to represent the XML tag, so
73 there is not much use in providing an ``__init__`` method in
76 Most use cases will not require any class initialisation, so you can content
77 yourself with skipping to the next section for now. However, if you really
78 need to set up your element class on instantiation, there is one possible way
79 to do so. ElementBase classes have an ``_init()`` method that can be
80 overridden. It can be used to modify the XML tree, e.g. to construct special
81 children or verify and update attributes.
83 The semantics of ``_init()`` are as follows:
85 * It is called once on Element class instantiation time. That is,
86 when a Python representation of the element is created by lxml. At
87 that time, the element object is completely initialized to represent
88 a specific XML element within the tree.
90 * The method has complete access to the XML tree. Modifications can be done
91 in exactly the same way as anywhere else in the program.
93 * Python representations of elements may be created multiple times during the
94 lifetime of an XML element in the underlying C tree. The ``_init()`` code
95 provided by subclasses must take special care by itself that multiple
96 executions either are harmless or that they are prevented by some kind of
97 flag in the XML tree. The latter can be achieved by modifying an attribute
98 value or by removing or adding a specific child node and then verifying this
99 before running through the init process.
101 * Any exceptions raised in ``_init()`` will be propagated throught the API
102 call that lead to the creation of the Element. So be careful with the code
103 you write here as its exceptions may turn up in various unexpected places.
106 Setting up a class lookup scheme
107 ================================
109 The first thing to do when deploying custom element classes is to register a
110 class lookup scheme on a parser. lxml.etree provides quite a number of
111 different schemes that also support class lookup based on namespaces or
112 attribute values. Most lookups support fallback chaining, which allows the
113 next lookup mechanism to take over when the previous one fails to find a
116 For example, setting the ``honk`` Element as a default element class
117 for a parser works as follows:
119 .. sourcecode:: pycon
121 >>> parser_lookup = etree.ElementDefaultClassLookup(element=honk)
122 >>> parser = etree.XMLParser()
123 >>> parser.set_element_class_lookup(parser_lookup)
125 There is one drawback of the parser based scheme: the ``Element()`` factory
126 does not know about your specialised parser and creates a new document that
127 deploys the default parser:
129 .. sourcecode:: pycon
131 >>> el = etree.Element("root")
132 >>> print(isinstance(el, honk))
135 You should therefore avoid using this factory function in code that
136 uses custom classes. The ``makeelement()`` method of parsers provides
137 a simple replacement:
139 .. sourcecode:: pycon
141 >>> el = parser.makeelement("root")
142 >>> print(isinstance(el, honk))
145 If you use a parser at the module level, you can easily redirect a module
146 level ``Element()`` factory to the parser method by adding code like this:
148 .. sourcecode:: pycon
150 >>> module_level_parser = etree.XMLParser()
151 >>> Element = module_level_parser.makeelement
153 While the ``XML()`` and ``HTML()`` factories also depend on the default
154 parser, you can pass them a different parser as second argument:
156 .. sourcecode:: pycon
158 >>> element = etree.XML("<test/>")
159 >>> print(isinstance(element, honk))
162 >>> element = etree.XML("<test/>", parser)
163 >>> print(isinstance(element, honk))
166 Whenever you create a document with a parser, it will inherit the lookup
167 scheme and all subsequent element instantiations for this document will use
170 .. sourcecode:: pycon
172 >>> element = etree.fromstring("<test/>", parser)
173 >>> print(isinstance(element, honk))
175 >>> el = etree.SubElement(element, "subel")
176 >>> print(isinstance(el, honk))
179 For testing code in the Python interpreter and for small projects, you
180 may also consider setting a lookup scheme on the default parser. To
181 avoid interfering with other modules, however, it is usually a better
182 idea to use a dedicated parser for each module (or a parser pool when
183 using threads) and then register the required lookup scheme only for
190 This is the most simple lookup mechanism. It always returns the default
191 element class. Consequently, no further fallbacks are supported, but this
192 scheme is a nice fallback for other custom lookup mechanisms.
196 .. sourcecode:: pycon
198 >>> lookup = etree.ElementDefaultClassLookup()
199 >>> parser = etree.XMLParser()
200 >>> parser.set_element_class_lookup(lookup)
202 Note that the default for new parsers is to use the global fallback, which is
203 also the default lookup (if not configured otherwise).
205 To change the default element implementation, you can pass your new class to
206 the constructor. While it accepts classes for ``element``, ``comment`` and
207 ``pi`` nodes, most use cases will only override the element class:
209 .. sourcecode:: pycon
211 >>> el = parser.makeelement("myelement")
212 >>> print(isinstance(el, honk))
215 >>> lookup = etree.ElementDefaultClassLookup(element=honk)
216 >>> parser.set_element_class_lookup(lookup)
218 >>> el = parser.makeelement("myelement")
219 >>> print(isinstance(el, honk))
223 >>> el = parser.makeelement("myelement", honking='true')
224 >>> etree.tostring(el)
225 b'<myelement honking="true"/>'
230 Namespace class lookup
231 ----------------------
233 This is an advanced lookup mechanism that supports namespace/tag-name specific
234 element classes. You can select it by calling:
236 .. sourcecode:: pycon
238 >>> lookup = etree.ElementNamespaceClassLookup()
239 >>> parser = etree.XMLParser()
240 >>> parser.set_element_class_lookup(lookup)
242 See the separate section on `implementing namespaces`_ below to learn how to
245 .. _`implementing namespaces`: #implementing-namespaces
247 This scheme supports a fallback mechanism that is used in the case where the
248 namespace is not found or no class was registered for the element name.
249 Normally, the default class lookup is used here. To change it, pass the
250 desired fallback lookup scheme to the constructor:
252 .. sourcecode:: pycon
254 >>> fallback = etree.ElementDefaultClassLookup(element=honk)
255 >>> lookup = etree.ElementNamespaceClassLookup(fallback)
256 >>> parser.set_element_class_lookup(lookup)
259 Attribute based lookup
260 ----------------------
262 This scheme uses a mapping from attribute values to classes. An attribute
263 name is set at initialisation time and is then used to find the corresponding
264 value in a dictionary. It is set up as follows:
266 .. sourcecode:: pycon
268 >>> id_class_mapping = {'1234' : honk} # maps attribute values to classes
270 >>> lookup = etree.AttributeBasedElementClassLookup(
271 ... 'id', id_class_mapping)
272 >>> parser = etree.XMLParser()
273 >>> parser.set_element_class_lookup(lookup)
275 This class uses its fallback if the attribute is not found or its value is not
276 in the mapping. Normally, the default class lookup is used here. If you want
277 to use the namespace lookup, for example, you can use this code:
279 .. sourcecode:: pycon
281 >>> fallback = etree.ElementNamespaceClassLookup()
282 >>> lookup = etree.AttributeBasedElementClassLookup(
283 ... 'id', id_class_mapping, fallback)
284 >>> parser = etree.XMLParser()
285 >>> parser.set_element_class_lookup(lookup)
288 Custom element class lookup
289 ---------------------------
291 This is the most customisable way of finding element classes on a per-element
292 basis. It allows you to implement a custom lookup scheme in a subclass:
294 .. sourcecode:: pycon
296 >>> class MyLookup(etree.CustomElementClassLookup):
297 ... def lookup(self, node_type, document, namespace, name):
298 ... return honk # be a bit more selective here ...
300 >>> parser = etree.XMLParser()
301 >>> parser.set_element_class_lookup(MyLookup())
303 The ``.lookup()`` method must return either None (which triggers the
304 fallback mechanism) or a subclass of ``lxml.etree.ElementBase``. It
305 can take any decision it wants based on the node type (one of
306 "element", "comment", "PI", "entity"), the XML document of the
307 element, or its namespace or tag name.
310 Tree based element class lookup in Python
311 -----------------------------------------
313 Taking more elaborate decisions than allowed by the custom scheme is
314 difficult to achieve in pure Python, as it results in a
315 chicken-and-egg problem. It would require access to the tree - before
316 the elements in the tree have been instantiated as Python Element
319 Luckily, there is a way to do this. The ``PythonElementClassLookup``
320 works similar to the custom lookup scheme:
322 .. sourcecode:: pycon
324 >>> class MyLookup(etree.PythonElementClassLookup):
325 ... def lookup(self, document, element):
326 ... return MyElementClass # defined elsewhere
328 >>> parser = etree.XMLParser()
329 >>> parser.set_element_class_lookup(MyLookup())
331 As before, the first argument to the ``lookup()`` method is the opaque
332 document instance that contains the Element. The second arguments is a
333 lightweight Element proxy implementation that is only valid during the lookup.
334 Do not try to keep a reference to it. Once the lookup is finished, the proxy
335 will become invalid. You will get an ``AssertionError`` if you access any of
336 the properties or methods outside the scope of the lookup call where they were
339 During the lookup, the element object behaves mostly like a normal Element
340 instance. It provides the properties ``tag``, ``text``, ``tail`` etc. and
341 supports indexing, slicing and the ``getchildren()``, ``getparent()``
342 etc. methods. It does *not* support iteration, nor does it support any kind
343 of modification. All of its properties are read-only and it cannot be removed
344 or inserted into other trees. You can use it as a starting point to freely
345 traverse the tree and collect any kind of information that its elements
346 provide. Once you have taken the decision which class to use for this
347 element, you can simply return it and have lxml take care of cleaning up the
348 instantiated proxy classes.
350 Sidenote: this lookup scheme originally lived in a separate module called
351 ``lxml.pyclasslookup``.
354 Generating XML with custom classes
355 ==================================
357 Up to lxml 2.1, you could not instantiate proxy classes yourself.
358 Only lxml.etree could do that when creating an object representation
359 of an existing XML element. Since lxml 2.2, however, instantiating
360 this class will simply create a new Element:
362 .. sourcecode:: pycon
364 >>> el = honk(honking = 'true')
370 Note, however, that the proxy you create here will be garbage
371 collected just like any other proxy. You can therefore not count on
372 lxml.etree using the same class that you instantiated when you access
373 this Element a second time after letting its reference go. You should
374 therefore always use a corresponding class lookup scheme that returns
375 your Element proxy classes for the elements that they create. The
376 ``ElementNamespaceClassLookup`` is generally a good match.
378 You can use custom Element classes to quickly create XML fragments:
380 .. sourcecode:: pycon
382 >>> class hale(etree.ElementBase): pass
383 >>> class bopp(etree.ElementBase): pass
385 >>> el = hale( "some ", honk(honking = 'true'), bopp, " text" )
387 >>> print(etree.tostring(el, encoding=unicode))
388 <hale>some <honk honking="true"/><bopp/> text</hale>
391 Implementing namespaces
392 =======================
394 lxml allows you to implement namespaces, in a rather literal sense. After
395 setting up the namespace class lookup mechanism as described above, you can
396 build a new element namespace (or retrieve an existing one) by calling the
397 ``get_namespace(uri)`` method of the lookup:
399 .. sourcecode:: pycon
401 >>> lookup = etree.ElementNamespaceClassLookup()
402 >>> parser = etree.XMLParser()
403 >>> parser.set_element_class_lookup(lookup)
405 >>> namespace = lookup.get_namespace('http://hui.de/honk')
407 and then register the new element type with that namespace, say, under the tag
410 .. sourcecode:: pycon
412 >>> namespace['honk'] = honk
414 If you have many Element classes declared in one module, and they are
415 all named like the elements they create, you can simply use
416 ``namespace.update(vars())`` at the end of your module to declare them
417 automatically. The implementation is smart enough to ignore
418 everything that is not an Element class.
420 After this, you create and use your XML elements through the normal API of
423 .. sourcecode:: pycon
425 >>> xml = '<honk xmlns="http://hui.de/honk" honking="true"/>'
426 >>> honk_element = etree.XML(xml, parser)
427 >>> print(honk_element.honking)
430 The same works when creating elements by hand:
432 .. sourcecode:: pycon
434 >>> honk_element = parser.makeelement('{http://hui.de/honk}honk',
436 >>> print(honk_element.honking)
439 Essentially, what this allows you to do, is to give Elements a custom API
440 based on their namespace and tag name.
442 A somewhat related topic are `extension functions`_ which use a similar
443 mechanism for registering extension functions in XPath and XSLT.
445 .. _`extension functions`: extensions.html
447 In the setup example above, we associated the ``honk`` Element class
448 only with the 'honk' element. If an XML tree contains different
449 elements in the same namespace, they do not pick up the same
452 .. sourcecode:: pycon
454 >>> xml = '<honk xmlns="http://hui.de/honk" honking="true"><bla/></honk>'
455 >>> honk_element = etree.XML(xml, parser)
456 >>> print(honk_element.honking)
458 >>> print(honk_element[0].honking)
459 Traceback (most recent call last):
461 AttributeError: 'lxml.etree._Element' object has no attribute 'honking'
463 You can therefore provide one implementation per element name in each
464 namespace and have lxml select the right one on the fly. If you want one
465 element implementation per namespace (ignoring the element name) or prefer
466 having a common class for most elements except a few, you can specify a
467 default implementation for an entire namespace by registering that class with
468 the empty element name (None).
470 You may consider following an object oriented approach here. If you build a
471 class hierarchy of element classes, you can also implement a base class for a
472 namespace that is used if no specific element class is provided. Again, you
473 can just pass None as an element name:
475 .. sourcecode:: pycon
477 >>> class HonkNSElement(etree.ElementBase):
480 >>> namespace[None] = HonkNSElement # default Element for namespace
482 >>> class HonkElement(HonkNSElement):
483 ... def honking(self):
484 ... return self.get('honking') == 'true'
485 ... honking = property(honking)
486 >>> namespace['honk'] = HonkElement # Element for specific tag
488 Now you can rely on lxml to always return objects of type HonkNSElement or its
489 subclasses for elements of this namespace:
491 .. sourcecode:: pycon
493 >>> xml = '<honk xmlns="http://hui.de/honk" honking="true"><bla/></honk>'
494 >>> honk_element = etree.XML(xml, parser)
496 >>> print(type(honk_element))
497 <class 'HonkElement'>
498 >>> print(type(honk_element[0]))
499 <class 'HonkNSElement'>
501 >>> print(honk_element.honking)
503 >>> print(honk_element.honk())
506 >>> print(honk_element[0].honk())
508 >>> print(honk_element[0].honking)
509 Traceback (most recent call last):
511 AttributeError: 'HonkNSElement' object has no attribute 'honking'