doc/element_classes.txt

   1 ====================================
   2 Using custom Element classes in lxml
   3 ====================================
   4
   5 lxml has very sophisticated support for custom Element classes.  You
   6 can provide your own classes for Elements and have lxml use them by
   7 default for all elements generated by a specific parser, only for a
   8 specific tag name in a specific namespace or even for an exact element
   9 at a specific position in the tree.
  10
  11 Custom Elements must inherit from the ``lxml.etree.ElementBase`` class, which
  12 provides the Element interface for subclasses:
  13
  14 .. sourcecode:: pycon
  15
  16   >>> from lxml import etree
  17
  18   >>> class honk(etree.ElementBase):
  19   ...    def honking(self):
  20   ...       return self.get('honking') == 'true'
  21   ...    honking = property(honking)
  22
  23 This defines a new Element class ``honk`` with a property ``honking``.
  24
  25 The following document describes how you can make lxml.etree use these
  26 custom Element classes.
  27
  28 .. contents::
  29 ..
  30    1  Background on Element proxies
  31    2  Element initialization
  32    3  Setting up a class lookup scheme
  33      3.1  Default class lookup
  34      3.2  Namespace class lookup
  35      3.3  Attribute based lookup
  36      3.4  Custom element class lookup
  37      3.5  Tree based element class lookup in Python
  38    4  Generating XML with custom classes
  39    5  Implementing namespaces
  40
  41 ..
  42   >>> try: _ = unicode
  43   ... except NameError: unicode = str
  44
  45
  46 Background on Element proxies
  47 =============================
  48
  49 Being based on libxml2, lxml.etree holds the entire XML tree in a C
  50 structure.  To communicate with Python code, it creates Python proxy
  51 objects for the XML elements on demand.
  52
  53    .. image:: proxies.png
  54
  55 The mapping between C elements and Python Element classes is
  56 completely configurable.  When you ask lxml.etree for an Element by
  57 using its API, it will instantiate your classes for you.  All you have
  58 to do is tell lxml which class to use for which kind of Element.  This
  59 is done through a class lookup scheme, as described in the sections
  60 below.
  61
  62
  63 Element initialization
  64 ======================
  65
  66 There is one thing to know up front.  Element classes *must not* have
  67 an ``__init___`` or ``__new__`` method.  There should not be any
  68 internal state either, except for the data stored in the underlying
  69 XML tree.  Element instances are created and garbage collected at
  70 need, so there is no way to predict when and how often a proxy is
  71 created for them.  Even worse, when the ``__init__`` method is called,
  72 the object is not even initialized yet to represent the XML tag, so
  73 there is not much use in providing an ``__init__`` method in
  74 subclasses.
  75
  76 Most use cases will not require any class initialisation, so you can content
  77 yourself with skipping to the next section for now.  However, if you really
  78 need to set up your element class on instantiation, there is one possible way
  79 to do so.  ElementBase classes have an ``_init()`` method that can be
  80 overridden.  It can be used to modify the XML tree, e.g. to construct special
  81 children or verify and update attributes.
  82
  83 The semantics of ``_init()`` are as follows:
  84
  85 * It is called once on Element class instantiation time.  That is,
  86   when a Python representation of the element is created by lxml.  At
  87   that time, the element object is completely initialized to represent
  88   a specific XML element within the tree.
  89
  90 * The method has complete access to the XML tree.  Modifications can be done
  91   in exactly the same way as anywhere else in the program.
  92
  93 * Python representations of elements may be created multiple times during the
  94   lifetime of an XML element in the underlying C tree.  The ``_init()`` code
  95   provided by subclasses must take special care by itself that multiple
  96   executions either are harmless or that they are prevented by some kind of
  97   flag in the XML tree.  The latter can be achieved by modifying an attribute
  98   value or by removing or adding a specific child node and then verifying this
  99   before running through the init process.
 100
 101 * Any exceptions raised in ``_init()`` will be propagated throught the API
 102   call that lead to the creation of the Element.  So be careful with the code
 103   you write here as its exceptions may turn up in various unexpected places.
 104
 105
 106 Setting up a class lookup scheme
 107 ================================
 108
 109 The first thing to do when deploying custom element classes is to register a
 110 class lookup scheme on a parser.  lxml.etree provides quite a number of
 111 different schemes that also support class lookup based on namespaces or
 112 attribute values.  Most lookups support fallback chaining, which allows the
 113 next lookup mechanism to take over when the previous one fails to find a
 114 class.
 115
 116 For example, setting the ``honk`` Element as a default element class
 117 for a parser works as follows:
 118
 119 .. sourcecode:: pycon
 120
 121   >>> parser_lookup = etree.ElementDefaultClassLookup(element=honk)
 122   >>> parser = etree.XMLParser()
 123   >>> parser.set_element_class_lookup(parser_lookup)
 124
 125 There is one drawback of the parser based scheme: the ``Element()`` factory
 126 does not know about your specialised parser and creates a new document that
 127 deploys the default parser:
 128
 129 .. sourcecode:: pycon
 130
 131   >>> el = etree.Element("root")
 132   >>> print(isinstance(el, honk))
 133   False
 134
 135 You should therefore avoid using this factory function in code that
 136 uses custom classes.  The ``makeelement()`` method of parsers provides
 137 a simple replacement:
 138
 139 .. sourcecode:: pycon
 140
 141   >>> el = parser.makeelement("root")
 142   >>> print(isinstance(el, honk))
 143   True
 144
 145 If you use a parser at the module level, you can easily redirect a module
 146 level ``Element()`` factory to the parser method by adding code like this:
 147
 148 .. sourcecode:: pycon
 149
 150   >>> module_level_parser = etree.XMLParser()
 151   >>> Element = module_level_parser.makeelement
 152
 153 While the ``XML()`` and ``HTML()`` factories also depend on the default
 154 parser, you can pass them a different parser as second argument:
 155
 156 .. sourcecode:: pycon
 157
 158   >>> element = etree.XML("<test/>")
 159   >>> print(isinstance(element, honk))
 160   False
 161
 162   >>> element = etree.XML("<test/>", parser)
 163   >>> print(isinstance(element, honk))
 164   True
 165
 166 Whenever you create a document with a parser, it will inherit the lookup
 167 scheme and all subsequent element instantiations for this document will use
 168 it:
 169
 170 .. sourcecode:: pycon
 171
 172   >>> element = etree.fromstring("<test/>", parser)
 173   >>> print(isinstance(element, honk))
 174   True
 175   >>> el = etree.SubElement(element, "subel")
 176   >>> print(isinstance(el, honk))
 177   True
 178
 179 For testing code in the Python interpreter and for small projects, you
 180 may also consider setting a lookup scheme on the default parser.  To
 181 avoid interfering with other modules, however, it is usually a better
 182 idea to use a dedicated parser for each module (or a parser pool when
 183 using threads) and then register the required lookup scheme only for
 184 this parser.
 185
 186
 187 Default class lookup
 188 --------------------
 189
 190 This is the most simple lookup mechanism.  It always returns the default
 191 element class.  Consequently, no further fallbacks are supported, but this
 192 scheme is a nice fallback for other custom lookup mechanisms.
 193
 194 Usage:
 195
 196 .. sourcecode:: pycon
 197
 198   >>> lookup = etree.ElementDefaultClassLookup()
 199   >>> parser = etree.XMLParser()
 200   >>> parser.set_element_class_lookup(lookup)
 201
 202 Note that the default for new parsers is to use the global fallback, which is
 203 also the default lookup (if not configured otherwise).
 204
 205 To change the default element implementation, you can pass your new class to
 206 the constructor.  While it accepts classes for ``element``, ``comment`` and
 207 ``pi`` nodes, most use cases will only override the element class:
 208
 209 .. sourcecode:: pycon
 210
 211   >>> el = parser.makeelement("myelement")
 212   >>> print(isinstance(el, honk))
 213   False
 214
 215   >>> lookup = etree.ElementDefaultClassLookup(element=honk)
 216   >>> parser.set_element_class_lookup(lookup)
 217
 218   >>> el = parser.makeelement("myelement")
 219   >>> print(isinstance(el, honk))
 220   True
 221   >>> el.honking
 222   False
 223   >>> el = parser.makeelement("myelement", honking='true')
 224   >>> etree.tostring(el)
 225   b'<myelement honking="true"/>'
 226   >>> el.honking
 227   True
 228
 229
 230 Namespace class lookup
 231 ----------------------
 232
 233 This is an advanced lookup mechanism that supports namespace/tag-name specific
 234 element classes.  You can select it by calling:
 235
 236 .. sourcecode:: pycon
 237
 238   >>> lookup = etree.ElementNamespaceClassLookup()
 239   >>> parser = etree.XMLParser()
 240   >>> parser.set_element_class_lookup(lookup)
 241
 242 See the separate section on `implementing namespaces`_ below to learn how to
 243 make use of it.
 244
 245 .. _`implementing namespaces`: #implementing-namespaces
 246
 247 This scheme supports a fallback mechanism that is used in the case where the
 248 namespace is not found or no class was registered for the element name.
 249 Normally, the default class lookup is used here.  To change it, pass the
 250 desired fallback lookup scheme to the constructor:
 251
 252 .. sourcecode:: pycon
 253
 254   >>> fallback = etree.ElementDefaultClassLookup(element=honk)
 255   >>> lookup = etree.ElementNamespaceClassLookup(fallback)
 256   >>> parser.set_element_class_lookup(lookup)
 257
 258
 259 Attribute based lookup
 260 ----------------------
 261
 262 This scheme uses a mapping from attribute values to classes.  An attribute
 263 name is set at initialisation time and is then used to find the corresponding
 264 value in a dictionary.  It is set up as follows:
 265
 266 .. sourcecode:: pycon
 267
 268   >>> id_class_mapping = {'1234' : honk} # maps attribute values to classes
 269
 270   >>> lookup = etree.AttributeBasedElementClassLookup(
 271   ...                                      'id', id_class_mapping)
 272   >>> parser = etree.XMLParser()
 273   >>> parser.set_element_class_lookup(lookup)
 274
 275 This class uses its fallback if the attribute is not found or its value is not
 276 in the mapping.  Normally, the default class lookup is used here.  If you want
 277 to use the namespace lookup, for example, you can use this code:
 278
 279 .. sourcecode:: pycon
 280
 281   >>> fallback = etree.ElementNamespaceClassLookup()
 282   >>> lookup = etree.AttributeBasedElementClassLookup(
 283   ...                       'id', id_class_mapping, fallback)
 284   >>> parser = etree.XMLParser()
 285   >>> parser.set_element_class_lookup(lookup)
 286
 287
 288 Custom element class lookup
 289 ---------------------------
 290
 291 This is the most customisable way of finding element classes on a per-element
 292 basis.  It allows you to implement a custom lookup scheme in a subclass:
 293
 294 .. sourcecode:: pycon
 295
 296   >>> class MyLookup(etree.CustomElementClassLookup):
 297   ...     def lookup(self, node_type, document, namespace, name):
 298   ...         return honk # be a bit more selective here ...
 299
 300   >>> parser = etree.XMLParser()
 301   >>> parser.set_element_class_lookup(MyLookup())
 302
 303 The ``.lookup()`` method must return either None (which triggers the
 304 fallback mechanism) or a subclass of ``lxml.etree.ElementBase``.  It
 305 can take any decision it wants based on the node type (one of
 306 "element", "comment", "PI", "entity"), the XML document of the
 307 element, or its namespace or tag name.
 308
 309
 310 Tree based element class lookup in Python
 311 -----------------------------------------
 312
 313 Taking more elaborate decisions than allowed by the custom scheme is
 314 difficult to achieve in pure Python, as it results in a
 315 chicken-and-egg problem.  It would require access to the tree - before
 316 the elements in the tree have been instantiated as Python Element
 317 proxies.
 318
 319 Luckily, there is a way to do this.  The ``PythonElementClassLookup``
 320 works similar to the custom lookup scheme:
 321
 322 .. sourcecode:: pycon
 323
 324   >>> class MyLookup(etree.PythonElementClassLookup):
 325   ...     def lookup(self, document, element):
 326   ...         return MyElementClass # defined elsewhere
 327
 328   >>> parser = etree.XMLParser()
 329   >>> parser.set_element_class_lookup(MyLookup())
 330
 331 As before, the first argument to the ``lookup()`` method is the opaque
 332 document instance that contains the Element.  The second arguments is a
 333 lightweight Element proxy implementation that is only valid during the lookup.
 334 Do not try to keep a reference to it.  Once the lookup is finished, the proxy
 335 will become invalid.  You will get an ``AssertionError`` if you access any of
 336 the properties or methods outside the scope of the lookup call where they were
 337 instantiated.
 338
 339 During the lookup, the element object behaves mostly like a normal Element
 340 instance.  It provides the properties ``tag``, ``text``, ``tail`` etc. and
 341 supports indexing, slicing and the ``getchildren()``, ``getparent()``
 342 etc. methods.  It does *not* support iteration, nor does it support any kind
 343 of modification.  All of its properties are read-only and it cannot be removed
 344 or inserted into other trees.  You can use it as a starting point to freely
 345 traverse the tree and collect any kind of information that its elements
 346 provide.  Once you have taken the decision which class to use for this
 347 element, you can simply return it and have lxml take care of cleaning up the
 348 instantiated proxy classes.
 349
 350 Sidenote: this lookup scheme originally lived in a separate module called
 351 ``lxml.pyclasslookup``.
 352
 353
 354 Generating XML with custom classes
 355 ==================================
 356
 357 Up to lxml 2.1, you could not instantiate proxy classes yourself.
 358 Only lxml.etree could do that when creating an object representation
 359 of an existing XML element.  Since lxml 2.2, however, instantiating
 360 this class will simply create a new Element:
 361
 362 .. sourcecode:: pycon
 363
 364   >>> el = honk(honking = 'true')
 365   >>> el.tag
 366   'honk'
 367   >>> el.honking
 368   True
 369
 370 Note, however, that the proxy you create here will be garbage
 371 collected just like any other proxy.  You can therefore not count on
 372 lxml.etree using the same class that you instantiated when you access
 373 this Element a second time after letting its reference go.  You should
 374 therefore always use a corresponding class lookup scheme that returns
 375 your Element proxy classes for the elements that they create.  The
 376 ``ElementNamespaceClassLookup`` is generally a good match.
 377
 378 You can use custom Element classes to quickly create XML fragments:
 379
 380 .. sourcecode:: pycon
 381
 382   >>> class hale(etree.ElementBase): pass
 383   >>> class bopp(etree.ElementBase): pass
 384
 385   >>> el = hale( "some ", honk(honking = 'true'), bopp, " text" )
 386
 387   >>> print(etree.tostring(el, encoding=unicode))
 388   <hale>some <honk honking="true"/><bopp/> text</hale>
 389
 390
 391 Implementing namespaces
 392 =======================
 393
 394 lxml allows you to implement namespaces, in a rather literal sense.  After
 395 setting up the namespace class lookup mechanism as described above, you can
 396 build a new element namespace (or retrieve an existing one) by calling the
 397 ``get_namespace(uri)`` method of the lookup:
 398
 399 .. sourcecode:: pycon
 400
 401   >>> lookup = etree.ElementNamespaceClassLookup()
 402   >>> parser = etree.XMLParser()
 403   >>> parser.set_element_class_lookup(lookup)
 404
 405   >>> namespace = lookup.get_namespace('http://hui.de/honk')
 406
 407 and then register the new element type with that namespace, say, under the tag
 408 name ``honk``:
 409
 410 .. sourcecode:: pycon
 411
 412   >>> namespace['honk'] = honk
 413
 414 If you have many Element classes declared in one module, and they are
 415 all named like the elements they create, you can simply use
 416 ``namespace.update(vars())`` at the end of your module to declare them
 417 automatically.  The implementation is smart enough to ignore
 418 everything that is not an Element class.
 419
 420 After this, you create and use your XML elements through the normal API of
 421 lxml:
 422
 423 .. sourcecode:: pycon
 424
 425   >>> xml = '<honk xmlns="http://hui.de/honk" honking="true"/>'
 426   >>> honk_element = etree.XML(xml, parser)
 427   >>> print(honk_element.honking)
 428   True
 429
 430 The same works when creating elements by hand:
 431
 432 .. sourcecode:: pycon
 433
 434   >>> honk_element = parser.makeelement('{http://hui.de/honk}honk',
 435   ...                                   honking='true')
 436   >>> print(honk_element.honking)
 437   True
 438
 439 Essentially, what this allows you to do, is to give Elements a custom API
 440 based on their namespace and tag name.
 441
 442 A somewhat related topic are `extension functions`_ which use a similar
 443 mechanism for registering extension functions in XPath and XSLT.
 444
 445 .. _`extension functions`: extensions.html
 446
 447 In the setup example above, we associated the ``honk`` Element class
 448 only with the 'honk' element.  If an XML tree contains different
 449 elements in the same namespace, they do not pick up the same
 450 implementation:
 451
 452 .. sourcecode:: pycon
 453
 454   >>> xml = '<honk xmlns="http://hui.de/honk" honking="true"><bla/></honk>'
 455   >>> honk_element = etree.XML(xml, parser)
 456   >>> print(honk_element.honking)
 457   True
 458   >>> print(honk_element[0].honking)
 459   Traceback (most recent call last):
 460   ...
 461   AttributeError: 'lxml.etree._Element' object has no attribute 'honking'
 462
 463 You can therefore provide one implementation per element name in each
 464 namespace and have lxml select the right one on the fly.  If you want one
 465 element implementation per namespace (ignoring the element name) or prefer
 466 having a common class for most elements except a few, you can specify a
 467 default implementation for an entire namespace by registering that class with
 468 the empty element name (None).
 469
 470 You may consider following an object oriented approach here.  If you build a
 471 class hierarchy of element classes, you can also implement a base class for a
 472 namespace that is used if no specific element class is provided.  Again, you
 473 can just pass None as an element name:
 474
 475 .. sourcecode:: pycon
 476
 477   >>> class HonkNSElement(etree.ElementBase):
 478   ...    def honk(self):
 479   ...       return "HONK"
 480   >>> namespace[None] = HonkNSElement # default Element for namespace
 481
 482   >>> class HonkElement(HonkNSElement):
 483   ...    def honking(self):
 484   ...       return self.get('honking') == 'true'
 485   ...    honking = property(honking)
 486   >>> namespace['honk'] = HonkElement # Element for specific tag
 487
 488 Now you can rely on lxml to always return objects of type HonkNSElement or its
 489 subclasses for elements of this namespace:
 490
 491 .. sourcecode:: pycon
 492
 493   >>> xml = '<honk xmlns="http://hui.de/honk" honking="true"><bla/></honk>'
 494   >>> honk_element = etree.XML(xml, parser)
 495
 496   >>> print(type(honk_element))
 497   <class 'HonkElement'>
 498   >>> print(type(honk_element[0]))
 499   <class 'HonkNSElement'>
 500
 501   >>> print(honk_element.honking)
 502   True
 503   >>> print(honk_element.honk())
 504   HONK
 505
 506   >>> print(honk_element[0].honk())
 507   HONK
 508   >>> print(honk_element[0].honking)
 509   Traceback (most recent call last):
 510   ...
 511   AttributeError: 'HonkNSElement' object has no attribute 'honking'