+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd">
<html>
<head>
<title>The XSLT C library for Gnome explained</title>
<p>Version: $Revision$</p>
-<h2>Table of content</h2>
+<h2>Table of contents</h2>
<ul>
<li><a href="#Introducti">Introduction</a></li>
<li><a href="#Basics">Basics</a></li>
<p>This document describes the processing of <a
href="http://xmlsoft.org/XSLT/">libxslt</a>, the <a
-href="http://www.w3.org/TR/xslt">XSLT</a> C library developped for the <a
+href="http://www.w3.org/TR/xslt">XSLT</a> C library developed for the <a
href="http://www.gnome.org/">Gnome</a> project.</p>
<p>Note: this documentation is by definition incomplete and I am not good at
<h2><a name="Basics">Basics</a></h2>
-<p>XSLT is a transformation language, taking an input document and a
-stylesheet document, it generates an ouput document:</p>
+<p>XSLT is a transformation language. It takes an input document and a
+stylesheet document and generates an output document:</p>
<p align="center"><img src="processing.gif"
alt="the XSLT processing model"></p>
-<p>Libxslt is written in C. It relies on libxml for the following
-operations:</p>
+<p>Libxslt is written in C. It relies on <a href="http://www.xmlsoft.org/">libxml</a>,
+ the XML C library for Gnome, for the following operations:</p>
<ul>
<li>parsing files</li>
- <li>building the in-memory DOM strucure associated to the documents
+ <li>building the in-memory DOM structure associated with the documents
handled</li>
<li>the XPath implementation</li>
- <li>serializing back the result document to XML, HTML (text is handled
- directly)</li>
+ <li>serializing back the result document to XML and HTML. (Text is handled
+ directly.)</li>
</ul>
<h2><a name="Keep">Keep it simple stupid</a></h2>
-<p>Libxslt is not very specialized, it is build under the assumption that all
+<p>Libxslt is not very specialized. It is built under the assumption that all
nodes from the source and output document can fit in the virtual memory of the
-system. There is a big trade-off there, it is fine for reasonably sized
-documents but may not be suitable for large sets of data, the gain is that it
-can be used in a relatively versatile way, the input or output may never be
+system. There is a big trade-off there. It is fine for reasonably sized
+documents but may not be suitable for large sets of data. The gain is that it
+can be used in a relatively versatile way. The input or output may never be
serialized, but the size of documents it can handle are limited by the size of
the memory available.</p>
<p>The result is not that bad, clearly one can do a better job but more
specialized too. Most optimization like building the tree on-demand would need
-serious changes to the libxml XPath framework, an easy step would be to
-serialize the output directly (or call a set of SAX-like ouptut handler to
+serious changes to the libxml XPath framework. An easy step would be to
+serialize the output directly (or call a set of SAX-like output handler to
keep this a flexible interface) and hence avoid the memory consumption of the
result.</p>
<h2><a name="libxml">The libxml nodes</a></h2>
-<p>DOM like trees as used and generated by libxml and libxslt are relatively
+<p>DOM-like trees, as used and generated by libxml and libxslt, are relatively
complex. Most node types follow the given structure except a few variations
depending on the node type:</p>
<p>For the XSLT processing, entity nodes should not be generated (i.e. they
should be replaced by their content). Most nodes also contains the following
-"naviagtion" informations:</p>
+"navigation" informations:</p>
<ul>
<li>the containing <strong>doc</strong>ument</li>
<li>the <strong>parent</strong> node</li>
<h2><a name="XSLT">The XSLT processing steps</a></h2>
-<p>Basically there is a few steps which are clearly decoupled at the interface
+<p>There are a few steps which are clearly decoupled at the interface
level:</p>
<ol>
- <li>parse the stylesheet and generate an DOM tree</li>
- <li>take the stylesheet tree and build a compiled version of it it's the
- compilation phase</li>
- <li>the input and generate a DOM tree</li>
+ <li>parse the stylesheet and generate a DOM tree</li>
+ <li>take the stylesheet tree and build a compiled version of it (the
+ compilation phase)</li>
+ <li>take the input and generate a DOM tree</li>
<li>process the stylesheet against the input tree and generate an output
tree</li>
<li>serialize the output tree</li>
<p>A few things should be noted here:</p>
<ul>
<li>the steps 1/ 3/ and 5/ are optional</li>
- <li>the stylesheet optained at 2/ can be reused by multiple processing 4/
+ <li>the stylesheet obtained at 2/ can be reused by multiple processing 4/
(and this should also work in threaded programs)</li>
<li>the tree provided in 2/ should never be freed using xmlFreeDoc, but by
freeing the stylesheet.</li>
<h2><a name="XSLT1">The XSLT stylesheet compilation</a></h2>
<p>This is the second step described. It takes a stylesheet tree, and
-"compiles" it, basically it associates to each node a structure stored in the
-_private field and containing informations computed in the stylesheet:</p>
+"compiles" it. This associates to each node a structure stored in the
+_private field and containing information computed in the stylesheet:</p>
<p align="center"><img src="stylesheet.gif"
alt="a compiled XSLT stylesheet"></p>
<p>One xsltStylesheet structure is generated per document parsed for the
-stylesheet. XSLT documents allows includes and imports of other documents,
+stylesheet. XSLT documents allow includes and imports of other documents,
imports are stored in the <strong>imports</strong> list (hence keeping the
tree hierarchy of includes which is very important for a proper XSLT
processing model) and includes are stored in the <strong>doclist</strong>
-list. An inported stylesheet has a parent link to allow to browse the
+list. An imported stylesheet has a parent link to allow browsing of the
tree.</p>
-<p>The DOM tree associated to the document is stored in <strong>doc</strong>,
-it is preprocessed to remove ignorable empty nodes and all the nodes in the
-XSLT namespace are subject to a precomputing. This usually consist of
-extrating all the context informations from the context tree (attributes,
-namespaces, XPath expressions), and store them in an xsltStylePreComp
+<p>The DOM tree associated to the document is stored in <strong>doc</strong>.
+It is preprocessed to remove ignorable empty nodes and all the nodes in the
+XSLT namespace are subject to precomputing. This usually consist of
+extracting all the context information from the context tree (attributes,
+namespaces, XPath expressions), and storing them in an xsltStylePreComp
structure associated to the <strong>_private</strong> field of the node.</p>
<p>A couple of notable exceptions to this are XSLT template nodes (more on
-this later) and attribute value templates, if they are actually templates, the
-value cannot be computed at compilation time (some preprocessing could be done
+this later) and attribute value templates. If they are actually templates, the
+value cannot be computed at compilation time. (Some preprocessing could be done
like isolation and preparsing of the XPath subexpressions but it's not done,
-yet).</p>
+yet.)</p>
-<p>The xsltStylePreComp structure also allow to store the precompiled form of
-an XPath expression which can be associated to an XSLT element (more on this
+<p>The xsltStylePreComp structure also allows storing of the precompiled form of
+an XPath expression that can be associated to an XSLT element (more on this
later).</p>
<h2><a name="XSLT2">The XSLT template compilation</a></h2>
-<p>A proper handling of templates lookup is one of the key of fast XSLT
-processing (given a node in the source document this is the processof finding
-which templates should be applied to this node). Libxslt follows the hint
+<p>A proper handling of templates lookup is one of the keys of fast XSLT
+processing. (Given a node in the source document this is the process of finding
+which templates should be applied to this node.) Libxslt follows the hint
suggested in the <a href="http://www.w3.org/TR/xslt#patterns">5.2 Patterns</a>
-section of the XSLT Recommendation, i.e. it doesn't evaluates it as an XPath
-expression but tokenize it and compile it as a set of rules to be evaluated on
-a candidate node. There is usually an indication of the node name in the last
+section of the XSLT Recommendation, i.e. it doesn't evaluate it as an XPath
+expression but tokenizes it and compiles it as a set of rules to be evaluated on
+a candidate node. There usually is an indication of the node name in the last
step of this evaluation and this is used as a key check for the match. As a
-result libxslt build a relatively more complex set of structures for the
+result libxslt builds a relatively more complex set of structures for the
templates:</p>
<p align="center"><img src="templates.gif"
<p>Let's describe a bit more closely what is built. First the xsltStylesheet
structure holds a pointer to the template hash table. All the XSLT patterns
compiled in this stylesheet are indexed by the value of the the target element
-(or attribute, pi ...) name, so when a element or an attribute "foo" need to
+(or attribute, pi ...) name, so when a element or an attribute "foo" needs to
be processed the lookup is done using the name as a key.</p>
-<p>Each of the patterns are compiled into an xsltCompMatch structure, it holds
-the set of rules based on the tokenization of the pattern basically stored in
+<p>Each of the patterns is compiled into an xsltCompMatch structure. It holds
+the set of rules based on the tokenization of the pattern stored in
reverse order (matching is easier this way). It also holds some information
about the previous matches used to speed up the process when one iterates over
-a set of siblings (this optimization may be defeated by trashing when running
-threaded computation, it's unclear taht this si a big deal in practice).
-Predicates expression are not compiled at this stage, they may be at run-time
+a set of siblings. (This optimization may be defeated by trashing when running
+threaded computation, it's unclear that this is a big deal in practice.)
+Predicate expressions are not compiled at this stage, they may be at run-time
if needed, but in this case they are compiled as full XPath expressions (the
use of some fixed predicate can probably be optimized, they are not yet).</p>
priority rules.</p>
<p>Associated to the compiled pattern is the xsltTemplate itself containing
-the informations actually required for the processing of the pattern including
-of course a pointer to the list of elements used for building the pattern
+the information required for the processing of the pattern including,
+of course, a pointer to the list of elements used for building the pattern
result.</p>
<p>Last but not least a number of patterns do not fit in the hash table
because they are not associated to a name, this is the case for patterns
applying to the root, any element, any attributes, text nodes, pi nodes, keys
-etc. Those are stored independantly in the stylesheet structure as separate
+etc. Those are stored independently in the stylesheet structure as separate
linked lists of xsltCompMatch.</p>
<h2><a name="processing">The processing itself</a></h2>
-<p>Well the processing is actually defined by the XSLT specification (the
-basis of the algorithm are explained in <a
+<p>The processing is defined by the XSLT specification (the
+basis of the algorithm is explained in <a
href="http://www.w3.org/TR/xslt#section-Introduction">the Introduction</a>
section). Basically it works by taking the root of the input document and
applying the following algorithm:</p>
<ol>
- <li>finding the template applying to it, basically this is a lookup in the
+ <li>Finding the template applying to it. This is a lookup in the
template hash table, walking the hash list until the node satisfies all
the steps of the pattern, then checking the appropriate(s) global
templates to see if there isn't a higher priority rule to apply</li>
<li>If there is no template, apply the default rule (recurse on the
children)</li>
- <li>else walk the content list of the selected templates, for each of them:
+ <li>else walk the content list of the selected templates, for each of them:
<ul>
- <li>if the node are in the XSLT namespace then the node has a _private
- field pointing to the preprocessed values, jump to the specific
+ <li>if the node is in the XSLT namespace then the node has a _private
+ field pointing to the preprocessed values, jump to the specific
code</li>
- <li>if the node is in an extension namespace, lookup the associated
- behaviour</li>
+ <li>if the node is in an extension namespace, look up the associated
+ behavior</li>
<li>otherwise copy the node.</li>
</ul>
- <p>the closure is usualy done through the XSLT
+ <p>The closure is usually done through the XSLT
<strong>apply-templates</strong> construct recursing by applying the
adequate template on the input node children or on the result of an
- associated XPath selection lookup</p>
+ associated XPath selection lookup.</p>
</li>
</ol>
<p>Note that large parts of the input tree may not be processed by a given
-stylesheet and that on the opposite some may be processed multiple times
-(often the case when a Table of Content is built).</p>
+stylesheet and that on the opposite some may be processed multiple times. (This
+often is the case when a Table of Contents is built).</p>
<p>The module <code>transform.c</code> is the one implementing most of this
-logic, <strong>xsltApplyStylesheet()</strong> is the entry point, it allocates
+logic. <strong>xsltApplyStylesheet()</strong> is the entry point, it allocates
an xsltTransformContext containing the following:</p>
<ul>
<li>a pointer to the stylesheet being processed</li>
<li>current input node</li>
<li>current selected node list</li>
<li>the current insertion points in the output document</li>
- <li>a couple of hash table for extensions element and functions</li>
+ <li>a couple of hash tables for extension elements and functions</li>
</ul>
-<p>then a new document get allocated (HTML or XML depending on the type of
+<p>Then a new document gets allocated (HTML or XML depending on the type of
output), the user parameters and global variables and parameters are
evaluated. Then <strong>xsltProcessOneNode()</strong> which implements the
1-2-3 algorithm is called on the root element of the input. Step 1/ is
<p>The XPath support is actually implemented in the libxml module (where it is
reused by the XPointer implementation). XPath is a relatively classic
-expression language, the only uncommon feature is that it is working on XML
+expression language. The only uncommon feature is that it is working on XML
trees and hence has specific syntax and types to handle them.</p>
-<p>XPath expressions are compiled using <strong>xmlXPathCompile()</strong> it
+<p>XPath expressions are compiled using <strong>xmlXPathCompile()</strong>. It
will take an expression string in input and generate a structure containing
the parsed expression tree, for example the expression:</p>
<pre>/doc/chapter[title='Introduction']</pre>
NODE</pre>
<p>This can be tested using the <code>testXPath</code> command (in the
-libxml codebase) using the <code>--tree</code> option</p>
+libxml codebase) using the <code>--tree</code> option.</p>
-<p>Again, the KISS approach is used, no optimization is done, this could be an
-interesting thing to add (<a
+<p>Again, the KISS approach is used. No optimization is done. This could be an
+interesting thing to add. <a
href="http://www-106.ibm.com/developerworks/library/x-xslt2/?dwzone=x?open&l=132%2ct=gr%2c+p=saxon">Michael
Kay describes</a> a lot of possible and interesting optimizations done in
-Saxon which would be possible at this level), I'm unsure they would provide
+Saxon which would be possible at this level. I'm unsure they would provide
much gain since the expressions tends to be relatively simple in general and
stylesheets are still hand generated. Optimizations at the interpretation
sounds likely to be more efficient.</p>
<p>The interpreter is implemented by <strong>xmlXPathCompiledEval()</strong>
which is the front-end to <strong>xmlXPathCompOpEval()</strong> the function
implementing the evaluation of the expression tree. This evaluation follows
-the KISS aproach again, it's recursive and call
+the KISS approach again. It's recursive and calls
<strong>xmlXPathNodeCollectAndTest()</strong> to collect nodes set when
-evaluating a <code>COLECT</code> node.</p>
+evaluating a <code>COLLECT</code> node.</p>
<p>An evaluation is done within the framework of an XPath context stored in an
<strong>xmlXPathContext</strong> structure, in the framework of a
<ul>
<li>the current document</li>
<li>the current node</li>
- <li>an hash table of defined variables (but not used by XSLT)</li>
- <li>an hash table of defined functions</li>
+ <li>a hash table of defined variables (but not used by XSLT)</li>
+ <li>a hash table of defined functions</li>
<li>the proximity position (the place of the node in the current node
list)</li>
<li>the context size (the size of the current node list)</li>
- <li>the arry of namespaces declaration in scope (there is also a namespace
+ <li>the array of namespace declarations in scope (there also is a namespace
hash table but it is not used in the XSLT transformation).</li>
</ul>
<h2><a name="Descriptio">Description of XPath Objects</a></h2>
<p>An XPath expression manipulates XPath objects. XPath defines the default
-types boolean, numbers, strings and node sets. XSLT adds the result tree
+types boolean, numbers, strings and node sets. XSLT adds the result tree
fragment type which is basically an unmodifiable node set.</p>
-<p>Implementation wise, libxml follows again a KISS approach, the
+<p>Implementation-wise, libxml follows again a KISS approach, the
xmlXPathObject is a structure containing a type description and the various
-possibilities (using an enum could have gained some bytes). In the case of
+possibilities. (Using an enum could have gained some bytes.) In the case of
node sets (or result tree fragments), it points to a separate xmlNodeSet
-object which contains the list of pointers to the docuemnt nodes:</p>
+object which contains the list of pointers to the document nodes:</p>
<p align="center"><img src="object.gif"
alt="An Node set object pointing to "></p>
signature:</p>
<pre>void xmlXPathFunc (xmlXPathParserContextPtr ctxt, int nargs);</pre>
-<p>The first argument is the XPath interprestation context, holding the
+<p>The first argument is the XPath interpretation context, holding the
interpretation stack. The second argument defines the number of objects passed
on the stack for the function to consume (last argument is on top of the
stack).</p>
<li>return</li>
</ul>
-<p>sometime the work can be done directly by modifying in-situ the top object
+<p>Sometime the work can be done directly by modifying in-situ the top object
on the stack <code>ctxt->value</code>.</p>
<h2><a name="stack">The XSLT variables stack frame</a></h2>
the scope of variables being called.</p>
<p>This part seems to be the most urgent attention right now, first it is done
-in a very ineficient way since the location of the variables and
-parameterswithin the stylesheet tree is still done at run time (it really
+in a very inefficient way since the location of the variables and
+parameters within the stylesheet tree is still done at run time (it really
should be done statically at compile time), and I am still unsure that my
-understanding of the template variables and parameter scope is actually
+understanding of the template variables and parameter scope is actually
right.</p>
<p>This part of the documentation is still to be written once this part of the
<h2><a name="TODOs">TODOs</a></h2>
-<p>redesign the XSLT stack frame handling far too much work is done at
-execution time, similary for the attribute value templates handling, at least
+<p>redesign the XSLT stack frame handling. Far too much work is done at
+execution time. Similarly for the attribute value templates handling, at least
the embedded subexpressions ought to be precompiled.</p>
<p>Allow output to be saved to a SAX like output (this notion of SAX like API
for output should be added directly to libxml).</p>
-<p>Implement and test some of the optimisation explained by Michael Kay
+<p>Implement and test some of the optimization explained by Michael Kay
especially:</p>
<ul>
<li>static slot allocation on the stack frame</li>
<li>specific boolean interpretation of an XPath expression</li>
<li>some of the sorting optimization</li>
- <li>Lazy evaluation of location path, this may require more changes but
- sounds really interesting, XT does this too</li>
- <li>Optimization of an expression tree (this could be done as a completely
- independant module)</li>
+ <li>Lazy evaluation of location path. (this may require more changes but
+ sounds really interesting. XT does this too.)</li>
+ <li>Optimization of an expression tree (This could be done as a completely
+ independent module.)</li>
</ul>
<p></p>
Error reporting, there is a lot of case where the XSLT specification specify
-that a given construct is an error are not checked adequately by libxslt,
-basically one should do a complete pass on the XSLT spec again and add all
-tests to the stylesheet compilation. Using the DTD provided in appendix and
+that a given construct is an error are not checked adequately by libxslt.
+Basically one should do a complete pass on the XSLT spec again and add all
+tests to the stylesheet compilation. Using the DTD provided in the appendix and
making direct checks using the libxml validation API sounds a good idea too
(though one should take care of not raising errors for elements/attributes in
different namespaces).