sgmlspl: a simple post-processor for SGMLS and NSGMLS (for use with &sgmls.pm; version 1.03)

sgmlspl: a simple post-processor for SGMLS and NSGMLS (for use with &sgmls.pm; version 1.03) David Megginson University of Ottawa Department of English

dmeggins@aix1.uottawa.ca

[unpublished] Welcome to &sgmlspl;, a simple sample &perl5; application which uses the &sgmls.pm; class library. Terms This program, along with its documentation, is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. What is &sgmlspl;? &sgmlspl; is a sample application distributed with the &sgmls.pm; &perl5; class library — you can use it to convert &sgml; documents to other formats by providing a specification file detailing exactly how you want to handle each element, external data entity, subdocument entity, CDATA string, record end, SDATA string, and processing instruction. &sgmlspl; also uses the &output.pm; library (included in this distribution) to allow you to redirect or capture output. To use &sgmlspl;, you simply prepare a specification file containing regular &perl5; code. If your &sgml; document were named doc.sgml, your &sgmlspl; specification file were named, spec.pl, and the name of your new file were doc.latex, then you could use the following command in a Unix shell to convert your &sgml; document: sgmls doc.sgml | sgmlspl spec.pl > doc.latex &sgmlspl will pass any additional arguments on to the specification file, which can process them in the regular &perl5; fashion. The specification files used to convert this manual — tolatex.pl and tohtml.pl — are available with the &sgmls.pm; distribution. How do I install &sgmlspl; on my system? To use &sgmlspl;, you need to install &sgmls.pm; on your system, by copying the &sgmls.pm; file to a directory searched by &perl5;. You also need to install &output.pm; in the same directory, and &sgmlspl; (with execute permission) somewhere on your PATH. The easiest way to do all of this on a Unix system is to change to the root directory of this distribution (SGMLSpm), edit the Makefile appropriately, and type make install Is &sgmlspl; the best way to convert &sgml; documents? Not necessarily. While &sgmlspl; is fully functional, it is not always particularly intuitive or pleasant to use. There is a new proposed standard, Document Style Semantics and Specification Language (DSSSL), based on the Scheme programming language, and implementations should soon be available. To read more about the DSSSL standard, see http://www.jclark.com/dsssl/ on the Internet. That said, DSSSL is a declarative, side-effect-free programming language, while &sgmlspl; allows you to use any programming constructions available in &perl5;, including those with side-effects. This means that if you want to do more than simply format the document or convert it from one Document Type Definition (DTD) to another, &sgmlspl; might be a good choice. How does the specification file tell &sgmlspl; what to do? &sgmlspl; uses an event model rather than a procedural model — instead of saying do A then B then C you say whenever X happens, do A; whenever Y happens, do B; whenever Z happens, do C. In other words, while you design the code, &sgmlspl; decides when and how often to run it. The specification file, which contains your instructions, is regular &perl5; code, and you can define packages and subroutines, display information, read files, create variables, etc. For processing the &sgml; document, however, &sgmlspl; exports a single subroutine, sgml(event, handler), into the 'main' package — each time you call sgml, you declare a handler for a specific type of &sgmls; event, and &sgmlspl; will then execute that handler every time the event occurs. You may use sgml to declare a handler for a generic event, like 'start_element', or a specific event, like '<DOC>' — a specific event will always take precedence over a generic event, so when the DOC element begins, &sgmlspl; will execute the '<DOC>' handler rather than the 'start_element' handler. What about the <parameter>handler</parameter> argument? The second argument to the sgml subroutine is the actual code or data associated with each event. If it is a string, it will be printed literally using the output subroutine from the &output.pm; library; if it is a reference to a &perl5; subroutine, the subroutine will be called whenever the event occurs. The following three sgml commands will have identical results: # Example 1 sgml('<DOC>', "\\begin{document}\n"); # Example 2 sgml('<DOC>', sub { output "\\begin{document}\n"; }); # Example 3 sub do_begin_document { output "\\begin{document}\n"; } sgml('<DOC>', \&do_begin_document); For simply printing a string, of course, it does not make sense to use a subroutine; however, the subroutines can be useful when you need to check the value of an attribute, perform different actions in different contexts, or perform other types of relatively more complicated post-processing. If your handler is a subroutine, then it will receive two arguments: the &sgmls.pm; event's data, and the &sgmls.pm; event itself (see the &sgmls.pm; documentation for a description of event and data types). The following example will print '\begin{enumerate}' if the value of the attribute TYPE is 'ORDERED', and '\begin{itemize}' if the value of the attribute TYPE is 'UNORDERED': sgml('<LIST>', sub { my ($element,$event) = @_; my $type = $element->attribute('TYPE')->value; if ($type eq 'ORDERED') { output "\\begin{enumerate}\n"; } elsif ($type eq 'UNORDERED') { output "\\begin{itemize}\n"; } else { die "Bad TYPE '$type' for element LIST at line " . $event->line . " in " . $event->file . "\n"; } }); You will not always need to use the event argument, but it can be useful if you want to report line numbers or file names for errors (presuming that you called &sgmls; or &nsgmls; with the -l option). If you have a new version of &nsgmls; which accepts the -h option, you can also use the event argument to look up arbitrary entities declared by the program. See the SGMLS_Event documentation for more information. What are the generic events? &sgmlspl; recognises the twelve generic events listed in table . You may provide any one of these as the first argument to sgml to declare a handler (string or subroutine) for that event. &sgmlspl; generic events Event Description 'start' Execute handler (with no arguments) at the beginning of the parse. 'end' Execute handler (with no arguments) at the end of the parse. 'start_element' Execute handler at the beginning of every element without a specific start handler. 'end_element' Execute handler at the end of every element without a specific end handler. 'cdata' Execute handler for every character-data string. 'sdata' Execute handler for every special-data string without a specific handler. 're' Execute handler for every record end. 'pi' Execute handler for every processing instruction. 'entity' Execute handler for every external data entity without a specific handler. 'start_subdoc' Execute handler at the beginning of every subdocument entity without a specific handler. 'end_subdoc' Execute handler at the end of every subdocument entity without a specific handler. 'conforming' Execute handler once, at the end of the document parse, if and only if the document was conforming.

The handlers for all of these except the document events 'start' and 'end' will receive two arguments whenever they are called: the first will be the data associated with the event (if any), and the second will be the SGMLS_Event object itself (see the document for &sgmls.pm;). Note the following example, which allows processing instructions for including the date or the hostname in the document at parse time: sgml('pi', sub { my ($instruction) = @_; if ($instruction eq 'date') { output `date`; } elsif ($instruction eq 'hostname') { output `hostname`; } else { print STDERR "Warning: unknown processing instruction: $instruction\n"; } }); With this handler, any occurance of <?date> in the original &sgml; document would be replaced by the current date and time, and any occurance of <?hostname> would be replaced by the name of the host. What are the specific events? In addition to the generic events listed in the previous section, &sgmlspl; allows special, specific handlers for the beginning and end of elements and subdocument entities, for SDATA strings, and for external data entities. Table lists the different specific event types available. Specific event types Event Description '<GI>' Execute handler at the beginning of every element named 'GI'. '</GI>' Execute handler at the end of every element named 'GI'. '|SDATA|' Execute handler for every special-data string 'SDATA'. '&ENTITY;' Execute handler for every external data entity named 'ENTITY'. '{ENTITY}' Execute handler at the beginning of every subdocument entity named 'ENTITY'. '{/ENTITY}' Execute handler at the end of every subdocument entity named 'ENTITY'.

Note that these override the generic-event handlers. For example, if you were to type sgml('&FOO;', sub { output "Found a \"foo\" entity!\n"; }); sgml('entity', sub { output "Found an entity!\n"; }); And the external data entity &FOO; appeared in your &sgml; document, &sgmlspl; would call the first handler rather than the second. Note also that start and end handlers are entirely separate things: if an element has a specific start handler but no specific end handler, the generic end handler will still be called at the end of the element. To prevent this, declare a handler with an empty string: sgml('</HACK>', ''); Why does &sgmlspl; use <command>output</command> instead of <command>print</command>? &sgmlspl; uses a special &perl5; library &output.pm; for printing text. &output.pm; exports the subroutines output(string…), push_output(type[,data]), and pop_output. The subroutine output works much like the regular &perl5; function print, except that you are not able to specify a file handle, and you may include multiple strings as arguments. When you want to write data to somewhere other than STDOUT (the default), then you use the subroutines push_output and pop_output to set a new destination or to restore an old one. You can use the &output.pm; package in other programs by adding the following line: use SGMLS::Output; How do I use <command>push_output</command>? The subroutine push_output(type[,data]) takes two arguments: the type, which is always required, and the data, which is needed for certain types of output. Table lists the different types which you can push onto the output stack. Types for <command>push_output</command> Type Data Description 'handle' a filehandle Send all output to the supplied filehandle. 'file' a filename Open the supplied file for writing, erasing its current contents (if any), and send all output to it. 'append' a filename Open the supplied file for writing and append all output to its current contents. 'pipe' a shell command Pipe all output to the supplied shell command. 'string' a string [optional] Append all output to the supplied string, which will be returned by pop_output. 'nul' [none] Ignore all output.

Because the output is stack-based, you do not lose the previous output destination when you push a new one. This is especially convenient for dealing with data in tree-structures, like &sgml; data — for example, you can capture the contents of sub-elements as strings, ignore certain types of elements, and split the output from one &sgml; parse into a series of sub-files. Here are some examples: push_output('string'); # append output to an empty string push_output('file','/tmp/foo'); # send output to this file push_output('pipe','mail webmaster'); # mail output to 'webmaster' (!!) push_output('nul'); # just ignore all output How do I use <command>pop_output</command>? When you want to restore the previous output after using push_output, simply call the subroutine pop_output. If the output type was a string, pop_output will return the string (containing all of the output); otherwise, the return value is not useful. Usually, you will want to use push_output in the start handler for an element or subdocument entity, and pop_output in the end handler. How about an example for <command>output</command>? Here is a simple example to demonstrate how output, push_output, and pop_output work: output "Hello, world!\n"; # (Written to STDOUT by default) push_output('nul'); # Push 'nul' ahead of STDOUT output "Hello, again!\n"; # (Discarded) push_output('file','foo.out'); # Push file 'foo.out' ahead of 'nul' output "Hello, again!\n"; # (Written to the file 'foo.out') pop_output; # Pop 'foo.out' and revert to 'nul' output "Hello, again!\n"; # (Discarded) push_output('string'); # Push 'string' ahead of 'nul' output "Hello, "; # (Written to the string) output "again!\n"; # (Also written to the string) # Pop the string "Hello, again!\n" $foo = pop_output; # and revert to 'nul' output "Hello, again!\n"; # (Discarded) pop_output; # Pop 'nul' and revert to STDOUT output "Hello, at last!\n"; # (Written to STDOUT) Is there an easier way to make specification files? Yes. The script skel.pl, included in this package, is an &sgmlspl; specification which writes a specification (!!!). To use it under Unix, try something like sgmls foo.sgml | sgmlspl skel.pl > foo-spec.pl (presuming that there is a copy of skel.pl in the current directory or in a directory searched by &perl5;) to generate a new, blank template named foo-spec.pl. How should I handle forward references? Because &sgmlspl; processes the document as a linear data stream, from beginning to end, it is easy to refer back to information, but relatively difficult to refer forward, since you do not know what will be coming later in the parse. Here are a few suggestions. First, you could use push_output and pop_output to save up output in a large string. When you have found the information which you need, you can make any necessary modifications to the string and print it then. This will work for relatively small chunks of a document, but you would not want to try it for anything larger. Next, you could use the ext method to add extra pointers, and build a parse tree of the whole document before processing any of it. This method will work well for small documents, but large documents will place some serious stress on your system's memory and/or swapping. A more sophisticated solution, however, involves the Refs.pm module, included in this distribution. In your &sgmlspl; script, include the line use SGMLS::Refs.pm; to activate the library. The library will create a database file to keep track of references between passes, and to tell you if any references have changed. For example, you might want to try something like this: sgml('start', sub { my $Refs = new SGMLS::Refs('references.refs'); }); sgml('end', sub { $Refs->warn; destroy $Refs; }); This code will create an object, $Refs, linked to a file of references called references.refs. The SGMLS::Refs class understands the methods listed in table The SGMLS::Refs class Method Return Type Description new(filename,[logfile_handle]) SGMLS::Refs Create a new SGMLS::Refs object. Arguments are the name of the hashfile and (optionally) a writable filehandle for logging changes. get(key) string Look up a reference key in the hash file and return its value. put(key,value) [none] Set a new value for the key in the hashfile. count number Return the number of references whose values have changed (thus far). warn 1 or 0 Print a warning mentioning the number of references which have changed, and return 1 if a warning was printed.

Are there any bugs? Any bugs in &sgmls.pm; will be here too, since &sgmlspl; relies heavily on that &perl5; library.