%ISOpub; %ISOnum; SGML"> ESIS"> SGMLS.pm"> perl5"> perl5"> sgmls"> nsgmls"> ]>
SGMLS.pm: a perl5 class library for handling output from the SGMLS and NSGMLS parsers (version 1.03) David Megginson University of Ottawa Department of English
dmeggins@aix1.uottawa.ca
[unpublished]
Welcome to &sgmls.pm;, an extensible &perl5; class library for processing the output from the &sgmls; and &nsgmls; parsers. &sgmls.pm; is free, copyrighted software available by anonymous ftp in the directory ftp://aix1.uottawa.ca/pub/dmeggins/. You might also want to look at the documentation for sgmlspl, a simple sample script which uses &sgmls.pm; to convert documents from &sgml; to other formats. Terms This program, along with its documentation, is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. What is &sgmls.pm;? &sgmls.pm; is an extensible &perl5; class library for parsing the output from James Clark's popular &sgmls; and &nsgmls; parsers, available on the Internet at ftp://jclark.com. This is not a complete system for translating documents written the the Standard Generalised Markup Language (&sgml;) into other formats, but it can easily form the basis of such a system (for a simple example, see the sgmlspl program included in this package). The library recognises four basic types of &sgml; objects: the element, the attribute, the notation, and the entity; each of these is a fully-developed class with methods for accessing important information. How do I produce &sgml; documents? I am presuming here that you are already experienced with &sgml; and the &sgmls; or &nsgmls; parser. For help with the parsers see the manual pages accompanying each one; for help with &sgml; see Robin Cover's SGML Web Page at http://www.sil.org/sgml/sgml.html on the Internet. How do I program in &perl5;? If you have to ask this question, you probably should not be trying to use this library right now, since it is intended only for experienced &perl5; programmers. That said, however, you can find the &perl5; documentation with the &perl5; source distribution or on the World-Wide Web at http://www.metronet.com/0/perlinfo/perl5/manual/perl.html. Please do not write to me for help on using &perl5;. How do I use &sgmls.pm;? First, you need to copy the file &sgmls.pm; to a directory where perl can find it (on a Unix system, it might be /usr/lib/perl5 or /usr/local/lib/perl5, or whatever the environment variable PERL5LIB is set to) and make certain that it is world-readable. Next, near the top of your &perl5; program, type the following line: use SGMLS; You must then open up a file handle from which &sgmls.pm; can read the data from an &sgmls; or &nsgmls; process, unless you are reading from a standard handle like STDIN — for example, if you are piping the output from &sgmls; to a &perl5; script, using something like sgmls foo.sgml | perl myscript.pl then the predefined filehandle STDIN will be sufficient. In DOS, you might want to dump the sgmls output to a file and use it as standard input (or open it explicitly in perl), and in Unix, you might actually want to open a pipe or socket for the input. &sgmls.pm; doesn't need to seek, so any input stream should work. To parse the &sgmls; or &nsgmls; output from the handle, create a new object instance of the SGMLS class with the handle as an argument, i.e. $parse = new SGMLS(STDIN); (You may create more than one SGMLS object at once, but each object must have a unique handle pointing to a unique stream, or chaos will result.) Now, you can retrieve and process events using the next_event method: while ($event = $parse->next_event) { #do something with each event } So what do I do with an event? The next_event method for the SGMLS class returns an object belonging to the class SGMLS_Event. This class has several methods available, as listed in table . The <classname>SGMLS_Event</classname> class Method Return Type Description type string Return the type of the event. data string, SGMLS_Element, or SGMLS_Entity Return any data associated with the event. file string Return the name of the &sgml; source file which generated the event, if available. line string Return the line number of the &sgml; source file which generated the event, if available. element SGMLS_Element Return the element in force when the event was generated. parse Return the SGMLS object for the current parse. entity(ename) Look up an entity from those currently known to the parse. An alias for ->parse->entity($ename) notation(nname) Look up the notation from those currently known to the parse: an alias for ->parse->notation($nname).
The file and line methods will return useful information only if you called &sgmls; or &nsgmls; with the -l flag to include file and line-number information.
What are the different event types and data? Table lists the ten different event types returned by the next_event method of an SGMLS object and the different types of data associated with each of these (note that these do not correspond to the standard &esis; events). The <classname>SGMLS_Event</classname> types Event Type Event Data Description 'start_element' SGMLS_Element The beginning of an element. 'end_element' SGMLS_Element The end of an element. 'cdata' string Regular character data. 'sdata' string Special system data. 're' [none] A record-end (i.e., a newline). 'pi' string A processing instruction 'entity' SGMLS_Entity A non-SGML external entity. 'start_subdoc' SGMLS_Entity The beginning of an SGML subdocument. 'end_subdoc' SGMLS_Entity The end of an SGML subdocument. 'conforming' [none] The document was valid.
For example, if $event->type returns 'start_element', then $event->data will return an object belonging to the SGMLS_Element class (which will contain a list of attributes, etc. — see below), $event->file and $event->line will return the file and line-number in which the element appeared (if you called &sgmls; or &nsgmls; with the -l flag), and $event->element will return the element currently in force (in this case, the same as $event->data).
What do I do with an <classname>SGMLS_Element</classname>? Altogether, there are six classes in &sgmls.pm;, each with its own methods: in addition to SGMLS (for the parse) and SGMLS_Event (for a specific event), the classes are SGMLS_Element, SGMLS_Attribute, SGMLS_Entity, and SGMLS_Notation. Like all of these, SGMLS_Element has a number of methods available for obtaining different types of information. For example, if you were to use my $element = $event->data to retrieve the data for a 'start_element' or 'end_element' event, then you could use the methods listed in table to find more information about the element. The <classname>SGMLS_Element</classname> class Method Return Type Description name string The name (or GI), in upper-case. parent SGMLS_Element The parent element, or '' if this is the top element. attributes HASH Return a reference to a hash table of SGMLS_Attribute objects, keyed by the attribute names (in upper-case). attribute_names ARRAY A list of all attribute names for the current element (in upper-case). attribute(aname) SGMLS_Attribute Return the attribute named ANAME. set_attribute(attribute) [none] The attribute argument should be an object belonging to the SGMLS_Attribute class. Add it to the element, replacing any previous attribute with the same name. in(name) SGMLS_Element If the current element's parent is named name, return the parent; otherwise, return ''. within(name) SGMLS_Element If any ancestor of the current element is named name, return it; otherwise, return ''.
What do I do with an <classname>SGMLS_Attribute</classname>? Note that objects of the SGMLS_Attribute class do not have events in their own right, and are available only through the attributes or attribute(aname) methods for SGMLS_Element objects. An object belonging to the SGMLS_Attribute class will recognise the methods listed in table . The <classname>SGMLS_Attribute</classname> class Method Return Type Description name string The name of the attribute (in upper-case). type string The type of the attribute: 'IMPLIED', 'CDATA', 'NOTATION', 'ENTITY', or 'TOKEN'. value string, SGMLS_Entity, or SGMLS_Notation. The value of the attribute. If the type is 'CDATA' or 'TOKEN', it will be a simple string; if it is 'NOTATION' it will be an object belonging to the SGMLS_Notation class, and if it is 'Entity' it will be an object belonging to the SGMLS_Entity class. is_implied boolean Return true if the value of the attribute is implied, or false if it has an explicit value. set_type(type) [none] Provide a new type for the current attribute -- no sanity checking will be performed, so be careful. set_value(value) [none] Provide a new value for the current attribute -- no sanity checking will be performed, so be careful.
Note that the type 'TOKEN' includes both individual tokens and lists of tokens (ie 'TOKENS', 'IDS', or 'IDREFS' in the original &sgml; document), so you might need to use the perl function 'split' to break the value string into a list.
What do I do with an <classname>SGMLS_Entity</classname>? An SGMLS_Entity object can come in an 'entity' event (in which case it is always external), in a 'start_subdoc' or 'end_subdoc' event (in which case it always has the type 'SUBDOC'), or as the value of an attribute (in which case it may be internal or external). An object belonging to the SGMLS_Entity class may use the methods listed in table . The <classname>SGMLS_Entity</classname> class Method Return Type Description name string The entity name. type string The entity type: 'CDATA', 'SDATA', 'NDATA', or 'SUBDOC'. value string The entity replacement text (internal entities only). sysid string The system identifier (external entities only). pubid string The public identifier (external entities only). filenames ARRAY A list of file names generated from the sysid and pubid (external entities only). notation SGMLS_Notation The associated notation (external data entities only).
An entity of type 'SUBDOC' will have a sysid and pubid, and external data entity will have a sysid, pubid, filenames, and a notation, and an internal data entity will have a value.
What do I do with an <classname>SGMLS_Notation</classname>? The fourth data class is the notation, which is available only as a return value from the notation method of an SGMLS_Entity or the value method of an SGMLS_Attribute with type 'NOTATION'. You can use the notation to decide how to treat non-SGML data (such as graphics). An object belonging to the SGMLS_Notation class will have access to the methods listed in table . The <classname>SGMLS_Notation class</classname> Method Return Type Description name string The notation's name. sysid string The notation's system identifier. pubid string The notation's public identifier.
What you do with this information is entirely up to you.
Is there any extra information available from the &sgml; document? The SGMLS object which you created at the beginning of the parse has several methods available in addition to next_event — you will find them all listed in table . There should normally be no need to use the notation and entity methods, since &sgmls.pm; will look up entities and notations for you automatically as needed. Additional methods for the <classname>SGMLS</classname> class Method Return Type Description next_event SGMLS_Event Return the next event. appinfo string Return the APPINFO parameter from the &sgml; declaration, if any. notation(nname) SGMLS_Notation Look up a notation by name. entity(ename) SGMLS_Entity Look up an entity by name.
How about a simple example? OK. The following script simply reports its events: &sample.program; To use it under Unix, try something like sgmls document.sgml | perl sample.pl and watch the output scroll down. How do I design my <emphasis>own</emphasis> classes? In addition to the methods listed above, all of the classes used in &sgmls.pm; have an ext method which returns a reference to an initially-empty hash table. You are free to use this hash table to store anything you want — it should be especially useful if you are building your own, derived classes from the ones provided here. The following example builds a derived class My_Element from the SGMLS_Element class, adding methods to set and get the current font: use SGMLS; package My_Element; @ISA = qw(SGMLS_Element); sub new { my ($class,$element,$font) = @_; $element->ext->{'font'} = $font; return bless $element; } sub get_font { my ($self) = @_; return $self->ext->{'font'}; } sub set_font { my ($self,$font) = @_; $self->ext->{'font'} = $font; } Note that the derived class does not need to have any knowledge about the underlying structure of the SGMLS_Element class, and need only avoid shadowing any of the methods currently existing there. If you decide to create a derived class from the SGMLS, please note that in addition to the methods listed above, that class uses internal methods named element, line, and file, similar to the same methods in SGMLS_Event — it is essential that you not shadow these method names. Are there any bugs? Of course! Right now, &sgmls.pm; silently ignores link attributes (&nsgmls; only) and data attributes, and there may be many other bugs which I have not yet found.