1 <!DOCTYPE article PUBLIC "-//HaL and O'Reilly//DTD DocBook//EN" [
2 <!ENTITY % ISOpub PUBLIC "ISO 8879:1986//ENTITIES Publishing//EN">
4 <!ENTITY % ISOnum PUBLIC
5 "ISO 8879:1986//ENTITIES Numeric and Special Graphic//EN">
8 <!ENTITY sample.program SYSTEM "sample.pl" CDATA linespecific>
10 <!ENTITY sgml "<ulink url='http://www.sil.org/sgml/sgml.html'><acronym>SGML</acronym></ulink>">
11 <!ENTITY esis "<acronym>ESIS</acronym>">
12 <!ENTITY sgmls.pm "<link linkend=SGMLSpm><application>SGMLS.pm</application></link>">
13 <!ENTITY perl5 "<ulink url='http://www.metronet.com/0/perlinfo/perl5/manual/perl.html'><application>perl5</application></ulink>">
14 <!ENTITY perl5 "<application>perl5</application>">
15 <!ENTITY sgmls "<application>sgmls</application>">
16 <!ENTITY nsgmls "<ulink url='http://www.jclark.com/sp.html'><application>nsgmls</application></ulink>">
23 <title>SGMLS.pm: a perl5 class library for handling output from the
24 SGMLS and NSGMLS parsers (version 1.03)</title>
28 <firstname>David</firstname>
29 <surname>Megginson</surname>
31 <orgname>University of Ottawa</orgname>
32 <orgdiv>Department of English</orgdiv>
33 <address><email>dmeggins@aix1.uottawa.ca</email></address>
38 <artpagenums>[unpublished]</artpagenums>
42 <para>Welcome to &sgmls.pm;, an extensible &perl5; class library for
43 processing the output from the &sgmls; and &nsgmls; parsers.
44 &sgmls.pm; is free, copyrighted software available by anonymous ftp in
46 url="ftp://aix1.uottawa.ca/pub/dmeggins/">ftp://aix1.uottawa.ca/pub/dmeggins/</ulink>.
47 You might also want to look at the documentation for <ulink
48 url="../sgmlspl/sgmlspl.html"><application>sgmlspl</application></ulink>,
49 a simple sample script which uses &sgmls.pm; to convert documents from
50 &sgml; to other formats.</para>
55 <para>This program, along with its documentation, is free software;
56 you can redistribute it and/or modify it under the terms of the GNU
57 General Public License as published by the Free Software Foundation;
58 either version 2 of the License, or (at your option) any later
61 <para>This program is distributed in the hope that it will be useful,
62 but WITHOUT ANY WARRANTY; without even the implied warranty of
63 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
64 General Public License for more details.</para>
66 <para>You should have received a copy of the GNU General Public
67 License along with this program; if not, write to the Free Software
68 Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.</para>
74 <title>What is &sgmls.pm;?</title>
76 <para>&sgmls.pm; is an <link linkend=extend>extensible</link> &perl5;
77 class library for parsing the output from James Clark's popular
78 &sgmls; and &nsgmls; parsers, available on the Internet at <ulink
79 url="ftp://jclark.com/"><filename>ftp://jclark.com</filename></ulink>.
80 This is <emphasis>not</emphasis> a complete system for translating
81 documents written the the <glossterm>Standard Generalised Markup
82 Language</glossterm> (&sgml;) into other formats, but it can easily
83 form the basis of such a system (for a simple example, see the <ulink
84 url="../sgmlspl/sgmlspl.html"><application>sgmlspl</application></ulink>
85 program included in this package).</para>
87 <para>The library recognises four basic types of &sgml; objects: the
88 <link linkend=sgmlselement><glossterm>element</glossterm></link>, the
89 <link linkend=sgmlsattribute><glossterm>attribute</glossterm></link>,
91 linkend=sgmlsnotation><glossterm>notation</glossterm></link>, and the
92 <link linkend=sgmlsentity><glossterm>entity</glossterm></link>; each
93 of these is a fully-developed class with methods for accessing
94 important information.</para>
100 <title>How do I produce &sgml; documents?</title>
102 <para>I am presuming here that you are already experienced with &sgml;
103 and the &sgmls; or &nsgmls; parser. For help with the parsers see the
104 manual pages accompanying each one; for help with &sgml; see Robin
105 Cover's SGML Web Page at <ulink
106 url="http://www.sil.org/sgml/sgml.html"><filename>http://www.sil.org/sgml/sgml.html</filename></ulink>
107 on the Internet.</para>
113 <title>How do I program in &perl5;?</title>
115 <para>If you have to ask this question, you probably should not be
116 trying to use this library right now, since it is intended only for
117 experienced &perl5; programmers. That said, however, you can find the
118 &perl5; documentation with the &perl5; source distribution or on the
119 World-Wide Web at <ulink
120 url="http://www.metronet.com/0/perlinfo/perl5/manual/perl.html"><filename>http://www.metronet.com/0/perlinfo/perl5/manual/perl.html</filename></ulink>.</para>
122 <para><emphasis>Please</emphasis> do not write to me for help on using
129 <title>How do I use &sgmls.pm;?</title>
131 <para>First, you need to copy the file &sgmls.pm; to a directory where
132 perl can find it (on a Unix system, it might be
133 <filename>/usr/lib/perl5</filename> or
134 <filename>/usr/local/lib/perl5</filename>, or whatever the environment
135 variable <symbol>PERL5LIB</symbol> is set to) and make certain that it
136 is world-readable.</para>
138 <para>Next, near the top of your &perl5; program, type the following
145 <para>You must then open up a file handle from which &sgmls.pm; can read the
146 data from an &sgmls; or &nsgmls; process, unless you are reading from
147 a standard handle like <symbol>STDIN</symbol> — for example,
148 if you are piping the output from &sgmls; to a &perl5; script, using
149 something like</para>
152 sgmls foo.sgml | perl myscript.pl
155 <para>then the predefined filehandle <symbol>STDIN</symbol> will be
156 sufficient. In DOS, you might want to dump the sgmls output to a file
157 and use it as standard input (or open it explicitly in perl), and in
158 Unix, you might actually want to open a pipe or socket for the input.
159 &sgmls.pm; doesn't need to seek, so any input stream should
162 <para>To parse the &sgmls; or &nsgmls; output from the handle, create
163 a new object instance of the <classname>SGMLS</classname> class with
164 the handle as an argument, i.e.</para>
167 $parse = new SGMLS(STDIN);
170 <para>(You may create more than one <classname>SGMLS</classname>
171 object at once, but each object <emphasis>must</emphasis> have a
172 unique handle pointing to a unique stream, or
173 <emphasis>chaos</emphasis> will result.) Now, you can retrieve and
174 process events using the <command>next_event</command> method:</para>
177 while ($event = $parse->next_event) {
178 #do something with each event
185 <sect1 id=sgmlsevent>
186 <title>So what do I do with an event?</title>
188 <para>The <command>next_event</command> method for the <link
189 linkend=sgmls><classname>SGMLS</classname></link> class returns an
190 object belonging to the class <classname>SGMLS_Event</classname>.
191 This class has several methods available, as listed in table <xref
192 linkend=table.class.sgmls>.</para>
194 <table id=table.class.sgmls>
195 <title>The <classname>SGMLS_Event</classname> class</title>
202 <entry>Method</entry>
203 <entry>Return Type</entry>
204 <entry>Description</entry>
212 <entry><command>type</command></entry>
213 <entry>string</entry>
214 <entry>Return the type of the event.</entry>
218 <entry><command>data</command></entry>
219 <entry>string, <classname>SGMLS_Element</classname>, or
220 <classname>SGMLS_Entity</classname></entry>
221 <entry>Return any data associated with the event.</entry>
225 <entry><command>file</command></entry>
226 <entry>string</entry>
227 <entry>Return the name of the &sgml; source file which generated the
228 event, if available.</entry>
232 <entry><command>line</command></entry>
233 <entry>string</entry>
234 <entry>Return the line number of the &sgml; source file which
235 generated the event, if available.</entry>
239 <entry><command>element</command></entry>
240 <entry><classname>SGMLS_Element</classname></entry>
241 <entry>Return the element in force when the event was
246 <entry><command>parse</command></entry>
247 <entry>Return the <classname>SGMLS</classname> object for the current
252 <entry><command>entity(<parameter>ename</parameter>)</command></entry>
253 <entry>Look up an entity from those currently known to the parse. An
254 alias for <literal>->parse->entity($ename)</literal></entry>
258 <entry><command>notation(<parameter>nname</parameter>)</command></entry>
259 <entry>Look up the notation from those currently known to the parse:
260 an alias for <literal>->parse->notation($nname)</literal>.</entry>
267 <para>The <command>file</command> and <command>line</command> methods
268 will return useful information only if you called &sgmls; or &nsgmls;
269 with the <parameter>-l</parameter> flag to include file and
270 line-number information.</para>
276 <title>What are the different event types and data?</title>
278 <para>Table <xref linkend=table.class.sgmls.event> lists the ten
279 different event types returned by the <command>next_event</command>
280 method of an <link linkend=sgmls><classname>SGMLS</classname></link>
281 object and the different types of data associated with each of these
282 (note that these do <emphasis>not</emphasis> correspond to the
283 standard &esis; events).</para>
286 <table id=table.class.sgmls.event>
287 <title>The <classname>SGMLS_Event</classname> types</title>
294 <entry>Event Type</entry>
295 <entry>Event Data</entry>
296 <entry>Description</entry>
305 <entry><returnvalue>'start_element'</returnvalue></entry>
306 <entry><classname>SGMLS_Element</classname></entry>
307 <entry>The beginning of an element.</entry>
311 <entry><returnvalue>'end_element'</returnvalue></entry>
312 <entry><classname>SGMLS_Element</classname></entry>
313 <entry>The end of an element.</entry>
317 <entry><returnvalue>'cdata'</returnvalue></entry>
318 <entry>string</entry>
319 <entry>Regular character data.</entry>
323 <entry><returnvalue>'sdata'</returnvalue></entry>
324 <entry>string</entry>
325 <entry>Special system data.</entry>
329 <entry><returnvalue>'re'</returnvalue></entry>
330 <entry>[none]</entry>
331 <entry>A record-end (i.e., a newline).</entry>
335 <entry><returnvalue>'pi'</returnvalue></entry>
336 <entry>string</entry>
337 <entry>A processing instruction</entry>
341 <entry><returnvalue>'entity'</returnvalue></entry>
342 <entry><classname>SGMLS_Entity</classname></entry>
343 <entry>A non-SGML external entity.</entry>
347 <entry><returnvalue>'start_subdoc'</returnvalue></entry>
348 <entry><classname>SGMLS_Entity</classname></entry>
349 <entry>The beginning of an SGML subdocument.</entry>
353 <entry><returnvalue>'end_subdoc'</returnvalue></entry>
354 <entry><classname>SGMLS_Entity</classname></entry>
355 <entry>The end of an SGML subdocument.</entry>
359 <entry><returnvalue>'conforming'</returnvalue></entry>
360 <entry>[none]</entry>
361 <entry>The document was valid.</entry>
368 <para>For example, if <literal>$event->type</literal> returns
369 <returnvalue>'start_element'</returnvalue>, then
370 <literal>$event->data</literal> will return an object belonging to the
371 <link linkend=sgmlselement><classname>SGMLS_Element</classname></link>
372 class (which will contain a list of attributes, etc. — see
373 below), <literal>$event->file</literal> and
374 <literal>$event->line</literal> will return the file and line-number
375 in which the element appeared (if you called &sgmls; or &nsgmls; with
376 the <parameter>-l</parameter> flag), and
377 <literal>$event->element</literal> will return the element currently
378 in force (in this case, the same as
379 <literal>$event->data</literal>).</para>
384 <sect1 id=sgmlselement>
385 <title>What do I do with an <classname>SGMLS_Element</classname>?</title>
387 <para>Altogether, there are six classes in &sgmls.pm;, each with its
388 own methods: in addition to <link
389 linkend=sgmls><classname>SGMLS</classname></link> (for the parse) and
390 <link linkend=sgmlsevent><classname>SGMLS_Event</classname></link>
391 (for a specific event), the classes are
392 <classname>SGMLS_Element</classname>, <link
393 linkend=sgmlsattribute><classname>SGMLS_Attribute</classname></link>,
394 <link linkend=sgmlsentity><classname>SGMLS_Entity</classname></link>,
396 linkend=sgmlsnotation><classname>SGMLS_Notation</classname></link>.
397 Like all of these, <classname>SGMLS_Element</classname> has a number
398 of methods available for obtaining different types of information.
399 For example, if you were to use</para>
402 my $element = $event->data
405 <para>to retrieve the data for a <literal>'start_element'</literal> or
406 <literal>'end_element'</literal> event, then you could use the methods
407 listed in table <xref linkend=table.class.sgmls.element> to find more
408 information about the element.</para>
410 <table id=table.class.sgmls.element>
411 <title>The <classname>SGMLS_Element</classname> class</title>
418 <entry>Method</entry>
419 <entry>Return Type</entry>
420 <entry>Description</entry>
428 <entry><command>name</command></entry>
429 <entry>string</entry>
430 <entry>The name (or GI), in upper-case.</entry>
434 <entry><command>parent</command></entry>
435 <entry><classname>SGMLS_Element</classname></entry>
436 <entry>The parent element, or <literal>''</literal> if this is the top
441 <entry><command>attributes</command></entry>
443 <entry>Return a reference to a hash table of
444 <classname>SGMLS_Attribute</classname> objects, keyed by the attribute
445 names (in upper-case).</entry>
449 <entry><command>attribute_names</command></entry>
451 <entry>A list of all attribute names for the current element (in
456 <entry><command>attribute(<parameter>aname</parameter>)</command></entry>
457 <entry><classname>SGMLS_Attribute</classname></entry>
458 <entry>Return the attribute named ANAME.</entry>
462 <entry><command>set_attribute(<parameter>attribute</parameter>)</command></entry>
463 <entry>[none]</entry>
464 <entry>The <parameter>attribute</parameter> argument should be an
465 object belonging to the <ulink
466 url=sgmlsattribute.html><classname>SGMLS_Attribute</classname></ulink>
467 class. Add it to the element, replacing any previous attribute with
468 the same name.</entry>
472 <entry><command>in(<parameter>name</parameter>)</command></entry>
473 <entry><classname>SGMLS_Element</classname></entry>
474 <entry>If the current element's parent is named
475 <parameter>name</parameter>, return the parent; otherwise, return
476 <literal>''</literal>.</entry>
480 <entry><command>within(<parameter>name</parameter>)</command></entry>
481 <entry><classname>SGMLS_Element</classname></entry>
482 <entry>If any ancestor of the current element is named
483 <parameter>name</parameter>, return it; otherwise, return
484 <literal>''</literal>.</entry>
494 <sect1 id=sgmlsattribute>
495 <title>What do I do with an
496 <classname>SGMLS_Attribute</classname>?</title>
498 <para>Note that objects of the <classname>SGMLS_Attribute</classname>
499 class do not have events in their own right, and are available only
500 through the <command>attributes</command> or
501 <command>attribute(<parameter>aname</parameter>)</command> methods for
502 <link linkend=sgmlselement><classname>SGMLS_Element</classname></link>
503 objects. An object belonging to the
504 <classname>SGMLS_Attribute</classname> class will recognise the
505 methods listed in table <xref
506 linkend=table.class.sgmls.attribute>.</para>
508 <table id=table.class.sgmls.attribute>
509 <title>The <classname>SGMLS_Attribute</classname> class</title>
516 <entry>Method</entry>
517 <entry>Return Type</entry>
518 <entry>Description</entry>
526 <entry><command>name</command></entry>
527 <entry>string</entry>
528 <entry>The name of the attribute (in upper-case).</entry>
532 <entry><command>type</command></entry>
533 <entry>string</entry>
534 <entry>The type of the attribute: <literal>'IMPLIED'</literal>,
535 <literal>'CDATA'</literal>, <literal>'NOTATION'</literal>,
536 <literal>'ENTITY'</literal>, or <literal>'TOKEN'</literal>.</entry>
540 <entry><command>value</command></entry>
541 <entry>string, <classname>SGMLS_Entity</classname>, or
542 <classname>SGMLS_Notation</classname>.</entry>
543 <entry>The value of the attribute. If the type is
544 <literal>'CDATA'</literal> or <literal>'TOKEN'</literal>, it will be a
545 simple string; if it is <literal>'NOTATION'</literal> it will be an
546 object belonging to the <classname>SGMLS_Notation</classname> class,
547 and if it is <literal>'Entity'</literal> it will be an object
548 belonging to the <classname>SGMLS_Entity</classname> class.</entry>
552 <entry><command>is_implied</command></entry>
553 <entry>boolean</entry>
554 <entry>Return true if the value of the attribute is implied, or false if
555 it has an explicit value.</entry>
559 <entry><command>set_type(<parameter>type</parameter>)</command></entry>
560 <entry>[none]</entry>
561 <entry>Provide a new type for the current attribute -- no sanity
562 checking will be performed, so be careful.</entry>
566 <entry><command>set_value(<parameter>value</parameter>)</command></entry>
567 <entry>[none]</entry>
568 <entry>Provide a new value for the current attribute -- no sanity
569 checking will be performed, so be careful.</entry>
576 <para>Note that the type <literal>'TOKEN'</literal> includes both
577 individual tokens and lists of tokens (ie <literal>'TOKENS'</literal>,
578 <literal>'IDS'</literal>, or <literal>'IDREFS'</literal> in the
579 original &sgml; document), so you might need to use the perl function
580 'split' to break the value string into a list.</para>
585 <sect1 id=sgmlsentity>
586 <title>What do I do with an <classname>SGMLS_Entity</classname>?</title>
588 <para>An <classname>SGMLS_Entity</classname> object can come in an
589 <literal>'entity'</literal> <link linkend=events>event</link> (in
590 which case it is always external), in a
591 <literal>'start_subdoc'</literal> or <literal>'end_subdoc'</literal>
592 event (in which case it always has the type
593 <literal>'SUBDOC'</literal>), or as the value of an attribute (in
594 which case it may be internal or external). An object belonging to
595 the <classname>SGMLS_Entity</classname> class may use the methods
596 listed in table <xref linkend=table.class.sgmls.entity>.</para>
598 <table id=table.class.sgmls.entity>
599 <title>The <classname>SGMLS_Entity</classname> class</title>
606 <entry>Method</entry>
607 <entry>Return Type</entry>
608 <entry>Description</entry>
616 <entry><command>name</command></entry>
617 <entry>string</entry>
618 <entry>The entity name.</entry>
622 <entry><command>type</command></entry>
623 <entry>string</entry>
624 <entry>The entity type: <literal>'CDATA'</literal>,
625 <literal>'SDATA'</literal>, <literal>'NDATA'</literal>, or
626 <literal>'SUBDOC'</literal>.</entry>
630 <entry><command>value</command></entry>
631 <entry>string</entry>
632 <entry>The entity replacement text (internal entities
637 <entry><command>sysid</command></entry>
638 <entry>string</entry>
639 <entry>The system identifier (external entities only).</entry>
643 <entry><command>pubid</command></entry>
644 <entry>string</entry>
645 <entry>The public identifier (external entities only).</entry>
649 <entry><command>filenames</command></entry>
651 <entry>A list of file names generated from the sysid and pubid
652 (external entities only).</entry>
656 <entry><command>notation</command></entry>
657 <entry><classname>SGMLS_Notation</classname></entry>
658 <entry>The associated notation (external data entities only).</entry>
665 <para>An entity of type <literal>'SUBDOC'</literal> will have a sysid
666 and pubid, and external data entity will have a sysid, pubid,
667 filenames, and a notation, and an internal data entity will have a
673 <sect1 id=sgmlsnotation>
674 <title>What do I do with an <classname>SGMLS_Notation</classname>?</title>
676 <para>The fourth data class is the notation, which is available only
677 as a return value from the <command>notation</command> method of an
678 <link linkend=sgmlsentity><classname>SGMLS_Entity</classname></link>
679 or the <command>value</command> method of an <link
680 linkend=sgmlsattribute><classname>SGMLS_Attribute</classname></link>
681 with type <literal>'NOTATION'</literal>. You can use the notation to
682 decide how to treat non-SGML data (such as graphics). An object
683 belonging to the <classname>SGMLS_Notation</classname> class will have
684 access to the methods listed in table <xref
685 linkend=table.class.sgmls.notation>.</para>
687 <table id=table.class.sgmls.notation>
688 <title>The <classname>SGMLS_Notation class</classname></title>
695 <entry>Method</entry>
696 <entry>Return Type</entry>
697 <entry>Description</entry>
705 <entry><command>name</command></entry>
706 <entry>string</entry>
707 <entry>The notation's name.</entry>
711 <entry><command>sysid</command></entry>
712 <entry>string</entry>
713 <entry>The notation's system identifier.</entry>
717 <entry><command>pubid</command></entry>
718 <entry>string</entry>
719 <entry>The notation's public identifier.</entry>
726 <para>What you do with this information is
727 <emphasis>entirely</emphasis> up to you.</para>
733 <title>Is there any extra information available from the &sgml;
736 <para>The <link linkend=sgmls><classname>SGMLS</classname></link>
737 object which you created at the beginning of the parse has several
738 methods available in addition to <command>next_event</command> —
739 you will find them all listed in table <xref
740 linkend=table.class.sgmls.extra>. There should normally be no need to
741 use the <command>notation</command> and <command>entity</command>
742 methods, since &sgmls.pm; will look up entities and notations for you
743 automatically as needed.</para>
745 <table id=table.class.sgmls.extra>
746 <title>Additional methods for the <classname>SGMLS</classname>
753 <entry>Method</entry>
754 <entry>Return Type</entry>
755 <entry>Description</entry>
763 <entry><command>next_event</command></entry>
764 <entry><classname>SGMLS_Event</classname></entry>
765 <entry>Return the next event.</entry>
769 <entry><command>appinfo</command></entry>
770 <entry>string</entry>
771 <entry>Return the APPINFO parameter from the &sgml; declaration, if
776 <entry><command>notation(<parameter>nname</parameter>)</command></entry>
777 <entry><classname>SGMLS_Notation</classname></entry>
778 <entry>Look up a notation by name.</entry>
782 <entry><command>entity(<parameter>ename</parameter>)</command></entry>
783 <entry><classname>SGMLS_Entity</classname></entry>
784 <entry>Look up an entity by name.</entry>
795 <title>How about a simple example?</title>
797 <para>OK. The following script simply reports its events:</para>
803 <para>To use it under Unix, try something like</para>
806 sgmls document.sgml | perl sample.pl
809 <para>and watch the output scroll down.</para>
815 <title>How do I design my <emphasis>own</emphasis> classes?</title>
817 <para>In addition to the methods listed above, all of the classes used
818 in &sgmls.pm; have an <command>ext</command> method which returns a
819 reference to an initially-empty hash table. You are free to use this
820 hash table to store <emphasis>anything</emphasis> you want — it
821 should be especially useful if you are building your own, derived
822 classes from the ones provided here. The following example builds a
823 derived class <classname>My_Element</classname> from the <link
824 linkend=sgmlselement><classname>SGMLS_Element</classname></link>
825 class, adding methods to set and get the current font:</para>
831 @ISA = qw(SGMLS_Element);
834 my ($class,$element,$font) = @_;
835 $element->ext->{'font'} = $font;
836 return bless $element;
841 return $self->ext->{'font'};
845 my ($self,$font) = @_;
846 $self->ext->{'font'} = $font;
850 <para>Note that the derived class does not need to have any knowledge
851 about the underlying structure of the <link
852 linkend=sgmlselement><classname>SGMLS_Element</classname></link>
853 class, and need only avoid shadowing any of the methods currently
854 existing there.</para>
856 <para>If you decide to create a derived class from the <link
857 linkend=sgmls><classname>SGMLS</classname></link>, please note that in
858 addition to the methods listed above, that class uses internal methods
859 named <command>element</command>, <command>line</command>, and
860 <command>file</command>, similar to the same methods in <link
861 linkend=sgmlsevent><classname>SGMLS_Event</classname></link> —
862 it is essential that you not shadow these method names.</para>
868 <title>Are there any bugs?</title>
870 <para>Of course! Right now, &sgmls.pm; silently ignores link attributes
871 (&nsgmls; only) and data attributes, and there may be many other bugs
872 which I have not yet found.</para>
878 <!-- Keep this comment at the end of the file
881 sgml-declaration:"/usr/local/lib/sgml/sgmldecl/docbook.dcl"