Other machines may be instantiated and control passed to them by use of
\verb|fcall|, \verb|fgoto| or \verb|fnext| statements.
-\begin{comment}
-\subsection{Write Statement}
-
-\begin{verbatim}
-write <component> [options];
-\end{verbatim}
-\verbspace
-
-The write statement is used to generate parts of the machine. There are four
-components that can be generated: the state machine's static data, the
-initialization code, the execution code and the EOF action execution code. The
-write statement is described in detail in Section \ref{write-statement}.
-\end{comment}
-
\section{Lexical Analysis of an FSM Specification}
\label{lexing}
not matched by the given machine. Character-Level Negation is equivalent to
\verb|(any - expr)|.
-\section{State Charts}
+\section{State Machine Minimization}
-It is not uncommon for programmers to implement
-parsers as manually-coded state machines, either using a switch statement or a
-state map compiler which takes a list of states, transitions and actions, and
-generates code.
-
-This method can be a very effective programming technique for producing robust
-code. The key disadvantage becomes clear when one attempts to comprehend such a
-parser. Machines coded in this way usually require many lines, causing logic to
-be spread out over large distances in the source file. Remembering the function
-of a large number of states can be difficult and organizing the parser in a
-sensible way requires discipline because branches and repetition present many
-file layout options. This kind of programming takes a specification with
-inherent structure such as looping, alternation and concatenation and expresses
-it in a flat form.
+State machine minimization is the process of finding the minimal equivalent FSM accepting
+the language. Minimization reduces the number of states in machines
+by merging equivalent states. It does not change the behaviour of the machine
+in any way. It will cause some states to be merged into one because they are
+functionally equivalent. State minimization is on by default. It can be turned
+off with the \verb|-n| option.
-If we could take an isolated component of a manually programmed state chart,
-that is, a subset of states that has only one entry point, and implement it
-using regular language operators then we could eliminate all the explicit
-naming of the states contained in it. By eliminating explicitly named states
-and replacing them with higher-level specifications we simplify a parser
-specification.
+The algorithm implemented is similar to Hopcroft's state minimization
+algorithm. Hopcroft's algorithm assumes a finite alphabet that can be listed in
+memory, whereas Ragel supports arbitrary integer alphabets that cannot be
+listed in memory. Though exact analysis is very difficult, Ragel minimization
+runs close to $O(n \times log(n))$ and requires $O(n)$ temporary storage where
+$n$ is the number of states.
-For example, sometimes chains of states are needed, with only a small number of
-possible characters appearing along the chain. These can easily be replaced
-with a concatenation of characters. Sometimes a group of common states
-implement a loop back to another single portion of the machine. Rather than
-manually duplicate all the transitions that loop back, we may be able to
-express the loop using a kleene star operator.
+\section{Visualization}
-Ragel allows one to take this state map simplification approach. We can build
-state machines using a state map model and implement portions of the state map
-using regular languages. In place of any transition in the state machine,
-entire sub-state machines can be given. These can encapsulate functionality
-defined elsewhere. An important aspect of the Ragel approach is that when we
-wrap up a collection of states using a regular expression we do not loose
-access to the states and transitions. We can still execute code on the
-transitions that we have encapsulated.
+Ragel is able to emit compiled state machines in Graphviz's Dot file format.
+Graphviz support allows users to perform
+incremental visualization of their parsers. User actions are displayed on
+transition labels of the graph. If the final graph is too large to be
+meaningful, or even drawn, the user is able to inspect portions of the parser
+by naming particular regular expression definitions with the \verb|-S| and
+\verb|-M| options to the \verb|ragel| program. Use of Graphviz greatly
+improves the Ragel programming experience. It allows users to learn Ragel by
+experimentation and also to track down bugs caused by unintended
+nondeterminism.
-\subsection{Join}
+\chapter{User Actions}
-\verb|expr , expr , ...|
-\verbspace
+Ragel permits the user to embed actions into the transitions of a regular
+expression's corresponding state machine. These actions are executed when the
+generated code moves over a transition. Like the regular expression operators,
+the action embedding operators are fully compositional. They take a state
+machine and an action as input, embed the action, and yield a new state machine
+which can be used in the construction of other machines. Due to the
+compositional nature of embeddings, the user has complete freedom in the
+placement of actions.
-Join a list of machines together without
-drawing any transitions, without setting up a start state, and without
-designating any final states. Transitions between the machines may be specified
-using labels and epsilon transitions. The start state must be explicity
-specified with the ``start'' label. Final states may be specified with the an
-epsilon transition to the implicitly created ``final'' state. The join
-operation allows one to build machines using a state chart model.
+A machine's transitions are categorized into four classes, The action embedding
+operators access the transitions defined by these classes. The {\em entering
+transition} operator \verb|>| isolates the start state, then embeds an action
+into all transitions leaving it. The {\em finishing transition} operator
+\verb|@| embeds an action into all transitions going into a final state. The
+{\em all transition} operator \verb|$| embeds an action into all transitions of
+an expression. The {\em pending out transition} operator \verb|%| provides
+access to yet-unmade leaving transitions.
-\subsection{Label}
+\section{Embedding Actions}
-\verb|label: expr|
+\begin{verbatim}
+action ActionName {
+ /* Code an action here. */
+ count += 1;
+}
+\end{verbatim}
\verbspace
-Attaches a label to an expression. Labels can be
-used as the target of epsilon transitions and explicit control transfer
-statements such \verb|fgoto| and \verb|fnext| in action
-code.
+The action statement defines a block of code that can be embedded into an FSM.
+Action names can be referenced by the action embedding operators in
+expressions. Though actions need not be named in this way (literal blocks
+of code can be embedded directly when building machines), defining reusable
+blocks of code whenever possible is good practice because it potentially increases the
+degree to which the machine can be minimized. Within an action some Ragel expressions
+and statements are parsed and translated. These allow the user to interact with the machine
+from action code. See Section \ref{vals} for a complete list of statements and
+values available in code blocks.
-\subsection{Epsilon}
+\subsection{Entering Action}
-\verb|expr -> label|
+\verb|expr > action|
\verbspace
-Draws an epsilon transition to the state defined
-by \verb|label|. Epsilon transitions are made deterministic when join
-operators are evaluated. Epsilon transitions that are not in a join operation
-are made deterministic when the machine definition that contains the epsilon is
-complete. See Section \ref{labels} for information on referencing labels.
+The entering operator embeds an action into the starting transitions. The
+action is executed on all transitions that enter into the machine from the
+start state. If the start state is a final state then it is possible for the
+machine to never be entered and the starting transitions bypassed. In the
+following example, the action is executed on the first transition of the
+machine. If the repetition machine is bypassed the action is not executed.
+\verbspace
-\section{Scanners}
-\label{generating-scanners}
+% GENERATE: exstact
+% OPT: -p
+% %%{
+% machine exstact;
+\begin{inline_code}
+\begin{verbatim}
+# Execute A at the beginning of a string of alpha.
+action A {}
+main := ( lower* >A ) . ' ';
+\end{verbatim}
+\end{inline_code}
+% }%%
+% END GENERATE
-The longest-match operator can be used to construct scanners. The generated
-machine repeatedly attempts to match one of the given patterns, first favouring
-longer pattern matches over shorter ones. If there is a choice between equal
-length matches, the match of the pattern which appears first is chosen.
+\graphspace
+\begin{center}
+\includegraphics[scale=0.45]{exstact}
+\end{center}
+\graphspace
+\subsection{Finishing Action}
+
+\verb|expr @ action|
\verbspace
+
+The finishing action operator embeds an action into any transitions that go into a
+final state. Whether or not the machine accepts is not determined at the point
+the action is executed. Further input may move the machine out of the accepting
+state, but keep it in the machine. As in the following example, the
+into-final-state operator is most often used when no lookahead is necessary.
+
+% GENERATE: exdoneact
+% OPT: -p
+% %%{
+% machine exdoneact;
+% action A {}
+\begin{inline_code}
\begin{verbatim}
-<machine_name> := |*
- pattern1 => action1;
- pattern2 => action2;
- ...
- *|;
+# Execute A when the trailing space is seen.
+main := ( lower* ' ' ) @A;
\end{verbatim}
-\verbspace
+\end{inline_code}
+% }%%
+% END GENERATE
-The longest-match construction operator is not a pure state machine operator.
-It relies on the \verb|tokstart|, \verb|tokend| and \verb|act| variables to be
-present so that it can backtrack and make pointers to the matched text
-available to the user. If input is processed using multiple calls to the
-execute code then the user must ensure that when a token is only partially
-matched that the prefix is preserved on the subsequent invocation of the
-execute code.
+\graphspace
+\begin{center}
+\includegraphics[scale=0.45]{exdoneact}
+\end{center}
+\graphspace
-The \verb|tokstart| variable must be defined as a pointer to the input data.
-It is used for recording where the current token match begins. This variable
-may be used in action code for retrieving the text of the current match. Ragel
-ensures that in between tokens and outside of the longest-match machines that
-this pointer is set to null. In between calls to the execute code the user must
-check if \verb|tokstart| is set and if so, ensure that the data it points to is
-preserved ahead of the next buffer block. This is described in more detail
-below.
-The \verb|tokend| variable must also be defined as a pointer to the input data.
-It is used for recording where a match ends and where scanning of the next
-token should begin. This can also be used in action code for retrieving the
-text of the current match.
+\subsection{All Transition Action}
-The \verb|act| variable must be defined as an integer type. It is used for
-recording the identity of the last pattern matched when the scanner must go
-past a matched pattern in an attempt to make a longer match. If the longer
-match fails it may need to consult the act variable. In some cases use of the act
-variable can be avoided because the value of the current state is enough
-information to determine which token to accept, however in other cases this is
-not enough and so the \verb|act| variable is used.
+\verb|expr $ action|
+\verbspace
-When the longest-match operator is in use, the user's driver code must take on
-some buffer management functions. The following algorithm gives an overview of
-the steps that should be taken to properly use the longest-match operator.
+The all transition operator embeds an action into all transitions of a machine.
+The action is executed whenever a transition of the machine is taken. In the
+following example, A is executed on every character matched.
-\begin{itemize}
-\setlength{\parskip}{0pt}
-\item Read a block of input data.
-\item Run the execute code.
-\item If \verb|tokstart| is set, the execute code will expect the incomplete
-token to be preserved ahead of the buffer on the next invocation of the execute
-code.
-\begin{itemize}
-\item Shift the data beginning at \verb|tokstart| and ending at \verb|pe| to the
-beginning of the input buffer.
-\item Reset \verb|tokstart| to the beginning of the buffer.
-\item Shift \verb|tokend| by the distance from the old value of \verb|tokstart|
-to the new value. The \verb|tokend| variable may or may not be valid. There is
-no way to know if it holds a meaningful value because it is not kept at null
-when it is not in use. It can be shifted regardless.
-\end{itemize}
-\item Read another block of data into the buffer, immediately following any
-preserved data.
-\item Run the scanner on the new data.
-\end{itemize}
+% GENERATE: exallact
+% OPT: -p
+% %%{
+% machine exallact;
+% action A {}
+\begin{inline_code}
+\begin{verbatim}
+# Execute A on any characters of machine one or two.
+main := ( 'm1' | 'm2' ) $A;
+\end{verbatim}
+\end{inline_code}
+% }%%
+% END GENERATE
-Figure \ref{preserve_example} shows the required handling of an input stream in
-which a token is broken by the input block boundaries. After processing up to
-and including the ``t'' of ``characters'', the prefix of the string token must be
-retained and processing should resume at the ``e'' on the next iteration of
-the execute code.
+\graphspace
+\begin{center}
+\includegraphics[scale=0.45]{exallact}
+\end{center}
+\graphspace
-If one uses a large input buffer for collecting input then the number of times
-the shifting must be done will be small. Furthermore, if one takes care not to
-define tokens that are allowed to be very long and instead processes these
-items using pure state machines or sub-scanners, then only a small amount of
-data will ever need to be shifted.
-\begin{figure}
-\begin{verbatim}
- a) A stream "of characters" to be scanned.
- | | |
- p tokstart pe
+\subsection{Pending Out (Leaving) Actions}
+\label{out-actions}
- b) "of characters" to be scanned.
- | | |
- tokstart p pe
+\verb|expr % action|
+\verbspace
+
+The pending out action operator embeds an action into the pending out
+transitions of a machine. The action is first embedded into the final states of
+the machine and later transferred to any transitions made going out of the
+machine. The transfer can be caused either by a concatenation or kleene star
+operation. This mechanism allows one to associate an action with the
+termination of a sequence, without being concerned about what particular
+character terminates the sequence. In the following example, A is executed
+when leaving the alpha machine by the newline character.
+
+% GENERATE: exoutact1
+% OPT: -p
+% %%{
+% machine exoutact1;
+% action A {}
+\begin{inline_code}
+\begin{verbatim}
+# Match a word followed by an newline. Execute A when
+# finishing the word.
+main := ( lower+ %A ) . '\n';
\end{verbatim}
-\caption{Following an invocation of the execute code there may be a partially
-matched token (a). The data of the partially matched token
-must be preserved ahead of the new data on the next invocation (b).}
-\label{preserve_example}
-\end{figure}
+\end{inline_code}
+% }%%
+% END GENERATE
-Since scanners attempt to make the longest possible match of input, in some
-cases they are not able to identify a token upon parsing its final character,
-they must wait for a lookahead character. For example if trying to match words,
-the token match must be triggered on following whitespace in case more
-characters of the word have yet to come. The user must therefore arrange for an
-EOF character to be sent to the scanner to flush out any token that has not yet
-been matched. The user can exclude a single character from the entire scanner
-and use this character as the EOF character, possibly specifying an EOF action.
-For most scanners, zero is a suitable choice for the EOF character.
+\graphspace
+\begin{center}
+\includegraphics[scale=0.45]{exoutact1}
+\end{center}
+\graphspace
-Alternatively, if whitespace is not significant and ignored by the scanner, the
-final real token can be flushed out by simply sending an additional whitespace
-character on the end of the stream. If the real stream ends with whitespace
-then it will simply be extended and ignored. If it does not, then the last real token is
-guaranteed to be flushed and the dummy EOF whitespace ignored.
-An example scanner processing loop is given in Figure \ref{scanner-loop}.
+In the following example, the \verb|term_word| action could be used to register
+the appearance of a word and to clear the buffer that the \verb|lower| action used
+to store the text of it.
-\begin{figure}
-\small
+% GENERATE: exoutact2
+% OPT: -p
+% %%{
+% machine exoutact2;
+% action lower {}
+% action space {}
+% action term_word {}
+% action newline {}
+\begin{inline_code}
\begin{verbatim}
- int have = 0;
- bool done = false;
- while ( !done ) {
- /* How much space is in the buffer? */
- int space = BUFSIZE - have;
- if ( space == 0 ) {
- /* Buffer is full. */
- cerr << "TOKEN TOO BIG" << endl;
- exit(1);
- }
-
- /* Read in a block after any data we already have. */
- char *p = inbuf + have;
- cin.read( p, space );
- int len = cin.gcount();
+word = ( [a-z] @lower )+ %term_word;
+main := word ( ' ' @space word )* '\n' @newline;
+\end{verbatim}
+\end{inline_code}
+% }%%
+% END GENERATE
- /* If no data was read, send the EOF character.
- if ( len == 0 ) {
- p[0] = 0, len++;
- done = true;
- }
+\graphspace
+\begin{center}
+\includegraphics[scale=0.45]{exoutact2}
+\end{center}
+\graphspace
- char *pe = p + len;
- %% write exec;
- if ( cs == RagelScan_error ) {
- /* Machine failed before finding a token. */
- cerr << "PARSE ERROR" << endl;
- exit(1);
- }
+In this final example of the action embedding operators, A is executed upon
+entering the alpha machine, B is executed on all transitions of the alpha
+machine, C is executed when the alpha machine accepts by moving into the
+newline machine and N is executed when the newline machine moves into a final
+state.
- if ( tokstart == 0 )
- have = 0;
- else {
- /* There is a prefix to preserve, shift it over. */
- have = pe - tokstart;
- memmove( inbuf, tokstart, have );
- tokend = inbuf + (tokend-tokstart);
- tokstart = inbuf;
- }
- }
+% GENERATE: exaction
+% OPT: -p
+% %%{
+% machine exaction;
+% action A {}
+% action B {}
+% action C {}
+% action N {}
+\begin{inline_code}
+\begin{verbatim}
+# Execute A on starting the alpha machine, B on every transition
+# moving through it and C upon finishing. Execute N on the newline.
+main := ( lower* >A $B %C ) . '\n' @N;
\end{verbatim}
-\caption{A processing loop for a scanner.}
-\label{scanner-loop}
-\end{figure}
+\end{inline_code}
+% }%%
+% END GENERATE
+\graphspace
+\begin{center}
+\includegraphics[scale=0.45]{exaction}
+\end{center}
+\graphspace
-\section{Write Statement}
-\label{write-statement}
-\begin{verbatim}
-write <component> [options];
-\end{verbatim}
-\verbspace
+\section{State Action Embedding Operators}
+The state embedding operators allow one to embed actions into states. Like the
+transition embedding operators, there are several different classes of states
+that the operators access. The meanings of the symbols are partially related to
+the meanings of the symbols used by the transition embedding operators.
-The write statement is used to generate parts of the machine.
-There are four
-components that can be generated by a write statement. These components are the
-state machine's data, initialization code, execution code and EOF action
-execution code. A write statement may appear before a machine is fully defined.
-This allows one to write out the data first then later define the machine where
-it is used. An example of this is show in Figure \ref{fbreak-example}.
+The state embedding operators are different from the transition embedding
+operators in that there are various kinds of events that embedded actions can
+be associated with, requiring them to be distinguished by these different types
+of events. The state embedding operators have two components. The first, which
+is the first one or two characters, specifies the class of states that the
+action will be embedded into. The second component specifies the type of event
+the action will be executed on.
-\subsection{Write Data}
-\begin{verbatim}
-write data [options];
-\end{verbatim}
-\verbspace
+\def\fakeitem{\hspace*{12pt}$\bullet$\hspace*{10pt}}
-The write data statement causes Ragel to emit the constant static data needed
-by the machine. In table-driven output styles (see Section \ref{genout}) this
-is a collection of arrays that represent the states and transitions of the
-machine. In goto-driven machines much less data is emitted. At the very
-minimum a start state \verb|name_start| is generated. All variables written
-out in machine data have both the \verb|static| and \verb|const| properties and
-are prefixed with the name of the machine and an
-underscore. The data can be placed inside a class, inside a function, or it can
-be defined as global data.
+\begin{minipage}{\textwidth}
+\begin{multicols}{2}
+\raggedcolumns
+\noindent The different classes of states are:\\
+\fakeitem \verb|> | -- the start state \\
+\fakeitem \verb|$ | -- all states\\
+\fakeitem \verb|% | -- final states\\
+\fakeitem \verb|< | -- any state except the start state\\
+\fakeitem \verb|@ | -- any state except final states\\
+\fakeitem \verb|<>| -- any except start and final (middle)
-Two variables are written that may be used to test the state of the machine
-after a buffer block has been processed. The \verb|name_error| variable gives
-the id of the state that the machine moves into when it cannot find a valid
-transition to take. The machine immediately breaks out of the processing loop when
-it finds itself in the error state. The error variable can be compared to the
-current state to determine if the machine has failed to parse the input. If the
-machine is complete, that is from every state there is a transition to a proper
-state on every possible character of the alphabet, then no error state is required
-and this variable will be set to -1.
+\columnbreak
-The \verb|name_first_final| variable stores the id of the first final state. All of the
-machine's states are sorted by their final state status before having their ids
-assigned. Checking if the machine has accepted its input can then be done by
-checking if the current state is greater-than or equal to the first final
-state.
+\noindent The different kinds of embeddings are:\\
+\fakeitem \verb|~| -- to-state actions\\
+\fakeitem \verb|*| -- from-state actions\\
+\fakeitem \verb|/| -- EOF actions\\
+\fakeitem \verb|!| -- error actions\\
+\fakeitem \verb|^| -- local error actions\\
+\end{multicols}
+\end{minipage}
+%\label{state-act-embed}
+%\caption{The two components of state embedding operators. The class of states
+%to select comes first, followed by the type of embedding.}
+%
+%\begin{figure}[t]
+%\centering
+%\includegraphics{stembed}
+%\caption{Summary of state manipulation operators}
+%\label{state-act-embed-chart}
+%\end{figure}
-Data generation has several options:
+%\noindent Putting these two components together we get a matrix of state
+%embedding operators. The entire set is given in Figure \ref{state-act-embed-chart}.
-\begin{itemize}
-\item \verb|noerror| - Do not generate the integer variable that gives the
-id of the error state.
-\item \verb|nofinal| - Do not generate the integer variable that gives the
-id of the first final state.
-\item \verb|noprefix| - Do not prefix the variable names with the name of the
-machine.
-\end{itemize}
-\subsection{Write Init}
-\begin{verbatim}
-write init;
-\end{verbatim}
-\verbspace
+\subsection{To-State and From-State Actions}
-The write init statement causes Ragel to emit initialization code. This should
-be executed once before the machine is started. At a very minimum this sets the
-current state to the start state. If other variables are needed by the
-generated code, such as call
-stack variables or longest-match management variables, they are also
-initialized here.
+\subsubsection{To-State Actions}
-\subsection{Write Exec}
-\begin{verbatim}
-write exec [options];
-\end{verbatim}
+\verb| >~ $~ %~ <~ @~ <>~ |
\verbspace
-The write exec statement causes Ragel to emit the state machine's execution code.
-Ragel expects several variables to be available to this code. At a very minimum, the
-generated code needs access to the current character position \verb|p|, the ending
-position \verb|pe| and the current state \verb|cs|, though \verb|pe|
-can be excluded by specifying the \verb|noend| write option.
-The \verb|p| variable is the cursor that the execute code will
-used to traverse the input. The \verb|pe| variable should be set up to point to one
-position past the last valid character in the buffer.
+To-state actions are executed whenever the state machine moves into the
+specified state, either by a natural movement over a transition or by an
+action-based transfer of control such as \verb|fgoto|. They are executed after the
+in-transition's actions but before the current character is advanced and
+tested against the end of the input block. To-state embeddings stay with the
+state. They are irrespective of the state's current set of transitions and any
+future transitions that may be added in or out of the state.
-Other variables are needed when certain features are used. For example using
-the \verb|fcall| or \verb|fret| statements requires \verb|stack| and
-\verb|top| variables to be defined. If a longest-match construction is used,
-variables for managing backtracking are required.
+Note that the setting of the current state variable \verb|cs| outside of the
+execute code is not considered by Ragel as moving into a state and consequently
+the to-state actions of the new current state are not executed. This includes
+the initialization of the current state when the machine begins. This is
+because the entry point into the machine execution code is after the execution
+of to-state actions.
-The write exec statement has one option. The \verb|noend| option tells Ragel
-to generate code that ignores the end position \verb|pe|. In this
-case the user must explicitly break out of the processing loop using
-\verb|fbreak|, otherwise the machine will continue to process characters until
-it moves into the error state. This option is useful if one wishes to process a
-null terminated string. Rather than traverse the string to discover then length
-before processing the input, the user can break out when the null character is
-seen. The example in Figure \ref{fbreak-example} shows the use of the
-\verb|noend| write option and the \verb|fbreak| statement for processing a string.
+\subsubsection{From-State Actions}
-\begin{figure}
-\small
-\begin{verbatim}
-#include <stdio.h>
-%% machine foo;
-int main( int argc, char **argv )
-{
- %% write data noerror nofinal;
- int cs, res = 0;
- if ( argc > 1 ) {
- char *p = argv[1];
- %%{
- main :=
- [a-z]+
- 0 @{ res = 1; fbreak; };
- write init;
- write exec noend;
- }%%
- }
- printf("execute = %i\n", res );
- return 0;
-}
-\end{verbatim}
-\caption{Use of {\tt noend} write option and the {\tt fbreak} statement for
-processing a string.}
-\label{fbreak-example}
-\end{figure}
+\verb| >* $* %* <* @* <>* |
+\verbspace
+From-state actions are executed whenever the state machine takes a transition from a
+state, either to itself or to some other state. These actions are executed
+immediately after the current character is tested against the input block end
+marker and before the transition to take is sought based on the current
+character. From-state actions are therefore executed even if a transition
+cannot be found and the machine moves into the error state. Like to-state
+embeddings, from-state embeddings stay with the state.
-\subsection{Write EOF Actions}
-\begin{verbatim}
-write eof;
-\end{verbatim}
-\verbspace
+\subsection{EOF Actions}
-The write EOF statement causes Ragel to emit code that executes EOF actions.
-This write statement is only relevant if EOF actions have been embedded,
-otherwise it does not generate anything. The EOF action code requires access to
-the current state.
+\verb| >/ $/ %/ </ @/ <>/ |
+\verbspace
-\section{Referencing Names}
-\label{labels}
+The EOF action embedding operators enable the user to embed EOF actions into
+different classes of
+states. EOF actions are stored in states and generated with the \verb|write eof|
+statement. The generated EOF code switches on the current state and executes the EOF
+actions associated with it.
-This section describes how to reference names in epsilon transitions and
-action-based control-flow statements such as \verb|fgoto|. There is a hierarchy
-of names implied in a Ragel specification. At the top level are the machine
-instantiations. Beneath the instantiations are labels and references to machine
-definitions. Beneath those are more labels and references to definitions, and
-so on.
+\subsection{Handling Errors}
-Any name reference may contain multiple components separated with the \verb|::|
-compound symbol. The search for the first component of a name reference is
-rooted at the join expression that the epsilon transition or action embedding
-is contained in. If the name reference is not not contained in a join,
-the search is rooted at the machine definition that that the epsilon transition or
-action embedding is contained in. Each component after the first is searched
-for beginning at the location in the name tree that the previous reference
-component refers to.
+\subsubsection{Global Error Actions}
-In the case of action-based references, if the action is embedded more than
-once, the local search is performed for each embedding and the result is the
-union of all the searches. If no result is found for action-based references then
-the search is repeated at the root of the name tree. Any action-based name
-search may be forced into a strictly global search by prefixing the name
-reference with \verb|::|.
+\verb| >! $! %! <! @! <>! |
+\verbspace
-The final component of the name reference must resolve to a unique entry point.
-If a name is unique in the entire name tree it can be referenced as is. If it
-is not unique it can be specified by qualifying it with names above it in the
-name tree. However, it can always be renamed.
+Error actions are stored in states until the final state machine has been fully
+constructed. They are then transferred to the transitions that move into the
+error state. This transfer entails the creation of a transition from the state
+to the error state that is taken on all input characters which are not already
+covered by the state's transitions. In other words it provides a default
+action. Error actions can induce a recovery by altering \verb|p| and then jumping back
+into the machine with \verb|fgoto|.
-% FIXME: Should fit this in somewhere.
-% Some kinds of name references are illegal. Cannot call into longest-match
-% machine, can only call its start state. Cannot make a call to anywhere from
-% any part of a longest-match machine except a rule's action. This would result
-% in an eventual return to some point inside a longest-match other than the
-% start state. This is banned for the same reason a call into the LM machine is
-% banned.
+\subsubsection{Local Error Actions}
-\section{State Machine Minimization}
+\verb| >^ $^ %^ <^ @^ <>^ |
+\verbspace
-State machine minimization is the process of finding the minimal equivalent FSM accepting
-the language. Minimization reduces the number of states in machines
-by merging equivalent states. It does not change the behaviour of the machine
-in any way. It will cause some states to be merged into one because they are
-functionally equivalent. State minimization is on by default. It can be turned
-off with the \verb|-n| option.
+Like global error actions, local error actions are also stored in states until
+a transfer point. The transfer point is different however. Each local error action
+embedding is associated with a name. When a machine definition has been fully
+constructed, all local error actions embeddings associated the same name as the
+machine are transferred to error transitions. Local error actions can be used
+to specify an action to take when a particular section of a larger state
+machine fails to make a match. A particular machine definition's ``thread'' may
+die and the local error actions executed, however the machine as a whole may
+continue to match input.
-The algorithm implemented is similar to Hopcroft's state minimization
-algorithm. Hopcroft's algorithm assumes a finite alphabet that can be listed in
-memory, whereas Ragel supports arbitrary integer alphabets that cannot be
-listed in memory. Though exact analysis is very difficult, Ragel minimization
-runs close to $O(n \times log(n))$ and requires $O(n)$ temporary storage where
-$n$ is the number of states.
+There are two forms of local error action embeddings. In the first form the name defaults
+to the current machine. In the second form the machine name can be specified. This
+is useful when it is more convenient to specify the local error action in a
+sub-definition that is used to construct the machine definition where the
+transfer should happen. To embed local error actions and explicitly state the
+machine on which the transfer is to happen use \verb|(name, action)| as the
+action.
-\chapter{User Actions}
+\begin{comment}
+\begin{itemize}
+\setlength{\parskip}{0in}
+\item \verb|expr >^ (name, action) | -- Start state.
+\item \verb|expr $^ (name, action) | -- All states.
+\item \verb|expr %^ (name, action) | -- Final states.
+\item \verb|expr <^ (name, action) | -- Not start state.
+\item \verb|expr <>^ (name, action)| -- Not start and not final states.
+\end{itemize}
+\end{comment}
-Ragel permits the user to embed actions into the transitions of a regular
-expression's corresponding state machine. These actions are executed when the
-generated code moves over a transition. Like the regular expression operators,
-the action embedding operators are fully compositional. They take a state
-machine and an action as input, embed the action, and yield a new state machine
-which can be used in the construction of other machines. Due to the
-compositional nature of embeddings, the user has complete freedom in the
-placement of actions.
+\section{Action Ordering and Duplicates}
-A machine's transitions are categorized into four classes, The action embedding
-operators access the transitions defined by these classes. The {\em entering
-transition} operator \verb|>| isolates the start state, then embeds an action
-into all transitions leaving it. The {\em finishing transition} operator
-\verb|@| embeds an action into all transitions going into a final state. The
-{\em all transition} operator \verb|$| embeds an action into all transitions of
-an expression. The {\em pending out transition} operator \verb|%| provides
-access to yet-unmade leaving transitions.
+When building a parser by combining smaller expressions which themselves have
+embedded actions, it is often the case that transitions are made which need to
+execute a number of actions on one input character. For example when we leave
+an expression, we may execute the expression's pending out action and the
+subsequent expression's starting action on the same input character. We must
+therefore devise a method for ordering actions that is both intuitive and
+predictable for the user and repeatable by the state machine compiler. The
+determinization processes cannot simply order actions by the time at which they
+are introduced into a transition -- otherwise the programmer will be at the
+mercy of luck.
-\section{Embedding Actions}
+We associate with the embedding of each action a distinct timestamp which is
+used to order actions that appear together on a single transition in the final
+compiled state machine. To accomplish this we traverse the parse tree of
+regular expressions and assign timestamps to action embeddings. This algorithm
+is recursive in nature and quite simple. When it visits a parse tree node it
+assigns timestamps to all {\em starting} action embeddings, recurses on the
+parse tree, then assigns timestamps to the remaining {\em all}, {\em
+finishing}, and {\em leaving} embeddings in the order in which they appear.
+
+Ragel does not permit actions (defined or unnamed) to appear multiple times in
+an action list. When the final machine has been created, actions which appear
+more than once in single transition or EOF action list have their duplicates
+removed. The first appearance of the action is preserved. This is useful in a
+number of scenarios. First, it allows us to union machines with common
+prefixes without worrying about the action embeddings in the prefix being
+duplicated. Second, it prevents pending out actions from being transferred multiple times
+when a concatenation follows a kleene star and the two machines begin with a common
+character.
+\verbspace
\begin{verbatim}
-action ActionName {
- /* Code an action here. */
- count += 1;
-}
+word = [a-z]+ %act;
+main := word ( '\n' word )* '\n\n';
\end{verbatim}
-\verbspace
-The action statement defines a block of code that can be embedded into an FSM.
-Action names can be referenced by the action embedding operators in
-expressions. Though actions need not be named in this way (literal blocks
-of code can be embedded directly when building machines), defining reusable
-blocks of code whenever possible is good practice because it potentially increases the
-degree to which the machine can be minimized. Within an action some Ragel expressions
-and statements are parsed and translated. These allow the user to interact with the machine
-from action code. See Section \ref{vals} for a complete list of statements and
-values available in code blocks.
+\section{Values and Statements Available in Code Blocks}
+\label{vals}
-\subsection{Entering Action}
+\noindent The following values are available in code blocks:
-\verb|expr > action|
-\verbspace
+\begin{itemize}
+\item \verb|fpc| -- A pointer to the current character. This is equivalent to
+accessing the \verb|p| variable.
-The entering operator embeds an action into the starting transitions. The
-action is executed on all transitions that enter into the machine from the
-start state. If the start state is a final state then it is possible for the
-machine to never be entered and the starting transitions bypassed. In the
-following example, the action is executed on the first transition of the
-machine. If the repetition machine is bypassed the action is not executed.
+\item \verb|fc| -- The current character. This is equivalent to the expression \verb|(*p)|.
-\verbspace
+\item \verb|fcurs| -- An integer value representing the current state. This
+value should only be read from. To move to a different place in the machine
+from action code use the \verb|fgoto|, \verb|fnext| or \verb|fcall| statements.
+Outside of the machine execution code the \verb|cs| variable may be modified.
-% GENERATE: exstact
-% OPT: -p
-% %%{
-% machine exstact;
-\begin{inline_code}
-\begin{verbatim}
-# Execute A at the beginning of a string of alpha.
-action A {}
-main := ( lower* >A ) . ' ';
-\end{verbatim}
-\end{inline_code}
-% }%%
-% END GENERATE
+\item \verb|ftargs| -- An integer value representing the target state. This
+value should only be read from. Again, \verb|fgoto|, \verb|fnext| and
+\verb|fcall| can be used to move to a specific entry point.
-\graphspace
-\begin{center}
-\includegraphics[scale=0.45]{exstact}
-\end{center}
-\graphspace
+\item \verb|fentry(<label>)| -- Retrieve an integer value representing the
+entry point \verb|label|. The integer value returned will be a compile time
+constant. This number is suitable for later use in control flow transfer
+statements that take an expression. This value should not be compared against
+the current state because any given label can have multiple states representing
+it. The value returned by \verb|fentry| will be one of the possibly multiple states the
+label represents.
+\end{itemize}
-\subsection{Finishing Action}
+\noindent The following statements are available in code blocks:
-\verb|expr @ action|
-\verbspace
+\begin{itemize}
-The finishing action operator embeds an action into any transitions that go into a
-final state. Whether or not the machine accepts is not determined at the point
-the action is executed. Further input may move the machine out of the accepting
-state, but keep it in the machine. As in the following example, the
-into-final-state operator is most often used when no lookahead is necessary.
+\item \verb|fhold;| -- Do not advance over the current character. If processing
+data in multiple buffer blocks, the \verb|fhold| statement should only be used
+once in the set of actions executed on a character. Multiple calls may result
+in backing up over the beginning of the buffer block. The \verb|fhold|
+statement does not imply any transfer of control. In actions embedded into
+transitions, it is equivalent to the \verb|p--;| statement. In scanner pattern
+actions any changes made to \verb|p| are lost. In this context, \verb|fhold| is
+equivalent to \verb|tokend--;|.
-% GENERATE: exdoneact
-% OPT: -p
-% %%{
-% machine exdoneact;
-% action A {}
-\begin{inline_code}
-\begin{verbatim}
-# Execute A when the trailing space is seen.
-main := ( lower* ' ' ) @A;
-\end{verbatim}
-\end{inline_code}
-% }%%
-% END GENERATE
+\item \verb|fexec <expr>;| -- Set the next character to process. This can be
+used to backtrack to previous input or advance ahead.
+Unlike \verb|fhold|, which can be used
+anywhere, \verb|fexec| requires the user to ensure that the target of the
+backtrack is in the current buffer block or is known to be somewhere ahead of
+it. The machine will continue iterating forward until \verb|pe| is arrived,
+\verb|fbreak| is called or the machine moves into the error state. In actions
+embedded into transitions, the \verb|fexec| statement is equivalent to setting
+\verb|p| to one position ahead of the next character to process. If the user
+also modifies \verb|pe|, it is possible to change the buffer block entirely.
+In scanner pattern actions any changes made to \verb|p| are lost. In this
+context, \verb|fexec| is equivalent to setting \verb|tokend| to the next
+character to process.
-\graphspace
-\begin{center}
-\includegraphics[scale=0.45]{exdoneact}
-\end{center}
-\graphspace
+\item \verb|fgoto <label>;| -- Jump to an entry point defined by
+\verb|<label>|. The \verb|fgoto| statement immediately transfers control to
+the destination state.
+\item \verb|fgoto *<expr>;| -- Jump to an entry point given by \verb|<expr>|.
+The expression must evaluate to an integer value representing a state.
-\subsection{All Transition Action}
+\item \verb|fnext <label>;| -- Set the next state to be the entry point defined
+by \verb|label|. The \verb|fnext| statement does not immediately jump to the
+specified state. Any action code following the statement is executed.
-\verb|expr $ action|
-\verbspace
+\item \verb|fnext *<expr>;| -- Set the next state to be the entry point given
+by \verb|<expr>|. The expression must evaluate to an integer value representing
+a state.
-The all transition operator embeds an action into all transitions of a machine.
-The action is executed whenever a transition of the machine is taken. In the
-following example, A is executed on every character matched.
+\item \verb|fcall <label>;| -- Push the target state and jump to the entry
+point defined by \verb|<label>|. The next \verb|fret| will jump to the target
+of the transition on which the call was made. Use of \verb|fcall| requires
+the declaration of a call stack. An array of integers named \verb|stack| and a
+single integer named \verb|top| must be declared. With the \verb|fcall|
+construct, control is immediately transferred to the destination state.
-% GENERATE: exallact
-% OPT: -p
-% %%{
-% machine exallact;
-% action A {}
-\begin{inline_code}
-\begin{verbatim}
-# Execute A on any characters of machine one or two.
-main := ( 'm1' | 'm2' ) $A;
-\end{verbatim}
-\end{inline_code}
-% }%%
-% END GENERATE
+\item \verb|fcall *<expr>;| -- Push the current state and jump to the entry
+point given by \verb|<expr>|. The expression must evaluate to an integer value
+representing a state.
-\graphspace
-\begin{center}
-\includegraphics[scale=0.45]{exallact}
-\end{center}
-\graphspace
+\item \verb|fret;| -- Return to the target state of the transition on which the
+last \verb|fcall| was made. Use of \verb|fret| requires the declaration of a
+call stack with \verb|fstack| in the struct block. Control is immediately
+transferred to the destination state.
+
+\item \verb|fbreak;| -- Save the current state and immediately break out of the
+execute loop. This statement is useful in conjunction with the \verb|noend|
+write option. Rather than process input until the end marker of the input
+buffer is arrived at, the fbreak statement can be used to stop processing input
+upon seeing some end-of-string marker. It can also be used for handling
+exceptional circumstances. The fbreak statement does not change the pointer to
+the current character. After an \verb|fbreak| call the \verb|p| variable will point to
+the character that was being traversed over when the action was
+executed. The current state will be the target of the current transition.
+
+\end{itemize}
+
+\noindent {\bf Note:} Once actions with control-flow commands are embedded into a
+machine, the user must exercise caution when using the machine as the operand
+to other machine construction operators. If an action jumps to another state
+then unioning any transition that executes that action with another transition
+that follows some other path will cause that other path to be lost. Using
+commands that manually jump around a machine takes us out of the domain of
+regular languages because transitions that may be conditional and that the
+machine construction operators are not aware of are introduced. These
+commands should therefore be used with caution.
-\subsection{Pending Out (Leaving) Actions}
-\label{out-actions}
+\chapter{Controlling Nondeterminism}
+\label{controlling-nondeterminism}
-\verb|expr % action|
-\verbspace
+Along with the flexibility of arbitrary action embeddings comes a need to
+control nondeterminism in regular expressions. If a regular expression is
+ambiguous, then sup-components of a parser other than the intended parts may become
+active. This means that actions which are irrelevant to the
+current subset of the parser may be executed, causing problems for the
+programmer.
-The pending out action operator embeds an action into the pending out
-transitions of a machine. The action is first embedded into the final states of
-the machine and later transferred to any transitions made going out of the
-machine. The transfer can be caused either by a concatenation or kleene star
-operation. This mechanism allows one to associate an action with the
-termination of a sequence, without being concerned about what particular
-character terminates the sequence. In the following example, A is executed
-when leaving the alpha machine by the newline character.
+Tools which are based on regular expression engines and which are used for
+recognition tasks will usually function as intended regardless of the presence
+of ambiguities. It is quite common for users of scripting languages to write
+regular expressions that are heavily ambiguous and it generally does not
+matter. As long as one of the potential matches is recognized, there can be any
+number of other matches present. In some parsing systems the run-time engine
+can employ a strategy for resolving ambiguities, for example always pursuing
+the longest possible match and discarding others.
-% GENERATE: exoutact1
+In Ragel, there is no regular expression run-time engine, just a simple state
+machine execution model. When we begin to embed actions and face the
+possibility of spurious action execution, it becomes clear that controlling
+nondeterminism at the machine construction level is very important. Consider
+the following example.
+
+% GENERATE: lines1
% OPT: -p
% %%{
-% machine exoutact1;
-% action A {}
+% machine lines1;
+% action first {}
+% action tail {}
+% word = [a-z]+;
\begin{inline_code}
\begin{verbatim}
-# Match a word followed by an newline. Execute A when
-# finishing the word.
-main := ( lower+ %A ) . '\n';
+ws = [\n\t ];
+line = word $first ( ws word $tail )* '\n';
+lines = line*;
\end{verbatim}
\end{inline_code}
+% main := lines;
% }%%
% END GENERATE
-\graphspace
\begin{center}
-\includegraphics[scale=0.45]{exoutact1}
+\includegraphics[scale=0.45]{lines1}
\end{center}
-\graphspace
-In the following example, the \verb|term_word| action could be used to register
-the appearance of a word and to clear the buffer that the \verb|lower| action used
-to store the text of it.
+Since the \verb|ws| expression includes the newline character, we will
+not finish the \verb|line| expression when a newline character is seen. We will
+simultaneously pursue the possibility of matching further words on the same
+line and the possibility of matching a second line. Evidence of this fact is
+in the state tables. On several transitions both the \verb|first| and
+\verb|tail| actions are executed. The solution here is simple: exclude
+the newline character from the \verb|ws| expression.
-% GENERATE: exoutact2
+% GENERATE: lines2
% OPT: -p
% %%{
-% machine exoutact2;
-% action lower {}
-% action space {}
-% action term_word {}
-% action newline {}
+% machine lines2;
+% action first {}
+% action tail {}
+% word = [a-z]+;
\begin{inline_code}
\begin{verbatim}
-word = ( [a-z] @lower )+ %term_word;
-main := word ( ' ' @space word )* '\n' @newline;
+ws = [\t ];
+line = word $first ( ws word $tail )* '\n';
+lines = line*;
\end{verbatim}
\end{inline_code}
+% main := lines;
% }%%
% END GENERATE
-\graphspace
\begin{center}
-\includegraphics[scale=0.45]{exoutact2}
+\includegraphics[scale=0.45]{lines2}
\end{center}
-\graphspace
-
-In this final example of the action embedding operators, A is executed upon
-entering the alpha machine, B is executed on all transitions of the alpha
-machine, C is executed when the alpha machine accepts by moving into the
-newline machine and N is executed when the newline machine moves into a final
-state.
+Solving this kind of problem is straightforward when the ambiguity is created
+by strings which are a single character long. When the ambiguity is created by
+strings which are multiple characters long we have a more difficult problem.
+The following example is an incorrect attempt at a regular expression for C
+language comments.
-% GENERATE: exaction
+% GENERATE: comments1
% OPT: -p
% %%{
-% machine exaction;
-% action A {}
-% action B {}
-% action C {}
-% action N {}
+% machine comments1;
+% action comm {}
\begin{inline_code}
\begin{verbatim}
-# Execute A on starting the alpha machine, B on every transition
-# moving through it and C upon finishing. Execute N on the newline.
-main := ( lower* >A $B %C ) . '\n' @N;
+comment = '/*' ( any @comm )* '*/';
+main := comment ' ';
\end{verbatim}
\end{inline_code}
% }%%
% END GENERATE
-\graphspace
\begin{center}
-\includegraphics[scale=0.45]{exaction}
+\includegraphics[scale=0.45]{comments1}
\end{center}
-\graphspace
-
-
-\section{State Action Embedding Operators}
-
-The state embedding operators allow one to embed actions into states. Like the
-transition embedding operators, there are several different classes of states
-that the operators access. The meanings of the symbols are partially related to
-the meanings of the symbols used by the transition embedding operators.
-
-The state embedding operators are different from the transition embedding
-operators in that there are various kinds of events that embedded actions can
-be associated with, requiring them to be distinguished by these different types
-of events. The state embedding operators have two components. The first, which
-is the first one or two characters, specifies the class of states that the
-action will be embedded into. The second component specifies the type of event
-the action will be executed on.
-
-\def\fakeitem{\hspace*{12pt}$\bullet$\hspace*{10pt}}
-
-\begin{minipage}{\textwidth}
-\begin{multicols}{2}
-\raggedcolumns
-\noindent The different classes of states are:\\
-\fakeitem \verb|> | -- the start state \\
-\fakeitem \verb|$ | -- all states\\
-\fakeitem \verb|% | -- final states\\
-\fakeitem \verb|< | -- any state except the start state\\
-\fakeitem \verb|@ | -- any state except final states\\
-\fakeitem \verb|<>| -- any except start and final (middle)
-
-\columnbreak
-
-\noindent The different kinds of embeddings are:\\
-\fakeitem \verb|~| -- to-state actions\\
-\fakeitem \verb|*| -- from-state actions\\
-\fakeitem \verb|/| -- EOF actions\\
-\fakeitem \verb|!| -- error actions\\
-\fakeitem \verb|^| -- local error actions\\
-\end{multicols}
-\end{minipage}
-%\label{state-act-embed}
-%\caption{The two components of state embedding operators. The class of states
-%to select comes first, followed by the type of embedding.}
-%
-%\begin{figure}[t]
-%\centering
-%\includegraphics{stembed}
-%\caption{Summary of state manipulation operators}
-%\label{state-act-embed-chart}
-%\end{figure}
-
-%\noindent Putting these two components together we get a matrix of state
-%embedding operators. The entire set is given in Figure \ref{state-act-embed-chart}.
-
-
-\subsection{To-State and From-State Actions}
-
-\subsubsection{To-State Actions}
-
-\verb| >~ $~ %~ <~ @~ <>~ |
-\verbspace
-
-To-state actions are executed whenever the state machine moves into the
-specified state, either by a natural movement over a transition or by an
-action-based transfer of control such as \verb|fgoto|. They are executed after the
-in-transition's actions but before the current character is advanced and
-tested against the end of the input block. To-state embeddings stay with the
-state. They are irrespective of the state's current set of transitions and any
-future transitions that may be added in or out of the state.
-
-Note that the setting of the current state variable \verb|cs| outside of the
-execute code is not considered by Ragel as moving into a state and consequently
-the to-state actions of the new current state are not executed. This includes
-the initialization of the current state when the machine begins. This is
-because the entry point into the machine execution code is after the execution
-of to-state actions.
-\subsubsection{From-State Actions}
-
-\verb| >* $* %* <* @* <>* |
-\verbspace
-
-From-state actions are executed whenever the state machine takes a transition from a
-state, either to itself or to some other state. These actions are executed
-immediately after the current character is tested against the input block end
-marker and before the transition to take is sought based on the current
-character. From-state actions are therefore executed even if a transition
-cannot be found and the machine moves into the error state. Like to-state
-embeddings, from-state embeddings stay with the state.
-
-\subsection{EOF Actions}
-
-\verb| >/ $/ %/ </ @/ <>/ |
-\verbspace
-
-The EOF action embedding operators enable the user to embed EOF actions into
-different classes of
-states. EOF actions are stored in states and generated with the \verb|write eof|
-statement. The generated EOF code switches on the current state and executes the EOF
-actions associated with it.
-
-\subsection{Handling Errors}
-
-\subsubsection{Global Error Actions}
-
-\verb| >! $! %! <! @! <>! |
-\verbspace
-
-Error actions are stored in states until the final state machine has been fully
-constructed. They are then transferred to the transitions that move into the
-error state. This transfer entails the creation of a transition from the state
-to the error state that is taken on all input characters which are not already
-covered by the state's transitions. In other words it provides a default
-action. Error actions can induce a recovery by altering \verb|p| and then jumping back
-into the machine with \verb|fgoto|.
-
-\subsubsection{Local Error Actions}
-
-\verb| >^ $^ %^ <^ @^ <>^ |
-\verbspace
-
-Like global error actions, local error actions are also stored in states until
-a transfer point. The transfer point is different however. Each local error action
-embedding is associated with a name. When a machine definition has been fully
-constructed, all local error actions embeddings associated the same name as the
-machine are transferred to error transitions. Local error actions can be used
-to specify an action to take when a particular section of a larger state
-machine fails to make a match. A particular machine definition's ``thread'' may
-die and the local error actions executed, however the machine as a whole may
-continue to match input.
-
-There are two forms of local error action embeddings. In the first form the name defaults
-to the current machine. In the second form the machine name can be specified. This
-is useful when it is more convenient to specify the local error action in a
-sub-definition that is used to construct the machine definition where the
-transfer should happen. To embed local error actions and explicitly state the
-machine on which the transfer is to happen use \verb|(name, action)| as the
-action.
+Using standard concatenation, we will never leave the \verb|any*| expression.
+We will forever entertain the possibility that a \verb|'*/'| string that we see
+is contained in a longer comment and that, simultaneously, the comment has
+ended. The concatenation of the \verb|comment| machine with \verb|SP| is done
+to show this. When we match space, we are also still matching the comment body.
-\begin{comment}
-\begin{itemize}
-\setlength{\parskip}{0in}
-\item \verb|expr >^ (name, action) | -- Start state.
-\item \verb|expr $^ (name, action) | -- All states.
-\item \verb|expr %^ (name, action) | -- Final states.
-\item \verb|expr <^ (name, action) | -- Not start state.
-\item \verb|expr <>^ (name, action)| -- Not start and not final states.
-\end{itemize}
-\end{comment}
+One way to approach the problem is to exclude the terminating string
+from the \verb|any*| expression using set difference. We must be careful to
+exclude not just the terminating string, but any string that contains it as a
+substring. A verbose, but proper specification of a C comment parser is given
+by the following regular expression.
-\section{Action Ordering and Duplicates}
+% GENERATE: comments2
+% OPT: -p
+% %%{
+% machine comments2;
+% action comm {}
+\begin{inline_code}
+\begin{verbatim}
+comment = '/*' ( ( any @comm )* - ( any* '*/' any* ) ) '*/';
+\end{verbatim}
+\end{inline_code}
+% main := comment;
+% }%%
+% END GENERATE
-When building a parser by combining smaller expressions which themselves have
-embedded actions, it is often the case that transitions are made which need to
-execute a number of actions on one input character. For example when we leave
-an expression, we may execute the expression's pending out action and the
-subsequent expression's starting action on the same input character. We must
-therefore devise a method for ordering actions that is both intuitive and
-predictable for the user and repeatable by the state machine compiler. The
-determinization processes cannot simply order actions by the time at which they
-are introduced into a transition -- otherwise the programmer will be at the
-mercy of luck.
+\begin{center}
+\includegraphics[scale=0.45]{comments2}
+\end{center}
-We associate with the embedding of each action a distinct timestamp which is
-used to order actions that appear together on a single transition in the final
-compiled state machine. To accomplish this we traverse the parse tree of
-regular expressions and assign timestamps to action embeddings. This algorithm
-is recursive in nature and quite simple. When it visits a parse tree node it
-assigns timestamps to all {\em starting} action embeddings, recurses on the
-parse tree, then assigns timestamps to the remaining {\em all}, {\em
-finishing}, and {\em leaving} embeddings in the order in which they appear.
-Ragel does not permit actions (defined or unnamed) to appear multiple times in
-an action list. When the final machine has been created, actions which appear
-more than once in single transition or EOF action list have their duplicates
-removed. The first appearance of the action is preserved. This is useful in a
-number of scenarios. First, it allows us to union machines with common
-prefixes without worrying about the action embeddings in the prefix being
-duplicated. Second, it prevents pending out actions from being transferred multiple times
-when a concatenation follows a kleene star and the two machines begin with a common
-character.
+We have phrased the problem of controlling non-determinism in terms of
+excluding strings common to two expressions which interact when combined.
+We can also phrase the problem in terms of the transitions of the state
+machines that implement these expressions. During the concatenation of
+\verb|any*| and \verb|'*/'| we will be making transitions that are composed of
+both the loop of the first expression and the final character of the second.
+At this time we want the transition on the \verb|'/'| character to take precedence
+over and disallow the transition that originated in the \verb|any*| loop.
-\verbspace
+In another parsing problem, we wish to implement a lightweight tokenizer that we can
+utilize in the composition of a larger machine. For example, some HTTP headers
+have a token stream as a sub-language. The following example is an attempt
+at a regular expression-based tokenizer that does not function correctly due to
+unintended nondeterminism.
+
+% GENERATE: smallscanner
+% OPT: -p
+% %%{
+% machine smallscanner;
+% action start_str {}
+% action on_char {}
+% action finish_str {}
+\begin{inline_code}
\begin{verbatim}
-word = [a-z]+ %act;
-main := word ( '\n' word )* '\n\n';
+header_contents = (
+ lower+ >start_str $on_char %finish_str |
+ ' '
+)*;
\end{verbatim}
+\end{inline_code}
+% main := header_contents;
+% }%%
+% END GENERATE
-\section{Values and Statements Available in Code Blocks}
-\label{vals}
+\begin{center}
+\includegraphics[scale=0.45]{smallscanner}
+\end{center}
-\noindent The following values are available in code blocks:
+In this case, the problem with using a standard kleene star operation is that
+there is an ambiguity between extending a token and wrapping around the machine
+to begin a new token. Using the standard operator, we get an undesirable
+nondeterministic behaviour. Evidence of this can be seen on the transition out
+of state one to itself. The transition extends the string, and simultaneously,
+finishes the string only to immediately begin a new one. What is required is
+for the
+transitions that represent an extension of a token to take precedence over the
+transitions that represent the beginning of a new token. For this problem
+there is no simple solution that uses standard regular expression operators.
-\begin{itemize}
-\item \verb|fpc| -- A pointer to the current character. This is equivalent to
-accessing the \verb|p| variable.
+\section{Priorities}
-\item \verb|fc| -- The current character. This is equivalent to the expression \verb|(*p)|.
+A priority mechanism was devised and built into the determinization
+process, specifically for the purpose of allowing the user to control
+nondeterminism. Priorities are integer values embedded into transitions. When
+the determinization process is combining transitions that have different
+priorities, the transition with the higher priority is preserved and the
+transition with the lower priority is dropped.
-\item \verb|fcurs| -- An integer value representing the current state. This
-value should only be read from. To move to a different place in the machine
-from action code use the \verb|fgoto|, \verb|fnext| or \verb|fcall| statements.
-Outside of the machine execution code the \verb|cs| variable may be modified.
+Unfortunately, priorities can have unintended side effects because their
+operation requires that they linger in transitions indefinitely. They must linger
+because the Ragel program cannot know when the user is finished with a priority
+embedding. A solution whereby they are explicitly deleted after use is
+conceivable; however this is not very user-friendly. Priorities were therefore
+made into named entities. Only priorities with the same name are allowed to
+interact. This allows any number of priorities to coexist in one machine for
+the purpose of controlling various different regular expression operations and
+eliminates the need to ever delete them. Such a scheme allows the user to
+choose a unique name, embed two different priority values using that name
+and be confident that the priority embedding will be free of any side effects.
-\item \verb|ftargs| -- An integer value representing the target state. This
-value should only be read from. Again, \verb|fgoto|, \verb|fnext| and
-\verb|fcall| can be used to move to a specific entry point.
+\section{Priority Assignment}
-\item \verb|fentry(<label>)| -- Retrieve an integer value representing the
-entry point \verb|label|. The integer value returned will be a compile time
-constant. This number is suitable for later use in control flow transfer
-statements that take an expression. This value should not be compared against
-the current state because any given label can have multiple states representing
-it. The value returned by \verb|fentry| will be one of the possibly multiple states the
-label represents.
+Priorities are integer values assigned to names within transitions.
+Only priorities with the same name are allowed to interact. When the machine
+construction process is combining transitions that have different priorities
+assiged to the same name, the transition with the higher priority is preserved
+and the lower priority is dropped.
+
+In the first form of priority embedding the name defaults to the name of the machine
+definition that the priority is assigned in. In this sense priorities are by
+default local to the current machine definition or instantiation. Beware of
+using this form in a longest-match machine, since there is only one name for
+the entire set of longest match patterns. In the second form the priority's
+name can be specified, allowing priority interaction across machine definition
+boundaries.
+
+\begin{itemize}
+\setlength{\parskip}{0in}
+\item \verb|expr > int| -- Sets starting transitions to have priority int.
+\item \verb|expr @ int| -- Sets transitions that go into a final state to have priority int.
+\item \verb|expr $ int| -- Sets all transitions to have priority int.
+\item \verb|expr % int| -- Sets pending out transitions from final states to
+have priority int.\\ When a transition is made going out of the machine (either
+by concatenation or kleene star) its priority is immediately set to the pending
+out priority.
\end{itemize}
-\noindent The following statements are available in code blocks:
+The second form of priority assignment allows the programmer to specify the name
+to which the priority is assigned.
\begin{itemize}
+\setlength{\parskip}{0in}
+\item \verb|expr > (name, int)| -- Entering transitions.
+\item \verb|expr @ (name, int)| -- Transitions into final state.
+\item \verb|expr $ (name, int)| -- All transitions.
+\item \verb|expr % (name, int)| -- Pending out transitions.
+\end{itemize}
-\item \verb|fhold;| -- Do not advance over the current character. If processing
-data in multiple buffer blocks, the \verb|fhold| statement should only be used
-once in the set of actions executed on a character. Multiple calls may result
-in backing up over the beginning of the buffer block. The \verb|fhold|
-statement does not imply any transfer of control. In actions embedded into
-transitions, it is equivalent to the \verb|p--;| statement. In scanner pattern
-actions any changes made to \verb|p| are lost. In this context, \verb|fhold| is
-equivalent to \verb|tokend--;|.
+\section{Guarded Operators that Encapsulate Priorities}
-\item \verb|fexec <expr>;| -- Set the next character to process. This can be
-used to backtrack to previous input or advance ahead.
-Unlike \verb|fhold|, which can be used
-anywhere, \verb|fexec| requires the user to ensure that the target of the
-backtrack is in the current buffer block or is known to be somewhere ahead of
-it. The machine will continue iterating forward until \verb|pe| is arrived,
-\verb|fbreak| is called or the machine moves into the error state. In actions
-embedded into transitions, the \verb|fexec| statement is equivalent to setting
-\verb|p| to one position ahead of the next character to process. If the user
-also modifies \verb|pe|, it is possible to change the buffer block entirely.
-In scanner pattern actions any changes made to \verb|p| are lost. In this
-context, \verb|fexec| is equivalent to setting \verb|tokend| to the next
-character to process.
+Priorities embeddings are a very expressive mechanism. At the same time they
+can be very confusing for the user. They force the user to imagine
+the transitions inside two interacting expressions and work out the precise
+effects of the operations between them. When we consider
+that this problem is worsened by the
+potential for side effects caused by unintended priority name collisions, we
+see that exposing the user to priorities is rather undesirable.
-\item \verb|fgoto <label>;| -- Jump to an entry point defined by
-\verb|<label>|. The \verb|fgoto| statement immediately transfers control to
-the destination state.
+Fortunately, in practice the use of priorities has been necessary only in a
+small number of scenarios. This allows us to encapsulate their functionality
+into a small set of operators and fully hide them from the user. This is
+advantageous from a language design point of view because it greatly simplifies
+the design.
-\item \verb|fgoto *<expr>;| -- Jump to an entry point given by \verb|<expr>|.
-The expression must evaluate to an integer value representing a state.
+Going back to the C comment example, we can now properly specify
+it using a guarded concatenation operator which we call {\em finish-guarded
+concatenation}. From the user's point of view, this operator terminates the
+first machine when the second machine moves into a final state. It chooses a
+unique name and uses it to embed a low priority into all
+transitions of the first machine. A higher priority is then embedded into the
+transitions of the second machine which enter into a final state. The following
+example yields a machine identical to the example in Section \ref{priorities}
-\item \verb|fnext <label>;| -- Set the next state to be the entry point defined
-by \verb|label|. The \verb|fnext| statement does not immediately jump to the
-specified state. Any action code following the statement is executed.
+\begin{inline_code}
+\begin{verbatim}
+comment = '/*' ( any @comm )* :>> '*/';
+\end{verbatim}
+\end{inline_code}
-\item \verb|fnext *<expr>;| -- Set the next state to be the entry point given
-by \verb|<expr>|. The expression must evaluate to an integer value representing
-a state.
+Another guarded operator is {\em left-guarded concatenation}, given by the
+\verb|<:| compound symbol. This operator places a higher priority on all
+transitions of the first machine. This is useful if one must forcibly separate
+two lists that contain common elements. For example, one may need to tokenize a
+stream, but first consume leading whitespace.
+
+Ragel also includes a {\em longest-match kleene star} operator, given by the
+\verb|**| compound symbol. This
+guarded operator embeds a high
+priority into all transitions of the machine.
+A lower priority is then embedded into pending out transitions
+(in a manner similar to pending out action embeddings, described in Section
+\ref{out-actions}). When the kleene star operator makes the epsilon transitions from
+the final states into the start state, the lower priority will be transferred
+to the epsilon transitions. In cases where following an epsilon transition
+out of a final state conflicts with an existing transition out of a final
+state, the epsilon transition will be dropped.
+
+Other guarded operators are conceivable, such as guards on union that cause one
+alternative to take precedence over another. These may be implemented when it
+is clear they constitute a frequently used operation.
+In the next section we discuss the explicit specification of state machines
+using state charts.
-\item \verb|fcall <label>;| -- Push the target state and jump to the entry
-point defined by \verb|<label>|. The next \verb|fret| will jump to the target
-of the transition on which the call was made. Use of \verb|fcall| requires
-the declaration of a call stack. An array of integers named \verb|stack| and a
-single integer named \verb|top| must be declared. With the \verb|fcall|
-construct, control is immediately transferred to the destination state.
+\subsection{Entry-Guarded Contatenation}
-\item \verb|fcall *<expr>;| -- Push the current state and jump to the entry
-point given by \verb|<expr>|. The expression must evaluate to an integer value
-representing a state.
+\verb|expr :> expr|
+\verbspace
-\item \verb|fret;| -- Return to the target state of the transition on which the
-last \verb|fcall| was made. Use of \verb|fret| requires the declaration of a
-call stack with \verb|fstack| in the struct block. Control is immediately
-transferred to the destination state.
+This operator concatenates two machines, but first assigns a low
+priority to all transitions
+of the first machine and a high priority to the entering transitions of the
+second machine. This operator is useful if from the final states of the first
+machine, it is possible to accept the characters in the start transitions of
+the second machine. This operator effectively terminates the first machine
+immediately upon entering the second machine, where otherwise they would be
+pursued concurrently. In the following example, entry-guarded concatenation is
+used to move out of a machine that matches everything at the first sign of an
+end-of-input marker.
-\item \verb|fbreak;| -- Save the current state and immediately break out of the
-execute loop. This statement is useful in conjunction with the \verb|noend|
-write option. Rather than process input until the end marker of the input
-buffer is arrived at, the fbreak statement can be used to stop processing input
-upon seeing some end-of-string marker. It can also be used for handling
-exceptional circumstances. The fbreak statement does not change the pointer to
-the current character. After an \verb|fbreak| call the \verb|p| variable will point to
-the character that was being traversed over when the action was
-executed. The current state will be the target of the current transition.
+% GENERATE: entryguard
+% OPT: -p
+% %%{
+% machine entryguard;
+\begin{inline_code}
+\begin{verbatim}
+# Leave the catch-all machine on the first character of FIN.
+main := any* :> 'FIN';
+\end{verbatim}
+\end{inline_code}
+% }%%
+% END GENERATE
-\end{itemize}
+\begin{center}
+\includegraphics[scale=0.45]{entryguard}
+\end{center}
-\noindent {\bf Note:} Once actions with control-flow commands are embedded into a
-machine, the user must exercise caution when using the machine as the operand
-to other machine construction operators. If an action jumps to another state
-then unioning any transition that executes that action with another transition
-that follows some other path will cause that other path to be lost. Using
-commands that manually jump around a machine takes us out of the domain of
-regular languages because transitions that may be conditional and that the
-machine construction operators are not aware of are introduced. These
-commands should therefore be used with caution.
+Entry-guarded concatenation is equivalent to the following:
-\chapter{Controlling Nondeterminism}
-\label{controlling-nondeterminism}
+\verbspace
+\begin{verbatim}
+expr $(unique_name,0) . expr >(unique_name,1)
+\end{verbatim}
-Along with the flexibility of arbitrary action embeddings comes a need to
-control nondeterminism in regular expressions. If a regular expression is
-ambiguous, then sup-components of a parser other than the intended parts may become
-active. This means that actions which are irrelevant to the
-current subset of the parser may be executed, causing problems for the
-programmer.
+\subsection{Finish-Guarded Contatenation}
-Tools which are based on regular expression engines and which are used for
-recognition tasks will usually function as intended regardless of the presence
-of ambiguities. It is quite common for users of scripting languages to write
-regular expressions that are heavily ambiguous and it generally does not
-matter. As long as one of the potential matches is recognized, there can be any
-number of other matches present. In some parsing systems the run-time engine
-can employ a strategy for resolving ambiguities, for example always pursuing
-the longest possible match and discarding others.
+\verb|expr :>> expr|
+\verbspace
-In Ragel, there is no regular expression run-time engine, just a simple state
-machine execution model. When we begin to embed actions and face the
-possibility of spurious action execution, it becomes clear that controlling
-nondeterminism at the machine construction level is very important. Consider
-the following example.
+This operator is
+like the previous operator, except the higher priority is placed on the final
+transitions of the second machine. This is useful if one wishes to entertain
+the possibility of continuing to match the first machine right up until the
+second machine enters a final state. In other words it terminates the first
+machine only when the second accepts. In the following example, finish-guarded
+concatenation causes the move out of the machine that matches everything to be
+delayed until the full end-of-input marker has been matched.
-% GENERATE: lines1
+% GENERATE: finguard
% OPT: -p
% %%{
-% machine lines1;
-% action first {}
-% action tail {}
-% word = [a-z]+;
+% machine finguard;
\begin{inline_code}
\begin{verbatim}
-ws = [\n\t ];
-line = word $first ( ws word $tail )* '\n';
-lines = line*;
+# Leave the catch-all machine on the last character of FIN.
+main := any* :>> 'FIN';
\end{verbatim}
\end{inline_code}
-% main := lines;
% }%%
% END GENERATE
\begin{center}
-\includegraphics[scale=0.45]{lines1}
+\includegraphics[scale=0.45]{finguard}
\end{center}
-Since the \verb|ws| expression includes the newline character, we will
-not finish the \verb|line| expression when a newline character is seen. We will
-simultaneously pursue the possibility of matching further words on the same
-line and the possibility of matching a second line. Evidence of this fact is
-in the state tables. On several transitions both the \verb|first| and
-\verb|tail| actions are executed. The solution here is simple: exclude
-the newline character from the \verb|ws| expression.
+Finish-guarded concatenation is equivalent to the following:
-% GENERATE: lines2
+\verbspace
+\begin{verbatim}
+expr $(unique_name,0) . expr @(unique_name,1)
+\end{verbatim}
+
+\subsection{Left-Guarded Concatenation}
+
+\verb|expr <: expr|
+\verbspace
+
+This operator places
+a higher priority on the left expression. It is useful if you want to prefix a
+sequence with another sequence composed of some of the same characters. For
+example, one can consume leading whitespace before tokenizing a sequence of
+whitespace-separated words as in:
+
+% GENERATE: leftguard
% OPT: -p
% %%{
-% machine lines2;
-% action first {}
-% action tail {}
-% word = [a-z]+;
+% machine leftguard;
+% action alpha {}
+% action ws {}
+% action start {}
+% action fin {}
\begin{inline_code}
\begin{verbatim}
-ws = [\t ];
-line = word $first ( ws word $tail )* '\n';
-lines = line*;
+main := ( ' '* >start %fin ) <: ( ' ' $ws | [a-z] $alpha )*;
\end{verbatim}
\end{inline_code}
-% main := lines;
% }%%
% END GENERATE
\begin{center}
-\includegraphics[scale=0.45]{lines2}
+\includegraphics[scale=0.45]{leftguard}
\end{center}
-Solving this kind of problem is straightforward when the ambiguity is created
-by strings which are a single character long. When the ambiguity is created by
-strings which are multiple characters long we have a more difficult problem.
-The following example is an incorrect attempt at a regular expression for C
-language comments.
+Left-guarded concatenation is equivalent to the following:
-% GENERATE: comments1
+\verbspace
+\begin{verbatim}
+expr $(unique_name,1) . expr >(unique_name,0)
+\end{verbatim}
+\verbspace
+
+\subsection{Longest-Match Kleene Star}
+\label{longest_match_kleene_star}
+
+\verb|expr**|
+\verbspace
+
+This version of kleene star puts a higher priority on staying in the
+machine versus wrapping around and starting over. The LM kleene star is useful
+when writing simple tokenizers. These machines are built by applying the
+longest-match kleene star to an alternation of token patterns, as in the
+following.
+
+\verbspace
+
+% GENERATE: lmkleene
% OPT: -p
% %%{
-% machine comments1;
-% action comm {}
+% machine exfinpri;
+% action A {}
+% action B {}
\begin{inline_code}
\begin{verbatim}
-comment = '/*' ( any @comm )* '*/';
-main := comment ' ';
+# Repeat tokens, but make sure to get the longest match.
+main := (
+ lower ( lower | digit )* %A |
+ digit+ %B |
+ ' '
+)**;
\end{verbatim}
\end{inline_code}
% }%%
% END GENERATE
\begin{center}
-\includegraphics[scale=0.45]{comments1}
+\includegraphics[scale=0.45]{lmkleene}
\end{center}
-Using standard concatenation, we will never leave the \verb|any*| expression.
-We will forever entertain the possibility that a \verb|'*/'| string that we see
-is contained in a longer comment and that, simultaneously, the comment has
-ended. The concatenation of the \verb|comment| machine with \verb|SP| is done
-to show this. When we match space, we are also still matching the comment body.
+If a regular kleene star were used the machine above would not be able to
+distinguish between extending a word and beginning a new one. This operator is
+equivalent to:
-One way to approach the problem is to exclude the terminating string
-from the \verb|any*| expression using set difference. We must be careful to
-exclude not just the terminating string, but any string that contains it as a
-substring. A verbose, but proper specification of a C comment parser is given
-by the following regular expression.
+\verbspace
+\begin{verbatim}
+( expr $(unique_name,1) %(unique_name,0) )*
+\end{verbatim}
+\verbspace
+
+When the kleene star is applied, transitions are made out of the machine which
+go back into it. These are assigned a priority of zero by the pending out
+transition mechanism. This is less than the priority of the transitions out of
+the final states that do not leave the machine. When two transitions clash on
+the same character, the differing priorities causes the transition which
+stays in the machine to take precedence. The transition that wraps around is
+dropped.
+
+Note that this operator does not build a scanner in the traditional sense
+because there is never any backtracking. To build a scanner in the traditional
+sense use the Longest-Match machine construction described Section
+\ref{generating-scanners}.
+
+\chapter{Interface to Host Program}
+
+\section{Alphtype Statement}
+
+\begin{verbatim}
+alphtype unsigned int;
+\end{verbatim}
+\verbspace
+
+The alphtype statement specifies the alphabet data type that the machine
+operates on. During the compilation of the machine, integer literals are expected to
+be in the range of possible values of the alphtype. Supported alphabet types
+are \verb|char|, \verb|unsigned char|, \verb|short|, \verb|unsigned short|,
+\verb|int|, \verb|unsigned int|, \verb|long|, and \verb|unsigned long|.
+The default is \verb|char|.
+
+\section{Getkey Statement}
-% GENERATE: comments2
-% OPT: -p
-% %%{
-% machine comments2;
-% action comm {}
-\begin{inline_code}
\begin{verbatim}
-comment = '/*' ( ( any @comm )* - ( any* '*/' any* ) ) '*/';
+getkey fpc->id;
\end{verbatim}
-\end{inline_code}
-% main := comment;
-% }%%
-% END GENERATE
+\verbspace
-\begin{center}
-\includegraphics[scale=0.45]{comments2}
-\end{center}
+Specify to Ragel how to retrieve the character that the machine operates on
+from the pointer to the current element (\verb|p|). Any expression that returns
+a value of the alphabet type
+may be used. The getkey statement may be used for looking into element
+structures or for translating the character to process. The getkey expression
+defaults to \verb|(*p)|. In goto-driven machines the getkey expression may be
+evaluated more than once per element processed, therefore it should not incur a
+large cost and preclude optimization.
+\section{Access Statement}
-We have phrased the problem of controlling non-determinism in terms of
-excluding strings common to two expressions which interact when combined.
-We can also phrase the problem in terms of the transitions of the state
-machines that implement these expressions. During the concatenation of
-\verb|any*| and \verb|'*/'| we will be making transitions that are composed of
-both the loop of the first expression and the final character of the second.
-At this time we want the transition on the \verb|'/'| character to take precedence
-over and disallow the transition that originated in the \verb|any*| loop.
+\begin{verbatim}
+access fsm->;
+\end{verbatim}
+\verbspace
-In another parsing problem, we wish to implement a lightweight tokenizer that we can
-utilize in the composition of a larger machine. For example, some HTTP headers
-have a token stream as a sub-language. The following example is an attempt
-at a regular expression-based tokenizer that does not function correctly due to
-unintended nondeterminism.
+The access statement allows one to tell Ragel how the generated code should
+access the machine data that is persistent across processing buffer blocks.
+This includes all variables except \verb|p| and \verb|pe|. This includes
+\verb|cs|, \verb|top|, \verb|stack|, \verb|tokstart|, \verb|tokend| and \verb|act|.
+This is useful if a machine is to be encapsulated inside a
+structure in C code. The access statement can be used to give the name of
+a pointer to the structure.
+
+\section{Write Statement}
+\label{write-statement}
-% GENERATE: smallscanner
-% OPT: -p
-% %%{
-% machine smallscanner;
-% action start_str {}
-% action on_char {}
-% action finish_str {}
-\begin{inline_code}
\begin{verbatim}
-header_contents = (
- lower+ >start_str $on_char %finish_str |
- ' '
-)*;
+write <component> [options];
\end{verbatim}
-\end{inline_code}
-% main := header_contents;
-% }%%
-% END GENERATE
-
-\begin{center}
-\includegraphics[scale=0.45]{smallscanner}
-\end{center}
+\verbspace
-In this case, the problem with using a standard kleene star operation is that
-there is an ambiguity between extending a token and wrapping around the machine
-to begin a new token. Using the standard operator, we get an undesirable
-nondeterministic behaviour. Evidence of this can be seen on the transition out
-of state one to itself. The transition extends the string, and simultaneously,
-finishes the string only to immediately begin a new one. What is required is
-for the
-transitions that represent an extension of a token to take precedence over the
-transitions that represent the beginning of a new token. For this problem
-there is no simple solution that uses standard regular expression operators.
-\section{Priorities}
+The write statement is used to generate parts of the machine.
+There are four
+components that can be generated by a write statement. These components are the
+state machine's data, initialization code, execution code and EOF action
+execution code. A write statement may appear before a machine is fully defined.
+This allows one to write out the data first then later define the machine where
+it is used. An example of this is show in Figure \ref{fbreak-example}.
-A priority mechanism was devised and built into the determinization
-process, specifically for the purpose of allowing the user to control
-nondeterminism. Priorities are integer values embedded into transitions. When
-the determinization process is combining transitions that have different
-priorities, the transition with the higher priority is preserved and the
-transition with the lower priority is dropped.
+\subsection{Write Data}
+\begin{verbatim}
+write data [options];
+\end{verbatim}
+\verbspace
-Unfortunately, priorities can have unintended side effects because their
-operation requires that they linger in transitions indefinitely. They must linger
-because the Ragel program cannot know when the user is finished with a priority
-embedding. A solution whereby they are explicitly deleted after use is
-conceivable; however this is not very user-friendly. Priorities were therefore
-made into named entities. Only priorities with the same name are allowed to
-interact. This allows any number of priorities to coexist in one machine for
-the purpose of controlling various different regular expression operations and
-eliminates the need to ever delete them. Such a scheme allows the user to
-choose a unique name, embed two different priority values using that name
-and be confident that the priority embedding will be free of any side effects.
+The write data statement causes Ragel to emit the constant static data needed
+by the machine. In table-driven output styles (see Section \ref{genout}) this
+is a collection of arrays that represent the states and transitions of the
+machine. In goto-driven machines much less data is emitted. At the very
+minimum a start state \verb|name_start| is generated. All variables written
+out in machine data have both the \verb|static| and \verb|const| properties and
+are prefixed with the name of the machine and an
+underscore. The data can be placed inside a class, inside a function, or it can
+be defined as global data.
-\section{Priority Assignment}
+Two variables are written that may be used to test the state of the machine
+after a buffer block has been processed. The \verb|name_error| variable gives
+the id of the state that the machine moves into when it cannot find a valid
+transition to take. The machine immediately breaks out of the processing loop when
+it finds itself in the error state. The error variable can be compared to the
+current state to determine if the machine has failed to parse the input. If the
+machine is complete, that is from every state there is a transition to a proper
+state on every possible character of the alphabet, then no error state is required
+and this variable will be set to -1.
-Priorities are integer values assigned to names within transitions.
-Only priorities with the same name are allowed to interact. When the machine
-construction process is combining transitions that have different priorities
-assiged to the same name, the transition with the higher priority is preserved
-and the lower priority is dropped.
+The \verb|name_first_final| variable stores the id of the first final state. All of the
+machine's states are sorted by their final state status before having their ids
+assigned. Checking if the machine has accepted its input can then be done by
+checking if the current state is greater-than or equal to the first final
+state.
-In the first form of priority embedding the name defaults to the name of the machine
-definition that the priority is assigned in. In this sense priorities are by
-default local to the current machine definition or instantiation. Beware of
-using this form in a longest-match machine, since there is only one name for
-the entire set of longest match patterns. In the second form the priority's
-name can be specified, allowing priority interaction across machine definition
-boundaries.
+Data generation has several options:
\begin{itemize}
-\setlength{\parskip}{0in}
-\item \verb|expr > int| -- Sets starting transitions to have priority int.
-\item \verb|expr @ int| -- Sets transitions that go into a final state to have priority int.
-\item \verb|expr $ int| -- Sets all transitions to have priority int.
-\item \verb|expr % int| -- Sets pending out transitions from final states to
-have priority int.\\ When a transition is made going out of the machine (either
-by concatenation or kleene star) its priority is immediately set to the pending
-out priority.
+\item \verb|noerror| - Do not generate the integer variable that gives the
+id of the error state.
+\item \verb|nofinal| - Do not generate the integer variable that gives the
+id of the first final state.
+\item \verb|noprefix| - Do not prefix the variable names with the name of the
+machine.
\end{itemize}
-The second form of priority assignment allows the programmer to specify the name
-to which the priority is assigned.
+\subsection{Write Init}
+\begin{verbatim}
+write init;
+\end{verbatim}
+\verbspace
-\begin{itemize}
-\setlength{\parskip}{0in}
-\item \verb|expr > (name, int)| -- Entering transitions.
-\item \verb|expr @ (name, int)| -- Transitions into final state.
-\item \verb|expr $ (name, int)| -- All transitions.
-\item \verb|expr % (name, int)| -- Pending out transitions.
-\end{itemize}
+The write init statement causes Ragel to emit initialization code. This should
+be executed once before the machine is started. At a very minimum this sets the
+current state to the start state. If other variables are needed by the
+generated code, such as call
+stack variables or longest-match management variables, they are also
+initialized here.
-\section{Guarded Operators that Encapsulate Priorities}
+\subsection{Write Exec}
+\begin{verbatim}
+write exec [options];
+\end{verbatim}
+\verbspace
-Priorities embeddings are a very expressive mechanism. At the same time they
-can be very confusing for the user. They force the user to imagine
-the transitions inside two interacting expressions and work out the precise
-effects of the operations between them. When we consider
-that this problem is worsened by the
-potential for side effects caused by unintended priority name collisions, we
-see that exposing the user to priorities is rather undesirable.
+The write exec statement causes Ragel to emit the state machine's execution code.
+Ragel expects several variables to be available to this code. At a very minimum, the
+generated code needs access to the current character position \verb|p|, the ending
+position \verb|pe| and the current state \verb|cs|, though \verb|pe|
+can be excluded by specifying the \verb|noend| write option.
+The \verb|p| variable is the cursor that the execute code will
+used to traverse the input. The \verb|pe| variable should be set up to point to one
+position past the last valid character in the buffer.
-Fortunately, in practice the use of priorities has been necessary only in a
-small number of scenarios. This allows us to encapsulate their functionality
-into a small set of operators and fully hide them from the user. This is
-advantageous from a language design point of view because it greatly simplifies
-the design.
+Other variables are needed when certain features are used. For example using
+the \verb|fcall| or \verb|fret| statements requires \verb|stack| and
+\verb|top| variables to be defined. If a longest-match construction is used,
+variables for managing backtracking are required.
-Going back to the C comment example, we can now properly specify
-it using a guarded concatenation operator which we call {\em finish-guarded
-concatenation}. From the user's point of view, this operator terminates the
-first machine when the second machine moves into a final state. It chooses a
-unique name and uses it to embed a low priority into all
-transitions of the first machine. A higher priority is then embedded into the
-transitions of the second machine which enter into a final state. The following
-example yields a machine identical to the example in Section \ref{priorities}
+The write exec statement has one option. The \verb|noend| option tells Ragel
+to generate code that ignores the end position \verb|pe|. In this
+case the user must explicitly break out of the processing loop using
+\verb|fbreak|, otherwise the machine will continue to process characters until
+it moves into the error state. This option is useful if one wishes to process a
+null terminated string. Rather than traverse the string to discover then length
+before processing the input, the user can break out when the null character is
+seen. The example in Figure \ref{fbreak-example} shows the use of the
+\verb|noend| write option and the \verb|fbreak| statement for processing a string.
-\begin{inline_code}
+\begin{figure}
+\small
+\begin{verbatim}
+#include <stdio.h>
+%% machine foo;
+int main( int argc, char **argv )
+{
+ %% write data noerror nofinal;
+ int cs, res = 0;
+ if ( argc > 1 ) {
+ char *p = argv[1];
+ %%{
+ main :=
+ [a-z]+
+ 0 @{ res = 1; fbreak; };
+ write init;
+ write exec noend;
+ }%%
+ }
+ printf("execute = %i\n", res );
+ return 0;
+}
+\end{verbatim}
+\caption{Use of {\tt noend} write option and the {\tt fbreak} statement for
+processing a string.}
+\label{fbreak-example}
+\end{figure}
+
+
+\subsection{Write EOF Actions}
\begin{verbatim}
-comment = '/*' ( any @comm )* :>> '*/';
+write eof;
\end{verbatim}
-\end{inline_code}
+\verbspace
-Another guarded operator is {\em left-guarded concatenation}, given by the
-\verb|<:| compound symbol. This operator places a higher priority on all
-transitions of the first machine. This is useful if one must forcibly separate
-two lists that contain common elements. For example, one may need to tokenize a
-stream, but first consume leading whitespace.
+The write EOF statement causes Ragel to emit code that executes EOF actions.
+This write statement is only relevant if EOF actions have been embedded,
+otherwise it does not generate anything. The EOF action code requires access to
+the current state.
-Ragel also includes a {\em longest-match kleene star} operator, given by the
-\verb|**| compound symbol. This
-guarded operator embeds a high
-priority into all transitions of the machine.
-A lower priority is then embedded into pending out transitions
-(in a manner similar to pending out action embeddings, described in Section
-\ref{out-actions}). When the kleene star operator makes the epsilon transitions from
-the final states into the start state, the lower priority will be transferred
-to the epsilon transitions. In cases where following an epsilon transition
-out of a final state conflicts with an existing transition out of a final
-state, the epsilon transition will be dropped.
+\section{Maintaining Pointers to Input Data}
-Other guarded operators are conceivable, such as guards on union that cause one
-alternative to take precedence over another. These may be implemented when it
-is clear they constitute a frequently used operation.
-In the next section we discuss the explicit specification of state machines
-using state charts.
+In the creation of any parser it is not uncommon to require the collection of
+the data being parsed. It is always possible to collect data into a growable
+buffer as the machine moves over it, however the copying of data is a somewhat
+wasteful use of processor cycles. The most efficient way to collect data
+from the parser is to set pointers into the input. This poses a problem for
+uses of Ragel where the input data arrives in blocks, such as over a socket or
+from a file. The program will error if a pointer is set in one buffer block but
+must be used while parsing a following buffer block.
-\subsection{Entry-Guarded Contatenation}
+The scanner constructions exhibit this problem, requiring the maintenance
+code described in Section \ref{generating-scanners}. If a longest-match
+construction has been used somewhere in the machine then it is possible to
+take advantage of the required prefix maintenance code in the driver program to
+ensure pointers to the input are always valid. If laying down a pointer one can
+set \verb|tokstart| at the same spot or ahead of it. When data is shifted in
+between loops the user must also shift the pointer. In this way it is possible
+to maintain pointers to the input that will always be consistent.
-\verb|expr :> expr|
-\verbspace
+\begin{figure}
+\small
+\begin{verbatim}
+ int have = 0;
+ while ( 1 ) {
+ char *p, *pe, *data = buf + have;
+ int len, space = BUFSIZE - have;
-This operator concatenates two machines, but first assigns a low
-priority to all transitions
-of the first machine and a high priority to the entering transitions of the
-second machine. This operator is useful if from the final states of the first
-machine, it is possible to accept the characters in the start transitions of
-the second machine. This operator effectively terminates the first machine
-immediately upon entering the second machine, where otherwise they would be
-pursued concurrently. In the following example, entry-guarded concatenation is
-used to move out of a machine that matches everything at the first sign of an
-end-of-input marker.
+ if ( space == 0 ) {
+ fprintf(stderr, "BUFFER OUT OF SPACE\n");
+ exit(1);
+ }
-% GENERATE: entryguard
-% OPT: -p
-% %%{
-% machine entryguard;
-\begin{inline_code}
-\begin{verbatim}
-# Leave the catch-all machine on the first character of FIN.
-main := any* :> 'FIN';
-\end{verbatim}
-\end{inline_code}
-% }%%
-% END GENERATE
+ len = fread( data, 1, space, stdin );
+ if ( len == 0 )
+ break;
-\begin{center}
-\includegraphics[scale=0.45]{entryguard}
-\end{center}
+ /* Find the last newline by searching backwards. */
+ p = buf;
+ pe = data + len - 1;
+ while ( *pe != '\n' && pe >= buf )
+ pe--;
+ pe += 1;
+ %% write exec;
-Entry-guarded concatenation is equivalent to the following:
+ /* How much is still in the buffer? */
+ have = data + len - pe;
+ if ( have > 0 )
+ memmove( buf, pe, have );
-\verbspace
-\begin{verbatim}
-expr $(unique_name,0) . expr >(unique_name,1)
+ if ( len < space )
+ break;
+ }
\end{verbatim}
+\caption{An example of line-oriented processing.}
+\label{line-oriented}
+\end{figure}
-\subsection{Finish-Guarded Contatenation}
-
-\verb|expr :>> expr|
-\verbspace
+In general, there are two approaches for guaranteeing the consistency of
+pointers to input data. The first approach is the one just described;
+lay down a marker from an action,
+then later ensure that the data the marker points to is preserved ahead of
+the buffer on the next execute invocation. This approach is good because it
+allows the parser to decide on the pointer-use boundaries, which can be
+arbitrarily complex parsing conditions. A downside is that it requires any
+pointers that are set to be corrected in between execute invocations.
-This operator is
-like the previous operator, except the higher priority is placed on the final
-transitions of the second machine. This is useful if one wishes to entertain
-the possibility of continuing to match the first machine right up until the
-second machine enters a final state. In other words it terminates the first
-machine only when the second accepts. In the following example, finish-guarded
-concatenation causes the move out of the machine that matches everything to be
-delayed until the full end-of-input marker has been matched.
+The alternative is to find the pointer-use boundaries before invoking the execute
+routine, then pass in the data using these boundaries. For example, if the
+program must perform line-oriented processing, the user can scan backwards from
+the end of an input block that has just been read in and process only up to the
+first found newline. On the next input read, the new data is placed after the
+partially read line and processing continues from the beginning of the line.
+An example of line-oriented processing is given in Figure \ref{line-oriented}.
-% GENERATE: finguard
-% OPT: -p
-% %%{
-% machine finguard;
-\begin{inline_code}
-\begin{verbatim}
-# Leave the catch-all machine on the last character of FIN.
-main := any* :>> 'FIN';
-\end{verbatim}
-\end{inline_code}
-% }%%
-% END GENERATE
-\begin{center}
-\includegraphics[scale=0.45]{finguard}
-\end{center}
+\section{Running the Executables}
-Finish-guarded concatenation is equivalent to the following:
+Ragel is broken down into two executables: a frontend which compiles machines
+and emits them in an XML format, and a backend which generates code or a
+Graphviz Dot file from the XML data. The purpose of the XML-based intermediate
+format is to allow users to inspect their compiled state machines and to
+interface Ragel to other tools such as custom visualizers, code generators or
+analysis tools. The intermediate format will provide a better platform for
+extending Ragel to support new host languages. The split also serves to reduce
+complexity of the Ragel program by strictly separating the data structures and
+algorithms that are used to compile machines from those that are used to
+generate code.
\verbspace
\begin{verbatim}
-expr $(unique_name,0) . expr @(unique_name,1)
+[user@host] myproj: ragel file.rl | rlcodegen -G2 -o file.c
\end{verbatim}
-\subsection{Left-Guarded Concatenation}
+\section{Choosing a Generated Code Style}
+\label{genout}
-\verb|expr <: expr|
-\verbspace
+The Ragel code generator is very flexible. Following the lead of Re2C, the
+generated code has no dependencies and can be inserted in any function, perhaps
+inside a loop if so desired. The user is responsible for declaring and
+initializing a number of required variables, including the current state and
+the pointer to the input stream. The user may break out of the processing loop
+and return to it at any time.
+
+Ragel is able to generate very fast-running code that implements state machines
+as directly executable code. Since very large files strain the host language
+compiler, table-based code generation is also supported. In the future we hope
+to provide a partitioned, directly executable format which is able to reduce the
+burden on the host compiler by splitting large machines across multiple functions.
+
+Ragel can be used to parse input in one block, or it can be used to parse input
+in a sequence of blocks as it arrives from a file or socket. Parsing the
+input in a sequence of blocks brings with it a few responsibilities. If the parser
+utilizes a scanner, care must be taken to not break the input stream anywhere
+but token boundaries. If pointers to the input stream are taken during parsing,
+care must be taken to not use a pointer which has been invalidated by movement
+to a subsequent block.
+If the current input data pointer is moved backwards it must not be moved
+past the beginning of the current block.
+Strategies for handling these scenarios are given in Ragel's manual.
-This operator places
-a higher priority on the left expression. It is useful if you want to prefix a
-sequence with another sequence composed of some of the same characters. For
-example, one can consume leading whitespace before tokenizing a sequence of
-whitespace-separated words as in:
+There are three styles of code output to choose from. Code style affects the
+size and speed of the compiled binary. Changing code style does not require any
+change to the Ragel program. There are two table-driven formats and a goto
+driven format.
-% GENERATE: leftguard
-% OPT: -p
-% %%{
-% machine leftguard;
-% action alpha {}
-% action ws {}
-% action start {}
-% action fin {}
+In addition to choosing a style to emit, there are various levels of action
+code reuse to choose from. The maximum reuse levels (\verb|-T0|, \verb|-F0|
+and \verb|-G0|) ensure that no FSM action code is ever duplicated by encoding
+each transition's action list as static data and iterating
+through the lists on every transition. This will normally result in a smaller
+binary. The less action reuse options (\verb|-T1|, \verb|-F1| and \verb|-G1|)
+will usually produce faster running code by expanding each transition's action
+list into a single block of code, eliminating the need to iterate through the
+lists. This duplicates action code instead of generating the logic necessary
+for reuse. Consequently the binary will be larger. However, this tradeoff applies to
+machines with moderate to dense action lists only. If a machine's transitions
+frequently have less than two actions then the less reuse options will actually
+produce both a smaller and a faster running binary due to less action sharing
+overhead. The best way to choose the appropriate code style for your
+application is to perform your own tests.
+
+The table-driven FSM represents the state machine as constant static data. There are
+tables of states, transitions, indices and actions. The current state is
+stored in a variable. The execution is simply a loop that looks up the current
+state, looks up the transition to take, executes any actions and moves to the
+target state. In general, the table-driven FSM can handle any machine, produces
+a smaller binary and requires a less expensive host language compile, but
+results in slower running code. Since the table-driven format is the most
+flexible it is the default code style.
+
+The flat table-driven machine is a table-based machine that is optimized for
+small alphabets. Where the regular table machine uses the current character as
+the key in a binary search for the transition to take, the flat table machine
+uses the current character as an index into an array of transitions. This is
+faster in general, however is only suitable if the span of possible characters
+is small.
+
+The goto-driven FSM represents the state machine using goto and switch
+statements. The execution is a flat code block where the transition to take is
+computed using switch statements and directly executable binary searches. In
+general, the goto FSM produces faster code but results in a larger binary and a
+more expensive host language compile.
+
+The goto-driven format has an additional action reuse level (\verb|-G2|) that
+writes actions directly into the state transitioning logic rather than putting
+all the actions together into a single switch. Generally this produces faster
+running code because it allows the machine to encode the current state using
+the processor's instruction pointer. Again, sparse machines may actually
+compile to smaller binaries when \verb|-G2| is used due to less state and
+action management overhead. For many parsing applications \verb|-G2| is the
+preferred output format.
+
+\verbspace
+\begin{center}
+\begin{tabular}{|c|c|}
+\hline
+\multicolumn{2}{|c|}{\bf Code Output Style Options} \\
+\hline
+\verb|-T0|&binary search table-driven\\
+\hline
+\verb|-T1|&binary search, expanded actions\\
+\hline
+\verb|-F0|&flat table-driven\\
+\hline
+\verb|-F1|&flat table, expanded actions\\
+\hline
+\verb|-G0|&goto-driven\\
+\hline
+\verb|-G1|&goto, expanded actions\\
+\hline
+\verb|-G2|&goto, in-place actions\\
+\hline
+\end{tabular}
+\end{center}
+
+\chapter{Beyond the Basic Model}
+
+\section{Parser Modularization}
+
+It is possible to use Ragel's machine construction and action embedding
+operators to specify an entire parser using a single regular expression. An
+example is given in Section \ref{examples}. In many cases this is the desired
+way to specify a parser in Ragel. However, in some scenarios, the language to
+parse may be so large that it is difficult to think about it as a single
+regular expression. It may shift between distinct parsing strategies,
+in which case modularization into several coherent blocks of the language may
+be appropriate.
+
+It may also be the case that patterns which compile to a large number of states
+must be used in a number of different contexts and referencing them in each
+context results in a very large state machine. In this case, an ability to reuse
+parsers would reduce code size.
+
+To address this, distinct regular expressions may be instantiated and linked
+together by means of a jumping and calling mechanism. This mechanism is
+analogous to the jumping to and calling of processor instructions. A jump
+command, given in action code, causes control to be immediately passed to
+another portion of the machine by way of setting the current state variable. A
+call command causes the target state of the current transition to be pushed to
+a state stack before control is transferred. Later on, the original location
+may be returned to with a return statement. In the following example, distinct
+state machines are used to handle the parsing of two types of headers.
+
+% GENERATE: call
+% %%{
+% machine call;
\begin{inline_code}
\begin{verbatim}
-main := ( ' '* >start %fin ) <: ( ' ' $ws | [a-z] $alpha )*;
+action return { fret; }
+action call_date { fcall date; }
+action call_name { fcall name; }
+
+# A parser for date strings.
+date := [0-9][0-9] '/'
+ [0-9][0-9] '/'
+ [0-9][0-9][0-9][0-9] '\n' @return;
+
+# A parser for name strings.
+name := ( [a-zA-Z]+ | ' ' )** '\n' @return;
+
+# The main parser.
+headers =
+ ( 'from' | 'to' ) ':' @call_name |
+ ( 'departed' | 'arrived' ) ':' @call_date;
+
+main := headers*;
\end{verbatim}
\end{inline_code}
% }%%
+% %% write data;
+% void f()
+% {
+% %% write init;
+% %% write exec;
+% }
% END GENERATE
-\begin{center}
-\includegraphics[scale=0.45]{leftguard}
-\end{center}
+Calling and jumping should be used carefully as they are operations which take
+one out of the domain
+of regular languages. A machine that contains a call or jump statement in one
+of its actions should be used as an argument to a machine construction operator
+only with considerable care. Since DFA transitions may actually
+represent several NFA transitions, a call or jump embedded in one machine can
+inadvertently terminate another machine that it shares prefixes with. Despite
+this danger, theses statements have proven useful for tying together
+sub-parsers of a language into a parser for the full language, especially for
+the purpose of modularization and reducing the number of states when the
+machine contains frequently recurring patterns.
+\section{Referencing Names}
+\label{labels}
-Left-guarded concatenation is equivalent to the following:
+This section describes how to reference names in epsilon transitions and
+action-based control-flow statements such as \verb|fgoto|. There is a hierarchy
+of names implied in a Ragel specification. At the top level are the machine
+instantiations. Beneath the instantiations are labels and references to machine
+definitions. Beneath those are more labels and references to definitions, and
+so on.
-\verbspace
-\begin{verbatim}
-expr $(unique_name,1) . expr >(unique_name,0)
-\end{verbatim}
-\verbspace
+Any name reference may contain multiple components separated with the \verb|::|
+compound symbol. The search for the first component of a name reference is
+rooted at the join expression that the epsilon transition or action embedding
+is contained in. If the name reference is not not contained in a join,
+the search is rooted at the machine definition that that the epsilon transition or
+action embedding is contained in. Each component after the first is searched
+for beginning at the location in the name tree that the previous reference
+component refers to.
-\subsection{Longest-Match Kleene Star}
-\label{longest_match_kleene_star}
+In the case of action-based references, if the action is embedded more than
+once, the local search is performed for each embedding and the result is the
+union of all the searches. If no result is found for action-based references then
+the search is repeated at the root of the name tree. Any action-based name
+search may be forced into a strictly global search by prefixing the name
+reference with \verb|::|.
-\verb|expr**|
-\verbspace
+The final component of the name reference must resolve to a unique entry point.
+If a name is unique in the entire name tree it can be referenced as is. If it
+is not unique it can be specified by qualifying it with names above it in the
+name tree. However, it can always be renamed.
-This version of kleene star puts a higher priority on staying in the
-machine versus wrapping around and starting over. The LM kleene star is useful
-when writing simple tokenizers. These machines are built by applying the
-longest-match kleene star to an alternation of token patterns, as in the
-following.
+% FIXME: Should fit this in somewhere.
+% Some kinds of name references are illegal. Cannot call into longest-match
+% machine, can only call its start state. Cannot make a call to anywhere from
+% any part of a longest-match machine except a rule's action. This would result
+% in an eventual return to some point inside a longest-match other than the
+% start state. This is banned for the same reason a call into the LM machine is
+% banned.
-\verbspace
-% GENERATE: lmkleene
-% OPT: -p
+
+\section{Scanners}
+
+Though the overall language may be represented by regular expressions, it may
+be the case that a language
+contains sub-regions where the input is best represented as a sequence
+of tokens. To support the scanning of sub-regions of a language, Ragel allows
+the definition of longest-match machines, also known as scanners. The generated
+code will repeatedly attempt to match patterns from a list, favouring longer
+patterns over shorter patterns. In the case of equal length matches, the
+generated code will favour patterns that appear ahead of others. When a scanner
+makes a match it executes the user code associated with the match, consumes the
+input then resumes scanning.
+
+On the surface, Ragel scanners are similar to those defined by Lex. Though
+there is a key distinguishing feature: patterns may be arbitrary Ragel
+expressions and can therefore contain embedded code. With a Ragel-based scanner
+the user need not wait until the end of a pattern before user code can be
+executed.
+
+The longest-match construction is not a pure state machine construction. It
+relies on several variables which enable it to backtrack and make pointers to the
+matched input text available to the user.
+For this reason scanners must be immediately instantiated. They cannot be defined inline or
+referenced by another expression. Scanners must be jumped to or called.
+
+% GENERATE: scanner
% %%{
-% machine exfinpri;
-% action A {}
-% action B {}
+% machine scanner;
+% word = 'foo';
+% head_name = 'bar';
\begin{inline_code}
\begin{verbatim}
-# Repeat tokens, but make sure to get the longest match.
-main := (
- lower ( lower | digit )* %A |
- digit+ %B |
- ' '
-)**;
+header := |*
+ word;
+ ' ';
+ '\n' => { fret; };
+*|;
+
+main := ( head_name ':' @{ fcall header; } )*;
\end{verbatim}
\end{inline_code}
% }%%
+% %% write data;
+% void f()
+% {
+% %% write init;
+% %% write exec;
+% }
% END GENERATE
-\begin{center}
-\includegraphics[scale=0.45]{lmkleene}
-\end{center}
+The scanner construction has a purpose similar to the longest-match kleene star
+operator \verb|**| when used in conjunction with the union operator. The key
+difference is that a scanner is able to backtrack to match a previously
+matched shorter string when the pursuit of a longer string fails.
-If a regular kleene star were used the machine above would not be able to
-distinguish between extending a word and beginning a new one. This operator is
-equivalent to:
+The longest-match operator can be used to construct scanners. The generated
+machine repeatedly attempts to match one of the given patterns, first favouring
+longer pattern matches over shorter ones. If there is a choice between equal
+length matches, the match of the pattern which appears first is chosen.
\verbspace
\begin{verbatim}
-( expr $(unique_name,1) %(unique_name,0) )*
+<machine_name> := |*
+ pattern1 => action1;
+ pattern2 => action2;
+ ...
+ *|;
\end{verbatim}
\verbspace
-When the kleene star is applied, transitions are made out of the machine which
-go back into it. These are assigned a priority of zero by the pending out
-transition mechanism. This is less than the priority of the transitions out of
-the final states that do not leave the machine. When two transitions clash on
-the same character, the differing priorities causes the transition which
-stays in the machine to take precedence. The transition that wraps around is
-dropped.
+The longest-match construction operator is not a pure state machine operator.
+It relies on the \verb|tokstart|, \verb|tokend| and \verb|act| variables to be
+present so that it can backtrack and make pointers to the matched text
+available to the user. If input is processed using multiple calls to the
+execute code then the user must ensure that when a token is only partially
+matched that the prefix is preserved on the subsequent invocation of the
+execute code.
-Note that this operator does not build a scanner in the traditional sense
-because there is never any backtracking. To build a scanner in the traditional
-sense use the Longest-Match machine construction described Section
-\ref{generating-scanners}.
+The \verb|tokstart| variable must be defined as a pointer to the input data.
+It is used for recording where the current token match begins. This variable
+may be used in action code for retrieving the text of the current match. Ragel
+ensures that in between tokens and outside of the longest-match machines that
+this pointer is set to null. In between calls to the execute code the user must
+check if \verb|tokstart| is set and if so, ensure that the data it points to is
+preserved ahead of the next buffer block. This is described in more detail
+below.
-\chapter{Interface to Host Program}
+The \verb|tokend| variable must also be defined as a pointer to the input data.
+It is used for recording where a match ends and where scanning of the next
+token should begin. This can also be used in action code for retrieving the
+text of the current match.
-\section{Alphtype Statement}
+The \verb|act| variable must be defined as an integer type. It is used for
+recording the identity of the last pattern matched when the scanner must go
+past a matched pattern in an attempt to make a longer match. If the longer
+match fails it may need to consult the act variable. In some cases use of the act
+variable can be avoided because the value of the current state is enough
+information to determine which token to accept, however in other cases this is
+not enough and so the \verb|act| variable is used.
+
+When the longest-match operator is in use, the user's driver code must take on
+some buffer management functions. The following algorithm gives an overview of
+the steps that should be taken to properly use the longest-match operator.
+
+\begin{itemize}
+\setlength{\parskip}{0pt}
+\item Read a block of input data.
+\item Run the execute code.
+\item If \verb|tokstart| is set, the execute code will expect the incomplete
+token to be preserved ahead of the buffer on the next invocation of the execute
+code.
+\begin{itemize}
+\item Shift the data beginning at \verb|tokstart| and ending at \verb|pe| to the
+beginning of the input buffer.
+\item Reset \verb|tokstart| to the beginning of the buffer.
+\item Shift \verb|tokend| by the distance from the old value of \verb|tokstart|
+to the new value. The \verb|tokend| variable may or may not be valid. There is
+no way to know if it holds a meaningful value because it is not kept at null
+when it is not in use. It can be shifted regardless.
+\end{itemize}
+\item Read another block of data into the buffer, immediately following any
+preserved data.
+\item Run the scanner on the new data.
+\end{itemize}
+
+Figure \ref{preserve_example} shows the required handling of an input stream in
+which a token is broken by the input block boundaries. After processing up to
+and including the ``t'' of ``characters'', the prefix of the string token must be
+retained and processing should resume at the ``e'' on the next iteration of
+the execute code.
+If one uses a large input buffer for collecting input then the number of times
+the shifting must be done will be small. Furthermore, if one takes care not to
+define tokens that are allowed to be very long and instead processes these
+items using pure state machines or sub-scanners, then only a small amount of
+data will ever need to be shifted.
+
+\begin{figure}
\begin{verbatim}
-alphtype unsigned int;
+ a) A stream "of characters" to be scanned.
+ | | |
+ p tokstart pe
+
+ b) "of characters" to be scanned.
+ | | |
+ tokstart p pe
\end{verbatim}
-\verbspace
+\caption{Following an invocation of the execute code there may be a partially
+matched token (a). The data of the partially matched token
+must be preserved ahead of the new data on the next invocation (b).}
+\label{preserve_example}
+\end{figure}
-The alphtype statement specifies the alphabet data type that the machine
-operates on. During the compilation of the machine, integer literals are expected to
-be in the range of possible values of the alphtype. Supported alphabet types
-are \verb|char|, \verb|unsigned char|, \verb|short|, \verb|unsigned short|,
-\verb|int|, \verb|unsigned int|, \verb|long|, and \verb|unsigned long|.
-The default is \verb|char|.
+Since scanners attempt to make the longest possible match of input, in some
+cases they are not able to identify a token upon parsing its final character,
+they must wait for a lookahead character. For example if trying to match words,
+the token match must be triggered on following whitespace in case more
+characters of the word have yet to come. The user must therefore arrange for an
+EOF character to be sent to the scanner to flush out any token that has not yet
+been matched. The user can exclude a single character from the entire scanner
+and use this character as the EOF character, possibly specifying an EOF action.
+For most scanners, zero is a suitable choice for the EOF character.
-\section{Getkey Statement}
+Alternatively, if whitespace is not significant and ignored by the scanner, the
+final real token can be flushed out by simply sending an additional whitespace
+character on the end of the stream. If the real stream ends with whitespace
+then it will simply be extended and ignored. If it does not, then the last real token is
+guaranteed to be flushed and the dummy EOF whitespace ignored.
+An example scanner processing loop is given in Figure \ref{scanner-loop}.
+\begin{figure}
+\small
\begin{verbatim}
-getkey fpc->id;
-\end{verbatim}
-\verbspace
+ int have = 0;
+ bool done = false;
+ while ( !done ) {
+ /* How much space is in the buffer? */
+ int space = BUFSIZE - have;
+ if ( space == 0 ) {
+ /* Buffer is full. */
+ cerr << "TOKEN TOO BIG" << endl;
+ exit(1);
+ }
-Specify to Ragel how to retrieve the character that the machine operates on
-from the pointer to the current element (\verb|p|). Any expression that returns
-a value of the alphabet type
-may be used. The getkey statement may be used for looking into element
-structures or for translating the character to process. The getkey expression
-defaults to \verb|(*p)|. In goto-driven machines the getkey expression may be
-evaluated more than once per element processed, therefore it should not incur a
-large cost and preclude optimization.
+ /* Read in a block after any data we already have. */
+ char *p = inbuf + have;
+ cin.read( p, space );
+ int len = cin.gcount();
+
+ /* If no data was read, send the EOF character.
+ if ( len == 0 ) {
+ p[0] = 0, len++;
+ done = true;
+ }
+
+ char *pe = p + len;
+ %% write exec;
+
+ if ( cs == RagelScan_error ) {
+ /* Machine failed before finding a token. */
+ cerr << "PARSE ERROR" << endl;
+ exit(1);
+ }
+
+ if ( tokstart == 0 )
+ have = 0;
+ else {
+ /* There is a prefix to preserve, shift it over. */
+ have = pe - tokstart;
+ memmove( inbuf, tokstart, have );
+ tokend = inbuf + (tokend-tokstart);
+ tokstart = inbuf;
+ }
+ }
+\end{verbatim}
+\caption{A processing loop for a scanner.}
+\label{scanner-loop}
+\end{figure}
-\section{Access Statement}
+\section{State Charts}
-\begin{verbatim}
-access fsm->;
-\end{verbatim}
-\verbspace
+In addition to supporting the construction of state machines using regular
+languages, Ragel also provides a way to manually specify state machines using
+state charts. The comma operator wombines machines together without any
+implied transitions. The user can then manually link machines by specifying
+epsilon transitions with the \verb|->| operator. Epsilon transitions are drawn
+between the final states of a machine and entry points defined by labels. This
+makes it possible to build machines using the explicit state-chart method while
+making minimal changes to the Ragel language.
-The access statement allows one to tell Ragel how the generated code should
-access the machine data that is persistent across processing buffer blocks.
-This includes all variables except \verb|p| and \verb|pe|. This includes
-\verb|cs|, \verb|top|, \verb|stack|, \verb|tokstart|, \verb|tokend| and \verb|act|.
-This is useful if a machine is to be encapsulated inside a
-structure in C code. The access statement can be used to give the name of
-a pointer to the structure.
+An interesting feature of Ragel's state chart construction method is that it
+can be mixed freely with regular expression constructions. A state chart may be
+referenced from within a regular expression, or a regular expression may be
+used in the definition of a state chart transition.
-\section{Maintaining Pointers to Input Data}
+\subsection{Join}
-In the creation of any parser it is not uncommon to require the collection of
-the data being parsed. It is always possible to collect data into a growable
-buffer as the machine moves over it, however the copying of data is a somewhat
-wasteful use of processor cycles. The most efficient way to collect data
-from the parser is to set pointers into the input. This poses a problem for
-uses of Ragel where the input data arrives in blocks, such as over a socket or
-from a file. The program will error if a pointer is set in one buffer block but
-must be used while parsing a following buffer block.
+\verb|expr , expr , ...|
+\verbspace
-The longest-match constructions exhibit this problem, requiring the maintenance
-code described in Section \ref{generating-scanners}. If a longest-match
-construction has been used somewhere in the machine then it is possible to
-take advantage of the required prefix maintenance code in the driver program to
-ensure pointers to the input are always valid. If laying down a pointer one can
-set \verb|tokstart| at the same spot or ahead of it. When data is shifted in
-between loops the user must also shift the pointer. In this way it is possible
-to maintain pointers to the input that will always be consistent.
+Join a list of machines together without
+drawing any transitions, without setting up a start state, and without
+designating any final states. Transitions between the machines may be specified
+using labels and epsilon transitions. The start state must be explicity
+specified with the ``start'' label. Final states may be specified with the an
+epsilon transition to the implicitly created ``final'' state. The join
+operation allows one to build machines using a state chart model.
-\begin{figure}
-\small
-\begin{verbatim}
- int have = 0;
- while ( 1 ) {
- char *p, *pe, *data = buf + have;
- int len, space = BUFSIZE - have;
+\subsection{Label}
- if ( space == 0 ) {
- fprintf(stderr, "BUFFER OUT OF SPACE\n");
- exit(1);
- }
+\verb|label: expr|
+\verbspace
- len = fread( data, 1, space, stdin );
- if ( len == 0 )
- break;
+Attaches a label to an expression. Labels can be
+used as the target of epsilon transitions and explicit control transfer
+statements such \verb|fgoto| and \verb|fnext| in action
+code.
- /* Find the last newline by searching backwards. */
- p = buf;
- pe = data + len - 1;
- while ( *pe != '\n' && pe >= buf )
- pe--;
- pe += 1;
+\subsection{Epsilon}
- %% write exec;
+\verb|expr -> label|
+\verbspace
- /* How much is still in the buffer? */
- have = data + len - pe;
- if ( have > 0 )
- memmove( buf, pe, have );
+Draws an epsilon transition to the state defined
+by \verb|label|. Epsilon transitions are made deterministic when join
+operators are evaluated. Epsilon transitions that are not in a join operation
+are made deterministic when the machine definition that contains the epsilon is
+complete. See Section \ref{labels} for information on referencing labels.
- if ( len < space )
- break;
- }
-\end{verbatim}
-\caption{An example of line-oriented processing.}
-\label{line-oriented}
-\end{figure}
+\subsection{Simplifying State Charts}
-In general, there are two approaches for guaranteeing the consistency of
-pointers to input data. The first approach is the one just described;
-lay down a marker from an action,
-then later ensure that the data the marker points to is preserved ahead of
-the buffer on the next execute invocation. This approach is good because it
-allows the parser to decide on the pointer-use boundaries, which can be
-arbitrarily complex parsing conditions. A downside is that it requires any
-pointers that are set to be corrected in between execute invocations.
+There are two benefits to providing state charts in Ragel. The first is that it
+allows us to take a state chart with a full listing of states and transitions
+and simplifly it in selective places using regular expressions.
-The alternative is to find the pointer-use boundaries before invoking the execute
-routine, then pass in the data using these boundaries. For example, if the
-program must perform line-oriented processing, the user can scan backwards from
-the end of an input block that has just been read in and process only up to the
-first found newline. On the next input read, the new data is placed after the
-partially read line and processing continues from the beginning of the line.
-An example of line-oriented processing is given in Figure \ref{line-oriented}.
+The state chart method of specifying parsers is a very common. It is an
+effective programming technique for producing robust code. The key disadvantage
+becomes clear when one attempts to comprehend a large parser specified in this
+way. These programs usually require many lines, causing logic to be spread out
+over large distances in the source file. Remembering the function of a large
+number of states can be difficult and organizing the parser in a sensible way
+requires discipline because branches and repetition present many file layout
+options. This kind of programming takes a specification with inherent
+structure such as looping, alternation and concatenation and expresses it in a
+flat form.
+If we could take an isolated component of a manually programmed state chart,
+that is, a subset of states that has only one entry point, and implement it
+using regular language operators then we could eliminate all the explicit
+naming of the states contained in it. By eliminating explicitly named states
+and replacing them with higher-level specifications we simplify a state machine
+specification.
-\section{Running the Executables}
+For example, sometimes chains of states are needed, with only a small number of
+possible characters appearing along the chain. These can easily be replaced
+with a concatenation of characters. Sometimes a group of common states
+implement a loop back to another single portion of the machine. Rather than
+manually duplicate all the transitions that loop back, we may be able to
+express the loop using a kleene star operator.
-Ragel is broken down into two executables: a frontend which compiles machines
-and emits them in an XML format, and a backend which generates code or a
-Graphviz Dot file from the XML data. The purpose of the XML-based intermediate
-format is to allow users to inspect their compiled state machines and to
-interface Ragel to other tools such as custom visualizers, code generators or
-analysis tools. The intermediate format will provide a better platform for
-extending Ragel to support new host languages. The split also serves to reduce
-complexity of the Ragel program by strictly separating the data structures and
-algorithms that are used to compile machines from those that are used to
-generate code.
+Ragel allows one to take this state map simplification approach. We can build
+state machines using a state map model and implement portions of the state map
+using regular languages. In place of any transition in the state machine,
+entire sub-state machines can be given. These can encapsulate functionality
+defined elsewhere. An important aspect of the Ragel approach is that when we
+wrap up a collection of states using a regular expression we do not loose
+access to the states and transitions. We can still execute code on the
+transitions that we have encapsulated.
-\verbspace
+\subsection{Down One Level of Abstraction}
+\label{down}
+
+The second benefit of incorporating state charts into Ragel is that it permits
+us to bypass the regular language abstraction if we need to. Ragel's action
+embedding operators are sometimes insufficient for expressing certain parsing
+tasks. In the same way that is useful for C language programmers to drop down
+to assembly language programming using embedded assembler, it is sometimes
+useful for the Ragel programmer to drop down to programming with state charts.
+
+In the following example, we wish to buffer the characters of an XML CDATA
+sequence. The sequence is terminated by the string \verb|]]>|. The challenge
+in our application is that we do not wish the terminating characters to be
+buffered. An expression of the form \verb|any* @buffer :>> ']]>'| will not work
+because the buffer will alway contain the characters \verb|]]| on the end.
+Instead, what we need is to delay the buffering of \hspace{0.25mm} \verb|]|
+characters until a time when we
+abandon the terminating sequence and go back into the main loop. There is no
+easy way to express this using Ragel's regular expression and action embedding
+operators, and so an ability to drop down to the state chart method is useful.
+
+% GENERATE: dropdown
+% OPT: -p
+% %%{
+% machine dropdown;
+\begin{inline_code}
\begin{verbatim}
-[user@host] myproj: ragel file.rl | rlcodegen -G2 -o file.c
+action bchar { buff( fpc ); } # Buffer the current character.
+action bbrack1 { buff( "]" ); }
+action bbrack2 { buff( "]]" ); }
+
+CDATA_body =
+start: (
+ ']' -> one |
+ (any-']') @bchar ->start
+),
+one: (
+ ']' -> two |
+ [^\]] @bbrack1 @bchar ->start
+),
+two: (
+ '>' -> final |
+ ']' @bbrack1 -> two |
+ [^>\]] @bbrack2 @bchar ->start
+);
\end{verbatim}
+\end{inline_code}
+% main := CDATA_body;
+% }%%
+% END GENERATE
-\section{Choosing a Generated Code Style}
-\label{genout}
+\begin{center}
+\includegraphics[scale=0.45]{dropdown}
+\end{center}
-There are three styles of code output to choose from. Code style affects the
-size and speed of the compiled binary. Changing code style does not require any
-change to the Ragel program. There are two table-driven formats and a goto
-driven format.
-In addition to choosing a style to emit, there are various levels of action
-code reuse to choose from. The maximum reuse levels (\verb|-T0|, \verb|-F0|
-and \verb|-G0|) ensure that no FSM action code is ever duplicated by encoding
-each transition's action list as static data and iterating
-through the lists on every transition. This will normally result in a smaller
-binary. The less action reuse options (\verb|-T1|, \verb|-F1| and \verb|-G1|)
-will usually produce faster running code by expanding each transition's action
-list into a single block of code, eliminating the need to iterate through the
-lists. This duplicates action code instead of generating the logic necessary
-for reuse. Consequently the binary will be larger. However, this tradeoff applies to
-machines with moderate to dense action lists only. If a machine's transitions
-frequently have less than two actions then the less reuse options will actually
-produce both a smaller and a faster running binary due to less action sharing
-overhead. The best way to choose the appropriate code style for your
-application is to perform your own tests.
+\section{Semantic Conditions}
+\label{semantic}
-The table-driven FSM represents the state machine as constant static data. There are
-tables of states, transitions, indices and actions. The current state is
-stored in a variable. The execution is simply a loop that looks up the current
-state, looks up the transition to take, executes any actions and moves to the
-target state. In general, the table-driven FSM can handle any machine, produces
-a smaller binary and requires a less expensive host language compile, but
-results in slower running code. Since the table-driven format is the most
-flexible it is the default code style.
+Many communication protocols contain variable-length fields, where the length
+of the field is given ahead of the field as a value. This
+problem cannot be expressed using regular languages because of its
+context-dependent nature. The prevalence of variable-length fields in
+communication protocols motivated us to introduce semantic conditions into
+the Ragel language.
-The flat table-driven machine is a table-based machine that is optimized for
-small alphabets. Where the regular table machine uses the current character as
-the key in a binary search for the transition to take, the flat table machine
-uses the current character as an index into an array of transitions. This is
-faster in general, however is only suitable if the span of possible characters
-is small.
+A semantic condition is a block of user code which is executed immediately
+before a transition is taken. If the code returns a value of true, the
+transition may be taken. We can now embed code which extracts the length of a
+field, then proceed to match $n$ data values.
-The goto-driven FSM represents the state machine using goto and switch
-statements. The execution is a flat code block where the transition to take is
-computed using switch statements and directly executable binary searches. In
-general, the goto FSM produces faster code but results in a larger binary and a
-more expensive host language compile.
+% GENERATE: conds1
+% OPT: -p
+% %%{
+% machine conds1;
+% number = digit+;
+\begin{inline_code}
+\begin{verbatim}
+action rec_num { i = 0; n = getnumber(); }
+action test_len { i++ < n }
+data_fields = (
+ 'd'
+ [0-9]+ %rec_num
+ ':'
+ ( [a-z] when test_len )*
+)**;
+\end{verbatim}
+\end{inline_code}
+% main := data_fields;
+% }%%
+% END GENERATE
-The goto-driven format has an additional action reuse level (\verb|-G2|) that
-writes actions directly into the state transitioning logic rather than putting
-all the actions together into a single switch. Generally this produces faster
-running code because it allows the machine to encode the current state using
-the processor's instruction pointer. Again, sparse machines may actually
-compile to smaller binaries when \verb|-G2| is used due to less state and
-action management overhead. For many parsing applications \verb|-G2| is the
-preferred output format.
+\begin{center}
+\includegraphics[scale=0.45]{conds1}
+\end{center}
+
+The Ragel implementation of semantic conditions does not force us to give up the
+compositional property of Ragel definitions. For example, a machine which tests
+the length of a field using conditions can be unioned with another machine
+which accepts some of the same strings, without the two machines interfering with
+another. The user need not be concerned about whether or not the result of the
+semantic condition will affect the matching of the second machine.
+
+To see this, first consider that when a user associates a condition with an
+existing transition, the transition's label is translated from the base character
+to its corresponding value in the space which represents ``condition $c$ true''. Should
+the determinization process combine a state that has a conditional transition
+with another state has a transition on the same input character but
+without a condition, then the condition-less transition first has its label
+translated into two values, one to its corresponding value in the space which
+represents ``condition $c$ true'' and another to its corresponding value in the
+space which represents ``condition $c$ false''. It
+is then safe to combine the two transitions. This is shown in the following
+example. Two intersecting patterns are unioned, one with a condition and one
+without. The condition embedded in the first pattern does not affect the second
+pattern.
+
+\newpage
+
+% GENERATE: conds2
+% OPT: -p
+% %%{
+% machine conds2;
+% number = digit+;
+\begin{inline_code}
+\begin{verbatim}
+action test_len { i++ < n }
+action one { /* accept pattern one */ }
+action two { /* accept pattern two */ }
+patterns =
+ ( [a-z] when test_len )+ %one |
+ [a-z][a-z0-9]* %two;
+main := patterns '\n';
+\end{verbatim}
+\end{inline_code}
+% }%%
+% END GENERATE
-\verbspace
\begin{center}
-\begin{tabular}{|c|c|}
-\hline
-\multicolumn{2}{|c|}{\bf Code Output Style Options} \\
-\hline
-\verb|-T0|&binary search table-driven\\
-\hline
-\verb|-T1|&binary search, expanded actions\\
-\hline
-\verb|-F0|&flat table-driven\\
-\hline
-\verb|-F1|&flat table, expanded actions\\
-\hline
-\verb|-G0|&goto-driven\\
-\hline
-\verb|-G1|&goto, expanded actions\\
-\hline
-\verb|-G2|&goto, in-place actions\\
-\hline
-\end{tabular}
+\includegraphics[scale=0.45]{conds2}
\end{center}
-\section{Graphviz}
+There are many more potential uses for semantic conditions. The user is free to
+use arbitrary code and may therefore perform actions such as looking up names
+in dictionaries, validating input using external parsing mechanisms or
+performing checks on the semantic structure of input seen so far. In the
+next section we describe how Ragel accommodates several common parser
+engineering problems.
+
+\section{Implementing Lookahead}
+
+There are a few strategies for implementing lookahead in Ragel programs.
+Pending out actions, which were described in Section \ref{out-actions}, can be
+used as a form of lookahead. Ragel also provides the \verb|fhold| directive
+which can be used in actions to prevent the machine from advancing over the
+current character. It is also possible to manually adjust the current
+character position by shifting it backwards.
+
+\section{Handling Errors}
+
+In many applications it is useful to be able to react to parsing errors. The
+user may wish to print an error message which depends on the context. It
+may also be desirable to consume input in an attempt to return the input stream
+to some known state and resume parsing.
+
+To support error handling and recovery, Ragel provides error action embedding
+operators. Error actions are embedded into an expression's states. When the
+final machine has been constructed and it is being made complete, error actions
+are transfered from their place of embedding within a state to the transitions
+which go to the error
+state. When the machine fails and is about to move into the error state, the
+current state's error actions get executed.
+
+Error actions can be used to simply report errors, or by jumping to a machine
+instantiation which consumes input, can attempt to recover from errors. Like
+the action embedding operators, there are several classes of states which
+error action embedding operators can access. For example, the \verb|@err|
+operator embeds an error action into non-final states. The \verb|$err| operator
+embeds an error action into all states. Other operators access the start state,
+final states, and states which are neither the start state nor are final. The
+design of the state selections was driven by a need to cover the states of an
+expression with a single error action.
+
+The following example uses error actions to report an error and jump to a
+machine which consumes the remainder of the line when parsing fails. After
+consuming the line, the error recovery machine returns to the main loop.
+
+% GENERATE: erract
+% %%{
+% machine erract;
+% ws = ' ';
+% address = 'foo@bar.com';
+% date = 'Monday May 12';
+\begin{inline_code}
+\begin{verbatim}
+action cmd_err {
+ printf( "command error\n" );
+ fhold; fgoto line;
+}
+action from_err {
+ printf( "from error\n" );
+ fhold; fgoto line;
+}
+action to_err {
+ printf( "to error\n" );
+ fhold; fgoto line;
+}
-Ragel is able to emit compiled state machines in Graphviz's Dot file format.
-Graphviz support allows users to perform
-incremental visualization of their parsers. User actions are displayed on
-transition labels of the graph. If the final graph is too large to be
-meaningful, or even drawn, the user is able to inspect portions of the parser
-by naming particular regular expression definitions with the \verb|-S| and
-\verb|-M| options to the \verb|ragel| program. Use of Graphviz greatly
-improves the Ragel programming experience. It allows users to learn Ragel by
-experimentation and also to track down bugs caused by unintended
-nondeterminism.
+line := [^\n]* '\n' @{ fgoto main; };
+
+main := (
+ (
+ 'from' @err cmd_err
+ ( ws+ address ws+ date '\n' ) $err from_err |
+ 'to' @err cmd_err
+ ( ws+ address '\n' ) $err to_err
+ )
+)*;
+\end{verbatim}
+\end{inline_code}
+% }%%
+% %% write data;
+% void f()
+% {
+% %% write init;
+% %% write exec;
+% }
+% END GENERATE
\end{document}