2 % Copyright 2001-2006 Adrian Thurston <thurston@cs.queensu.ca>
5 % This file is part of Ragel.
7 % Ragel is free software; you can redistribute it and/or modify
8 % it under the terms of the GNU General Public License as published by
9 % the Free Software Foundation; either version 2 of the License, or
10 % (at your option) any later version.
12 % Ragel is distributed in the hope that it will be useful,
13 % but WITHOUT ANY WARRANTY; without even the implied warranty of
14 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 % GNU General Public License for more details.
17 % You should have received a copy of the GNU General Public License
18 % along with Ragel; if not, write to the Free Software
19 % Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
21 \documentclass[letterpaper,11pt,oneside]{book}
31 \setlength{\parskip}{0pt}
32 \setlength{\topsep}{0pt}
33 \setlength{\partopsep}{0pt}
34 \setlength{\itemsep}{0pt}
38 \newcommand{\verbspace}{\vspace{10pt}}
39 \newcommand{\graphspace}{\vspace{10pt}}
41 \renewcommand\floatpagefraction{.99}
42 \renewcommand\topfraction{.99}
43 \renewcommand\bottomfraction{.99}
44 \renewcommand\textfraction{.01}
45 \setcounter{totalnumber}{50}
46 \setcounter{topnumber}{50}
47 \setcounter{bottomnumber}{50}
49 \newenvironment{inline_code}{\def\baselinestretch{1}\vspace{12pt}\small}{}
59 {\huge Ragel State Machine Compiler}\\
65 {\large Adrian Thurston}\\
75 Ragel version \version, \pubdate\\
76 Copyright \copyright\ 2003, 2004, 2005, 2006 Adrian Thurston
79 {\bf\it\noindent This document is part of Ragel, and as such, this document is
80 released under the terms of the GNU General Public License as published by the
81 Free Software Foundation; either version 2 of the License, or (at your option)
86 {\bf\it\noindent Ragel is distributed in the hope that it will be useful, but
87 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
88 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
93 {\bf\it\noindent You should have received a copy of the GNU General Public
94 License along with Ragel; if not, write to the Free Software Foundation, Inc.,
95 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA}
108 \pagenumbering{arabic}
110 \chapter{Introduction}
114 Regular expressions are used heavily in practice for the purpose of specifying
115 parsers. However, they are normally used as black boxes linked together with
116 program logic. User actions are associated with entire expressions and matched
117 text is extracted from input. With these facilities it is not possible to
118 specify an entire parser with a single regular expression because practical
119 parsing tasks invariably involve the execution of arbitrary user code
120 throughout the course of parsing.
122 Ragel is a software development tool which allows the user to embed actions into
123 regular expressions without disrupting the regular expression syntax.
124 Consequently, one can specify an entire parser using a single regular
125 experssion. The single-expression model affords concise
126 and elegant descriptions of languages and the generation of very simple,
127 fast and robust code. Ragel compiles finite state machines from a high level
128 regular language notation to executable C, C++, Objective-C or D.
130 In addition to building state machines from regular expressions, Ragel allows
131 the programmer to directly specify state machines with state charts. These two
132 notations may also be freely combined. There are facilities for controlling
133 nondeterminism in the resulting machines and building scanners using the
134 longest-match paradigm. Ragel can produce code that runs as fast as manually
135 constructed machines. Ragel can handle integer-sized alphabets and can compile
136 very large state machines.
140 When a programmer is faced with the task of producing a parser for a
141 context-free language there are many tools to choose from. It is quite common
142 to generate useful and efficient parsers for programming languages from a
143 formal grammar. It is also quite common for programmers to avoid such tools
144 when making parsers for simple computer languages, such as file formats and
145 communication protocols. Such languages often meet the criteria for the
146 regular languages. Tools for processing the context-free languages are simply
147 too heavyweight for the purpose of parsing regular languages because the extra
148 run-time effort required for supporting the recursive nature of context-free
151 Regular expressions are more appropriate than context-free grammars for a large
152 number of parsing probelems. Parsers based on them have many advantages over
153 hand written parsers. Regular expression syntax is convenient,
154 concise and easy to maintain. Existing
155 parsing tools based on regular expressions, such as Lex, Re2C, Sed, Awk and
156 Perl, are normally split into two levels: a regular expression matching engine
157 and some kind of program logic for linking patterns together and executing user
160 As an example, Lex requires the user to consider a language as a sequence
161 of independent patterns.
162 Unfortunately, there are many computer languages that are considered regular,
163 which do not fit this model. This model also places restrictions on when action
164 code may be executed. Since action code can only be associated with complete
165 patterns, if action code must be executed before an entire pattern is matched
166 then the pattern must be broken into smaller units. Instead of being forced to
167 disrupt the regular expression syntax, it is desirable to retain a single
168 expression and embed code for performing actions directly into the transitions
169 which move over the characters. After all we know the transitions are there.
171 Perl allows one to link patterns together using arbitrary program code. This
172 is very flexible and powerful, however we can be more concise, clear and robust
173 if we avoid gluing together regular expressions with if statements and while
174 loops, and instead only compose parsers with regular expression operators. To
175 achieve this we require an action execution model for associating code with the
176 sub-expressions of a regular expression in a way that does not disrupt its
179 The primary goal of Ragel is therefore to provide developers with an ability to embed
180 actions into the transitions and states of a regular expression in support the
181 definition of entire parsers or large sections of parsers using a single
182 regular expression that is compiled to a simple state machine. From the
183 regular expression we gain a clear and concise statement of our language. From
184 the state machine we obtain a very fast and robust executable that lends itself
185 to many kinds of analysis and visualization.
189 Ragel is a language for specifying state machines. The Ragel program is a
190 compiler that assembles a state machine definition to executable code. Ragel
191 is based on the principle that any regular language can be converted to a
192 deterministic finite state automaton. Since every regular language has a state
193 machine representation and vice versa, the terms regular language and state
194 machine (or just machine) will be used interchangeably in this document.
196 Ragel outputs machines to C, C++, Objective-C, or D code. The output is
197 designed to be generic and is not bound to any particular input or processing
198 method. A Ragel machine expects to have data passed to it in buffer blocks.
199 When there is no more input, the machine can be queried for acceptance. In
200 this way, a Ragel machine can be used to simply recognize a regular language
201 like a regular expression library. By embedding code into the regular language,
202 a Ragel machine can also be used to parse input.
204 The Ragel input language has many operators for constructing and manipulating
205 machines. Machines are built up from smaller machines, to bigger ones, to the
206 final machine representing the language that needs to be recognized or parsed.
208 The core state machine construction operators are those found in most ``Theory
209 of Computation'' textbooks. They date back to the 1950s and are widely studied.
210 They are based on set operations and permit one to think of languages as a set
211 of strings. They are Union, Intersection, Subtraction, Concatenation and Kleene
212 Star. Put together, these operators make up what most people know as regular
213 expressions. Ragel also provides a longest-match construction for easily
214 building scanners and provides operators for explicitly constructing machines
215 using a state chart method. In the state chart method one joins machines
216 together without any implied transitions and then explicitly specifies where
217 epsilon transitions should be drawn.
219 The state machine manipulation operators are specific to Ragel. They allow the
220 programmer to access the states and transitions of regular languages. There are
221 two uses of the manipulation operators. The first and primary use is to embed
222 code into transitions and states, allowing the programmer to specify the
223 actions of the state machine.
225 Following a number of action embeddings, a single transition can have a number
226 of actions embedded in it. When making a nondeterministic specification into a
227 DFA using machines that have embedded actions, new transitions are often made
228 that have the combined actions of several source transitions. Ragel ensures
229 that multiple actions associated with a single transition are ordered
230 consistently with respect to the order of reference and the natural ordering
231 implied by the construction operators.
233 The second use of the manipulation operators is to assign priorities in
234 transitions. Priorities provide a convenient way of controlling any
235 nondeterminism introduced by the construction operators. Suppose two
236 transitions leave from the same state and go to distinct target states on the
237 same character. If these transitions are assigned conflicting priorities, then
238 during the determinization process the transition with the higher priority will
239 take precedence over the transition with the lower priority. The lower priority
240 transition gets abandoned. The transitions would otherwise be combined to a new
241 transition that goes to a new state which is a combination of the original
242 target states. Priorities are often required for segmenting machines. The most
243 common uses of priorities have been encoded into a set of simple operators
244 which should be used instead of priority embeddings whenever possible.
246 There are four operators for embedding actions and priorities into the
247 transitions of a state machine, these correspond to the different
248 classes of transitions in a machine. It is possible to embed into start
249 transitions, finishing transitions, all transitions or pending out
250 transitions. The embedding of pending out transitions is a special case.
251 These transition embeddings gets stored in the final states of a machine. They
252 are transferred to any transitions that may be made going out of the machine by
253 a concatenation or kleene star operator.
255 There are several more operators for embedding actions into states. Like the
256 transition embeddings, there are various different classes of states that the
257 embedding operators access. For example, one can access start states, final
258 states or all states, among others. Unlike the transition
260 are several different types of state action embeddings. These are executed at various
261 different times during the processing of input. It is possible to embed
262 actions which are exectued on all transitions into a state, all transitions out of a state,
263 transitions taken on the error event or on the EOF event.
265 Within actions, it is possible to influence the behaviour of the state machine.
266 The user can write action code that jumps or calls to another portion of the
267 machine, changes the current character being processed, or breaks out of the
268 processing loop. With the state machine calling feature Ragel can be used to
269 parse languages which are not regular. For example, one can parse balanced
270 parentheses by calling into a parser when an open bracket character is seen and
271 returning to the state on the top of the stack when the corresponding closing
272 bracket character is seen. More complicated context-free languages such as
273 expressions in C, are out of the scope of Ragel.
275 Ragel provides a longest-match construction operator which eases the task of
276 building scanners. This construction behaves much like the primary processing
277 model of Lex. The generated code, which relies on user-defined variables for
278 backtracking, repeatedly tries to match patterns to the input, favouring longer
279 patterns over shorter ones and patterns that appear ahead of others when the
280 lengths of the possible matches are identical. When a pattern is matched the
281 associated action is executed. Longest-match machines take Ragel out of the
282 domain of pure state machines and require the user to maintain the backtracking
283 related variables. However, longest-match machines integrate well with regular
284 state machine instantiations. They can be called to or jumped to only when
285 needed, or they can be called out of or jumped out of when a simpler, pure
286 state machine model is needed.
288 Two types of output code style are available. Ragel can produce a table-driven
289 machine or a directly executable machine. The directly executable machine is much
290 faster than the table-driven. On the other hand, the table-driven machine is
291 more compact and less demanding on the host language compiler. It is better
292 suited to compiling large state machines and in the future will be used for
293 coverage statistics gathering and debugging.
295 \section{Related Work}
297 Lex is perhaps the best-known tool for constructing parsers from regular
298 expressions. In the Lex processing model, generated code attempts to match one
299 of the user's regular expression patterns, favouring longer matches over
300 shorter ones. Once a match is made it then executes the code associated with
301 the pattern and consumes the matching string. This process is repeated until
302 the input is fully consumed.
304 Through the use of start conditions, related sets of patterns may be defined.
305 The active set may be changed at any time. This allows the user to define
306 different lexical regions. It also allows the user to link patterns together by
307 requiring that some patterns come before others. This is quite like a
308 concatenation operation. However, use of Lex for languages that require a
309 considerable amount of pattern concatenation is inappropriate. In such cases a
310 Lex program deteriorates into a manually specified state machine, where start
311 conditions define the states and pattern actions define the transitions. Lex
312 is therefore best suited to parsing tasks where the language to be parsed can
313 be described in terms of regions of tokens.
315 Lex is useful in many scenarios and has undoubtedly stood the test of time.
316 There are, however, several drawbacks to using Lex. Lex can impose too much
317 overhead for parsing applications where buffering is not required because all
318 the characters are available in a single string. In these cases there is
319 structure to the language to be parsed and a parser specification tool can
320 help, but employing a heavyweight processing loop that imposes a stream
321 ``pull'' model and dynamic input buffer allocation is inappropriate. An
322 example of this kind of scenario is the conversion of floating point numbers
323 contained in a string to their corresponding numerical values.
325 Another drawback is that
326 Lex patterns are black boxes. It is not possbile to execute a user action while
327 matching a character contained inside a pattern. For example, if scanning a
328 programming language and string literals can contain newlines which must be
329 counted, a Lex user must break up a string literal pattern so as to associate
330 an action with newlines. This forces the definition of a new start condition.
331 Alternatively the user can reprocess the text of the matched string literal to
335 How ragel is different from Lex.
337 %Like Re2c, Ragel provides a simple execution model that does not make any
338 %assumptions as to how the input is collected. Also, Ragel does not do any
339 %buffering in the generated code. Consequently there are no dependencies on
340 %external functions such as \verb|malloc|.
342 %If buffering is required it can be manually implemented by embedding actions
343 %that copy the current character to a buffer, or data can be passed to the
344 %parser using known block boundaries. If the longest-match operator is used,
345 %Ragel requires the user to ensure that the ending portion of the input buffer
346 %is preserved when the buffer is exhaused before a token is fully matched. The
347 %user should move the token prefix to a new memory location, such as back to the
348 %beginning of the input buffer, then place the subsequently read input
349 %immediately after the prefix.
351 %These properties of Ragel make it more work to write a program that requires
352 %the longest-match operator or buffering of input, however they make Ragel a
353 %more flexible tool that can produce very simple and fast-running programs under
354 %a variety of input acquisition arrangements.
356 %In Ragel, it is not necessary
357 %to introduce start conditions to concatenate tokens and retain action
358 %execution. Ragel allows one to structure a parser as a series of tokens, but
359 %does not require it.
361 %Like Lex and Re2C, Ragel is able to process input using a longest-match
362 %execution model, however the core of the Ragel language specifies parsers at a
363 %much lower level. This core is built around a pure state machine model. When
364 %building basic machines there is no implied algorithm for processing input
365 %other than to move from state to state on the transitions of the machine. This
366 %core of pure state machine operations makes Ragel well suited to handling
367 %parsing problems not based on token scanning. Should one need to use a
368 %longest-match model, the functionality is available and the lower level state
369 %machine construction facilities can be used to specify the patterns of a
370 %longest-match machine.
372 %This is not possible in Ragel. One can only program
373 %a longest-match instantiation with a fixed set of rules. One can jump to
374 %another longest-match machine that employs the same machine definitions in the
375 %construction of its rules, however no states will be shared.
377 %In Ragel, input may be re-parsed using a
378 %different machine, but since the action to be executed is associated with
379 %transitions of the compiled state machine, the longest-match construction does
380 %not permit a single rule to be excluded from the active set. It cannot be done
381 %ahead of time nor in the excluded rule's action.
384 The Re2C program defines an input processing model similar to that of Lex.
385 Unlike Lex, Re2C focuses on making generated state machines run very fast and
386 integrate easily into any program, free of dependencies. Re2C generates
387 directly executable code and is able to claim that generated parsers run nearly
388 as fast as their hand-coded equivalents. This is very important for user
389 adoption, as programmers are reluctant to use a tool when a faster alternative
390 exists. A consideration to ease of use is also important because developers
391 need the freedom to integrate the generated code as they see fit.
393 Many scripting languages provide ways of composing parsers by linking regular
394 expressions using program logic. For example, Sed and Awk are two established
395 Unix scripting tools that allow the programmer to exploit regular expressions
396 for the purpose of locating and extracting text of interest. High-level
397 programming languages such as Perl, Python, PHP and Ruby all provide regular
398 expression libraries that allow the user to combine regular expressions with
401 In addition to supporting the linking of regular expressions with arbitrary
402 program logic, the Perl programming language permits the embedding of code into
403 regular expressions. Perl embeddings do not translate into the embedding of
404 code into deterministic state machines. Perl regular expressions are in fact
405 not fully compiled to deterministic machines when embedded code is involved.
406 They are instead interpreted and involve backtracking. This is shown by the
407 following Perl program. When it is fed the input \verb|abcd| the interpretor
408 attempts to match the first alternative, printing \verb|a1 b1|. When this
409 possibility fails it backtracks and tries the second possibility, printing
410 \verb|a2 b2|, at which point it succeeds. A similar parser expressed in Ragel
411 will attempt both of the alternatives concurrently, printing
416 print "YES\n" if ( <STDIN> =~
417 /( a (?{ print "a1 "; }) b (?{ print "b1 "; }) cX ) |
418 ( a (?{ print "a2 "; }) b (?{ print "b2 "; }) cd )/x )
421 \section{Development Status}
423 Ragel is a relatively new tool and is under continuous development. As a rough
424 release guide, minor revision number changes are for implementation
425 improvements and feature additions. Major revision number changes are for
426 implementation and language changes that do not preserve backwards
427 compatibility. Though in the past this has not always held true: changes that
428 break code have crept into minor version number changes. Typically, the
429 documentation lags behind the development in the interest of documenting only
430 the lasting features. The latest changes are always documented in the ChangeLog
431 file. As Ragel stabilizes, which is expected in the 5.x line, the version
432 numbering rules will become more strict and the documentation will become more
436 \chapter{Constructing State Machines}
438 \section{Ragel State Machine Specifications}
440 A Ragel input file consists of a host language code file with embedded machine
441 specifications. Ragel normally passes input straight to output. When it sees
442 a machine specification it stops to read the Ragel statements and possibly generate
443 code in place of the specification.
444 Afterwards it continues to pass input through. There
445 can be any number of FSM specifications in an input file. A multi-line FSM spec
446 starts with \verb|%%{| and ends with \verb|}%%|. A single-line FSM spec starts
447 with \verb|%%| and ends at the first newline.
449 While Ragel is looking for FSM specifications it does basic lexical analysis on
450 the surrounding input. It interprets literal strings and comments so a
451 \verb|%%| sequence in either of those will not trigger the parsing of an FSM
452 specification. Ragel does not pass the input through any preprocessor nor does it
453 interpret preprocessor directives itself so includes, defines and ifdef logic
454 cannot be used to alter the parse of a Ragel input file. It is therefore not
455 possible to use an \verb|#if 0| directive to comment out a machine as is
456 commonly done in C code. As an alternative, a machine can be prevented from
457 causing any generated output by commenting out the write statements.
459 In Figure \ref{cmd-line-parsing}, a multi-line machine is used to define the
460 machine and single line machines are used to trigger the writing of the machine
461 data and execution code.
477 %% write data noerror nofinal;
481 int main( int argc, char **argv )
486 char *pe = p + strlen(p) + 1;
490 printf("result = %i\n", res );
495 \caption{Parsing a command line argument.}
496 \label{cmd-line-parsing}
500 \subsection{Naming Ragel Blocks}
507 The \verb|machine| statement gives the name of the FSM. If present in a
508 specification, this statement must appear first. If a machine specification
509 does not have a name then Ragel uses the previous specification name. If no
510 previous specification name exists then this is an error. Because FSM
511 specifications persist in memory, a machine's statements can be spread across
512 multiple machine specifications. This allows one to break up a machine across
513 several files or draw in statements that are common to multiple machines using
514 the include statement.
516 \subsection{Including Ragel Code}
519 include FsmName "inputfile.rl";
523 The \verb|include| statement can be used to draw in the statements of another FSM
524 specification. Both the name and input file are optional, however at least one
525 must be given. Without an FSM name, the given input file is searched for an FSM
526 of the same name as the current specification. Without an input file the
527 current file is searched for a machine of the given name. If both are present,
528 the given input file is searched for a machine of the given name.
530 \subsection{Machine Definition}
534 <name> = <expression>;
538 The machine definition statement associates an FSM expression with a name. Machine
539 expressions assigned to names can later be referenced by other expressions. A
540 definition statement on its own does not cause any states to be generated. It is simply a
541 description of a machine to be used later. States are generated only when a definition is
542 instantiated, which happens when a definition is referenced in an instantiated
545 \subsection{Machine Instantiation}
546 \label{instantiation}
549 <name> := <expression>;
553 The machine instantiation statement generates a set of states representing an expression and
554 associates a name with the entry point. Each instantiation generates a distinct
555 set of states. At a very minimum the \verb|main| machine must be instantiated.
556 Other machines may be instantiated and control passed to them by use of
557 \verb|fcall|, \verb|fgoto| or \verb|fnext| statements.
560 \subsection{Write Statement}
563 write <component> [options];
567 The write statement is used to generate parts of the machine. There are four
568 components that can be generated: the state machine's static data, the
569 initialization code, the execution code and the EOF action execution code. The
570 write statement is described in detail in Section \ref{write-statement}.
573 \section{Lexical Analysis of an FSM Specification}
576 Within a machine specification the following lexical rules apply to the parse
581 \item The \verb|#| symbol begins a comment that terminates at the next newline.
583 \item The symbols \verb|""|, \verb|''|, \verb|//|, \verb|[]| behave as the
584 delimiters of literal strings. With them, the following escape sequences are interpreted:
586 \verb| \0 \a \b \t \n \v \f \r|
588 A backslash at the end of a line joins the following line onto the current. A
589 backslash preceding any other character removes special meaning. This applies
590 to terminating characters and to special characters in regular expression
591 literals. As an exception, regular expression literals do not support escape
592 sequences as the operands of a range within a list. See the bullet on regular
593 expressions in Section \ref{basic}.
595 \item The symbols \verb|{}| delimit a block of host language code that will be
596 embedded into the machine as an action. Within the block of host language
597 code, basic lexical analysis of C/C++ comments and strings is done in order to
598 correctly find the closing brace of the block. With the exception of FSM
599 commands embedded in code blocks, the entire block is preserved as is for
600 identical reproduction in the output code.
602 \item The pattern \verb|[+-]?[0-9]+| denotes an integer in decimal format.
603 Integers used for specifying machines may be negative only if the alphabet type
604 is signed. Integers used for specifying priorities may be positive or negative.
606 \item The pattern \verb|0x[0-9a-fA-f]+| denotes an integer in hexadecimal
609 \item The keywords are \verb|access|, \verb|action|, \verb|alphtype|,
610 \verb|getkey|, \verb|write|, \verb|machine| and \verb|include|.
612 \item The pattern \verb|[a-zA-Z_][a-zA-Z_0-9]*| denotes an identifier.
614 %\item The allowable symbols are:
616 %\verb/ ( ) ! ^ * ? + : -> - | & . , := = ; > @ $ % /\\
617 %\verb| >/ $/ %/ </ @/ <>/ >! $! %! <! @! <>!|\\
618 %\verb| >^ $^ %^ <^ @^ <>^ >~ $~ %~ <~ @~ <>~|\\
619 %\verb| >* $* %* <* @* <>*|
621 \item Any amount of whitespace may separate tokens.
625 %\section{Parse of an FSM Specification}
627 %The following statements are possible within an FSM specification. The
628 %requirements for trailing semicolons loosely follow that of C.
630 %specifying code does not require a trailing semicolon. An expression
631 %statement does require a trailing semicolon.
634 \section{Basic Machines}
637 The basic machines are the base operands of regular language expressions. They
638 are the smallest unit to which machine construction and manipulation operators
641 In the diagrams that follow the symbol \verb|df| represents
642 the default transition, which is taken if no other transition can be taken. The
643 symbol \verb|cr| represents the carriage return character, \verb|nl| represents the newline character (aka line feed) and the symbol
644 \verb|sp| represents the space character.
648 \item \verb|'hello'| -- Concatenation Literal. Produces a machine that matches
649 the sequence of characters in the quoted string. If there are 5 characters
650 there will be 6 states chained together with the characters in the string. See
651 Section \ref{lexing} for information on valid escape sequences.
666 \includegraphics[scale=0.45]{bmconcat}
670 to make a concatenation literal case-insensitive by appending an \verb|i| to
671 the string, for example \verb|'cmd'i|.
673 \item \verb|"hello"| -- Identical to the single quoted version.
675 \item \verb|[hello]| -- Or Expression. Produces a union of characters. There
676 will be two states with a transition for each unique character between the two states.
677 The \verb|[]| delimiters behave like the quotes of a literal string. For example,
678 \verb|[ \t]| means tab or space. The or expression supports character ranges
679 with the \verb|-| symbol as a separator. The meaning of the union can be negated
680 using an initial \verb|^| character as in standard regular expressions.
681 See Section \ref{lexing} for information on valid escape sequences
697 \includegraphics[scale=0.45]{bmor}
700 \item \verb|''|, \verb|""|, and \verb|[]| -- Zero Length Machine. Produces a machine
701 that matches the zero length string. Zero length machines have one state that is both
702 a start state and a final state.
717 \includegraphics[scale=0.45]{bmnull}
720 % FIXME: More on the range of values here.
721 \item \verb|42| -- Numerical Literal. Produces a two state machine with one
722 transition on the given number. The number may be in decimal or hexadecimal
723 format and should be in the range allowed by the alphabet type. The minimum and
724 maximum values permitted are defined by the host machine that Ragel is compiled
725 on. For example, numbers in a \verb|short| alphabet on an i386 machine should
726 be in the range \verb|-32768| to \verb|32767|.
740 \includegraphics[scale=0.45]{bmnum}
743 \item \verb|/simple_regex/| -- Regular Expression. Regular expressions are
744 parsed as a series of expressions that will be concatenated together. Each
745 concatenated expression
746 may be a literal character, the any character specified by the \verb|.|
747 symbol, or a union of characters specified by the \verb|[]| delimiters. If the
748 first character of a union is \verb|^| then it matches any character not in the
749 list. Within a union, a range of characters can be given by separating the first
750 and last characters of the range with the \verb|-| symbol. Each
751 concatenated machine may have repetition specified by following it with the
752 \verb|*| symbol. The standard escape sequences described in Section
753 \ref{lexing} are supported everywhere in regular expressions except as the
754 operands of a range within in a list. This notation also supports the \verb|i|
755 trailing option. Use it to produce case-insensitive machines, as in \verb|/GET/i|.
757 Ragel does not support very complex regular expressions because the desired
758 results can always be achieved using the more general machine construction
759 operators listed in Section \ref{machconst}. The following diagram shows the
760 result of compiling \verb|/ab*[c-z].*[123]/|.
768 main := /ab*[c-z].*[123]/;
775 \includegraphics[scale=0.45]{bmregex}
778 \item \verb|'a' .. 'z'| -- Range. Produces a machine that matches any
779 characters in the specified range. Allowable upper and lower bounds of the
780 range are concatenation literals of length one and numerical literals. For
781 example, \verb|0x10..0x20|, \verb|0..63|, and \verb|'a'..'z'| are valid ranges.
782 The bounds should be in the range allowed by the alphabet type.
797 \includegraphics[scale=0.45]{bmrange}
801 \item \verb|variable_name| -- Lookup the machine definition assigned to the
802 variable name given and use an instance of it. See Section \ref{definition} for
803 an important note on what it means to reference a variable name.
805 \item \verb|builtin_machine| -- There are several built-in machines available
806 for use. They are all two state machines for the purpose of matching common
807 classes of characters. They are:
811 \item \verb|any | -- Any character in the alphabet.
813 \item \verb|ascii | -- Ascii characters. \verb|0..127|
815 \item \verb|extend| -- Ascii extended characters. This is the range
816 \verb|-128..127| for signed alphabets and the range \verb|0..255| for unsigned
819 \item \verb|alpha | -- Alphabetic characters. \verb|[A-Za-z]|
821 \item \verb|digit | -- Digits. \verb|[0-9]|
823 \item \verb|alnum | -- Alpha numerics. \verb|[0-9A-Za-z]|
825 \item \verb|lower | -- Lowercase characters. \verb|[a-z]|
827 \item \verb|upper | -- Uppercase characters. \verb|[A-Z]|
829 \item \verb|xdigit| -- Hexadecimal digits. \verb|[0-9A-Fa-f]|
831 \item \verb|cntrl | -- Control characters. \verb|0..31|
833 \item \verb|graph | -- Graphical characters. \verb|[!-~]|
835 \item \verb|print | -- Printable characters. \verb|[ -~]|
837 \item \verb|punct | -- Punctuation. Graphical characters that are not alphanumerics.
838 \verb|[!-/:-@[-`{-~]|
840 \item \verb|space | -- Whitespace. \verb|[\t\v\f\n\r ]|
842 \item \verb|zlen | -- Zero length string. \verb|""|
844 \item \verb|empty | -- Empty set. Matches nothing. \verb|^any|
849 \section{Operator Precedence}
850 The following table shows operator precedence from lowest to highest. Operators
851 in the same precedence group are evaluated from left to right.
854 \begin{tabular}{|c|c|c|}
858 2&\verb/ | & - --/&Union, Intersection and Subtraction\\
860 3&\verb| . <: :> :>> |&Concatenation\\
864 5&\verb| -> |&Epsilon Transition\\
866 &\verb| > @ $ % |&Transitions Actions and Priorities\\
868 &\verb| >/ $/ %/ </ @/ <>/ |&EOF Actions\\
870 6&\verb| >! $! %! <! @! <>! |&Global Error Actions\\
872 &\verb| >^ $^ %^ <^ @^ <>^ |&Local Error Actions\\
874 &\verb| >~ $~ %~ <~ @~ <>~ |&To-State Actions\\
876 &\verb| >* $* %* <* @* <>* |&From-State Action\\
878 7&\verb| * ** ? + {n} {,n} {n,} {n,m} |&Repetition\\
880 8&\verb| ! ^ |&Negation and Character-Level Negation\\
882 9&\verb| ( <expr> ) |&Grouping\\
886 \section{Regular Language Operators}
889 When using Ragel it is helpful to have a sense of how it constructs machines.
890 Sometimes this the determinization process can cause results that appear unusual to someone
891 unfamiliar with it. Ragel does not make use of any nondeterministic
892 intermediate state machines. All operators accept and return deterministic
893 machines. However, to ease the discussion, the operations are defined in terms
896 To draw an epsilon transition between two states \verb|x| and \verb|y|, is to
897 copy all of the properties of \verb|y| into \verb|x|. This involves drawing in
898 all of \verb|y|'s to-state actions, EOF actions, etc., as well as its
899 transitions. If \verb|x| and \verb|y| both have a transition out on the same
900 character, then the transitions must be combined. During transition
901 combination a new transition is made which goes to a new state that is the
902 combination of both target states. The new combination state is created using
903 the same epsilon transition method. The new state has an epsilon transition
904 drawn to all the states that compose it. Since every time an epsilon transition
905 is drawn the creation of new epsilon transitions may be triggered, the process
906 of drawing epsilon transitions is repeated until there are no more epsilon
907 transitions to be made.
909 A very common error that is made when using Ragel is to make machines that do
910 too much at once. That is, to create machines that have unintentional
911 nondeterminism. This usually results from being unaware of the common strings
912 between machines that are combined together using the regular language
913 operators. This can involve never leaving a machine, causing its actions to be
914 propagated through all the following states. Or it can involve an alternation
915 where both branches are unintentionally taken simultaneously.
917 This problem forces one to think hard about the language that needs to be
918 matched. To guard against this kind of problem one must ensure that the machine
919 specification is divided up using boundaries that do not allow ambiguities from
920 one portion of the machine to the next. See Chapter
921 \ref{controlling-nondeterminism} for more on this problem and how to solve it.
923 The Graphviz tool is an immense help when debugging improperly compiled
924 machines or otherwise learning how to use Ragel. In many cases, practical
925 parsing programs will be too large to completely visualize with Graphviz. The
926 proper approach is to reduce the language to the smallest subset possible that
927 still exhibits the characteristics that one wishes to learn about or to fix.
928 This can be done without modifying the source code using the \verb|-M| and
929 \verb|-S| options at the frontend. If a machine cannot be easily reduced,
930 embeddings of unique actions can be very useful for tracing a
931 particular component of a larger machine specification, since action names are
932 written out on transition labels.
939 The union operation produces a machine that matches any string in machine one
940 or machine two. The operation first creates a new start state. Epsilon
941 transitions are drawn from the new start state to the start states of both
942 input machines. The resulting machine has a final state set equivalent to the
943 union of the final state sets of both input machines. In this operation, there
944 is the opportunity for nondeterminism among both branches. If there are
945 strings, or prefixes of strings that are matched by both machines then the new
946 machine will follow both parts of the alternation at once. The union operation is
951 \includegraphics{opor}
955 The following example demonstrates the union of three machines representing
964 # Hex digits, decimal digits, or identifiers
965 main := '0x' xdigit+ | digit+ | alpha alnum*;
973 \includegraphics[scale=0.45]{exor}
976 \subsection{Intersection}
981 Intersection produces a machine that matches any
982 string which is in both machine one and machine two. To achieve intersection, a
983 union is performed on the two machines. After the result has been made
984 deterministic, any final state that is not a combination of final states from
985 both machines has its final state status revoked. To complete the operation,
986 paths that do not lead to a final state are pruned from the machine. Therefore,
987 if there are any such paths in either of the expressions they will be removed
988 by the intersection operator. Intersection can be used to require that two
989 independent patterns be simultaneously satisfied as in the following example.
997 # Match lines four characters wide that contain
998 # words separated by whitespace.
1000 /[^\n][^\n][^\n][^\n]\n/* &
1001 (/[a-z][a-z]*/ | [ \n])**;
1009 \includegraphics[scale=0.45]{exinter}
1012 \subsection{Difference}
1017 The difference operation produces a machine that matches
1018 strings which are in machine one but which are not in machine two. To achieve subtraction,
1019 a union is performed on the two machines. After the result has been made
1020 deterministic, any final state that came from machine two or is a combination
1021 of states involving a final state from machine two has its final state status
1022 revoked. As with intersection, the operation is completed by pruning any path
1023 that does not lead to a final state. The following example demonstrates the
1024 use of subtraction to exclude specific cases from a set.
1034 # Subtract keywords from identifiers.
1035 main := /[a-z][a-z]*/ - ( 'for' | 'int' );
1043 \includegraphics[scale=0.45]{exsubtr}
1048 \subsection{Strong Difference}
1049 \label{strong_difference}
1054 Strong difference produces a machine that matches any string of the first
1055 machine which does not have any string of the second machine as a substring. In
1056 the following example, strong subtraction is used to excluded \verb|CRLF| from
1059 % GENERATE: exstrongsubtr
1062 % machine exstrongsubtr;
1066 main := [a-z]+ ':' ( any* -- crlf ) crlf;
1074 \includegraphics[scale=0.45]{exstrongsubtr}
1078 This operator is equivalent to the following.
1082 expr - ( any* expr any* )
1085 \subsection{Concatenation}
1090 Concatenation produces a machine that matches all the strings in machine one followed by all
1091 the strings in machine two. Concatenation draws epsilon transitions from the
1092 final states of the first machine to the start state of the second machine. The
1093 final states of the first machine loose their final state status, unless the
1094 start state of the second machine is final as well.
1095 Concatenation is the default operator. Two machines next to each other with no
1096 operator between them results in the machines being concatenated together.
1100 \includegraphics{opconcat}
1104 The opportunity for nondeterministic behaviour results from the possibility of
1105 the final states of the first machine accepting a string which is also accepted
1106 by the start state of the second machine.
1107 The most common scenario that this happens in is the
1108 concatenation of a machine that repeats some pattern with a machine that gives
1109 a termination string, but the repetition machine does not exclude the
1110 termination string. The example in Section \ref{strong_difference}
1111 guards against this. Another example is the expression \verb|("'" any* "'")|.
1112 When exectued the thread of control will
1113 never leave the \verb|any*| machine. This is a problem especially if actions
1114 are embedded to processes the characters of the \verb|any*| component.
1116 In the following example, the first machine is always active due to the
1117 nondeterministic nature of concatenation. This particular nondeterminism is intended
1118 however because we wish to permit EOF strings before the end of the input.
1120 % GENERATE: exconcat
1126 # Require an eof marker on the last line.
1127 main := /[^\n]*\n/* . 'EOF\n';
1135 \includegraphics[scale=0.45]{exconcat}
1139 \noindent {\bf Note:} There is a language
1140 ambiguity involving concatenation and subtraction. Because concatenation is the
1141 default operator for two
1142 adjacent machines there is an ambiguity between subtraction of
1143 a positive numerical literal and concatenation of a negative numerical literal.
1144 For example, \verb|(x-7)| could be interpreted as \verb|(x . -7)| or
1145 \verb|(x - 7)|. In the Ragel language, the subtraction operator always takes precedence
1146 over concatenation of a negative literal. Precedence was given to the
1147 subtraction-based interpretation so as to adhere to the rule that the default
1148 concatenation operator takes effect only when there are no other operators between
1149 two machines. Beware of writing machines such as \verb|(any -1)| when what is
1150 desired is a concatenation of \verb|any| and -1. Instead write
1151 \verb|(any . -1)| or \verb|(any (-1))|. If in doubt of the meaning of your program do not
1152 rely on the default concatenation operator, always use the \verb|.| symbol.
1155 \subsection{Kleene Star}
1160 The machine resulting from the Kleene Star operator will match zero or more
1161 repetitions of the machine it is applied to.
1162 It creates a new start state and an additional final
1163 state. Epsilon transitions are drawn between the new start state and the old start
1164 state, between the new start state and the new final state, and
1165 between the final states of the machine and the new start state. After the
1166 machine is made deterministic the effect is of the final states getting all the
1167 transitions of the start state.
1171 \includegraphics{opstar}
1175 The possibility for nondeterministic behaviour arises if the final states have
1176 transitions on any of the same characters as the start state. This is common
1177 when applying kleene star to an alternation of tokens. Like the other problems
1178 arising from nondeterministic behavior, this is discussed in more detail in Chapter
1179 \ref{controlling-nondeterminism}. This particular problem can also be solved
1180 by using the longest-match construction discussed in Section
1181 \ref{generating-scanners} on scanners.
1184 example, there is no nondeterminism introduced by the exterior kleene star due
1185 the newline at the end of the regular expression. Without the newline the
1186 exterior kleene star would be redundant and there would be ambiguity between
1187 repeating the inner range of the regular expression and the entire regular
1188 expression. Though it would not cause a problem in this case, unnecessary
1189 nondeterminism in the kleene star operator often causes undesired results for
1190 new Ragel users and must be guarded against.
1198 # Match any number of lines with only lowercase letters.
1199 main := /[a-z]*\n/*;
1207 \includegraphics[scale=0.45]{exstar}
1211 \subsection{One Or More Repetition}
1216 This operator produces the concatenation of the machine with the kleene star of
1217 itself. The result will match one or more repetitions of the machine. The plus
1218 operator is equivalent to \verb|(expr . expr*)|. The plus operator makes
1219 repetitions that cannot be zero length.
1227 # Match alpha-numeric words.
1236 \includegraphics[scale=0.45]{explus}
1240 \subsection{Optional}
1245 The {\em optional} operator produces a machine that accepts the machine
1246 given or the zero length string. The optional operator is equivalent to
1247 \verb/(expr | '' )/. In the following example the optional operator is used to
1250 % GENERATE: exoption
1256 # Match integers or floats.
1257 main := digit+ ('.' digit+)?;
1265 \includegraphics[scale=0.45]{exoption}
1270 \subsection{Repetition}
1273 \noindent \verb|expr {n}| \hspace{16pt}\=-- Exactly N copies of expr.\\
1275 \noindent \verb|expr {,n}| \>-- Zero to N copies of expr.\\
1277 \noindent \verb|expr {n,}| \>-- N or more copies of expr.\\
1279 \noindent \verb|expr {n,m}| \>-- N to M copies of expr.
1282 \subsection{Negation}
1287 Negation produces a machine that matches any string not matched by the given
1288 machine. Negation is equivalent to \verb|(any* - expr)|.
1290 % GENERATE: exnegate
1296 # Accept anything but a string beginning with a digit.
1297 main := ! ( digit any* );
1305 \includegraphics[scale=0.45]{exnegate}
1310 \subsection{Character-Level Negation}
1315 Character-level negation produces a machine that matches any single character
1316 not matched by the given machine. Character-Level Negation is equivalent to
1317 \verb|(any - expr)|.
1319 \section{State Charts}
1321 It is not uncommon for programmers to implement
1322 parsers as manually-coded state machines, either using a switch statement or a
1323 state map compiler which takes a list of states, transitions and actions, and
1326 This method can be a very effective programming technique for producing robust
1327 code. The key disadvantage becomes clear when one attempts to comprehend such a
1328 parser. Machines coded in this way usually require many lines, causing logic to
1329 be spread out over large distances in the source file. Remembering the function
1330 of a large number of states can be difficult and organizing the parser in a
1331 sensible way requires discipline because branches and repetition present many
1332 file layout options. This kind of programming takes a specification with
1333 inherent structure such as looping, alternation and concatenation and expresses
1336 If we could take an isolated component of a manually programmed state chart,
1337 that is, a subset of states that has only one entry point, and implement it
1338 using regular language operators then we could eliminate all the explicit
1339 naming of the states contained in it. By eliminating explicitly named states
1340 and replacing them with higher-level specifications we simplify a parser
1343 For example, sometimes chains of states are needed, with only a small number of
1344 possible characters appearing along the chain. These can easily be replaced
1345 with a concatenation of characters. Sometimes a group of common states
1346 implement a loop back to another single portion of the machine. Rather than
1347 manually duplicate all the transitions that loop back, we may be able to
1348 express the loop using a kleene star operator.
1350 Ragel allows one to take this state map simplification approach. We can build
1351 state machines using a state map model and implement portions of the state map
1352 using regular languages. In place of any transition in the state machine,
1353 entire sub-state machines can be given. These can encapsulate functionality
1354 defined elsewhere. An important aspect of the Ragel approach is that when we
1355 wrap up a collection of states using a regular expression we do not loose
1356 access to the states and transitions. We can still execute code on the
1357 transitions that we have encapsulated.
1361 \verb|expr , expr , ...|
1364 Join a list of machines together without
1365 drawing any transitions, without setting up a start state, and without
1366 designating any final states. Transitions between the machines may be specified
1367 using labels and epsilon transitions. The start state must be explicity
1368 specified with the ``start'' label. Final states may be specified with the an
1369 epsilon transition to the implicitly created ``final'' state. The join
1370 operation allows one to build machines using a state chart model.
1377 Attaches a label to an expression. Labels can be
1378 used as the target of epsilon transitions and explicit control transfer
1379 statements such \verb|fgoto| and \verb|fnext| in action
1382 \subsection{Epsilon}
1384 \verb|expr -> label|
1387 Draws an epsilon transition to the state defined
1388 by \verb|label|. Epsilon transitions are made deterministic when join
1389 operators are evaluated. Epsilon transitions that are not in a join operation
1390 are made deterministic when the machine definition that contains the epsilon is
1391 complete. See Section \ref{labels} for information on referencing labels.
1395 \label{generating-scanners}
1397 The longest-match operator can be used to construct scanners. The generated
1398 machine repeatedly attempts to match one of the given patterns, first favouring
1399 longer pattern matches over shorter ones. If there is a choice between equal
1400 length matches, the match of the pattern which appears first is chosen.
1404 <machine_name> := |*
1405 pattern1 => action1;
1406 pattern2 => action2;
1412 The longest-match construction operator is not a pure state machine operator.
1413 It relies on the \verb|tokstart|, \verb|tokend| and \verb|act| variables to be
1414 present so that it can backtrack and make pointers to the matched text
1415 available to the user. If input is processed using multiple calls to the
1416 execute code then the user must ensure that when a token is only partially
1417 matched that the prefix is preserved on the subsequent invocation of the
1420 The \verb|tokstart| variable must be defined as a pointer to the input data.
1421 It is used for recording where the current token match begins. This variable
1422 may be used in action code for retrieving the text of the current match. Ragel
1423 ensures that in between tokens and outside of the longest-match machines that
1424 this pointer is set to null. In between calls to the execute code the user must
1425 check if \verb|tokstart| is set and if so, ensure that the data it points to is
1426 preserved ahead of the next buffer block. This is described in more detail
1429 The \verb|tokend| variable must also be defined as a pointer to the input data.
1430 It is used for recording where a match ends and where scanning of the next
1431 token should begin. This can also be used in action code for retrieving the
1432 text of the current match.
1434 The \verb|act| variable must be defined as an integer type. It is used for
1435 recording the identity of the last pattern matched when the scanner must go
1436 past a matched pattern in an attempt to make a longer match. If the longer
1437 match fails it may need to consult the act variable. In some cases use of the act
1438 variable can be avoided because the value of the current state is enough
1439 information to determine which token to accept, however in other cases this is
1440 not enough and so the \verb|act| variable is used.
1442 When the longest-match operator is in use, the user's driver code must take on
1443 some buffer management functions. The following algorithm gives an overview of
1444 the steps that should be taken to properly use the longest-match operator.
1447 \setlength{\parskip}{0pt}
1448 \item Read a block of input data.
1449 \item Run the execute code.
1450 \item If \verb|tokstart| is set, the execute code will expect the incomplete
1451 token to be preserved ahead of the buffer on the next invocation of the execute
1454 \item Shift the data beginning at \verb|tokstart| and ending at \verb|pe| to the
1455 beginning of the input buffer.
1456 \item Reset \verb|tokstart| to the beginning of the buffer.
1457 \item Shift \verb|tokend| by the distance from the old value of \verb|tokstart|
1458 to the new value. The \verb|tokend| variable may or may not be valid. There is
1459 no way to know if it holds a meaningful value because it is not kept at null
1460 when it is not in use. It can be shifted regardless.
1462 \item Read another block of data into the buffer, immediately following any
1464 \item Run the scanner on the new data.
1467 Figure \ref{preserve_example} shows the required handling of an input stream in
1468 which a token is broken by the input block boundaries. After processing up to
1469 and including the ``t'' of ``characters'', the prefix of the string token must be
1470 retained and processing should resume at the ``e'' on the next iteration of
1473 If one uses a large input buffer for collecting input then the number of times
1474 the shifting must be done will be small. Furthermore, if one takes care not to
1475 define tokens that are allowed to be very long and instead processes these
1476 items using pure state machines or sub-scanners, then only a small amount of
1477 data will ever need to be shifted.
1481 a) A stream "of characters" to be scanned.
1485 b) "of characters" to be scanned.
1489 \caption{Following an invocation of the execute code there may be a partially
1490 matched token (a). The data of the partially matched token
1491 must be preserved ahead of the new data on the next invocation (b).}
1492 \label{preserve_example}
1495 Since scanners attempt to make the longest possible match of input, in some
1496 cases they are not able to identify a token upon parsing its final character,
1497 they must wait for a lookahead character. For example if trying to match words,
1498 the token match must be triggered on following whitespace in case more
1499 characters of the word have yet to come. The user must therefore arrange for an
1500 EOF character to be sent to the scanner to flush out any token that has not yet
1501 been matched. The user can exclude a single character from the entire scanner
1502 and use this character as the EOF character, possibly specifying an EOF action.
1503 For most scanners, zero is a suitable choice for the EOF character.
1505 Alternatively, if whitespace is not significant and ignored by the scanner, the
1506 final real token can be flushed out by simply sending an additional whitespace
1507 character on the end of the stream. If the real stream ends with whitespace
1508 then it will simply be extended and ignored. If it does not, then the last real token is
1509 guaranteed to be flushed and the dummy EOF whitespace ignored.
1510 An example scanner processing loop is given in Figure \ref{scanner-loop}.
1518 /* How much space is in the buffer? */
1519 int space = BUFSIZE - have;
1521 /* Buffer is full. */
1522 cerr << "TOKEN TOO BIG" << endl;
1526 /* Read in a block after any data we already have. */
1527 char *p = inbuf + have;
1528 cin.read( p, space );
1529 int len = cin.gcount();
1531 /* If no data was read, send the EOF character.
1540 if ( cs == RagelScan_error ) {
1541 /* Machine failed before finding a token. */
1542 cerr << "PARSE ERROR" << endl;
1546 if ( tokstart == 0 )
1549 /* There is a prefix to preserve, shift it over. */
1550 have = pe - tokstart;
1551 memmove( inbuf, tokstart, have );
1552 tokend = inbuf + (tokend-tokstart);
1557 \caption{A processing loop for a scanner.}
1558 \label{scanner-loop}
1562 \section{Write Statement}
1563 \label{write-statement}
1566 write <component> [options];
1571 The write statement is used to generate parts of the machine.
1573 components that can be generated by a write statement. These components are the
1574 state machine's data, initialization code, execution code and EOF action
1575 execution code. A write statement may appear before a machine is fully defined.
1576 This allows one to write out the data first then later define the machine where
1577 it is used. An example of this is show in Figure \ref{fbreak-example}.
1579 \subsection{Write Data}
1581 write data [options];
1585 The write data statement causes Ragel to emit the constant static data needed
1586 by the machine. In table-driven output styles (see Section \ref{genout}) this
1587 is a collection of arrays that represent the states and transitions of the
1588 machine. In goto-driven machines much less data is emitted. At the very
1589 minimum a start state \verb|name_start| is generated. All variables written
1590 out in machine data have both the \verb|static| and \verb|const| properties and
1591 are prefixed with the name of the machine and an
1592 underscore. The data can be placed inside a class, inside a function, or it can
1593 be defined as global data.
1595 Two variables are written that may be used to test the state of the machine
1596 after a buffer block has been processed. The \verb|name_error| variable gives
1597 the id of the state that the machine moves into when it cannot find a valid
1598 transition to take. The machine immediately breaks out of the processing loop when
1599 it finds itself in the error state. The error variable can be compared to the
1600 current state to determine if the machine has failed to parse the input. If the
1601 machine is complete, that is from every state there is a transition to a proper
1602 state on every possible character of the alphabet, then no error state is required
1603 and this variable will be set to -1.
1605 The \verb|name_first_final| variable stores the id of the first final state. All of the
1606 machine's states are sorted by their final state status before having their ids
1607 assigned. Checking if the machine has accepted its input can then be done by
1608 checking if the current state is greater-than or equal to the first final
1611 Data generation has several options:
1614 \item \verb|noerror| - Do not generate the integer variable that gives the
1615 id of the error state.
1616 \item \verb|nofinal| - Do not generate the integer variable that gives the
1617 id of the first final state.
1618 \item \verb|noprefix| - Do not prefix the variable names with the name of the
1622 \subsection{Write Init}
1628 The write init statement causes Ragel to emit initialization code. This should
1629 be executed once before the machine is started. At a very minimum this sets the
1630 current state to the start state. If other variables are needed by the
1631 generated code, such as call
1632 stack variables or longest-match management variables, they are also
1635 \subsection{Write Exec}
1637 write exec [options];
1641 The write exec statement causes Ragel to emit the state machine's execution code.
1642 Ragel expects several variables to be available to this code. At a very minimum, the
1643 generated code needs access to the current character position \verb|p|, the ending
1644 position \verb|pe| and the current state \verb|cs|, though \verb|pe|
1645 can be excluded by specifying the \verb|noend| write option.
1646 The \verb|p| variable is the cursor that the execute code will
1647 used to traverse the input. The \verb|pe| variable should be set up to point to one
1648 position past the last valid character in the buffer.
1650 Other variables are needed when certain features are used. For example using
1651 the \verb|fcall| or \verb|fret| statements requires \verb|stack| and
1652 \verb|top| variables to be defined. If a longest-match construction is used,
1653 variables for managing backtracking are required.
1655 The write exec statement has one option. The \verb|noend| option tells Ragel
1656 to generate code that ignores the end position \verb|pe|. In this
1657 case the user must explicitly break out of the processing loop using
1658 \verb|fbreak|, otherwise the machine will continue to process characters until
1659 it moves into the error state. This option is useful if one wishes to process a
1660 null terminated string. Rather than traverse the string to discover then length
1661 before processing the input, the user can break out when the null character is
1662 seen. The example in Figure \ref{fbreak-example} shows the use of the
1663 \verb|noend| write option and the \verb|fbreak| statement for processing a string.
1670 int main( int argc, char **argv )
1672 %% write data noerror nofinal;
1679 0 @{ res = 1; fbreak; };
1684 printf("execute = %i\n", res );
1688 \caption{Use of {\tt noend} write option and the {\tt fbreak} statement for
1689 processing a string.}
1690 \label{fbreak-example}
1694 \subsection{Write EOF Actions}
1700 The write EOF statement causes Ragel to emit code that executes EOF actions.
1701 This write statement is only relevant if EOF actions have been embedded,
1702 otherwise it does not generate anything. The EOF action code requires access to
1705 \section{Referencing Names}
1708 This section describes how to reference names in epsilon transitions and
1709 action-based control-flow statements such as \verb|fgoto|. There is a hierarchy
1710 of names implied in a Ragel specification. At the top level are the machine
1711 instantiations. Beneath the instantiations are labels and references to machine
1712 definitions. Beneath those are more labels and references to definitions, and
1715 Any name reference may contain multiple components separated with the \verb|::|
1716 compound symbol. The search for the first component of a name reference is
1717 rooted at the join expression that the epsilon transition or action embedding
1718 is contained in. If the name reference is not not contained in a join,
1719 the search is rooted at the machine definition that that the epsilon transition or
1720 action embedding is contained in. Each component after the first is searched
1721 for beginning at the location in the name tree that the previous reference
1722 component refers to.
1724 In the case of action-based references, if the action is embedded more than
1725 once, the local search is performed for each embedding and the result is the
1726 union of all the searches. If no result is found for action-based references then
1727 the search is repeated at the root of the name tree. Any action-based name
1728 search may be forced into a strictly global search by prefixing the name
1729 reference with \verb|::|.
1731 The final component of the name reference must resolve to a unique entry point.
1732 If a name is unique in the entire name tree it can be referenced as is. If it
1733 is not unique it can be specified by qualifying it with names above it in the
1734 name tree. However, it can always be renamed.
1736 % FIXME: Should fit this in somewhere.
1737 % Some kinds of name references are illegal. Cannot call into longest-match
1738 % machine, can only call its start state. Cannot make a call to anywhere from
1739 % any part of a longest-match machine except a rule's action. This would result
1740 % in an eventual return to some point inside a longest-match other than the
1741 % start state. This is banned for the same reason a call into the LM machine is
1744 \section{State Machine Minimization}
1746 State machine minimization is the process of finding the minimal equivalent FSM accepting
1747 the language. Minimization reduces the number of states in machines
1748 by merging equivalent states. It does not change the behaviour of the machine
1749 in any way. It will cause some states to be merged into one because they are
1750 functionally equivalent. State minimization is on by default. It can be turned
1751 off with the \verb|-n| option.
1753 The algorithm implemented is similar to Hopcroft's state minimization
1754 algorithm. Hopcroft's algorithm assumes a finite alphabet that can be listed in
1755 memory, whereas Ragel supports arbitrary integer alphabets that cannot be
1756 listed in memory. Though exact analysis is very difficult, Ragel minimization
1757 runs close to $O(n \times log(n))$ and requires $O(n)$ temporary storage where
1758 $n$ is the number of states.
1760 \chapter{User Actions}
1762 \section{Embedding Actions}
1766 /* Code an action here. */
1772 The action statement defines a block of code that can be embedded into an FSM.
1773 Action names can be referenced by the action embedding operators in
1774 expressions. Though actions need not be named in this way (literal blocks
1775 of code can be embedded directly when building machines), defining reusable
1776 blocks of code whenever possible is good practice because it potentially increases the
1777 degree to which the machine can be minimized. Within an action some Ragel expressions
1778 and statements are parsed and translated. These allow the user to interact with the machine
1779 from action code. See Section \ref{vals} for a complete list of statements and
1780 values available in code blocks.
1782 \subsection{Entering Action}
1784 \verb|expr > action|
1787 The entering operator embeds an action into the starting transitions. The
1788 action is executed on all transitions that enter into the machine from the
1789 start state. If the start state is a final state then it is possible for the
1790 machine to never be entered and the starting transitions bypassed. In the
1791 following example, the action is executed on the first transition of the
1792 machine. If the repetition machine is bypassed the action is not executed.
1802 # Execute A at the beginning of a string of alpha.
1804 main := ( lower* >A ) . ' ';
1812 \includegraphics[scale=0.45]{exstact}
1816 \subsection{Finishing Action}
1818 \verb|expr @ action|
1821 The finishing action operator embeds an action into any transitions that go into a
1822 final state. Whether or not the machine accepts is not determined at the point
1823 the action is executed. Further input may move the machine out of the accepting
1824 state, but keep it in the machine. As in the following example, the
1825 into-final-state operator is most often used when no lookahead is necessary.
1827 % GENERATE: exdoneact
1830 % machine exdoneact;
1834 # Execute A when the trailing space is seen.
1835 main := ( lower* ' ' ) @A;
1843 \includegraphics[scale=0.45]{exdoneact}
1848 \subsection{All Transition Action}
1850 \verb|expr $ action|
1853 The all transition operator embeds an action into all transitions of a machine.
1854 The action is executed whenever a transition of the machine is taken. In the
1855 following example, A is executed on every character matched.
1857 % GENERATE: exallact
1864 # Execute A on any characters of machine one or two.
1865 main := ( 'm1' | 'm2' ) $A;
1873 \includegraphics[scale=0.45]{exallact}
1878 \subsection{Pending Out (Leaving) Actions}
1881 \verb|expr % action|
1884 The pending out action operator embeds an action into the pending out
1885 transitions of a machine. The action is first embedded into the final states of
1886 the machine and later transferred to any transitions made going out of the
1887 machine. The transfer can be caused either by a concatenation or kleene star
1888 operation. This mechanism allows one to associate an action with the
1889 termination of a sequence, without being concerned about what particular
1890 character terminates the sequence. In the following example, A is executed
1891 when leaving the alpha machine by the newline character.
1893 % GENERATE: exoutact1
1896 % machine exoutact1;
1900 # Match a word followed by an newline. Execute A when
1901 # finishing the word.
1902 main := ( lower+ %A ) . '\n';
1910 \includegraphics[scale=0.45]{exoutact1}
1914 In the following example, the \verb|term_word| action could be used to register
1915 the appearance of a word and to clear the buffer that the \verb|lower| action used
1916 to store the text of it.
1918 % GENERATE: exoutact2
1921 % machine exoutact2;
1924 % action term_word {}
1928 word = ( [a-z] @lower )+ %term_word;
1929 main := word ( ' ' @space word )* '\n' @newline;
1937 \includegraphics[scale=0.45]{exoutact2}
1942 In this final example of the action embedding operators, A is executed upon
1943 entering the alpha machine, B is executed on all transitions of the alpha
1944 machine, C is executed when the alpha machine accepts by moving into the
1945 newline machine and N is executed when the newline machine moves into a final
1948 % GENERATE: exaction
1958 # Execute A on starting the alpha machine, B on every transition
1959 # moving through it and C upon finishing. Execute N on the newline.
1960 main := ( lower* >A $B %C ) . '\n' @N;
1968 \includegraphics[scale=0.45]{exaction}
1973 \section{State Action Embedding Operators}
1975 The state embedding operators allow one to embed actions into states. Like the
1976 transition embedding operators, there are several different classes of states
1977 that the operators access. The meanings of the symbols are partially related to
1978 the meanings of the symbols used by the transition embedding operators.
1980 The state embedding operators are different from the transition embedding
1981 operators in that there are various kinds of events that embedded actions can
1982 be associated with, requiring them to be distinguished by these different types
1983 of events. The state embedding operators have two components. The first, which
1984 is the first one or two characters, specifies the class of states that the
1985 action will be embedded into. The second component specifies the type of event
1986 the action will be executed on.
1988 \def\fakeitem{\hspace*{12pt}$\bullet$\hspace*{10pt}}
1990 \begin{minipage}{\textwidth}
1991 \begin{multicols}{2}
1993 \noindent The different classes of states are:\\
1994 \fakeitem \verb|> | -- the start state \\
1995 \fakeitem \verb|$ | -- all states\\
1996 \fakeitem \verb|% | -- final states\\
1997 \fakeitem \verb|< | -- any state except the start state\\
1998 \fakeitem \verb|@ | -- any state except final states\\
1999 \fakeitem \verb|<>| -- any except start and final (middle)
2003 \noindent The different kinds of embeddings are:\\
2004 \fakeitem \verb|~| -- to-state actions\\
2005 \fakeitem \verb|*| -- from-state actions\\
2006 \fakeitem \verb|/| -- EOF actions\\
2007 \fakeitem \verb|!| -- error actions\\
2008 \fakeitem \verb|^| -- local error actions\\
2011 %\label{state-act-embed}
2012 %\caption{The two components of state embedding operators. The class of states
2013 %to select comes first, followed by the type of embedding.}
2017 %\includegraphics{stembed}
2018 %\caption{Summary of state manipulation operators}
2019 %\label{state-act-embed-chart}
2022 %\noindent Putting these two components together we get a matrix of state
2023 %embedding operators. The entire set is given in Figure \ref{state-act-embed-chart}.
2026 \subsection{To-State and From-State Actions}
2028 \subsubsection{To-State Actions}
2030 \verb| >~ $~ %~ <~ @~ <>~ |
2033 To-state actions are executed whenever the state machine moves into the
2034 specified state, either by a natural movement over a transition or by an
2035 action-based transfer of control such as \verb|fgoto|. They are executed after the
2036 in-transition's actions but before the current character is advanced and
2037 tested against the end of the input block. To-state embeddings stay with the
2038 state. They are irrespective of the state's current set of transitions and any
2039 future transitions that may be added in or out of the state.
2041 Note that the setting of the current state variable \verb|cs| outside of the
2042 execute code is not considered by Ragel as moving into a state and consequently
2043 the to-state actions of the new current state are not executed. This includes
2044 the initialization of the current state when the machine begins. This is
2045 because the entry point into the machine execution code is after the execution
2046 of to-state actions.
2048 \subsubsection{From-State Actions}
2050 \verb| >* $* %* <* @* <>* |
2053 From-state actions are executed whenever the state machine takes a transition from a
2054 state, either to itself or to some other state. These actions are executed
2055 immediately after the current character is tested against the input block end
2056 marker and before the transition to take is sought based on the current
2057 character. From-state actions are therefore executed even if a transition
2058 cannot be found and the machine moves into the error state. Like to-state
2059 embeddings, from-state embeddings stay with the state.
2061 \subsection{EOF Actions}
2063 \verb| >/ $/ %/ </ @/ <>/ |
2066 The EOF action embedding operators enable the user to embed EOF actions into
2067 different classes of
2068 states. EOF actions are stored in states and generated with the \verb|write eof|
2069 statement. The generated EOF code switches on the current state and executes the EOF
2070 actions associated with it.
2072 \subsection{Handling Errors}
2074 \subsubsection{Global Error Actions}
2076 \verb| >! $! %! <! @! <>! |
2079 Error actions are stored in states until the final state machine has been fully
2080 constructed. They are then transferred to the transitions that move into the
2081 error state. This transfer entails the creation of a transition from the state
2082 to the error state that is taken on all input characters which are not already
2083 covered by the state's transitions. In other words it provides a default
2084 action. Error actions can induce a recovery by altering \verb|p| and then jumping back
2085 into the machine with \verb|fgoto|.
2087 \subsubsection{Local Error Actions}
2089 \verb| >^ $^ %^ <^ @^ <>^ |
2092 Like global error actions, local error actions are also stored in states until
2093 a transfer point. The transfer point is different however. Each local error action
2094 embedding is associated with a name. When a machine definition has been fully
2095 constructed, all local error actions embeddings associated the same name as the
2096 machine are transferred to error transitions. Local error actions can be used
2097 to specify an action to take when a particular section of a larger state
2098 machine fails to make a match. A particular machine definition's ``thread'' may
2099 die and the local error actions executed, however the machine as a whole may
2100 continue to match input.
2102 There are two forms of local error action embeddings. In the first form the name defaults
2103 to the current machine. In the second form the machine name can be specified. This
2104 is useful when it is more convenient to specify the local error action in a
2105 sub-definition that is used to construct the machine definition where the
2106 transfer should happen. To embed local error actions and explicitly state the
2107 machine on which the transfer is to happen use \verb|(name, action)| as the
2112 \setlength{\parskip}{0in}
2113 \item \verb|expr >^ (name, action) | -- Start state.
2114 \item \verb|expr $^ (name, action) | -- All states.
2115 \item \verb|expr %^ (name, action) | -- Final states.
2116 \item \verb|expr <^ (name, action) | -- Not start state.
2117 \item \verb|expr <>^ (name, action)| -- Not start and not final states.
2121 \section{Action Ordering and Duplicates}
2123 When building a parser by combining smaller expressions which themselves have
2124 embedded actions, it is often the case that transitions are made which need to
2125 execute a number of actions on one input character. For example when we leave
2126 an expression, we may execute the expression's pending out action and the
2127 subsequent expression's starting action on the same input character. We must
2128 therefore devise a method for ordering actions that is both intuitive and
2129 predictable for the user and repeatable by the state machine compiler. The
2130 determinization processes cannot simply order actions by the time at which they
2131 are introduced into a transition -- otherwise the programmer will be at the
2134 We associate with the embedding of each action a distinct timestamp which is
2135 used to order actions that appear together on a single transition in the final
2136 compiled state machine. To accomplish this we traverse the parse tree of
2137 regular expressions and assign timestamps to action embeddings. This algorithm
2138 is recursive in nature and quite simple. When it visits a parse tree node it
2139 assigns timestamps to all {\em starting} action embeddings, recurses on the
2140 parse tree, then assigns timestamps to the remaining {\em all}, {\em
2141 finishing}, and {\em leaving} embeddings in the order in which they appear.
2143 Ragel does not permit actions (defined or unnamed) to appear multiple times in
2144 an action list. When the final machine has been created, actions which appear
2145 more than once in single transition or EOF action list have their duplicates
2146 removed. The first appearance of the action is preserved. This is useful in a
2147 number of scenarios. First, it allows us to union machines with common
2148 prefixes without worrying about the action embeddings in the prefix being
2149 duplicated. Second, it prevents pending out actions from being transferred multiple times
2150 when a concatenation follows a kleene star and the two machines begin with a common
2156 main := word ( '\n' word )* '\n\n';
2159 \section{Values and Statements Available in Code Blocks}
2162 \noindent The following values are available in code blocks:
2165 \item \verb|fpc| -- A pointer to the current character. This is equivalent to
2166 accessing the \verb|p| variable.
2168 \item \verb|fc| -- The current character. This is equivalent to the expression \verb|(*p)|.
2170 \item \verb|fcurs| -- An integer value representing the current state. This
2171 value should only be read from. To move to a different place in the machine
2172 from action code use the \verb|fgoto|, \verb|fnext| or \verb|fcall| statements.
2173 Outside of the machine execution code the \verb|cs| variable may be modified.
2175 \item \verb|ftargs| -- An integer value representing the target state. This
2176 value should only be read from. Again, \verb|fgoto|, \verb|fnext| and
2177 \verb|fcall| can be used to move to a specific entry point.
2179 \item \verb|fentry(<label>)| -- Retrieve an integer value representing the
2180 entry point \verb|label|. The integer value returned will be a compile time
2181 constant. This number is suitable for later use in control flow transfer
2182 statements that take an expression. This value should not be compared against
2183 the current state because any given label can have multiple states representing
2184 it. The value returned by \verb|fentry| will be one of the possibly multiple states the
2188 \noindent The following statements are available in code blocks:
2192 \item \verb|fhold;| -- Do not advance over the current character. If processing
2193 data in multiple buffer blocks, the \verb|fhold| statement should only be used
2194 once in the set of actions executed on a character. Multiple calls may result
2195 in backing up over the beginning of the buffer block. The \verb|fhold|
2196 statement does not imply any transfer of control. In actions embedded into
2197 transitions, it is equivalent to the \verb|p--;| statement. In scanner pattern
2198 actions any changes made to \verb|p| are lost. In this context, \verb|fhold| is
2199 equivalent to \verb|tokend--;|.
2201 \item \verb|fexec <expr>;| -- Set the next character to process. This can be
2202 used to backtrack to previous input or advance ahead.
2203 Unlike \verb|fhold|, which can be used
2204 anywhere, \verb|fexec| requires the user to ensure that the target of the
2205 backtrack is in the current buffer block or is known to be somewhere ahead of
2206 it. The machine will continue iterating forward until \verb|pe| is arrived,
2207 \verb|fbreak| is called or the machine moves into the error state. In actions
2208 embedded into transitions, the \verb|fexec| statement is equivalent to setting
2209 \verb|p| to one position ahead of the next character to process. If the user
2210 also modifies \verb|pe|, it is possible to change the buffer block entirely.
2211 In scanner pattern actions any changes made to \verb|p| are lost. In this
2212 context, \verb|fexec| is equivalent to setting \verb|tokend| to the next
2213 character to process.
2215 \item \verb|fgoto <label>;| -- Jump to an entry point defined by
2216 \verb|<label>|. The \verb|fgoto| statement immediately transfers control to
2217 the destination state.
2219 \item \verb|fgoto *<expr>;| -- Jump to an entry point given by \verb|<expr>|.
2220 The expression must evaluate to an integer value representing a state.
2222 \item \verb|fnext <label>;| -- Set the next state to be the entry point defined
2223 by \verb|label|. The \verb|fnext| statement does not immediately jump to the
2224 specified state. Any action code following the statement is executed.
2226 \item \verb|fnext *<expr>;| -- Set the next state to be the entry point given
2227 by \verb|<expr>|. The expression must evaluate to an integer value representing
2230 \item \verb|fcall <label>;| -- Push the target state and jump to the entry
2231 point defined by \verb|<label>|. The next \verb|fret| will jump to the target
2232 of the transition on which the call was made. Use of \verb|fcall| requires
2233 the declaration of a call stack. An array of integers named \verb|stack| and a
2234 single integer named \verb|top| must be declared. With the \verb|fcall|
2235 construct, control is immediately transferred to the destination state.
2237 \item \verb|fcall *<expr>;| -- Push the current state and jump to the entry
2238 point given by \verb|<expr>|. The expression must evaluate to an integer value
2239 representing a state.
2241 \item \verb|fret;| -- Return to the target state of the transition on which the
2242 last \verb|fcall| was made. Use of \verb|fret| requires the declaration of a
2243 call stack with \verb|fstack| in the struct block. Control is immediately
2244 transferred to the destination state.
2246 \item \verb|fbreak;| -- Save the current state and immediately break out of the
2247 execute loop. This statement is useful in conjunction with the \verb|noend|
2248 write option. Rather than process input until the end marker of the input
2249 buffer is arrived at, the fbreak statement can be used to stop processing input
2250 upon seeing some end-of-string marker. It can also be used for handling
2251 exceptional circumstances. The fbreak statement does not change the pointer to
2252 the current character. After an \verb|fbreak| call the \verb|p| variable will point to
2253 the character that was being traversed over when the action was
2254 executed. The current state will be the target of the current transition.
2258 \noindent {\bf Note:} Once actions with control-flow commands are embedded into a
2259 machine, the user must exercise caution when using the machine as the operand
2260 to other machine construction operators. If an action jumps to another state
2261 then unioning any transition that executes that action with another transition
2262 that follows some other path will cause that other path to be lost. Using
2263 commands that manually jump around a machine takes us out of the domain of
2264 regular languages because transitions that may be conditional and that the
2265 machine construction operators are not aware of are introduced. These
2266 commands should therefore be used with caution.
2269 \chapter{Controlling Nondeterminism}
2270 \label{controlling-nondeterminism}
2272 Along with the flexibility of arbitrary action embeddings comes a need to
2273 control nondeterminism in regular expressions. If a regular expression is
2274 ambiguous, then sup-components of a parser other than the intended parts may be
2275 active at any given time. This means that actions which are irrelevant to the
2276 current subset of the parser may be executed, causing problems for the
2279 Tools which are based on regular expression engines and which are used for
2280 recognition tasks will usually function as intended regardless of the presence
2281 of ambiguities. It is quite common for users of scripting languages to write
2282 regular expressions that are heavily ambiguous and it generally does not
2283 matter. As long as one of the potential matches is recognized, there can be any
2284 number of other matches present.
2286 In some systems, matched text is attributed to a portion of a regular
2287 expression. Armed with the knowledge that the regular expression engine always
2288 pursues the longest match or the shortest match, the user is able to compose
2289 their patterns accordingly.
2291 In Ragel, there is no regular expression run-time engine, just a simple state
2292 machine execution model. When we begin to embed actions and face the
2293 possibility of spurious action execution, it becomes clear that controlling
2294 nondeterminism at the machine construction level is very important. Consider
2295 the following example.
2299 lines = ( word ( space word )* '\n' )*;
2303 Since the \verb|space| built-in expression includes the newline character, we will
2304 not leave the line expression when a newline character is seen. We will
2305 simultaneously pursue the possibility of matching further words on the same
2306 line and the possibility of matching a second line. The solution here is easy:
2307 simply exclude the newline character from the \verb|space| expression. Solving
2308 this kind of problem is straightforward because the string that terminates the
2309 sequence is a single character long. When it is multiple characters long we
2310 have a more difficult problem, as shown by the following example.
2314 comment = '/*' any* '*/';
2318 Using standard concatenation, we will never leave the \verb|any*| expression.
2319 We will forever entertain the possibility that a \verb|'*/'| string that we see
2320 is contained in a longer comment and that, simultaneously, the comment has
2321 ended. One way to approach the problem is to exclude the terminating string
2322 from the \verb|any*| expression using set difference. We must be careful to
2323 exclude not just the terminating string, but any string that contains it as a
2324 substring. A verbose, but proper specification of a C comment parser is given
2325 by the following regular expression. Note that this operation is the basis of the
2326 strong subtraction operator.
2330 comment = '/*' ( any* - ( any* '*/' any* ) ) '*/';
2334 We can also phrase the problem in terms of the transitions of the state
2335 machines that implement these expressions. During the concatenation of
2336 \verb|any*| and \verb|'*/'| we will be making transitions that are composed of
2337 both the loop of the first expression and the characters of the second.
2338 At this time we want the transition on the \verb|'/'| character to take precedence
2339 over and disallow the transition that originated in the \verb|any*| loop.
2341 In another scenario, we wish to implement a lightweight tokenizer that we can
2342 utilize in the composition of a larger machine. For example, some HTTP headers
2343 have a token stream as a sub-language.
2347 header_contents = ( lower+ | digit+ | ' ' )*;
2351 In this case, the problem with using a standard kleene star operation is that
2352 there is an ambiguity between extending a token and wrapping around the
2353 machine to begin a new token. Using the standard operator, we get
2354 an undesirable nondeterministic behaviour. What is required is for the
2355 transitions that represent an extension of a token to take precedence over the
2356 transitions that represent the beginning of a new token. For this problem,
2357 there is no simple solution that uses standard regular expressions.
2359 \section{Priorities}
2361 A priority mechanism was devised and built into the determinization
2362 process, specifically for the purpose of allowing the user to control
2363 nondeterminism. Priorities are integer values embedded into transitions. When
2364 the determinization process is combining transitions that have different
2365 priorities, the transition with the higher priority is preserved and the
2366 transition with the lower priority is dropped.
2368 Unfortunately, priorities can have unintended side effects because their
2369 operation requires that they linger in transitions indefinitely. They must linger
2370 because the Ragel program cannot know when the user is finished with a priority
2371 embedding. A solution whereby they are explicitly deleted after use is
2372 conceivable; however this is not very user-friendly. Priorities were therefore
2373 made into named entities. Only priorities with the same name are allowed to
2374 interact. This allows any number of priorities to coexist in one machine for
2375 the purpose of controlling various different regular expression operations and
2376 eliminates the need to ever delete them. Such a scheme allows the user to
2377 choose a unique name, embed two different priority values using that name
2378 and be confident that the priority embedding will be free of any side effects.
2380 \section{Priority Assignment}
2382 Priorities are integer values assigned to names within transitions.
2383 Only priorities with the same name are allowed to interact. When the machine
2384 construction process is combining transitions that have different priorities
2385 assiged to the same name, the transition with the higher priority is preserved
2386 and the lower priority is dropped.
2388 In the first form of priority embedding the name defaults to the name of the machine
2389 definition that the priority is assigned in. In this sense priorities are by
2390 default local to the current machine definition or instantiation. Beware of
2391 using this form in a longest-match machine, since there is only one name for
2392 the entire set of longest match patterns. In the second form the priority's
2393 name can be specified, allowing priority interaction across machine definition
2397 \setlength{\parskip}{0in}
2398 \item \verb|expr > int| -- Sets starting transitions to have priority int.
2399 \item \verb|expr @ int| -- Sets transitions that go into a final state to have priority int.
2400 \item \verb|expr $ int| -- Sets all transitions to have priority int.
2401 \item \verb|expr % int| -- Sets pending out transitions from final states to
2402 have priority int.\\ When a transition is made going out of the machine (either
2403 by concatenation or kleene star) its priority is immediately set to the pending
2407 The second form of priority assignment allows the programmer to specify the name
2408 to which the priority is assigned.
2411 \setlength{\parskip}{0in}
2412 \item \verb|expr > (name, int)| -- Entering transitions.
2413 \item \verb|expr @ (name, int)| -- Transitions into final state.
2414 \item \verb|expr $ (name, int)| -- All transitions.
2415 \item \verb|expr % (name, int)| -- Pending out transitions.
2418 \section{Guarded Operators that Encapsulate Priorities}
2420 Priorities can be very confusing for the user. They force the user to imagine
2421 the transitions inside machines and work out the precise effects of regular
2422 expression operations. When we consider that this problem is worsened by the
2423 potential for side effects caused by unintended priority name collisions, we
2424 see that exposing the user to priorities is rather undesirable.
2426 Fortunately, in practice the use of priorities has been necessary only in a
2427 small number of scenarios. This allows us to encapsulate their functionality
2428 into a small set of operators and fully hide them from the user. This is
2429 advantageous from a language design point of view because it greatly simplifies
2434 Example from 2 page poster paper.
2442 main := ( lower+ ':' ' '* <: (
2443 ( lower ( lower | digit )* ) >mark %id |
2444 digit+ >mark %number |
2452 \includegraphics[scale=0.4]{lmkleene}
2456 \subsection{Entry-Guarded Contatenation}
2461 This operator concatenates two machines, but first assigns a low
2462 priority to all transitions
2463 of the first machine and a high priority to the entering transitions of the
2464 second machine. This operator is useful if from the final states of the first
2465 machine, it is possible to accept the characters in the start transitions of
2466 the second machine. This operator effectively terminates the first machine
2467 immediately upon entering the second machine, where otherwise they would be
2468 pursued concurrently. In the following example, entry-guarded concatenation is
2469 used to move out of a machine that matches everything at the first sign of an
2470 end-of-input marker.
2478 # Leave the catch-all machine on the first character of FIN.
2479 main := any* :> 'FIN';
2486 \includegraphics[scale=0.45]{exstpri}
2490 Entry-guarded concatenation is equivalent to the following:
2494 expr $(unique_name,0) . expr >(unique_name,1)
2497 \subsection{Finish-Guarded Contatenation}
2499 \verb|expr :>> expr|
2503 like the previous operator, except the higher priority is placed on the final
2504 transitions of the second machine. This is useful if one wishes to entertain
2505 the possibility of continuing to match the first machine right up until the
2506 second machine enters a final state. In other words it terminates the first
2507 machine only when the second accepts. In the following example, finish-guarded
2508 concatenation causes the move out of the machine that matches everything to be
2509 delayed until the full end-of-input marker has been matched.
2511 % GENERATE: exdonepri
2514 % machine exdonepri;
2517 # Leave the catch-all machine on the last character of FIN.
2518 main := any* :>> 'FIN';
2525 \includegraphics[scale=0.45]{exdonepri}
2528 Finish-guarded concatenation is equivalent to the following:
2532 expr $(unique_name,0) . expr @(unique_name,1)
2535 \subsection{Left-Guarded Concatenation}
2540 This operator places
2541 a higher priority on the left expression. It is useful if you want to prefix a
2542 sequence with another sequence composed of some of the same characters. For
2543 example, one can consume leading whitespace before tokenizing a sequence of
2544 whitespace-separated words as in:
2548 ( ' '* <: ( ' '+ | [a-z]+ )** )
2552 Left-guarded concatenation is equivalent to the following:
2556 expr $(unique_name,1) . expr >(unique_name,0)
2560 \subsection{Longest-Match Kleene Star}
2561 \label{longest_match_kleene_star}
2566 This version of kleene star puts a higher priority on staying in the
2567 machine versus wrapping around and starting over. The LM kleene star is useful
2568 when writing simple tokenizers. These machines are built by applying the
2569 longest-match kleene star to an alternation of token patterns, as in the
2574 % GENERATE: exfinpri
2582 # Repeat tokens, but make sure to get the longest match.
2584 lower ( lower | digit )* %A |
2594 \includegraphics[scale=0.45]{exfinpri}
2597 If a regular kleene star were used the machine above would not be able to
2598 distinguish between extending a word and beginning a new one. This operator is
2603 ( expr $(unique_name,1) %(unique_name,0) )*
2607 When the kleene star is applied, transitions are made out of the machine which
2608 go back into it. These are assigned a priority of zero by the pending out
2609 transition mechanism. This is less than the priority of the transitions out of
2610 the final states that do not leave the machine. When two transitions clash on
2611 the same character, the differing priorities causes the transition which
2612 stays in the machine to take precedence. The transition that wraps around is
2615 Note that this operator does not build a scanner in the traditional sense because
2616 there is never any backtracking. To build a scanner in the traditional sense
2617 use the Longest-Match machine construction described Section \ref{generating-scanners}.
2619 \chapter{Interface to Host Program}
2621 \section{Alphtype Statement}
2624 alphtype unsigned int;
2628 The alphtype statement specifies the alphabet data type that the machine
2629 operates on. During the compilation of the machine, integer literals are expected to
2630 be in the range of possible values of the alphtype. Supported alphabet types
2631 are \verb|char|, \verb|unsigned char|, \verb|short|, \verb|unsigned short|,
2632 \verb|int|, \verb|unsigned int|, \verb|long|, and \verb|unsigned long|.
2633 The default is \verb|char|.
2635 \section{Getkey Statement}
2642 Specify to Ragel how to retrieve the character that the machine operates on
2643 from the pointer to the current element (\verb|p|). Any expression that returns
2644 a value of the alphabet type
2645 may be used. The getkey statement may be used for looking into element
2646 structures or for translating the character to process. The getkey expression
2647 defaults to \verb|(*p)|. In goto-driven machines the getkey expression may be
2648 evaluated more than once per element processed, therefore it should not incur a
2649 large cost and preclude optimization.
2651 \section{Access Statement}
2658 The access statement allows one to tell Ragel how the generated code should
2659 access the machine data that is persistent across processing buffer blocks.
2660 This includes all variables except \verb|p| and \verb|pe|. This includes
2661 \verb|cs|, \verb|top|, \verb|stack|, \verb|tokstart|, \verb|tokend| and \verb|act|.
2662 This is useful if a machine is to be encapsulated inside a
2663 structure in C code. The access statement can be used to give the name of
2664 a pointer to the structure.
2666 \section{Maintaining Pointers to Input Data}
2668 In the creation of any parser it is not uncommon to require the collection of
2669 the data being parsed. It is always possible to collect data into a growable
2670 buffer as the machine moves over it, however the copying of data is a somewhat
2671 wasteful use of processor cycles. The most efficient way to collect data
2672 from the parser is to set pointers into the input. This poses a problem for
2673 uses of Ragel where the input data arrives in blocks, such as over a socket or
2674 from a file. The program will error if a pointer is set in one buffer block but
2675 must be used while parsing a following buffer block.
2677 The longest-match constructions exhibit this problem, requiring the maintenance
2678 code described in Section \ref{generating-scanners}. If a longest-match
2679 construction has been used somewhere in the machine then it is possible to
2680 take advantage of the required prefix maintenance code in the driver program to
2681 ensure pointers to the input are always valid. If laying down a pointer one can
2682 set \verb|tokstart| at the same spot or ahead of it. When data is shifted in
2683 between loops the user must also shift the pointer. In this way it is possible
2684 to maintain pointers to the input that will always be consistent.
2691 char *p, *pe, *data = buf + have;
2692 int len, space = BUFSIZE - have;
2695 fprintf(stderr, "BUFFER OUT OF SPACE\n");
2699 len = fread( data, 1, space, stdin );
2703 /* Find the last newline by searching backwards. */
2705 pe = data + len - 1;
2706 while ( *pe != '\n' && pe >= buf )
2712 /* How much is still in the buffer? */
2713 have = data + len - pe;
2715 memmove( buf, pe, have );
2721 \caption{An example of line-oriented processing.}
2722 \label{line-oriented}
2725 In general, there are two approaches for guaranteeing the consistency of
2726 pointers to input data. The first approach is the one just described;
2727 lay down a marker from an action,
2728 then later ensure that the data the marker points to is preserved ahead of
2729 the buffer on the next execute invocation. This approach is good because it
2730 allows the parser to decide on the pointer-use boundaries, which can be
2731 arbitrarily complex parsing conditions. A downside is that it requires any
2732 pointers that are set to be corrected in between execute invocations.
2734 The alternative is to find the pointer-use boundaries before invoking the execute
2735 routine, then pass in the data using these boundaries. For example, if the
2736 program must perform line-oriented processing, the user can scan backwards from
2737 the end of an input block that has just been read in and process only up to the
2738 first found newline. On the next input read, the new data is placed after the
2739 partially read line and processing continues from the beginning of the line.
2740 An example of line-oriented processing is given in Figure \ref{line-oriented}.
2743 \section{Running the Executables}
2745 Ragel is broken down into two executables: a frontend which compiles machines
2746 and emits them in an XML format, and a backend which generates code or a
2747 Graphviz Dot file from the XML data. The purpose of the XML-based intermediate
2748 format is to allow users to inspect their compiled state machines and to
2749 interface Ragel to other tools such as custom visualizers, code generators or
2750 analysis tools. The intermediate format will provide a better platform for
2751 extending Ragel to support new host languages. The split also serves to reduce
2752 complexity of the Ragel program by strictly separating the data structures and
2753 algorithms that are used to compile machines from those that are used to
2758 [user@host] myproj: ragel file.rl | rlcodegen -G2 -o file.c
2761 \section{Choosing a Generated Code Style}
2764 There are three styles of code output to choose from. Code style affects the
2765 size and speed of the compiled binary. Changing code style does not require any
2766 change to the Ragel program. There are two table-driven formats and a goto
2769 In addition to choosing a style to emit, there are various levels of action
2770 code reuse to choose from. The maximum reuse levels (\verb|-T0|, \verb|-F0|
2771 and \verb|-G0|) ensure that no FSM action code is ever duplicated by encoding
2772 each transition's action list as static data and iterating
2773 through the lists on every transition. This will normally result in a smaller
2774 binary. The less action reuse options (\verb|-T1|, \verb|-F1| and \verb|-G1|)
2775 will usually produce faster running code by expanding each transition's action
2776 list into a single block of code, eliminating the need to iterate through the
2777 lists. This duplicates action code instead of generating the logic necessary
2778 for reuse. Consequently the binary will be larger. However, this tradeoff applies to
2779 machines with moderate to dense action lists only. If a machine's transitions
2780 frequently have less than two actions then the less reuse options will actually
2781 produce both a smaller and a faster running binary due to less action sharing
2782 overhead. The best way to choose the appropriate code style for your
2783 application is to perform your own tests.
2785 The table-driven FSM represents the state machine as constant static data. There are
2786 tables of states, transitions, indices and actions. The current state is
2787 stored in a variable. The execution is simply a loop that looks up the current
2788 state, looks up the transition to take, executes any actions and moves to the
2789 target state. In general, the table-driven FSM can handle any machine, produces
2790 a smaller binary and requires a less expensive host language compile, but
2791 results in slower running code. Since the table-driven format is the most
2792 flexible it is the default code style.
2794 The flat table-driven machine is a table-based machine that is optimized for
2795 small alphabets. Where the regular table machine uses the current character as
2796 the key in a binary search for the transition to take, the flat table machine
2797 uses the current character as an index into an array of transitions. This is
2798 faster in general, however is only suitable if the span of possible characters
2801 The goto-driven FSM represents the state machine using goto and switch
2802 statements. The execution is a flat code block where the transition to take is
2803 computed using switch statements and directly executable binary searches. In
2804 general, the goto FSM produces faster code but results in a larger binary and a
2805 more expensive host language compile.
2807 The goto-driven format has an additional action reuse level (\verb|-G2|) that
2808 writes actions directly into the state transitioning logic rather than putting
2809 all the actions together into a single switch. Generally this produces faster
2810 running code because it allows the machine to encode the current state using
2811 the processor's instruction pointer. Again, sparse machines may actually
2812 compile to smaller binaries when \verb|-G2| is used due to less state and
2813 action management overhead. For many parsing applications \verb|-G2| is the
2814 preferred output format.
2818 \begin{tabular}{|c|c|}
2820 \multicolumn{2}{|c|}{\bf Code Output Style Options} \\
2822 \verb|-T0|&binary search table-driven\\
2824 \verb|-T1|&binary search, expanded actions\\
2826 \verb|-F0|&flat table-driven\\
2828 \verb|-F1|&flat table, expanded actions\\
2830 \verb|-G0|&goto-driven\\
2832 \verb|-G1|&goto, expanded actions\\
2834 \verb|-G2|&goto, in-place actions\\
2841 Ragel is able to emit compiled state machines in Graphviz's Dot file format.
2842 Graphviz support allows users to perform
2843 incremental visualization of their parsers. User actions are displayed on
2844 transition labels of the graph. If the final graph is too large to be
2845 meaningful, or even drawn, the user is able to inspect portions of the parser
2846 by naming particular regular expression definitions with the \verb|-S| and
2847 \verb|-M| options to the \verb|ragel| program. Use of Graphviz greatly
2848 improves the Ragel programming experience. It allows users to learn Ragel by
2849 experimentation and also to track down bugs caused by unintended