From 22a66d93d30cc8004e999bc86c4ad694be378302 Mon Sep 17 00:00:00 2001 From: Craig Burley Date: Fri, 28 May 1999 19:17:04 -0400 Subject: [PATCH] put development docs on mainline for now From-SVN: r27233 --- gcc/f/ffe.texi | 523 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 521 insertions(+), 2 deletions(-) diff --git a/gcc/f/ffe.texi b/gcc/f/ffe.texi index 4108bb8..3081b02 100644 --- a/gcc/f/ffe.texi +++ b/gcc/f/ffe.texi @@ -12,7 +12,16 @@ This chapter describes some aspects of the design and implementation of the @code{g77} front end. +To find about things that are ``To Be Determined'' or ``To Be Done'', +search for the string TBD. +If you want to help by working on one or more of these items, +email me at @email{@value{email-burley}}. +If you're planning to do more than just research issues and offer comments, +see @uref{http://egcs.cygnus.com/contribute.html} for steps you might +need to take first. + @menu +* Overview of Translation Process:: * Philosophy of Code Generation:: * Two-pass Design:: * Challenges Posed:: @@ -20,6 +29,518 @@ of the @code{g77} front end. * Transforming Expressions:: @end menu +@node Overview of Translation Process +@section Overview of Translation Process + +The order of phases translating source code to the form accepted +by the GBE is: + +@enumerate +@item +Stripping punched-card sources (@file{g77stripcard.c}) + +@item +Lexing (@file{lex.c}) + +@item +Stand-alone statement identification (@file{sta.c}) + +@item +Parsing (@file{stb.c} and @file{expr.c}) + +@item +Constructing (@file{stc.c}) + +@item +Collecting (@file{std.c}) + +@item +Expanding (@file{ste.c}) +@end enumerate + +To get a rough idea of how a particularly twisted Fortran statement +gets treated by the passes, consider: + +@smallexample + FORMAT(I2 4H)=(J/ + & I3) +@end smallexample + +The job of @file{lex.c} is to know enough about Fortran syntax rules +to break the statement up into distinct lexemes without requiring +any feedback from subsequent phases: + +@smallexample +`FORMAT' +`(' +`I24H' +`)' +`=' +`(' +`J' +`/' +`I3' +`)' +@end smallexample + +The job of @file{sta.c} is to figure out the kind of statement, +or, at least, statement form, that sequence of lexemes represent. + +The sooner it can do this (in terms of using the smallest number of +lexemes, starting with the first for each statement), the better, +because that leaves diagnostics for problems beyond the recognition +of the statement form to subsequent phases, +which can usually better describe the nature of the problem. + +In this case, the @samp{=} at ``level zero'' +(not nested within parentheses) +tells @file{sta.c} that this is an @emph{assignment-form}, +not @code{FORMAT}, statement. + +An assignment-form statement might be a statement-function +definition or an executable assignment statement. + +To make that determination, +@file{sta.c} looks at the first two lexemes. + +Since the second lexeme is @samp{(}, +the first must represent an array for this to be an assignment statement, +else it's a statement function. + +Either way, @file{sta.c} hands off the statement to @file{stb.c} +(either its statement-function parser or its assignment-statement parser). + +@file{stb.c} forms a +statement-specific record containing the pertinent information. +That information includes a source expression and, +for an assignment statement, a destination expression. +Expressions are parsed by @file{expr.c}. + +This record is passed to @file{stc.c}, +which copes with the implications of the statement +within the context established by previous statements. + +For example, if it's the first statement in the file +or after an @code{END} statement, +@file{stc.c} recognizes that, first of all, +a main program unit is now being lexed +(and tells that to @file{std.c} +before telling it about the current statement). + +@file{stc.c} attaches whatever information it can, +usually derived from the context established by the preceding statements, +and passes the information to @file{std.c}. + +@file{std.c} saves this information away, +since the GBE cannot cope with information +that might be incomplete at this stage. + +For example, @samp{I3} might later be determined +to be an argument to an alternate @code{ENTRY} point. + +When @file{std.c} is told about the end of an external (top-level) +program unit, +it passes all the information it has saved away +on statements in that program unit +to @file{ste.c}. + +@file{ste.c} ``expands'' each statement, in sequence, by +constructing the appropriate GBE information and calling +the appropriate GBE routines. + +Details on the transformational phases follow. +Keep in mind that Fortran numbering is used, +so the first character on a line is column 1, +decimal numbering is used, and so on. + +@menu +* g77stripcard:: +* lex.c:: +* sta.c:: +* stb.c:: +* expr.c:: +* stc.c:: +* std.c:: +* ste.c:: + +* Gotchas (Transforming):: +* TBD (Transforming):: +@end menu + +@node g77stripcard +@subsection g77stripcard + +The @code{g77stripcard} program handles removing content beyond +column 72 (adjustable via a command-line option), +optionally warning about that content being something other +than trailing whitespace or Fortran commentary. + +This program is needed because @code{lex.c} doesn't pay attention +to maximum line lengths at all, to make it easier to maintain, +as well as faster (for sources that don't depend on the maximum +column length vis-a-vis trailing non-blank non-commentary content). + +Just how this program will be run---whether automatically for +old source (perhaps as the default for @file{.f} files?)---is not +yet determined. + +In the meantime, it might as well be implemented as a typical UNIX pipe. + +It should accept a @samp{-fline-length-@var{n}} option, +with the default line length set to 72. + +When the text it strips off the end of a line is not blank +(not spaces and tabs), +it should insert an additional comment line +(beginning with @samp{!}, +so it works for both fixed-form and free-form files) +containing the text, +following the stripped line. +The inserted comment should have a prefix of some kind, +TBD, that distinguishes the comment as representing stripped text. +Users could use that to @code{sed} out such lines, if they wished---it +seems silly to provide a command-line option to delete information +when it can be so easily filtered out by another program. + +(This inserted comment should be designed to ``fit in'' well +with whatever the Fortran community is using these days for +preprocessor, translator, and other such products, like OpenMP. +What that's all about, and how @code{g77} can elegantly fit its +special comment conventions into it all, is TBD as well. +We don't want to reinvent the wheel here, but if there turn out +to be too many conflicting conventions, we might have to invent +one that looks nothing like the others, but which offers their +host products a better infrastructure in which to fit and coexist +peacefully.) + +@node lex.c +@subsection lex.c + +To help make the lexer simple, fast, and easy to maintain, +while also having @code{g77} generally encourage Fortran programmers +to write simple, maintainable, portable code by maximizing the +performance of compiling that kind of code: + +@itemize @bullet +@item +There'll be just one lexer, for both fixed-form and free-form source. + +@item +It'll care about the form only when handling the first 7 columns of +text, stuff like spaces between strings of alphanumerics, and +how lines are continued. + +Some other distinctions will be handled by subsequent phases, +so at least one of them will have to know which form is involved. + +For example, @samp{I = 2 . 4} is acceptable in fixed form, +and works in free form as well given the implementation @code{g77} +presently uses. +But the standard requires a diagnostic for it in free form, +so the parser has to be able to recognize that +the lexemes aren't contiguous +(information the lexer @emph{does} have to provide) +and that free-form source is being parsed, +so it can provide the diagnostic. + +The @code{g77} lexer doesn't try to gather @samp{2 . 4} into a single lexeme. +Otherwise, it'd have to know a whole lot more about how to parse Fortran, +or subsequent phases (mainly parsing) would have two paths through +lots of critical code---one to handle the lexeme @samp{2}, @samp{.}, +and @samp{4} in sequence, another to handle the lexeme @samp{2.4}. + +@item +It won't worry about line lengths +(beyond the first 7 columns for fixed-form source). + +That is, once it starts parsing the ``statement'' part of a line +(column 7 for fixed-form, column 1 for free-form), +it'll keep going until it finds a newline, +rather than ignoring everything past a particular column +(72 or 132). + +The implication here is that there shouldn't @emph{be} +anything past that last column, other than whitespace or +commentary, because users using typical editors +(or viewing output as typically printed) +won't necessarily know just where the last column is. + +Code that has ``garbage'' beyond the last column +(almost certainly only fixed-form code with a punched-card legacy, +such as code using columns 73-80 for ``sequence numbers'') +will have to be run through @code{g77stripcard} first. + +Also, keeping track of the maximum column position while also watching out +for the end of a line @emph{and} while reading from a file +just makes things slower. +Since a file must be read, and watching for the end of the line +is necessary (unless the typical input file was preprocessed to +include the necessary number of trailing spaces), +dropping the tracking of the maximum column position +is the only way to reduce the complexity of the pertinent code +while maintaining high performance. + +@item +ASCII encoding is assumed for the input file. + +Code written in other character sets will have to be converted first. + +@item +Tabs (ASCII code 9) +will be converted to spaces via the straightforward +approach. + +Specifically, a tab is converted to between one and eight spaces +as necessary to reach column @var{n}, +where dividing @samp{(@var{n} - 1)} by eight +results in a remainder of zero. + +@item +Linefeeds (ASCII code 10) +mark the ends of lines. + +@item +A carriage return (ASCII code 13) +is accept if it immediately precedes a linefeed, +in which case it is ignored. + +Otherwise, it is rejected (with a diagnostic). + +@item +Any other characters other than the above +that are not part of the GNU Fortran Character Set +(@pxref{Character Set}) +are rejected with a diagnostic. + +This includes backspaces, form feeds, and the like. + +(It might make sense to allow a form feed in column 1 +as long as that's the only character on a line. +It certainly wouldn't seem to cost much in terms of performance.) + +@item +The end of the input stream (EOF) +ends the current line. + +@item +The distinction between uppercase and lowercase letters +will be preserved. + +It will be up to subsequent phases to decide to fold case. + +Current plans are to permit any casing for Fortran (reserved) keywords +while preserving casing for user-defined names. +(This might not be made the default for @file{.f} files, though.) + +Preserving case seems necessary to provide more direct access +to facilities outside of @code{g77}, such as to C or Pascal code. + +Names of intrinsics will probably be matchable in any case, +However, there probably won't be any option to require +a particular mixed-case appearance of intrinsics +(as there was for @code{g77} prior to version 0.6), +because that's painful to maintain, +and probably nobody uses it. + +(How @samp{external SiN; r = sin(x)} would be handled is TBD. +I think old @code{g77} might already handle that pretty elegantly, +but whether we can cope with allowing the same fragment to reference +a @emph{different} procedure, even with the same interface, +via @samp{s = SiN(r)}, needs to be determined. +If it can't, we need to make sure that when code introduces +a user-defined name, any intrinsic matching that name +using a case-insensitive comparison +is ``turned off''.) + +@item +Backslashes in @code{CHARACTER} and Hollerith constants +are not allowed. + +This avoids the confusion introduced by some Fortran compiler vendors +providing C-like interpretation of backslashes, +while others provide straight-through interpretation. + +Some kind of lexical construct (TBD) will be provided to allow +flagging of a @code{CHARACTER} +(but probably not a Hollerith) +constant that permits backslashes. +It'll necessarily be a prefix, such as: + +@smallexample +PRINT *, C'This line has a backspace \b here.' +PRINT *, F'This line has a straight backslash \ here.' +@end smallexample + +Further, command-line options might be provided to specify that +one prefix or the other is to be assumed as the default +for @code{CHARACTER} constants. + +However, it seems more helpful for @code{g77} to provide a program +that converts prefix all constants +(or just those containing backslashes) +with the desired designation, +so printouts of code can be read +without knowing the compile-time options used when compiling it. + +If such a program is provided +(let's name it @code{g77slash} for now), +then a command-line option to @code{g77} should not be provided. +(Though, given that it'll be easy to implement, it might be hard +to resist user requests for it ``to compile faster than if we +have to invoke another filter''.) + +This program would take a command-line option to specify the +default interpretation of slashes, +affecting which prefix it uses for constants. + +@code{g77slash} probably should automatically convert Hollerith +constants that contain slashes +to the appropriate @code{CHARACTER} constants. +Then @code{g77} wouldn't have to define a prefix syntax for Hollerith +constants specifying whether they want C-style or straight-through +backslashes. +@end itemize + +The above implements nearly exactly what is specified by +@ref{Character Set}, +and +@ref{Lines}, +except it also provides automatic conversion of tabs +and ignoring of newline-related carriage returns. + +It also effects the ``pure visual'' model, +by which is meant that a user viewing his code +in a typical text editor +(assuming it's not preprocessed via @code{g77stripcard} or similar) +doesn't need any special knowledge +of whether spaces on the screen are really tabs, +whether lines end immediately after the last visible non-space character +or after a number of spaces and tabs that follow it, +or whether the last line in the file is ended by a newline. + +Most editors don't make these distinctions, +the ANSI FORTRAN 77 standard doesn't require them to, +and it permits a standard-conforming compiler +to define a method for transforming source code to +``standard form'' however it wants. + +So, GNU Fortran defines it such that users have the best chance +of having the code be interpreted the way it looks on the screen +of the typical editor. + +(Fancy editors should @emph{never} be required to correctly read code +written in classic two-dimensional-plaintext form. +By correct reading I mean ability to read it, book-like, without +mistaking text ignored by the compiler for program code and vice versa, +and without having to count beyond the first several columns. +The vague meaning of ASCII TAB, among other things, complicates +this somewhat, but as long as ``everyone'', including the editor, +other tools, and printer, agrees about the every-eighth-column convention, +the GNU Fortran ``pure visual'' model meets these requirements. +Any language or user-visible source form +requiring special tagging of tabs, +the ends of lines after spaces/tabs, +and so on, is broken by this definition. +Fortunately, Fortran @emph{itself} is not broken, +even if most vendor-supplied defaults for their Fortran compilers @emph{are} +in this regard.) + +Further, this model provides a clean interface +to whatever preprocessors or code-generators are used +to produce input to this phase of @code{g77}. +Mainly, they need not worry about long lines. + +@node sta.c +@subsection sta.c + +@node stb.c +@subsection stb.c + +@node expr.c +@subsection expr.c + +@node stc.c +@subsection stc.c + +@node std.c +@subsection std.c + +@node ste.c +@subsection ste.c + +@node Gotchas (Transforming) +@subsection Gotchas (Transforming) + +This section is not about transforming ``gotchas'' into something else. +It is about the weirder aspects of transforming Fortran, +however that's defined, +into a more modern, canonical form. + +@node TBD (Transforming) +@subsection TBD (Transforming) + +Continue researching gotchas, designing the transformational process, +and implementing it. + +Specific issues to resolve: + +@itemize @bullet +@item +Just where should @code{INCLUDE} processing take place? + +Clearly before (or part of) statement identification (@file{sta.c}), +since determining whether @samp{I(J)=K} is a statement-function +definition or an assignment statement requires knowing the context, +which in turn requires having processed @code{INCLUDE} files. + +@item +Just where should (if it was implemented) @code{USE} processing take place? + +This gets into the whole issue of how @code{g77} should handle the concept +of modules. +I think GNAT already takes on this issue, but don't know more than that. +Jim Giles has written extensively on @code{comp.lang.fortran} +about his opinions on module handling, as have others. +Jim's views should be taken into account. + +Actually, Richard M. Stallman (RMS) also has written up +some guidelines for implementing such things, +but I'm not sure where I read them. +Perhaps the old @email{gcc2@@cygnus.com} list. + +If someone could dig references to these up and get them to me, +that would be much appreciated! +Even though modules are not on the short-term list for implementation, +it'd be helpful to know @emph{now} how to avoid making them harder to +implement them @emph{later}. + +@item +Should the @code{g77} command become just a script that invokes +all the various preprocessing that might be needed, +thus making it seem slower than necessary for legacy code +that people are unwilling to convert, +or should we provide a separate script for that, +thus encouraging people to convert their code once and for all? + +At least, a separate script to behave as old @code{g77} did, +perhaps named @code{g77old}, might ease the transition, +as might a corresponding one that converts source codes +named @code{g77oldnew}. + +These scripts would take all the pertinent options @code{g77} used +to take and run the appropriate filters, +passing the results to @code{g77} or just making new sources out of them +(in a subdirectory, leaving the user to do the dirty deed of +moving or copying them over the old sources). + +@item +Do other Fortran compilers provide a prefix syntax +to govern the treatment of backslashes in @code{CHARACTER} +(or Hollerith) constants? + +Knowing what other compilers provide would help. +@end itemize + @node Philosophy of Code Generation @section Philosophy of Code Generation @@ -882,6 +1403,4 @@ to hold the value of the expression. @item Other stuff??? - - @end itemize -- 2.7.4