[flang][RFC] Adding higher level FIR ops to ease expression lowering

author Jean Perier <jperier@nvidia.com>

Thu, 13 Oct 2022 12:24:56 +0000 (14:24 +0200)

committer Jean Perier <jperier@nvidia.com>

Thu, 13 Oct 2022 12:25:51 +0000 (14:25 +0200)
author Jean Perier <jperier@nvidia.com>
Thu, 13 Oct 2022 12:24:56 +0000 (14:24 +0200)
committer Jean Perier <jperier@nvidia.com>
Thu, 13 Oct 2022 12:25:51 +0000 (14:25 +0200)
diff --git a/flang/docs/HighLevelFIR.md b/flang/docs/HighLevelFIR.md

new file mode 100644 (file)

index 0000000..7119ec0
--- /dev/null
+++ b/flang/docs/HighLevelFIR.md
@@ -0,0 +1,1410 @@
+The approach of FIR and lowering design so far was to start with the minimal set
+of IR operations that could allow implementing the core aspects of Fortran (like
+memory allocations, array addressing, runtime descriptors, and structured
+control flow operations). One notable aspect of the current FIR is that array
+and character operations are buffered (some storage is allocated for the result,
+and the storage is addressed to implement the operation).  While this proved
+functional so far, the code lowering expressions and assignments from the
+front-end representations (the evaluate::Expr and parser nodes) to FIR has
+significantly grown in complexity while it still lacks some F95 features around
+character array expressions or FORALL. This is mainly explained by the fact that
+the representation level gap is big, and a lot is happening in lowering.  It
+appears more and more that some intermediate steps would help to split concerns
+between translating the front-end representation to MLIR, implementing some
+Fortran concepts at a lower-level (like character or derived type assignments),
+and how bufferizations of character and array expressions should be done.
+
+This document proposes the addition of two concepts and a set of related
+operations in a new dialect HLFIR to allow a simpler lowering to a higher-level
+FIR representation that would later be lowered to the current FIR representation
+via MLIR translation passes.  As a result of these additions, it is likely that
+the fir.array_load/fir.array_merge_store and related array operations could be
+removed from FIR since array assignment analysis could directly happen on the
+higher-level FIR representation.
+
+
+The main principles of the new lowering design are:
+-   Make expression lowering context independent and rather naive
+-   Do not materialize temporaries while lowering to FIR
+-   Preserve Fortran semantics/information for high-level optimizations
+
+The core impact on lowering will be:
+-   Lowering expressions and assignments in the exact same way, regardless of
+    whether it is an array assignment context and/or an expression inside a
+    forall.
+-   Lowering transformational intrinsics in a verbatim way (no runtime calls and
+    memory aspects yet).
+-   Lowering character expressions in a verbatim way (no memcpy/runtime calls
+    and memory aspects yet).
+-   Argument association side effects will be delayed (copy-in/copy-out) to help
+    inlining/function specialization to get rid of them when they are not
+    relevant.
+
+
+## Variable and Expression value concepts in HLFIR
+
+## Strengthening the variable concept
+
+Fortran variables are currently represented in FIR as mlir::Value with reference
+or box type coming from special operations or block arguments. They are either
+the result of a fir.alloca, fir.allocmem, or fir.address_of operations with the
+mangled name of the variable as attribute, or they are function block arguments
+with the mangled name of the variable as attribute.
+
+Fortran variables are defined with a Fortran type (both dynamic and static) that
+may have type parameters, a rank and shape (including lower bounds), and some
+attributes (like TARGET, OPTIONAL, VOLATILE...). All this information is
+currently not represented in FIR. Instead, lowering keeps track of all this
+information in the fir::ExtendedValue lowering data structure and uses it when
+needed. If unused in lowering, some information about variables is lost (like
+non-constant array bound expressions). In the IR, only the static type, the
+compile time constant extents, and compile time character lengths can be
+retrieved from the mlir::Value of a variable in the general case (more can be
+retrieved if the variable is tracked via a fir.box, but not if it is a bare
+memory reference).
+
+This makes reasoning about Fortran variables in FIR harder, and in general
+forces lowering to apply all decisions related to the information that is lost
+in FIR. A more problematic point is that it does not allow generating debug
+information for the variables from FIR, since the bounds and type parameters
+information is not tightly linked to the base mlir::Value.
+
+The proposal is to add a fir.declare operation that would anchor the
+fir::ExtendedValue information in the IR regardless of the mlir::Value used for
+the variable (bare memory reference, or fir.box). This operation will have a
+"fir.def = uniq_mangled_variable_name" that will allow linking it to the Fortran
+source variable, and will take all the bounds and type parameters as operands.
+All the high-level operations referring to variables will have a "fir.ref =
+uniq_mangled_variable_name" that will allow retrieving back the related
+dominating fir.declare and all the variable information. In most of the cases,
+the fir.declare should simply be the defining operation of the operand mlir
+value.
+
+The fir.declare operation will allow:
+- Pushing higher-level Fortran concepts into FIR operations (like array
+  assignments or transformational intrinsics).
+- Generating debug information for the variables based on the fir.declare
+  operation.
+- Generic Fortran aliasing analysis (currently implemented only around array
+  assignments with the fir.array_load concept).
+
+The fir.declare op is the only operation described by this change that will be
+added to FIR and not HLFIR. The rational for this is that it is intended to
+survive until LLVM dialect codegeneration so that debug info generation can use
+them and alias information can take advantage of them even on FIR. 
+
+Note that Fortran variables are not necessarily named objects, they can also be
+the result of function references returning POINTERs. fir.declare will also
+accept such variables to be described in the IR (a unique name will be built
+from the caller scope name and the function name.). In general, fir.declare
+will allow to view every memory storage as a variable, and this will be used to
+describe and use compiler created array temporaries.
+
+## Adding an expression value concept in HLFIR
+
+Currently, Fortran expressions can be represented as SSA values for scalar
+logical, integer, real, and complex expressions. Scalar character or
+derived-type expressions and all array expressions are buffered in lowering:
+their results are directly given a memory storage in lowering and are
+manipulated as variables.
+
+While this keeps FIR simple, this makes the amount of IR generated for these
+expressions higher, and in general makes later optimization passes job harder
+since they present non-trivial patterns (with memory operations) and cannot be
+eliminated by naive dead code elimination when the result is unused. This also
+forces lowering to combine elemental array expressions into single loop nests to
+avoid bufferizing all array sub-expressions (which would yield terrible
+performance). These combinations, which are implemented using C++ lambdas in
+lowering makes lowering code harder to understand. It also makes the expression
+lowering code context dependent (especially designators lowering). The lowering
+code paths may be different when lowering a syntactically similar expression in
+an elemental expression context, in a forall context, or in a normal context.
+
+Some of the combinations described in [Array Composition](ArrayComposition.md)
+are currently not implemented in lowering because they are less trivial
+optimizations, and do not really belong in lowering. However, deploying such
+combinations on the generated FIR with bufferizations requires the usage of
+non-trivial pattern matching and rewrites (recognizing temporary allocation,
+usage, and related runtime calls). Note that the goal of such combination is not
+only about inlining transformational runtime calls, it is mainly about never
+generating a temporary for an array expression sub-operand that is a
+transformational intrinsic call matching certain criteria. So the optimization
+pass will not only need to recognize the intrinsic call, it must understand the
+context it is being called in.
+
+The usage of memory manipulations also makes some of the alias analysis more
+complex, especially when dealing with foralls (the alias analysis cannot simply
+follow an operand tree, it must understand indirect dependencies from operations
+stored in memory).
+
+The proposal is to add a !hlfir.expr<T> SSA value type concept, and set of
+character operations (concatenation, TRIM, MAX, MIN, comparisons...), a set of
+array transformational operations (SUM, MATMUL, TRANSPOSE, ...), and a generic
+hlfir.elemental operation. The hlfir.expr<T> type is not intended to be used
+with scalar types that already have SSA value types (e.g., integer or real
+scalars).  Instead, these existing SSA types will implicitly be considered as
+being expressions when used in high-level FIR operations, which will simplify
+interfacing with other dialects that define operations with these types (e.g.,
+the arith dialect).
+
+These hlfir.expr values could then be placed in memory when needed (assigned to
+a variable, passed as a procedure argument, or an IO output item...) via
+hlfir.assign or hlfir.associate operations that will later be described.
+
+When no special optimization pass is run, a translation pass would lower the
+operations producing hlfir.expr to buffer allocations and memory operations just
+as in the currently generated FIR.
+
+However, these high-level operations should allow the writing of optimization
+passes combining chains of operations producing hlfir.expr into optimized forms
+via pattern matching on the operand tree.
+
+The hlfir.elemental operation will be discussed in more detail below. It allows
+simplifying lowering while keeping the ability to combine elemental
+sub-expressions into a single loop nest. It should also allow rewriting some of
+the transformational intrinsic operations to functions of the indices as
+described in [Array Composition](ArrayComposition.md).
+
+## Proposed design for HLFIR (High-Level Fortran IR)
+
+### HLFIR Operations and Types
+
+#### Introduce a hlfir.expr<T> type
+
+Motivation: avoid the need to materialize expressions in temporaries while
+lowering.
+
+Syntax: ``` !hlfir.expr<[extent x]* T [, class]> ```
+
+- `[extent x]*` represents the shape for arrays similarly to !fir.array<> type,
+  except that the shape cannot be assumed rank (!hlfir.expr<..xT> is invalid).
+  This restriction can be added because it is impossible to create an assumed
+  rank expression in Fortran that is not a variable.
+- `T` is the element type of the static type
+- `class` flag can be set to denote that this a polymorphic expression (that the
+  dynamic type should not be assumed to be the static type).
+
+
+examples: !hlfir.expr<fir.char<?>>, !hlfir.expr<10xi32>,
+!hlfir.expr<?x10x?xfir.complex<4>>
+
+T in scalar hlfir.expr<T> can be:
+-   A character type (fir.char<10, kind>, fir.char<?, kind>)
+-   A derived type: (fir.type<t{...}>)
+
+T in an array hlfir.expr< e1 x ex2 ..  : T> can be:
+-   A character or derived type
+-   A logical type (fir.logical<kind>)
+-   An integer type (i1, i32, ….)
+-   A floating point type (f32, f16…)
+-   A complex type (fir.complex<4> or mlir::complex<f32>...)
+
+Some expressions may be polymorphic (for instance, MERGE can be used on
+polymorphic entities). The hlfir.expr type has an optional "class" flag to
+denote this: hlfir.expr<T, class>.
+
+Note that the ALLOCATABLE, POINTER, TARGET, VOLATILE, ASYNCHRONOUS, OPTIONAL
+aspects do not apply to expressions, they apply to variables.
+
+It is possible to query the following about an expression:
+-   What is the extent : via hlfir.get_extent %expr, dim
+-   What are the length parameters: via hlfir.get_typeparam %expr [, param_name]
+-   What is the dynamic type: via hlfir.get_dynamic_type %expr
+
+It is possible to get the value of an array expression element:
+- %element = hlfir.apply %expr, %i, %j : (!hlfir.expr<T>, index index) ->
+  hlfir.expr<ScalarType> | AnyConstantSizeScalarType
+
+It is not directly possible to take an address for the expression, but an
+expression value can be associated to a new variable whose address can be used
+(required when passing the expression in a user call, or to concepts that are
+kept low level in FIR, like IO runtime calls).  The variable created may be a
+compiler created temporary, or may relate to a Fortran source variable if this
+mechanism is used to implement ASSOCIATE.
+
+-   %var = hlfir.associate %expr [attributes about the association]->
+    AnyMemoryOrBoxType
+-   hlfir.end_association %var
+
+The intention is that the hlfir.expr<T> is the result of an operation, and
+should most often not be a block argument. This is because the hlfir.expr is
+mostly intended to allow combining chains of operations into more optimal
+forms. But it is possible to represent any expression result via a Fortran
+runtime descriptor (fir.box<T>), implying that if a hlfir.expr<T> is passed as
+a block argument, the expression bufferization pass will evaluate the operation
+producing the expression in a temporary, and transform the block operand into a
+fir.box describing the temporary. Clean-up for the temporary will be inserted
+after the last use of the hlfir.expr. Note that, at least at first, lowering
+may help FIR to find the last use of a hlfir.expr by explicitly inserting a
+hlfir.finalize %expr operation that may turn into a no-op if the expression is
+not later materialized in memory.
+
+It is nonetheless not intended that such abstract types be used as block
+arguments to avoid introducing allocations and descriptor manipulations.
+
+#### fir.declare operation
+
+Motivation: represent variables, linking together a memory storage, shape,
+length parameters, attributes and the variable name.
+
+Syntax:
+```
+%var = fir.declare %base [shape %extent1, %extent2, ...] [lbs %lb1, %lb2, ...] [typeparams %l1, ...] {fir.def = mangled_variable_name, attributes} : [(....) ->] T
+```
+
+%var will have the same type as %base. When no debug info is generated, the
+operation can be replaced by %base when lowering to LLVM.
+
+- Extents should only be provided if %base is not a fir.box and the entity is an
+  array.
+- lower bounds should only be provided if the entity is an array and the lower
+  bounds are not default (all ones). It should also not be provided for POINTERs
+  and ALLOCATABLES since the lower bounds may change.
+- type parameters should be provided for entities with length parameters, unless
+  the entity is a CHARACTER where the length is constant in %base type.
+- The attributes will include the Fortran attributes: TARGET (fir.target),
+  POINTER (fir.ptr), ALLOCATABLE (fir.alloc), CONTIGUOUS (fir.contiguous),
+  OPTIONAL (fir.optional), VOLATILE (fir.volatile), ASYNCHRONOUS (fir.async).
+  They will also indicate when an entity is part of an equivalence by giving the
+  equivalence name (fir.equiv = mangled_equivalence_name).
+
+fir.declare will be used for all Fortran variables, except the ones created via
+the ASSOCIATE construct that will use hlfir.associate described below.
+
+fir.declare will also be used when creating compiler created temporaries, in
+which case the fir.tmp attribute will be given.
+
+Examples:
+
+| FORTRAN                                 | FIR                                                                                                     |
+| ----------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| REAL :: X                                 | %mem = fir.alloca f32 <br> %x = fir.declare %mem {fir.def = "\_QPfooEx"} : fir.ref<f32>                                                                                                     |
+| REAL, TARGET :: X(10)                     | %mem = fir.alloca f32 <br> %nval = fir.load %n <br> %x = fir.declare %mem {fir.def = "\_QPfooEx", fir.target} : fir.ref<fir.array<10xf32>>                                                  |
+| REAL :: X(N)                              | %mem = // … alloc or dummy argument <br> %nval = fir.load %n : i64 <br> %x = fir.declare %mem shape %nval {fir.def = "\_QPfooEx"} : (i64) -> fir.ref<fir.array<?xf32>>                      |
+| REAL :: X(0:)                             | %mem = // … dummy argument <br> %c0 = arith.constant 0 : index <br> %x = fir.declare %mem lbs %c0 {fir.def = "\_QPfooEx"} : (index) -> fir.box<fir.array<?xf32>>                            |
+| <br>REAL, POINTER :: X(:)                 | %mem = // … dummy argument, or local, or global <br> %x = fir.declare %mem {fir.def = "\_QPfooEx", fir.ptr} :  fir.ref<fir.box<fir.ptr<fir.array<?xf32>>>>                                  |
+| REAL, ALLOCATABLE :: X(:)                 | %mem = // … dummy argument, or local, or global <br> %x = fir.declare %mem {fir.def = "\_QPfooEx", fir.alloc} :  fir.ref<fir.box<fir.heap<fir.array<?xf32>>>>                               |
+| CHARACTER(10) :: C                        | %mem = //  … dummy argument, or local, or global <br> %c = fir.declare %mem lbs %c0 {fir.def = "\_QPfooEc"} :  fir.ref<fir.char<10>>                                                        |
+| CHARACTER(\*) :: C                        | %unbox = fir.unbox %bochar (fir.boxchar<1>) -> (fir.ref<fir.char<?>>, index) <br> %c = fir.declare %unbox#0 typeparams %unbox#1 {fir.def = "\_QPfooEc"} : (index) ->  fir.ref<fir.char<?>>  |
+| CHARACTER(\*), OPTIONAL, ALLOCATABLE :: C | %mem = // … dummy argument <br> %c = fir.declare %mem {fir.def = "\_QPfooEc", fir.alloc, fir.optional, fir.assumed\_len\_alloc} :  fir.ref<fir.box<fir.heap<fir.char<?>>>>                  |
+| TYPE(T) :: X                              | %mem = //  … dummy argument, or local, or global <br> %x = fir.declare %mem {fir.def = "\_QPfooEx"} : fir.ref<fir.type<t{...}>>                                                             |
+| TYPE(T(L)) :: X                           | %mem = //  … dummy argument, or local, or global <br> %lval = fir.load %l <br> %x = fir.declare %mem typeparams %lval {fir.def = "\_QPfooEx"} : fir.box<fir.type<t{...}>>                   |
+| CLASS(\*), POINTER :: X                   | %mem = //  … dummy argument, or local, or global <br> %x = fir.declare %mem {fir.def = "\_QPfooEx", fir.ptr} : fir.class<fir.ptr<None>>                                                     |
+| REAL :: X(..)                             | %mem = //  … dummy argument <br> %x = fir.declare %mem {fir.def = "\_QPfooEx"} : fir.box<fir.array<..xf32>>                                                                                 |
+
+#### hlfir.associate operation
+
+Motivation: represent Fortran associations (both from variables and expressions)
+and allow keeping actual/dummy argument association information after inlining.
+
+Syntax:
+```
+%var = hlfir.associate %expr_or_var {fir.def = mangled_uniq_name, attributes} (AnyExprOrVarType) -> AnyVarType
+```
+
+hlfir.associate is used to represent the following associations:
+- Dummy/Actual association on the caller side (the callee side uses
+  fir.declare).
+- Host association in block constructs when VOLATILE/ASYNC attributes are added
+  locally
+- ASSOCIATE construct (both from variable and expressions).
+
+When the operand is a variable, hlfir.associate allows changing the attributes
+of the variable locally, and to encode certain side-effects (like
+copy-in/copy-out when going from a non-contiguous variable to a contiguous
+variable, with the help of the related hlfir.end_association operation).
+
+When the operand is an expression, hlfir.associate allows associating a storage
+location to an expression value.
+
+A hlfir.associate must be followed by a related hlfir.end_association that will
+allow inserting any necessary finalization or copy-out later.
+
+#### hlfir.end_association operation
+
+Motivation: mark the place where some association should end and some side
+effects might need to occur.
+
+The hlfir.end_associate is a placeholder to later insert
+deallocation/finalization if the variable was associated with an expression,
+and to insert copy-out/deallocation if the variable was associated with another
+variable with a copy-in.
+
+Syntax:
+```
+hlfir.end_association %var [%original_variable] {fir.ref = var_mangled_name, attributes}
+```
+
+
+The attributes can be:
+-   copy_out (copy out the associated variable back into the original variable
+    if a copy-in occurred)
+-   finalize_copy_in (deallocate the temporary storage for the associated
+    variable if a copy-in occurred but the associated variable was not modified
+    (e.g., it is intent(in))).
+-   finalize: indicate that a finalizer should be run on the entity associated
+    with the variable (There is currently no way to deduce this only from the
+    variable type in FIR). It will give the finalizer mangled name so that it
+    can be later called.
+
+If the copy_out or finalize_copy_in attribute is set, “original_variable” (the
+argument of the hlfir.associate that produced %var) must be provided. The
+rationale is that the original variable address is needed to verify if a
+temporary was created, and if needed, to copy the data back to it.
+
+#### hlfir.finalize
+
+Motivation: mark end of life of local variables
+
+Mark the place where a local variable will go out of scope. The main goal is to
+retain this information even after local variables are inlined.
+
+Syntax:
+```
+hlfir.finalize %var {fir.ref = var_mangled_name, attributes}
+```
+
+The attributes can be:
+-   finalize: indicate that a finalizer should be run on the entity associated
+    with the variable (There is currently no way to deduce this only from the
+    variable type in FIR).
+
+Note that finalization will not free the local variable storage if it was
+allocated on the heap. If lowering created the storage passed to fir.declare via
+a fir.allocmem, lowering should insert a fir.freemem after the hlfir.finalize.
+This could help making fir.allocmem to fir.alloca promotion simpler, and also
+because finalization may be run without the intent to deallocate the variable
+storage (like on INTENT(OUT) dummies).
+
+
+#### hlfir.designate
+
+Motivation: Represent designators at a high-level and allow representing some
+information about derived type components that would otherwise be lost, like
+component lower bounds.
+
+Represent Fortran designators in a verbatim way: both triplet, and component
+parts.
+
+Syntax:
+```
+%var = hlfir.designate %base [“component”,] [(%i, %k:l%:%m)] [substr ub, lb] [imag|real] [shape extent1, extent2, ....] [lbs lb1, lb2, .....] [typeparams %l1, ...] {fir.ref = base_mangled_name, fir.def = mangled_name, attributes}
+```
+
+hlfir.designate is intended to encode a single part-ref (as defined by the
+fortran standard). That means that a(:)%x(i, j, k) must be split into two
+hlfir.designate: one for a(:), and one for x(i, j, k).  If the base is ranked,
+and the component is an array, the subscripts are mandatory and must not
+contain triplets. This ensures that the result of a fir.designator cannot be a
+"super-array".
+
+The subscripts passed to hlfir.designate must be based on the base lower bounds
+(one by default).
+
+A substring is built by providing the lower and upper character indices after
+`substr`. Implicit substring bounds must be made explicit by lowering.  It is
+not possible to provide substr if a component is already provided. Instead the
+related Fortran designator must be split into two fir.designator. This is
+because the component character length will be needed to compute the right
+stride, and it might be lost if not placed on the first designator typeparams.
+
+Real and Imaginary complex parts are represented by an optional imag or real
+tag. It can be added even if there is already a component.
+
+The shape, lower bound, and type parameter operands represent the output entity
+properties. The point of having those made explicit is to allow early folding
+and hoisting of array section shape and length parameters (which especially in
+FORALL contexts, can simplify later assignment temporary insertion a lot). Also,
+if lower bounds of a derived type component array could not be added here, they
+would be lost since they are not represented by other means in FIR (the fir.type
+does not include this information).
+
+hlfir.designate is not intended to describe vector subscripted variables.
+Instead, lowering will have to introduce loops to do element by element
+addressing. See the Examples section. This helps keeping hlfir.designate simple,
+and since the contexts where a vector subscripted entity is considered to be a
+variable (in the sense that it can be modified) are very limited, it seems
+reasonable to have lowering deal with this aspect. For instance, a vector
+subscripted entity cannot be passed as a variable, it cannot be a pointer
+assignment target, and when it appears as an associated entity in an ASSOCIATE,
+the related variable cannot be modified.
+
+#### hlfir.assign
+
+Motivation: represent assignment at a high-level (mainly a change for array and
+character assignment) so that optimization pass can clearly reason about it
+(value propagation, inserting temporary for right-hand side evaluation only when
+needed), and that lowering does not have to implement it all.
+
+Syntax:
+```
+hlfir.assign %expr_or_var to %var [attributes]
+```
+
+The attributes can be:
+
+-   realloc: mark that assignment has F2003 semantics and that the left-hand
+    side may have to be deallocated/reallocated…
+-   use_assign=@function: mark a user defined assignment
+-   no_overlap: mark that an assignment does not need a temporary (added by an
+    analysis pass).
+-   unordered : mark that an assignment can happen in any element order (not
+    true if there is an impure elemental function being called).
+
+This will replace the current array_load/array_access/array_merge semantics.
+Instead, a more generic alias analysis will be performed on the LHS and RHS to
+detect aliasing, and a temporary inserted if needed. The alias analysis will
+look at all the memory references in the RHS operand tree and base overlap
+decisions on the related variable declaration operations. This same analysis
+should later allow moving/merging some expression evaluation between different
+statements.
+
+Note about user defined assignments: semantics is resolving them and building
+the related subroutine call. So a fir.call could directly be made in lowering if
+the right hand side was always evaluated in a temporary. The motivation to use
+hlfir.assign is to help the temporary removal, and also to deal with two edge
+cases: user assignment in a FORALL (the forall pass will need to understand that
+this an assignment), and allocatable assignment mixed with user assignment
+(implementing this as a call in lowering would require lowering the whole
+reallocation logic in lowering already, duplicating the fact that hlfir.assign
+should deal with it).
+
+#### hlfir.ptr_assign
+
+Motivation: represent pointer assignment without lowering the exact pointer
+implementation (descriptor address, fir.ref<fir.box> or simple pointer scalar
+fir.llvm_ptr<fir.ptr>).
+
+Syntax:
+```
+hlfir.ptr_assign %var [[reshape %reshape] | [lbounds %lb1, …., %lbn]] to %ptr
+```
+
+It is important to keep pointer assignment at a high-level so that they can
+later correctly be processed in hlfir.forall.
+
+#### hlfir.allocate
+
+Motivation: keep POINTER and ALLOCATABLE allocation explicit in HLFIR, while
+allowing later lowering to either inlined fir.allocmem or Fortran runtime
+calls. Generating runtime calls allow the runtime to do Fortran specific
+bookkeeping or flagging and to provide better runtime error reports.
+
+The main difference with the ALLOCATE statement is that one distinct
+hlfir.allocate has to be created for each element of the allocation-list.
+Otherwise, it is a naive lowering of the ALLOCATE statement.
+
+Syntax:
+```
+%stat = hlfir.allocate %var [%shape] [%type_params] [[src=%source] | [mold=%mold]] [errmsg =%errmsg]
+```
+
+#### hlfir.deallocate
+
+Motivation: keep deallocation explicit in HLFIR, while allowing later lowering
+to Fortran runtime calls to allow the runtime to do Fortran specific
+bookkeeping or flagging of allocations.
+
+Similarly to hlfir.allocate, one operation must be created for each
+allocate-object-list object.
+
+Syntax:
+```
+%stat = hlfir.deallocate %var [errmsg=err].
+```
+
+####  hlfir.elemental
+
+Motivation: represent elemental operations without defining array level
+operations for each of them, and allow the representation of array expressions
+as function of the indices.
+
+The hlfir.elemental operation can be seen as a closure: it is defining a
+function of the indices that returns the value of the element of the
+represented array expression at the given indices. This an operation with an
+MLIR region. It allows detailing how an elemental expression is implemented at
+the element level, without yet requiring materializing the operands and result
+in memory.  The hlfir.expr<T> elements value can be obtained using hlfir.apply.
+
+The element result is built with a fir.result op, whose result type can be a
+scalar hlfir.expr<T> or any scalar constant size types (e.g. i32, or f32).
+
+Syntax:
+```
+%op = hlfir.elemental (%indices) %shape [%type_params] [%dynamic_type] {
+  ….
+  fir.result %result_element
+}
+```
+
+
+Note that %indices are not operands, they are the elemental region block
+arguments, representing the array iteration space in a one based fashion.
+The choice of using one based indicies is to match Fortran default for
+array variables, so that there is no need to generate bound adjustments
+when working with one based array variables in an expression.
+
+Illustration: “A + B” represented with a hlfir.elemental.
+
+```
+%add = hlfir.elemental (%i:index, %j:index) shape %shape (!fir.shape<2>) -> !hlfir.expr<?x?xf32> {
+  %belt = hlfir.designate %b, %i, %j {fir.ref = _QPfooEb, fir.def = _QPfooEb.des001}: (!fir.ref<!fir.array<?x?xf32>>, index, index) -> !fir.ref<f32>
+  %celt = hlfir.designate %c, %i, %j {fir.ref = _QPfooEa, fir.def = _QPfooEa.des002} : (!fir.ref<!fir.array<?x?xf32>>, index, index) -> !fir.ref<f32>
+  %bval = fir.load %belt : (!fir.ref<f32>) -> f32
+  %cval = fir.load %celt : (!fir.ref<f32>) -> f32
+  %add = arith.addf %bval, %cval : f32
+  fir.result %res : f32
+}
+```
+
+In contexts where it can be proved that the array operands were not modified
+between the hlfir.elemental and the hlfir.apply, the region of the
+hlfir.elemental can be inlined at the hlfir.apply. Otherwise, if there is no
+such guarantee, or if the hlfir.elemental is not “visible” (because its result
+is passed as a block argument), the hlfir.elemental will be lowered to an array
+temporary. This will be done as a HLFIR to HLFIR optimization pass. Note that
+MLIR inlining could be used if hlfir.elemental implemented the
+CallableInterface and hlfir.apply the CallInterface.  But MLIR generic inlining
+is probably too generic for this case: no recursion is possible here, the call
+graphs are trivial, and using MLIR inlining here could introduce later
+conflicts or make normal function inlining more complex because FIR inlining
+hooks would already be used.
+
+hlfir.elemental allows delaying elemental array expression buffering and
+combination. Its generic aspect has two advantages:
+- It avoids defining one operation per elemental operation or intrinsic,
+  instead, the related arith dialect operations can be used directly in the
+  elemental regions. This avoids growing HLFIR and having to maintain about a
+  hundred operations.
+- It allows representing transformational intrinsics as functions of the indices
+  while doing optimization as described in
+  [Array Composition](ArrayComposition.md). This because the indices can be
+  transformed inside the region before being applied to array variables
+  according to any kind of transformation (semi-affine or not).
+
+
+#### Introducing the hlfir.apply operation
+
+Motivation: provide a way to get the element of an array expression
+(hlfir.expr<?x…xT>)
+
+This is the addressing equivalent for expressions. A notable difference is that
+it can only take simple scalar indices (no triplets) because it is not clear
+why supporting triplets would be needed, and keeping the indexing simple makes
+inlining of hlfir.elemental much easier.
+
+If hlfir.elemental inlining is not performed, or if the hlfir.expr<T> array
+expression is produced by another operation (like fir.intrinsic) that is not
+rewritten, hlfir.apply will be lowered to an actual addressing operation that
+will address the temporary that was created for the hlfir.expr<T> value that
+was materialized in memory.
+
+hlfir.apply indices will be one based to make further lowering simpler.
+
+Syntax:
+```
+%element = hlfir.apply %array_expr %i, %j: (hlfir.expr<?x?xi32>) -> i32
+```
+
+
+#### Introducing operations for transformational intrinsic functions
+
+Motivation: Represent transformational intrinsics functions at a high-level so
+that they can be manipulated easily by the optimizer, and do not require
+materializing the result as a temporary in lowering.
+
+An operation will be added for each Fortran transformational functions (SUM,
+MATMUL, TRANSPOSE....). It translates the Fortran expression verbatim: it takes
+the same number of arguments as the Fortran intrinsics and returns a
+hlfir.expr<T>. The arguments may be hlfir.expr<T>, simple scalar types (e.g.,
+i32, f32), or variables.
+
+The exception being that the arguments that are statically absent would be
+passed to it (passing results of fir.absent operation), so that the arguments
+can be identified via their positions.
+
+This operation is meant for the transformational intrinsics, not the elemental
+intrinsics, that will be implemented using hlfir.elemental + mlir math dialect
+operations, nor the intrinsic subroutines (like random_seed or system_clock),
+that will be directly lowered in lowering.
+
+Syntax:
+```
+%res = hlfir."intrinsic_name" %expr_or_var, ...
+```
+
+These operations will all inherit a same operation base in tablegen to make
+their definition and identification easy.
+
+Without any optimization, codegen would then translate the operations to
+exactly the same FIR as currently generated by IntrinsicCall.cpp (runtime calls
+or inlined code with temporary allocation for array results). The fact that
+they are the verbatim Fortran translations should allow to move the lowering
+code to a translation pass without massive changes.
+
+An operation will at least be created for each of the following transformational
+intrinsics: all, any, count, cshift, dot_product, eoshift, findloc, iall, iany,
+iparity, matmul, maxloc, maxval, minloc, minval, norm2, pack, parity, product,
+reduce, repeat, reshape, spread, sum, transfer, transpose, trim, unpack.
+
+For the following transformational intrinsics, the current lowering to runtime
+call will probably be used since there is little point to keep them high level:
+- command_argument_count, get_team, null, num_images, team_number, this_image
+  that are more program related (and cannot appear for instance in constant
+  expressions)
+- selected_char_kind, selected_int_kind, selected_real_kind that returns scalar
+  integers
+
+#### Introducing operations for character operations and elemental intrinsic functions
+
+
+Motivation: represent character operations without requiring the operand and
+results to be materialized in memory.
+
+fir.char_op is intended to represent:
+-  Character concatenation (//)
+-  Character MIN/MAX
+-  Character MERGE
+-  “SET_LENGTH”
+-  Character conversions
+-  REPEAT
+-  INDEX
+-  CHAR
+-  Character comparisons
+-  LEN_TRIM
+
+The arguments must be scalars, the elemental aspect should be handled by a
+hlfir.elemental operation.
+
+Syntax:
+```
+%res = hlfir.“char_op” %expr_or_var
+```
+
+Just like for the transformational intrinsics, if no optimization occurs, these
+operations will be lowered to memory operations with temporary results (if the
+result is a character), using the same generation code as the one currently used
+in lowering.
+
+#### hlfir.array_ctor
+
+Motivation: represent array constructor without creating temporary
+
+Many array constructors have a limited number of elements (less than 10), the
+current lowering of array constructor is rather complex because it must deal
+with the generic cases.
+
+Having a representation to represent array constructor will allow an easier
+lowering of array constructor, and make array ctor a lot easier to manipulate.
+For instance, for small array constructors, loops could could be unrolled with
+the array ctor elements without ever creating a dynamically allocated array
+temporary and loop nest using it.
+
+Syntax:
+```
+%array_ctor = hlfir.array_ctor %expr1, %expr2 ….
+```
+
+Note that hlfir.elemental could be used to implement some ac-implied-do,
+although this is not yet clarified since ac-implied-do may contain more than
+one scalar element (they may contain a list of scalar and array values, which
+would render the representation in a hlfir.elemental tricky, but maybe not
+impossible using if/then/else and hlfir.elemental nests using the index value).
+One big issue though is that hlfir.elemental requires the result shape to be
+pre-computed (it is an operand), and with an ac-implied-do containing user
+transformational calls returning allocatable or pointer arrays, it is
+impossible to pre-evaluate the shape without evaluating all the function calls
+entirely (and therefore all the array constructor elements).
+
+#### hlfir.get_extent
+
+Motivation: inquire about the extent of a hlfir.expr, variable, or fir.shape
+
+Syntax:
+```
+%extent = hlfir.get_extent %shape_expr_or_var, dim
+```
+
+dim is a constant integer attribute.
+
+This allows inquiring about the extents of expressions whose shape may not be
+yet computable without generating detailed, low level operations (e.g, for some
+transformational intrinsics), or to avoid going into low level details for
+pointer and allocatable variables (where the descriptor needs to be read and
+loaded).
+
+#### hlfir.get_typeparam
+
+Motivation: inquire about the type parameters of a hlfir.expr, or variable.
+
+Syntax:
+```
+%param = hlfir.get_typeparam %expr_or_var [, param_name]
+```
+- param_name is an optional string attribute that must contain the length
+  parameter name if %expr_or_var is a derived type.
+
+####  hlfir.get_dynamic_type
+
+Motivation: inquire about the dynamic type of a polymorphic hlfir.expr or
+variable.
+
+Syntax:
+```
+%dynamic_type = hlfir.get_dynamic_type %expr_or_var
+```
+
+#### hlfir.get_lbound
+
+Motivation: inquire about the lower bounds of variables without digging into
+the implementation details of pointers and allocatables.
+
+Syntax:
+```
+%lb = hlfir.get_lbound %var, n
+```
+
+Note: n is an integer constant attribute for the (zero based) dimension.
+
+####  hlfir.shape_meet
+
+Motivation: represent conformity requirement/information between two array
+operands so that later optimization can choose the best shape information
+source, or insert conformity runtime checks.
+
+Syntax:
+```
+%shape = hlfir.shape_meet %shape1, %shape2
+```
+
+Suppose A(n), B(m) are two explicit shape arrays. Currently, when A+B is
+lowered, lowering chose which operand shape gives the result shape information,
+and it is later not retrievable that both n and m can be used. If lowering
+chose n, but m later gets folded thanks to inlining or constant propagation, the
+optimization passes have no way to use this constant information to optimize the
+result storage allocation or vectorization of A+B.  hlfir.shape_meet intends to
+delay this choice until constant propagation or inlining can provide better
+information about n and m.
+
+#### hlfir.forall
+
+Motivation: segregate the Forall lowering complexity in its own unit.
+
+Forall is tough to lower because:
+-   Lowering it in an optimal way requires analyzing several assignments/mask
+    expressions.
+-   The shape of the temporary needed to store intermediate evaluation values is
+    not a Fortran array in the general case, and cannot in the general case be
+    maximized/pre-computed without executing the forall to compute the bounds of
+    inner forall, and the shape of the assignment operands that may depend on
+    the bound values.
+-   Mask expressions evaluation should be affected by previous assignment
+    statements, but not by the following ones. Array temporaries may be
+    required for the masks to cover this.
+-   On top of the above points, Forall can contain user assignments, pointer
+    assignments, and assignment to whole allocatable.
+
+
+The hlfir.forall syntax would be exactly the one of a fir.do_loop. The
+difference would be that hlfir.assign and hlfir.ptr_assign inside hlfir.forall
+have specific semantics (the same as in Fortran):
+-   Given one hlfir.assign, all the iteration values of the LHS/RHS must be
+    evaluated before the assignment of any value is done.
+-   Given two hlfir.assign, the first hlfir.assign must be fully performed
+    before any evaluation of the operands of the second assignment is done.
+-   Masks (fir.if arguments), if any, should be evaluated before any nested
+    assignments. Any assignments syntactically before the where mask occurrence
+    must be performed before the mask evaluation.
+
+Note that forall forbids impure function calls, hence, no calls should modify
+any other expression evaluation and can be removed if unused.
+
+The translation of hlfir.forall will happen by:
+-   1. Determining if the where masks value may be modified by any assignments
+    - Yes, pre-compute all masks in a pre-run of the forall loop, creating
+      a “forall temps” (we may need a FIR concept to help here).
+    - No, Do nothing (or indicate it is safe to evaluate masks while evaluating
+      the rest).
+-   2. Determining if a hlfir.assign operand expression depends on the
+       previous hlfir.assign left-hand side base value.
+    - Yes, split the hlfir.assign into their own nest of hlfir.forall loops.
+    - No, do nothing (or indicate it is safe to evaluate the assignment while
+      evaluating previous assignments)
+-   3. For each assignments, check if the RHS/LHS operands value may depend
+     on the LHS base:
+    - Yes, split the forall loops. Insert a “forall temps” before the loops for
+      the “smallest” part that may overlap (which may be the whole RHS, or some
+      RHS sub-part, or some LHS indices). In the first nest, evaluate this
+      overlapping part into the temp. In the next forall loop nest, modify the
+      assignment to use the temporary, and add the [no_overlap] flag to indicate
+      no further temporary is needed. Insert code to finalize the temp after its
+      usage.
+
+### Tagging variable uses in high-level operations (fir.ref attribute)
+
+All operations defined above that accept "variables" (i.e: memory addresses or
+box values that were produced by fir.declare, hlfir.associate, or
+hlfir.designate) must have a fir.ref = mangled_name_attribute that matches the
+fir.def on the operation that created them (it will be added automatically by
+the operation builder). That is to ensure optimization passes do not merge
+seemingly identical operations using variables with different properties, and
+also to ensure that the matching defining operation can always be retrieved to
+get all the variable properties (shape, bounds, type parameters and
+attributes).
+
+Two other alternatives have been considered and rejected:
+- Using MLIR symbols. This has been rejected because MLIR symbols are mainly
+  intended to deal with globals and functions that may refer to each other
+  before being defined. Their processing is not as light as normal values, and
+  would require to turn every FIR operation with a region into an MLIR symbol
+  table. This would especially be annoying given fir.designator also produce
+  variables with their own properties, which would imply creating a lot of MLIR
+  symbols. All the operations that both accept variable and expression operands
+  would also either need to be more complex in order to both accept SSA values
+  or MLIR symbol operands (or some fir.as_expr %var operation should be added to
+  turn a variable into an expression). Given all variable definitions will
+  dominates their uses, it seems more adequate to use an SSA model with named
+  attributes. Using SSA values also makes the transition and mix with
+  lower-level FIR operations smoother: a variable SSA usage can simply be
+  replaced by lower-level FIR operations using the same SSA value.
+- Another alternative could be making all operations defining variables return
+  fir.box, and repeating the variable attributes (fir.target...) on all
+  operations using the variable. This would allow the link between the variable
+  definition and usage to become broken (variable could travel as block
+  arguments). But this would risk littering the codegen with fir.box
+  manipulations (creating, writing and reading to descriptors) that may lead to
+  poor performance. Maintaining all the attributes on the operations would also
+  be more cumbersome than only maintaining the variable name in the fir.ref
+  attribute.
+
+Lower-level operations (the current FIR operations), do not require this strong
+link between a memory address and the variable definition, and it will not be
+necessary to add fir.ref attributes to those. During alias analysis on FIR using
+lower-level operations (like loads and stores), any memory reference that cannot
+be resolved to a Fortran variable or some unrelated temporary allocation is
+considered as potentially overlapping.
+
+The variable definition will be guaranteed to have a unique name after lowering,
+and some care might have to be taken when later duplicating regions that define
+variables in a way that could lead a variable usage to have two dominating
+definitions with the same name (this could for instance happen after inlining
+two calls to the same procedure inside the same region). Inlining will need to
+take care of those conflicts. This could be done by randomizing the inlined
+variable name attributes (like by adding a counter index that is incremented
+after each call inlining).
+
+## New HLFIR Transformation Passes
+
+### Mandatory Passes (translation towards lower-level representation)
+
+Note that these passes could be implemented as a single MLIR pass, or successive
+passes.
+
+-   Forall rewrites (getting rid of hlfir.forall)
+-   Array assignment rewrites (getting rid of array hlfir.assign)
+-   Bufferization: expression temporary materialization (getting rid of
+    hlfir.expr, and all the operations that may produce it like transformational
+    intrinsics and hlfir.elemental, hlfir.apply).
+-   Call interface argument association lowering (getting rid of hlfir.associate
+    and hlfir.end_associate)
+-   Lowering high level operations using variables into FIR operations
+    operating on memory (translating hlfir.designate, scalar hlfir.assign,
+    hlfir.finalize into fir.array_coor, fir.do_loop, fir.store, fir.load.
+    fir.embox/fir.rebox operations).
+
+Note that these passes do not have to be the first one run after lowering. It is
+intended that CSE, DCE, algebraic simplification, inlining and some other new
+high-level optimization passes discused below be run before doing any of these
+translations.
+
+After that, the current FIR pipeline could be used to continue lowering towards
+LLVM.
+
+### Optimization Passes
+
+-   Elemental expression inlining (inlining of hlfir.elemental in hlfir.apply)
+-   User function Inlining
+-   Transformational intrinsic rewrites as hlfir.elemental expressions
+-   Assignments propagation
+-   Shape/Rank/dynamic type propagation
+
+These high level optimization passes can be run any number of times in any
+order.
+
+## Transition Plan
+
+The new higher-level steps proposed in this document will require significant
+refactoring of lowering. Codegen should not be impacted since the current FIR
+will remain untouched.
+
+A lot of the code in lowering generating Fortran features (like an intrinsic or
+how to do assignments) is based on the fir::ExtendedValue concept. This
+currently is a collection of mlir::Value that allows describing a Fortran object
+(either a variable or an evaluated expression result). The variable and
+expression concepts described above should allow to keep an interface very
+similar to the fir::ExtendedValue, but having the fir::ExtendedValue wrap a
+single value or mlir::Operation* from which all of the object entity
+information can be inferred.
+
+That way, all the helpers currently generating FIR from fir::ExtendedValue could
+be kept and used with the new variable and expression concepts with as little
+modification as possible.
+
+The proposed plan is to:
+- 1. Introduce the new HLFIR operations.
+- 2. Refactor fir::ExtendedValue so that it can work with the new variable and
+     expression concepts (requires part of 1.).
+- 3. Introduce the new translation passes, using the fir::ExtendedValue helpers
+     (requires 1.).
+- 3.b Introduce the new optimization passes (requires 1.).
+- 4. Introduce the fir.declare and hlfir.finalize usage in lowering (requires 1.
+     and 2. and part of 3.).
+
+The following steps might have to be done in parallel of the current lowering,
+to avoid disturbing the work on performance until the new lowering is complete
+and on par.
+
+- 5. Introduce hlfir.designate and hlfir.associate usage in lowering.
+- 6. Introduce lowering to hlfir.assign (with RHS that is not a hlfir.expr),
+     hlfir.ptr_assign.
+- 7. Introduce lowering to hlfir.expr and related operations.
+- 8. Introduce lowering to hlfir.forall.
+
+At that point, lowering using the high-level FIR should be in place, allowing
+extensive testing.
+- 9. Debugging correctness.
+- 10. Debugging execution performance.
+
+The plan is to do these steps incrementally upstream, but for lowering this will
+most likely be safer to do have the new expression lowering implemented in
+parallel upstream, and to add an option to use the new lowering rather than to
+directly modify the current expression lowering and have it step by step
+equivalent functionally and performance wise.
+
+## Examples
+
+### Example 1: simple array assignment
+
+```Fortran
+subroutine foo(a, b)
+  real :: a(:), b(:)
+  a = b
+end subroutine
+```
+
+Lowering output:
+
+```HLFIR
+func.func @_QPfoo(%arg0: !fir.box<!fir.array<?xf32>>, %arg1: !fir.box<!fir.array<?xf32>>) {
+  %a = fir.declare %arg0 {fir.def = "_QPfooEa"} : !fir.box<!fir.array<?xf32>>
+  %b = fir.declare %arg1 {fir.def = "_QPfooEb"} : !fir.box<!fir.array<?xf32>>
+  hlfir.assign %b to %a {fir.ref = "_QPfooEb,_QPfooEa"}: !fir.box<!fir.array<?xf32>>
+  return
+}
+```
+
+HLFIR array assignment lowering pass:
+-   Query: can %b value depend on %a? No, they are two different argument
+    associated variables that are neither target nor pointers.
+-   Lower to assignment to loop:
+
+```HFLIR
+func.func @_QPfoo(%arg0: !fir.box<!fir.array<?xf32>>, %arg1: !fir.box<!fir.array<?xf32>>) {
+  %a = fir.declare %arg0 {fir.def = "_QPfooEa"} : !fir.box<!fir.array<?xf32>>
+  %b = fir.declare %arg1 {fir.def = "_QPfooEb"} : !fir.box<!fir.array<?xf32>>
+
+  %ashape = hlfir.shape_of %a {fir.ref = "_QPfooEa"}
+  %bshape = hlfir.shape_of %b {fir.ref = "_QPfooEb"}
+  %shape = hlfir.shape_meet %ashape, %bshape
+  %extent = hlfir.get_extent %shape, 0
+
+  %c1 = arith.constant 1 : index
+
+  fir.do_loop %i = %c1 to %extent step %c1 unordered {
+    %belt = hlfir.designate %b, %i {fir.ref = "_QPfooEb", "fir.def=_QPfooEb.des001"}
+    %aelt = hlfir.designate %a, %i {fir.ref = "_QPfooEa", "fir.def=_QPfooEa.des002"}
+    hlfir.assign %belt to %aelt {fir.ref = "_QPfooEb.des001,_QPfooEa.des002"}: fir.ref<f32>, fir.ref<f32>
+  }
+  return
+}
+```
+
+HLFIR variable operations to memory translation pass:
+-   hlfir.designate is rewritten into fir.array_coor operation on the variable
+    associated memory buffer, and returns the element address
+-   For numerical scalar, hlfir.assign is rewritten to fir.store (and fir.load
+    of the operand if needed), for derived type and characters, memory copy
+    (and padding for characters) is done.
+-   hlfir.shape_of are lowered to fir.box_dims, here, no constant information
+    was obtained from any of the source shape, so hlfir.shape_meet is a no-op,
+    selecting the first shape (a conformity runtime check could be inserted
+    under debug options).
+-   fir.declare are kept (they are no-ops) so that it will be possible to
+    generate debug information for LLVM.
+
+This pass would wrap operations defining variables (fir.declare/hlfir.designate)
+as fir::ExtendedValue, and use all the current helpers operating on it
+(e.g.: fir::factory::genScalarAssignment).
+
+```
+func.func @_QPfoo(%arg0: !fir.box<!fir.array<?xf32>>, %arg1:
+  !fir.box<!fir.array<?xf32>>) {
+  %a = fir.declare %arg0 {fir.def = "_QPfooEa"} : !fir.box<!fir.array<?xf32>>
+  %b = fir.declare %arg1 {fir.def = "_QPfooEb"} : !fir.box<!fir.array<?xf32>>
+  %c1 = arith.constant 1 : index
+  %dims = fir.box_dims %a, 1
+  fir.do_loop %i = %c1 to %dims#1 step %c1 unordered {
+    %belt = fir.array_coor %b, %i : (!fir.box<!fir.array<?xf32>>, index) -> fir.ref<f32>
+    %aelt = fir.array_coor %a, %i : (!fir.box<!fir.array<?xf32>>, index) -> fir.ref<f32>
+    %bval = fir.load %belt : f32
+    fir.store %bval to %aelt : fir.ref<f32>
+  }
+  return
+}
+```
+
+This reaches the current FIR level (except fir.declare_op that can be kept until
+LLVM codegen and dropped on the floor if there is no debug information
+generated).
+
+### Example 2: array assignment with elemental expression
+
+```Fortran
+subroutine foo(a, b, p, c)
+  real, target :: a(:)
+  real :: b(:), c(100)
+  real, pointer :: p(:)
+  a = b*p + c
+end subroutine
+```
+
+Lowering output:
+
+```HLFIR
+func.func @_QPfoo(%arg0: !fir.box<!fir.array<?xf32>>, %arg1: !fir.box<!fir.array<?xf32>>, %arg2: !fir.box<!fir.ptr<!fir.array<?xf32>>>, %arg3: !fir.ref<!fir.array<100xf32>>) {
+  %a = fir.declare %arg0 {fir.def = "_QPfooEa"} {fir.target} : !fir.box<!fir.array<?xf32>
+  %b =  fir.declare %arg1 {fir.def = "_QPfooEb"} : !fir.box<!fir.array<?xf32>>
+  %p = fir.declare %arg2 {fir.def = "_QPfooEp", fir.ptr} : !fir.box<!fir.ptr<!fir.array<?xf32>>>
+  %c =  fir.declare %arg3 {fir.def = "_QPfooEc"} : !fir.ref<!fir.array<100xf32>>
+  %bshape = hlfir.shape_of %b {fir.ref = "_QPfooEb"}
+  %pshape = hlfir.shape_of %p {fir.ref = "_QPfooEp"}
+  %shape1 = hlfir.shape_meet %bshape, %pshape
+  %mul = hlfir.elemental(%i:index) %shape1 {
+    %belt = hlfir.designate %b, %i {fir.ref = "_QPfooEb", fir.def= "_QPfooEb.des001"}
+    %p_lb = hlfir.get_lbound %p, 1 {fir.ref = "_QPfooEp"}
+    %i_zero = arith.subi %i, %c1
+    %i_p = arith.addi %i_zero,  %p_lb
+    %pelt = hlfir.designate %p, %i_p {fir.ref = "_QPfooEp", fir.def= "_QPfooEp.des002"}
+    %bval = fir.load %belt : f32
+    %pval = fir.load %pelt : f32
+    %mulres = arith.mulf %bval, %pval : f32
+     fir.result %mulres : f32
+  }
+  %cshape = hlfir.shape_of %c
+  %shape2 = hlfir.shape_meet %cshape, %shape1
+  %add =  hlfir.elemental(%i:index) %shape2 {
+    %mulval = hlfir.apply %mul, %i : f32
+    %celt = hlfir.designate %c, %i {fir.ref = "_QPfooEc", fir.def= "_QPfooEc.des003"}
+    %cval = fir.load %celt
+    %add_res = arith.addf %mulval, %cval
+    fir.result %add_res
+  }
+  hlfir.assign %add to %a {fir.ref = "_QPfooEa"} : hlfir.expr<?xf32>, !fir.box<!fir.array<?xf32>
+  return
+}
+```
+
+Step 1: hlfir.elemental inlining: inline the first hlfir.elemental into the
+second one at the hlfir.apply.
+
+
+```HLFIR
+func.func @_QPfoo(%arg0: !fir.box<!fir.array<?xf32>>, %arg1: !fir.box<!fir.array<?xf32>>, %arg2: !fir.box<!fir.ptr<!fir.array<?xf32>>>, %arg3: !fir.ref<!fir.array<100xf32>>) {
+  %a = fir.declare %arg0 {fir.def = "_QPfooEa"} {fir.target} : !fir.box<!fir.array<?xf32>
+  %b =  fir.declare %arg1 {fir.def = "_QPfooEb"} : !fir.box<!fir.array<?xf32>>
+  %p = fir.declare %arg2 {fir.def = "_QPfooEa", fir.ptr} : !fir.box<!fir.ptr<!fir.array<?xf32>>>
+  %c =  fir.declare %arg3 {fir.def = "_QPfooEp"} : !fir.ref<!fir.array<100xf32>>
+  %bshape = hlfir.shape_of %b {fir.ref = "_QPfooEb"}
+  %pshape = hlfir.shape_of %p {fir.ref = "_QPfooEp"}
+  %shape1 = hlfir.shape_meet %bshape, %pshape
+  %cshape = hlfir.shape_of %c
+  %shape2 = hlfir.shape_meet %cshape, %shape1
+  %add =  hlfir.elemental(%i:index) %shape2 {
+    %belt = hlfir.designate %b, %i {fir.ref = "_QPfooEb", fir.def= "_QPfooEb.des001"}
+    %p_lb = hlfir.get_lbound %p, 1 {fir.ref = "_QPfooEp"}
+    %i_zero = arith.subi %i, %c1
+    %i_p = arith.addi %i_zero,  %p_lb
+    %pelt = hlfir.designate %p, %i_p {fir.ref = "_QPfooEp", fir.def= "_QPfooEp.des002"}
+    %bval = fir.load %belt : f32
+    %pval = fir.load %pelt : f32
+    %mulval = arith.mulf %bval, %pval : f32
+    %celt = hlfir.designate %c, %i {fir.ref = "_QPfooEc", fir.def= "_QPfooEc.des003"}
+    %cval = fir.load %celt
+    %add_res = arith.addf %mulval, %cval
+    fir.result %add_res
+  }
+  hlfir.assign %add to %a {fir.ref = "_QPfooEa"} : hlfir.expr<?xf32>, !fir.box<!fir.array<?xf32>
+  return
+}
+```
+
+Step2: alias analysis around the array assignment:
+
+-   May %add value depend on %a variable?
+-   Gather variable and function calls in %add operand tree (visiting
+    hlfir.elemental regions)
+-   Gather references to %b, %p, and %c. %p is a pointer variable according to
+    its defining operations. It may alias with %a that is a target. -> answer
+    yes.
+-   Insert temporary, and duplicate array assignments, that can be lowered to
+    loops at that point
+
+Note that the alias analysis could have already occurred without inlining the
+%add hlfir.elemental.
+
+
+```HLFIR
+func.func @_QPfoo(%arg0: !fir.box<!fir.array<?xf32>>, %arg1: !fir.box<!fir.array<?xf32>>, %arg2: !fir.box<!fir.ptr<!fir.array<?xf32>>>, %arg3: !fir.ref<!fir.array<100xf32>>) {
+  %a = fir.declare %arg0 {fir.def = "_QPfooEa"} {fir.target} : !fir.box<!fir.array<?xf32>
+  %b =  fir.declare %arg1 {fir.def = "_QPfooEb"} : !fir.box<!fir.array<?xf32>>
+  %p = fir.declare %arg2 {fir.def = "_QPfooEp", fir.ptr} : !fir.box<!fir.ptr<!fir.array<?xf32>>>
+  %c =  fir.declare %arg3 {fir.def = "_QPfooEc"} : !fir.ref<!fir.array<100xf32>>
+  %bshape = hlfir.shape_of %b {fir.ref = "_QPfooEb"}
+  %pshape = hlfir.shape_of %p {fir.ref = "_QPfooEp"}
+  %shape1 = hlfir.shape_meet %bshape, %pshape
+  %cshape = hlfir.shape_of %c
+  %shape2 = hlfir.shape_meet %cshape, %shape1
+  %add =  hlfir.elemental(%i:index) %shape2 {
+    %belt = hlfir.designate %b, %i {fir.ref = "_QPfooEb", fir.def= "_QPfooEb.des001"}
+    %p_lb = hlfir.get_lbound %p, 1 {fir.ref = "_QPfooEp"}
+    %i_zero = arith.subi %i, %c1
+    %i_p = arith.addi %i_zero,  %p_lb
+    %pelt = hlfir.designate %p, %i_p {fir.ref = "_QPfooEp", fir.def= "_QPfooEp.des002"}
+    %bval = fir.load %belt : f32
+    %pval = fir.load %pelt : f32
+    %mulval = arith.mulf %bval, %pval : f32
+    %celt = hlfir.designate %c, %i {fir.ref = "_QPfooEc", fir.def= "_QPfooEc.des003"}
+    %cval = fir.load %celt
+    %add_res = arith.addf %mulval, %cval
+    fir.result %add_res
+  }
+  %extent = hlfir.get_extent %shape2, 0: (fir.shape<1>) -> index
+  %tempstorage = fir.allocmem %extent : fir.heap<fir.array<?xf32>>
+  %temp = fir.declare %tempstorage, shape %extent {fir.def = QPfoo.temp001} : (index) -> fir.heap<fir.array<?xf32>>
+  hlfir.assign %add to %temp : no_overlap {fir.ref = "QPfoo.temp001"} : hlfir.expr<?xf32>, !fir.box<!fir.array<?xf32>
+  hlfir.assign %temp to %a : no_overlap {fir.ref = " QPfoo.temp001,_QPfooEa"} : hlfir.expr<?xf32>, !fir.box<!fir.array<?xf32>
+  hlfir.finalize %temp {fir.ref = "QPfoo.temp001"}
+  fir.freemem %tempstorage
+  return
+}
+```
+
+Step 4: Lower assignments to regular loops since they have the no_overlap
+attribute, and inline the hlfir.elemental into the first loop nest.
+
+```HLFIR
+func.func @_QPfoo(%arg0: !fir.box<!fir.array<?xf32>>, %arg1: !fir.box<!fir.array<?xf32>>, %arg2: !fir.box<!fir.ptr<!fir.array<?xf32>>>, %arg3: !fir.ref<!fir.array<100xf32>>) {
+  %a = fir.declare %arg0 {fir.def = "_QPfooEa"} {fir.target} : !fir.box<!fir.array<?xf32>
+  %b =  fir.declare %arg1 {fir.def = "_QPfooEb"} : !fir.box<!fir.array<?xf32>>
+  %p = fir.declare %arg2 {fir.def = "_QPfooEp", fir.ptr} : !fir.box<!fir.ptr<!fir.array<?xf32>>>
+  %c =  fir.declare %arg3 {fir.def = "_QPfooEc"} : !fir.ref<!fir.array<100xf32>>
+  %bshape = hlfir.shape_of %b {fir.ref = "_QPfooEb"}
+  %pshape = hlfir.shape_of %p {fir.ref = "_QPfooEp"}
+  %shape1 = hlfir.shape_meet %bshape, %pshape
+  %cshape = hlfir.shape_of %c
+  %shape2 = hlfir.shape_meet %cshape, %shape1
+  %extent = hlfir.get_extent %shape2, 0: (fir.shape<1>) -> index
+  %tempstorage = fir.allocmem %extent : fir.heap<fir.array<?xf32>>
+  %temp = fir.declare %tempstorage, shape %extent (index) fir.def = QPfoo.temp001} : fir.heap<fir.array<?xf32>>
+  fir.do_loop %i = %c1 to %shape2 step %c1 unordered {
+    %belt = hlfir.designate %b, %i {fir.ref = "_QPfooEb", fir.def= "_QPfooEb.des001"}
+    %p_lb = hlfir.get_lbound %p, 1 {fir.ref = "_QPfooEp"}
+    %i_zero = arith.subi %i, %c1
+    %i_p = arith.addi %i_zero,  %p_lb
+    %pelt = hlfir.designate %p, %i_p {fir.ref = "_QPfooEp", fir.def= "_QPfooEp.des002"}
+    %bval = fir.load %belt : f32
+    %pval = fir.load %pelt : f32
+    %mulval = arith.mulf %bval, %pval : f32
+    %celt = hlfir.designate %c, %i {fir.ref = "_QPfooEc", fir.def= "_QPfooEc.des003"}
+    %cval = fir.load %celt
+    %add_res = arith.addf %mulval, %cval
+    %tempelt = hlfir.designate %temp, %i {fir.ref = "_QPfoo.temp001", fir.def="_QPfoo.temp001.des004"}
+    hlfir.assign %add_res to %tempelt {fir.ref = "_QPfoo.temp001.des004"}: f32, fir.ref<f32>
+  }
+  fir.do_loop %i = %c1 to %shape2 step %c1 unordered {
+    %aelt = hlfir.designate %a, %i {fir.ref = "_QPfooEa", fir.def= "_QPfooEa.des005"}
+    %tempelt = hlfir.designate %temp, %i {fir.ref = "_QPfoo.temp001", fir.def="_QPfoo.temp001.des006"}
+    hlfir.assign %add_res to %tempelt {fir.ref = "_QPfoo.temp001.des005,_QPfooEa.des005"}: f32, fir.ref<f32>
+  }
+  hlfir.finalize %temp {fir.ref = "QPfoo.temp001"}
+  fir.freemem %tempstorage
+  return
+}
+```
+
+Step 5 (may also occur earlier or several times): shape propagation.
+-   %shape2 can be inferred from %cshape that has constant shape: the
+    hlfir.shape_meet results can be replaced by it, and if the option is set,
+    conformance checks can be added for %a, %b and %p.
+-   %temp is small, and its fir.allocmem can be promoted to a stack allocation
+
+```HLFIR
+func.func @_QPfoo(%arg0: !fir.box<!fir.array<?xf32>>, %arg1: !fir.box<!fir.array<?xf32>>, %arg2: !fir.box<!fir.ptr<!fir.array<?xf32>>>, %arg3: !fir.ref<!fir.array<100xf32>>) {
+  // .....
+  %cshape = fir.shape %c100
+  %extent = %c100
+  // updated fir.alloca
+  %tempstorage = fir.alloca %extent : fir.ref<fir.array<100xf32>>
+  %temp = fir.declare %tempstorage {fir.def = "_QPfoo.temp001"} : fir.ref<fir.array<100xf32>>
+  fir.do_loop %i = %c1 to %c100 step %c1 unordered {
+    // ...
+  }
+  fir.do_loop %i = %c1 to %c100 step %c1 unordered {
+    // ...
+  }
+  hlfir.finalize %temp {fir.ref = "QPfoo.temp001"}
+  // deleted fir.freemem %tempstorage
+  return
+}
+```
+
+Step 6: lower hlfir.designate/hlfir.assign in a translation pass:
+
+At this point, the representation is similar to the current representation after
+the array value copy pass, and the existing FIR flow is used (lowering
+fir.do_loop to cfg and doing codegen to LLVM).
+
+### Example 3: assignments with vector subscript
+
+```Fortran
+subroutine foo(a, b, v)
+  real :: a(*), b(*)
+  integer :: v(:)
+  a(v) = b(v)
+end subroutine
+```
+
+Lowering of vector subscripted entities would happen as follow:
+- vector subscripted entities would be lowered as a hlfir.elemental implementing
+  the vector subscript addressing.
+- If the vector appears in a context where it can be modified (which can only
+  be an assignment LHS, or in input IO), lowering could transform the
+  hlfir.elemental into hlfir.forall (for assignments), or a fir.iter_while (for
+  input IO) by inlining the elemental body into the created loops, and
+  identifying the hlfir.designate producing the result.
+
+```HFLFIR
+func.func @_QPfoo(%arg0: !fir.ref<!fir.array<?xf32>>, %arg1: !fir.ref<!fir.array<?xf32>>, %arg2: !fir.box<<!fir.array<?xi32>>) {
+  %a = fir.declare %arg0 {fir.def = "_QPfooEa"} : !fir.ref<!fir.array<?xf32>>
+  %b = fir.declare %arg1 {fir.def = "_QPfooEb"} : !fir.ref<!fir.array<?xf32>>
+  %v = fir.declare %arg2 {fir.def = "_QPfooEv"} : !fir.box<!fir.array<?xi32>>
+  %vshape = hlfir.shape_of %v : fir.shape<1>
+  %bsection =  hlfir.elemental(%i:index) %vshape : (fir.shape<1>) -> hlfir.expr<?xf32> {
+    %v_elt = hlfir.designate %v, %i {fir.ref = "_QPfooEv", fir.def="_QPfooEv.des001"} : (!fir.box<!fir.array<?xi32>>, index) -> fir.ref<i32>
+    %v_val = fir.load %v_elt : fir.ref<i32>
+    %cast = fir.convert %v_val : (i32) -> index
+    %b_elt = hlfir.designate %b, %v_val {fir.ref = "_QPfooEb", fir.def="_QPfooEb.des002"} : (!fir.ref<!fir.array<?xf32>>, index) -> fir.ref<f32>
+    %b_val = fir.load %b_elt : fir.ref<f32>
+    fir.result %b_elt
+  }
+  %extent = hlfir.get_extent %vshape, 0 : (fir.shape<1>) -> index
+  %c1 = arith.constant 1 : index
+  hlfir.forall (%i from %c1 to %extent step %c1) {
+    %b_section_val = hlfir.apply %bsection, %i : (hlfir.expr<?xf32>, index) -> f32
+    %v_elt = hlfir.designate %v, %i {fir.ref = "_QPfooEv", fir.def="_QPfooEv.des003"} : (!fir.box<!fir.array<?xi32>>, index) -> fir.ref<i32>
+    %v_val = fir.load %v_elt : fir.ref<i32>
+    %cast = fir.convert %v_val : (i32) -> index
+    %a_elt = hlfir.designate %a, %v_val {fir.ref = "_QPfooEa", fir.def="_QPfooEa.des004"} : (!fir.ref<!fir.array<?xf32>>, index) -> fir.ref<f32>
+    hlfir.assign %b_section_val to %a_elt {fir.ref="_QPfooEa.des004"} : f32, fir.ref<f32>
+  }
+  return
+}
+```
+
+This would then be lowered as described in the examples above (hlfir.elemental
+will be inlined, hlfir.forall will be rewritten into normal loops taking into
+account the alias analysis, and hlfir.assign/hlfir.designate operations will be
+lowered to fir.array_coor and fir.store operations).
+
+# Alternatives that were not retained
+
+## Using a non-MLIR based mutable CFG representation
+
+An option would have been to extend the PFT to describe expressions in a way
+that can be annotated and modified with the ability to introduce temporaries.
+This has been rejected because this would imply a whole new set of
+infrastructure and data structures while FIR is already using MLIR
+infrastructure, so enriching FIR seems a smoother approach and will benefit from
+the MLIR infrastructure experience that was gained.
+
+## Using some existing MLIR dialects for the high-level Fortran.
+
+### Why not using Linalg dialect?
+
+The linalg dialects offers a powerful way to represent array operations: the
+linalg.generic operation takes a set of input and output arrays, a related set
+of affine maps to represent how these inputs/outputs are to be addressed, and a
+region detailing what operation should happen at each iteration point, given the
+input and output array elements. It seems mainly intended to optimize matmul,
+dot, and sum.
+
+Issues:
+
+-   The linalg dialect is tightly linked to the tensor/memref concepts that
+    cannot represent byte stride based discontinuity and would most likely
+    require FIR to use MLIR memref descriptor format to take advantage of it.
+-   It is not clear whether all Fortran array expression addressing can be
+    represented as semi affine maps. For instance, vector subscripted entities
+    can probably not, which may force creating temporaries for the related
+    designator expressions to fit in this framework. Fortran has a lot more
+    transformational intrinsics than matmul, dot, and sum that can and should
+    still be optimized.
+
+So while there may be benefits to use linalg at the optimization level (like
+rewriting fir.sum/fir.matmul to a linalg sum, with dialect types plumbing
+around the operand and results, to get tiling done by linalg), using it as a
+lowering target would not cover all Fortran needs (especially for the non
+semi-affine cases).
+So using linalg is for now left as an optimization pass opportunity in some
+cases that could be experimented.
+
+### Why not using Shape dialect?
+
+MLIR shape dialect gives a set of operations to manipulate shapes. The
+shape.meet operation is exactly similar with hlfir.shape_meet, except that it
+returns a tensor or a shape.shape.
+
+The main issue with using the shape dialect is that it is dependent on tensors.
+Bringing the tensor toolchain in flang for the sole purpose of manipulating
+shape is not seen as beneficial given that the only thing Fortran needs is
+shape.meet The shape dialect is a lot more complex because it is intended to
+deal with computations involving dynamically ranked entity, which is not the
+case in Fortran (assumed rank usage in Fortran is greatly limited).
+
+## Using embox/rebox and box as an alternative to fir.declare/hlfir.designate and hlfir.expr/ variable concept
+
+All Fortran entities (*) can be described at runtime by a fir.box, except for
+some attributes that are not part of the runtime descriptors (like TARGET,
+OPTIONAL or VOLATILE).  In that sense, it would be possible to have
+fir.declare, hlfir.designate, and hlfir.associate be replaced by embox/rebox,
+and also to have all operation creating hlfir.expr to create fir.box.
+
+This was rejected because this would lack clarity, and make embox/rebox
+semantics way too complex (their codegen is already non-trivial), and also
+because it would then not really be possible to know if a fir.box is an
+expression or a variable when it is an operand, which would make reasoning
+harder: this would already imply that expressions have been buffered, and it is
+not clear when looking at a fir.box if the value it describe may change or not,
+while a hlfir.expr value cannot change, which allows moving its usages more
+easily.
+
+This would also risk generating too many runtime descriptors read and writes
+that could make later optimizations harder.
+
+Hence, while this would be functionally possible, this makes the reasoning about
+the IR harder and would not benefit high-level optimizations.
+
+(*) This not true for vector subscripted variables, but the proposed plan will
+also not allow creating vector subscripted variables as the result of a
+hlfir.designate. Lowering will deal with the assignment and input IO special
+case using hlfir.elemental.
author	Jean Perier <jperier@nvidia.com>
	Thu, 13 Oct 2022 12:24:56 +0000 (14:24 +0200)
committer	Jean Perier <jperier@nvidia.com>
	Thu, 13 Oct 2022 12:25:51 +0000 (14:25 +0200)