[Arm64] Add arm64-intrinsics.md (#15343)

author Steve MacLean <sdmaclea.qdt@qualcommdatacenter.com>

Wed, 17 Jan 2018 00:00:52 +0000 (19:00 -0500)

committer Jan Kotas <jkotas@microsoft.com>

Wed, 17 Jan 2018 00:00:52 +0000 (16:00 -0800)
author Steve MacLean <sdmaclea.qdt@qualcommdatacenter.com>
Wed, 17 Jan 2018 00:00:52 +0000 (19:00 -0500)
committer Jan Kotas <jkotas@microsoft.com>
Wed, 17 Jan 2018 00:00:52 +0000 (16:00 -0800)
diff --git a/Documentation/design-docs/arm64-intrinsics.md b/Documentation/design-docs/arm64-intrinsics.md

new file mode 100644 (file)

index 0000000..67aff02
--- /dev/null
+++ b/Documentation/design-docs/arm64-intrinsics.md
@@ -0,0 +1,476 @@
+# Arm64 Intrinsics
+
+This document is intended to document proposed design decisions related to the introduction
+of Arm64 Intrinsics
+
+## Document Goals
+
++ Discuss design options
+  + Document existing design pattern
+  + Draft initial design decisions which are least likely to cause extensive rework
++ Decouple `X86`, `X64`, `ARM32` and `ARM64` development
+  + Make some minimal decisions which encourage API similarity between platforms
+  + Make some additional minimal decisions which allow `ARM32` and `ARM64` API's to be similar
++ Decouple CoreCLR implementation and testing from API design
++ Allow for best API design
++ Keep implementation simple
+
+## Intrinsics in general
+
+Use of intrinsics in general is a CoreCLR design decision to allow low level platform
+specific optimizations.
+
+At first glance, such a decision seems to violate the fundamental principles of .NET
+code running on any platform.  However, the intent is not for the vast majority of
+apps to use such optimizations.  The intended usage model is to allow library
+developers access to low level functions which enable optimization of key
+functions.  As such the use is expected to be limited, but performance critical.
+
+## Intrinsic granularity
+
+In general individual intrinsic will be chosen to be fine grained.  These will generally
+correspond to a single assembly instruction.
+
+## Logical Sets of Intrinsics
+
+For various reasons, an individual CPU will have a specific set of supported instructions.  For `ARM64` the
+set of supported instructions is identified by various `ID_* System registers`.
+While these feature registers are only available for the OS to access, they provide
+a logical grouping of instructions which are enabled/disabled together.
+
+### API Logical Set grouping & `IsSupported`
+
+The C# API must provide a mechanism to determine which sets of instructions are supported.
+Existing design uses a separate `static class` to group the methods which correspond to each
+logical set of instructions.  A single `IsSupported` property is included in each `static class`
+to allow client code to alter control flow.  The `IsSupported` properties are designed so that JIT
+can remove code on unused paths.  `ARM64` will use an identical approach.
+
+### API `PlatformNotSupported` Exception
+
+If client code calls an intrinsic which is not supported by the platform a `PlatformNotSupported`
+exception must be thrown.
+
+### JIT, VM, PAL & OS requirements
+
+The JIT must use a set of flags corresponding to logical sets of instructions to alter code
+generation.
+
+The VM must query the OS to populate the set of JIT flags.  For the special altJit case, a
+means must provide for setting the flags.
+
+PAL must provide an OS abstraction layer.
+
+Each OS must provide a mechanism for determining which sets of instructions are supported.
+
++ Linux provides the HWCAP detection mechanism which is able to detect current set of exposed
+features
++ Arm64 MAC OS and Arm64 Windows OS must provide an equally capable detection mechanism.
+
+In the event the OS fails to provides a means to detect a support for an instruction set extension
+it must be treated as unsupported.
+
+NOTE: Exceptions might be where:
+
++ CoreCLR is distributed as source and CMake build configuration test is used to detect these features
++ Installer detects features and sets appropriate configuration knobs
++ VM runs code inside safe try/catch blocks to test for instruction support
++ Platform requires a specific minimum set of instructions
+
+### Intrinsics & Crossgen
+
+For any intrinsic which may not be supported on all variants of a platform, crossgen method
+compilation should be designed to allow optimal code generation.
+
+Initial implementation will simply trap so that the JIT is forced to generate optimal platform dependent code at
+runtime.  Subsequent implementations may use different approaches.
+
+## Choice of Arm64 naming conventions
+
+`x86`, `x64`, `ARM32` and `ARM64` will follow similar naming conventions.
+
+### Namespaces
+
++ `System.Runtime.Intrinsics` is used for type definitions useful across multiple platforms
++ `System.Runtime.Intrinsics.Arm` is used type definitions shared across `ARM32` and `ARM64` platforms
++ `System.Runtime.Intrinsics.Arm.Arm64` is used for type definitions for the `ARM64` platform
+  + The primary implementation of `ARM64` intrinsics will occur within this namespace
+  + While `x86` and `x64` share a common namespace, this document is recommending a separate namespace
+  for `ARM32` and `ARM64`.  This is because `AARCH64` is a separate `ISA` from the `AARCH32` `Arm` & `Thumb`
+  instruction sets.  It is not an `ISA` extension, but rather a new `ISA`.  This is different from `x64`
+  which could be viewed as a superset of `x86`.
+  + The logical grouping of `ARM64` and `ARM32` instruction sets is different.  It is controlled by
+  different sets of `System Registers`.
+
+For the convenience of the end user, it may be useful to add convenience API's which expose functionality
+which is common across platforms and sets of platforms.  These could be implemented in terms of the
+platform specific functionality.  These API's are currently out of scope of this initial design document.
+
+### Logical Set Class Names
+
+Within the `System.Runtime.Intrinsics.Arm.Arm64` namespace there will be a separate `static class` for each
+logical set of instructions
+
+The sets will be chosen to match the granularity of the `ARM64` `ID_*` register fields.
+
+#### Specific Class Names
+
+The table below documents the set of known extensions, their identification, and their recommended intrinsic
+class names.
+
+| ID Register      | Field   | Values   | Intrinsic `static class` name |
+| ---------------- | ------- | -------- | ----------------------------- |
+| N/A              | N/A     | N/A      | Base                          |
+| ID_AA64ISAR0_EL1 | AES     | (1b, 10b)| Aes                           |
+| ID_AA64ISAR0_EL1 | Atomic  | (10b)    | Atomics                       |
+| ID_AA64ISAR0_EL1 | CRC32   | (1b)     | Crc32                         |
+| ID_AA64ISAR1_EL1 | DPB     | (1b)     | Dcpop                         |
+| ID_AA64ISAR0_EL1 | DP      | (1b)     | Dp                            |
+| ID_AA64ISAR1_EL1 | FCMA    | (1b)     | Fcma                          |
+| ID_AA64PFR0_EL1  | FP      | (0b, 1b) | Fp                            |
+| ID_AA64PFR0_EL1  | FP      | (1b)     | Fp16                          |
+| ID_AA64ISAR1_EL1 | JSCVT   | (1b)     | Jscvt                         |
+| ID_AA64ISAR1_EL1 | LRCPC   | (1b)     | Lrcpc                         |
+| ID_AA64ISAR0_EL1 | AES     | (10b)    | Pmull                         |
+| ID_AA64PFR0_EL1  | RAS     | (1b)     | Ras                           |
+| ID_AA64ISAR0_EL1 | SHA1    | (1b)     | Sha1                          |
+| ID_AA64ISAR0_EL1 | SHA2    | (1b, 10b)| Sha2                          |
+| ID_AA64ISAR0_EL1 | SHA3    | (1b)     | Sha3                          |
+| ID_AA64ISAR0_EL1 | SHA2    | (10b)    | Sha512                        |
+| ID_AA64PFR0_EL1  | AdvSIMD | (0b, 1b) | Simd                          |
+| ID_AA64PFR0_EL1  | AdvSIMD | (1b)     | SimdFp16                      |
+| ID_AA64ISAR0_EL1 | RDM     | (1b)     | SimdV81                       |
+| ID_AA64ISAR0_EL1 | SM3     | (1b)     | Sm3                           |
+| ID_AA64ISAR0_EL1 | SM4     | (1b)     | Sm4                           |
+| ID_AA64PFR0_EL1  | SVE     | (1b)     | Sve                           |
+
+The `All`, `Simd`, and `Fp` classes will together contain the bulk of the `ARM64` intrinsics.  Most other extensions
+will only add a few instruction so they should be simpler to review.
+
+The `Base` `static class` is used to represent any intrinsic which is guaranteed to be implemented on all
+`ARM64` platforms.  This set will include general purpose instructions.  For example, this would include intrinsics
+such as `LeadingZeroCount` and `LeadingSignCount`.
+
+As further extensions are released, this set of intrinsics will grow.
+
+### Intrinsic Method Names
+
+Intrinsics will be named to describe functionality.  Names will not correspond to specific named
+assembly instructions.
+
+Where precedent exists for common operations within the `System.Runtime.Intrinsics.X86` namespace, identical method
+names will be chosen: `Add`, `Multiply`, `Load`, `Store` ...
+
+Where `ARM` naming convention differs substantially from `XARCH`, `ARM` naming conventions will sometimes be preferred.
+For instance
+
++ `ARM` uses `Replicate` or `Duplicate` rather than X86 `Broadcast`.
++ `ARM` uses `Across` rather than `X86` `Horizontal`.
+
+These will need to reviewed on a case by case basis.
+
+It is also worth noting `System.Runtime.Intrinsics.X86` naming conventions will include the suffix `Scalar` for
+operations which take vector argument(s), but contain an implicit cast(s) to the base type and therefore operate only
+on the first item of the argument vector(s).
+
+### Intinsic Method Argument and Return Types
+
+Intrinsic methods will typically use a standard set of argument and return types:
+
++ Integer type: `byte`, `sbyte`, `short`, `ushort`, `int`, `uint`, `long`, `ulong`
++ Floating types: `double`, `single`, `System.Half`
++ Vector types: `Vector128<T>`, `Vector64<T>`
++ SVE will add new vector types: TBD
++ `ValueTuple<>` for return types returning multiple values
+
+It is proposed to add the `Vector64<T>` type.  Most `ARM64` instructions support 8 byte and 16 byte forms.  8 byte
+operations can execute faster with less power on some platforms. So adding `Vector64<T>` will allow exposing the full
+flexibility of the instruction set and allow for optimal usage.
+
+Some intrinsics will need to produce multiple results.  The most notable are the structured load operations `LD2`,
+`LD3`, `LD4` ...  For these operations it is proposed that the intrinsic API return a `ValueTuple<>` of `Vector64<T>` or
+`Vector128<T>`
+
+#### Literal immediates
+
+Some assembly instructions require an immediate encoded directly in the assembly instruction.  These need to be
+constant at JIT time.
+
+While the discussion is still on-going, consensus seems to be that any intrinsic must function correctly even when its
+arguments are not constant.
+
+## Intrinsic Interface Documentation
+
++ Namespace
++ Each `static class` will
+  + Briefly document corresponding `System Register Field and Value` from ARM specification.
+  + Document use of IsSupported property
+  + Optionally summarize set of methods enabled by the extension
++ Each intrinsic method will
+  + Document underlying `ARM64` assembly instruction
+  + Optionally, briefly summarize operation performed
+    + In many cases this may be unnecessary: `Add`, `Multiply`, `Load`, `Store`
+    + In some cases this may be difficult to do correctly. (Crypto instructions)
+  + Optionally mention corresponding compiler gcc, clang, and/or MSVC intrinsics
+    + Review of existing documentation shows `ARM64` intrinsics are mostly absent or undocumented so
+    initially this will not be necessary for `ARM64`
+    + See gcc manual "AArch64 Built-in Functions"
+    + MSVC ARM64 documentation has not been publically released
+
+## Phased Implementation
+
+### Implementation Priorities
+
+As rough guidelines for order of implementation:
+
++ Baseline functionality will be prioritized over architectural extensions
++ Architectural extensions will typically be prioritized in age order.  Earlier extensions will be added first
+  + This is primarily driven by availability of hardware.  Features released in earlier will be prevalent in
+  more hardware.
++ Priorities will be driven by optimization efforts and requests
+  + Priority will be given to intrinsics which are equivalent/similar to those actively used in libraries for other
+  platforms
+  + Priority will be given to intrinsics which have already been implemented for other platforms
+
+### API review
+
+Intrinsics will extend the API of CoreCLR.  They will need to follow standard API review practices.
+
+Initial XArch intrinsics are proposed to be added to the `netcoreapp2.1` Target Framework.  ARM64 intrinsics will
+be in similar Target Frameworks as the XArch intrinsics.
+
+Each review will identify the Target Framework API version where the API will be extended and released.
+
+#### API review of an intrinsic `static class`
+
+Given the need to add hundreds or thousands of intrinsics, it will be helpful to review incrementally.
+
+A separate GitHub Issue will typically created for the review of each intrinsic `static class`.
+
+When the `static class` exceeds a few dozen methods, it is desirable to break the review into smaller more manageable
+pieces.
+
+The extensive set of ARM64 assembly instructions make reviewing and implementing an exhaustive set a long process.
+To facilitate incremental progress, initial intrinsic API for a given `static class` need not be exhaustive.
+
+### Partial implementation of intrinsic `static class`
+
++ `IsSupported` must represent the state of an entire intrinsic `static class` for a given Target Framework.
++ Once API review is complete and approved, it is acceptable to implement approved methods in any order.
++ The approved API must be completed before the intrinsic `static class` is included in its Target Framework release
+
+## Test coverage
+
+As intrinsic support is added test coverage must be extended to provide basic testing.
+
+Tests should be added as soon as practical.  CoreCLR Implementation and CoreFX API will need to be merged before tests
+can be merged.
+
+## LSRA changes to allocate contiguous register ranges
+
+Some ARM64 instructions will require allocation of contiguous blocks of registers.  These are likely limited to load and
+store multiple instructions.
+
+It is not clear if this is a new LSRA feature and if it is how much complexity this will introduce into the LSRA.
+
+## ARM ABI Vector64<T> and Vector128<T>
+
+For intrinsic method calls, these vector types will implicitly be treated as pass by vector register.
+
+For other calls, ARM64 ABI conventions must be followed.  For purposes of the ABI calling conventions, these vector
+types will treated as composite struct type containing a contiguous array of `T`.  They will need to follow standard
+struct argument and return passing rules.
+
+## Half precision floating point
+
+This document will refer to half precision floating point as `Half`.
+
++ Machine learning and Artificial intelligence often use `Half` type to simplify storage and improve processing time.
++ CoreCLR and `CIL` in general do not have general support for a `Half` type
++ There is an open request to expose `Half` intrinsics
++ There is an outstanding proposal to add `System.Half` to support this request
+https://github.com/dotnet/corefx/issues/25702
++ Implementation of `Half` features will be adjusted based on
+  + Implementation of the `System.Half` proposal
+  + Availability of supporting hardware (extensions)
+  + General language extensions supporting `Half`
+
+**`Half` support is currently outside the scope of the initial design proposal.  It is discussed below only for
+introductory purposes.**
+
+### ARM64 Half precision support
+
+ARM64 supports two half precision floating point formats
+
++ IEEE-754 compliant.
++ ARM alternative format
+
+The two formats are similar.  IEEE-754 has support for Inifinity and NAN and therefore has a somewhat smaller range.
+IEEE-754 should be preferred.
+
+ARM64 baseline support for `Half` is limited.  The following types of operations are supported
+
++ Loads and Stores
++ Conversion to/from `Float`
++ Widening from `Vector128<Half>` to two `Vector128<Float>`
++ Narrowing from two `Vector128<Float>` to `Vector128<Half>`
+
+The optional ARMv8.2-FP16 extension adds support for
+
++ General operations on IEEE-754 `Half` types
++ Vector operations on IEEE-754 `Half` types
+
+These correspond to the proposed `static class`es `Fp16` and `SimdFp16`
+
+### `Half` and ARM64 ABI
+
+Any complete `Half` implementation must conform to the `ARM64 ABI`.
+
+The proposed `System.Half` type must be treated as a floating point type for purposes of the ARM64 ABI
+
+As an argument it must be passed in a floating point register.
+
+As a structure member, it must be treated as a floating point type and enter into the HFA determination logic.
+
+Test cases must be written and conformance must be demonstrated.
+
+## Scalable Vector Extension Support
+
+`SVE`, the Scalable Vector Extension introduces its own complexity.
+
+The extension
+
++ Creates a set of `Z0-Z31` scalable vector registers.  These overlay existing vector registers.  Each scalar vector
+register has a platform specific length
+  + Any multiple of 128 bits up to 2048 bits
++ Creates a new set of `P0-P15` predicate registers.  Each predicate register has a platform specific length which is
+1/8th of the scalar vector length.
++ Add an extensive set of instructions including complex load and store operations.
++ Modifies the ARM64 ABI.
+
+Therefore implementation will not be trivial.
+
++ Register allocator will need changes to support predicate allocation
++ SIMD support will face similar issues
++ Open issue: Should we use `Vector<T>`, `Vector128<t>, Vector256<t>, ... Vector2048<T>`, `SVE<T>` ... in user interface
+design?
+  + Use of `Vector128<t>, Vector256<t>, ... Vector2048<T>` is current default proposal.
+Having 16 forms of every API may create issues for framework and client developers.
+However generics may provide some/sufficient relief to make this acceptable.
+  + Use of `Vector<T>` may be preferred if SVE will also be used for `FEATURE_SIMD`
+  + Use of `SVE<T>` may be preferred if SVE will not be used for `FEATURE_SIMD`
+
+
+Given lack of available hardware and a lack of thorough understanding of the specification:
+
++ SVE will require a separate design
++ **SVE is considered out of scope for this document.  It is discussed above only for
+introductory purposes.**
+
+## Miscellaneous
+### Handling Instruction Deprecation
+
+Deprecation of instructions should be relatively rare
+
++ Do not introduce an intrinsic for a feature that is currently deprecated
++ In event an assembly instruction is deprecated
+  1. Prefer emulation using alternate instructions if practical
+  2. Add `SetThrowOnDeprecated()` interface to allow developers to find these issues
+
+## Approved APIs
+
+The following sections document APIs which have completed the API review process.
+
+Until each API is approved it shall be marked "TBD Not Approved"
+
+### `All`
+
+TBD Not approved
+
+### `Aes`
+
+TBD Not approved
+
+### `Atomics`
+
+TBD Not approved
+
+### `Crc32`
+
+TBD Not approved
+
+### `Dcpop`
+
+TBD Not approved
+
+### `Dp`
+
+TBD Not approved
+
+### `Fcma`
+
+TBD Not approved
+
+### `Fp`
+
+TBD Not approved
+
+### `Fp16`
+
+TBD Not approved
+
+### `Jscvt`
+
+TBD Not approved
+
+### `Lrcpc`
+
+TBD Not approved
+
+### `Pmull`
+
+TBD Not approved
+
+### `Ras`
+
+TBD Not approved
+
+### `Sha1`
+
+TBD Not approved
+
+### `Sha2`
+
+TBD Not approved
+
+### `Sha3`
+
+TBD Not approved
+
+### `Sha512`
+
+TBD Not approved
+
+### `Simd`
+
+TBD Not approved
+
+### `SimdFp16`
+
+TBD Not approved
+
+### `SimdV81`
+
+TBD Not approved
+
+### `Sm3`
+
+TBD Not approved
+
+### `Sm4`
+
+TBD Not approved
+
+### `Sve`
+
+TBD Not approved
author	Steve MacLean <sdmaclea.qdt@qualcommdatacenter.com>
	Wed, 17 Jan 2018 00:00:52 +0000 (19:00 -0500)
committer	Jan Kotas <jkotas@microsoft.com>
	Wed, 17 Jan 2018 00:00:52 +0000 (16:00 -0800)