--- /dev/null
+# Arm64 Intrinsics
+
+This document is intended to document proposed design decisions related to the introduction
+of Arm64 Intrinsics
+
+## Document Goals
+
++ Discuss design options
+ + Document existing design pattern
+ + Draft initial design decisions which are least likely to cause extensive rework
++ Decouple `X86`, `X64`, `ARM32` and `ARM64` development
+ + Make some minimal decisions which encourage API similarity between platforms
+ + Make some additional minimal decisions which allow `ARM32` and `ARM64` API's to be similar
++ Decouple CoreCLR implementation and testing from API design
++ Allow for best API design
++ Keep implementation simple
+
+## Intrinsics in general
+
+Use of intrinsics in general is a CoreCLR design decision to allow low level platform
+specific optimizations.
+
+At first glance, such a decision seems to violate the fundamental principles of .NET
+code running on any platform. However, the intent is not for the vast majority of
+apps to use such optimizations. The intended usage model is to allow library
+developers access to low level functions which enable optimization of key
+functions. As such the use is expected to be limited, but performance critical.
+
+## Intrinsic granularity
+
+In general individual intrinsic will be chosen to be fine grained. These will generally
+correspond to a single assembly instruction.
+
+## Logical Sets of Intrinsics
+
+For various reasons, an individual CPU will have a specific set of supported instructions. For `ARM64` the
+set of supported instructions is identified by various `ID_* System registers`.
+While these feature registers are only available for the OS to access, they provide
+a logical grouping of instructions which are enabled/disabled together.
+
+### API Logical Set grouping & `IsSupported`
+
+The C# API must provide a mechanism to determine which sets of instructions are supported.
+Existing design uses a separate `static class` to group the methods which correspond to each
+logical set of instructions. A single `IsSupported` property is included in each `static class`
+to allow client code to alter control flow. The `IsSupported` properties are designed so that JIT
+can remove code on unused paths. `ARM64` will use an identical approach.
+
+### API `PlatformNotSupported` Exception
+
+If client code calls an intrinsic which is not supported by the platform a `PlatformNotSupported`
+exception must be thrown.
+
+### JIT, VM, PAL & OS requirements
+
+The JIT must use a set of flags corresponding to logical sets of instructions to alter code
+generation.
+
+The VM must query the OS to populate the set of JIT flags. For the special altJit case, a
+means must provide for setting the flags.
+
+PAL must provide an OS abstraction layer.
+
+Each OS must provide a mechanism for determining which sets of instructions are supported.
+
++ Linux provides the HWCAP detection mechanism which is able to detect current set of exposed
+features
++ Arm64 MAC OS and Arm64 Windows OS must provide an equally capable detection mechanism.
+
+In the event the OS fails to provides a means to detect a support for an instruction set extension
+it must be treated as unsupported.
+
+NOTE: Exceptions might be where:
+
++ CoreCLR is distributed as source and CMake build configuration test is used to detect these features
++ Installer detects features and sets appropriate configuration knobs
++ VM runs code inside safe try/catch blocks to test for instruction support
++ Platform requires a specific minimum set of instructions
+
+### Intrinsics & Crossgen
+
+For any intrinsic which may not be supported on all variants of a platform, crossgen method
+compilation should be designed to allow optimal code generation.
+
+Initial implementation will simply trap so that the JIT is forced to generate optimal platform dependent code at
+runtime. Subsequent implementations may use different approaches.
+
+## Choice of Arm64 naming conventions
+
+`x86`, `x64`, `ARM32` and `ARM64` will follow similar naming conventions.
+
+### Namespaces
+
++ `System.Runtime.Intrinsics` is used for type definitions useful across multiple platforms
++ `System.Runtime.Intrinsics.Arm` is used type definitions shared across `ARM32` and `ARM64` platforms
++ `System.Runtime.Intrinsics.Arm.Arm64` is used for type definitions for the `ARM64` platform
+ + The primary implementation of `ARM64` intrinsics will occur within this namespace
+ + While `x86` and `x64` share a common namespace, this document is recommending a separate namespace
+ for `ARM32` and `ARM64`. This is because `AARCH64` is a separate `ISA` from the `AARCH32` `Arm` & `Thumb`
+ instruction sets. It is not an `ISA` extension, but rather a new `ISA`. This is different from `x64`
+ which could be viewed as a superset of `x86`.
+ + The logical grouping of `ARM64` and `ARM32` instruction sets is different. It is controlled by
+ different sets of `System Registers`.
+
+For the convenience of the end user, it may be useful to add convenience API's which expose functionality
+which is common across platforms and sets of platforms. These could be implemented in terms of the
+platform specific functionality. These API's are currently out of scope of this initial design document.
+
+### Logical Set Class Names
+
+Within the `System.Runtime.Intrinsics.Arm.Arm64` namespace there will be a separate `static class` for each
+logical set of instructions
+
+The sets will be chosen to match the granularity of the `ARM64` `ID_*` register fields.
+
+#### Specific Class Names
+
+The table below documents the set of known extensions, their identification, and their recommended intrinsic
+class names.
+
+| ID Register | Field | Values | Intrinsic `static class` name |
+| ---------------- | ------- | -------- | ----------------------------- |
+| N/A | N/A | N/A | Base |
+| ID_AA64ISAR0_EL1 | AES | (1b, 10b)| Aes |
+| ID_AA64ISAR0_EL1 | Atomic | (10b) | Atomics |
+| ID_AA64ISAR0_EL1 | CRC32 | (1b) | Crc32 |
+| ID_AA64ISAR1_EL1 | DPB | (1b) | Dcpop |
+| ID_AA64ISAR0_EL1 | DP | (1b) | Dp |
+| ID_AA64ISAR1_EL1 | FCMA | (1b) | Fcma |
+| ID_AA64PFR0_EL1 | FP | (0b, 1b) | Fp |
+| ID_AA64PFR0_EL1 | FP | (1b) | Fp16 |
+| ID_AA64ISAR1_EL1 | JSCVT | (1b) | Jscvt |
+| ID_AA64ISAR1_EL1 | LRCPC | (1b) | Lrcpc |
+| ID_AA64ISAR0_EL1 | AES | (10b) | Pmull |
+| ID_AA64PFR0_EL1 | RAS | (1b) | Ras |
+| ID_AA64ISAR0_EL1 | SHA1 | (1b) | Sha1 |
+| ID_AA64ISAR0_EL1 | SHA2 | (1b, 10b)| Sha2 |
+| ID_AA64ISAR0_EL1 | SHA3 | (1b) | Sha3 |
+| ID_AA64ISAR0_EL1 | SHA2 | (10b) | Sha512 |
+| ID_AA64PFR0_EL1 | AdvSIMD | (0b, 1b) | Simd |
+| ID_AA64PFR0_EL1 | AdvSIMD | (1b) | SimdFp16 |
+| ID_AA64ISAR0_EL1 | RDM | (1b) | SimdV81 |
+| ID_AA64ISAR0_EL1 | SM3 | (1b) | Sm3 |
+| ID_AA64ISAR0_EL1 | SM4 | (1b) | Sm4 |
+| ID_AA64PFR0_EL1 | SVE | (1b) | Sve |
+
+The `All`, `Simd`, and `Fp` classes will together contain the bulk of the `ARM64` intrinsics. Most other extensions
+will only add a few instruction so they should be simpler to review.
+
+The `Base` `static class` is used to represent any intrinsic which is guaranteed to be implemented on all
+`ARM64` platforms. This set will include general purpose instructions. For example, this would include intrinsics
+such as `LeadingZeroCount` and `LeadingSignCount`.
+
+As further extensions are released, this set of intrinsics will grow.
+
+### Intrinsic Method Names
+
+Intrinsics will be named to describe functionality. Names will not correspond to specific named
+assembly instructions.
+
+Where precedent exists for common operations within the `System.Runtime.Intrinsics.X86` namespace, identical method
+names will be chosen: `Add`, `Multiply`, `Load`, `Store` ...
+
+Where `ARM` naming convention differs substantially from `XARCH`, `ARM` naming conventions will sometimes be preferred.
+For instance
+
++ `ARM` uses `Replicate` or `Duplicate` rather than X86 `Broadcast`.
++ `ARM` uses `Across` rather than `X86` `Horizontal`.
+
+These will need to reviewed on a case by case basis.
+
+It is also worth noting `System.Runtime.Intrinsics.X86` naming conventions will include the suffix `Scalar` for
+operations which take vector argument(s), but contain an implicit cast(s) to the base type and therefore operate only
+on the first item of the argument vector(s).
+
+### Intinsic Method Argument and Return Types
+
+Intrinsic methods will typically use a standard set of argument and return types:
+
++ Integer type: `byte`, `sbyte`, `short`, `ushort`, `int`, `uint`, `long`, `ulong`
++ Floating types: `double`, `single`, `System.Half`
++ Vector types: `Vector128<T>`, `Vector64<T>`
++ SVE will add new vector types: TBD
++ `ValueTuple<>` for return types returning multiple values
+
+It is proposed to add the `Vector64<T>` type. Most `ARM64` instructions support 8 byte and 16 byte forms. 8 byte
+operations can execute faster with less power on some platforms. So adding `Vector64<T>` will allow exposing the full
+flexibility of the instruction set and allow for optimal usage.
+
+Some intrinsics will need to produce multiple results. The most notable are the structured load operations `LD2`,
+`LD3`, `LD4` ... For these operations it is proposed that the intrinsic API return a `ValueTuple<>` of `Vector64<T>` or
+`Vector128<T>`
+
+#### Literal immediates
+
+Some assembly instructions require an immediate encoded directly in the assembly instruction. These need to be
+constant at JIT time.
+
+While the discussion is still on-going, consensus seems to be that any intrinsic must function correctly even when its
+arguments are not constant.
+
+## Intrinsic Interface Documentation
+
++ Namespace
++ Each `static class` will
+ + Briefly document corresponding `System Register Field and Value` from ARM specification.
+ + Document use of IsSupported property
+ + Optionally summarize set of methods enabled by the extension
++ Each intrinsic method will
+ + Document underlying `ARM64` assembly instruction
+ + Optionally, briefly summarize operation performed
+ + In many cases this may be unnecessary: `Add`, `Multiply`, `Load`, `Store`
+ + In some cases this may be difficult to do correctly. (Crypto instructions)
+ + Optionally mention corresponding compiler gcc, clang, and/or MSVC intrinsics
+ + Review of existing documentation shows `ARM64` intrinsics are mostly absent or undocumented so
+ initially this will not be necessary for `ARM64`
+ + See gcc manual "AArch64 Built-in Functions"
+ + MSVC ARM64 documentation has not been publically released
+
+## Phased Implementation
+
+### Implementation Priorities
+
+As rough guidelines for order of implementation:
+
++ Baseline functionality will be prioritized over architectural extensions
++ Architectural extensions will typically be prioritized in age order. Earlier extensions will be added first
+ + This is primarily driven by availability of hardware. Features released in earlier will be prevalent in
+ more hardware.
++ Priorities will be driven by optimization efforts and requests
+ + Priority will be given to intrinsics which are equivalent/similar to those actively used in libraries for other
+ platforms
+ + Priority will be given to intrinsics which have already been implemented for other platforms
+
+### API review
+
+Intrinsics will extend the API of CoreCLR. They will need to follow standard API review practices.
+
+Initial XArch intrinsics are proposed to be added to the `netcoreapp2.1` Target Framework. ARM64 intrinsics will
+be in similar Target Frameworks as the XArch intrinsics.
+
+Each review will identify the Target Framework API version where the API will be extended and released.
+
+#### API review of an intrinsic `static class`
+
+Given the need to add hundreds or thousands of intrinsics, it will be helpful to review incrementally.
+
+A separate GitHub Issue will typically created for the review of each intrinsic `static class`.
+
+When the `static class` exceeds a few dozen methods, it is desirable to break the review into smaller more manageable
+pieces.
+
+The extensive set of ARM64 assembly instructions make reviewing and implementing an exhaustive set a long process.
+To facilitate incremental progress, initial intrinsic API for a given `static class` need not be exhaustive.
+
+### Partial implementation of intrinsic `static class`
+
++ `IsSupported` must represent the state of an entire intrinsic `static class` for a given Target Framework.
++ Once API review is complete and approved, it is acceptable to implement approved methods in any order.
++ The approved API must be completed before the intrinsic `static class` is included in its Target Framework release
+
+## Test coverage
+
+As intrinsic support is added test coverage must be extended to provide basic testing.
+
+Tests should be added as soon as practical. CoreCLR Implementation and CoreFX API will need to be merged before tests
+can be merged.
+
+## LSRA changes to allocate contiguous register ranges
+
+Some ARM64 instructions will require allocation of contiguous blocks of registers. These are likely limited to load and
+store multiple instructions.
+
+It is not clear if this is a new LSRA feature and if it is how much complexity this will introduce into the LSRA.
+
+## ARM ABI Vector64<T> and Vector128<T>
+
+For intrinsic method calls, these vector types will implicitly be treated as pass by vector register.
+
+For other calls, ARM64 ABI conventions must be followed. For purposes of the ABI calling conventions, these vector
+types will treated as composite struct type containing a contiguous array of `T`. They will need to follow standard
+struct argument and return passing rules.
+
+## Half precision floating point
+
+This document will refer to half precision floating point as `Half`.
+
++ Machine learning and Artificial intelligence often use `Half` type to simplify storage and improve processing time.
++ CoreCLR and `CIL` in general do not have general support for a `Half` type
++ There is an open request to expose `Half` intrinsics
++ There is an outstanding proposal to add `System.Half` to support this request
+https://github.com/dotnet/corefx/issues/25702
++ Implementation of `Half` features will be adjusted based on
+ + Implementation of the `System.Half` proposal
+ + Availability of supporting hardware (extensions)
+ + General language extensions supporting `Half`
+
+**`Half` support is currently outside the scope of the initial design proposal. It is discussed below only for
+introductory purposes.**
+
+### ARM64 Half precision support
+
+ARM64 supports two half precision floating point formats
+
++ IEEE-754 compliant.
++ ARM alternative format
+
+The two formats are similar. IEEE-754 has support for Inifinity and NAN and therefore has a somewhat smaller range.
+IEEE-754 should be preferred.
+
+ARM64 baseline support for `Half` is limited. The following types of operations are supported
+
++ Loads and Stores
++ Conversion to/from `Float`
++ Widening from `Vector128<Half>` to two `Vector128<Float>`
++ Narrowing from two `Vector128<Float>` to `Vector128<Half>`
+
+The optional ARMv8.2-FP16 extension adds support for
+
++ General operations on IEEE-754 `Half` types
++ Vector operations on IEEE-754 `Half` types
+
+These correspond to the proposed `static class`es `Fp16` and `SimdFp16`
+
+### `Half` and ARM64 ABI
+
+Any complete `Half` implementation must conform to the `ARM64 ABI`.
+
+The proposed `System.Half` type must be treated as a floating point type for purposes of the ARM64 ABI
+
+As an argument it must be passed in a floating point register.
+
+As a structure member, it must be treated as a floating point type and enter into the HFA determination logic.
+
+Test cases must be written and conformance must be demonstrated.
+
+## Scalable Vector Extension Support
+
+`SVE`, the Scalable Vector Extension introduces its own complexity.
+
+The extension
+
++ Creates a set of `Z0-Z31` scalable vector registers. These overlay existing vector registers. Each scalar vector
+register has a platform specific length
+ + Any multiple of 128 bits up to 2048 bits
++ Creates a new set of `P0-P15` predicate registers. Each predicate register has a platform specific length which is
+1/8th of the scalar vector length.
++ Add an extensive set of instructions including complex load and store operations.
++ Modifies the ARM64 ABI.
+
+Therefore implementation will not be trivial.
+
++ Register allocator will need changes to support predicate allocation
++ SIMD support will face similar issues
++ Open issue: Should we use `Vector<T>`, `Vector128<t>, Vector256<t>, ... Vector2048<T>`, `SVE<T>` ... in user interface
+design?
+ + Use of `Vector128<t>, Vector256<t>, ... Vector2048<T>` is current default proposal.
+Having 16 forms of every API may create issues for framework and client developers.
+However generics may provide some/sufficient relief to make this acceptable.
+ + Use of `Vector<T>` may be preferred if SVE will also be used for `FEATURE_SIMD`
+ + Use of `SVE<T>` may be preferred if SVE will not be used for `FEATURE_SIMD`
+
+
+Given lack of available hardware and a lack of thorough understanding of the specification:
+
++ SVE will require a separate design
++ **SVE is considered out of scope for this document. It is discussed above only for
+introductory purposes.**
+
+## Miscellaneous
+### Handling Instruction Deprecation
+
+Deprecation of instructions should be relatively rare
+
++ Do not introduce an intrinsic for a feature that is currently deprecated
++ In event an assembly instruction is deprecated
+ 1. Prefer emulation using alternate instructions if practical
+ 2. Add `SetThrowOnDeprecated()` interface to allow developers to find these issues
+
+## Approved APIs
+
+The following sections document APIs which have completed the API review process.
+
+Until each API is approved it shall be marked "TBD Not Approved"
+
+### `All`
+
+TBD Not approved
+
+### `Aes`
+
+TBD Not approved
+
+### `Atomics`
+
+TBD Not approved
+
+### `Crc32`
+
+TBD Not approved
+
+### `Dcpop`
+
+TBD Not approved
+
+### `Dp`
+
+TBD Not approved
+
+### `Fcma`
+
+TBD Not approved
+
+### `Fp`
+
+TBD Not approved
+
+### `Fp16`
+
+TBD Not approved
+
+### `Jscvt`
+
+TBD Not approved
+
+### `Lrcpc`
+
+TBD Not approved
+
+### `Pmull`
+
+TBD Not approved
+
+### `Ras`
+
+TBD Not approved
+
+### `Sha1`
+
+TBD Not approved
+
+### `Sha2`
+
+TBD Not approved
+
+### `Sha3`
+
+TBD Not approved
+
+### `Sha512`
+
+TBD Not approved
+
+### `Simd`
+
+TBD Not approved
+
+### `SimdFp16`
+
+TBD Not approved
+
+### `SimdV81`
+
+TBD Not approved
+
+### `Sm3`
+
+TBD Not approved
+
+### `Sm4`
+
+TBD Not approved
+
+### `Sve`
+
+TBD Not approved