3 This document is intended to document proposed design decisions related to the introduction
8 + Discuss design options
9 + Document existing design pattern
10 + Draft initial design decisions which are least likely to cause extensive rework
11 + Decouple `X86`, `X64`, `ARM32` and `ARM64` development
12 + Make some minimal decisions which encourage API similarity between platforms
13 + Make some additional minimal decisions which allow `ARM32` and `ARM64` API's to be similar
14 + Decouple CoreCLR implementation and testing from API design
15 + Allow for best API design
16 + Keep implementation simple
18 ## Intrinsics in general
20 Use of intrinsics in general is a CoreCLR design decision to allow low level platform
21 specific optimizations.
23 At first glance, such a decision seems to violate the fundamental principles of .NET
24 code running on any platform. However, the intent is not for the vast majority of
25 apps to use such optimizations. The intended usage model is to allow library
26 developers access to low level functions which enable optimization of key
27 functions. As such the use is expected to be limited, but performance critical.
29 ## Intrinsic granularity
31 In general individual intrinsic will be chosen to be fine grained. These will generally
32 correspond to a single assembly instruction.
34 ## Logical Sets of Intrinsics
36 For various reasons, an individual CPU will have a specific set of supported instructions. For `ARM64` the
37 set of supported instructions is identified by various `ID_* System registers`.
38 While these feature registers are only available for the OS to access, they provide
39 a logical grouping of instructions which are enabled/disabled together.
41 ### API Logical Set grouping & `IsSupported`
43 The C# API must provide a mechanism to determine which sets of instructions are supported.
44 Existing design uses a separate `static class` to group the methods which correspond to each
45 logical set of instructions. A single `IsSupported` property is included in each `static class`
46 to allow client code to alter control flow. The `IsSupported` properties are designed so that JIT
47 can remove code on unused paths. `ARM64` will use an identical approach.
49 ### API `PlatformNotSupported` Exception
51 If client code calls an intrinsic which is not supported by the platform a `PlatformNotSupported`
52 exception must be thrown.
54 ### JIT, VM, PAL & OS requirements
56 The JIT must use a set of flags corresponding to logical sets of instructions to alter code
59 The VM must query the OS to populate the set of JIT flags. For the special altJit case, a
60 means must provide for setting the flags.
62 PAL must provide an OS abstraction layer.
64 Each OS must provide a mechanism for determining which sets of instructions are supported.
66 + Linux provides the HWCAP detection mechanism which is able to detect current set of exposed
68 + Arm64 MAC OS and Arm64 Windows OS must provide an equally capable detection mechanism.
70 In the event the OS fails to provides a means to detect a support for an instruction set extension
71 it must be treated as unsupported.
73 NOTE: Exceptions might be where:
75 + CoreCLR is distributed as source and CMake build configuration test is used to detect these features
76 + Installer detects features and sets appropriate configuration knobs
77 + VM runs code inside safe try/catch blocks to test for instruction support
78 + Platform requires a specific minimum set of instructions
80 ### Intrinsics & Crossgen
82 For any intrinsic which may not be supported on all variants of a platform, crossgen method
83 compilation should be designed to allow optimal code generation.
85 Initial implementation will simply trap so that the JIT is forced to generate optimal platform dependent code at
86 runtime. Subsequent implementations may use different approaches.
88 ## Choice of Arm64 naming conventions
90 `x86`, `x64`, `ARM32` and `ARM64` will follow similar naming conventions.
94 + `System.Runtime.Intrinsics` is used for type definitions useful across multiple platforms
95 + `System.Runtime.Intrinsics.Arm` is used type definitions shared across `ARM32` and `ARM64` platforms
96 + `System.Runtime.Intrinsics.Arm.Arm64` is used for type definitions for the `ARM64` platform
97 + The primary implementation of `ARM64` intrinsics will occur within this namespace
98 + While `x86` and `x64` share a common namespace, this document is recommending a separate namespace
99 for `ARM32` and `ARM64`. This is because `AARCH64` is a separate `ISA` from the `AARCH32` `Arm` & `Thumb`
100 instruction sets. It is not an `ISA` extension, but rather a new `ISA`. This is different from `x64`
101 which could be viewed as a superset of `x86`.
102 + The logical grouping of `ARM64` and `ARM32` instruction sets is different. It is controlled by
103 different sets of `System Registers`.
105 For the convenience of the end user, it may be useful to add convenience API's which expose functionality
106 which is common across platforms and sets of platforms. These could be implemented in terms of the
107 platform specific functionality. These API's are currently out of scope of this initial design document.
109 ### Logical Set Class Names
111 Within the `System.Runtime.Intrinsics.Arm.Arm64` namespace there will be a separate `static class` for each
112 logical set of instructions
114 The sets will be chosen to match the granularity of the `ARM64` `ID_*` register fields.
116 #### Specific Class Names
118 The table below documents the set of known extensions, their identification, and their recommended intrinsic
121 | ID Register | Field | Values | Intrinsic `static class` name |
122 | ---------------- | ------- | -------- | ----------------------------- |
123 | N/A | N/A | N/A | Base |
124 | ID_AA64ISAR0_EL1 | AES | (1b, 10b)| Aes |
125 | ID_AA64ISAR0_EL1 | Atomic | (10b) | Atomics |
126 | ID_AA64ISAR0_EL1 | CRC32 | (1b) | Crc32 |
127 | ID_AA64ISAR1_EL1 | DPB | (1b) | Dcpop |
128 | ID_AA64ISAR0_EL1 | DP | (1b) | Dp |
129 | ID_AA64ISAR1_EL1 | FCMA | (1b) | Fcma |
130 | ID_AA64PFR0_EL1 | FP | (0b, 1b) | Fp |
131 | ID_AA64PFR0_EL1 | FP | (1b) | Fp16 |
132 | ID_AA64ISAR1_EL1 | JSCVT | (1b) | Jscvt |
133 | ID_AA64ISAR1_EL1 | LRCPC | (1b) | Lrcpc |
134 | ID_AA64ISAR0_EL1 | AES | (10b) | Pmull |
135 | ID_AA64PFR0_EL1 | RAS | (1b) | Ras |
136 | ID_AA64ISAR0_EL1 | SHA1 | (1b) | Sha1 |
137 | ID_AA64ISAR0_EL1 | SHA2 | (1b, 10b)| Sha2 |
138 | ID_AA64ISAR0_EL1 | SHA3 | (1b) | Sha3 |
139 | ID_AA64ISAR0_EL1 | SHA2 | (10b) | Sha512 |
140 | ID_AA64PFR0_EL1 | AdvSIMD | (0b, 1b) | Simd |
141 | ID_AA64PFR0_EL1 | AdvSIMD | (1b) | SimdFp16 |
142 | ID_AA64ISAR0_EL1 | RDM | (1b) | SimdV81 |
143 | ID_AA64ISAR0_EL1 | SM3 | (1b) | Sm3 |
144 | ID_AA64ISAR0_EL1 | SM4 | (1b) | Sm4 |
145 | ID_AA64PFR0_EL1 | SVE | (1b) | Sve |
147 The `All`, `Simd`, and `Fp` classes will together contain the bulk of the `ARM64` intrinsics. Most other extensions
148 will only add a few instruction so they should be simpler to review.
150 The `Base` `static class` is used to represent any intrinsic which is guaranteed to be implemented on all
151 `ARM64` platforms. This set will include general purpose instructions. For example, this would include intrinsics
152 such as `LeadingZeroCount` and `LeadingSignCount`.
154 As further extensions are released, this set of intrinsics will grow.
156 ### Intrinsic Method Names
158 Intrinsics will be named to describe functionality. Names will not correspond to specific named
159 assembly instructions.
161 Where precedent exists for common operations within the `System.Runtime.Intrinsics.X86` namespace, identical method
162 names will be chosen: `Add`, `Multiply`, `Load`, `Store` ...
164 Where `ARM` naming convention differs substantially from `XARCH`, `ARM` naming conventions will sometimes be preferred.
167 + `ARM` uses `Replicate` or `Duplicate` rather than X86 `Broadcast`.
168 + `ARM` uses `Across` rather than `X86` `Horizontal`.
170 These will need to reviewed on a case by case basis.
172 It is also worth noting `System.Runtime.Intrinsics.X86` naming conventions will include the suffix `Scalar` for
173 operations which take vector argument(s), but contain an implicit cast(s) to the base type and therefore operate only
174 on the first item of the argument vector(s).
176 ### Intinsic Method Argument and Return Types
178 Intrinsic methods will typically use a standard set of argument and return types:
180 + Integer type: `byte`, `sbyte`, `short`, `ushort`, `int`, `uint`, `long`, `ulong`
181 + Floating types: `double`, `single`, `System.Half`
182 + Vector types: `Vector128<T>`, `Vector64<T>`
183 + SVE will add new vector types: TBD
184 + `ValueTuple<>` for return types returning multiple values
186 It is proposed to add the `Vector64<T>` type. Most `ARM64` instructions support 8 byte and 16 byte forms. 8 byte
187 operations can execute faster with less power on some platforms. So adding `Vector64<T>` will allow exposing the full
188 flexibility of the instruction set and allow for optimal usage.
190 Some intrinsics will need to produce multiple results. The most notable are the structured load operations `LD2`,
191 `LD3`, `LD4` ... For these operations it is proposed that the intrinsic API return a `ValueTuple<>` of `Vector64<T>` or
194 #### Literal immediates
196 Some assembly instructions require an immediate encoded directly in the assembly instruction. These need to be
197 constant at JIT time.
199 While the discussion is still on-going, consensus seems to be that any intrinsic must function correctly even when its
200 arguments are not constant.
202 ## Intrinsic Interface Documentation
205 + Each `static class` will
206 + Briefly document corresponding `System Register Field and Value` from ARM specification.
207 + Document use of IsSupported property
208 + Optionally summarize set of methods enabled by the extension
209 + Each intrinsic method will
210 + Document underlying `ARM64` assembly instruction
211 + Optionally, briefly summarize operation performed
212 + In many cases this may be unnecessary: `Add`, `Multiply`, `Load`, `Store`
213 + In some cases this may be difficult to do correctly. (Crypto instructions)
214 + Optionally mention corresponding compiler gcc, clang, and/or MSVC intrinsics
215 + Review of existing documentation shows `ARM64` intrinsics are mostly absent or undocumented so
216 initially this will not be necessary for `ARM64`
217 + See gcc manual "AArch64 Built-in Functions"
218 + MSVC ARM64 documentation has not been publically released
220 ## Phased Implementation
222 ### Implementation Priorities
224 As rough guidelines for order of implementation:
226 + Baseline functionality will be prioritized over architectural extensions
227 + Architectural extensions will typically be prioritized in age order. Earlier extensions will be added first
228 + This is primarily driven by availability of hardware. Features released in earlier will be prevalent in
230 + Priorities will be driven by optimization efforts and requests
231 + Priority will be given to intrinsics which are equivalent/similar to those actively used in libraries for other
233 + Priority will be given to intrinsics which have already been implemented for other platforms
237 Intrinsics will extend the API of CoreCLR. They will need to follow standard API review practices.
239 Initial XArch intrinsics are proposed to be added to the `netcoreapp2.1` Target Framework. ARM64 intrinsics will
240 be in similar Target Frameworks as the XArch intrinsics.
242 Each review will identify the Target Framework API version where the API will be extended and released.
244 #### API review of an intrinsic `static class`
246 Given the need to add hundreds or thousands of intrinsics, it will be helpful to review incrementally.
248 A separate GitHub Issue will typically created for the review of each intrinsic `static class`.
250 When the `static class` exceeds a few dozen methods, it is desirable to break the review into smaller more manageable
253 The extensive set of ARM64 assembly instructions make reviewing and implementing an exhaustive set a long process.
254 To facilitate incremental progress, initial intrinsic API for a given `static class` need not be exhaustive.
256 ### Partial implementation of intrinsic `static class`
258 + `IsSupported` must represent the state of an entire intrinsic `static class` for a given Target Framework.
259 + Once API review is complete and approved, it is acceptable to implement approved methods in any order.
260 + The approved API must be completed before the intrinsic `static class` is included in its Target Framework release
264 As intrinsic support is added test coverage must be extended to provide basic testing.
266 Tests should be added as soon as practical. CoreCLR Implementation and CoreFX API will need to be merged before tests
269 ## LSRA changes to allocate contiguous register ranges
271 Some ARM64 instructions will require allocation of contiguous blocks of registers. These are likely limited to load and
272 store multiple instructions.
274 It is not clear if this is a new LSRA feature and if it is how much complexity this will introduce into the LSRA.
276 ## ARM ABI Vector64<T> and Vector128<T>
278 For intrinsic method calls, these vector types will implicitly be treated as pass by vector register.
280 For other calls, ARM64 ABI conventions must be followed. For purposes of the ABI calling conventions, these vector
281 types will treated as composite struct type containing a contiguous array of `T`. They will need to follow standard
282 struct argument and return passing rules.
284 ## Half precision floating point
286 This document will refer to half precision floating point as `Half`.
288 + Machine learning and Artificial intelligence often use `Half` type to simplify storage and improve processing time.
289 + CoreCLR and `CIL` in general do not have general support for a `Half` type
290 + There is an open request to expose `Half` intrinsics
291 + There is an outstanding proposal to add `System.Half` to support this request
292 https://github.com/dotnet/corefx/issues/25702
293 + Implementation of `Half` features will be adjusted based on
294 + Implementation of the `System.Half` proposal
295 + Availability of supporting hardware (extensions)
296 + General language extensions supporting `Half`
298 **`Half` support is currently outside the scope of the initial design proposal. It is discussed below only for
299 introductory purposes.**
301 ### ARM64 Half precision support
303 ARM64 supports two half precision floating point formats
305 + IEEE-754 compliant.
306 + ARM alternative format
308 The two formats are similar. IEEE-754 has support for Inifinity and NAN and therefore has a somewhat smaller range.
309 IEEE-754 should be preferred.
311 ARM64 baseline support for `Half` is limited. The following types of operations are supported
314 + Conversion to/from `Float`
315 + Widening from `Vector128<Half>` to two `Vector128<Float>`
316 + Narrowing from two `Vector128<Float>` to `Vector128<Half>`
318 The optional ARMv8.2-FP16 extension adds support for
320 + General operations on IEEE-754 `Half` types
321 + Vector operations on IEEE-754 `Half` types
323 These correspond to the proposed `static class`es `Fp16` and `SimdFp16`
325 ### `Half` and ARM64 ABI
327 Any complete `Half` implementation must conform to the `ARM64 ABI`.
329 The proposed `System.Half` type must be treated as a floating point type for purposes of the ARM64 ABI
331 As an argument it must be passed in a floating point register.
333 As a structure member, it must be treated as a floating point type and enter into the HFA determination logic.
335 Test cases must be written and conformance must be demonstrated.
337 ## Scalable Vector Extension Support
339 `SVE`, the Scalable Vector Extension introduces its own complexity.
343 + Creates a set of `Z0-Z31` scalable vector registers. These overlay existing vector registers. Each scalar vector
344 register has a platform specific length
345 + Any multiple of 128 bits up to 2048 bits
346 + Creates a new set of `P0-P15` predicate registers. Each predicate register has a platform specific length which is
347 1/8th of the scalar vector length.
348 + Add an extensive set of instructions including complex load and store operations.
349 + Modifies the ARM64 ABI.
351 Therefore implementation will not be trivial.
353 + Register allocator will need changes to support predicate allocation
354 + SIMD support will face similar issues
355 + Open issue: Should we use `Vector<T>`, `Vector128<t>, Vector256<t>, ... Vector2048<T>`, `SVE<T>` ... in user interface
357 + Use of `Vector128<t>, Vector256<t>, ... Vector2048<T>` is current default proposal.
358 Having 16 forms of every API may create issues for framework and client developers.
359 However generics may provide some/sufficient relief to make this acceptable.
360 + Use of `Vector<T>` may be preferred if SVE will also be used for `FEATURE_SIMD`
361 + Use of `SVE<T>` may be preferred if SVE will not be used for `FEATURE_SIMD`
364 Given lack of available hardware and a lack of thorough understanding of the specification:
366 + SVE will require a separate design
367 + **SVE is considered out of scope for this document. It is discussed above only for
368 introductory purposes.**
371 ### Handling Instruction Deprecation
373 Deprecation of instructions should be relatively rare
375 + Do not introduce an intrinsic for a feature that is currently deprecated
376 + In event an assembly instruction is deprecated
377 1. Prefer emulation using alternate instructions if practical
378 2. Add `SetThrowOnDeprecated()` interface to allow developers to find these issues
382 The following sections document APIs which have completed the API review process.
384 Until each API is approved it shall be marked "TBD Not Approved"