Documentation/design-docs/arm64-intrinsics.md

   1 # Arm64 Intrinsics
   2
   3 This document is intended to document proposed design decisions related to the introduction
   4 of Arm64 Intrinsics
   5
   6 ## Document Goals
   7
   8 + Discuss design options
   9   + Document existing design pattern
  10   + Draft initial design decisions which are least likely to cause extensive rework
  11 + Decouple `X86`, `X64`, `ARM32` and `ARM64` development
  12   + Make some minimal decisions which encourage API similarity between platforms
  13   + Make some additional minimal decisions which allow `ARM32` and `ARM64` API's to be similar
  14 + Decouple CoreCLR implementation and testing from API design
  15 + Allow for best API design
  16 + Keep implementation simple
  17
  18 ## Intrinsics in general
  19
  20 Use of intrinsics in general is a CoreCLR design decision to allow low level platform
  21 specific optimizations.
  22
  23 At first glance, such a decision seems to violate the fundamental principles of .NET
  24 code running on any platform.  However, the intent is not for the vast majority of
  25 apps to use such optimizations.  The intended usage model is to allow library
  26 developers access to low level functions which enable optimization of key
  27 functions.  As such the use is expected to be limited, but performance critical.
  28
  29 ## Intrinsic granularity
  30
  31 In general individual intrinsic will be chosen to be fine grained.  These will generally
  32 correspond to a single assembly instruction.
  33
  34 ## Logical Sets of Intrinsics
  35
  36 For various reasons, an individual CPU will have a specific set of supported instructions.  For `ARM64` the
  37 set of supported instructions is identified by various `ID_* System registers`.
  38 While these feature registers are only available for the OS to access, they provide
  39 a logical grouping of instructions which are enabled/disabled together.
  40
  41 ### API Logical Set grouping & `IsSupported`
  42
  43 The C# API must provide a mechanism to determine which sets of instructions are supported.
  44 Existing design uses a separate `static class` to group the methods which correspond to each
  45 logical set of instructions.  A single `IsSupported` property is included in each `static class`
  46 to allow client code to alter control flow.  The `IsSupported` properties are designed so that JIT
  47 can remove code on unused paths.  `ARM64` will use an identical approach.
  48
  49 ### API `PlatformNotSupported` Exception
  50
  51 If client code calls an intrinsic which is not supported by the platform a `PlatformNotSupported`
  52 exception must be thrown.
  53
  54 ### JIT, VM, PAL & OS requirements
  55
  56 The JIT must use a set of flags corresponding to logical sets of instructions to alter code
  57 generation.
  58
  59 The VM must query the OS to populate the set of JIT flags.  For the special altJit case, a
  60 means must provide for setting the flags.
  61
  62 PAL must provide an OS abstraction layer.
  63
  64 Each OS must provide a mechanism for determining which sets of instructions are supported.
  65
  66 + Linux provides the HWCAP detection mechanism which is able to detect current set of exposed
  67 features
  68 + Arm64 MAC OS and Arm64 Windows OS must provide an equally capable detection mechanism.
  69
  70 In the event the OS fails to provides a means to detect a support for an instruction set extension
  71 it must be treated as unsupported.
  72
  73 NOTE: Exceptions might be where:
  74
  75 + CoreCLR is distributed as source and CMake build configuration test is used to detect these features
  76 + Installer detects features and sets appropriate configuration knobs
  77 + VM runs code inside safe try/catch blocks to test for instruction support
  78 + Platform requires a specific minimum set of instructions
  79
  80 ### Intrinsics & Crossgen
  81
  82 For any intrinsic which may not be supported on all variants of a platform, crossgen method
  83 compilation should be designed to allow optimal code generation.
  84
  85 Initial implementation will simply trap so that the JIT is forced to generate optimal platform dependent code at
  86 runtime.  Subsequent implementations may use different approaches.
  87
  88 ## Choice of Arm64 naming conventions
  89
  90 `x86`, `x64`, `ARM32` and `ARM64` will follow similar naming conventions.
  91
  92 ### Namespaces
  93
  94 + `System.Runtime.Intrinsics` is used for type definitions useful across multiple platforms
  95 + `System.Runtime.Intrinsics.Arm` is used type definitions shared across `ARM32` and `ARM64` platforms
  96 + `System.Runtime.Intrinsics.Arm.Arm64` is used for type definitions for the `ARM64` platform
  97   + The primary implementation of `ARM64` intrinsics will occur within this namespace
  98   + While `x86` and `x64` share a common namespace, this document is recommending a separate namespace
  99   for `ARM32` and `ARM64`.  This is because `AARCH64` is a separate `ISA` from the `AARCH32` `Arm` & `Thumb`
 100   instruction sets.  It is not an `ISA` extension, but rather a new `ISA`.  This is different from `x64`
 101   which could be viewed as a superset of `x86`.
 102   + The logical grouping of `ARM64` and `ARM32` instruction sets is different.  It is controlled by
 103   different sets of `System Registers`.
 104
 105 For the convenience of the end user, it may be useful to add convenience API's which expose functionality
 106 which is common across platforms and sets of platforms.  These could be implemented in terms of the
 107 platform specific functionality.  These API's are currently out of scope of this initial design document.
 108
 109 ### Logical Set Class Names
 110
 111 Within the `System.Runtime.Intrinsics.Arm.Arm64` namespace there will be a separate `static class` for each
 112 logical set of instructions
 113
 114 The sets will be chosen to match the granularity of the `ARM64` `ID_*` register fields.
 115
 116 #### Specific Class Names
 117
 118 The table below documents the set of known extensions, their identification, and their recommended intrinsic
 119 class names.
 120
 121 | ID Register      | Field   | Values   | Intrinsic `static class` name |
 122 | ---------------- | ------- | -------- | ----------------------------- |
 123 | N/A              | N/A     | N/A      | Base                          |
 124 | ID_AA64ISAR0_EL1 | AES     | (1b, 10b)| Aes                           |
 125 | ID_AA64ISAR0_EL1 | Atomic  | (10b)    | Atomics                       |
 126 | ID_AA64ISAR0_EL1 | CRC32   | (1b)     | Crc32                         |
 127 | ID_AA64ISAR1_EL1 | DPB     | (1b)     | Dcpop                         |
 128 | ID_AA64ISAR0_EL1 | DP      | (1b)     | Dp                            |
 129 | ID_AA64ISAR1_EL1 | FCMA    | (1b)     | Fcma                          |
 130 | ID_AA64PFR0_EL1  | FP      | (0b, 1b) | Fp                            |
 131 | ID_AA64PFR0_EL1  | FP      | (1b)     | Fp16                          |
 132 | ID_AA64ISAR1_EL1 | JSCVT   | (1b)     | Jscvt                         |
 133 | ID_AA64ISAR1_EL1 | LRCPC   | (1b)     | Lrcpc                         |
 134 | ID_AA64ISAR0_EL1 | AES     | (10b)    | Pmull                         |
 135 | ID_AA64PFR0_EL1  | RAS     | (1b)     | Ras                           |
 136 | ID_AA64ISAR0_EL1 | SHA1    | (1b)     | Sha1                          |
 137 | ID_AA64ISAR0_EL1 | SHA2    | (1b, 10b)| Sha2                          |
 138 | ID_AA64ISAR0_EL1 | SHA3    | (1b)     | Sha3                          |
 139 | ID_AA64ISAR0_EL1 | SHA2    | (10b)    | Sha512                        |
 140 | ID_AA64PFR0_EL1  | AdvSIMD | (0b, 1b) | Simd                          |
 141 | ID_AA64PFR0_EL1  | AdvSIMD | (1b)     | SimdFp16                      |
 142 | ID_AA64ISAR0_EL1 | RDM     | (1b)     | SimdV81                       |
 143 | ID_AA64ISAR0_EL1 | SM3     | (1b)     | Sm3                           |
 144 | ID_AA64ISAR0_EL1 | SM4     | (1b)     | Sm4                           |
 145 | ID_AA64PFR0_EL1  | SVE     | (1b)     | Sve                           |
 146
 147 The `All`, `Simd`, and `Fp` classes will together contain the bulk of the `ARM64` intrinsics.  Most other extensions
 148 will only add a few instruction so they should be simpler to review.
 149
 150 The `Base` `static class` is used to represent any intrinsic which is guaranteed to be implemented on all
 151 `ARM64` platforms.  This set will include general purpose instructions.  For example, this would include intrinsics
 152 such as `LeadingZeroCount` and `LeadingSignCount`.
 153
 154 As further extensions are released, this set of intrinsics will grow.
 155
 156 ### Intrinsic Method Names
 157
 158 Intrinsics will be named to describe functionality.  Names will not correspond to specific named
 159 assembly instructions.
 160
 161 Where precedent exists for common operations within the `System.Runtime.Intrinsics.X86` namespace, identical method
 162 names will be chosen: `Add`, `Multiply`, `Load`, `Store` ...
 163
 164 Where `ARM` naming convention differs substantially from `XARCH`, `ARM` naming conventions will sometimes be preferred.
 165 For instance
 166
 167 + `ARM` uses `Replicate` or `Duplicate` rather than X86 `Broadcast`.
 168 + `ARM` uses `Across` rather than `X86` `Horizontal`.
 169
 170 These will need to reviewed on a case by case basis.
 171
 172 It is also worth noting `System.Runtime.Intrinsics.X86` naming conventions will include the suffix `Scalar` for
 173 operations which take vector argument(s), but contain an implicit cast(s) to the base type and therefore operate only
 174 on the first item of the argument vector(s).
 175
 176 ### Intinsic Method Argument and Return Types
 177
 178 Intrinsic methods will typically use a standard set of argument and return types:
 179
 180 + Integer type: `byte`, `sbyte`, `short`, `ushort`, `int`, `uint`, `long`, `ulong`
 181 + Floating types: `double`, `single`, `System.Half`
 182 + Vector types: `Vector128<T>`, `Vector64<T>`
 183 + SVE will add new vector types: TBD
 184 + `ValueTuple<>` for return types returning multiple values
 185
 186 It is proposed to add the `Vector64<T>` type.  Most `ARM64` instructions support 8 byte and 16 byte forms.  8 byte
 187 operations can execute faster with less power on some platforms. So adding `Vector64<T>` will allow exposing the full
 188 flexibility of the instruction set and allow for optimal usage.
 189
 190 Some intrinsics will need to produce multiple results.  The most notable are the structured load operations `LD2`,
 191 `LD3`, `LD4` ...  For these operations it is proposed that the intrinsic API return a `ValueTuple<>` of `Vector64<T>` or
 192 `Vector128<T>`
 193
 194 #### Literal immediates
 195
 196 Some assembly instructions require an immediate encoded directly in the assembly instruction.  These need to be
 197 constant at JIT time.
 198
 199 While the discussion is still on-going, consensus seems to be that any intrinsic must function correctly even when its
 200 arguments are not constant.
 201
 202 ## Intrinsic Interface Documentation
 203
 204 + Namespace
 205 + Each `static class` will
 206   + Briefly document corresponding `System Register Field and Value` from ARM specification.
 207   + Document use of IsSupported property
 208   + Optionally summarize set of methods enabled by the extension
 209 + Each intrinsic method will
 210   + Document underlying `ARM64` assembly instruction
 211   + Optionally, briefly summarize operation performed
 212     + In many cases this may be unnecessary: `Add`, `Multiply`, `Load`, `Store`
 213     + In some cases this may be difficult to do correctly. (Crypto instructions)
 214   + Optionally mention corresponding compiler gcc, clang, and/or MSVC intrinsics
 215     + Review of existing documentation shows `ARM64` intrinsics are mostly absent or undocumented so
 216     initially this will not be necessary for `ARM64`
 217     + See gcc manual "AArch64 Built-in Functions"
 218     + MSVC ARM64 documentation has not been publically released
 219
 220 ## Phased Implementation
 221
 222 ### Implementation Priorities
 223
 224 As rough guidelines for order of implementation:
 225
 226 + Baseline functionality will be prioritized over architectural extensions
 227 + Architectural extensions will typically be prioritized in age order.  Earlier extensions will be added first
 228   + This is primarily driven by availability of hardware.  Features released in earlier will be prevalent in
 229   more hardware.
 230 + Priorities will be driven by optimization efforts and requests
 231   + Priority will be given to intrinsics which are equivalent/similar to those actively used in libraries for other
 232   platforms
 233   + Priority will be given to intrinsics which have already been implemented for other platforms
 234
 235 ### API review
 236
 237 Intrinsics will extend the API of CoreCLR.  They will need to follow standard API review practices.
 238
 239 Initial XArch intrinsics are proposed to be added to the `netcoreapp2.1` Target Framework.  ARM64 intrinsics will
 240 be in similar Target Frameworks as the XArch intrinsics.
 241
 242 Each review will identify the Target Framework API version where the API will be extended and released.
 243
 244 #### API review of an intrinsic `static class`
 245
 246 Given the need to add hundreds or thousands of intrinsics, it will be helpful to review incrementally.
 247
 248 A separate GitHub Issue will typically created for the review of each intrinsic `static class`.
 249
 250 When the `static class` exceeds a few dozen methods, it is desirable to break the review into smaller more manageable
 251 pieces.
 252
 253 The extensive set of ARM64 assembly instructions make reviewing and implementing an exhaustive set a long process.
 254 To facilitate incremental progress, initial intrinsic API for a given `static class` need not be exhaustive.
 255
 256 ### Partial implementation of intrinsic `static class`
 257
 258 + `IsSupported` must represent the state of an entire intrinsic `static class` for a given Target Framework.
 259 + Once API review is complete and approved, it is acceptable to implement approved methods in any order.
 260 + The approved API must be completed before the intrinsic `static class` is included in its Target Framework release
 261
 262 ## Test coverage
 263
 264 As intrinsic support is added test coverage must be extended to provide basic testing.
 265
 266 Tests should be added as soon as practical.  CoreCLR Implementation and CoreFX API will need to be merged before tests
 267 can be merged.
 268
 269 ## LSRA changes to allocate contiguous register ranges
 270
 271 Some ARM64 instructions will require allocation of contiguous blocks of registers.  These are likely limited to load and
 272 store multiple instructions.
 273
 274 It is not clear if this is a new LSRA feature and if it is how much complexity this will introduce into the LSRA.
 275
 276 ## ARM ABI Vector64<T> and Vector128<T>
 277
 278 For intrinsic method calls, these vector types will implicitly be treated as pass by vector register.
 279
 280 For other calls, ARM64 ABI conventions must be followed.  For purposes of the ABI calling conventions, these vector
 281 types will treated as composite struct type containing a contiguous array of `T`.  They will need to follow standard
 282 struct argument and return passing rules.
 283
 284 ## Half precision floating point
 285
 286 This document will refer to half precision floating point as `Half`.
 287
 288 + Machine learning and Artificial intelligence often use `Half` type to simplify storage and improve processing time.
 289 + CoreCLR and `CIL` in general do not have general support for a `Half` type
 290 + There is an open request to expose `Half` intrinsics
 291 + There is an outstanding proposal to add `System.Half` to support this request
 292 https://github.com/dotnet/corefx/issues/25702
 293 + Implementation of `Half` features will be adjusted based on
 294   + Implementation of the `System.Half` proposal
 295   + Availability of supporting hardware (extensions)
 296   + General language extensions supporting `Half`
 297
 298 **`Half` support is currently outside the scope of the initial design proposal.  It is discussed below only for
 299 introductory purposes.**
 300
 301 ### ARM64 Half precision support
 302
 303 ARM64 supports two half precision floating point formats
 304
 305 + IEEE-754 compliant.
 306 + ARM alternative format
 307
 308 The two formats are similar.  IEEE-754 has support for Inifinity and NAN and therefore has a somewhat smaller range.
 309 IEEE-754 should be preferred.
 310
 311 ARM64 baseline support for `Half` is limited.  The following types of operations are supported
 312
 313 + Loads and Stores
 314 + Conversion to/from `Float`
 315 + Widening from `Vector128<Half>` to two `Vector128<Float>`
 316 + Narrowing from two `Vector128<Float>` to `Vector128<Half>`
 317
 318 The optional ARMv8.2-FP16 extension adds support for
 319
 320 + General operations on IEEE-754 `Half` types
 321 + Vector operations on IEEE-754 `Half` types
 322
 323 These correspond to the proposed `static class`es `Fp16` and `SimdFp16`
 324
 325 ### `Half` and ARM64 ABI
 326
 327 Any complete `Half` implementation must conform to the `ARM64 ABI`.
 328
 329 The proposed `System.Half` type must be treated as a floating point type for purposes of the ARM64 ABI
 330
 331 As an argument it must be passed in a floating point register.
 332
 333 As a structure member, it must be treated as a floating point type and enter into the HFA determination logic.
 334
 335 Test cases must be written and conformance must be demonstrated.
 336
 337 ## Scalable Vector Extension Support
 338
 339 `SVE`, the Scalable Vector Extension introduces its own complexity.
 340
 341 The extension
 342
 343 + Creates a set of `Z0-Z31` scalable vector registers.  These overlay existing vector registers.  Each scalar vector
 344 register has a platform specific length
 345   + Any multiple of 128 bits up to 2048 bits
 346 + Creates a new set of `P0-P15` predicate registers.  Each predicate register has a platform specific length which is
 347 1/8th of the scalar vector length.
 348 + Add an extensive set of instructions including complex load and store operations.
 349 + Modifies the ARM64 ABI.
 350
 351 Therefore implementation will not be trivial.
 352
 353 + Register allocator will need changes to support predicate allocation
 354 + SIMD support will face similar issues
 355 + Open issue: Should we use `Vector<T>`, `Vector128<t>, Vector256<t>, ... Vector2048<T>`, `SVE<T>` ... in user interface
 356 design?
 357   + Use of `Vector128<t>, Vector256<t>, ... Vector2048<T>` is current default proposal.
 358 Having 16 forms of every API may create issues for framework and client developers.
 359 However generics may provide some/sufficient relief to make this acceptable.
 360   + Use of `Vector<T>` may be preferred if SVE will also be used for `FEATURE_SIMD`
 361   + Use of `SVE<T>` may be preferred if SVE will not be used for `FEATURE_SIMD`
 362
 363
 364 Given lack of available hardware and a lack of thorough understanding of the specification:
 365
 366 + SVE will require a separate design
 367 + **SVE is considered out of scope for this document.  It is discussed above only for
 368 introductory purposes.**
 369
 370 ## Miscellaneous
 371 ### Handling Instruction Deprecation
 372
 373 Deprecation of instructions should be relatively rare
 374
 375 + Do not introduce an intrinsic for a feature that is currently deprecated
 376 + In event an assembly instruction is deprecated
 377   1. Prefer emulation using alternate instructions if practical
 378   2. Add `SetThrowOnDeprecated()` interface to allow developers to find these issues
 379
 380 ## Approved APIs
 381
 382 The following sections document APIs which have completed the API review process.
 383
 384 Until each API is approved it shall be marked "TBD Not Approved"
 385
 386 ### `All`
 387
 388 TBD Not approved
 389
 390 ### `Aes`
 391
 392 TBD Not approved
 393
 394 ### `Atomics`
 395
 396 TBD Not approved
 397
 398 ### `Crc32`
 399
 400 TBD Not approved
 401
 402 ### `Dcpop`
 403
 404 TBD Not approved
 405
 406 ### `Dp`
 407
 408 TBD Not approved
 409
 410 ### `Fcma`
 411
 412 TBD Not approved
 413
 414 ### `Fp`
 415
 416 TBD Not approved
 417
 418 ### `Fp16`
 419
 420 TBD Not approved
 421
 422 ### `Jscvt`
 423
 424 TBD Not approved
 425
 426 ### `Lrcpc`
 427
 428 TBD Not approved
 429
 430 ### `Pmull`
 431
 432 TBD Not approved
 433
 434 ### `Ras`
 435
 436 TBD Not approved
 437
 438 ### `Sha1`
 439
 440 TBD Not approved
 441
 442 ### `Sha2`
 443
 444 TBD Not approved
 445
 446 ### `Sha3`
 447
 448 TBD Not approved
 449
 450 ### `Sha512`
 451
 452 TBD Not approved
 453
 454 ### `Simd`
 455
 456 TBD Not approved
 457
 458 ### `SimdFp16`
 459
 460 TBD Not approved
 461
 462 ### `SimdV81`
 463
 464 TBD Not approved
 465
 466 ### `Sm3`
 467
 468 TBD Not approved
 469
 470 ### `Sm4`
 471
 472 TBD Not approved
 473
 474 ### `Sve`
 475
 476 TBD Not approved