From 316486dd9f6bbd03e7e13655674f1fa91e533b9a Mon Sep 17 00:00:00 2001 From: Alyssa Rosenzweig Date: Fri, 16 Jul 2021 10:40:58 -0400 Subject: [PATCH] pan/va: Add initial ISA.xml for Valhall This handwritten file is the product of over a hundred hours of reverse-engineering and represents the sum of what I've learned about the Valhall architecture. It will be used in the next commits as the backbone of a Valhall toolchain. Signed-off-by: Alyssa Rosenzweig Part-of: --- src/panfrost/bifrost/valhall/ISA.xml | 1683 ++++++++++++++++++++++++++++++++++ 1 file changed, 1683 insertions(+) create mode 100644 src/panfrost/bifrost/valhall/ISA.xml diff --git a/src/panfrost/bifrost/valhall/ISA.xml b/src/panfrost/bifrost/valhall/ISA.xml new file mode 100644 index 0000000..902d1a0 --- /dev/null +++ b/src/panfrost/bifrost/valhall/ISA.xml @@ -0,0 +1,1683 @@ + + + + + + This immediates are accessible in (almost) any instruction, provided the + immediate mode is kept to the default. They optimize for the most common + immediate values; any immediate listed here may be used without taking up + a uniform slot or a register. Most integer instructions can access + separate half-words and individual bytes via swizzles on the source. + + 0x00000000 + 0xFFFFFFFF + 0x7FFFFFFF + 0xFAFCFDFE + 0x01000000 + 0x80002000 + 0x70605030 + 0xC0B0A090 + 0x03020100 + 0x07060504 + 0x0B0A0908 + 0x0F0E0D0C + 0x13121110 + 0x17161514 + 0x1B1A1918 + 0x1F1E1D1C + 0x3F800000 + 0x3DCCCCCD + 0x3EA2F983 + 0x3F317218 + 0x40490FDB + 0x00000000 + 0x477FFF00 + 0x5C005BF8 + 0x2E660000 + 0x34000000 + 0x38000000 + 0x3C000000 + 0x40000000 + 0x44000000 + 0x48000000 + 0x42480000 + + + + + Every Valhall instruction can perform an action, like wait on dependency + slots. A few special actions are available, specified in the instruction + metadata from this enum. The `wait0126` action is required to wait on + dependency slot #6 and should be set on the instruction immediately + preceding `ATEST`. The `barrier` action may be set on any instruction for + subgroup barriers, and should particularly be set with the `BARRIER` + instruction for global barriers. The `td` action only applies to fragment + shaders and is used to terminate helper invocations, it should be set as + early as possible after helper invocations are no longer needed as + determined by data flow analysis. The `return` action is used to terminate + the shader, although it may be overloaded by the `BLEND` instruction. + + The `reconverge` action is required on any instruction immediately + preceding a possible change to the mask of active threads in a subgroup. + This includes all divergent branches, but it also includes the final + instruction at the end of any basic block where the immediate successor + (fallthrough) is the target of a divergent branch. + + wait0126 + barrier + reconverge + + + td + + return + + + + Selects how immediates sources are interpreted. + none + ts + + id + + + + + Situated between the immediates hard-coded in the hardware and the + uniforms defined purely in software, Valhall has a some special + "constants" passing through data structures. These are encoded like the + table of immediates, as if special constant $i$ were lookup table entry + $32 + i$. These special values are selected with the `.ts` modifier. + + + + tls_ptr + tls_ptr_hi + + + wls_ptr + wls_ptr_hi + + + + + Situated between the immediates hard-coded in the hardware and the + uniforms defined purely in software, Valhall has a some special + "constants" passing through data structures. These are encoded like the + table of immediates, as if special constant $i$ were lookup table entry + $32 + i$. These special values are selected with the `.id` modifier. + + + + lane_id + + + + core_id + + + + + + + + + + + + + + + + + + + + + + + + program_counter + + + + + b0123 + b3210 + b0101 + b2323 + b0000 + b1111 + b2222 + b3333 + b2301 + b1032 + b0011 + b2233 + + + + + + + + Used to select the 2 bytes for shifts of 16-bit vectors + b02 + + + + b00 + b11 + b22 + b33 + + + b01 + b23 + + + + + + + + h00 + h10 + h01 + h11 + b00 + b20 + b02 + b22 + b11 + b31 + b13 + b33 + b01 + b23 + + + + + + none + + h0 + h1 + b0 + b1 + b2 + b3 + + + + none + + h0 + h1 + b0 + b1 + b2 + b3 + w0 + + + + + + + + + + + b0 + b1 + b2 + b3 + + + + + Used for the lane select of `BRANCHZ`. To use an 8-bit condition, a + separate `ICMP` is required to cast to 16-bit. + + none + h0 + h1 + + + + + h0 + h1 + + + + b0 + b1 + b2 + b3 + h0 + h1 + w0 + d0 + + + + h0 + h1 + w0 + d0 + + + + + + + + identity + + + + + + + + + + w0 + d0 + + + + + + + + + + + + + + identity + + + + + + + + + + + + + + identity + + + + + + + + + + identity + + + + + + + + + + + + identity + + + + Corresponds to IEEE 754 rounding modes + rte + rtp + rtn + rtz + + + + + Comparison instructions like `FCMP` return a boolean but may encode this + boolean in a variety of ways. `i1` gives a OpenGL style `0/1` boolean. + `m1` gives a Direct3D style `0/~0` boolean. `f1` gives a floating-point + `0.0f / 1.0f` boolean. Switching between these modes is useful to fold a + boolean type convert into a comparison. `u1` is used internally to + implement 64-bit comparisons. + + i1 + f1 + m1 + u1 + + + + none + h0 + h1 + + + + + + + + + + Clamp applied to the destination of a floating-point instruction. Note the + clamps may be decomposed as two independent bits for `clamp_0_inf` and + `clamp_m1_1`, with `clamp_0_1` arising as the composition of `clamp_0_inf` + and `clamp_m1_1` in either order. + + none + clamp_0_inf + clamp_m1_1 + clamp_0_1 + + + + + Condition code. Type must be inferred from the instruction. IEEE 754 total + ordering only applies to floating point compares. "Not equal" and "greater + than or less than" are distinguished by NaN behaviour conforming to + the IEEE 754 specification. + + eq + gt + ge + ne + lt + le + gtlt + total + + + + Texture dimension. + 1d + 2d + 3d + cube + + + + Level-of-detail selection mode in a texture instruction. + zero + computed + + + explicit + computed_bias + grdesc + + + + + Format of data loaded to / stored from registers for general memory access. + + + f32 + f16 + u32 + + + + + + + sr0 + sr1 + sr2 + sr3 + sr4 + sr5 + sr6 + sr7 + + + + Number of channels loaded/stored for general memory access. + none + v2 + v3 + v4 + + + + + Dependency slot set on a message-passing instruction that writes to + registers. Before reading the destination, a future instruction must wait + on the specified slot. Slot #7 is for `BARRIER` instructions only. + + slot0 + slot1 + slot2 + + + + + slot7 + + + + Memory segment written to by a `STORE` instruction. + global + pos + vary + tl + + + + + Selects the effective subgroup size from subgroup operations. The hardware + warps are sixteen threads on Valhall, but subdividing a warp may be useful + for API requirements. In particular, derivatives may be calculated with + quads (four threads). + + subgroup2 + subgroup4 + subgroup8 + subgroup16 + + + + + Acts as a modifier on the lane specificier for a `CLPER` instruction. The + `accumulate` mode is required for efficient subgroup reductions. + + none + xor + accumulate + shift + + + + + Accesses to inactive lanes (due to divergence) in a subgroup is generally + undefined in APIs. However, the results of permuting with an inactive lane + with `CLPER.i32` are well-defined in Valhall: they return one of the + following values, as specified in the `CLPER.i32` instructions. Sometimes + certain values enable small optimizations. + + zero + umax + i1 + v2i1 + smin + smax + v2smin + v2smax + v4smin + v4smax + f1 + v2f1 + infn + inf + v2infn + v2inf + + + + + Do nothing. Useful at the start of a block for waiting on slots required + by the first actual instruction of the block, to reconcile dependencies + after a branch. Also useful as the sole instruction of an empty shader. + + + + + + Branches to a specified relative offset if its source is nonzero (default) + or if its source is zero (if `.eq` is set). The offset is 27-bits and + sign-extended, giving an effective range of ±26-bits. The offset is + specified in units of instructions, relative to the *next* instruction. + Positive offsets may be interpreted as "number of instructions to skip". + Since Valhall instructions are 8 bytes, this operates as: + + $$PC := \begin{cases} PC + 8 \cdot (\text{offset} \; + 1) & \text{if} \; + \text{src} \stackrel{?}{=} 0 \\ PC + 8 & \text{otherwise} \end{cases}$$ + + Used with comparison instructions to implement control flow. Tie the + source to a nonzero constant to implement a jump. May introduce + divergence, so generally requires `.reconverge` flow control. + + Value to compare against zero + + + + + + + Evaluates the given condition, and if it passes, discards the current + fragment and terminates the thread. The destination should be set to R60. + Only valid in a **fragment** shader. + + + Updated coverage mask (set to R60) + Left value to compare + Right value to compare + + + + + Jump to an indirectly specified address. Used to jump to blend shaders at + the end of a fragment shader. + + Value to compare against zero + Branch target + + + + + + General-purpose barrier. Must use slot #7. Must be paired with a + `.barrier` action on the instruction. + + + + + + + + + Evaluates the given condition and outputs either the true source or the + false source. + + + Left value to compare + Right value to compare + Return value if true + Return value if false + + + + + + + + + Evaluates the given condition and outputs either the true source or the + false source. + + Valhall lacks integer minimum/maximum instructions. `CSEL` instructions + with tied operands form the canonical implementations of these + instructions. Similarly, the integer $\text{sign}$ function is canonically + implemented with a pair of `CSEL` instructions. + + + Left value to compare + Right value to compare + Return value if true + Return value if false + + + + + + + + + + + + + + Interpolates a given varying + + + + + + + + + + + + + + + + + + + Vertex ID + Instance ID + + + + + The index must not diverge within a warp. + + + + + + Vertex ID + Instance ID + Index + + + + + Loads the effective address of the position buffer (in a position shader) + or the varying buffer (in a varying shader). That is, the base pointer + plus the vertex's linear ID (the first source) times the buffer's + per-vertex stride. `LEA_ATTR` should be executed once in a + position/varying shader, with the linear ID preloaded as `r59`. Each + position/varying store can then be constructed as `STORE` with the base + address sourced from the 64-bit destination of `LEA_ATTR` and an + appropriately computed offset. Varying stores bypass the usual conversion + hardware for attributes; this diverges from earlier Mali hardware. + + + + + + Linear ID + + + + Loads from main memory + + + + + + Address to load from after adding offset + + + + + Loads from main memory + + + + + + Address to load from after adding offset + + + + + Loads from main memory + + + + + + Address to load from after adding offset + + + + + Loads from main memory + + + + + + Address to load from after adding offset + + + + + Loads from main memory + + + + + + Address to load from after adding offset + + + + + Loads from main memory + + + + + + Address to load from after adding offset + + + + + Loads from main memory + + + + + + Address to load from after adding offset + + + + + Loads from main memory + + + + + + Address to load from after adding offset + + + + + Stores to main memory + + + + + + + + + + + + + Address to store to after adding offset + + + + + Stores to images + + + + Address to store to after adding offset + + + + + Loads a given render target, specified in the pixel indices descriptor, at + a given location and sample, and convert to the format specified in the + internal conversion descriptor. Used to implement EXT_framebuffer_fetch + and internally in blend shaders. + + + + + Pixel indices descriptor + Coverage mask + Conversion descriptor + + + + + Blends a given render target. This loads the API-specified blend state for + the render target from the first source. Blend descriptors are available + as special immediates. It then reads the colour to be blended from the + first staging register, with the specified vector size and register format + as desired. The resulting coverage mask is stored to the second set of + staging registers. + + In the fixed-function path, `BLEND` sends the colour to the blender to be + written to the tilebuffer. Then, if the instruction's flow control + specifies termination, the fragment program is ended. If it does not + specify termination, `BLEND` acts as a relative branch, branching with the + offset specified as `target`. This allows the subsequent instructions to + be skipped when fixed-function blending is used. Note this implicit branch + can never introduce divergence, so `.reconverge` is not required. + + In the blend shader path, `BLEND` ignores the specified flow control and + does not branch to the specified offset. Instead, execution considers + normally with the next instruction. The compiler should insert code for + calling a blend shader after the `BLEND` instruction unless it is known + that a blend shader will never be required. + + The indirection is required to support both fixed-function and blend + shaders efficiently and without shader variants. + + + + Blend descriptor + + + + + + + + + + Does alpha-to-coverage testing, updating the sample coverage mask. ATEST + does not do an implicit discard. It should be executed before the first + ZS_EMIT or BLEND instruction. + + Updated coverage mask + Input coverage mask + Alpha value (render target 0) + + + + + + + Programatically writes out depth, stencil, or both, depending on which + modifiers are set. Used to implement gl_FragDepth and gl_FragStencil. + + + + Updated coverage mask + Depth value + Stencil value + Input coverage mask + + + + + Performs the given data conversion. Note that floating-point rounding is + handled via the same hardware and therefore shares an encoding. Round mode + is specified where it makes sense. + + + + + + + + + + + + + + + + Value to convert + + + + Performs the given data conversion. + + + + Value to convert + + + + Performs the given data conversion. + + + + + + Value to convert + + + + Converts up with the specified round mode. + + Value to convert + + + + + Performs the given data conversion. + + + + + + + + + + + + + Value to convert + + + + + Performs the given rounding, using the convert unit. + + + + + + + Value to convert + + + + Canonical register-to-register move. + + + + + + Used as a primitive for various bitwise operations. + + + + + + + Used as a primitive for various bitwise operations. + + + + + + + Used as a primitive for various bitwise operations. + + + + + + + 64-bit abs may be constructed in 4 instructions (5 clocks) by checking the + sign with `ICMP.s32.lt.m1 hi, 0` and negating based on the result with + `IADD.s64` and `LSHIFT_XOR.i32` on each half. + + + + + + + + + + + + + + + Only available as 32-bit. Smaller bitsizes require explicit conversions. + 64-bit popcount may be constructed in 3 clocks by separate 32-bit + popcounts of each half and a 32-bit add, which is guaranteed not to + overflow. + + + + + + + Only available as 32-bit. Other bitsizes may be derived with swizzles. + + + + + + + For fully featured bitwise operation, see the shift opcodes. + + + + + + + For fully featured bitwise operation, see the shift opcodes. + + + + + + + Returns the mask of lanes ever active within the warp (subgroup), such + that the source is nonzero. The number of work-items in a subgroup is + given as the popcount of this value with a nonzero input. + + An `all()` subgroup operation may be constructed as `WMASK` of the input + compared for equality with `WMASK` of an nonzero value. + + An `any()` subgroup operation may be constructed as `WMASK` of the input + compared against zero. + + + + + + + + + + + + Breaks up the floating-point input into its fractional (mantissa) and + exponent parts. By default, this is compatible with the `frexp()` function + in APIs. With the log modifier, the floating point format is adjusted to + be compatible with Valhall's argument reduction for logarithm computation. + + + + + + + + + + + + + Performs a given special function. The floating-point reciprocal (`FRCP`) + and reciprocal square root (`FRSQ`) instructions may be freely used as-is. + The logarithm instruction (`FLOGD.f32`) requires an argument reduction. See the + transcendentals section for more information. + + + + + + + + + Performs a given special function.The trigonometric tables (`FSIN_TABLE.u6` and `FCOS_TABLE.u6`) are crude, + requiring both an argument reduction and postprocessing. + + + + + + + + $A + B$ + + A + B + + + + + + $\min \{ A, B \}$ + + A + B + + + + + + $\max \{ A, B \}$ + + A + B + + + + + + Given a pair of 32-bit floats, output a pair of 16-bit floats packed into + a 32-bit destination. + + A + B + + + + + + + Computes $A \cdot 2^B$ by adding B to the exponent of A. Used to calculate + various special functions, particularly base-2 exponents. Special case + handling differs from an actual floating-point multiply, so this should + not be used outside fixed instruction sequences. + + + A + B + + + + + Calculates the base-2 exponent of an argument specified as a 8:24 + fixed-point. The original argument is passed as well for correct handling + of special cases. + + + Input as 8:24 fixed-point + Input as 32-bit float + + + + + Performs a floating-point addition specialized for logarithm computation. + + + A + B + + + + + $A + B$ with optional saturation. + + As Valhall lacks swizzle instructions, `IADD.v2i16` with zero is the + canonical lowering for swizzles. + + + + + + + + + + + A + B + + + + + Calculates $A | (B \ll 16)$. Used to implement `(ushort2)(A, B)` + A + B + + + + + + + + + + + + $A - B$ with optional saturation + A + B + + + + + + Sign or zero extend B to 64-bits, left-shift by `shift`, and add the + 64-bit value A. These instructions accelerate address arithmetic, but may + be used in full generality for 64-bit integer arithmetic. + + + + + A + B + + + + + + + + + + + + + $A \cdot B$ with optional saturation. Note the multipliers can only handle up to + 32-bit by 32-bit multiplies. The 64-bit "multiply" acts like IMUL.u32 but + additionally writes the high half of the product to the high half of the + 64-bit destination. Along with IADD.u32 and IADD.u64, this allows the + construction of a 64-bit multiply in 5 instructions (6 clocks). + + A + B + + + + + + + + + + + + A + B + + $(A + B) \gg 1$ without intermediate overflow, corresponding to `hadd()` in + OpenCL. With the `.rhadd` modifier set, it instead calculates + $(A + B + 1) \gg 1$ corresponding to `rhadd()` in OpenCL. + + + + + + + + + + + + + + + Selects the value of A in the subgroup lane given by B. This implements + subgroup broadcasts. It may be used as a primitive for screen space + derivatives in fragment shaders. + + A + B + + + + + + + + + $A \cdot B + C$ + + A + B + C + + + + + + + + + + Left shifts its first source by a specified amount and bitwise ANDs it with the + second source, optionally inverting the second source or the result. + + + A + shift + B + + + + + + + + + + Right shifts its first source by a specified amount and bitwise ANDs it with the + second source, optionally inverting the second source or the result. + + + A + shift + B + + + + + + + + + + Left shifts its first source by a specified amount and bitwise ORs it with the + second source, optionally inverting the second source or the result. + + + A + shift + B + + + + + + + + + + Right shifts its first source by a specified amount and bitwise ORs it with the + second source, optionally inverting the second source or the result. + + + A + shift + B + + + + + + + + + + Left shifts its first source by a specified amount and bitwise XORs it with the + second source, optionally inverting the second source or the result. + + + A + shift + B + + + + + + + + + + Right shifts its first source by a specified amount and bitwise XORs it with the + second source, optionally inverting the second source or the result. + + + A + shift + B + + + + + Mux between A and B based on the provided mask. Equivalent to + `bitselect()` in OpenCL. `(A & mask) | (A & ~mask)` + + A + B + Mask + + + + During a cube map transform, select the S coordinate given a selected face. + Z coordinate as 32-bit floating point + X coordinate as 32-bit floating point + Cube face index + + + + During a cube map transform, select the T coordinate given a selected face. + Y coordinate as 32-bit floating point + Z coordinate as 32-bit floating point + Cube face index + + + + + Calculates $A | (B \ll 8) | (CD \ll 16)$ for 8-bit A and B and 16-bit CD. + + To implement `(uchar4) (A, B, C, D)` in full generality, use the sequence + `MKVEC.v4i8 CD, C, D, #0; MKVEC.v4i8 out, A, B, CD` + + `MKVEC.v4i8` also allows zero extending arbitrary 8-bit lanes. For + example, to extend `r0.b3` to `r1`, use `MKVEC.v4i8 r1, r0.b3, 0x0.b0, 0x0`. + + A + B + CD + + + + Select the maximum absolute value of its arguments. + X coordinate as 32-bit floating point + Y coordinate as 32-bit floating point + Z coordinate as 32-bit floating point + + + + Select the cube face index corresponding to the arguments. + X coordinate as 32-bit floating point + Y coordinate as 32-bit floating point + Z coordinate as 32-bit floating point + + + + + 8-bit integer dot product between 4 channel vectors, intended for machine + learning. Available in both unsigned and signed variants, controlling + sign-extension/zero-extension behaviour to the final 32-bit destination. + Saturation is available. Corresponds to the `cl_arm_integer_dot_product_*` + family of OpenCL extensions. Not for actual use, just for completeness. + Instead, use your platform's neural accelerator. + + For $A, B \in \{ 0, \ldots, 255 \}^4$ and $\text{Accumulator} \in + \mathbb{Z}$, calculates $(A \cdot B) + \text{Accumulator}$ and optionally + saturates. + + + + A + B + Accumulator + + + + + + Evaluates the given condition, do a logical and/or with the condition in + the result source, and return in the given result type (integer + one, integer minus one, or floating-point one). The third source is useful + for chaining together conditions without intermediate bitwise arithmetic; + when this is not desired, tie it to zero and use the OR combine mode (do + not set the `.and` modifier). + + The sequence modifier `.seq` is used to construct 64-bit compares in 2 + `ICMP.u32` instructions, in conjunction with the `u1` result type on the + low half, the `m1` result type on the high half, and the result of the low + half comparison passed as the third source. For comparisons other than + 64-bit, do not set the `.seq` modifier and do not use the `u1` result + type. + + + + + + + + + A + B + C + + + + + Evaluates the given condition, do a logical and/or with the condition in + the result source, and return in the given result type (integer + one, integer minus one, or floating-point one). The third source is useful + for chaining together conditions without intermediate bitwise arithmetic; + when this is not desired, tie it to zero and use the OR combine mode (do + not set the `.and` modifier). + + + + + + + A + B + C + + + + + Evaluates the given condition, do a logical and/or with the condition in + the result source, and return in the given result type (integer + one, integer minus one, or floating-point one). The third source is useful + for chaining together conditions without intermediate bitwise arithmetic; + when this is not desired, tie it to zero and use the OR combine mode (do + not set the `.and` modifier). + + The sequence modifier `.seq` is used to construct signed 64-bit compares + in 1 `ICMP.u32` and 1 `ICMP.s32` instruction, in conjunction with the `u1` + result type on the low half, the `m1` result type on the high half, and + the result of the low half comparison passed as the third source. For + comparisons other than 64-bit, do not set the `.seq` modifier and do not + use the `u1` result type. + + + + + + + + + A + B + C + + + + + Adds an arbitrary 32-bit immediate embedded within the instruction stream. + If no modifiers are required, this is preferred to `IADD.i32` with a + constant accessed as a uniform. However, if the constant is available + inline, `IADD.f32` is preferred. + + `IADD_IMM.i32` with the source tied to zero is the canonical immediate move. + + A + + + + + + Adds an arbitrary pair of 16-bit immediates embedded within the + instruction stream. If no modifiers are required, this is preferred to + `IADD.v2i16` with a constant accessed as a uniform. However, if the + constant is available inline, `IADD.v2i16` is preferred. Adding only a + single 16-bit constant requires replication of the constant. + + A + + + + + + Adds an arbitrary quad of 8-bit immediates embedded within the + instruction stream. If no modifiers are required, this is preferred to + `IADD.v4i8` with a constant accessed as a uniform. However, if the + constant is available inline, `IADD.v4i8` is preferred. Adding only a + single 8-bit constant requires replication of the constant. + + A + + + + + + Adds an arbitrary 32-bit immediate embedded within the instruction stream. + If no modifiers are required, this is preferred to `FADD.f32` with a + constant accessed as a uniform. However, if the constant is available + inline, `FADD.f32` is preferred. + + A + + + + + + Adds an arbitrary pair of 16-bit immediates embedded within the + instruction stream. If no modifiers are required, this is preferred to + `FADD.v2f16` with a constant accessed as a uniform. However, if the + constant is available inline, `FADD.v2f16` is preferred. Adding only a + single 16-bit constant requires replication of the constant. + + A + + + + + + + + + + + + + + + + + + + + + + + + + + + Unfiltered textured instruction. + + + + + + + + Image to read from + + + + Ordinary texturing instruction using a sampler. + + + Image to read from + + + + + + + + + + + Only works for FP32 varyings. + + + + + Image to read from + + + + + First calculates $A \cdot B + C$ and then biases the exponent by D. Used in + special transcendental function sequences. It should not be used for + general code as its special case handling differs from two back-to-back + `FMA.f32` operations. Equivalent to `FMA.f32` back-to-back with + `RSCALE.f32` + + + A + B + C + D + + + -- 2.7.4