From e0752673becc9d6263f1e982289111f6d7aa7c43 Mon Sep 17 00:00:00 2001
From: Alyssa Rosenzweig <alyssa@collabora.com>
Date: Wed, 28 Dec 2022 15:26:45 -0500
Subject: [PATCH] docs/panfrost: Move description of instancing

Connor Abbott wrote a nice explanation of how instance divisors work on Mali.
Let's add it to the driver docs instead of letting it languish in a forgotten
header file.

This is mostly pasted from the existing header in tree, with a few local changes
applied.

Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20445>
---
 docs/drivers/panfrost.rst           | 111 ++++++++++++++++++++++++++++++++
 src/panfrost/include/panfrost-job.h | 123 ------------------------------------
 2 files changed, 111 insertions(+), 123 deletions(-)

diff --git a/docs/drivers/panfrost.rst b/docs/drivers/panfrost.rst
index 0c51b20..1adc0d7 100644
--- a/docs/drivers/panfrost.rst
+++ b/docs/drivers/panfrost.rst
@@ -175,3 +175,114 @@ Mali-T760 and newer, Arm Framebuffer Compression (AFBC) is more efficient and
 should be used instead where possible. However, not all formats are
 compressible, so u-interleaved tiling remains an important fallback on Panfrost.
 
+Instancing
+----------
+
+The attribute descriptor lets the attribute unit compute the address of an
+attribute given the vertex and instance ID. Unfortunately, the way this works is
+rather complicated when instancing is enabled.
+
+To explain this, first we need to explain how compute and vertex threads are
+dispatched.  When a quad is dispatched, it receives a single, linear index.
+However, we need to translate that index into a (vertex id, instance id) pair.
+One option would be to do:
+
+.. math::
+   \text{vertex id} = \text{linear id} \% \text{num vertices}
+
+   \text{instance id} = \text{linear id} / \text{num vertices}
+
+but this involves a costly division and modulus by an arbitrary number.
+Instead, we could pad num_vertices. We dispatch padded_num_vertices *
+num_instances threads instead of num_vertices * num_instances, which results
+in some "extra" threads with vertex_id >= num_vertices, which we have to
+discard.  The more we pad num_vertices, the more "wasted" threads we
+dispatch, but the division is potentially easier.
+
+One straightforward choice is to pad num_vertices to the next power of two,
+which means that the division and modulus are just simple bit shifts and
+masking. But the actual algorithm is a bit more complicated. The thread
+dispatcher has special support for dividing by 3, 5, 7, and 9, in addition
+to dividing by a power of two. As a result, padded_num_vertices can be
+1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads,
+since we need less padding.
+
+padded_num_vertices is picked by the hardware. The driver just specifies the
+actual number of vertices. Note that padded_num_vertices is a multiple of four
+(presumably because threads are dispatched in groups of 4). Also,
+padded_num_vertices is always at least one more than num_vertices, which seems
+like a quirk of the hardware. For larger num_vertices, the hardware uses the
+following algorithm: using the binary representation of num_vertices, we look at
+the most significant set bit as well as the following 3 bits. Let n be the
+number of bits after those 4 bits. Then we set padded_num_vertices according to
+the following table:
+
+==========  =======================
+high bits   padded_num_vertices
+==========  =======================
+1000		   :math:`9 \cdot 2^n`
+1001		   :math:`5 \cdot 2^{n+1}`
+101x		   :math:`3 \cdot 2^{n+2}`
+110x		   :math:`7 \cdot 2^{n+1}`
+111x		   :math:`2^{n+4}`
+==========  =======================
+
+For example, if num_vertices = 70 is passed to glDraw(), its binary
+representation is 1000110, so n = 3 and the high bits are 1000, and
+therefore padded_num_vertices = :math:`9 \cdot 2^3` = 72.
+
+The attribute unit works in terms of the original linear_id. if
+num_instances = 1, then they are the same, and everything is simple.
+However, with instancing things get more complicated. There are four
+possible modes, two of them we can group together:
+
+1. Use the linear_id directly. Only used when there is no instancing.
+
+2. Use the linear_id modulo a constant. This is used for per-vertex
+attributes with instancing enabled by making the constant equal
+padded_num_vertices. Because the modulus is always padded_num_vertices, this
+mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, or 9.
+The shift field specifies the power of two, while the extra_flags field
+specifies the odd number. If shift = n and extra_flags = m, then the modulus
+is :math:`(2m + 1) \cdot 2^n`. As an example, if num_vertices = 70, then as
+computed above, padded_num_vertices = :math:`9 \cdot 2^3`, so we should set
+extra_flags = 4 and shift = 3. Note that we must exactly follow the hardware
+algorithm used to get padded_num_vertices in order to correctly implement
+per-vertex attributes.
+
+3. Divide the linear_id by a constant. In order to correctly implement
+instance divisors, we have to divide linear_id by padded_num_vertices times
+to user-specified divisor. So first we compute padded_num_vertices, again
+following the exact same algorithm that the hardware uses, then multiply it
+by the GL-level divisor to get the hardware-level divisor. This case is
+further divided into two more cases. If the hardware-level divisor is a
+power of two, then we just need to shift. The shift amount is specified by
+the shift field, so that the hardware-level divisor is just 2^shift.
+
+If it isn't a power of two, then we have to divide by an arbitrary integer.
+For that, we use the well-known technique of multiplying by an approximation
+of the inverse. The driver must compute the magic multiplier and shift
+amount, and then the hardware does the multiplication and shift. The
+hardware and driver also use the "round-down" optimization as described in
+http://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf.
+The hardware further assumes the multiplier is between 2^31 and 2^32, so the
+high bit is implicitly set to 1 even though it is set to 0 by the driver --
+presumably this simplifies the hardware multiplier a little. The hardware
+first multiplies linear_id by the multiplier and takes the high 32 bits,
+then applies the round-down correction if extra_flags = 1, then finally
+shifts right by the shift field.
+
+There are some differences between ridiculousfish's algorithm and the Mali
+hardware algorithm, which means that the reference code from ridiculousfish
+doesn't always produce the right constants. Mali does not use the pre-shift
+optimization, since that would make a hardware implementation slower (it
+would have to always do the pre-shift, multiply, and post-shift operations).
+It also forces the multplier to be at least 2^31, which means that the
+exponent is entirely fixed, so there is no trial-and-error. Altogether,
+given the divisor d, the algorithm the driver must follow is:
+
+1. Set shift = :math:`\lfloor \log_2(d) \rfloor`.
+2. Compute :math:`m = \lceil 2^{shift + 32} / d \rceil` and :math:`e = 2^{shift + 32} % d`.
+3. If :math:`e <= 2^{shift}`, then we need to use the round-down algorithm. Set
+   magic_divisor = m - 1 and extra_flags = 1.  4. Otherwise, set magic_divisor =
+   m and extra_flags = 0.
diff --git a/src/panfrost/include/panfrost-job.h b/src/panfrost/include/panfrost-job.h
index efdf26a..6138ca7 100644
--- a/src/panfrost/include/panfrost-job.h
+++ b/src/panfrost/include/panfrost-job.h
@@ -42,129 +42,6 @@ typedef uint64_t mali_ptr;
 #define MALI_EXTRACT_TYPE(fmt) ((fmt)&0xe0)
 #define MALI_EXTRACT_INDEX(pixfmt) (((pixfmt) >> 12) & 0xFF)
 
-/*
- * Mali Attributes
- *
- * This structure lets the attribute unit compute the address of an attribute
- * given the vertex and instance ID. Unfortunately, the way this works is
- * rather complicated when instancing is enabled.
- *
- * To explain this, first we need to explain how compute and vertex threads are
- * dispatched. This is a guess (although a pretty firm guess!) since the
- * details are mostly hidden from the driver, except for attribute instancing.
- * When a quad is dispatched, it receives a single, linear index. However, we
- * need to translate that index into a (vertex id, instance id) pair, or a
- * (local id x, local id y, local id z) triple for compute shaders (although
- * vertex shaders and compute shaders are handled almost identically).
- * Focusing on vertex shaders, one option would be to do:
- *
- * vertex_id = linear_id % num_vertices
- * instance_id = linear_id / num_vertices
- *
- * but this involves a costly division and modulus by an arbitrary number.
- * Instead, we could pad num_vertices. We dispatch padded_num_vertices *
- * num_instances threads instead of num_vertices * num_instances, which results
- * in some "extra" threads with vertex_id >= num_vertices, which we have to
- * discard.  The more we pad num_vertices, the more "wasted" threads we
- * dispatch, but the division is potentially easier.
- *
- * One straightforward choice is to pad num_vertices to the next power of two,
- * which means that the division and modulus are just simple bit shifts and
- * masking. But the actual algorithm is a bit more complicated. The thread
- * dispatcher has special support for dividing by 3, 5, 7, and 9, in addition
- * to dividing by a power of two. This is possibly using the technique
- * described in patent US20170010862A1. As a result, padded_num_vertices can be
- * 1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads,
- * since we need less padding.
- *
- * padded_num_vertices is picked by the hardware. The driver just specifies the
- * actual number of vertices. At least for Mali G71, the first few cases are
- * given by:
- *
- * num_vertices	| padded_num_vertices
- * 3		| 4
- * 4-7		| 8
- * 8-11		| 12 (3 * 4)
- * 12-15	| 16
- * 16-19	| 20 (5 * 4)
- *
- * Note that padded_num_vertices is a multiple of four (presumably because
- * threads are dispatched in groups of 4). Also, padded_num_vertices is always
- * at least one more than num_vertices, which seems like a quirk of the
- * hardware. For larger num_vertices, the hardware uses the following
- * algorithm: using the binary representation of num_vertices, we look at the
- * most significant set bit as well as the following 3 bits. Let n be the
- * number of bits after those 4 bits. Then we set padded_num_vertices according
- * to the following table:
- *
- * high bits	| padded_num_vertices
- * 1000		| 9 * 2^n
- * 1001		| 5 * 2^(n+1)
- * 101x		| 3 * 2^(n+2)
- * 110x		| 7 * 2^(n+1)
- * 111x		| 2^(n+4)
- *
- * For example, if num_vertices = 70 is passed to glDraw(), its binary
- * representation is 1000110, so n = 3 and the high bits are 1000, and
- * therefore padded_num_vertices = 9 * 2^3 = 72.
- *
- * The attribute unit works in terms of the original linear_id. if
- * num_instances = 1, then they are the same, and everything is simple.
- * However, with instancing things get more complicated. There are four
- * possible modes, two of them we can group together:
- *
- * 1. Use the linear_id directly. Only used when there is no instancing.
- *
- * 2. Use the linear_id modulo a constant. This is used for per-vertex
- * attributes with instancing enabled by making the constant equal
- * padded_num_vertices. Because the modulus is always padded_num_vertices, this
- * mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, or 9.
- * The shift field specifies the power of two, while the extra_flags field
- * specifies the odd number. If shift = n and extra_flags = m, then the modulus
- * is (2m + 1) * 2^n. As an example, if num_vertices = 70, then as computed
- * above, padded_num_vertices = 9 * 2^3, so we should set extra_flags = 4 and
- * shift = 3. Note that we must exactly follow the hardware algorithm used to
- * get padded_num_vertices in order to correctly implement per-vertex
- * attributes.
- *
- * 3. Divide the linear_id by a constant. In order to correctly implement
- * instance divisors, we have to divide linear_id by padded_num_vertices times
- * to user-specified divisor. So first we compute padded_num_vertices, again
- * following the exact same algorithm that the hardware uses, then multiply it
- * by the GL-level divisor to get the hardware-level divisor. This case is
- * further divided into two more cases. If the hardware-level divisor is a
- * power of two, then we just need to shift. The shift amount is specified by
- * the shift field, so that the hardware-level divisor is just 2^shift.
- *
- * If it isn't a power of two, then we have to divide by an arbitrary integer.
- * For that, we use the well-known technique of multiplying by an approximation
- * of the inverse. The driver must compute the magic multiplier and shift
- * amount, and then the hardware does the multiplication and shift. The
- * hardware and driver also use the "round-down" optimization as described in
- * http://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf.
- * The hardware further assumes the multiplier is between 2^31 and 2^32, so the
- * high bit is implicitly set to 1 even though it is set to 0 by the driver --
- * presumably this simplifies the hardware multiplier a little. The hardware
- * first multiplies linear_id by the multiplier and takes the high 32 bits,
- * then applies the round-down correction if extra_flags = 1, then finally
- * shifts right by the shift field.
- *
- * There are some differences between ridiculousfish's algorithm and the Mali
- * hardware algorithm, which means that the reference code from ridiculousfish
- * doesn't always produce the right constants. Mali does not use the pre-shift
- * optimization, since that would make a hardware implementation slower (it
- * would have to always do the pre-shift, multiply, and post-shift operations).
- * It also forces the multplier to be at least 2^31, which means that the
- * exponent is entirely fixed, so there is no trial-and-error. Altogether,
- * given the divisor d, the algorithm the driver must follow is:
- *
- * 1. Set shift = floor(log2(d)).
- * 2. Compute m = ceil(2^(shift + 32) / d) and e = 2^(shift + 32) % d.
- * 3. If e <= 2^shift, then we need to use the round-down algorithm. Set
- * magic_divisor = m - 1 and extra_flags = 1.
- * 4. Otherwise, set magic_divisor = m and extra_flags = 0.
- */
-
 /* Purposeful off-by-one in width, height fields. For example, a (64, 64)
  * texture is stored as (63, 63) in these fields. This adjusts for that.
  * There's an identical pattern in the framebuffer descriptor. Even vertex
-- 
2.7.4