From 365d342ee6e89973e94bb3bef594859bccd1c808 Mon Sep 17 00:00:00 2001
From: Vitaliy Triang3l Kuzmin <triang3l@yandex.ru>
Date: Sun, 23 Apr 2023 23:12:58 +0300
Subject: [PATCH] docs/amd: Document Primitive Ordered Pixel Shading
MIME-Version: 1.0
Content-Type: text/plain; charset=utf8
Content-Transfer-Encoding: 8bit

Acked-by: Timur KristÃ³f <timur.kristof@gmail.com>
Signed-off-by: Vitaliy Triang3l Kuzmin <triang3l@yandex.ru>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22250>
---
 docs/drivers/amd/hw/pops.rst | 476 +++++++++++++++++++++++++++++++++++++++++++
 docs/drivers/radv.rst        |   7 +
 2 files changed, 483 insertions(+)
 create mode 100644 docs/drivers/amd/hw/pops.rst

diff --git a/docs/drivers/amd/hw/pops.rst b/docs/drivers/amd/hw/pops.rst
new file mode 100644
index 0000000..b50f301
--- /dev/null
+++ b/docs/drivers/amd/hw/pops.rst
@@ -0,0 +1,476 @@
+Primitive Ordered Pixel Shading
+===============================
+
+Primitive Ordered Pixel Shading (POPS) is the feature available starting from
+GFX9 that provides the Fragment Shader Interlock or Fragment Shader Ordering
+functionality.
+
+It allows a part of a fragment shader â an ordered section (or a critical
+section) â to be executed sequentially in rasterization order for different
+invocations covering the same pixel position.
+
+This article describes how POPS is set up in shader code and the registers. The
+information here is currently provided for architecture generations up to GFX11.
+
+Note that the information in this article is **not official** and may contain
+inaccuracies, as well as incomplete or incorrect assumptions. It is based on the
+shader code output of the Radeon GPU Analyzer for Rasterizer Ordered View usage
+in Direct3D shaders, AMD's Platform Abstraction Library (PAL), ISA references,
+and experimentation with the hardware.
+
+Shader code
+-----------
+
+With POPS, a wave can dynamically execute up to one ordered section. It is fine
+for a wave not to enter an ordered section at all if it doesn't need ordering on
+its execution path, however.
+
+The setup of the ordered section consists of three parts:
+
+1. Entering the ordered section in the current wave â awaiting the completion of
+   ordered sections in overlapped waves.
+2. Resolving overlap within the current wave â intrawave collisions (optional
+   and GFX9â10.3 only).
+3. Exiting the ordered section â resuming overlapping waves trying to enter
+   their ordered sections.
+
+GFX9â10.3: Entering the ordered section in the wave
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Awaiting the completion of ordered sections in overlapped waves is performed by
+setting the POPS packer hardware register, and then polling the volatile
+``pops_exiting_wave_id`` ALU operand source until its value exceeds the newest
+overlapped wave ID for the current wave.
+
+The information needed for the wave to perform the waiting is provided to it via
+the SGPR argument ``COLLISION_WAVEID``. Its loading needs to be enabled in the
+``SPI_SHADER_PGM_RSRC2_PS`` and ``PA_SC_SHADER_CONTROL`` registers (note that
+the POPS arguments specifically need to be enabled not only in ``RSRC`` unlike
+various other arguments, but in ``PA_SC_SHADER_CONTROL`` as well).
+
+The collision wave ID argument contains the following unsigned values:
+
+* [31]: Whether overlap has occurred.
+* [29:28] (GFX10+) / [28] (GFX9): ID of the packer the wave should be associated
+  with.
+* [25:16]: Newest overlapped wave ID.
+* [9:0]: Current wave ID.
+
+The 2020 RDNA and RDNA 2 ISA references contain incorrect offsets and widths of
+the fields, possibly from an early development iteration, but the meanings of
+them are accurate there.
+
+The wait must not be performed if the "did overlap" bit 31 is set to 0,
+otherwise it will result in a hang. Also, the bit being set to 0 indicates that
+there are *both* no wave overlap *and no intrawave collisions* for the current
+wave â so if the bit is 0, it's safe for the wave to skip all of the POPS logic
+completely and execute the contents of the ordered section simply as usual with
+unordered access as a potential additional optimization. The packer hardware
+register, however, may be set even without overlap safely â it's the wait loop
+itself that must not be executed if it was reported that there was no overlap.
+
+The packer ID needs to be passed to the packer hardware register using
+``s_setreg_b32`` so the wave can poll ``pops_exiting_wave_id`` on that packer.
+
+On GFX9, the ``MODE`` (1) hardware register has two bits specifying which packer
+the wave is associated with:
+
+* [25]: The wave is associated with packer 1.
+* [24]: The wave is associated with packer 0.
+
+Initially, both of these bits are set 0, meaning that POPS is disabled for the
+wave. If the wave needs to enter the ordered section, it must set bit 24 to 1 if
+the packer ID in ``COLLISION_WAVEID`` is 0, or set bit 25 to 1 if the packer ID
+is 1.
+
+Starting from GFX10, the ``POPS_PACKER`` (25) hardware register is used instead,
+containing the following fields:
+
+* [2:1]: Packer ID.
+* [0]: POPS enabled for the wave.
+
+Initially, POPS is disabled for a wave. To start entering the ordered section,
+bits 2:1 must be set to the packer ID from ``COLLISION_WAVEID``, and bit 0 needs
+to be set to 1.
+
+The wave IDs, both in ``COLLISION_WAVEID`` and ``pops_exiting_wave_id``, are
+10-bit values wrapping around on overflow â consecutive waves are numbered 1022,
+1023, 0, 1â¦ This wraparound needs to be taken into account when comparing the
+exiting wave ID and the newest overlapped wave ID.
+
+Specifically, until the current wave exits the ordered section, its ID can't be
+smaller than the newest overlapped wave ID or the exiting wave ID. So
+``current_wave_id + 1`` can be subtracted from 10-bit wave IDs to remap them to
+monotonically increasing unsigned values. In this case, the largest value,
+0xFFFFFFFF, will correspond to the current wave, 10-bit values up to the current
+wave ID will be in a range near 0xFFFFFFFF growing towards it, and wave IDs from
+before the last wraparound will be near 0 increasing away from it. Subtracting
+``current_wave_id + 1`` is equivalent to adding ``~current_wave_id``.
+
+GFX9 has an off-by-one error in the newest overlapped wave ID: if the 10-bit
+newest overlapped wave ID is greater than the 10-bit current wave ID (meaning
+that it's behind the last wraparound point), 1 needs to be added to the newest
+overlapped wave ID before using it in the comparison. This was corrected in
+GFX10.
+
+The exiting wave ID (not to be confused with "exited" â the exiting wave ID is
+the wave that will exit the ordered section next) is queried via the
+``pops_exiting_wave_id`` ALU operand source, numbered 239. Normally, it will be
+one of the arguments of ``s_add_i32`` that remaps it from a wrapping 10-bit wave
+ID to monotonically increasing one.
+
+It's a volatile operand, and it needs to be read in a loop until its value
+becomes greater than the newest overlapped wave ID (after remapping both to
+monotonic). However, if it's too early for the current wave to enter the ordered
+section, it needs to yield execution to other waves that may potentially be
+overlapped â via ``s_sleep``. GFX9 requires a finite amount of delay to be
+specified, AMD uses 3. Starting from GFX10, exiting the ordered section wakes up
+the waiting waves, so the maximum delay of 0xFFFF can be used.
+
+In pseudocode, the entering logic would look like this::
+
+   bool did_overlap = collision_wave_id[31];
+   if (did_overlap) {
+      if (gfx_level >= GFX10) {
+         uint packer_id = collision_wave_id[29:28];
+         s_setreg_b32(HW_REG_POPS_PACKER[2:0], 1 | (packer_id << 1));
+      } else {
+         uint packer_id = collision_wave_id[28];
+         s_setreg_b32(HW_REG_MODE[25:24], packer_id ? 0b10 : 0b01);
+      }
+
+      uint current_10bit_wave_id = collision_wave_id[9:0];
+      // Or -(current_10bit_wave_id + 1).
+      uint wave_id_remap_offset = ~current_10bit_wave_id;
+
+      uint newest_overlapped_10bit_wave_id = collision_wave_id[25:16];
+      if (gfx_level < GFX10 &&
+          newest_overlapped_10bit_wave_id > current_10bit_wave_id) {
+         ++newest_overlapped_10bit_wave_id;
+      }
+      uint newest_overlapped_wave_id =
+         newest_overlapped_10bit_wave_id + wave_id_remap_offset;
+
+      while (!(src_pops_exiting_wave_id + wave_id_remap_offset >
+               newest_overlapped_wave_id)) {
+         s_sleep(gfx_level >= GFX10 ? 0xFFFF : 3);
+      }
+   }
+
+The SPIR-V fragment shader interlock specification requires an invocation â an
+individual invocation, not the whole subgroup â to execute
+``OpBeginInvocationInterlockEXT`` exactly once. However, if there are multiple
+begin instructions, or even multiple begin/end pairs, under divergent
+conditions, a wave may end up waiting for the overlapped waves multiple times.
+Thankfully, it's safe to set the POPS packer hardware register to the same
+value, or to run the wait loop, multiple times during the wave's execution, as
+long as the ordered section isn't exited in between by the wave.
+
+GFX11: Entering the ordered section in the wave
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Instead of exposing wave IDs to shaders, GFX11 uses the "export ready" wave
+status flag to report that the wave may enter the ordered section. It's awaited
+by the ``s_wait_event`` instruction, with the bit 0 ("don't wait for
+``export_ready``") of the immediate operand set to 0. On GFX11 specifically, AMD
+passes 0 as the whole immediate operand.
+
+The "export ready" wait can be done multiple times safely.
+
+GFX9â10.3: Resolving intrawave collisions
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+On GFX9â10.3, it's possible for overlapping fragment shader invocations to be
+placed not only in different waves, but also in the same wave, with the shader
+code making sure that the ordered section is executed for overlapping
+invocations in order.
+
+This functionality is optional â it can be activated by enabling loading of the
+``INTRAWAVE_COLLISION`` SGPR argument in ``SPI_SHADER_PGM_RSRC2_PS`` and
+``PA_SC_SHADER_CONTROL``.
+
+The lower 8 or 16 (depending on the wave size) bits of ``INTRAWAVE_COLLISION``
+contain the mask of whether each quad in the wave starts a new layer of
+overlapping invocations, and thus the ordered section code for them needs to be
+executed after running it for all lanes with indices preceding that quad index
+multiplied by 4. The rest of the bits in the argument need to be ignored â AMD
+explicitly masks them out in shader code (although this is not necessary if the
+shader uses "find first 1" to obtain the start of the next set of overlapping
+quads or expands this quad mask into a lane mask).
+
+For example, if the intrawave collision mask is 0b0000001110000100, or
+``(1 << 2) | (1 << 7) | (1 << 8) | (1 << 9)``, the code of the ordered section
+needs to be executed first only for quads 1:0 (lanes 7:0), then only for quads
+6:2 (lanes 27:8), then for quad 7 (lanes 31:28), then for quad 8 (lanes 35:32),
+and then for the remaining quads 15:9 (lanes 63:36).
+
+This effectively causes the ordered section to be executed as smaller
+"sub-subgroups" within the original subgroup.
+
+However, this is not always compatible with the execution model of SPIR-V or
+GLSL fragment shaders, so enabling intrawave collisions and wrapping a part of
+the shader in a loop may be unsafe in some cases. One particular example is when
+the shader uses subgroup operations influenced by lanes outside the current
+quad. In this case, the code outside and inside the ordered section may be
+executed with different sets of active invocations, affecting the results of
+subgroup operations. But in SPIR-V and GLSL, fragment shader interlock is not
+supposed to modify the set of active invocations in any way. So the intrawave
+collision loop may break the results of subgroup operations in unpredictable
+ways, even outside the driver's compiler infrastructure. Even if the driver
+splits the subgroup exactly at ``OpBeginInvocationInterlockEXT`` and makes the
+lane subsets rejoin exactly at ``OpEndInvocationInterlockEXT``, the application
+and the compilers that created the source shader are still not aware of that
+happening â the input SPIR-V or GLSL shader might have already gone through
+various optimizations, such as common subexpression elimination which might
+have considered a subgroup operation before ``OpBeginInvocationInterlockEXT``
+and one after it equivalent.
+
+The idea behind reporting intrawave collisions to shaders is to reduce the
+impact on the parallelism of the part of the shader that doesn't depend on the
+ordering, to avoid wasting lanes in the wave and to allow the code outside the
+ordered section in different invocations to run in parallel lanes as usual. This
+may be especially helpful if the ordered section is small compared to the rest
+of the shader â for instance, a custom blending equation in the end of the usual
+fragment shader for a surface in the world.
+
+However, whether handling intrawave collisions is preferred is not a question
+with one universal answer. Intrawave collisions are pretty uncommon without
+multisampling, or when using sample interlock with multisampling, although
+they're highly frequent with pixel interlock with multisampling, when adjacent
+primitives cover the same pixels along the shared edge (though that's an
+extremely expensive situation in general). But resolving intrawave collisions
+adds some overhead costs to the shader. If intrawave overlap is unlikely to
+happen often, or even more importantly, if the majority of the shader is inside
+the ordered section, handling it in the shader may cause more harm than good.
+
+GFX11 removes this concept entirely, instead overlapping invocations are always
+placed in different waves.
+
+GFX9â10.3: Exiting the ordered section in the wave
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To exit the ordered section and let overlapping waves resume execution and enter
+their ordered sections, the wave needs to send the ``ORDERED_PS_DONE`` message
+(7) using ``s_sendmsg``.
+
+If the wave has enabled POPS by setting the packer hardware register, it *must
+not* execute ``s_endpgm`` without having sent ``ORDERED_PS_DONE`` once, so the
+message must be sent on all execution paths after the packer register setup.
+However, if the wave exits before having configured the packer register, sending
+the message is not required, though it's still fine to send it regardless of
+that.
+
+Note that if the shader has multiple ``OpEndInvocationInterlockEXT``
+instructions executed in the same wave (depending on a divergent condition, for
+example), it must still be ensured that ``ORDERED_PS_DONE`` is sent by the wave
+only once, and especially not before any awaiting of overlapped waves.
+
+Before the message is sent, all counters for memory accesses that need to be
+primitive-ordered, both writes and (in case something after the ordered section
+depends on the per-pixel data, for instance, the tail blending fallback in
+order-independent transparency) reads, must be awaited. Those may include
+``vm``, ``vs``, and in some cases ``lgkm`` (though normally primitive-ordered
+memory accesses will be done through VMEM with divergent addresses, not SMEM, as
+there's no synchronization between fragments at different pixel coordinates, but
+it's still technically possible for a shader, even though pointless and
+nonoptimal, to explicitly perform them in a waterfall loop, for instance, and
+that must work correctly too). Without that, a race condition will occur when
+the newly resumed waves start accessing the memory locations to which there
+still are outstanding accesses in the current wave.
+
+Another option for exiting is the ``s_endpgm_ordered_ps_done`` instruction,
+which combines waiting for all the counters, sending the ``ORDERED_PS_DONE``
+message, and ending the program. Generally, however, it's desirable to resume
+overlapping waves as early as possible, including before the export, as it may
+stall the wave for some time too.
+
+GFX11: Exiting the ordered section in the wave
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The overlapping waves are resumed when the wave performs the last export (with
+the ``done`` flag).
+
+The same requirements for awaiting the memory access counters as on GFX9â10.3
+still apply.
+
+Memory access requirements
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The compiler needs to ensure that entering the ordered section implements
+acquire semantics, and exiting it implements release semantics, in the fragment
+interlock memory scope for ``UniformMemory`` and ``ImageMemory`` SPIR-V storage
+classes.
+
+A fragment interlock memory scope instance includes overlapping fragment shader
+invocations executed by commands inside a single subpass. It may be considered a
+subset of a queue family memory scope instance from the perspective of memory
+barriers.
+
+Fragment shader interlock doesn't perform implicit memory availability or
+visibility operations. Shaders must do them by themselves for accesses requiring
+primitive ordering, such as via ``coherent`` (``queuefamilycoherent``) in GLSL
+or ``MakeAvailable`` and ``MakeVisible`` in at least the ``QueueFamily`` scope
+in SPIR-V.
+
+On AMD hardware, this means that the accessed memory locations must be made
+available or visible between waves that may be executed on any compute unit â so
+accesses must go directly to the global L2 cache, bypassing L0$ via the GLC flag
+and L1$ via DLC.
+
+However, it should be noted that memory accesses in the ordered section may be
+expected by the application to be done in primitive order even if they don't
+have the GLC and DLC flags. Coherent access not only bypasses, but also
+invalidates the lower-level caches for the accessed memory locations. Thus,
+considering that normally per-pixel data is accessed exclusively by the
+invocation executing the ordered section, it's not necessary to make all reads
+or writes in the ordered section for one memory location to be GLC/DLC â just
+the first read and the last write: it doesn't matter if per-pixel data is cached
+in L0/L1 in the middle of a dependency chain in the ordered section, as long as
+it's invalidated in them in the beginning and flushed to L2 in the end.
+Therefore, optimizations in the compiler must not simply assume that only
+coherent accesses need primitive ordering â and moreover, the compiler must also
+take into account that the same data may be accessed through different bindings.
+
+Export requirements
+^^^^^^^^^^^^^^^^^^^
+
+With POPS, on all hardware generations, the shader must have at least one
+export, though it can be a null or an ``off, off, off, off`` one.
+
+Also, even if the shader doesn't need to export any real data, the export
+skipping that was added in GFX10 must not be used, and some space must be
+allocated in the export buffer, such as by setting ``SPI_SHADER_COL_FORMAT`` for
+some color output to ``SPI_SHADER_32_R``.
+
+Without this, the shader will be executed without the needed synchronization on
+GFX10, and will hang on GFX11.
+
+Drawing context setup
+---------------------
+
+Configuring POPS
+^^^^^^^^^^^^^^^^
+
+Most of the configuration is performed via the ``DB_SHADER_CONTROL`` register.
+
+To enable POPS for the draw,
+``DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER`` should be set to 1.
+
+On GFX9â10.3, ``DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES`` controls which
+fragment shader invocations are considered overlapping:
+
+* For pixel interlock, it must be set to 0 (1 sample).
+* If sample interlock is sufficient (only synchronizing between invocations that
+  have any common sample mask bits), it may be set to
+  ``PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES`` â the number of sample coverage mask
+  bits passed to the shader which is expected to use the sample mask to
+  determine whether it's allowed to access the data for each of the samples. As
+  of April 2023, PAL for some reason doesn't use non-1x
+  ``POPS_OVERLAP_NUM_SAMPLES`` at all, even when using Direct3D Rasterizer
+  Ordered Views or ``GL_INTEL_fragment_shader_ordering`` with sample shading
+  (those APIs tie the interlock granularity to the shading frequency â Vulkan
+  and OpenGL fragment shader interlock, however, allows specifying the interlock
+  granularity independently of it, making it possible both to ask for finer
+  synchronization guarantees and to require stronger ones than Direct3D ROVs can
+  provide). However, with MSAA, on AMD hardware, pixel interlock generally
+  performs *massively*, sometimes prohibitively, slower than sample interlock,
+  because it causes fragment shader invocations along the common edge of
+  adjacent primitives to be ordered as they cover the same pixels (even though
+  they don't cover any common samples). So it's highly desirable for the driver
+  to provide sample interlock, and to set ``POPS_OVERLAP_NUM_SAMPLES``
+  accordingly, if the shader declares that it's enough for it via the execution
+  mode.
+
+On GFX11, when POPS is enabled, ``DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE`` is
+used in place of ``DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES`` from the earlier
+architecture generations (and has a different bit offset in the register), and
+``DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE_ENABLE`` must be set to 1. The GFX11
+blending performance workaround overriding the intrinsic rate must not be
+applied if POPS is used in the draw â the intrinsic rate override must be used
+solely to control the interlock granularity in this case.
+
+No explicit flushes/synchronization are needed when changing the pipeline state
+variables that may be involved in POPS, such as the rasterization sample count.
+POPS automatically keeps synchronizing invocations even between draws with
+different sample counts (invocations with common coverage mask bits are
+considered overlapping by the hardware, regardless of what those samples
+actually are â only the indices are important).
+
+Also, on GFX11, POPS uses ``DB_Z_INFO.NUM_SAMPLES`` to determine the coverage
+sample count, and it must be equal to ``PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES``
+even if there's no depth/stencil target.
+
+Hardware bug workarounds
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Early revisions of GFX9 â ``CHIP_VEGA10`` and ``CHIP_RAVEN`` â contain a
+hardware bug that may result in a hang, and need a workaround to be enabled.
+Specifically, if POPS is used with 8 or more rasterization samples, or with 8 or
+more depth/stencil target samples, ``DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP``
+must be set to 1 for draws that satisfy this condition. In PAL, this is the
+``waMiscPopsMissedOverlap`` workaround. It results in slightly lower performance
+in those cases, increasing the frame time by around 1.5 to 2 times in
+`nvpro-samples/vk_order_independent_transparency <https://github.com/nvpro-samples/vk_order_independent_transparency>`_
+on the RX Vega 10, but it's required in a pretty rare case (8x+ MSAA) and is
+mandatory to ensure stability.
+
+Also, even though ``DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP`` is not required
+on chips other than the ``CHIP_VEGA10`` and ``CHIP_RAVEN`` GFX9 revisions, if
+it's enabled for some reason on GFX10.1 (``CHIP_NAVI10``, ``CHIP_NAVI12``,
+``CHIP_NAVI14``), and the draw uses POPS,
+``DB_RENDER_OVERRIDE2.PARTIAL_SQUAD_LAUNCH_CONTROL`` must be set to
+``PSLC_ON_HANG_ONLY`` to avoid a hang (see ``waStalledPopsMode`` in PAL).
+
+Out-of-order rasterization interaction
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This is a largely unresearched topic currently. However, considering that POPS
+is primarily the functionality of the Depth Block, similarity to the behavior of
+out-of-order rasterization in depth/stencil testing may possibly be expected.
+
+If the shader specifies an ordered interlock execution mode, out-of-order
+rasterization likely must not be enabled implicitly.
+
+As of April 2023, PAL doesn't have any rules specifically for POPS in the logic
+determining whether out-of-order rasterization can be enabled automatically.
+Some of the POPS usage cases may possibly be covered by the rule that always
+disables out-of-order rasterization if the shader writes to Unordered Access
+Views (storage resources), though fragment shader interlock can be used for
+read-only purposes too (for ordering between draws that only read per-pixel data
+and draws that may write it), so that may be an oversight.
+
+Explicitly enabled relaxed rasterization order modifies the concept of
+rasterization order itself in Vulkan, so from the point of view of the
+specification of fragment shader interlock, relaxed rasterization order should
+still be applicable regardless of whether the shader requests ordered interlock.
+PAL also doesn't make any POPS-specific exceptions here as of April 2023.
+
+Variable-rate shading interaction
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+On GFX10.3, enabling ``DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER`` forces
+the shading rate to be 1x1, thus the
+``fragmentShadingRateWithFragmentShaderInterlock`` Vulkan device property must
+be false.
+
+On GFX11, by default, POPS itself can work with non-1x1 shading rates, and the
+``fragmentShadingRateWithFragmentShaderInterlock`` property must be true.
+However, if ``PA_SC_VRS_SURFACE_CNTL_1.FORCE_SC_VRS_RATE_FINE_POPS`` is set,
+enabling POPS will force 1x1 shading rate.
+
+The widest interlock granularity available on GFX11 â with the lowest possible
+Depth Block intrinsic rate, 1x â is per-fine-pixel, however. There's no
+synchronization between coarse fragment shader invocations if they don't cover
+common fine pixels, so the ``fragmentShaderShadingRateInterlock`` Vulkan device
+feature is not available.
+
+Additional configuration
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+These are some largely unresearched options found in the register declarations.
+PAL doesn't use them, so it's unknown if they make any significant difference.
+No effect was found in `nvpro-samples/vk_order_independent_transparency <https://github.com/nvpro-samples/vk_order_independent_transparency>`_
+during testing on GFX9 ``CHIP_RAVEN`` and GFX11 ``CHIP_GFX1100``.
+
+* ``DB_SHADER_CONTROL.EXEC_IF_OVERLAPPED`` on GFX9â10.3.
+* ``PA_SC_BINNER_CNTL_0.BIN_MAPPING_MODE = BIN_MAP_MODE_POPS`` on GFX10+.
diff --git a/docs/drivers/radv.rst b/docs/drivers/radv.rst
index 5c37b95..5368efb 100644
--- a/docs/drivers/radv.rst
+++ b/docs/drivers/radv.rst
@@ -16,6 +16,13 @@ You can find a list of documentation for the various generations of
 AMD hardware on the `X.Org wiki
 <https://www.x.org/wiki/RadeonFeature/#documentation>`__.
 
+Additional community-written documentation is also available in Mesa:
+
+.. toctree::
+   :glob:
+
+   amd/hw/*
+
 ACO
 ---
 
-- 
2.7.4