From 365d342ee6e89973e94bb3bef594859bccd1c808 Mon Sep 17 00:00:00 2001 From: Vitaliy Triang3l Kuzmin Date: Sun, 23 Apr 2023 23:12:58 +0300 Subject: [PATCH] docs/amd: Document Primitive Ordered Pixel Shading MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Transfer-Encoding: 8bit Acked-by: Timur Kristóf Signed-off-by: Vitaliy Triang3l Kuzmin Part-of: --- docs/drivers/amd/hw/pops.rst | 476 +++++++++++++++++++++++++++++++++++++++++++ docs/drivers/radv.rst | 7 + 2 files changed, 483 insertions(+) create mode 100644 docs/drivers/amd/hw/pops.rst diff --git a/docs/drivers/amd/hw/pops.rst b/docs/drivers/amd/hw/pops.rst new file mode 100644 index 0000000..b50f301 --- /dev/null +++ b/docs/drivers/amd/hw/pops.rst @@ -0,0 +1,476 @@ +Primitive Ordered Pixel Shading +=============================== + +Primitive Ordered Pixel Shading (POPS) is the feature available starting from +GFX9 that provides the Fragment Shader Interlock or Fragment Shader Ordering +functionality. + +It allows a part of a fragment shader — an ordered section (or a critical +section) — to be executed sequentially in rasterization order for different +invocations covering the same pixel position. + +This article describes how POPS is set up in shader code and the registers. The +information here is currently provided for architecture generations up to GFX11. + +Note that the information in this article is **not official** and may contain +inaccuracies, as well as incomplete or incorrect assumptions. It is based on the +shader code output of the Radeon GPU Analyzer for Rasterizer Ordered View usage +in Direct3D shaders, AMD's Platform Abstraction Library (PAL), ISA references, +and experimentation with the hardware. + +Shader code +----------- + +With POPS, a wave can dynamically execute up to one ordered section. It is fine +for a wave not to enter an ordered section at all if it doesn't need ordering on +its execution path, however. + +The setup of the ordered section consists of three parts: + +1. Entering the ordered section in the current wave — awaiting the completion of + ordered sections in overlapped waves. +2. Resolving overlap within the current wave — intrawave collisions (optional + and GFX9–10.3 only). +3. Exiting the ordered section — resuming overlapping waves trying to enter + their ordered sections. + +GFX9–10.3: Entering the ordered section in the wave +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Awaiting the completion of ordered sections in overlapped waves is performed by +setting the POPS packer hardware register, and then polling the volatile +``pops_exiting_wave_id`` ALU operand source until its value exceeds the newest +overlapped wave ID for the current wave. + +The information needed for the wave to perform the waiting is provided to it via +the SGPR argument ``COLLISION_WAVEID``. Its loading needs to be enabled in the +``SPI_SHADER_PGM_RSRC2_PS`` and ``PA_SC_SHADER_CONTROL`` registers (note that +the POPS arguments specifically need to be enabled not only in ``RSRC`` unlike +various other arguments, but in ``PA_SC_SHADER_CONTROL`` as well). + +The collision wave ID argument contains the following unsigned values: + +* [31]: Whether overlap has occurred. +* [29:28] (GFX10+) / [28] (GFX9): ID of the packer the wave should be associated + with. +* [25:16]: Newest overlapped wave ID. +* [9:0]: Current wave ID. + +The 2020 RDNA and RDNA 2 ISA references contain incorrect offsets and widths of +the fields, possibly from an early development iteration, but the meanings of +them are accurate there. + +The wait must not be performed if the "did overlap" bit 31 is set to 0, +otherwise it will result in a hang. Also, the bit being set to 0 indicates that +there are *both* no wave overlap *and no intrawave collisions* for the current +wave — so if the bit is 0, it's safe for the wave to skip all of the POPS logic +completely and execute the contents of the ordered section simply as usual with +unordered access as a potential additional optimization. The packer hardware +register, however, may be set even without overlap safely — it's the wait loop +itself that must not be executed if it was reported that there was no overlap. + +The packer ID needs to be passed to the packer hardware register using +``s_setreg_b32`` so the wave can poll ``pops_exiting_wave_id`` on that packer. + +On GFX9, the ``MODE`` (1) hardware register has two bits specifying which packer +the wave is associated with: + +* [25]: The wave is associated with packer 1. +* [24]: The wave is associated with packer 0. + +Initially, both of these bits are set 0, meaning that POPS is disabled for the +wave. If the wave needs to enter the ordered section, it must set bit 24 to 1 if +the packer ID in ``COLLISION_WAVEID`` is 0, or set bit 25 to 1 if the packer ID +is 1. + +Starting from GFX10, the ``POPS_PACKER`` (25) hardware register is used instead, +containing the following fields: + +* [2:1]: Packer ID. +* [0]: POPS enabled for the wave. + +Initially, POPS is disabled for a wave. To start entering the ordered section, +bits 2:1 must be set to the packer ID from ``COLLISION_WAVEID``, and bit 0 needs +to be set to 1. + +The wave IDs, both in ``COLLISION_WAVEID`` and ``pops_exiting_wave_id``, are +10-bit values wrapping around on overflow — consecutive waves are numbered 1022, +1023, 0, 1… This wraparound needs to be taken into account when comparing the +exiting wave ID and the newest overlapped wave ID. + +Specifically, until the current wave exits the ordered section, its ID can't be +smaller than the newest overlapped wave ID or the exiting wave ID. So +``current_wave_id + 1`` can be subtracted from 10-bit wave IDs to remap them to +monotonically increasing unsigned values. In this case, the largest value, +0xFFFFFFFF, will correspond to the current wave, 10-bit values up to the current +wave ID will be in a range near 0xFFFFFFFF growing towards it, and wave IDs from +before the last wraparound will be near 0 increasing away from it. Subtracting +``current_wave_id + 1`` is equivalent to adding ``~current_wave_id``. + +GFX9 has an off-by-one error in the newest overlapped wave ID: if the 10-bit +newest overlapped wave ID is greater than the 10-bit current wave ID (meaning +that it's behind the last wraparound point), 1 needs to be added to the newest +overlapped wave ID before using it in the comparison. This was corrected in +GFX10. + +The exiting wave ID (not to be confused with "exited" — the exiting wave ID is +the wave that will exit the ordered section next) is queried via the +``pops_exiting_wave_id`` ALU operand source, numbered 239. Normally, it will be +one of the arguments of ``s_add_i32`` that remaps it from a wrapping 10-bit wave +ID to monotonically increasing one. + +It's a volatile operand, and it needs to be read in a loop until its value +becomes greater than the newest overlapped wave ID (after remapping both to +monotonic). However, if it's too early for the current wave to enter the ordered +section, it needs to yield execution to other waves that may potentially be +overlapped — via ``s_sleep``. GFX9 requires a finite amount of delay to be +specified, AMD uses 3. Starting from GFX10, exiting the ordered section wakes up +the waiting waves, so the maximum delay of 0xFFFF can be used. + +In pseudocode, the entering logic would look like this:: + + bool did_overlap = collision_wave_id[31]; + if (did_overlap) { + if (gfx_level >= GFX10) { + uint packer_id = collision_wave_id[29:28]; + s_setreg_b32(HW_REG_POPS_PACKER[2:0], 1 | (packer_id << 1)); + } else { + uint packer_id = collision_wave_id[28]; + s_setreg_b32(HW_REG_MODE[25:24], packer_id ? 0b10 : 0b01); + } + + uint current_10bit_wave_id = collision_wave_id[9:0]; + // Or -(current_10bit_wave_id + 1). + uint wave_id_remap_offset = ~current_10bit_wave_id; + + uint newest_overlapped_10bit_wave_id = collision_wave_id[25:16]; + if (gfx_level < GFX10 && + newest_overlapped_10bit_wave_id > current_10bit_wave_id) { + ++newest_overlapped_10bit_wave_id; + } + uint newest_overlapped_wave_id = + newest_overlapped_10bit_wave_id + wave_id_remap_offset; + + while (!(src_pops_exiting_wave_id + wave_id_remap_offset > + newest_overlapped_wave_id)) { + s_sleep(gfx_level >= GFX10 ? 0xFFFF : 3); + } + } + +The SPIR-V fragment shader interlock specification requires an invocation — an +individual invocation, not the whole subgroup — to execute +``OpBeginInvocationInterlockEXT`` exactly once. However, if there are multiple +begin instructions, or even multiple begin/end pairs, under divergent +conditions, a wave may end up waiting for the overlapped waves multiple times. +Thankfully, it's safe to set the POPS packer hardware register to the same +value, or to run the wait loop, multiple times during the wave's execution, as +long as the ordered section isn't exited in between by the wave. + +GFX11: Entering the ordered section in the wave +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Instead of exposing wave IDs to shaders, GFX11 uses the "export ready" wave +status flag to report that the wave may enter the ordered section. It's awaited +by the ``s_wait_event`` instruction, with the bit 0 ("don't wait for +``export_ready``") of the immediate operand set to 0. On GFX11 specifically, AMD +passes 0 as the whole immediate operand. + +The "export ready" wait can be done multiple times safely. + +GFX9–10.3: Resolving intrawave collisions +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +On GFX9–10.3, it's possible for overlapping fragment shader invocations to be +placed not only in different waves, but also in the same wave, with the shader +code making sure that the ordered section is executed for overlapping +invocations in order. + +This functionality is optional — it can be activated by enabling loading of the +``INTRAWAVE_COLLISION`` SGPR argument in ``SPI_SHADER_PGM_RSRC2_PS`` and +``PA_SC_SHADER_CONTROL``. + +The lower 8 or 16 (depending on the wave size) bits of ``INTRAWAVE_COLLISION`` +contain the mask of whether each quad in the wave starts a new layer of +overlapping invocations, and thus the ordered section code for them needs to be +executed after running it for all lanes with indices preceding that quad index +multiplied by 4. The rest of the bits in the argument need to be ignored — AMD +explicitly masks them out in shader code (although this is not necessary if the +shader uses "find first 1" to obtain the start of the next set of overlapping +quads or expands this quad mask into a lane mask). + +For example, if the intrawave collision mask is 0b0000001110000100, or +``(1 << 2) | (1 << 7) | (1 << 8) | (1 << 9)``, the code of the ordered section +needs to be executed first only for quads 1:0 (lanes 7:0), then only for quads +6:2 (lanes 27:8), then for quad 7 (lanes 31:28), then for quad 8 (lanes 35:32), +and then for the remaining quads 15:9 (lanes 63:36). + +This effectively causes the ordered section to be executed as smaller +"sub-subgroups" within the original subgroup. + +However, this is not always compatible with the execution model of SPIR-V or +GLSL fragment shaders, so enabling intrawave collisions and wrapping a part of +the shader in a loop may be unsafe in some cases. One particular example is when +the shader uses subgroup operations influenced by lanes outside the current +quad. In this case, the code outside and inside the ordered section may be +executed with different sets of active invocations, affecting the results of +subgroup operations. But in SPIR-V and GLSL, fragment shader interlock is not +supposed to modify the set of active invocations in any way. So the intrawave +collision loop may break the results of subgroup operations in unpredictable +ways, even outside the driver's compiler infrastructure. Even if the driver +splits the subgroup exactly at ``OpBeginInvocationInterlockEXT`` and makes the +lane subsets rejoin exactly at ``OpEndInvocationInterlockEXT``, the application +and the compilers that created the source shader are still not aware of that +happening — the input SPIR-V or GLSL shader might have already gone through +various optimizations, such as common subexpression elimination which might +have considered a subgroup operation before ``OpBeginInvocationInterlockEXT`` +and one after it equivalent. + +The idea behind reporting intrawave collisions to shaders is to reduce the +impact on the parallelism of the part of the shader that doesn't depend on the +ordering, to avoid wasting lanes in the wave and to allow the code outside the +ordered section in different invocations to run in parallel lanes as usual. This +may be especially helpful if the ordered section is small compared to the rest +of the shader — for instance, a custom blending equation in the end of the usual +fragment shader for a surface in the world. + +However, whether handling intrawave collisions is preferred is not a question +with one universal answer. Intrawave collisions are pretty uncommon without +multisampling, or when using sample interlock with multisampling, although +they're highly frequent with pixel interlock with multisampling, when adjacent +primitives cover the same pixels along the shared edge (though that's an +extremely expensive situation in general). But resolving intrawave collisions +adds some overhead costs to the shader. If intrawave overlap is unlikely to +happen often, or even more importantly, if the majority of the shader is inside +the ordered section, handling it in the shader may cause more harm than good. + +GFX11 removes this concept entirely, instead overlapping invocations are always +placed in different waves. + +GFX9–10.3: Exiting the ordered section in the wave +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To exit the ordered section and let overlapping waves resume execution and enter +their ordered sections, the wave needs to send the ``ORDERED_PS_DONE`` message +(7) using ``s_sendmsg``. + +If the wave has enabled POPS by setting the packer hardware register, it *must +not* execute ``s_endpgm`` without having sent ``ORDERED_PS_DONE`` once, so the +message must be sent on all execution paths after the packer register setup. +However, if the wave exits before having configured the packer register, sending +the message is not required, though it's still fine to send it regardless of +that. + +Note that if the shader has multiple ``OpEndInvocationInterlockEXT`` +instructions executed in the same wave (depending on a divergent condition, for +example), it must still be ensured that ``ORDERED_PS_DONE`` is sent by the wave +only once, and especially not before any awaiting of overlapped waves. + +Before the message is sent, all counters for memory accesses that need to be +primitive-ordered, both writes and (in case something after the ordered section +depends on the per-pixel data, for instance, the tail blending fallback in +order-independent transparency) reads, must be awaited. Those may include +``vm``, ``vs``, and in some cases ``lgkm`` (though normally primitive-ordered +memory accesses will be done through VMEM with divergent addresses, not SMEM, as +there's no synchronization between fragments at different pixel coordinates, but +it's still technically possible for a shader, even though pointless and +nonoptimal, to explicitly perform them in a waterfall loop, for instance, and +that must work correctly too). Without that, a race condition will occur when +the newly resumed waves start accessing the memory locations to which there +still are outstanding accesses in the current wave. + +Another option for exiting is the ``s_endpgm_ordered_ps_done`` instruction, +which combines waiting for all the counters, sending the ``ORDERED_PS_DONE`` +message, and ending the program. Generally, however, it's desirable to resume +overlapping waves as early as possible, including before the export, as it may +stall the wave for some time too. + +GFX11: Exiting the ordered section in the wave +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The overlapping waves are resumed when the wave performs the last export (with +the ``done`` flag). + +The same requirements for awaiting the memory access counters as on GFX9–10.3 +still apply. + +Memory access requirements +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The compiler needs to ensure that entering the ordered section implements +acquire semantics, and exiting it implements release semantics, in the fragment +interlock memory scope for ``UniformMemory`` and ``ImageMemory`` SPIR-V storage +classes. + +A fragment interlock memory scope instance includes overlapping fragment shader +invocations executed by commands inside a single subpass. It may be considered a +subset of a queue family memory scope instance from the perspective of memory +barriers. + +Fragment shader interlock doesn't perform implicit memory availability or +visibility operations. Shaders must do them by themselves for accesses requiring +primitive ordering, such as via ``coherent`` (``queuefamilycoherent``) in GLSL +or ``MakeAvailable`` and ``MakeVisible`` in at least the ``QueueFamily`` scope +in SPIR-V. + +On AMD hardware, this means that the accessed memory locations must be made +available or visible between waves that may be executed on any compute unit — so +accesses must go directly to the global L2 cache, bypassing L0$ via the GLC flag +and L1$ via DLC. + +However, it should be noted that memory accesses in the ordered section may be +expected by the application to be done in primitive order even if they don't +have the GLC and DLC flags. Coherent access not only bypasses, but also +invalidates the lower-level caches for the accessed memory locations. Thus, +considering that normally per-pixel data is accessed exclusively by the +invocation executing the ordered section, it's not necessary to make all reads +or writes in the ordered section for one memory location to be GLC/DLC — just +the first read and the last write: it doesn't matter if per-pixel data is cached +in L0/L1 in the middle of a dependency chain in the ordered section, as long as +it's invalidated in them in the beginning and flushed to L2 in the end. +Therefore, optimizations in the compiler must not simply assume that only +coherent accesses need primitive ordering — and moreover, the compiler must also +take into account that the same data may be accessed through different bindings. + +Export requirements +^^^^^^^^^^^^^^^^^^^ + +With POPS, on all hardware generations, the shader must have at least one +export, though it can be a null or an ``off, off, off, off`` one. + +Also, even if the shader doesn't need to export any real data, the export +skipping that was added in GFX10 must not be used, and some space must be +allocated in the export buffer, such as by setting ``SPI_SHADER_COL_FORMAT`` for +some color output to ``SPI_SHADER_32_R``. + +Without this, the shader will be executed without the needed synchronization on +GFX10, and will hang on GFX11. + +Drawing context setup +--------------------- + +Configuring POPS +^^^^^^^^^^^^^^^^ + +Most of the configuration is performed via the ``DB_SHADER_CONTROL`` register. + +To enable POPS for the draw, +``DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER`` should be set to 1. + +On GFX9–10.3, ``DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES`` controls which +fragment shader invocations are considered overlapping: + +* For pixel interlock, it must be set to 0 (1 sample). +* If sample interlock is sufficient (only synchronizing between invocations that + have any common sample mask bits), it may be set to + ``PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES`` — the number of sample coverage mask + bits passed to the shader which is expected to use the sample mask to + determine whether it's allowed to access the data for each of the samples. As + of April 2023, PAL for some reason doesn't use non-1x + ``POPS_OVERLAP_NUM_SAMPLES`` at all, even when using Direct3D Rasterizer + Ordered Views or ``GL_INTEL_fragment_shader_ordering`` with sample shading + (those APIs tie the interlock granularity to the shading frequency — Vulkan + and OpenGL fragment shader interlock, however, allows specifying the interlock + granularity independently of it, making it possible both to ask for finer + synchronization guarantees and to require stronger ones than Direct3D ROVs can + provide). However, with MSAA, on AMD hardware, pixel interlock generally + performs *massively*, sometimes prohibitively, slower than sample interlock, + because it causes fragment shader invocations along the common edge of + adjacent primitives to be ordered as they cover the same pixels (even though + they don't cover any common samples). So it's highly desirable for the driver + to provide sample interlock, and to set ``POPS_OVERLAP_NUM_SAMPLES`` + accordingly, if the shader declares that it's enough for it via the execution + mode. + +On GFX11, when POPS is enabled, ``DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE`` is +used in place of ``DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES`` from the earlier +architecture generations (and has a different bit offset in the register), and +``DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE_ENABLE`` must be set to 1. The GFX11 +blending performance workaround overriding the intrinsic rate must not be +applied if POPS is used in the draw — the intrinsic rate override must be used +solely to control the interlock granularity in this case. + +No explicit flushes/synchronization are needed when changing the pipeline state +variables that may be involved in POPS, such as the rasterization sample count. +POPS automatically keeps synchronizing invocations even between draws with +different sample counts (invocations with common coverage mask bits are +considered overlapping by the hardware, regardless of what those samples +actually are — only the indices are important). + +Also, on GFX11, POPS uses ``DB_Z_INFO.NUM_SAMPLES`` to determine the coverage +sample count, and it must be equal to ``PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES`` +even if there's no depth/stencil target. + +Hardware bug workarounds +^^^^^^^^^^^^^^^^^^^^^^^^ + +Early revisions of GFX9 — ``CHIP_VEGA10`` and ``CHIP_RAVEN`` — contain a +hardware bug that may result in a hang, and need a workaround to be enabled. +Specifically, if POPS is used with 8 or more rasterization samples, or with 8 or +more depth/stencil target samples, ``DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP`` +must be set to 1 for draws that satisfy this condition. In PAL, this is the +``waMiscPopsMissedOverlap`` workaround. It results in slightly lower performance +in those cases, increasing the frame time by around 1.5 to 2 times in +`nvpro-samples/vk_order_independent_transparency `_ +on the RX Vega 10, but it's required in a pretty rare case (8x+ MSAA) and is +mandatory to ensure stability. + +Also, even though ``DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP`` is not required +on chips other than the ``CHIP_VEGA10`` and ``CHIP_RAVEN`` GFX9 revisions, if +it's enabled for some reason on GFX10.1 (``CHIP_NAVI10``, ``CHIP_NAVI12``, +``CHIP_NAVI14``), and the draw uses POPS, +``DB_RENDER_OVERRIDE2.PARTIAL_SQUAD_LAUNCH_CONTROL`` must be set to +``PSLC_ON_HANG_ONLY`` to avoid a hang (see ``waStalledPopsMode`` in PAL). + +Out-of-order rasterization interaction +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This is a largely unresearched topic currently. However, considering that POPS +is primarily the functionality of the Depth Block, similarity to the behavior of +out-of-order rasterization in depth/stencil testing may possibly be expected. + +If the shader specifies an ordered interlock execution mode, out-of-order +rasterization likely must not be enabled implicitly. + +As of April 2023, PAL doesn't have any rules specifically for POPS in the logic +determining whether out-of-order rasterization can be enabled automatically. +Some of the POPS usage cases may possibly be covered by the rule that always +disables out-of-order rasterization if the shader writes to Unordered Access +Views (storage resources), though fragment shader interlock can be used for +read-only purposes too (for ordering between draws that only read per-pixel data +and draws that may write it), so that may be an oversight. + +Explicitly enabled relaxed rasterization order modifies the concept of +rasterization order itself in Vulkan, so from the point of view of the +specification of fragment shader interlock, relaxed rasterization order should +still be applicable regardless of whether the shader requests ordered interlock. +PAL also doesn't make any POPS-specific exceptions here as of April 2023. + +Variable-rate shading interaction +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +On GFX10.3, enabling ``DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER`` forces +the shading rate to be 1x1, thus the +``fragmentShadingRateWithFragmentShaderInterlock`` Vulkan device property must +be false. + +On GFX11, by default, POPS itself can work with non-1x1 shading rates, and the +``fragmentShadingRateWithFragmentShaderInterlock`` property must be true. +However, if ``PA_SC_VRS_SURFACE_CNTL_1.FORCE_SC_VRS_RATE_FINE_POPS`` is set, +enabling POPS will force 1x1 shading rate. + +The widest interlock granularity available on GFX11 — with the lowest possible +Depth Block intrinsic rate, 1x — is per-fine-pixel, however. There's no +synchronization between coarse fragment shader invocations if they don't cover +common fine pixels, so the ``fragmentShaderShadingRateInterlock`` Vulkan device +feature is not available. + +Additional configuration +^^^^^^^^^^^^^^^^^^^^^^^^ + +These are some largely unresearched options found in the register declarations. +PAL doesn't use them, so it's unknown if they make any significant difference. +No effect was found in `nvpro-samples/vk_order_independent_transparency `_ +during testing on GFX9 ``CHIP_RAVEN`` and GFX11 ``CHIP_GFX1100``. + +* ``DB_SHADER_CONTROL.EXEC_IF_OVERLAPPED`` on GFX9–10.3. +* ``PA_SC_BINNER_CNTL_0.BIN_MAPPING_MODE = BIN_MAP_MODE_POPS`` on GFX10+. diff --git a/docs/drivers/radv.rst b/docs/drivers/radv.rst index 5c37b95..5368efb 100644 --- a/docs/drivers/radv.rst +++ b/docs/drivers/radv.rst @@ -16,6 +16,13 @@ You can find a list of documentation for the various generations of AMD hardware on the `X.Org wiki `__. +Additional community-written documentation is also available in Mesa: + +.. toctree:: + :glob: + + amd/hw/* + ACO --- -- 2.7.4