From 378f83917c01430a24a55699182653a6fab165fc Mon Sep 17 00:00:00 2001 From: Emma Anholt Date: Thu, 10 Nov 2022 11:35:46 -0800 Subject: [PATCH] doc/freedreno: Add a bunch of docs of the hardware and drivers. Part-of: --- docs/drivers/freedreno.rst | 280 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 277 insertions(+), 3 deletions(-) diff --git a/docs/drivers/freedreno.rst b/docs/drivers/freedreno.rst index 4435a21..bc1543d 100644 --- a/docs/drivers/freedreno.rst +++ b/docs/drivers/freedreno.rst @@ -1,12 +1,286 @@ Freedreno ========= -Freedreno driver specific docs. +Freedreno GLES and GL driver for Adreno 2xx-6xx GPUs. It implements up to +OpenGL ES 3.2 and desktop OpenGL 4.5. + +See the `Freedreno Wiki +`__ for more +details. + +Turnip +====== + +Turnip is a Vulkan 1.3 driver for Adreno 6xx GPUs. + +The current set of specific chip versions supported can be found in +:file:`src/freedreno/common/freedreno_devices.py`. The current set of features +supported can be found rendered at `Mesa Matrix `__. +There are no plans to port to a5xx or earlier GPUs. + +Hardware architecture +--------------------- + +Adreno is a mostly tile-mode renderer, but with the option to bypass tiling +("gmem") and render directly to system memory ("sysmem"). It is UMA, using +mostly write combined memory but with the ability to map some buffers as cache +coherent with the CPU. + +Hardware acronyms +^^^^^^^^^^^^^^^^^ + +.. glossary:: + + Cluster + A group of hardware registers, often with multiple copies to allow + pipelining. There is an M:N relationship between hardware blocks that do + work and the clusters of registers for the state that hardware blocks use. + + CP + Command Processor. Reads the stream of statechanges and draw commands + generated by the driver. + + PFP + Prefetch Parser. Adreno 2xx-4xx CP component. + + ME + Micro Engine. Adreno 2xx-4xx CP component after PFP, handles most PM4 commands. + + SQE + a6xx+ replacement for PFP/ME. This is the microcontroller that runs the + microcode (loaded from Linux) which actually processes the command stream + and writes to the hardware registers. See `afuc + `__. + + ROQ + DMA engine used by the SQE for reading memory, with some prefetch buffering. + Mostly reads in the command stream, but also serves for + ``CP_MEMCPY``/``CP_MEM_TO_REG`` and visibility stream reads. + + SP + Shader Processor. Unified, scalar shader engine. One or more, depending on + GPU and tier. + + TP + Texture Processor. + + UCHE + Unified L2 Cache. 32KB on A330, unclear how big now. + + CCU + Color Cache Unit. + + VSC + Visibility Stream Compressor + + PVS + Primitive Visibiliy Stream + + FE + Front End? Index buffer and vertex attribute fetch cluster. Includes PC, + VFD, VPC. + + VFD + Vertex Fetch and Decode + + VPC + Varying/Position Cache? Hardware block that stores shaded vertex data for + primitive assembly. + + HLSQ + High Level Sequencer. Manages state for the SPs, batches up PS invocations + between primitives, is involved in preemption. + + PC_VS + Cluster where varyings are read from VPC and assembled into primitives to + feed GRAS. + + VS + Vertex Shader. Responsible for generating VS/GS/tess invocations + + GRAS + Rasterizer. Responsible for generating PS invocations from primitives, also + does LRZ + + PS + Pixel Shader. + + RB + Render Backend. Performs both early and late Z testing, blending, and + attachment stores of output of the PS. + + GMEM + Roughly 128KB-1MB of memory on the GPU (SKU-dependent), used to store + attachments during tiled rendering + + LRZ + Low Resolution Z. A low resolution area of the depth buffer that can be + initialized during the binning pass to contain the worst-case (farthest) Z + values in a block, and then used to early reject fragments during + rasterization. + +Cache hierarchy +^^^^^^^^^^^^^^^ + +The a6xx GPUs have two main caches: CCU and UCHE. + +UCHE (Unified L2 Cache) is the cache behind the vertex fetch, VSC writes, +texture L1, LRZ, and storage image accesses (``ldib``/``stib``). Misses and +flushes access system memory. + +The CCU is the separate cache used by 2D blits and sysmem render target access +(and also for resolves to system memory when in GMEM mode). Its memory comes +from a carveout of GMEM controlled by ``RB_CCU_CNTL``, with a varying amount +reserved based on whether we're in a render pass using GMEM for attachment +storage, or we're doing sysmem rendering. Cache entries have the attachment +number and layer mixed into the cache tag in some way, likely so that a +fragment's access is spread through the cache even if the attachments are the +same size and alignments in address space. This means that the cache must be +flushed and invalidated between memory being used for one attachment and another +(notably depth vs color, but also MRT color). + +The Texture Processors (TP) additionally have a small L1 cache (1KB on A330, +unclear how big now) before accessing UCHE. This cache is used for normal +sampling like ``sam``` and ``isam`` (and the compiler will make read-only +storage image access through it as well). It is not coherent with UCHE (may get +stale results when you ``sam`` after ``stib``), but must get flushed per draw or +something because you don't need a manual invalidate between draws storing to an +image and draws sampling from a texture. + +The command processor (CP) does not read from either of these caches, and +instead uses FIFOs in the ROQ to avoid stalls reading from system memory. + +Draw states +^^^^^^^^^^^ + +Since the SQE is not a fast processor, and tiled rendering means that many draws +won't even be used in many bins, since a5xx state updates can be batched up into +"draw states" that point to a fragment of CP packets. At draw time, if the draw +call is going to actually execute (some primitive is visible in the current +tile), the SQE goes through the ``GROUP_ID``\s and for any with an update since +the last time they were executed, it executes the corresponding fragment. + +Starting with a6xx, states can be taggged with whether they should be executed +at draw time for any of sysmem, binning, or tile rendering. This allows a +single command stream to be generated which can be executed in any of the modes, +unlike pre-a6xx where we had to generate separate command lists for the binning +and rendering phases. + +Note that this means that the generated draw state has to always update all of +the state you have chosen to pack into that ``GROUP_ID``, since any of your +previous statechanges in a previous draw state command may have been skipped. + +Pipelining (a6xx+) +^^^^^^^^^^^^^^^^^^ + +Most CP commands write to registers. In a6xx+, the registers are located in +clusters corresponding to the stage of the pipeline they are used from (see +``enum tu_stage`` for a list). To pipeline state updates and drawing, registers +generally have two copies ("contexts") in their cluster, so previous draws can +be working on the previous set of register state while the next draw's state is +being set up. You can find what registers go into which clusters by looking at +:command:`crashdec` output in the ``regs-name: CP_MEMPOOL`` section. + +As SQE processes register writes in the command stream, it sends them into a +per-cluster queue stored in ``CP_MEMPOOL``. This allows the pipeline stages to +process their stream of register updates and events independent of each other +(so even with just 2 contexts in a stage, earlier stages can proceed on to later +draws before later stages have caught up). + +Each cluster has a per-context bit indicating that the context is done/free. +Register writes will stall on the context being done. + +During a 3D draw command, SQE generates several internal events flow through the +pipeline: + +- ``CP_EVENT_START`` clears the done bit for the context when written to the + cluster +- ``PC_EVENT_CMD``/``PC_DRAW_CMD``/``HLSQ_EVENT_CMD``/``HLSQ_DRAW_CMD`` kick off + the actual event/drawing. +- ``CONTEXT_DONE`` event completes after the event/draw is complete and sets the + done flag. +- ``CP_EVENT_END`` waits for the done flag on the next context, then copies all + the registers that were dirtied in this context to that one. + +The 2D blit engine has its own ``CP_2D_EVENT_START``, ``CP_2D_EVENT_END``, +``CONTEXT_DONE_2D``, so 2D and 3D register contexts can do separate context +rollover. + +Because the clusters proceed independently of each other even across draws, if +you need to synchronize an earlier cluster to the output of a later one, then +you will need to ``CP_WAIT_FOR_IDLE`` after flushing and invalidating any +necessary caches. + +Also, note that some registers are not banked at all, and will require a +CP_WAIT_FOR_IDLE for any previous usage of the register to complete. + +In a2xx-a4xx, there weren't per-stage clusters, and instead there were two +register banks that were flipped between per draw. + +Software Architecture +--------------------- + +Freedreno and Turnip use a shared core for shader compiler, image layout, and +register and command stream definitions. They implement separate state +management and command stream generation. .. toctree:: :glob: freedreno/* -See the `Freedreno Wiki `__ -for more details. +GPU hang debugging +^^^^^^^^^^^^^^^^^^ + +A kernel message from DRM of "gpu fault" can mean any sort of error reported by +the GPU (including its internal hang detection). If a fault in GPU address +space happened, you should expect to find a message from the iommu, with the +faulting address and a hardware unit involved: + +.. code-block:: console + + *** gpu fault: ttbr0=000000001c941000 iova=000000010066a000 dir=READ type=TRANSLATION source=TP|VFD (0,0,0,1) + +On a GPU fault or hang, a GPU core dump is taken by the DRM driver and saved to +``/sys/devices/virtual/devcoredump/**/data``. You can cp that file to a +:file:`crash.devcore` to save it, otherwise the kernel will expire it +eventually. Echo 1 to the file to free the core early, as another core won't be +taken until then. + +Once you have your core file, you can use :command:`crashdec -f crash.devcore` +to decode it. The output will have ``ESTIMATED CRASH LOCATION`` where we +estimate the CP to have stopped. Note that it is expected that this will be +some distance past whatever state triggered the fault, given GPU pipelining, and +will often be at some ``CP_REG_TO_MEM`` (which waits on previous WFIs) or +``CP_WAIT_FOR_ME`` (which waits for all register writes to land) or similar +event. You can try running the workload with ``TU_DEBUG=flushall`` or +``FD_MESA_DEBUG=flush`` to try to close in on the failing commands. + +You can also find what commands were queued up to each cluster in the +``regs-name: CP_MEMPOOL`` section. + +Command Stream Capture +^^^^^^^^^^^^^^^^^^^^^^ + +During Mesa development, it's often useful to look at the command streams we +send to the kernel. Mesa itself doesn't implement a way to stream them out +(though it maybe should!). Instead, we have an interface for the kernel to +capture all submitted command streams: + +.. code-block:: console + + cat /sys/kernel/debug/dri/0/rd > cmdstream & + +By default, command stream capture does not capture texture/vertex/etc. data. +You can enable capturing all the BOs with: + +.. code-block:: console + + echo Y > /sys/module/msm/parameters/rd_full + +Note that, since all command streams get captured, it is easy to run the system +out of memory doing this, so you probably don't want to enable it during play of +a heavyweight game. Instead, to capture a command stream within a game, you +probably want to cause a crash in the GPU during a farme of interest so that a +single GPU core dump is generated. Emitting ``0xdeadbeef`` in the CS should be +enough to cause a fault. -- 2.7.4