From 378f83917c01430a24a55699182653a6fab165fc Mon Sep 17 00:00:00 2001
From: Emma Anholt <emma@anholt.net>
Date: Thu, 10 Nov 2022 11:35:46 -0800
Subject: [PATCH] doc/freedreno: Add a bunch of docs of the hardware and
 drivers.

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19733>
---
 docs/drivers/freedreno.rst | 280 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 277 insertions(+), 3 deletions(-)

diff --git a/docs/drivers/freedreno.rst b/docs/drivers/freedreno.rst
index 4435a21..bc1543d 100644
--- a/docs/drivers/freedreno.rst
+++ b/docs/drivers/freedreno.rst
@@ -1,12 +1,286 @@
 Freedreno
 =========
 
-Freedreno driver specific docs.
+Freedreno GLES and GL driver for Adreno 2xx-6xx GPUs.  It implements up to
+OpenGL ES 3.2 and desktop OpenGL 4.5.
+
+See the `Freedreno Wiki
+<https://gitlab.freedesktop.org/freedreno/freedreno/-/wikis/home>`__ for more
+details.
+
+Turnip
+======
+
+Turnip is a Vulkan 1.3 driver for Adreno 6xx GPUs.
+
+The current set of specific chip versions supported can be found in
+:file:`src/freedreno/common/freedreno_devices.py`.  The current set of features
+supported can be found rendered at `Mesa Matrix <https://mesamatrix.net/>`__.
+There are no plans to port to a5xx or earlier GPUs.
+
+Hardware architecture
+---------------------
+
+Adreno is a mostly tile-mode renderer, but with the option to bypass tiling
+("gmem") and render directly to system memory ("sysmem").  It is UMA, using
+mostly write combined memory but with the ability to map some buffers as cache
+coherent with the CPU.
+
+Hardware acronyms
+^^^^^^^^^^^^^^^^^
+
+.. glossary::
+
+  Cluster
+    A group of hardware registers, often with multiple copies to allow
+    pipelining.  There is an M:N relationship between hardware blocks that do
+    work and the clusters of registers for the state that hardware blocks use.
+
+  CP
+    Command Processor.  Reads the stream of statechanges and draw commands
+    generated by the driver.
+
+  PFP
+    Prefetch Parser.  Adreno 2xx-4xx CP component.
+
+  ME
+    Micro Engine. Adreno 2xx-4xx CP component after PFP, handles most PM4 commands.
+
+  SQE
+    a6xx+ replacement for PFP/ME.  This is the microcontroller that runs the
+    microcode (loaded from Linux) which actually processes the command stream
+    and writes to the hardware registers.  See `afuc
+    <https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/afuc/README.rst>`__.
+
+  ROQ
+    DMA engine used by the SQE for reading memory, with some prefetch buffering.
+    Mostly reads in the command stream, but also serves for
+    ``CP_MEMCPY``/``CP_MEM_TO_REG`` and visibility stream reads.
+
+  SP
+    Shader Processor.  Unified, scalar shader engine.  One or more, depending on
+    GPU and tier.
+
+  TP
+    Texture Processor.
+
+  UCHE
+    Unified L2 Cache.  32KB on A330, unclear how big now.
+
+  CCU
+    Color Cache Unit.
+
+  VSC
+    Visibility Stream Compressor
+
+  PVS
+    Primitive Visibiliy Stream
+
+  FE
+    Front End?  Index buffer and vertex attribute fetch cluster.  Includes PC,
+    VFD, VPC.
+
+  VFD
+    Vertex Fetch and Decode
+
+  VPC
+    Varying/Position Cache?  Hardware block that stores shaded vertex data for
+    primitive assembly.
+
+  HLSQ
+    High Level Sequencer.  Manages state for the SPs, batches up PS invocations
+    between primitives, is involved in preemption.
+
+  PC_VS
+    Cluster where varyings are read from VPC and assembled into primitives to
+    feed GRAS.
+
+  VS
+    Vertex Shader. Responsible for generating VS/GS/tess invocations
+
+  GRAS
+    Rasterizer. Responsible for generating PS invocations from primitives, also
+    does LRZ
+
+  PS
+    Pixel Shader.
+
+  RB
+    Render Backend.  Performs both early and late Z testing, blending, and
+    attachment stores of output of the PS.
+
+  GMEM
+    Roughly 128KB-1MB of memory on the GPU (SKU-dependent), used to store
+    attachments during tiled rendering
+
+  LRZ
+    Low Resolution Z.  A low resolution area of the depth buffer that can be
+    initialized during the binning pass to contain the worst-case (farthest) Z
+    values in a block, and then used to early reject fragments during
+    rasterization.
+
+Cache hierarchy
+^^^^^^^^^^^^^^^
+
+The a6xx GPUs have two main caches: CCU and UCHE.
+
+UCHE (Unified L2 Cache) is the cache behind the vertex fetch, VSC writes,
+texture L1, LRZ, and storage image accesses (``ldib``/``stib``).  Misses and
+flushes access system memory.
+
+The CCU is the separate cache used by 2D blits and sysmem render target access
+(and also for resolves to system memory when in GMEM mode).  Its memory comes
+from a carveout of GMEM controlled by ``RB_CCU_CNTL``, with a varying amount
+reserved based on whether we're in a render pass using GMEM for attachment
+storage, or we're doing sysmem rendering.  Cache entries have the attachment
+number and layer mixed into the cache tag in some way, likely so that a
+fragment's access is spread through the cache even if the attachments are the
+same size and alignments in address space.  This means that the cache must be
+flushed and invalidated between memory being used for one attachment and another
+(notably depth vs color, but also MRT color).
+
+The Texture Processors (TP) additionally have a small L1 cache (1KB on A330,
+unclear how big now) before accessing UCHE. This cache is used for normal
+sampling like ``sam``` and ``isam`` (and the compiler will make read-only
+storage image access through it as well).  It is not coherent with UCHE (may get
+stale results when you ``sam`` after ``stib``), but must get flushed per draw or
+something because you don't need a manual invalidate between draws storing to an
+image and draws sampling from a texture.
+
+The command processor (CP) does not read from either of these caches, and
+instead uses FIFOs in the ROQ to avoid stalls reading from system memory.
+
+Draw states
+^^^^^^^^^^^
+
+Since the SQE is not a fast processor, and tiled rendering means that many draws
+won't even be used in many bins, since a5xx state updates can be batched up into
+"draw states" that point to a fragment of CP packets.  At draw time, if the draw
+call is going to actually execute (some primitive is visible in the current
+tile), the SQE goes through the ``GROUP_ID``\s and for any with an update since
+the last time they were executed, it executes the corresponding fragment.
+
+Starting with a6xx, states can be taggged with whether they should be executed
+at draw time for any of sysmem, binning, or tile rendering.  This allows a
+single command stream to be generated which can be executed in any of the modes,
+unlike pre-a6xx where we had to generate separate command lists for the binning
+and rendering phases.
+
+Note that this means that the generated draw state has to always update all of
+the state you have chosen to pack into that ``GROUP_ID``, since any of your
+previous statechanges in a previous draw state command may have been skipped.
+
+Pipelining (a6xx+)
+^^^^^^^^^^^^^^^^^^
+
+Most CP commands write to registers.  In a6xx+, the registers are located in
+clusters corresponding to the stage of the pipeline they are used from (see
+``enum tu_stage`` for a list). To pipeline state updates and drawing, registers
+generally have two copies ("contexts") in their cluster, so previous draws can
+be working on the previous set of register state while the next draw's state is
+being set up. You can find what registers go into which clusters by looking at
+:command:`crashdec` output in the ``regs-name: CP_MEMPOOL`` section.
+
+As SQE processes register writes in the command stream, it sends them into a
+per-cluster queue stored in ``CP_MEMPOOL``.  This allows the pipeline stages to
+process their stream of register updates and events independent of each other
+(so even with just 2 contexts in a stage, earlier stages can proceed on to later
+draws before later stages have caught up).
+
+Each cluster has a per-context bit indicating that the context is done/free.
+Register writes will stall on the context being done.
+
+During a 3D draw command, SQE generates several internal events flow through the
+pipeline:
+
+- ``CP_EVENT_START`` clears the done bit for the context when written to the
+  cluster
+- ``PC_EVENT_CMD``/``PC_DRAW_CMD``/``HLSQ_EVENT_CMD``/``HLSQ_DRAW_CMD`` kick off
+  the actual event/drawing.
+- ``CONTEXT_DONE`` event completes after the event/draw is complete and sets the
+  done flag.
+- ``CP_EVENT_END`` waits for the done flag on the next context, then copies all
+  the registers that were dirtied in this context to that one.
+
+The 2D blit engine has its own ``CP_2D_EVENT_START``, ``CP_2D_EVENT_END``,
+``CONTEXT_DONE_2D``, so 2D and 3D register contexts can do separate context
+rollover.
+
+Because the clusters proceed independently of each other even across draws, if
+you need to synchronize an earlier cluster to the output of a later one, then
+you will need to ``CP_WAIT_FOR_IDLE`` after flushing and invalidating any
+necessary caches.
+
+Also, note that some registers are not banked at all, and will require a
+CP_WAIT_FOR_IDLE for any previous usage of the register to complete.
+
+In a2xx-a4xx, there weren't per-stage clusters, and instead there were two
+register banks that were flipped between per draw.
+
+Software Architecture
+---------------------
+
+Freedreno and Turnip use a shared core for shader compiler, image layout, and
+register and command stream definitions.  They implement separate state
+management and command stream generation.
 
 .. toctree::
    :glob:
 
    freedreno/*
 
-See the `Freedreno Wiki <https://github.com/freedreno/freedreno/wiki>`__
-for more details.
+GPU hang debugging
+^^^^^^^^^^^^^^^^^^
+
+A kernel message from DRM of "gpu fault" can mean any sort of error reported by
+the GPU (including its internal hang detection).  If a fault in GPU address
+space happened, you should expect to find a message from the iommu, with the
+faulting address and a hardware unit involved:
+
+.. code-block:: console
+
+  *** gpu fault: ttbr0=000000001c941000 iova=000000010066a000 dir=READ type=TRANSLATION source=TP|VFD (0,0,0,1)
+
+On a GPU fault or hang, a GPU core dump is taken by the DRM driver and saved to
+``/sys/devices/virtual/devcoredump/**/data``.  You can cp that file to a
+:file:`crash.devcore` to save it, otherwise the kernel will expire it
+eventually. Echo 1 to the file to free the core early, as another core won't be
+taken until then.
+
+Once you have your core file, you can use :command:`crashdec -f crash.devcore`
+to decode it.  The output will have ``ESTIMATED CRASH LOCATION`` where we
+estimate the CP to have stopped.  Note that it is expected that this will be
+some distance past whatever state triggered the fault, given GPU pipelining, and
+will often be at some ``CP_REG_TO_MEM`` (which waits on previous WFIs) or
+``CP_WAIT_FOR_ME`` (which waits for all register writes to land) or similar
+event. You can try running the workload with ``TU_DEBUG=flushall`` or
+``FD_MESA_DEBUG=flush`` to try to close in on the failing commands.
+
+You can also find what commands were queued up to each cluster in the
+``regs-name: CP_MEMPOOL`` section.
+
+Command Stream Capture
+^^^^^^^^^^^^^^^^^^^^^^
+
+During Mesa development, it's often useful to look at the command streams we
+send to the kernel.  Mesa itself doesn't implement a way to stream them out
+(though it maybe should!).  Instead, we have an interface for the kernel to
+capture all submitted command streams:
+
+.. code-block:: console
+
+  cat /sys/kernel/debug/dri/0/rd > cmdstream &
+
+By default, command stream capture does not capture texture/vertex/etc. data.
+You can enable capturing all the BOs with:
+
+.. code-block:: console
+
+  echo Y > /sys/module/msm/parameters/rd_full
+
+Note that, since all command streams get captured, it is easy to run the system
+out of memory doing this, so you probably don't want to enable it during play of
+a heavyweight game.  Instead, to capture a command stream within a game, you
+probably want to cause a crash in the GPU during a farme of interest so that a
+single GPU core dump is generated.  Emitting ``0xdeadbeef`` in the CS should be
+enough to cause a fault.
-- 
2.7.4