drm/etnaviv: consider completed fence seqno in hang check
authorLucas Stach <l.stach@pengutronix.de>
Wed, 22 Dec 2021 00:17:28 +0000 (01:17 +0100)
committerLucas Stach <l.stach@pengutronix.de>
Thu, 23 Dec 2021 19:21:33 +0000 (20:21 +0100)
Some GPU heavy test programs manage to trigger the hangcheck quite often.
If there are no other GPU users in the system and the test program
exhibits a very regular structure in the commandstreams that are being
submitted, we can end up with two distinct submits managing to trigger
the hangcheck with the FE in a very similar address range. This leads
the hangcheck to believe that the GPU is stuck, while in reality the GPU
is already busy working on a different job. To avoid those spurious
GPU resets, also remember and consider the last completed fence seqno
in the hang check.

Reported-by: Joerg Albert <joerg.albert@iav.de>
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Reviewed-by: Christian Gmeiner <christian.gmeiner@gmail.com>
drivers/gpu/drm/etnaviv/etnaviv_gpu.h
drivers/gpu/drm/etnaviv/etnaviv_sched.c

index 1c75c8e..85eddd4 100644 (file)
@@ -130,6 +130,7 @@ struct etnaviv_gpu {
 
        /* hang detection */
        u32 hangcheck_dma_addr;
+       u32 hangcheck_fence;
 
        void __iomem *mmio;
        int irq;
index 180bb63..58f593b 100644 (file)
@@ -107,8 +107,10 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
         */
        dma_addr = gpu_read(gpu, VIVS_FE_DMA_ADDRESS);
        change = dma_addr - gpu->hangcheck_dma_addr;
-       if (change < 0 || change > 16) {
+       if (gpu->completed_fence != gpu->hangcheck_fence ||
+           change < 0 || change > 16) {
                gpu->hangcheck_dma_addr = dma_addr;
+               gpu->hangcheck_fence = gpu->completed_fence;
                goto out_no_timeout;
        }