From c3701e90707805d622cf51a85b02c53b141f945c Mon Sep 17 00:00:00 2001
From: Kenneth Graunke <kenneth@whitecape.org>
Date: Tue, 7 May 2019 23:03:46 -0700
Subject: [PATCH] iris: Mark render batches as non-recoverable.

Adapted from Chris Wilson's patch.  The comment is largely his.

Currently, when iris hangs the GPU, it will continue sending batches
which incrementally update the state, assuming it's preserved across
batches.  However, the kernel's GPU reset support reinitializes the
guilty context to the default GPU state (reasonably not wanting to
trust the current state).  This ends up resetting critical things
like STATE_BASE_ADDRESS, causing memory accesses in all subsequent
batches to be garbage, and almost certainly result in more hangs
until we're banned or we kill the machine.

We now ask the kernel to ban our render context immediately, so we
notice we've gone off the rails as fast as possible.  Eventually, we'll
attempt to recover and continue.  For now, we just avoid torching the
GPU over and over.
---
 src/gallium/drivers/iris/iris_bufmgr.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/src/gallium/drivers/iris/iris_bufmgr.c b/src/gallium/drivers/iris/iris_bufmgr.c
index 5b807e0..808f20d 100644
--- a/src/gallium/drivers/iris/iris_bufmgr.c
+++ b/src/gallium/drivers/iris/iris_bufmgr.c
@@ -1412,6 +1412,28 @@ iris_create_hw_context(struct iris_bufmgr *bufmgr)
       return 0;
    }
 
+   /* Upon declaring a GPU hang, the kernel will zap the guilty context
+    * back to the default logical HW state and attempt to continue on to
+    * our next submitted batchbuffer.  However, our render batches assume
+    * the previous GPU state is preserved, and only emit commands needed
+    * to incrementally change that state.  In particular, we inherit the
+    * STATE_BASE_ADDRESS and PIPELINE_SELECT settings, which are critical.
+    * With default base addresses, our next batches will almost certainly
+    * cause more GPU hangs, leading to repeated hangs until we're banned
+    * or the machine is dead.
+    *
+    * Here we tell the kernel not to attempt to recover our context but
+    * immediately (on the next batchbuffer submission) report that the
+    * context is lost, and we will do the recovery ourselves.  Ideally,
+    * we'll have two lost batches instead of a continual stream of hangs.
+    */
+   struct drm_i915_gem_context_param p = {
+      .ctx_id = create.ctx_id,
+      .param = I915_CONTEXT_PARAM_RECOVERABLE,
+      .value = false,
+   };
+   drmIoctl(bufmgr->fd, DRM_IOCTL_I915_GEM_CONTEXT_SETPARAM, &p);
+
    return create.ctx_id;
 }
 
-- 
2.7.4