drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs
authorYuBiao Wang <YuBiao.Wang@amd.com>
Thu, 16 Mar 2023 03:30:32 +0000 (11:30 +0800)
committerGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Thu, 20 Apr 2023 10:35:11 +0000 (12:35 +0200)
[ Upstream commit 033c56474acf567a450f8bafca50e0b610f2b716 ]

[Why]
For engines not supporting soft reset, i.e. VCN, there will be a failed
ib test before mode 1 reset during asic reset. The fences in this case
are never signaled and next time when we try to free the sa_bo, kernel
will hang.

[How]
During pre_asic_reset, driver will clear job fences and afterwards the
fences' refcount will be reduced to 1. For drm_sched_jobs it will be
released in job_free_cb, and for non-sched jobs like ib_test, it's meant
to be released in sa_bo_free but only when the fences are signaled. So
we have to force signal the non_sched bad job's fence during
pre_asic_reset or the clear is not complete.

Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com>
Acked-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

index 6fdb679321d0d368b40926cc7b4c3dc0215559b5..3cc1929285fc097e928013bf9fe82076dac37428 100644 (file)
@@ -624,6 +624,15 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
                ptr = &ring->fence_drv.fences[i];
                old = rcu_dereference_protected(*ptr, 1);
                if (old && old->ops == &amdgpu_job_fence_ops) {
                ptr = &ring->fence_drv.fences[i];
                old = rcu_dereference_protected(*ptr, 1);
                if (old && old->ops == &amdgpu_job_fence_ops) {
+                       struct amdgpu_job *job;
+
+                       /* For non-scheduler bad job, i.e. failed ib test, we need to signal
+                        * it right here or we won't be able to track them in fence_drv
+                        * and they will remain unsignaled during sa_bo free.
+                        */
+                       job = container_of(old, struct amdgpu_job, hw_fence);
+                       if (!job->base.s_fence && !dma_fence_is_signaled(old))
+                               dma_fence_signal(old);
                        RCU_INIT_POINTER(*ptr, NULL);
                        dma_fence_put(old);
                }
                        RCU_INIT_POINTER(*ptr, NULL);
                        dma_fence_put(old);
                }