drm/amdgpu: race issue when jobs on 2 ring timeout
authorHorace Chen <horace.chen@amd.com>
Wed, 20 Jan 2021 14:03:28 +0000 (22:03 +0800)
committerAlex Deucher <alexander.deucher@amd.com>
Mon, 25 Jan 2021 22:45:16 +0000 (17:45 -0500)
commit91fb309d8294be5ab03746638e10bb3e5680f348
tree74e197780fc7941ae409f12a61540c7bae762153
parenteda1068dc995fbc87eee04496a3414372a8ef63d
drm/amdgpu: race issue when jobs on 2 ring timeout

Fix a racing issue when jobs on 2 rings timeout simultaneously.

If 2 rings timed out at the same time, the
amdgpu_device_gpu_recover will be reentered. Then the
adev->gmc.xgmi.head will be grabbed by 2 local linked list,
which may cause wild pointer issue in iterating.

lock the device earily to prevent the node be added to 2
different lists.

also increase karma for the skipped job since the job is also
timed out and should be guilty.

Signed-off-by: Horace Chen <horace.chen@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c