PCI/PM: Extend D3hot delay for NVIDIA HDA controllers
authorAlex Williamson <alex.williamson@redhat.com>
Thu, 13 Apr 2023 19:40:42 +0000 (13:40 -0600)
committerGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Thu, 11 May 2023 14:03:29 +0000 (23:03 +0900)
commit08e9653bb9e88f4df1219857765b69a0874935b6
treecc0a9aa007636baf295dd0534eb5362cfbbb704d
parent402299cca89273b62384b5f9645ea49cd5fc4a57
PCI/PM: Extend D3hot delay for NVIDIA HDA controllers

[ Upstream commit a5a6dd2624698b6e3045c3a1450874d8c790d5d9 ]

Assignment of NVIDIA Ampere-based GPUs have seen a regression since the
below referenced commit, where the reduced D3hot transition delay appears
to introduce a small window where a D3hot->D0 transition followed by a bus
reset can wedge the device.  The entire device is subsequently unavailable,
returning -1 on config space read and is unrecoverable without a host
reset.

This has been observed with RTX A2000 and A5000 GPU and audio functions
assigned to a Windows VM, where shutdown of the VM places the devices in
D3hot prior to vfio-pci performing a bus reset when userspace releases the
devices.  The issue has roughly a 2-3% chance of occurring per shutdown.

Restoring the HDA controller d3hot_delay to the effective value before the
below commit has been shown to resolve the issue.  NVIDIA confirms this
change should be safe for all of their HDA controllers.

Fixes: 3e347969a577 ("PCI/PM: Reduce D3hot delay with usleep_range()")
Link: https://lore.kernel.org/r/20230413194042.605768-1-alex.williamson@redhat.com
Reported-by: Zhiyi Guo <zhguo@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Tarun Gupta <targupta@nvidia.com>
Cc: Abhishek Sahu <abhsahu@nvidia.com>
Cc: Tarun Gupta <targupta@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
drivers/pci/quirks.c