From 205da24343013e0bd62475800df79cd053f22326 Mon Sep 17 00:00:00 2001 From: Sagi Grimberg Date: Fri, 30 Aug 2019 11:00:59 -0700 Subject: [PATCH] nvme: fix ns removal hang when failing to revalidate due to a transient error If a controller reset is racing with a namespace revalidation, the revalidation (admin) I/O will surely fail, but we should not remove the namespace as we will execute the I/O when the controller is back up. Same for spurious allocation errors (return -ENOMEM). Fix this by checking the specific error code in nvme_revalidate_disk and if it is a transient error (for example non DNR nvme statuses or a negative ENOMEM as allocation failure), do not remove the namespace as it will either recover when the controller is back up and schedule a subsequent scan, or the controller is going away and the namespaces will be removed anyways. This fixes a hang namespace scanning racing with a controller reset and also sporious I/O errors in path failover coditions where the controller reset is racing with the namespace scan work with multipath enabled. Reported-by: Hannes Reinecke Reviewed-by: Hannes Reinecke Reviewed-by: James Smart Reviewed-by: Christoph Hellwig Signed-off-by: Sagi Grimberg --- drivers/nvme/host/core.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index f15a77d..fad0428 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1765,7 +1765,13 @@ static int nvme_revalidate_disk(struct gendisk *disk) free_id: kfree(id); out: - if (ret > 0) + /* + * Only fail the function if we got a fatal error back from the + * device, otherwise ignore the error and just move on. + */ + if (ret == -ENOMEM || (ret > 0 && !(ret & NVME_SC_DNR))) + ret = 0; + else if (ret > 0) ret = blk_status_to_errno(nvme_error_status(ret)); return ret; } -- 2.7.4