NFSv4 client live hangs after live data migration recovery
authorBill Baker <Bill.Baker@Oracle.com>
Tue, 19 Jun 2018 21:24:58 +0000 (16:24 -0500)
committerGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Sun, 9 Sep 2018 18:01:23 +0000 (20:01 +0200)
commitb5bc39d4ffb53b0fb335d96d3abf1a4585019c0a
tree15d71caf3c052c05b78df21a54f67e19fdaae11b
parent9ba1a9ea7cb781fa293831ad77b4632b6db90f8c
NFSv4 client live hangs after live data migration recovery

commit 0f90be132cbf1537d87a6a8b9e80867adac892f6 upstream.

After a live data migration event at the NFS server, the client may send
I/O requests to the wrong server, causing a live hang due to repeated
recovery events.  On the wire, this will appear as an I/O request failing
with NFS4ERR_BADSESSION, followed by successful CREATE_SESSION, repeatedly.
NFS4ERR_BADSSESSION is returned because the session ID being used was
issued by the other server and is not valid at the old server.

The failure is caused by async worker threads having cached the transport
(xprt) in the rpc_task structure.  After the migration recovery completes,
the task is redispatched and the task resends the request to the wrong
server based on the old value still present in tk_xprt.

The solution is to recompute the tk_xprt field of the rpc_task structure
so that the request goes to the correct server.

Signed-off-by: Bill Baker <bill.baker@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Helen Chao <helen.chao@oracle.com>
Fixes: fb43d17210ba ("SUNRPC: Use the multipath iterator to assign a ...")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fs/nfs/nfs4proc.c
include/linux/sunrpc/clnt.h
net/sunrpc/clnt.c