xprtrdma: Fix recursion into rpcrdma_xprt_disconnect()
authorChuck Lever <chuck.lever@oracle.com>
Sat, 27 Jun 2020 16:35:09 +0000 (12:35 -0400)
committerAnna Schumaker <Anna.Schumaker@Netapp.com>
Mon, 13 Jul 2020 14:50:41 +0000 (10:50 -0400)
commit4cf44be6f1e86da302085bf3e1dc2c86f3cdaaaa
tree272261fd0bd0b64b7468338f9cb512656f71f172
parent85bfd71bc34e20d9fadb745131f6314c36d0f75b
xprtrdma: Fix recursion into rpcrdma_xprt_disconnect()

Both Dan and I have observed two processes invoking
rpcrdma_xprt_disconnect() concurrently. In my case:

1. The connect worker invokes rpcrdma_xprt_disconnect(), which
   drains the QP and waits for the final completion
2. This causes the newly posted Receive to flush and invoke
   xprt_force_disconnect()
3. xprt_force_disconnect() sets CLOSE_WAIT and wakes up the RPC task
   that is holding the transport lock
4. The RPC task invokes xprt_connect(), which calls ->ops->close
5. xprt_rdma_close() invokes rpcrdma_xprt_disconnect(), which tries
   to destroy the QP.

Deadlock.

To prevent xprt_force_disconnect() from waking anything, handle the
clean up after a failed connection attempt in the xprt's sndtask.

The retry loop is removed from rpcrdma_xprt_connect() to ensure
that the newly allocated ep and id are properly released before
a REJECTED connection attempt can be retried.

Reported-by: Dan Aloni <dan@kernelim.com>
Fixes: e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
net/sunrpc/xprtrdma/transport.c
net/sunrpc/xprtrdma/verbs.c