platform/kernel/linux-exynos.git
7 years agoNFSv4: Fix CLOSE races with OPEN
Trond Myklebust [Mon, 14 Nov 2016 16:19:55 +0000 (11:19 -0500)]
NFSv4: Fix CLOSE races with OPEN

If the reply to a successful CLOSE call races with an OPEN to the same
file, we can end up scribbling over the stateid that represents the
new open state.
The race looks like:

  Client Server
  ====== ======

  CLOSE stateid A on file "foo"
CLOSE stateid A, return stateid C
  OPEN file "foo"
OPEN "foo", return stateid B
  Receive reply to OPEN
  Reset open state for "foo"
  Associate stateid B to "foo"

  Receive CLOSE for A
  Reset open state for "foo"
  Replace stateid B with C

The fix is to examine the argument of the CLOSE, and check for a match
with the current stateid "other" field. If the two do not match, then
the above race occurred, and we should just ignore the CLOSE.

Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoNFSv4.1: Fix a regression in DELEGRETURN
Trond Myklebust [Thu, 10 Nov 2016 21:06:28 +0000 (16:06 -0500)]
NFSv4.1: Fix a regression in DELEGRETURN

We don't want to call nfs4_free_revoked_stateid() in the case where
the delegreturn was successful.

Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: Fix DMAR failure in frwr_op_map() after reconnect
Chuck Lever [Mon, 7 Nov 2016 21:16:24 +0000 (16:16 -0500)]
xprtrdma: Fix DMAR failure in frwr_op_map() after reconnect

When a LOCALINV WR is flushed, the frmr is marked STALE, then
frwr_op_unmap_sync DMA-unmaps the frmr's SGL. These STALE frmrs
are then recovered when frwr_op_map hunts for an INVALID frmr to
use.

All other cases that need frmr recovery leave that SGL DMA-mapped.
The FRMR recovery path unconditionally DMA-unmaps the frmr's SGL.

To avoid DMA unmapping the SGL twice for flushed LOCAL_INV WRs,
alter the recovery logic (rather than the hot frwr_op_unmap_sync
path) to distinguish among these cases. This solution also takes
care of the case where multiple LOCAL_INV WRs are issued for the
same rpcrdma_req, some complete successfully, but some are flushed.

Reported-by: Vasco Steinmetz <linux@kyberraum.net>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Vasco Steinmetz <linux@kyberraum.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agofs/nfs: Fix used uninitialized warn in nfs4_slot_seqid_in_use()
Shuah Khan [Mon, 7 Nov 2016 17:48:16 +0000 (10:48 -0700)]
fs/nfs: Fix used uninitialized warn in nfs4_slot_seqid_in_use()

Fix the following warn:

fs/nfs/nfs4session.c: In function ‘nfs4_slot_seqid_in_use’:
fs/nfs/nfs4session.c:203:54: warning: ‘cur_seq’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  if (nfs4_slot_get_seqid(tbl, slotid, &cur_seq) == 0 &&
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
      cur_seq == seq_nr && test_bit(slotid, tbl->used_slots))
      ~~~~~~~~~~~~~~~~~

Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFS: Don't print a pNFS error if we aren't using pNFS
Anna Schumaker [Wed, 26 Oct 2016 19:54:31 +0000 (15:54 -0400)]
NFS: Don't print a pNFS error if we aren't using pNFS

We used to check for a valid layout type id before verifying pNFS flags
as an indicator for if we are using pNFS.  This changed in 3132e49ece
with the introduction of multiple layout types, since now we are passing
an array of ids instead of just one.  Since then, users have been seeing
a KERN_ERR printk show up whenever mounting NFS v4 without pNFS.  This
patch restores the original behavior of exiting set_pnfs_layoutdriver()
early if we aren't using pNFS.

Fixes 3132e49ece ("pnfs: track multiple layout types in fsinfo
structure")
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFS: Ignore connections that have cl_rpcclient uninitialized
Petr Vandrovec [Mon, 7 Nov 2016 20:11:29 +0000 (12:11 -0800)]
NFS: Ignore connections that have cl_rpcclient uninitialized

cl_rpcclient starts as ERR_PTR(-EINVAL), and connections like that
are floating freely through the system.  Most places check whether
pointer is valid before dereferencing it, but newly added code
in nfs_match_client does not.

Which causes crashes when more than one NFS mount point is present.

Signed-off-by: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoSUNRPC: Fix suspicious RCU usage
Anna Schumaker [Wed, 26 Oct 2016 14:33:31 +0000 (10:33 -0400)]
SUNRPC: Fix suspicious RCU usage

We need to hold the rcu_read_lock() when calling rcu_dereference(),
otherwise we can't guarantee that the object being dereferenced still
exists.

Fixes: 39e5d2df ("SUNRPC search xprt switch for sockaddr")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4.1: work around -Wmaybe-uninitialized warning
Arnd Bergmann [Mon, 17 Oct 2016 22:05:35 +0000 (00:05 +0200)]
NFSv4.1: work around -Wmaybe-uninitialized warning

A bugfix introduced a harmless gcc warning in nfs4_slot_seqid_in_use
if we enable -Wmaybe-uninitialized again:

fs/nfs/nfs4session.c:203:54: error: 'cur_seq' may be used uninitialized in this function [-Werror=maybe-uninitialized]

gcc is not smart enough to conclude that the IS_ERR/PTR_ERR pair
results in a nonzero return value here. Using PTR_ERR_OR_ZERO()
instead makes this clear to the compiler.

The warning originally did not appear in v4.8 as it was globally
disabled, but the bugfix that introduced the warning got backported
to stable kernels which again enable it, and this is now the only
warning in the v4.7 builds.

Fixes: e09c978aae5b ("NFSv4.1: Fix Oopsable condition in server callback races")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFS: Trim extra slash in v4 nfs_path
Benjamin Coddington [Wed, 15 Jun 2016 19:02:55 +0000 (15:02 -0400)]
NFS: Trim extra slash in v4 nfs_path

A NFSv4 mount of a subdirectory will show an extra slash (as in
'server://path') in proc's mountinfo which will not match the device name
and path.  This can cause problems for programs searching for the mount.
Fix this by checking for a leading slash in the dentry path, if so trim
away any trailing slashes in the device name.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agonfs4: fix missing-braces warning
Arnd Bergmann [Tue, 18 Oct 2016 15:21:30 +0000 (17:21 +0200)]
nfs4: fix missing-braces warning

A bugfix introduced a harmless warning for update_open_stateid:

fs/nfs/nfs4proc.c:1548:2: error: missing braces around initializer [-Werror=missing-braces]

Removing the zero in the initializer will do the right thing here
and initialize the entire structure to zero.

Fixes: 1393d9612ba0 ("NFSv4: Fix a race when updating an open_stateid")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agopnfs/blocklayout: fix last_write_offset incorrectly set to page boundary
Benjamin Coddington [Tue, 11 Oct 2016 19:53:21 +0000 (15:53 -0400)]
pnfs/blocklayout: fix last_write_offset incorrectly set to page boundary

Commit 41963c10c47a35185e68cb9049f7a3493c94d2d7 sets the block layout's
last written byte to the offset of the end of the extent rather than the
end of the write which incorrectly updates the inode's size for
partial-page writes.

Fixes: 41963c10c47a ("pnfs/blocklayout: update last_write_offset atomically with extents")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # 4.8+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic
Jeff Layton [Tue, 4 Oct 2016 04:07:43 +0000 (00:07 -0400)]
NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic

The caller of rpc_run_task also gets a reference that must be put.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agofs: nfs: Make nfs boot time y2038 safe
Deepa Dinamani [Sat, 1 Oct 2016 23:46:26 +0000 (16:46 -0700)]
fs: nfs: Make nfs boot time y2038 safe

boot_time is represented as a struct timespec.
struct timespec and CURRENT_TIME are not y2038 safe.
Overall, the plan is to use timespec64 and ktime_t for
all internal kernel representation of timestamps.
CURRENT_TIME will also be removed.

boot_time is used to construct the nfs client boot verifier.

Use ktime_t to represent boot_time and ktime_get_real() for
the boot_time value.

Following Trond's request https://lkml.org/lkml/2016/6/9/22 ,
use ktime_t instead of converting to struct timespec64.

Use higher and lower 32 bit parts of ktime_t for the boot
verifier.

Use the lower 32 bit part of ktime_t for the authsys_parms
stamp field.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agosunrpc: replace generic auth_cred hash with auth-specific function
Frank Sorenson [Thu, 29 Sep 2016 15:44:41 +0000 (10:44 -0500)]
sunrpc: replace generic auth_cred hash with auth-specific function

Replace the generic code to hash the auth_cred with the call to
the auth-specific hash function in the rpc_authops struct.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agosunrpc: add RPCSEC_GSS hash_cred() function
Frank Sorenson [Thu, 29 Sep 2016 15:44:40 +0000 (10:44 -0500)]
sunrpc: add RPCSEC_GSS hash_cred() function

Add a hash_cred() function for RPCSEC_GSS, using only the
uid from the auth_cred.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agosunrpc: add auth_unix hash_cred() function
Frank Sorenson [Thu, 29 Sep 2016 15:44:39 +0000 (10:44 -0500)]
sunrpc: add auth_unix hash_cred() function

Add a hash_cred() function for auth_unix, using both the
uid and gid from the auth_cred.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agosunrpc: add generic_auth hash_cred() function
Frank Sorenson [Thu, 29 Sep 2016 15:44:38 +0000 (10:44 -0500)]
sunrpc: add generic_auth hash_cred() function

Add a hash_cred() function for generic_auth, using both the
uid and gid from the auth_cred.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agosunrpc: add hash_cred() function to rpc_authops struct
Frank Sorenson [Thu, 29 Sep 2016 15:44:37 +0000 (10:44 -0500)]
sunrpc: add hash_cred() function to rpc_authops struct

Currently, a single hash algorithm is used to hash the auth_cred for
the credcache for all rpc_auth types.  Add a hash_cred() function to
the rpc_authops struct to allow a hash function specific to each
auth flavor.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoRetry operation on EREMOTEIO on an interrupted slot
Olga Kornievskaia [Fri, 23 Sep 2016 21:24:03 +0000 (17:24 -0400)]
Retry operation on EREMOTEIO on an interrupted slot

If an operation got interrupted, then since we don't know if the
server processed it on not, we keep the seq#. Upon reuse of slot
and seq# if we get reply from the cache (ie EREMOTEIO) then we
need to retry the operation after bumping the seq#

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agopNFS: Fix atime updates on pNFS clients
Trond Myklebust [Thu, 15 Sep 2016 22:26:05 +0000 (18:26 -0400)]
pNFS: Fix atime updates on pNFS clients

Fix the code so that we always mark the atime as invalid in nfs4_read_done().
Currently, the expectation appears to be that the pNFS drivers should always
do this, with the result that most of them don't.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agosunrpc: queue work on system_power_efficient_wq
Ke Wang [Thu, 1 Sep 2016 07:30:26 +0000 (15:30 +0800)]
sunrpc: queue work on system_power_efficient_wq

sunrpc uses workqueue to clean cache regulary. There is no real dependency
of executing work on the cpu which queueing it.

On a idle system, especially for a heterogeneous systems like big.LITTLE,
it is observed that the big idle cpu was woke up many times just to service
this work, which against the principle of power saving. It would be better
if we can schedule it on a cpu which the scheduler believes to be the most
appropriate one.

After apply this patch, system_wq will be replaced by
system_power_efficient_wq for sunrpc. This functionality is enabled when
CONFIG_WQ_POWER_EFFICIENT is selected.

Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4.1: Even if the stateid is OK, we may need to recover the open modes
Trond Myklebust [Thu, 22 Sep 2016 17:39:21 +0000 (13:39 -0400)]
NFSv4.1: Even if the stateid is OK, we may need to recover the open modes

TEST_STATEID only tells you that you have a valid open stateid. It doesn't
tell the client anything about whether or not it holds the required share
locks.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
[Anna: Wrap nfs_open_stateid_recover_openmode in CONFIG_NFS_V4_1 checks]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4: If recovery failed for a specific open stateid, then don't retry
Trond Myklebust [Thu, 22 Sep 2016 17:39:20 +0000 (13:39 -0400)]
NFSv4: If recovery failed for a specific open stateid, then don't retry

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4: Fix retry issues with nfs41_test/free_stateid
Trond Myklebust [Thu, 22 Sep 2016 17:39:19 +0000 (13:39 -0400)]
NFSv4: Fix retry issues with nfs41_test/free_stateid

_nfs41_free_stateid() needs to be cached by the session, but
nfs41_test_stateid() may return NFS4ERR_RETRY_UNCACHED_REP (in which
case we should just retry).

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4: Open state recovery must account for file permission changes
Trond Myklebust [Thu, 22 Sep 2016 17:39:18 +0000 (13:39 -0400)]
NFSv4: Open state recovery must account for file permission changes

If the file permissions change on the server, then we may not be able to
recover open state. If so, we need to ensure that we mark the file
descriptor appropriately.

Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4: Mark the lock and open stateids as invalid after freeing them
Trond Myklebust [Thu, 22 Sep 2016 17:39:17 +0000 (13:39 -0400)]
NFSv4: Mark the lock and open stateids as invalid after freeing them

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4: Don't test open_stateid unless it is set
Trond Myklebust [Thu, 22 Sep 2016 17:39:16 +0000 (13:39 -0400)]
NFSv4: Don't test open_stateid unless it is set

We need to test the NFS_OPEN_STATE flag for whether or not the
open_stateid is valid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid
Trond Myklebust [Thu, 22 Sep 2016 17:39:15 +0000 (13:39 -0400)]
NFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid

If we're not yet sure that all state has expired or been revoked, we
should try to do a minimal recovery on just the one stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFS: Always call nfs_inode_find_state_and_recover() when revoking a delegation
Trond Myklebust [Thu, 22 Sep 2016 17:39:14 +0000 (13:39 -0400)]
NFS: Always call nfs_inode_find_state_and_recover() when revoking a delegation

Don't rely on nfs_inode_detach_delegation() succeeding. That can race...

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4: Fix a race when updating an open_stateid
Trond Myklebust [Thu, 22 Sep 2016 17:39:13 +0000 (13:39 -0400)]
NFSv4: Fix a race when updating an open_stateid

If we're replacing an old stateid which has a different 'other' field,
then we probably need to free the old stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4: Fix a race in nfs_inode_reclaim_delegation()
Trond Myklebust [Thu, 22 Sep 2016 17:39:12 +0000 (13:39 -0400)]
NFSv4: Fix a race in nfs_inode_reclaim_delegation()

If we race with a delegreturn before taking the spin lock, we
currently end up dropping the delegation stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4: Pass the stateid to the exception handler in nfs4_read/write_done_cb
Trond Myklebust [Thu, 22 Sep 2016 17:39:11 +0000 (13:39 -0400)]
NFSv4: Pass the stateid to the exception handler in nfs4_read/write_done_cb

The actual stateid used in the READ or WRITE can represent a delegation,
a lock or a stateid, so it is useful to pass it as an argument to the
exception handler when an expired/revoked response is received from the
server. It also ensures that we don't re-label the state as needing
recovery if that has already occurred.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4.1: nfs4_layoutget_handle_exception handle revoked state
Trond Myklebust [Thu, 22 Sep 2016 17:39:10 +0000 (13:39 -0400)]
NFSv4.1: nfs4_layoutget_handle_exception handle revoked state

Handle revoked open/lock/delegation stateids when LAYOUTGET tells us
the state was revoked.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4: nfs4_handle_setlk_error() handle expiration as revoke case
Trond Myklebust [Thu, 22 Sep 2016 17:39:09 +0000 (13:39 -0400)]
NFSv4: nfs4_handle_setlk_error() handle expiration as revoke case

If the server tells us our stateid has expired, then handle that as if
it was revoked.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4: nfs4_handle_delegation_recall_error() handle expiration as revoke case
Trond Myklebust [Thu, 22 Sep 2016 17:39:08 +0000 (13:39 -0400)]
NFSv4: nfs4_handle_delegation_recall_error() handle expiration as revoke case

If the server tells us our stateid has expired, then handle that as if
it was revoked.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4: nfs_inode_find_state_and_recover() should check all stateids
Trond Myklebust [Thu, 22 Sep 2016 17:39:07 +0000 (13:39 -0400)]
NFSv4: nfs_inode_find_state_and_recover() should check all stateids

Modify the helper nfs_inode_find_state_and_recover() so that it
can check all open/lock/delegation state trackers on that inode for
whether or not they need are affected by a revoked stateid error.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4: Ensure we don't re-test revoked and freed stateids
Trond Myklebust [Thu, 22 Sep 2016 17:39:06 +0000 (13:39 -0400)]
NFSv4: Ensure we don't re-test revoked and freed stateids

This fixes a potential infinite loop in nfs_reap_expired_delegations.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4.1: Ensure we call FREE_STATEID if needed on close/delegreturn/locku
Trond Myklebust [Thu, 22 Sep 2016 17:39:05 +0000 (13:39 -0400)]
NFSv4.1: Ensure we call FREE_STATEID if needed on close/delegreturn/locku

If a server returns NFS4ERR_ADMIN_REVOKED, NFS4ERR_DELEG_REVOKED
or NFS4ERR_EXPIRED on a call to close, open_downgrade, delegreturn, or
locku, we should call FREE_STATEID before attempting to recover.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4.1: FREE_STATEID can be asynchronous
Trond Myklebust [Thu, 22 Sep 2016 17:39:04 +0000 (13:39 -0400)]
NFSv4.1: FREE_STATEID can be asynchronous

Nothing should need to be serialised with FREE_STATEID on the client,
so let's make the RPC call always asynchronous. Also constify the
stateid argument.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4.1: Ensure we always run TEST/FREE_STATEID on locks
Trond Myklebust [Thu, 22 Sep 2016 17:39:03 +0000 (13:39 -0400)]
NFSv4.1: Ensure we always run TEST/FREE_STATEID on locks

Right now, we're only running TEST/FREE_STATEID on the locks if
the open stateid recovery succeeds. The protocol requires us to
always do so.
The fix would be to move the call to TEST/FREE_STATEID and do it
before we attempt open recovery.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4.1: Allow revoked stateids to skip the call to TEST_STATEID
Trond Myklebust [Thu, 22 Sep 2016 17:39:02 +0000 (13:39 -0400)]
NFSv4.1: Allow revoked stateids to skip the call to TEST_STATEID

In some cases (e.g. when the SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED sequence
flag is set) we may already know that the stateid was revoked and that the
only valid operation we can call is FREE_STATEID. In those cases, allow
the stateid to carry the information in the type field, so that we skip
the redundant call to TEST_STATEID.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4.1: Don't recheck delegations that have already been checked
Trond Myklebust [Thu, 22 Sep 2016 17:39:01 +0000 (13:39 -0400)]
NFSv4.1: Don't recheck delegations that have already been checked

Ensure we don't spam the server with test_stateid() calls for
delegations that have already been checked.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4.1: Deal with server reboots during delegation expiration recovery
Trond Myklebust [Thu, 22 Sep 2016 17:39:00 +0000 (13:39 -0400)]
NFSv4.1: Deal with server reboots during delegation expiration recovery

Ensure that if the server reboots while we're testing and recovering
from revoked delegations, we exit to allow the state manager to
handle matters.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4.1: Test delegation stateids when server declares "some state revoked"
Trond Myklebust [Thu, 22 Sep 2016 17:38:59 +0000 (13:38 -0400)]
NFSv4.1: Test delegation stateids when server declares "some state revoked"

According to RFC5661, if any of the SEQUENCE status bits
SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED,
SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, SEQ4_STATUS_ADMIN_STATE_REVOKED,
or SEQ4_STATUS_RECALLABLE_STATE_REVOKED are set, then we need to use
TEST_STATEID to figure out which stateids have been revoked, so we
can acknowledge the loss of state using FREE_STATEID.

While we already do this for open and lock state, we have not been doing
so for all the delegations.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4.x: Allow callers of nfs_remove_bad_delegation() to specify a stateid
Trond Myklebust [Thu, 22 Sep 2016 17:38:58 +0000 (13:38 -0400)]
NFSv4.x: Allow callers of nfs_remove_bad_delegation() to specify a stateid

Allow the callers of nfs_remove_bad_delegation() to specify the stateid
that needs to be marked as bad.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4.1: Add a helper function to deal with expired stateids
Trond Myklebust [Thu, 22 Sep 2016 17:38:57 +0000 (13:38 -0400)]
NFSv4.1: Add a helper function to deal with expired stateids

In NFSv4.1 and newer, if the server decides to revoke some or all of
the protocol state, the client is required to iterate through all the
stateids that it holds and call TEST_STATEID to determine which stateids
still correspond to valid state, and then call FREE_STATEID on the
others.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4.1: Allow test_stateid to handle session errors without waiting
Trond Myklebust [Thu, 22 Sep 2016 17:38:56 +0000 (13:38 -0400)]
NFSv4.1: Allow test_stateid to handle session errors without waiting

If the server crashes while we're testing stateids for validity, then
we want to initiate session recovery. Usually, we will be calling from
a state manager thread, though, so we don't really want to wait.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4.1: Don't check delegations that are already marked as revoked
Trond Myklebust [Thu, 22 Sep 2016 17:38:55 +0000 (13:38 -0400)]
NFSv4.1: Don't check delegations that are already marked as revoked

If the delegation has been marked as revoked, we don't have to test
it, because we should already have called FREE_STATEID on it.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Olek Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is invalid
Trond Myklebust [Thu, 22 Sep 2016 17:38:54 +0000 (13:38 -0400)]
NFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is invalid

We must not allow the use of delegations that have been revoked or are
being returned.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: 869f9dfa4d6d ("NFSv4: Fix races between nfs_remove_bad_delegation()...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v3.19+
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4: Don't report revoked delegations as valid in nfs_have_delegation()
Trond Myklebust [Thu, 22 Sep 2016 17:38:53 +0000 (13:38 -0400)]
NFSv4: Don't report revoked delegations as valid in nfs_have_delegation()

If the delegation is revoked, then it can't be used for caching.

Fixes: 869f9dfa4d6d ("NFSv4: Fix races between nfs_remove_bad_delegation()...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v3.19+
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFS: Fix inode corruption in nfs_prime_dcache()
Trond Myklebust [Thu, 22 Sep 2016 17:38:52 +0000 (13:38 -0400)]
NFS: Fix inode corruption in nfs_prime_dcache()

Due to inode number reuse in filesystems, we can end up corrupting the
inode on our client if we apply the file attributes without ensuring that
the filehandle matches.
Typical symptoms include spurious "mode changed" reports in the syslog.

We still do want to ensure that we don't invalidate the dentry if the
inode number matches, but we don't have a filehandle.

Fixes: fa9233699cc1 ("NFS: Don't require a filehandle to refresh...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.0+
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFSv4.1: Don't deadlock the state manager on the SEQUENCE status flags
Trond Myklebust [Thu, 22 Sep 2016 17:38:51 +0000 (13:38 -0400)]
NFSv4.1: Don't deadlock the state manager on the SEQUENCE status flags

As described in RFC5661, section 18.46, some of the status flags exist
in order to tell the client when it needs to acknowledge the existence of
revoked state on the server and/or to recover state.
Those flags will then remain set until the recovery procedure is done.

In order to avoid looping, the client therefore needs to ignore
those particular flags while recovering.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: use complete() instead complete_all()
Daniel Wagner [Fri, 23 Sep 2016 08:41:57 +0000 (10:41 +0200)]
xprtrdma: use complete() instead complete_all()

There is only one waiter for the completion, therefore there
is no need to use complete_all(). Let's make that clear by
using complete() instead of complete_all().

The usage pattern of the completion is:

waiter context                          waker context

frwr_op_unmap_sync()
  reinit_completion()
  ib_post_send()
  wait_for_completion()

frwr_wc_localinv_wake()
  complete()

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFS: cache_lib: use complete() instead of complete_all()
Daniel Wagner [Thu, 22 Sep 2016 11:54:29 +0000 (13:54 +0200)]
NFS: cache_lib: use complete() instead of complete_all()

There is only one waiter for the completion, therefore there
is no need to use complete_all(). Let's make that clear by
using complete() instead of complete_all().

The generic caching code from sunrpc is calling revisit() only once.

The usage pattern of the completion is:

waiter context                          waker context

do_cache_lookup_wait()
  nfs_cache_defer_req_alloc()
    init_completion()
  do_cache_lookup()
  nfs_cache_wait_for_upcall()
    wait_for_completion_timeout()

nfs_dns_cache_revisit()
  complete()

  nfs_cache_defer_req_put()

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFS: direct: use complete() instead of complete_all()
Daniel Wagner [Thu, 22 Sep 2016 11:54:28 +0000 (13:54 +0200)]
NFS: direct: use complete() instead of complete_all()

There is only one waiter for the completion, therefore there
is no need to use complete_all(). Let's make that clear by
using complete() instead of complete_all().

nfs_file_direct_write() or nfs_file_direct_read() allocated a request
object via nfs_direct_req_alloc(), which initializes the
completion. The request object then is freed later in the exit path.
Between the initialization and the release either
nfs_direct_write_schedule_iovec() resp
nfs_direct_read_schedule_iovec() are called which will asynchronously
process the request. The calling function waits via nfs_direct_wait()
till the async work has been done. Thus there is only one waiter on
the completion.

nfs_direct_pgio_init() and nfs_direct_read_completion() are passed via
function pointers to nfs pageio. The first function does a ref
counting (get_dreq() and put_dreq()) which ensures that
nfs_direct_read_completion() and nfs_direct_read_schedule_iovec() only
call the completion path once.

The usage pattern of the completion is:

waiter context                          waker context

nfs_file_direct_write()
  dreq = nfs_direct_req_alloc()
    init_completion()
  nfs_direct_write_schedule_iovec()
  nfs_direct_wait()
    wait_for_completion_killable()

                                        nfs_direct_write_schedule_work()
                                          nfs_direct_complete()
                                            complete()

nfs_file_direct_read()
  dreq = nfs_direct_req_all()
    init_completion()
  nfs_direct_read_schedule_iovec()
  nfs_direct_wait()
    wait_for_completion_killable()
                                        nfs_direct_read_schedule_iovec()
                                          nfs_direct_complete()
                                            complete()

                                        nfs_direct_read_completion()
                                          nfs_direct_complete()
                                            complete()

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoSUNRPC: Fix setting of buffer length in xdr_set_next_buffer()
Trond Myklebust [Tue, 20 Sep 2016 18:33:43 +0000 (14:33 -0400)]
SUNRPC: Fix setting of buffer length in xdr_set_next_buffer()

Use xdr->nwords to tell us how much buffer remains.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoSUNRPC: Fix corruption of xdr->nwords in xdr_copy_to_scratch
Trond Myklebust [Tue, 20 Sep 2016 18:33:42 +0000 (14:33 -0400)]
SUNRPC: Fix corruption of xdr->nwords in xdr_copy_to_scratch

When we copy the first part of the data, we need to ensure that value
of xdr->nwords is updated as well. Do so by calling __xdr_inline_decode()

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFS: nfs_prime_dcache must validate the filename
Trond Myklebust [Tue, 20 Sep 2016 18:34:24 +0000 (14:34 -0400)]
NFS: nfs_prime_dcache must validate the filename

Before we try to stash it in the dcache, we need to at least check
that the filename passed to us by the server is non-empty and doesn't
contain any illegal '\0' or '/' characters.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agonfs: allow blocking locks to be awoken by lock callbacks
Jeff Layton [Sat, 17 Sep 2016 22:17:39 +0000 (18:17 -0400)]
nfs: allow blocking locks to be awoken by lock callbacks

Add a waitqueue head to the client structure. Have clients set a wait
on that queue prior to requesting a lock from the server. If the lock
is blocked, then we can use that to wait for wakeups.

Note that we do need to do this "manually" since we need to set the
wait on the waitqueue prior to requesting the lock, but requesting a
lock can involve activities that can block.

However, only do that for NFSv4.1 locks, either by compiling out
all of the waitqueue handling when CONFIG_NFS_V4_1 is disabled, or
skipping all of it at runtime if we're dealing with v4.0, or v4.1
servers that don't send lock callbacks.

Note too that even when we expect to get a lock callback, RFC5661
section 20.11.4 is pretty clear that we still need to poll for them,
so we do still sleep on a timeout. We do however always poll at the
longest interval in that case.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
[Anna: nfs4_retry_setlk() "status" should default to -ERESTARTSYS]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agonfs: move nfs4 lock retry attempt loop to a separate function
Jeff Layton [Sat, 17 Sep 2016 22:17:38 +0000 (18:17 -0400)]
nfs: move nfs4 lock retry attempt loop to a separate function

This also consolidates the waiting logic into a single function,
instead of having it spread across two like it is now.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agonfs: move nfs4_set_lock_state call into caller
Jeff Layton [Sat, 17 Sep 2016 22:17:37 +0000 (18:17 -0400)]
nfs: move nfs4_set_lock_state call into caller

We need to have this info set up before adding the waiter to the
waitqueue, so move this out of the _nfs4_proc_setlk and into the
caller. That's more efficient anyway since we don't need to do
this more than once if we end up waiting on the lock.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agonfs: add handling for CB_NOTIFY_LOCK in client
Jeff Layton [Sat, 17 Sep 2016 22:17:36 +0000 (18:17 -0400)]
nfs: add handling for CB_NOTIFY_LOCK in client

For now, the callback doesn't do anything. Support for that will be
added in later patches.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agonfs: track whether server sets MAY_NOTIFY_LOCK flag
Jeff Layton [Sat, 17 Sep 2016 22:17:35 +0000 (18:17 -0400)]
nfs: track whether server sets MAY_NOTIFY_LOCK flag

We want to handle the two cases differently, such that we poll more
aggressively when we don't expect a callback.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agonfs: add a new NFS4_OPEN_RESULT_MAY_NOTIFY_LOCK constant
Jeff Layton [Sat, 17 Sep 2016 22:17:34 +0000 (18:17 -0400)]
nfs: add a new NFS4_OPEN_RESULT_MAY_NOTIFY_LOCK constant

As defined in RFC 5661, section 18.16.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agonfs: use safe, interruptible sleeps when waiting to retry LOCK
Jeff Layton [Sat, 17 Sep 2016 22:17:33 +0000 (18:17 -0400)]
nfs: use safe, interruptible sleeps when waiting to retry LOCK

We actually want to use TASK_INTERRUPTIBLE sleeps when we're in the
process of polling for a NFSv4 lock. If there is a signal pending when
the task wakes up, then we'll be returning an error anyway. So, we might
as well wake up immediately for non-fatal signals as well. That allows
us to return to userland more quickly in that case, but won't change the
error that userland sees.

Also, there is no need to use the *_unsafe sleep variants here, as no
vfs-layer locks should be held at this point.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agonfs: eliminate pointless and confusing do_vfs_lock wrappers
Jeff Layton [Sat, 17 Sep 2016 22:17:32 +0000 (18:17 -0400)]
nfs: eliminate pointless and confusing do_vfs_lock wrappers

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agonfs: the length argument to read_buf should be unsigned
Jeff Layton [Sat, 17 Sep 2016 22:17:31 +0000 (18:17 -0400)]
nfs: the length argument to read_buf should be unsigned

Since it gets passed through to xdr_inline_decode, we might as well
have read_buf expect what it expects -- a size_t.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agonfs: cover ->migratepage with CONFIG_MIGRATION
Chao Yu [Tue, 20 Sep 2016 05:59:07 +0000 (13:59 +0800)]
nfs: cover ->migratepage with CONFIG_MIGRATION

It will be more clean to use CONFIG_MIGRATION to cover nfs' private
.migratepage in nfs_file_aops like we do in other part of nfs
operations.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agosunrpc: fix write space race causing stalls
David Vrabel [Mon, 19 Sep 2016 12:58:30 +0000 (13:58 +0100)]
sunrpc: fix write space race causing stalls

Write space becoming available may race with putting the task to sleep
in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
race does not work.

This (edited) partial trace illustrates the problem:

   [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
   [2] xs_write_space <-xs_tcp_write_space
   [3] xprt_write_space <-xs_write_space
   [4] rpc_task_sleep: task:43546@5 ...
   [5] xs_write_space <-xs_tcp_write_space

[1] Task 43546 runs but is out of write space.

[2] Space becomes available, xs_write_space() clears the
    SOCKWQ_ASYNC_NOSPACE bit.

[3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
    this has not yet been queued and the wake up is lost.

[4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
    which queues task 43546.

[5] The call to sk->sk_write_space() at the end of xs_nospace() (which
    is supposed to handle the above race) does not call
    xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
    thus the task is not woken.

Fix the race by resetting the SOCKWQ_ASYNC_NOSPACE bit in xs_nospace()
so the second call to sk->sk_write_space() calls xprt_write_space().

Suggested-by: Trond Myklebust <trondmy@primarydata.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
cc: stable@vger.kernel.org # 4.4
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agopnfs: add a new mechanism to select a layout driver according to an ordered list
Jeff Layton [Thu, 15 Sep 2016 18:40:49 +0000 (14:40 -0400)]
pnfs: add a new mechanism to select a layout driver according to an ordered list

Currently, the layout driver selection code always chooses the first one
from the list. That's not really ideal however, as the server can send
the list of layout types in any order that it likes. It's up to the
client to select the best one for its needs.

This patch adds an ordered list of preferred driver types and has the
selection code sort the list of available layout drivers according to it.
Any unrecognized layout type is sorted to the end of the list.

For now, the order of preference is hardcoded, but it should be possible
to make this configurable in the future.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: Eliminate rpcrdma_receive_worker()
Chuck Lever [Thu, 15 Sep 2016 14:57:57 +0000 (10:57 -0400)]
xprtrdma: Eliminate rpcrdma_receive_worker()

Clean up: the extra layer of indirection doesn't add value.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: Rename rpcrdma_receive_wc()
Chuck Lever [Thu, 15 Sep 2016 14:57:49 +0000 (10:57 -0400)]
xprtrdma: Rename rpcrdma_receive_wc()

Clean up: When converting xprtrdma to use the new CQ API, I missed a
spot. The naming convention elsewhere is:

  {svc_rdma,rpcrdma}_wc_{operation}

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrmda: Report address of frmr, not mw
Chuck Lever [Thu, 15 Sep 2016 14:57:40 +0000 (10:57 -0400)]
xprtrmda: Report address of frmr, not mw

Tie frwr debugging messages together by always reporting the address
of the frwr.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: Support larger inline thresholds
Chuck Lever [Thu, 15 Sep 2016 14:57:32 +0000 (10:57 -0400)]
xprtrdma: Support larger inline thresholds

The Version One default inline threshold is still 1KB. But allow
testing with thresholds up to 64KB.

This maximum is somewhat arbitrary. There's no fundamental
architectural limit I'm aware of, but it's good to keep the size of
Receive buffers reasonable. Now that Send can use a s/g list, a
Send buffer is only as large as each RPC requires. Receive buffers
are always the size of the inline threshold, however.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: Use gathered Send for large inline messages
Chuck Lever [Thu, 15 Sep 2016 14:57:24 +0000 (10:57 -0400)]
xprtrdma: Use gathered Send for large inline messages

An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"

- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload

- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent

As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.

The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.

Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.

This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.

This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: Basic support for Remote Invalidation
Chuck Lever [Thu, 15 Sep 2016 14:57:16 +0000 (10:57 -0400)]
xprtrdma: Basic support for Remote Invalidation

Have frwr's ro_unmap_sync recognize an invalidated rkey that appears
as part of a Receive completion. Local invalidation can be skipped
for that rkey.

Use an out-of-band signaling mechanism to indicate to the server
that the client is prepared to receive RDMA Send With Invalidate.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: Client-side support for rpcrdma_connect_private
Chuck Lever [Thu, 15 Sep 2016 14:57:07 +0000 (10:57 -0400)]
xprtrdma: Client-side support for rpcrdma_connect_private

Send an RDMA-CM private message on connect, and look for one during
a connection-established event.

Both sides can communicate their various implementation limits.
Implementations that don't support this sideband protocol ignore it.

Once the client knows the server's inline threshold maxima, it can
adjust the use of Reply chunks, and eliminate most use of Position
Zero Read chunks. Moderately-sized I/O can be done using a pure
inline RDMA Send instead of RDMA operations that require memory
registration.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agorpcrdma: RDMA/CM private message data structure
Chuck Lever [Thu, 15 Sep 2016 14:56:59 +0000 (10:56 -0400)]
rpcrdma: RDMA/CM private message data structure

Introduce data structure used by both client and server to exchange
implementation details during RDMA/CM connection establishment.

This is an experimental out-of-band exchange between Linux
RPC-over-RDMA Version One implementations, replacing the deprecated
CCP (see RFC 5666bis). The purpose of this extension is to enable
prototyping of features that might be introduced in a subsequent
version of RPC-over-RDMA.

Suggested by Christoph Hellwig and Devesh Sharma.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: Move recv_wr to struct rpcrdma_rep
Chuck Lever [Thu, 15 Sep 2016 14:56:51 +0000 (10:56 -0400)]
xprtrdma: Move recv_wr to struct rpcrdma_rep

Clean up: The fields in the recv_wr do not vary. There is no need to
initialize them before each ib_post_recv(). This removes a large-ish
data structure from the stack.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: Move send_wr to struct rpcrdma_req
Chuck Lever [Thu, 15 Sep 2016 14:56:43 +0000 (10:56 -0400)]
xprtrdma: Move send_wr to struct rpcrdma_req

Clean up: Most of the fields in each send_wr do not vary. There is
no need to initialize them before each ib_post_send(). This removes
a large-ish data structure from the stack.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: Simplify rpcrdma_ep_post_recv()
Chuck Lever [Thu, 15 Sep 2016 14:56:35 +0000 (10:56 -0400)]
xprtrdma: Simplify rpcrdma_ep_post_recv()

Clean up.

Since commit fc66448549bb ("xprtrdma: Split the completion queue"),
rpcrdma_ep_post_recv() no longer uses the "ep" argument.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: Eliminate "ia" argument in rpcrdma_{alloc, free}_regbuf
Chuck Lever [Thu, 15 Sep 2016 14:56:26 +0000 (10:56 -0400)]
xprtrdma: Eliminate "ia" argument in rpcrdma_{alloc, free}_regbuf

Clean up. The "ia" argument is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: Delay DMA mapping Send and Receive buffers
Chuck Lever [Thu, 15 Sep 2016 14:56:18 +0000 (10:56 -0400)]
xprtrdma: Delay DMA mapping Send and Receive buffers

Currently, each regbuf is allocated and DMA mapped at the same time.
This is done during transport creation.

When a device driver is unloaded, every DMA-mapped buffer in use by
a transport has to be unmapped, and then remapped to the new
device if the driver is loaded again. Remapping will have to be done
_after_ the connect worker has set up the new device.

But there's an ordering problem:

call_allocate, which invokes xprt_rdma_allocate which calls
rpcrdma_alloc_regbuf to allocate Send buffers, happens _before_
the connect worker can run to set up the new device.

Instead, at transport creation, allocate each buffer, but leave it
unmapped. Once the RPC carries these buffers into ->send_request, by
which time a transport connection should have been established,
check to see that the RPC's buffers have been DMA mapped. If not,
map them there.

When device driver unplug support is added, it will simply unmap all
the transport's regbufs, but it doesn't have to deallocate the
underlying memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: Replace DMA_BIDIRECTIONAL
Chuck Lever [Thu, 15 Sep 2016 14:56:10 +0000 (10:56 -0400)]
xprtrdma: Replace DMA_BIDIRECTIONAL

The use of DMA_BIDIRECTIONAL is discouraged by DMA-API.txt.
Fortunately, xprtrdma now knows which direction I/O is going as
soon as it allocates each regbuf.

The RPC Call and Reply buffers are no longer the same regbuf. They
can each be labeled correctly now. The RPC Reply buffer is never
part of either a Send or Receive WR, but it can be part of Reply
chunk, which is mapped and registered via ->ro_map . So it is not
DMA mapped when it is allocated (DMA_NONE), to avoid a double-
mapping.

Since Receive buffers are no longer DMA_BIDIRECTIONAL and their
contents are never modified by the host CPU, DMA-API-HOWTO.txt
suggests that a DMA sync before posting each buffer should be
unnecessary. (See my_card_interrupt_handler).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: Use smaller buffers for RPC-over-RDMA headers
Chuck Lever [Thu, 15 Sep 2016 14:56:02 +0000 (10:56 -0400)]
xprtrdma: Use smaller buffers for RPC-over-RDMA headers

Commit 949317464bc2 ("xprtrdma: Limit number of RDMA segments in
RPC-over-RDMA headers") capped the number of chunks that may appear
in RPC-over-RDMA headers. The maximum header size can be estimated
and fixed to avoid allocating buffer space that is never used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: Initialize separate RPC call and reply buffers
Chuck Lever [Thu, 15 Sep 2016 14:55:53 +0000 (10:55 -0400)]
xprtrdma: Initialize separate RPC call and reply buffers

RPC-over-RDMA needs to separate its RPC call and reply buffers.

 o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
   Send operation using DMA_TO_DEVICE

 o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
   as part of a Reply chunk using DMA_FROM_DEVICE

The two mappings are for data movement in opposite directions.

DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.

On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.

Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.

Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.

Some incidental changes worth noting:

- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
  the iov.length field, so eliminate rg_size

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoSUNRPC: Add a transport-specific private field in rpc_rqst
Chuck Lever [Thu, 15 Sep 2016 14:55:45 +0000 (10:55 -0400)]
SUNRPC: Add a transport-specific private field in rpc_rqst

Currently there's a hidden and indirect mechanism for finding the
rpcrdma_req that goes with an rpc_rqst. It depends on getting from
the rq_buffer pointer in struct rpc_rqst to the struct
rpcrdma_regbuf that controls that buffer, and then to the struct
rpcrdma_req it goes with.

This was done back in the day to avoid the need to add a per-rqst
pointer or to alter the buf_free API when support for RPC-over-RDMA
was introduced.

I'm about to change the way regbuf's work to support larger inline
thresholds. Now is a good time to replace this indirect mechanism
with something that is more straightforward. I guess this should be
considered a clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoSUNRPC: Separate buffer pointers for RPC Call and Reply messages
Chuck Lever [Thu, 15 Sep 2016 14:55:37 +0000 (10:55 -0400)]
SUNRPC: Separate buffer pointers for RPC Call and Reply messages

For xprtrdma, the RPC Call and Reply buffers are involved in real
I/O operations.

To start with, the DMA direction of the I/O for a Call is opposite
that of a Reply.

In the current arrangement, the Reply buffer address is on a
four-byte alignment just past the call buffer. Would be friendlier
on some platforms if that was at a DMA cache alignment instead.

Because the current arrangement allocates a single memory region
which contains both buffers, the RPC Reply buffer often contains a
page boundary in it when the Call buffer is large enough (which is
frequent).

It would be a little nicer for setting up DMA operations (and
possible registration of the Reply buffer) if the two buffers were
separated, well-aligned, and contained as few page boundaries as
possible.

Now, I could just pad out the single memory region used for the pair
of buffers. But frequently that would mean a lot of unused space to
ensure the Reply buffer did not have a page boundary.

Add a separate pointer to rpc_rqst that points right to the RPC
Reply buffer. This makes no difference to xprtsock, but it will help
xprtrdma in subsequent patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoSUNRPC: Generalize the RPC buffer release API
Chuck Lever [Thu, 15 Sep 2016 14:55:29 +0000 (10:55 -0400)]
SUNRPC: Generalize the RPC buffer release API

xprtrdma needs to allocate the Call and Reply buffers separately.
TBH, the reliance on using a single buffer for the pair of XDR
buffers is transport implementation-specific.

Instead of passing just the rq_buffer into the buf_free method, pass
the task structure and let buf_free take care of freeing both
XDR buffers at once.

There's a micro-optimization here. In the common case, both
xprt_release and the transport's buf_free method were checking if
rq_buffer was NULL. Now the check is done only once per RPC.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoSUNRPC: Generalize the RPC buffer allocation API
Chuck Lever [Thu, 15 Sep 2016 14:55:20 +0000 (10:55 -0400)]
SUNRPC: Generalize the RPC buffer allocation API

xprtrdma needs to allocate the Call and Reply buffers separately.
TBH, the reliance on using a single buffer for the pair of XDR
buffers is transport implementation-specific.

Transports that want to allocate separate Call and Reply buffers
will ignore the "size" argument anyway.  Don't bother passing it.

The buf_alloc method can't return two pointers. Instead, make the
method's return value an error code, and set the rq_buffer pointer
in the method itself.

This gives call_allocate an opportunity to terminate an RPC instead
of looping forever when a permanent problem occurs. If a request is
just bogus, or the transport is in a state where it can't allocate
resources for any request, there needs to be a way to kill the RPC
right there and not loop.

This immediately fixes a rare problem in the backchannel send path,
which loops if the server happens to send a CB request whose
call+reply size is larger than a page (which it shouldn't do yet).

One more issue: looks like xprt_inject_disconnect was incorrectly
placed in the failure path in call_allocate. It needs to be in the
success path, as it is for other call-sites.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoSUNRPC: Refactor rpc_xdr_buf_init()
Chuck Lever [Thu, 15 Sep 2016 14:55:12 +0000 (10:55 -0400)]
SUNRPC: Refactor rpc_xdr_buf_init()

Clean up: there is some XDR initialization logic that is common
to the forward channel and backchannel. Move it to an XDR header
so it can be shared.

rpc_rqst::rq_buffer points to a buffer containing big-endian data.
Update its annotation as part of the clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoxprtrdma: Eliminate INLINE_THRESHOLD macros
Chuck Lever [Thu, 15 Sep 2016 14:55:04 +0000 (10:55 -0400)]
xprtrdma: Eliminate INLINE_THRESHOLD macros

Clean up: r_xprt is already available everywhere these macros are
invoked, so just dereference that directly.

RPCRDMA_INLINE_PAD_VALUE is no longer used, so it can simply be
removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFS pnfs data server multipath session trunking
Andy Adamson [Fri, 9 Sep 2016 13:22:29 +0000 (09:22 -0400)]
NFS pnfs data server multipath session trunking

Try all multipath addresses for a data server. The first address that
successfully connects and creates a session is the DS mount address.
All subsequent addresses are tested for session trunking and
added as aliases.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFS test session trunking with exchange id
Andy Adamson [Fri, 9 Sep 2016 13:22:28 +0000 (09:22 -0400)]
NFS test session trunking with exchange id

Use an async exchange id call to test for session trunking

To conform with RFC 5661 section 18.35.4, the Non-Update on
Existing Clientid case, save the exchange id verifier in
cl_confirm and use it for the session trunking exhange id test.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoNFS add xprt switch addrs test to match client
Andy Adamson [Fri, 9 Sep 2016 13:22:27 +0000 (09:22 -0400)]
NFS add xprt switch addrs test to match client

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoSUNRPC: rpc_clnt_add_xprt setup function for NFS layer
Andy Adamson [Fri, 9 Sep 2016 13:22:26 +0000 (09:22 -0400)]
SUNRPC: rpc_clnt_add_xprt setup function for NFS layer

Use a setup function to call into the NFS layer to test an rpc_xprt
for session trunking so as to not leak the rpc_xprt_switch into
the nfs layer.

Search for the address in the rpc_xprt_switch first so as not to
put an unnecessary EXCHANGE_ID on the wire.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoSUNRPC search xprt switch for sockaddr
Andy Adamson [Fri, 9 Sep 2016 13:22:25 +0000 (09:22 -0400)]
SUNRPC search xprt switch for sockaddr

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoSUNRPC rpc_clnt_xprt_switch_add_xprt
Andy Adamson [Fri, 9 Sep 2016 13:22:24 +0000 (09:22 -0400)]
SUNRPC rpc_clnt_xprt_switch_add_xprt

Give the NFS layer access to the rpc_xprt_switch_add_xprt function

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoSUNRPC rpc_clnt_xprt_switch_put
Andy Adamson [Fri, 9 Sep 2016 13:22:23 +0000 (09:22 -0400)]
SUNRPC rpc_clnt_xprt_switch_put

Give the NFS layer access to the xprt_switch_put function

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoSUNRPC remove rpc_task_release_client from rpc_task_set_client
Andy Adamson [Fri, 9 Sep 2016 13:22:22 +0000 (09:22 -0400)]
SUNRPC remove rpc_task_release_client from rpc_task_set_client

rpc_task_set_client is only called from rpc_run_task after
rpc_new_task and rpc_task_release_client is not needed as the
task is new.

When called from rpc_new_task, rpc_task_set_client also removed the
assigned rpc_xprt which is not desired.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>