platform/kernel/linux-starfive.git
7 years agoNFSv4: Fix an rcu lock leak
Trond Myklebust [Fri, 5 May 2017 17:02:42 +0000 (13:02 -0400)]
NFSv4: Fix an rcu lock leak

The intention in the original patch was to release the lock when
we put the inode, however something got screwed up.

Reported-by: Jason Yan <yanaijie@huawei.com>
Fixes: 7b410d9ce460f ("pNFS: Delay getting the layout header in..")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agonfs: use kmap/kunmap directly
Fabian Frederick [Wed, 3 May 2017 18:52:21 +0000 (20:52 +0200)]
nfs: use kmap/kunmap directly

This patch removes useless nfs_readdir_get_array() and
nfs_readdir_release_array() as suggested by Trond Myklebust

nfs_readdir() calls nfs_revalidate_mapping() before
readdir_search_pagecache() , nfs_do_filldir(), uncached_readdir()
so mapping should be correct.

While kmap() can't fail, all subsequent error checks were removed
as well as unused labels.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: always treat the invocation of nfs_getattr as cache hit when noac is on
Hou Tao [Fri, 28 Apr 2017 10:35:19 +0000 (18:35 +0800)]
NFS: always treat the invocation of nfs_getattr as cache hit when noac is on

When using 'ls -l' to display a large directory, if noac option is used,
in function nfs_getattr() nfs_need_revalidate_inode() will always be true
for NFSv3 and the nfs_entry cache of the directory will be flushed. The
flush will lead to a fully reread of the directory entries from server.

To prevent the unnecessary RPCs, we need to check whether or not the
noac option is used, and always report the invocation of nfs_getattr()
as cache hit instead cache miss when it's on.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoFix nfs_client refcounting if kmalloc fails in nfs4_proc_exchange_id and nfs4_proc_as...
Dave Wysochanski [Thu, 27 Apr 2017 14:45:15 +0000 (10:45 -0400)]
Fix nfs_client refcounting if kmalloc fails in nfs4_proc_exchange_id and nfs4_proc_async_renew

If memory allocation fails for the callback data, we need to put the nfs_client
or we end up with an elevated refcount.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFSv4.1: RECLAIM_COMPLETE must handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION
Trond Myklebust [Thu, 4 May 2017 17:44:04 +0000 (13:44 -0400)]
NFSv4.1: RECLAIM_COMPLETE must handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION

If the server returns NFS4ERR_CONN_NOT_BOUND_TO_SESSION because we
are trunking, then RECLAIM_COMPLETE must handle that by calling
nfs4_schedule_session_recovery() and then retrying.

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
7 years agopNFS: Fix NULL dereference in pnfs_generic_alloc_ds_commits
Fred Isaman [Tue, 2 May 2017 20:53:36 +0000 (16:53 -0400)]
pNFS: Fix NULL dereference in pnfs_generic_alloc_ds_commits

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Fix a typo in pnfs_generic_alloc_ds_commits
Trond Myklebust [Mon, 1 May 2017 21:35:58 +0000 (17:35 -0400)]
pNFS: Fix a typo in pnfs_generic_alloc_ds_commits

If the layout segment is invalid, we want to just resend the remaining
writes.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Fix a deadlock when coalescing writes and returning the layout
Trond Myklebust [Mon, 1 May 2017 21:06:56 +0000 (17:06 -0400)]
pNFS: Fix a deadlock when coalescing writes and returning the layout

Consider the following deadlock:

Process P1 Process P2 Process P3
========== ========== ==========
lock_page(page)

lseg = pnfs_update_layout(inode)

lo = NFS_I(inode)->layout
pnfs_error_mark_layout_for_return(lo)

lock_page(page)

lseg = pnfs_update_layout(inode)

In this scenario,
- P1 has declared the layout to be in error, but P2 holds a reference to
  a layout segment on that inode, so the layoutreturn is deferred.
- P2 is waiting for a page lock held by P3.
- P3 is asking for a new layout segment, but is blocked waiting
  for the layoutreturn.

The fix is to ensure that pnfs_error_mark_layout_for_return() does
not set the NFS_LAYOUT_RETURN flag, which blocks P3. Instead, we allow
the latter to call LAYOUTGET so that it can make progress and unblock
P2.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Don't clear the layout return info if there are segments to return
Trond Myklebust [Mon, 1 May 2017 21:03:44 +0000 (17:03 -0400)]
pNFS: Don't clear the layout return info if there are segments to return

In pnfs_clear_layoutreturn_info, ensure that we don't clear the layout
return info if there are new segments queued for return due to, for
instance, a race between a LAYOUTRETURN and a failed I/O attempt.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Ensure we commit the layout if it has been invalidated
Trond Myklebust [Sat, 29 Apr 2017 14:10:17 +0000 (10:10 -0400)]
pNFS: Ensure we commit the layout if it has been invalidated

If the layout is being invalidated on the server, then we must
invoke nfs_commit_inode() to ensure any commits to the DS get
cleared out.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Don't send COMMITs to the DSes if the server invalidated our layout
Trond Myklebust [Sat, 29 Apr 2017 14:27:18 +0000 (10:27 -0400)]
pNFS: Don't send COMMITs to the DSes if the server invalidated our layout

If the layout was invalidated, then assume we should requeue all the
pending writes for the DS in question.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS/flexfiles: Fix up the ff_layout_write_pagelist failure path
Trond Myklebust [Sat, 29 Apr 2017 04:02:37 +0000 (00:02 -0400)]
pNFS/flexfiles: Fix up the ff_layout_write_pagelist failure path

If the attempt to write through pNFS fails, we need to use the same
failure semantics as for the read path: If the FF_FLAGS_NO_IO_THRU_MDS
flag is set or we have sufficient valid DSes, then we must retry through
pNFS

Fixes: d67ae825a59d ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Ensure we check layout validity before marking it for return
Trond Myklebust [Thu, 27 Apr 2017 19:30:00 +0000 (15:30 -0400)]
pNFS: Ensure we check layout validity before marking it for return

pnfs_error_mark_layout_for_return needs to check that the layout is
valid before calling pnfs_set_plh_return_info().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS4.1 handle interrupted slot reuse from ERR_DELAY
Olga Kornievskaia [Wed, 26 Apr 2017 18:21:22 +0000 (14:21 -0400)]
NFS4.1 handle interrupted slot reuse from ERR_DELAY

If the RPC slot was interrupted and server replied to the next
operation on the "reused" slot with ERR_DELAY, don't clear out
the "interrupted" flag until we properly recover.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFSv4: check return value of xdr_inline_decode
Pan Bian [Sun, 23 Apr 2017 06:49:41 +0000 (14:49 +0800)]
NFSv4: check return value of xdr_inline_decode

Function xdr_inline_decode() will return a NULL pointer if the input
buffer does not have long enough buffer to decode nbytes of data.
However, in function decode_op_map(), the return value of
xdr_inline_decode() is not validated before it is used. This patch adds
a check to the return value of xdr_inline_decode().

Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agonfs/filelayout: fix NULL pointer dereference in fl_pnfs_update_layout()
Artem Savkov [Fri, 21 Apr 2017 19:35:51 +0000 (21:35 +0200)]
nfs/filelayout: fix NULL pointer dereference in fl_pnfs_update_layout()

Calling pnfs_put_lset on an IS_ERR pointer results in a NULL pointer
dereference like the one below. At the same time the check of retvalue
of filelayout_check_deviceid() sets lseg to error, but does not free it
before that.

[ 3000.636161] BUG: unable to handle kernel NULL pointer dereference at 000000000000003c
[ 3000.636970] IP: pnfs_put_lseg+0x29/0x100 [nfsv4]
[ 3000.637420] PGD 4f23b067
[ 3000.637421] PUD 4a0f4067
[ 3000.637679] PMD 0
[ 3000.637937]
[ 3000.638287] Oops: 0000 [#1] SMP
[ 3000.638591] Modules linked in: nfs_layout_nfsv41_files nfsv3 nfnetlink_queue nfnetlink_log nfnetlink bluetooth rfkill rpcsec_gss_krb5 nfsv4 nfs fscache binfmt_misc arc4 md4 nls_utf8 cifs ccm dns_resolver rpcrdma ib_isert iscsi_target_mod ib_iser rdma_cm iw_cm libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib ib_ucm ib_uverbs ib_umad ib_cm ib_core nls_koi8_u nls_cp932 ts_kmp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr virtio_balloon ppdev virtio_rng parport_pc i2c_piix4 parport acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c ata_generic pata_acpi virtio_blk virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ata_piix ttm libata drm serio_raw
[ 3000.645245]  i2c_core virtio_pci virtio_ring virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: xt_u32]
[ 3000.646360] CPU: 1 PID: 26402 Comm: date Not tainted 4.11.0-rc7.1.el7.test.x86_64 #1
[ 3000.647092] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 3000.647638] task: ffff8800415ada00 task.stack: ffffc90000ff0000
[ 3000.648207] RIP: 0010:pnfs_put_lseg+0x29/0x100 [nfsv4]
[ 3000.648696] RSP: 0018:ffffc90000ff39b8 EFLAGS: 00010246
[ 3000.649193] RAX: 0000000000000000 RBX: fffffffffffffff4 RCX: 00000000000d43be
[ 3000.649859] RDX: 00000000000d43bd RSI: 0000000000000000 RDI: fffffffffffffff4
[ 3000.650530] RBP: ffffc90000ff39d8 R08: 000000000001e320 R09: ffffffffa05c35ce
[ 3000.651203] R10: ffff88007fd1e320 R11: ffffea0001283d80 R12: 0000000001400040
[ 3000.651875] R13: ffff88004f77d9f0 R14: ffffc90000ff3cd8 R15: ffff8800417ade00
[ 3000.652546] FS:  00007fac4d5cd740(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[ 3000.653304] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3000.653849] CR2: 000000000000003c CR3: 000000004f080000 CR4: 00000000000406e0
[ 3000.654527] Call Trace:
[ 3000.654771]  fl_pnfs_update_layout.constprop.20+0x10c/0x150 [nfs_layout_nfsv41_files]
[ 3000.655505]  filelayout_pg_init_write+0x21d/0x270 [nfs_layout_nfsv41_files]
[ 3000.656195]  __nfs_pageio_add_request+0x11c/0x490 [nfs]
[ 3000.656698]  nfs_pageio_add_request+0xac/0x260 [nfs]
[ 3000.657180]  nfs_do_writepage+0x109/0x2e0 [nfs]
[ 3000.657616]  nfs_writepages_callback+0x16/0x30 [nfs]
[ 3000.658096]  write_cache_pages+0x26f/0x510
[ 3000.658495]  ? nfs_do_writepage+0x2e0/0x2e0 [nfs]
[ 3000.658946]  ? _raw_spin_unlock_bh+0x1e/0x20
[ 3000.659357]  ? wb_wakeup_delayed+0x5f/0x70
[ 3000.659748]  ? __mark_inode_dirty+0x2eb/0x360
[ 3000.660170]  nfs_writepages+0x84/0xd0 [nfs]
[ 3000.660575]  ? nfs_updatepage+0x571/0xb70 [nfs]
[ 3000.661012]  do_writepages+0x1e/0x30
[ 3000.661358]  __filemap_fdatawrite_range+0xc6/0x100
[ 3000.661819]  filemap_write_and_wait_range+0x41/0x90
[ 3000.662292]  nfs_file_fsync+0x34/0x1f0 [nfs]
[ 3000.662704]  vfs_fsync_range+0x3d/0xb0
[ 3000.663065]  vfs_fsync+0x1c/0x20
[ 3000.663385]  nfs4_file_flush+0x57/0x80 [nfsv4]
[ 3000.663813]  filp_close+0x2f/0x70
[ 3000.664132]  __close_fd+0x9a/0xc0
[ 3000.664453]  SyS_close+0x23/0x50
[ 3000.664785]  do_syscall_64+0x67/0x180
[ 3000.665162]  entry_SYSCALL64_slow_path+0x25/0x25
[ 3000.665600] RIP: 0033:0x7fac4d0e1e90
[ 3000.665946] RSP: 002b:00007ffd54e90c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[ 3000.666679] RAX: ffffffffffffffda RBX: 00007fac4d3b5400 RCX: 00007fac4d0e1e90
[ 3000.667349] RDX: 0000000000000000 RSI: 00007fac4d5d9000 RDI: 0000000000000001
[ 3000.668031] RBP: 0000000000000000 R08: 00007fac4d3b6a00 R09: 00007fac4d5cd740
[ 3000.668709] R10: 00007ffd54e909e0 R11: 0000000000000246 R12: 0000000000000000
[ 3000.669385] R13: 00007fac4d3b5e80 R14: 0000000000000000 R15: 0000000000000000
[ 3000.670061] Code: 00 00 66 66 66 66 90 55 48 85 ff 48 89 e5 41 56 41 55 41 54 53 48 89 fb 0f 84 97 00 00 00 f6 05 16 8f bc ff 10 0f 85 a6 00 00 00 <4c> 8b 63 48 48 8d 7b 38 49 8b 84 24 90 00 00 00 4c 8d a8 88 00
[ 3000.671831] RIP: pnfs_put_lseg+0x29/0x100 [nfsv4] RSP: ffffc90000ff39b8
[ 3000.672462] CR2: 000000000000003c

Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFSv4: Don't special case "launder"
Trond Myklebust [Wed, 26 Apr 2017 16:26:22 +0000 (12:26 -0400)]
NFSv4: Don't special case "launder"

If the client receives a fatal server error from nfs_pageio_add_request(),
then we should always truncate the page on which the error occurred.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Add a few more fatal I/O errors to nfs_error_is_fatal()
Trond Myklebust [Wed, 26 Apr 2017 16:21:49 +0000 (12:21 -0400)]
NFS: Add a few more fatal I/O errors to nfs_error_is_fatal()

EACCES, EDQUOT, EFBIG and ESTALE are all fatal errors as far as NFS
I/O is concerned. They need to be reported back to the application.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoMerge tag 'nfs-rdma-4.12-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma
Trond Myklebust [Tue, 25 Apr 2017 22:42:48 +0000 (18:42 -0400)]
Merge tag 'nfs-rdma-4.12-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFS over RDMA Client Side Changes

New Features:
- Break RDMA connections after a connection timeout
- Support for unloading the underlying device driver

Bugfixes and cleanups:
- Mark the receive workqueue as "read-mostly"
- Silence warnings caused by ENOBUFS
- Update a comment in xdr_init_decode_pages()
- Remove rpcrdma_buffer->rb_pool.

7 years agoNFSv3: nfs3_nlm_alloc_call should be declared static
Trond Myklebust [Tue, 25 Apr 2017 20:25:06 +0000 (16:25 -0400)]
NFSv3: nfs3_nlm_alloc_call should be declared static

Fix compiler warnings.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoxprtrdma: Remove rpcrdma_buffer::rb_pool
Chuck Lever [Tue, 11 Apr 2017 17:24:07 +0000 (13:24 -0400)]
xprtrdma: Remove rpcrdma_buffer::rb_pool

Since commit 1e465fd4ff47 ("xprtrdma: Replace send and receive
arrays"), this field is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agosunrpc: Fix xdr_init_decode_pages() documenting comment
Chuck Lever [Tue, 11 Apr 2017 17:23:59 +0000 (13:23 -0400)]
sunrpc: Fix xdr_init_decode_pages() documenting comment

Clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Squelch ENOBUFS warnings
Chuck Lever [Tue, 11 Apr 2017 17:23:51 +0000 (13:23 -0400)]
xprtrdma: Squelch ENOBUFS warnings

When ro_map is out of buffers, that's not a permanent error, so
don't report a problem.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Annotate receive workqueue
Chuck Lever [Tue, 11 Apr 2017 17:23:43 +0000 (13:23 -0400)]
xprtrdma: Annotate receive workqueue

Micro-optimize the receive workqueue by marking it's anchor "read-
mostly."

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Revert commit d0f36c46deea
Chuck Lever [Tue, 11 Apr 2017 17:23:34 +0000 (13:23 -0400)]
xprtrdma: Revert commit d0f36c46deea

Device removal is now adequately supported. Pinning the underlying
device driver to prevent removal while an NFS mount is active is no
longer necessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Restore transport after device removal
Chuck Lever [Tue, 11 Apr 2017 17:23:26 +0000 (13:23 -0400)]
xprtrdma: Restore transport after device removal

After a device removal, enable the transport connect worker to
restore normal operation if there is another device with
connectivity to the server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Refactor rpcrdma_ep_connect
Chuck Lever [Tue, 11 Apr 2017 17:23:18 +0000 (13:23 -0400)]
xprtrdma: Refactor rpcrdma_ep_connect

I'm about to add another arm to

    if (ep->rep_connected != 0)

It will be cleaner to use a switch statement here. We'll be looking
for a couple of specific errnos, or "anything else," basically to
sort out the difference between a normal reconnect and recovery from
device removal.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Support unplugging an HCA from under an NFS mount
Chuck Lever [Tue, 11 Apr 2017 17:23:10 +0000 (13:23 -0400)]
xprtrdma: Support unplugging an HCA from under an NFS mount

The device driver for the underlying physical device associated
with an RPC-over-RDMA transport can be removed while RPC-over-RDMA
transports are still in use (ie, while NFS filesystems are still
mounted and active). The IB core performs a connection event upcall
to request that consumers free all RDMA resources associated with
a transport.

There may be pending RPCs when this occurs. Care must be taken to
release associated resources without leaving references that can
trigger a subsequent crash if a signal or soft timeout occurs. We
rely on the caller of the transport's ->close method to ensure that
the previous RPC task has invoked xprt_release but the transport
remains write-locked.

A DEVICE_REMOVE upcall forces a disconnect then sleeps. When ->close
is invoked, it destroys the transport's H/W resources, then wakes
the upcall, which completes and allows the core driver unload to
continue.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=266
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Use same device when mapping or syncing DMA buffers
Chuck Lever [Tue, 11 Apr 2017 17:23:02 +0000 (13:23 -0400)]
xprtrdma: Use same device when mapping or syncing DMA buffers

When the underlying device driver is reloaded, ia->ri_device will be
replaced. All cached copies of that device pointer have to be
updated as well.

Commit 54cbd6b0c6b9 ("xprtrdma: Delay DMA mapping Send and Receive
buffers") added the rg_device field to each regbuf. As part of
handling a device removal, rpcrdma_dma_unmap_regbuf is invoked on
all regbufs for a transport.

Simply calling rpcrdma_dma_map_regbuf for each Receive buffer after
the driver has been reloaded should reinitialize rg_device correctly
for every case except rpcrdma_wc_receive, which still uses
rpcrdma_rep::rr_device.

Ensure the same device that was used to map a Receive buffer is also
used to sync it in rpcrdma_wc_receive by using rg_device there
instead of rr_device.

This is the only use of rr_device, so it can be removed.

The use of regbufs in the send path is also updated, for
completeness.

Fixes: 54cbd6b0c6b9 ("xprtrdma: Delay DMA mapping Send and ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Refactor rpcrdma_ia_open()
Chuck Lever [Tue, 11 Apr 2017 17:22:54 +0000 (13:22 -0400)]
xprtrdma: Refactor rpcrdma_ia_open()

In order to unload a device driver and reload it, xprtrdma will need
to close a transport's interface adapter, and then call
rpcrdma_ia_open again, possibly finding a different interface
adapter.

Make rpcrdma_ia_open safe to call on the same transport multiple
times.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Detect unreachable NFS/RDMA servers more reliably
Chuck Lever [Tue, 11 Apr 2017 17:22:46 +0000 (13:22 -0400)]
xprtrdma: Detect unreachable NFS/RDMA servers more reliably

Current NFS clients rely on connection loss to determine when to
retransmit. In particular, for protocols like NFSv4, clients no
longer rely on RPC timeouts to drive retransmission: NFSv4 servers
are required to terminate a connection when they need a client to
retransmit pending RPCs.

When a server is no longer reachable, either because it has crashed
or because the network path has broken, the server cannot actively
terminate a connection. Thus NFS clients depend on transport-level
keepalive to determine when a connection must be replaced and
pending RPCs retransmitted.

However, RDMA RC connections do not have a native keepalive
mechanism. If an NFS/RDMA server crashes after a client has sent
RPCs successfully (an RC ACK has been received for all OTW RDMA
requests), there is no way for the client to know the connection is
moribund.

In addition, new RDMA requests are subject to the RPC-over-RDMA
credit limit. If the client has consumed all granted credits with
NFS traffic, it is not allowed to send another RDMA request until
the server replies. Thus it has no way to send a true keepalive when
the workload has already consumed all credits with pending RPCs.

To address this, forcibly disconnect a transport when an RPC times
out. This prevents moribund connections from stopping the
detection of failover or other configuration changes on the server.

Note that even if the connection is still good, retransmitting
any RPC will trigger a disconnect thanks to this logic in
xprt_rdma_send_request:

/* Must suppress retransmit to maintain credits */
if (req->rl_connect_cookie == xprt->connect_cookie)
goto drop_connection;
req->rl_connect_cookie = xprt->connect_cookie;

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agosunrpc: Export xprt_force_disconnect()
Chuck Lever [Tue, 11 Apr 2017 17:22:38 +0000 (13:22 -0400)]
sunrpc: Export xprt_force_disconnect()

xprt_force_disconnect() is already invoked from the socket
transport. I want to invoke xprt_force_disconnect() from the
RPC-over-RDMA transport, which is a separate module from sunrpc.ko.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Cancel refresh worker during buffer shutdown
Chuck Lever [Tue, 11 Apr 2017 17:22:29 +0000 (13:22 -0400)]
xprtrdma: Cancel refresh worker during buffer shutdown

Trying to create MRs while the transport is being torn down can
cause a crash.

Fixes: e2ac236c0b65 ("xprtrdma: Allocate MRs on demand")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoNFS: Don't write back further requests if there is a pending write error
Trond Myklebust [Tue, 25 Apr 2017 16:06:00 +0000 (12:06 -0400)]
NFS: Don't write back further requests if there is a pending write error

If the server has already returned a fatal write error that the user
has not yet received on this file, then don't write back the other pages.
Instead, act as if they have been sent, and have returned with the same
error.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Fix use after free issues in pnfs_do_read()
Trond Myklebust [Tue, 25 Apr 2017 15:26:53 +0000 (11:26 -0400)]
pNFS: Fix use after free issues in pnfs_do_read()

The assumption should be that if the caller returns PNFS_ATTEMPTED, then hdr
has been consumed, and so we should not be testing hdr->task.tk_status.
If the caller returns PNFS_TRY_AGAIN, then we need to recoalesce and
free hdr.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Ensure we check layout segment validity in the pg_init() callback
Trond Myklebust [Tue, 25 Apr 2017 14:56:19 +0000 (10:56 -0400)]
pNFS: Ensure we check layout segment validity in the pg_init() callback

If we have a layout segment cached in pgio->pg_lseg, we should check it
for validity before reusing it in a new RPC request. Otherwise, if we
recoalesce, we can end up looping forever.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Always wait for I/O completion before unlock
Benjamin Coddington [Tue, 11 Apr 2017 16:50:12 +0000 (12:50 -0400)]
NFS: Always wait for I/O completion before unlock

NFS attempts to wait for read and write completion before unlocking in
order to ensure that the data returned was protected by the lock.  When
this waiting is interrupted by a signal, the unlock may be skipped, and
messages similar to the following are seen in the kernel ring buffer:

[20.167876] Leaked locks on dev=0x0:0x2b ino=0x8dd4c3:
[20.168286] POSIX: fl_owner=ffff880078b06940 fl_flags=0x1 fl_type=0x0 fl_pid=20183
[20.168727] POSIX: fl_owner=ffff880078b06680 fl_flags=0x1 fl_type=0x0 fl_pid=20185

For NFSv3, the missing unlock will cause the server to refuse conflicting
locks indefinitely.  For NFSv4, the leftover lock will be removed by the
server after the lease timeout.

This patch fixes this issue by skipping the usual wait in
nfs_iocounter_wait if the FL_CLOSE flag is set when signaled.  Instead, the
wait happens in the unlock RPC task on the NFS UOC rpc_waitqueue.

For NFSv3, use lockd's new nlmclnt_operations along with
nfs_async_iocounter_wait to defer NLM's unlock task until the lock
context's iocounter reaches zero.

For NFSv4, call nfs_async_iocounter_wait() directly from unlock's
current rpc_call_prepare.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agolockd: Introduce nlmclnt_operations
Benjamin Coddington [Tue, 11 Apr 2017 16:50:11 +0000 (12:50 -0400)]
lockd: Introduce nlmclnt_operations

NFS would enjoy the ability to modify the behavior of the NLM client's
unlock RPC task in order to delay the transmission of the unlock until IO
that was submitted under that lock has completed.  This ability can ensure
that the NLM client will always complete the transmission of an unlock even
if the waiting caller has been interrupted with fatal signal.

For this purpose, a pointer to a struct nlmclnt_operations can be assigned
in a nfs_module's nfs_rpc_ops that will install those nlmclnt_operations on
the nlm_host.  The struct nlmclnt_operations defines three callback
operations that will be used in a following patch:

nlmclnt_alloc_call - used to call back after a successful allocation of
a struct nlm_rqst in nlmclnt_proc().

nlmclnt_unlock_prepare - used to call back during NLM unlock's
rpc_call_prepare.  The NLM client defers calling rpc_call_start()
until this callback returns false.

nlmclnt_release_call - used to call back when the NLM client's struct
nlm_rqst is freed.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Add an iocounter wait function for async RPC tasks
Benjamin Coddington [Tue, 11 Apr 2017 16:50:10 +0000 (12:50 -0400)]
NFS: Add an iocounter wait function for async RPC tasks

By sleeping on a new NFS Unlock-On-Close waitqueue, rpc tasks may wait for
a lock context's iocounter to reach zero.  The rpc waitqueue is only woken
when the open_context has the NFS_CONTEXT_UNLOCK flag set in order to
mitigate spurious wake-ups for any iocounter reaching zero.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agolocks: Set FL_CLOSE when removing flock locks on close()
Benjamin Coddington [Tue, 11 Apr 2017 16:50:09 +0000 (12:50 -0400)]
locks: Set FL_CLOSE when removing flock locks on close()

Set FL_CLOSE in fl_flags as in locks_remove_posix() when clearing locks.
NFS will check for this flag to ensure an unlock is sent in a following
patch.

Fuse handles flock and posix locks differently for FL_CLOSE, and so
requires a fixup to retain the existing behavior for flock.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Move the flock open mode check into nfs_flock()
Benjamin Coddington [Tue, 11 Apr 2017 16:50:08 +0000 (12:50 -0400)]
NFS: Move the flock open mode check into nfs_flock()

We only need to check lock exclusive/shared types against open mode when
flock() is used on NFS, so move it into the flock-specific path instead of
checking it for all locks.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS4: remove a redundant lock range check
Benjamin Coddington [Tue, 11 Apr 2017 16:50:07 +0000 (12:50 -0400)]
NFS4: remove a redundant lock range check

flock64_to_posix_lock() is already doing this check

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: unexport nfs4_pnfs_v3_ds_connect_unload
Trond Myklebust [Thu, 20 Apr 2017 20:58:50 +0000 (16:58 -0400)]
pNFS: unexport nfs4_pnfs_v3_ds_connect_unload

It is not used outside the NFSv4 module.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Unexport pnfs_put_lseg_locked and _pnfs_return_layout
Trond Myklebust [Thu, 20 Apr 2017 20:53:58 +0000 (16:53 -0400)]
pNFS: Unexport pnfs_put_lseg_locked and _pnfs_return_layout

They are not used outside the NFSv4 module.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Remove unused layout driver callbacks
Trond Myklebust [Thu, 20 Apr 2017 20:48:14 +0000 (16:48 -0400)]
pNFS: Remove unused layout driver callbacks

encode_layoutreturn and encode_layoutcommit are now unused. Let's
remove them.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agonfs: remove the objlayout driver
Christoph Hellwig [Wed, 12 Apr 2017 16:01:08 +0000 (18:01 +0200)]
nfs: remove the objlayout driver

The objlayout code has been in the tree, but it's been unmaintained and
no server product for it actually ever shipped.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS/flexfiles: Check the result of nfs4_pnfs_ds_connect
Trond Myklebust [Thu, 20 Apr 2017 18:33:06 +0000 (14:33 -0400)]
pNFS/flexfiles: Check the result of nfs4_pnfs_ds_connect

The check in nfs4_ff_layout_prepare_ds() seems to be missing.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: a33e4b036d461 ("pNFS: return status from nfs4_pnfs_ds_connect")
Cc: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org # v4.11
7 years agoNFSv4: Fix a hang in OPEN related to server reboot
Trond Myklebust [Sat, 15 Apr 2017 23:20:01 +0000 (19:20 -0400)]
NFSv4: Fix a hang in OPEN related to server reboot

If the server fails to return the attributes as part of an OPEN
reply, and then reboots, we can end up hanging. The reason is that
the client attempts to send a GETATTR in order to pick up the
missing OPEN call, but fails to release the slot first, causing
reboot recovery to deadlock.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: 2e80dbe7ac51a ("NFSv4.1: Close callback races for OPEN, LAYOUTGET...")
Cc: stable@vger.kernel.org # v4.8+
7 years agoNFS: move rw_mode to nfs_pageio_header
Benjamin Coddington [Wed, 19 Apr 2017 14:11:35 +0000 (10:11 -0400)]
NFS: move rw_mode to nfs_pageio_header

Let's try to have it in a cacheline in nfs4_proc_pgio_rpc_prepare().

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: move nfs_pgarray_set() to open code
Benjamin Coddington [Wed, 19 Apr 2017 14:11:34 +0000 (10:11 -0400)]
NFS: move nfs_pgarray_set() to open code

Since commit 00bfa30abe86 ("NFS: Create a common pgio_alloc and
pgio_release function"), nfs_pgarray_set() has only a single caller.  Let's
open code it.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Use GFP_NOIO for two allocations in writeback
Benjamin Coddington [Wed, 19 Apr 2017 14:11:33 +0000 (10:11 -0400)]
NFS: Use GFP_NOIO for two allocations in writeback

Prevent a deadlock that can occur if we wait on allocations
that try to write back our pages.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: 00bfa30abe869 ("NFS: Create a common pgio_alloc and pgio_release...")
Cc: stable@vger.kernel.org # 3.16+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Fix use after free in write error path
Fred Isaman [Fri, 14 Apr 2017 18:24:28 +0000 (14:24 -0400)]
NFS: Fix use after free in write error path

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Fixes: 0bcbf039f6b2b ("nfs: handle request add failure properly")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Fix missing pg_cleanup after nfs_pageio_cond_complete()
Benjamin Coddington [Fri, 14 Apr 2017 16:29:54 +0000 (12:29 -0400)]
NFS: Fix missing pg_cleanup after nfs_pageio_cond_complete()

Commit a7d42ddb3099727f58366fa006f850a219cce6c8 ("nfs: add mirroring
support to pgio layer") moved pg_cleanup out of the path when there was
non-sequental I/O that needed to be flushed.  The result is that for
layouts that have more than one layout segment per file, the pg_lseg is not
cleared, so we can end up hitting the WARN_ON_ONCE(req_start >= seg_end) in
pnfs_generic_pg_test since the pg_lseg will be pointing to that
previously-flushed layout segment.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: a7d42ddb3099 ("nfs: add mirroring support to pgio layer")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agosunrpc: don't check for failure from mempool_alloc()
NeilBrown [Mon, 10 Apr 2017 02:19:40 +0000 (12:19 +1000)]
sunrpc: don't check for failure from mempool_alloc()

When mempool_alloc() is allowed to sleep (GFP_NOIO allows
sleeping) it cannot fail.
So rpc_alloc_task() cannot fail, so rpc_new_task doesn't need
to test for failure.
Consequently rpc_new_task() cannot fail, so the callers
don't need to test.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: fix usage of mempools.
NeilBrown [Mon, 10 Apr 2017 02:22:09 +0000 (12:22 +1000)]
NFS: fix usage of mempools.

When passed GFP flags that allow sleeping (such as
GFP_NOIO), mempool_alloc() will never return NULL, it will
wait until memory is available.

This means that we don't need to handle failure, but that we
do need to ensure one thread doesn't call mempool_alloc()
twice on the one pool without queuing or freeing the first
allocation.  If multiple threads did this during times of
high memory pressure, the pool could be exhausted and a
deadlock could result.

pnfs_generic_alloc_ds_commits() attempts to allocate from
the nfs_commit_mempool while already holding an allocation
from that pool.  This is not safe.  So change
nfs_commitdata_alloc() to take a flag that indicates whether
failure is acceptable.

In pnfs_generic_alloc_ds_commits(), accept failure and
handle it as we currently do.  Else where, do not accept
failure, and do not handle it.

Even when failure is acceptable, we want to succeed if
possible.  That means both
 - using an entry from the pool if there is one
 - waiting for direct reclaim is there isn't.

We call mempool_alloc(GFP_NOWAIT) to achieve the first, then
kmem_cache_alloc(GFP_NOIO|__GFP_NORETRY) to achieve the
second.  Each of these can fail, but together they do the
best they can without blocking indefinitely.

The objects returned by kmem_cache_alloc() will still be freed
by mempool_free().  This is safe as mempool_alloc() uses
exactly the same function to allocate objects (since the mempool
was created with mempool_create_slab_pool()).  The object returned
by mempool_alloc() and kmem_cache_alloc() are indistinguishable
so mempool_free() will handle both identically, either adding to the
pool or calling kmem_cache_free().

Also, don't test for failure when allocating from
nfs_wdata_mempool.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up nfs4_proc_get_lease_time()
Anna Schumaker [Fri, 7 Apr 2017 18:15:23 +0000 (14:15 -0400)]
NFS: Clean up nfs4_proc_get_lease_time()

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up _nfs4_proc_exchange_id()
Anna Schumaker [Fri, 7 Apr 2017 18:15:22 +0000 (14:15 -0400)]
NFS: Clean up _nfs4_proc_exchange_id()

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up nfs4_proc_bind_one_conn_to_session()
Anna Schumaker [Fri, 7 Apr 2017 18:15:21 +0000 (14:15 -0400)]
NFS: Clean up nfs4_proc_bind_one_conn_to_session()

Returning errors directly even lets us remove the goto

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Remove extra dprintk()s from nfs4namespace.c
Anna Schumaker [Fri, 7 Apr 2017 18:15:20 +0000 (14:15 -0400)]
NFS: Remove extra dprintk()s from nfs4namespace.c

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up nfs4_get_rootfh()
Anna Schumaker [Fri, 7 Apr 2017 18:15:19 +0000 (14:15 -0400)]
NFS: Clean up nfs4_get_rootfh()

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Remove extra dprintk()s from nfs4client.c
Anna Schumaker [Fri, 7 Apr 2017 18:15:18 +0000 (14:15 -0400)]
NFS: Remove extra dprintk()s from nfs4client.c

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up nfs4_init_server()
Anna Schumaker [Fri, 7 Apr 2017 18:15:17 +0000 (14:15 -0400)]
NFS: Clean up nfs4_init_server()

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up nfs4_set_client()
Anna Schumaker [Fri, 7 Apr 2017 18:15:16 +0000 (14:15 -0400)]
NFS: Clean up nfs4_set_client()

If we cut out the dprintk()s, then we can return error codes directly
and cut out the goto.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up nfs4_check_server_scope()
Anna Schumaker [Fri, 7 Apr 2017 18:15:15 +0000 (14:15 -0400)]
NFS: Clean up nfs4_check_server_scope()

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up nfs4_check_serverowner_major_id()
Anna Schumaker [Fri, 7 Apr 2017 18:15:14 +0000 (14:15 -0400)]
NFS: Clean up nfs4_check_serverowner_major_id()

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Create a common nfs4_match_client() function
Anna Schumaker [Fri, 7 Apr 2017 18:15:13 +0000 (14:15 -0400)]
NFS: Create a common nfs4_match_client() function

This puts all the common code in a single place for the
walk_client_list() functions.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up nfs4_check_serverowner_minor_id()
Anna Schumaker [Fri, 7 Apr 2017 18:15:12 +0000 (14:15 -0400)]
NFS: Clean up nfs4_check_serverowner_minor_id()

Once again, we can remove the function and compare integer values
directly.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up nfs4_match_clientids()
Anna Schumaker [Fri, 7 Apr 2017 18:15:11 +0000 (14:15 -0400)]
NFS: Clean up nfs4_match_clientids()

If we cut out the dprintk()s, then we don't even need this to be a
separate function.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up nfs42_layoutstat_done()
Anna Schumaker [Fri, 7 Apr 2017 18:15:10 +0000 (14:15 -0400)]
NFS: Clean up nfs42_layoutstat_done()

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Remove extra dprintk()s from namespace.c
Anna Schumaker [Fri, 7 Apr 2017 18:15:09 +0000 (14:15 -0400)]
NFS: Remove extra dprintk()s from namespace.c

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up nfs_direct_commit_complete()
Anna Schumaker [Fri, 7 Apr 2017 18:15:08 +0000 (14:15 -0400)]
NFS: Clean up nfs_direct_commit_complete()

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Remove nfs_direct_readpage_release()
Anna Schumaker [Fri, 7 Apr 2017 18:15:07 +0000 (14:15 -0400)]
NFS: Remove nfs_direct_readpage_release()

Just remove the function and have the caller use nfs_release_request()
instead.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up extra dprintk()s in client.c
Anna Schumaker [Fri, 7 Apr 2017 18:15:06 +0000 (14:15 -0400)]
NFS: Clean up extra dprintk()s in client.c

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up nfs_init_client()
Anna Schumaker [Fri, 7 Apr 2017 18:15:05 +0000 (14:15 -0400)]
NFS: Clean up nfs_init_client()

We always call nfs_mark_client_ready() even if nfs_create_rpc_client()
returns an error, so we can rearrange nfs_init_client() to mark the
client ready from a single place.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Remove extra dprintk()s from callback_xdr.c
Anna Schumaker [Fri, 7 Apr 2017 18:15:04 +0000 (14:15 -0400)]
NFS: Remove extra dprintk()s from callback_xdr.c

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up encode_cb_sequence_res()
Anna Schumaker [Fri, 7 Apr 2017 18:15:03 +0000 (14:15 -0400)]
NFS: Clean up encode_cb_sequence_res()

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up decode_notify_lock_args()
Anna Schumaker [Fri, 7 Apr 2017 18:15:02 +0000 (14:15 -0400)]
NFS: Clean up decode_notify_lock_args()

Let's cut out the goto and return any errors immedately

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up decode_cb_sequence_args()
Anna Schumaker [Fri, 7 Apr 2017 18:15:01 +0000 (14:15 -0400)]
NFS: Clean up decode_cb_sequence_args()

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up decode_layoutrecall_args()
Anna Schumaker [Fri, 7 Apr 2017 18:15:00 +0000 (14:15 -0400)]
NFS: Clean up decode_layoutrecall_args()

Additionally, this change lets us cut out the goto by returning errors
immediately.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up decode_recall_args()
Anna Schumaker [Fri, 7 Apr 2017 18:14:59 +0000 (14:14 -0400)]
NFS: Clean up decode_recall_args()

Removing the dprintk() lets us simplify the function by returning status
codes directly, rather than using a goto.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up decode_getattr_args()
Anna Schumaker [Fri, 7 Apr 2017 18:14:58 +0000 (14:14 -0400)]
NFS: Clean up decode_getattr_args()

Removing the dprintk() lets us return the status value directly, rather
than jumping to a label if an error occurs.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Remove extra dprintk()s from callback_proc.c
Anna Schumaker [Fri, 7 Apr 2017 18:14:57 +0000 (14:14 -0400)]
NFS: Remove extra dprintk()s from callback_proc.c

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up nfs4_callback_layoutrecall()
Anna Schumaker [Fri, 7 Apr 2017 18:14:56 +0000 (14:14 -0400)]
NFS: Clean up nfs4_callback_layoutrecall()

In addition to removing the dprintk(), this patch also initializes "res"
to the default return value instead of doing this through an else
condition.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Clean up do_callback_layoutrecall()
Anna Schumaker [Fri, 7 Apr 2017 18:14:55 +0000 (14:14 -0400)]
NFS: Clean up do_callback_layoutrecall()

Removing the dprintk()s lets us simplify the function by removing the
else condition entirely and returning the status of
initiate_{file,bulk}_draining() directly.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agonfs: flexfilelayout: remove v3-only data server limitation
Tigran Mkrtchyan [Tue, 4 Apr 2017 13:12:51 +0000 (15:12 +0200)]
nfs: flexfilelayout: remove v3-only data server limitation

Flexfilelayout supports data servers which talk NFS v3 and v4.{0,1,2}.
However, this code path is disabled and v3 only servers are accepted.
This change removes this limitation.
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: switch back to to ->iterate()
Benjamin Coddington [Fri, 10 Mar 2017 22:07:46 +0000 (17:07 -0500)]
NFS: switch back to to ->iterate()

NFS has some optimizations for readdir to choose between using READDIR or
READDIRPLUS based on workload, and which NFS operation to use is determined
by subsequent interactions with lookup, d_revalidate, and getattr.

Concurrent use of nfs_readdir() via ->iterate_shared() can cause those
optimizations to repeatedly invalidate the pagecache used to store
directory entries during readdir(), which causes some very bad performance
for directories with many entries (more than about 10000).

There's a couple ways to fix this in NFS, but no fix would be as simple as
going back to ->iterate() to serialize nfs_readdir(), and neither fix I
tested performed as well as going back to ->iterate().

The first required taking the directory's i_lock for each entry, with the
result of terrible contention.

The second way adds another flag to the nfs_inode, and so keeps the
optimizations working for large directories.  The difference from using
->iterate() here is that much more memory is consumed for a given workload
without any performance gain.

The workings of nfs_readdir() are such that concurrent users are serialized
within read_cache_page() waiting to retrieve pages of entries from the
server.  By serializing this work in iterate_dir() instead, contention for
cache pages is reduced.  Waiting processes can have an uncontended pass at
the entirety of the directory's pagecache once previous processes have
completed filling it.

v2 - Keep the bits needed for parallel lookup

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoLinux 4.11-rc7
Linus Torvalds [Sun, 16 Apr 2017 20:00:18 +0000 (13:00 -0700)]
Linux 4.11-rc7

7 years agoMerge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Linus Torvalds [Sun, 16 Apr 2017 19:38:17 +0000 (12:38 -0700)]
Merge tag 'armsoc-fixes' of git://git./linux/kernel/git/arm/arm-soc

Pull ARM SoC fixes from Olof Johansson:
 "Again, a batch that's been sitting a couple of weeks, mostly because
  I anticipated a bit more material but it didn't show up -- which is
  good.

  These are all your garden variety fixes for ARM platforms.

  The most visible issue fixed here is probably the SMP reset issue on
  OMAP, the rest are minor stuff"

* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  arm64: allwinner: a64: add pmu0 regs for USB PHY
  ARM: OMAP2+: omap_device: Sync omap_device and pm_runtime after probe defer
  reset: add exported __reset_control_get, return NULL if optional
  ARM: orion5x: only call into phylib when available
  ARM: omap2+: Revert omap-smp.c changes resetting CPU1 during boot
  ARM: dts: am335x-evmsk: adjust mmc2 param to allow suspend
  ARM: dts: ti: fix PCI bus dtc warnings
  ARM: dts: am335x-baltos: disable EEE for Atheros 8035 PHY
  ARM: dts: OMAP3: Fix MFG ID EEPROM
  ARM: sun8i: a33: add operating-points-v2 property to all nodes
  ARM: sun8i: a33: remove highest OPP to fix CPU crashes

7 years agoMerge branch 'for-linus' of git://git.kernel.dk/linux-block
Linus Torvalds [Sun, 16 Apr 2017 19:05:09 +0000 (12:05 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Four small fixes.

  Three of them fix the same error in NVMe, in loop, fc, and rdma
  respectively.  The last fix from Ming fixes a regression in this
  series, where our bvec gap logic was wrong and causes an oops on
  NVMe for certain conditions"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: fix bio_will_gap() for first bvec with offset
  nvme-fc: Fix sqsize wrong assignment based on ctrl MQES capability
  nvme-rdma: Fix sqsize wrong assignment based on ctrl MQES capability
  nvme-loop: Fix sqsize wrong assignment based on ctrl MQES capability

7 years agoMerge tag 'omap-for-v4.11/fixes-rc6-signed' of git://git.kernel.org/pub/scm/linux...
Olof Johansson [Sun, 16 Apr 2017 18:52:26 +0000 (11:52 -0700)]
Merge tag 'omap-for-v4.11/fixes-rc6-signed' of git://git./linux/kernel/git/tmlind/linux-omap into fixes

Regression fix for omap interconnect code for deferred probe.
Without this fix we can get PM related warnings for devices that
use deferred probe. If necessary, this fix can wait for the
v4.12 merge window no problem.

* tag 'omap-for-v4.11/fixes-rc6-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
  ARM: OMAP2+: omap_device: Sync omap_device and pm_runtime after probe defer
  ARM: omap2+: Revert omap-smp.c changes resetting CPU1 during boot
  ARM: dts: am335x-evmsk: adjust mmc2 param to allow suspend
  ARM: dts: ti: fix PCI bus dtc warnings
  ARM: dts: am335x-baltos: disable EEE for Atheros 8035 PHY
  ARM: dts: OMAP3: Fix MFG ID EEPROM

Signed-off-by: Olof Johansson <olof@lixom.net>
7 years agoMerge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj...
Linus Torvalds [Sun, 16 Apr 2017 18:48:10 +0000 (11:48 -0700)]
Merge branch 'for-4.11-fixes' of git://git./linux/kernel/git/tj/cgroup

Pull cgroup fix from Tejun Heo:
 "Unfortunately, the commit to fix the cgroup mount race in the previous
  pull request can lead to hangs.

  The original bug has been around for a while and isn't too likely to
  be triggered in usual use cases. Revert the commit for now"

* 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  Revert "cgroup: avoid attaching a cgroup root to two different superblocks"

7 years agoMerge tag 'tty-4.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Linus Torvalds [Sun, 16 Apr 2017 18:35:34 +0000 (11:35 -0700)]
Merge tag 'tty-4.11-rc7' of git://git./linux/kernel/git/gregkh/tty

Pull tty fix from Greg KH:
 "Here is a single tty core revert for a patch that was reported to
  cause problems.

  The original issue is one that we have lived with for decades, so
  trying to scramble to fix the fix in time for 4.11-final does not make
  sense due to the fragility of the tty ldisc layer. Just reverting it
  makes sense for now"

* tag 'tty-4.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  Revert "tty: don't panic on OOM in tty_set_ldisc()"

7 years agoMerge tag 'trace-v4.11-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rosted...
Linus Torvalds [Sun, 16 Apr 2017 17:01:34 +0000 (10:01 -0700)]
Merge tag 'trace-v4.11-rc5-2' of git://git./linux/kernel/git/rostedt/linux-trace

Pull ftrace fix from Steven Rostedt:
 "While rewriting the function probe code, I stumbled over a long
  standing bug. This bug has been there sinc function tracing was added
  way back when. But my new development depends on this bug being fixed,
  and it should be fixed regardless as it causes ftrace to disable
  itself when triggered, and a reboot is required to enable it again.

  The bug is that the function probe does not disable itself properly if
  there's another probe of its type still enabled. For example:

     # cd /sys/kernel/debug/tracing
     # echo schedule:traceoff > set_ftrace_filter
     # echo do_IRQ:traceoff > set_ftrace_filter
     # echo \!do_IRQ:traceoff > /debug/tracing/set_ftrace_filter
     # echo do_IRQ:traceoff > set_ftrace_filter

  The above registers two traceoff probes (one for schedule and one for
  do_IRQ, and then removes do_IRQ.

  But since there still exists one for schedule, it is not done
  properly. When adding do_IRQ back, the breakage in the accounting is
  noticed by the ftrace self tests, and it causes a warning and disables
  ftrace"

* tag 'trace-v4.11-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix removing of second function probe

7 years agoRevert "cgroup: avoid attaching a cgroup root to two different superblocks"
Tejun Heo [Sun, 16 Apr 2017 14:17:37 +0000 (23:17 +0900)]
Revert "cgroup: avoid attaching a cgroup root to two different superblocks"

This reverts commit bfb0b80db5f9dca5ac0a5fd0edb765ee555e5a8e.

Andrei reports CRIU test hangs with the patch applied.  The bug fixed
by the patch isn't too likely to trigger in actual uses.  Revert the
patch for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Link: http://lkml.kernel.org/r/20170414232737.GC20350@outlook.office365.com
7 years agoMerge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdim...
Linus Torvalds [Sat, 15 Apr 2017 21:07:03 +0000 (14:07 -0700)]
Merge branch 'libnvdimm-fixes' of git://git./linux/kernel/git/nvdimm/nvdimm

Pull nvdimm fixes from Dan Williams:
 "A small crop of lockdep, sleeping while atomic, and other fixes /
  band-aids in advance of the full-blown reworks targeting the next
  merge window. The largest change here is "libnvdimm: fix blk free
  space accounting" which deletes a pile of buggy code that better
  testing would have caught before merging. The next change that is
  borderline too big for a late rc is switching the device-dax locking
  from rcu to srcu, I couldn't think of a smaller way to make that fix.

  The __copy_user_nocache fix will have a full replacement in 4.12 to
  move those pmem special case considerations into the pmem driver. The
  "libnvdimm: band aid btt vs clear poison locking" commit admits that
  our error clearing support for btt went in broken, so we just disable
  it in 4.11 and -stable. A replacement / full fix is in the pipeline
  for 4.12

  Some of these would have been caught earlier had DEBUG_ATOMIC_SLEEP
  been enabled on my development station. I wonder if we should have:

      config DEBUG_ATOMIC_SLEEP
        default PROVE_LOCKING

  ...since I mistakenly thought I got both with PROVE_LOCKING=y.

  These have received a build success notification from the 0day robot,
  and some have appeared in a -next release with no reported issues"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
  device-dax: switch to srcu, fix rcu_read_lock() vs pte allocation
  libnvdimm: band aid btt vs clear poison locking
  libnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splat
  libnvdimm: fix blk free space accounting
  acpi, nfit, libnvdimm: fix interleave set cookie calculation (64-bit comparison)

7 years agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Sat, 15 Apr 2017 16:42:14 +0000 (09:42 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "This is seven small fixes which are all for user visible issues that
  fortunately only occur in rare circumstances.

  The most serious is the sr one in which QEMU can cause us to read
  beyond the end of a buffer (I don't think it's exploitable, but just
  in case).

  The next is the sd capacity fix which means all non 512 byte sector
  drives greater than 2TB fail to be correctly sized.

  The rest are either in new drivers (qedf) or on error legs"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: ipr: do not set DID_PASSTHROUGH on CHECK CONDITION
  scsi: aacraid: fix PCI error recovery path
  scsi: sd: Fix capacity calculation with 32-bit sector_t
  scsi: qla2xxx: Add fix to read correct register value for ISP82xx.
  scsi: qedf: Fix crash due to unsolicited FIP VLAN response.
  scsi: sr: Sanity check returned mode data
  scsi: sd: Consider max_xfer_blocks if opt_xfer_blocks is unusable

7 years agoMerge branch 'parisc-4.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller...
Linus Torvalds [Sat, 15 Apr 2017 16:40:35 +0000 (09:40 -0700)]
Merge branch 'parisc-4.11-4' of git://git./linux/kernel/git/deller/parisc-linux

Pull parisc fix from Helge Deller:
 "Mikulas Patocka fixed a few bugs in our new pa_memcpy() assembler
  function, e.g. one bug made the kernel unbootable if source and
  destination address are the same"

* 'parisc-4.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: fix bugs in pa_memcpy

7 years agoorangefs: free superblock when mount fails
Martin Brandenburg [Fri, 14 Apr 2017 18:22:41 +0000 (14:22 -0400)]
orangefs: free superblock when mount fails

Otherwise lockdep says:

[ 1337.483798] ================================================
[ 1337.483999] [ BUG: lock held when returning to user space! ]
[ 1337.484252] 4.11.0-rc6 #19 Not tainted
[ 1337.484423] ------------------------------------------------
[ 1337.484626] mount/14766 is leaving the kernel with locks still held!
[ 1337.484841] 1 lock held by mount/14766:
[ 1337.485017]  #0:  (&type->s_umount_key#33/1){+.+.+.}, at: [<ffffffff8124171f>] sget_userns+0x2af/0x520

Caught by xfstests generic/413 which tried to mount with the unsupported
mount option dax.  Then xfstests generic/422 ran sync which deadlocks.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Acked-by: Mike Marshall <hubcap@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agovfs: don't do RCU lookup of empty pathnames
Linus Torvalds [Mon, 3 Apr 2017 00:10:08 +0000 (17:10 -0700)]
vfs: don't do RCU lookup of empty pathnames

Normal pathname lookup doesn't allow empty pathnames, but using
AT_EMPTY_PATH (with name_to_handle_at() or fstatat(), for example) you
can trigger an empty pathname lookup.

And not only is the RCU lookup in that case entirely unnecessary
(because we'll obviously immediately finalize the end result), it is
actively wrong.

Why? An empth path is a special case that will return the original
'dirfd' dentry - and that dentry may not actually be RCU-free'd,
resulting in a potential use-after-free if we were to initialize the
path lazily under the RCU read lock and depend on complete_walk()
finalizing the dentry.

Found by syzkaller and KASAN.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agoparisc: fix bugs in pa_memcpy
Mikulas Patocka [Fri, 14 Apr 2017 18:15:20 +0000 (14:15 -0400)]
parisc: fix bugs in pa_memcpy

The patch 554bfeceb8a22d448cd986fc9efce25e833278a1 ("parisc: Fix access
fault handling in pa_memcpy()") reimplements the pa_memcpy function.
Unfortunatelly, it makes the kernel unbootable. The crash happens in the
function ide_complete_cmd where memcpy is called with the same source
and destination address.

This patch fixes a few bugs in pa_memcpy:

* When jumping to .Lcopy_loop_16 for the first time, don't skip the
  instruction "ldi 31,t0" (this bug made the kernel unbootable)
* Use the COND macro when comparing length, so that the comparison is
  64-bit (a theoretical issue, in case the length is greater than
  0xffffffff)
* Don't use the COND macro after the "extru" instruction (the PA-RISC
  specification says that the upper 32-bits of extru result are undefined,
  although they are set to zero in practice)
* Fix exception addresses in .Lcopy16_fault and .Lcopy8_fault
* Rename .Lcopy_loop_4 to .Lcopy_loop_8 (so that it is consistent with
  .Lcopy8_fault)

Cc: <stable@vger.kernel.org> # v4.9+
Fixes: 554bfeceb8a2 ("parisc: Fix access fault handling in pa_memcpy()")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Helge Deller <deller@gmx.de>