platform/kernel/linux-rpi.git
9 years agoMerge tag 'nfs-rdma-for-4.1-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma
Trond Myklebust [Thu, 23 Apr 2015 19:16:37 +0000 (15:16 -0400)]
Merge tag 'nfs-rdma-for-4.1-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA Client Changes

This patch series creates an operation vector for each of the different
memory registration modes.  This should make it easier to one day increase
credit limit, rsize, and wsize.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoMerge branch 'bugfixes'
Trond Myklebust [Thu, 23 Apr 2015 19:16:27 +0000 (15:16 -0400)]
Merge branch 'bugfixes'

* bugfixes:
  NFSv4: Return delegations synchronously in evict_inode
  SUNRPC: Fix a regression when reconnecting
  NFS: remount with security change should return EINVAL
  nfs: do not export discarded symbols
  NFSv4.1: don't export static symbol

9 years agofs/nfs: fix new compiler warning about boolean in switch
Andre Przywara [Thu, 23 Apr 2015 16:17:40 +0000 (17:17 +0100)]
fs/nfs: fix new compiler warning about boolean in switch

The brand new GCC 5.1.0 warns by default on using a boolean in the
switch condition. This results in the following warning:

fs/nfs/nfs4proc.c: In function 'nfs4_proc_get_rootfh':
fs/nfs/nfs4proc.c:3100:10: warning: switch condition has boolean value [-Wswitch-bool]
  switch (auth_probe) {
          ^

This code was obviously using switch to make use of the fall-through
semantics (without the usual comment, though).
Rewrite that code using if statements to avoid the warning and make
the code a bit more readable on the way.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: Remove unneeded casts in nfs
Firo Yang [Thu, 23 Apr 2015 09:17:51 +0000 (17:17 +0800)]
nfs: Remove unneeded casts in nfs

Don't unnecessarily cast allocation return value in
fs/nfs/inode.c::nfs_alloc_inode().

Signed-off-by: Firo Yang <firogm@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFS: Don't attempt to decode missing directory entries
Benjamin Coddington [Tue, 21 Apr 2015 18:17:35 +0000 (14:17 -0400)]
NFS: Don't attempt to decode missing directory entries

If a READDIR reply comes back without any page data, avoid a NULL pointer
dereference in xdr_copy_to_scratch().

BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
IP: [<ffffffff813a378d>] memcpy+0xd/0x110
...
Call Trace:
? xdr_inline_decode+0x7a/0xb0 [sunrpc]
nfs3_decode_dirent+0x73/0x320 [nfsv3]
nfs_readdir_page_filler+0xd5/0x4e0 [nfs]
? nfs3_rpc_wrapper.constprop.9+0x42/0xc0 [nfsv3]
nfs_readdir_xdr_to_array+0x1fa/0x330 [nfs]
? mem_cgroup_commit_charge+0xac/0x160
? nfs_readdir_xdr_to_array+0x330/0x330 [nfs]
nfs_readdir_filler+0x22/0x90 [nfs]
do_read_cache_page+0x7e/0x1a0
read_cache_page+0x1c/0x20
nfs_readdir+0x18e/0x660 [nfs]
? nfs3_xdr_dec_getattr3res+0x80/0x80 [nfsv3]
iterate_dir+0x97/0x130
SyS_getdents+0x94/0x120
? fillonedir+0xd0/0xd0
system_call_fastpath+0x12/0x17

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoRevert "nfs: replace nfs_add_stats with nfs_inc_stats when add one"
Nicolas Iooss [Thu, 16 Apr 2015 10:48:39 +0000 (18:48 +0800)]
Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one"

This reverts commit 5a254d08b086d80cbead2ebcee6d2a4b3a15587a.

Since commit 5a254d08b086 ("nfs: replace nfs_add_stats with
nfs_inc_stats when add one"), nfs_readpage and nfs_do_writepage use
nfs_inc_stats to increment NFSIOS_READPAGES and NFSIOS_WRITEPAGES
instead of nfs_add_stats.

However nfs_inc_stats does not do the same thing as nfs_add_stats with
value 1 because these functions work on distinct stats:
nfs_inc_stats increments stats from "enum nfs_stat_eventcounters" (in
server->io_stats->events) and nfs_add_stats those from "enum
nfs_stat_bytecounters" (in server->io_stats->bytes).

Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Fixes: 5a254d08b086 ("nfs: replace nfs_add_stats with nfs_inc_stats...")
Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFS: Rename idmap.c to nfs4idmap.c
Anna Schumaker [Wed, 15 Apr 2015 17:00:06 +0000 (13:00 -0400)]
NFS: Rename idmap.c to nfs4idmap.c

I added the nfs4 prefix to make it obvious that this file is built into
the NFS v4 module, and not the generic client.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFS: Move nfs_idmap.h into fs/nfs/
Anna Schumaker [Wed, 15 Apr 2015 17:00:05 +0000 (13:00 -0400)]
NFS: Move nfs_idmap.h into fs/nfs/

This file is only used internally to the NFS v4 module, so it doesn't
need to be in the global include path.  I also renamed it from
nfs_idmap.h to nfs4idmap.h to emphasize that it's an NFSv4-only include
file.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h
Anna Schumaker [Wed, 15 Apr 2015 17:00:04 +0000 (13:00 -0400)]
NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h

The idmapper is completely internal to the NFS v4 module, so this macro
will always evaluate to true.  This patch also removes unnecessary
includes of this file from the generic NFS client.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFS: Add a stub for GETDEVICELIST
Anna Schumaker [Tue, 14 Apr 2015 14:34:20 +0000 (10:34 -0400)]
NFS: Add a stub for GETDEVICELIST

d4b18c3e (pnfs: remove GETDEVICELIST implementation) removed the
GETDEVICELIST operation from the NFS client, but left a "hole" in the
nfs4_procedures array.  This caused /proc/self/mountstats to report an
operation named "51" where GETDEVICELIST used to be.  This patch adds a
stub to fix mountstats.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Fixes: d4b18c3e (pnfs: remove GETDEVICELIST implementation)
Cc: stable@vger.kernel.org # 3.17+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: remove WARN_ON_ONCE from nfs_direct_good_bytes
Peng Tao [Thu, 9 Apr 2015 15:02:17 +0000 (23:02 +0800)]
nfs: remove WARN_ON_ONCE from nfs_direct_good_bytes

For flexfiles driver, we might choose to read from mirror index other
than 0 while mirror_count is always 1 for read.

Reported-by: Jean Spector <jean@primarydata.com>
Cc: <stable@vger.kernel.org> # v3.19+
Cc: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: fix DIO good bytes calculation
Peng Tao [Thu, 9 Apr 2015 15:02:16 +0000 (23:02 +0800)]
nfs: fix DIO good bytes calculation

For direct read that has IO size larger than rsize, we'll split
it into several READ requests and nfs_direct_good_bytes() would
count completed bytes incorrectly by eating last zero count reply.

Fix it by handling mirror and non-mirror cases differently such that
we only count mirrored writes differently.

This fixes 5fadeb47("nfs: count DIO good bytes correctly with mirroring").

Reported-by: Jean Spector <jean@primarydata.com>
Cc: <stable@vger.kernel.org> # v3.19+
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: Fetch MOUNTED_ON_FILEID when updating an inode
Anna Schumaker [Fri, 3 Apr 2015 18:35:59 +0000 (14:35 -0400)]
nfs: Fetch MOUNTED_ON_FILEID when updating an inode

2ef47eb1 (NFS: Fix use of nfs_attr_use_mounted_on_fileid()) was a good
start to fixing a circular directory structure warning for NFS v4
"junctioned" mountpoints.  Unfortunately, further testing continued to
generate this error.

My server is configured like this:

anna@nfsd ~ % df
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       9.1G  2.0G  6.5G  24% /
/dev/vdc1      1014M   33M  982M   4% /exports
/dev/vdc2      1014M   33M  982M   4% /exports/vol1
/dev/vdc3      1014M   33M  982M   4% /exports/vol1/vol2

anna@nfsd ~ % cat /etc/exports
/exports/          *(rw,async,no_subtree_check,no_root_squash)
/exports/vol1/     *(rw,async,no_subtree_check,no_root_squash)
/exports/vol1/vol2 *(rw,async,no_subtree_check,no_root_squash)

I've been running chown across the entire mountpoint twice in a row to
hit this problem.  The first run succeeds, but the second one fails with
the circular directory warning along with:

anna@client ~ % dmesg
[Apr 3 14:28] NFS: server 192.168.100.204 error: fileid changed
              fsid 0:39: expected fileid 0x100080, got 0x80

WHere 0x80 is the mountpoint's fileid and 0x100080 is the mounted-on
fileid.

This patch fixes the issue by requesting an updated mounted-on fileid
from the server during nfs_update_inode(), and then checking that the
fileid stored in the nfs_inode matches either the fileid or mounted-on
fileid returned by the server.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agosunrpc: make debugfs file creation failure non-fatal
Jeff Layton [Tue, 31 Mar 2015 16:03:28 +0000 (12:03 -0400)]
sunrpc: make debugfs file creation failure non-fatal

v2: gracefully handle the case where some dentry pointers end up NULL
    and be more dilligent about zeroing out dentry pointers

We currently have a problem that SELinux policy is being enforced when
creating debugfs files. If a debugfs file is created as a side effect of
doing some syscall, then that creation can fail if the SELinux policy
for that process prevents it.

This seems wrong. We don't do that for files under /proc, for instance,
so Bruce has proposed a patch to fix that.

While discussing that patch however, Greg K.H. stated:

    "No kernel code should care / fail if a debugfs function fails, so
     please fix up the sunrpc code first."

This patch converts all of the sunrpc debugfs setup code to be void
return functins, and the callers to not look for errors from those
functions.

This should allow rpc_clnt and rpc_xprt creation to work, even if the
kernel fails to create debugfs files for some reason.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: fix high load average due to callback thread sleeping
Jeff Layton [Fri, 20 Mar 2015 19:15:14 +0000 (15:15 -0400)]
nfs: fix high load average due to callback thread sleeping

Chuck pointed out a problem that crept in with commit 6ffa30d3f734 (nfs:
don't call blocking operations while !TASK_RUNNING). Linux counts tasks
in uninterruptible sleep against the load average, so this caused the
system's load average to be pinned at at least 1 when there was a
NFSv4.1+ mount active.

Not a huge problem, but it's probably worth fixing before we get too
many complaints about it. This patch converts the code back to use
TASK_INTERRUPTIBLE sleep, simply has it flush any signals on each loop
iteration. In practice no one should really be signalling this thread at
all, so I think this is reasonably safe.

With this change, there's also no need to game the hung task watchdog so
we can also convert the schedule_timeout call back to a normal schedule.

Cc: <stable@vger.kernel.org>
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: commit 6ffa30d3f734 (“nfs: don't call blocking . . .”)
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFS: Reduce time spent holding the i_mutex during fallocate()
Anna Schumaker [Mon, 16 Mar 2015 18:06:24 +0000 (14:06 -0400)]
NFS: Reduce time spent holding the i_mutex during fallocate()

At the very least, we should not be taking the i_mutex until after
checking if the server even supports ALLOCATE or DEALLOCATE, allowing
v4.0 or v4.1 to exit without potentially waiting on a lock.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFS: Don't zap caches on fallocate()
Anna Schumaker [Mon, 16 Mar 2015 18:06:23 +0000 (14:06 -0400)]
NFS: Don't zap caches on fallocate()

This patch adds a GETATTR to the end of ALLOCATE and DEALLOCATE
operations so we can set the updated inode size and change attribute
directly.  DEALLOCATE will still need to release pagecache pages, so
nfs42_proc_deallocate() now calls truncate_pagecache_range() before
contacting the server.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoxprtrdma: Make rpcrdma_{un}map_one() into inline functions
Chuck Lever [Mon, 30 Mar 2015 18:35:44 +0000 (14:35 -0400)]
xprtrdma: Make rpcrdma_{un}map_one() into inline functions

These functions are called in a loop for each page transferred via
RDMA READ or WRITE. Extract loop invariants and inline them to
reduce CPU overhead.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Handle non-SEND completions via a callout
Chuck Lever [Mon, 30 Mar 2015 18:35:35 +0000 (14:35 -0400)]
xprtrdma: Handle non-SEND completions via a callout

Allow each memory registration mode to plug in a callout that handles
the completion of a memory registration operation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Add "open" memreg op
Chuck Lever [Mon, 30 Mar 2015 18:35:26 +0000 (14:35 -0400)]
xprtrdma: Add "open" memreg op

The open op determines the size of various transport data structures
based on device capabilities and memory registration mode.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Add "destroy MRs" memreg op
Chuck Lever [Mon, 30 Mar 2015 18:35:17 +0000 (14:35 -0400)]
xprtrdma: Add "destroy MRs" memreg op

Memory Region objects associated with a transport instance are
destroyed before the instance is shutdown and destroyed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Add "reset MRs" memreg op
Chuck Lever [Mon, 30 Mar 2015 18:35:07 +0000 (14:35 -0400)]
xprtrdma: Add "reset MRs" memreg op

This method is invoked when a transport instance is about to be
reconnected. Each Memory Region object is reset to its initial
state.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Add "init MRs" memreg op
Chuck Lever [Mon, 30 Mar 2015 18:34:58 +0000 (14:34 -0400)]
xprtrdma: Add "init MRs" memreg op

This method is used when setting up a new transport instance to
create a pool of Memory Region objects that will be used to register
memory during operation.

Memory Regions are not needed for "physical" registration, since
->prepare and ->release are no-ops for that mode.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Add a "deregister_external" op for each memreg mode
Chuck Lever [Mon, 30 Mar 2015 18:34:48 +0000 (14:34 -0400)]
xprtrdma: Add a "deregister_external" op for each memreg mode

There is very little common processing among the different external
memory deregistration functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Add a "register_external" op for each memreg mode
Chuck Lever [Mon, 30 Mar 2015 18:34:39 +0000 (14:34 -0400)]
xprtrdma: Add a "register_external" op for each memreg mode

There is very little common processing among the different external
memory registration functions. Have rpcrdma_create_chunks() call
the registration method directly. This removes a stack frame and a
switch statement from the external registration path.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Add a "max_payload" op for each memreg mode
Chuck Lever [Mon, 30 Mar 2015 18:34:30 +0000 (14:34 -0400)]
xprtrdma: Add a "max_payload" op for each memreg mode

The max_payload computation is generalized to ensure that the
payload maximum is the lesser of RPC_MAX_DATA_SEGS and the number of
data segments that can be transmitted in an inline buffer.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Add vector of ops for each memory registration strategy
Chuck Lever [Mon, 30 Mar 2015 18:34:21 +0000 (14:34 -0400)]
xprtrdma: Add vector of ops for each memory registration strategy

Instead of employing switch() statements, let's use the typical
Linux kernel idiom for handling behavioral variation: virtual
functions.

Start by defining a vector of operations for each supported memory
registration mode, and by adding a source file for each mode.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Prevent infinite loop in rpcrdma_ep_create()
Chuck Lever [Mon, 30 Mar 2015 18:34:12 +0000 (14:34 -0400)]
xprtrdma: Prevent infinite loop in rpcrdma_ep_create()

If a provider advertizes a zero max_fast_reg_page_list_len, FRWR
depth detection loops forever. Instead of just failing the mount,
try other memory registration modes.

Fixes: 0fc6c4e7bb28 ("xprtrdma: mind the device's max fast . . .")
Reported-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Byte-align FRWR registration
Chuck Lever [Mon, 30 Mar 2015 18:34:02 +0000 (14:34 -0400)]
xprtrdma: Byte-align FRWR registration

The RPC/RDMA transport's FRWR registration logic registers whole
pages. This means areas in the first and last pages that are not
involved in the RDMA I/O are needlessly exposed to the server.

Buffered I/O is typically page-aligned, so not a problem there. But
for direct I/O, which can be byte-aligned, and for reply chunks,
which are nearly always smaller than a page, the transport could
expose memory outside the I/O buffer.

FRWR allows byte-aligned memory registration, so let's use it as
it was intended.

Reported-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Perform a full marshal on retransmit
Chuck Lever [Mon, 30 Mar 2015 18:33:53 +0000 (14:33 -0400)]
xprtrdma: Perform a full marshal on retransmit

Commit 6ab59945f292 ("xprtrdma: Update rkeys after transport
reconnect" added logic in the ->send_request path to update the
chunk list when an RPC/RDMA request is retransmitted.

Note that rpc_xdr_encode() resets and re-encodes the entire RPC
send buffer for each retransmit of an RPC. The RPC send buffer
is not preserved from the previous transmission of an RPC.

Revert 6ab59945f292, and instead, just force each request to be
fully marshaled every time through ->send_request. This should
preserve the fix from 6ab59945f292, while also performing pullup
during retransmits.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Display IPv6 addresses and port numbers correctly
Chuck Lever [Mon, 30 Mar 2015 18:33:43 +0000 (14:33 -0400)]
xprtrdma: Display IPv6 addresses and port numbers correctly

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoSUNRPC: Introduce missing well-known netids
Chuck Lever [Mon, 30 Mar 2015 18:33:33 +0000 (14:33 -0400)]
SUNRPC: Introduce missing well-known netids

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoLinux 4.0-rc6
Linus Torvalds [Sun, 29 Mar 2015 22:26:31 +0000 (15:26 -0700)]
Linux 4.0-rc6

9 years agoMerge tag 'armsoc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm...
Linus Torvalds [Sun, 29 Mar 2015 22:09:31 +0000 (15:09 -0700)]
Merge tag 'armsoc-for-linus' of git://git./linux/kernel/git/arm/arm-soc

Pull ARM SoC fixes from Olof Johansson:
 "The latest and greatest fixes for ARM platform code.  Worth pointing
  out are:

   - Lines-wise, largest is a PXA fix for dealing with interrupts on DT
     that was quite broken.  It's still newish code so while we could
     have held this off, it seemed appropriate to include now

   - Some GPIO fixes for OMAP platforms added a few lines.  This was
     also fixes for code recently added (this release).

   - Small OMAP timer fix to behave better with partially upstreamed
     platforms, which is quite welcome.

   - Allwinner fixes about operating point control, reducing
     overclocking in some cases for better stability.

  plus a handful of other smaller fixes across the map"

* tag 'armsoc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  arm64: juno: Fix misleading name of UART reference clock
  ARM: dts: sunxi: Remove overclocked/overvoltaged OPP
  ARM: dts: sun4i: a10-lime: Override and remove 1008MHz OPP setting
  ARM: socfpga: dts: fix spi1 interrupt
  ARM: dts: Fix gpio interrupts for dm816x
  ARM: dts: dra7: remove ti,hwmod property from pcie phy
  ARM: OMAP: dmtimer: disable pm runtime on remove
  ARM: OMAP: dmtimer: check for pm_runtime_get_sync() failure
  ARM: OMAP2+: Fix socbus family info for AM33xx devices
  ARM: dts: omap3: Add missing dmas for crypto
  ARM: dts: rockchip: disable gmac by default in rk3288.dtsi
  MAINTAINERS: add rockchip regexp to the ARM/Rockchip entry
  ARM: pxa: fix pxa interrupts handling in DT
  ARM: pxa: Fix typo in zeus.c
  ARM: sunxi: Have ARCH_SUNXI select RESET_CONTROLLER for clock driver usage

9 years agoMerge tag 'sunxi-fixes-for-4.0' of https://git.kernel.org/pub/scm/linux/kernel/git...
Olof Johansson [Sun, 29 Mar 2015 21:00:53 +0000 (14:00 -0700)]
Merge tag 'sunxi-fixes-for-4.0' of https://git./linux/kernel/git/mripard/linux into fixes

Allwinner fixes for 4.0

There's a few fixes to merge for 4.0, one to add a select in the machine
Kconfig option to fix a potential build failure, and two fixing cpufreq related
issues.

* tag 'sunxi-fixes-for-4.0' of https://git.kernel.org/pub/scm/linux/kernel/git/mripard/linux:
  ARM: dts: sunxi: Remove overclocked/overvoltaged OPP
  ARM: dts: sun4i: a10-lime: Override and remove 1008MHz OPP setting
  ARM: sunxi: Have ARCH_SUNXI select RESET_CONTROLLER for clock driver usage

Signed-off-by: Olof Johansson <olof@lixom.net>
9 years agoMerge tag 'fixes-v4.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind...
Olof Johansson [Sun, 29 Mar 2015 20:58:54 +0000 (13:58 -0700)]
Merge tag 'fixes-v4.0-rc4' of git://git./linux/kernel/git/tmlind/linux-omap into fixes

Fixes for omaps for the -rc cycle:

- Fix a device tree based booting vs legacy booting regression for
  omap3 crypto hardware by adding the missing DMA channels.

- Fix /sys/bus/soc/devices/soc0/family for am33xx devices.

- Fix two timer issues that can cause hangs if the timer related
  hwmod data is missing like it often initially is for new SoCs.

- Remove pcie hwmods entry from dts as that causes runtime PM to
  fail for the PHYs.

- A paper bag type dts configuration fix for dm816x GPIO
  interrupts that I just noticed. This is most of the changes
  diffstat wise, but as it's a basic feature for connecting
  devices and things work otherwise, it should be fixed.

* tag 'fixes-v4.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
  ARM: dts: Fix gpio interrupts for dm816x
  ARM: dts: dra7: remove ti,hwmod property from pcie phy
  ARM: OMAP: dmtimer: disable pm runtime on remove
  ARM: OMAP: dmtimer: check for pm_runtime_get_sync() failure
  ARM: OMAP2+: Fix socbus family info for AM33xx devices
  ARM: dts: omap3: Add missing dmas for crypto

Signed-off-by: Olof Johansson <olof@lixom.net>
9 years agoMerge tag 'socfpga_fix_for_v4.0_2' of git://git.rocketboards.org/linux-socfpga-next...
Olof Johansson [Sun, 29 Mar 2015 20:58:04 +0000 (13:58 -0700)]
Merge tag 'socfpga_fix_for_v4.0_2' of git://git.rocketboards.org/linux-socfpga-next into fixes

Late fix for v4.0 on the SoCFPGA platform:
- Fix interrupt number for SPI1 interface

* tag 'socfpga_fix_for_v4.0_2' of git://git.rocketboards.org/linux-socfpga-next:
  ARM: socfpga: dts: fix spi1 interrupt

Signed-off-by: Olof Johansson <olof@lixom.net>
9 years agoarm64: juno: Fix misleading name of UART reference clock
Dave Martin [Tue, 17 Mar 2015 12:35:41 +0000 (12:35 +0000)]
arm64: juno: Fix misleading name of UART reference clock

The UART reference clock speed is 7273.8 kHz, not 72738 kHz.

Dots aren't usually used in node names even though ePAPR permits
them.  However, this can easily be avoided by expressing the
frequency in Hz, not kHz.

This patch changes the name to refclk7273800hz, reflecting the
actual clock speed.

Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Acked-by: Liviu Dudau <Liviu.Dudau@arm.com>
Signed-off-by: Olof Johansson <olof@lixom.net>
9 years agoMerge tag 'fixes-for-v4.0-rc5' of https://github.com/rjarzmik/linux into fixes
Olof Johansson [Sun, 29 Mar 2015 20:47:21 +0000 (13:47 -0700)]
Merge tag 'fixes-for-v4.0-rc5' of https://github.com/rjarzmik/linux into fixes

arm: pxa: fixes for v4.0-rc5

There are only 2 fixes, one for the zeus board about the regulator changes,
where a typo prevented the zeus board from having a working can regulator,
and one regression triggered by the interrupts IRQ shift of 16 affecting all
boards.

* tag 'fixes-for-v4.0-rc5' of https://github.com/rjarzmik/linux:
  ARM: pxa: fix pxa interrupts handling in DT
  ARM: pxa: Fix typo in zeus.c

Signed-off-by: Olof Johansson <olof@lixom.net>
9 years agoMerge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 28 Mar 2015 18:25:04 +0000 (11:25 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 fix from Ingo Molnar:
 "Fix x86 syscall exit code bug that resulted in spurious non-execution
  of TIF-driven user-return worklets, causing big trouble for things
  like KVM that rely on user notifiers for correctness of their vcpu
  model, causing crashes like double faults"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/asm/entry: Check for syscall exit work with IRQs disabled

9 years agoMerge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 28 Mar 2015 18:21:23 +0000 (11:21 -0700)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull timer fixes from Ingo Molnar:
 "Two clocksource driver fixes, and an idle loop RCU warning fix"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource/drivers/sun5i: Fix cpufreq interaction with sched_clock()
  clocksource/drivers: Fix various !CONFIG_HAS_IOMEM build errors
  timers/tick/broadcast-hrtimer: Fix suspicious RCU usage in idle loop

9 years agoMerge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 28 Mar 2015 18:17:32 +0000 (11:17 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "A single sched/rt corner case fix for RLIMIT_RTIME correctness"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix RLIMIT_RTTIME when PI-boosting to RT

9 years agoMerge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 28 Mar 2015 18:12:08 +0000 (11:12 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull perf fix from Ingo Molnar:
 "A perf kernel side fix for a fuzzer triggered lockup"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix irq_work 'tail' recursion

9 years agoMerge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 28 Mar 2015 18:05:03 +0000 (11:05 -0700)]
Merge branch 'locking-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull locking fix from Ingo Molnar:
 "A module unload lockdep race fix"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Fix the module unload key range freeing logic

9 years agoMerge branch 'parisc-4.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller...
Linus Torvalds [Sat, 28 Mar 2015 17:58:53 +0000 (10:58 -0700)]
Merge branch 'parisc-4.0-1' of git://git./linux/kernel/git/deller/parisc-linux

Pull parsic fixes from Helge Deller:
 "One patch from Mikulas fixes a bug on parisc by artifically
  incrementing the counter in pmd_free when the kernel tries to free
  the preallocated pmd.

  Other than that we now prevent that syscalls gets added without
  incrementing __NR_Linux_syscalls and fix the initial pmd setup code
  if a default page size greater than 4k has been selected"

* 'parisc-4.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Fix pmd code to depend on PT_NLEVELS value, not on CONFIG_64BIT
  parisc: mm: don't count preallocated pmds
  parisc: Add compile-time check when adding new syscalls

9 years agoMerge git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Sat, 28 Mar 2015 17:54:59 +0000 (10:54 -0700)]
Merge git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm ppc bugfixes from Marcelo Tosatti.

* git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: PPC: Book3S HV: Fix instruction emulation
  KVM: PPC: Book3S HV: Endian fix for accessing VPA yield count
  KVM: PPC: Book3S HV: Fix spinlock/mutex ordering issue in kvmppc_set_lpcr()

9 years agoMerge tag 'arc-4.0-fixes-part-2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 28 Mar 2015 17:47:27 +0000 (09:47 -0800)]
Merge tag 'arc-4.0-fixes-part-2' of git://git./linux/kernel/git/vgupta/arc

Pull ARC fixes from Vineet Gupta:
 "We found some issues with signal handling taking down the system.  I
  know its late, but these are important and all marked for stable.

  ARC signal handling related fixes uncovered during recent testing of
  NPTL tools"

* tag 'arc-4.0-fixes-part-2' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
  ARC: signal handling robustify
  ARC: SA_SIGINFO ucontext regs off-by-one

9 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris...
Linus Torvalds [Sat, 28 Mar 2015 17:41:22 +0000 (10:41 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jmorris/linux-security

Pull selinux bugfix from James Morris.

Fix broken return value.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  selinux: fix sel_write_enforce broken return value

9 years agoMerge git://www.linux-watchdog.org/linux-watchdog
Linus Torvalds [Fri, 27 Mar 2015 21:45:42 +0000 (14:45 -0700)]
Merge git://www.linux-watchdog.org/linux-watchdog

Pull watchdog fixes from Wim Van Sebroeck:

 - mtk_wdt: signedness bug in mtk_wdt_start()

 - imgpdc: Fix NULL pointer dereference during probe and fix the default
   heartbeat

* git://www.linux-watchdog.org/linux-watchdog:
  watchdog: imgpdc: Fix default heartbeat
  watchdog: imgpdc: Fix probe NULL pointer dereference
  watchdog: mtk_wdt: signedness bug in mtk_wdt_start()

9 years agoMerge tag 'sound-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai...
Linus Torvalds [Fri, 27 Mar 2015 21:38:02 +0000 (14:38 -0700)]
Merge tag 'sound-4.0-rc6' of git://git./linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "Three trivial oneliner fixes for HD-audio.

  Two are device-specific quirks while one is a generic fix for recent
  Realtek codecs"

* tag 'sound-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda - Add one more node in the EAPD supporting candidate list
  ALSA: hda_intel: apply the Seperate stream_tag for Sunrise Point
  ALSA: hda - Add dock support for Thinkpad T450s (17aa:5036)

9 years agoNFS: Block new writes while syncing data in nfs_getattr()
Trond Myklebust [Wed, 25 Mar 2015 20:40:07 +0000 (16:40 -0400)]
NFS: Block new writes while syncing data in nfs_getattr()

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv4.1/pnfs: Separate out metadata and data consistency for pNFS
Trond Myklebust [Wed, 25 Mar 2015 18:14:42 +0000 (14:14 -0400)]
NFSv4.1/pnfs: Separate out metadata and data consistency for pNFS

The LAYOUTCOMMIT operation means different things to different layout types.
For blocks and objects, it is both a data and metadata consistency operation.
For files and flexfiles, it is only a metadata consistency operation.

This patch separates out the 2 cases, allowing the files/flexfiles layout
drivers to optimise away the data consistency calls to layoutcommit.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv4.1/pnfs: Ensure we send layoutcommit before return-on-close
Trond Myklebust [Wed, 25 Mar 2015 16:36:13 +0000 (12:36 -0400)]
NFSv4.1/pnfs: Ensure we send layoutcommit before return-on-close

We must not send a close or delegreturn that would result in a
return-on-close of the layout without ensuring that we've also
sent the necessary layoutcommit.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv4.1/pnfs: Ensure that writes respect the O_SYNC flag when doing O_DIRECT
Trond Myklebust [Wed, 25 Mar 2015 22:58:53 +0000 (18:58 -0400)]
NFSv4.1/pnfs: Ensure that writes respect the O_SYNC flag when doing O_DIRECT

If the caller does not specify the O_SYNC flag, then it is legitimate
to return from O_DIRECT without doing a pNFS layoutcommit operation.
However if the file is opened O_DIRECT|O_SYNC then we'd better get it
right.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv4: Truncating file opens should also sync O_DIRECT writes
Trond Myklebust [Wed, 25 Mar 2015 21:23:31 +0000 (17:23 -0400)]
NFSv4: Truncating file opens should also sync O_DIRECT writes

We don't just want to sync out buffered writes, but also O_DIRECT ones.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFS: File unlock needs to be a metadata synchronisation point
Trond Myklebust [Thu, 26 Mar 2015 19:55:49 +0000 (15:55 -0400)]
NFS: File unlock needs to be a metadata synchronisation point

File unlock needs to update both data and metadata on the NFS server
in order to act as a synchronisation point for other clients.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFS: Add a helper to sync both O_DIRECT and buffered writes
Trond Myklebust [Wed, 25 Mar 2015 20:38:33 +0000 (16:38 -0400)]
NFS: Add a helper to sync both O_DIRECT and buffered writes

Then apply it to nfs_setattr() and nfs_getattr().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv4.1/pnfs: Refactor pnfs_set_layoutcommit()
Trond Myklebust [Thu, 26 Mar 2015 00:40:38 +0000 (20:40 -0400)]
NFSv4.1/pnfs: Refactor pnfs_set_layoutcommit()

pnfs_set_layoutcommit() and pnfs_commit_set_layoutcommit() are 100% identical
except for the function arguments. Refactor to eliminate the difference.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv4.1/pnfs: Fix setting of layoutcommit last write byte
Trond Myklebust [Thu, 26 Mar 2015 00:10:20 +0000 (20:10 -0400)]
NFSv4.1/pnfs: Fix setting of layoutcommit last write byte

If the NFS_INO_LAYOUTCOMMIT flag was unset, then we _must_ ensure that
we also reset the last write byte (lwb) for that layout. The current
code depends on us clearing the lwb when we clear NFS_INO_LAYOUTCOMMIT,
which is not the case when we call pnfs_clear_layoutcommit().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv4: Return the delegation before returning the layout in evict_inode()
Trond Myklebust [Wed, 25 Mar 2015 17:26:18 +0000 (13:26 -0400)]
NFSv4: Return the delegation before returning the layout in evict_inode()

Minor optimisation for the case where the layout has return-on-close
enabled.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv4: Allow tracing of NFSv4 fsync calls
Trond Myklebust [Wed, 25 Mar 2015 23:09:08 +0000 (19:09 -0400)]
NFSv4: Allow tracing of NFSv4 fsync calls

I appear to have missed this when adding the ftrace probes.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFS: Fix free_deveiceid -> free_deviceid
Trond Myklebust [Mon, 9 Mar 2015 21:25:14 +0000 (17:25 -0400)]
NFS: Fix free_deveiceid -> free_deviceid

Make it easier to grep for these functions by name.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv4.1: Don't cache deviceids that have no notifications
Trond Myklebust [Mon, 9 Mar 2015 18:48:32 +0000 (14:48 -0400)]
NFSv4.1: Don't cache deviceids that have no notifications

The spec says that once all layouts that reference a given deviceid
have been returned, then we are only allowed to continue to cache
the deviceid if the metadata server supports notifications.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv4.1: Allow getdeviceinfo to return notification info back to caller
Trond Myklebust [Mon, 9 Mar 2015 18:01:25 +0000 (14:01 -0400)]
NFSv4.1: Allow getdeviceinfo to return notification info back to caller

We are only allowed to cache deviceinfo if the server supports notifications
and actually promises to call us back when changes occur. Right now, we
request those notifications, but then we don't check the server's reply.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv4.1: Cleanup - don't opencode nfs4_put_deviceid_node()
Trond Myklebust [Mon, 9 Mar 2015 19:37:00 +0000 (15:37 -0400)]
NFSv4.1: Cleanup - don't opencode nfs4_put_deviceid_node()

There really is no reason to do so.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv4.1: Convert pNFS deviceid to use kfree_rcu()
Trond Myklebust [Mon, 9 Mar 2015 19:23:35 +0000 (15:23 -0400)]
NFSv4.1: Convert pNFS deviceid to use kfree_rcu()

Use of synchronize_rcu() when unmounting and potentially freeing a lot
of deviceids is problematic. There really is no reason why we can't just
use kfree_rcu() here.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv4: Return delegations synchronously in evict_inode
Trond Myklebust [Wed, 25 Mar 2015 17:19:42 +0000 (13:19 -0400)]
NFSv4: Return delegations synchronously in evict_inode

Kinglong Mee reports that asynchronous delegations are being killed
by the call to rpc_shutdown_client() when unmounting. This can lead
to state leakage on the server until the client lease expires.

Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Fix a regression when reconnecting
Trond Myklebust [Mon, 23 Mar 2015 20:10:00 +0000 (16:10 -0400)]
SUNRPC: Fix a regression when reconnecting

If the task needs to give up the socket lock in order to allow a
reconnect to occur, then it must also clear the 'rq_bytes_sent' field
so that when it retransmits, it knows to start from the beginning.

Fixes: 718ba5b87343 ("SUNRPC: Add helpers to prevent socket create from racing")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoMerge branch 'upstream' of git://git.infradead.org/users/pcmoore/selinux into for...
James Morris [Fri, 27 Mar 2015 09:33:27 +0000 (20:33 +1100)]
Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/selinux into for-linus

9 years agowatchdog: imgpdc: Fix default heartbeat
James Hogan [Fri, 20 Feb 2015 23:45:45 +0000 (23:45 +0000)]
watchdog: imgpdc: Fix default heartbeat

The IMG PDC watchdog driver heartbeat module parameter has no default so
it is initialised to zero. This results in the following warning during
probe:

imgpdc-wdt 2006000.wdt: Initial timeout out of range! setting max timeout

The module parameter description implies that the default value should
be PDC_WDT_DEF_TIMEOUT, which isn't yet used, so initialise it to that.

Also tweak the heartbeat module parameter description for consistency.

Fixes: 93937669e9b5 ("watchdog: ImgTec PDC Watchdog Timer Driver")
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Ezequiel Garcia <ezequiel.garcia@imgtec.com>
Cc: Naidu Tellapati <Naidu.Tellapati@imgtec.com>
Cc: Jude Abraham <Jude.Abraham@imgtec.com>
Cc: linux-watchdog@vger.kernel.org
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
9 years agowatchdog: imgpdc: Fix probe NULL pointer dereference
James Hogan [Fri, 20 Feb 2015 23:45:44 +0000 (23:45 +0000)]
watchdog: imgpdc: Fix probe NULL pointer dereference

The IMG PDC watchdog probe function calls pdc_wdt_stop() prior to
watchdog_set_drvdata(), causing a NULL pointer dereference when
pdc_wdt_stop() retrieves the struct pdc_wdt_dev pointer using
watchdog_get_drvdata() and reads the register base address through it.

Fix by moving the watchdog_set_drvdata() call earlier, to where various
other pdc_wdt->wdt_dev fields are initialised.

Fixes: 93937669e9b5 ("watchdog: ImgTec PDC Watchdog Timer Driver")
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Ezequiel Garcia <ezequiel.garcia@imgtec.com>
Cc: Naidu Tellapati <Naidu.Tellapati@imgtec.com>
Cc: Jude Abraham <Jude.Abraham@imgtec.com>
Cc: linux-watchdog@vger.kernel.org
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
9 years agowatchdog: mtk_wdt: signedness bug in mtk_wdt_start()
Dan Carpenter [Wed, 11 Feb 2015 10:26:21 +0000 (13:26 +0300)]
watchdog: mtk_wdt: signedness bug in mtk_wdt_start()

"ret" should be signed for the error handling to work correctly.  This
doesn't matter much in real life since mtk_wdt_set_timeout() always
succeeds.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
9 years agoMerge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Linus Torvalds [Thu, 26 Mar 2015 22:04:05 +0000 (15:04 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux

Pull drm refcounting fixes from Dave Airlie:
 "Here is the complete set of i915 bug/warn/refcounting fixes"

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm/i915: Fixup legacy plane->crtc link for initial fb config
  drm/i915: Fix atomic state when reusing the firmware fb
  drm/i915: Keep ring->active_list and ring->requests_list consistent
  drm/i915: Don't try to reference the fb in get_initial_plane_config()
  drm: Fixup racy refcounting in plane_force_disable

9 years agoMerge tag 'dm-4.0-fix-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device...
Linus Torvalds [Thu, 26 Mar 2015 21:53:47 +0000 (14:53 -0700)]
Merge tag 'dm-4.0-fix-2' of git://git./linux/kernel/git/device-mapper/linux-dm

Pull device mapper fix from Mike Snitzer:
 "Fix DM core device cleanup regression -- due to a latent race that was
  exposed by the bdi changes that were introduced during the 4.0 merge"

* tag 'dm-4.0-fix-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: fix add_disk() NULL pointer due to race with free_dev()

9 years agoMerge tag 'linux-kselftest-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Thu, 26 Mar 2015 21:43:42 +0000 (14:43 -0700)]
Merge tag 'linux-kselftest-4.0-rc6' of git://git./linux/kernel/git/shuah/linux-kselftest

Pull kselftest fix from Shuah Khan.

* tag 'linux-kselftest-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests: Fix build failures when invoked from kselftest target

9 years agoMerge tag 'drm-intel-fixes-2015-03-26' of git://anongit.freedesktop.org/drm-intel...
Dave Airlie [Thu, 26 Mar 2015 21:39:45 +0000 (07:39 +1000)]
Merge tag 'drm-intel-fixes-2015-03-26' of git://anongit.freedesktop.org/drm-intel into drm-fixes

This should cover the final warnings in -rc5 with two more backports
from our development branch (drm-intel-next-queued). They're the ones
from Daniel and Damien, with references to the reports.

This is on top of drm-fixes because of the dependency on the two earlier
fixes not yet in Linus' tree.

There's an additional regression fix from Chris.

* tag 'drm-intel-fixes-2015-03-26' of git://anongit.freedesktop.org/drm-intel:
  drm/i915: Fixup legacy plane->crtc link for initial fb config
  drm/i915: Fix atomic state when reusing the firmware fb
  drm/i915: Keep ring->active_list and ring->requests_list consistent

9 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Linus Torvalds [Thu, 26 Mar 2015 21:11:17 +0000 (14:11 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/s390/linux

Pull s390 fixes from Martin Schwidefsky:
 "A couple of bug fixes for s390.

  The ftrace comile fix is quite large for a -rc6 release, but it would
  be nice to have it in 4.0"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/smp: reenable smt after resume
  s390/mm: limit STACK_RND_MASK for compat tasks
  s390/ftrace: fix compile error if CONFIG_KPROBES is disabled
  s390/cpum_sf: add diagnostic sampling event only if it is authorized

9 years agodrm/i915: Fixup legacy plane->crtc link for initial fb config
Daniel Vetter [Wed, 25 Mar 2015 17:30:38 +0000 (18:30 +0100)]
drm/i915: Fixup legacy plane->crtc link for initial fb config

This is a very similar bug in the load detect code fixed in

commit 9128b040eb774e04bc23777b005ace2b66ab2a85
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Mar 3 17:31:21 2015 +0100

    drm/i915: Fix modeset state confusion in the load detect code

But this time around it was the initial fb code that forgot to update
the plane->crtc pointer. Otherwise it's the exact same bug, with the
exact same restrains (any set_config call/ioctl that doesn't disable
the pipe papers over the bug for free, so fairly hard to hit in normal
testing). So if you want the full explanation just go read that one
over there - it's rather long ...

Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
[Jani: backported to drm-intel-fixes for v4.0-rc]
Reference: http://mid.gmane.org/CA+5PVA7ChbtJrknqws1qvZcbrg1CW2pQAFkSMURWWgyASRyGXg@mail.gmail.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
9 years agodrm/i915: Fix atomic state when reusing the firmware fb
Damien Lespiau [Thu, 5 Feb 2015 19:24:25 +0000 (19:24 +0000)]
drm/i915: Fix atomic state when reusing the firmware fb

Right now, we get a warning when taking over the firmware fb:

  [drm:drm_atomic_plane_check] FB set but no CRTC

with the following backtrace:

  [<ffffffffa010339d>] drm_atomic_check_only+0x35d/0x510 [drm]
  [<ffffffffa0103567>] drm_atomic_commit+0x17/0x60 [drm]
  [<ffffffffa00a6ccd>] drm_atomic_helper_plane_set_property+0x8d/0xd0 [drm_kms_helper]
  [<ffffffffa00f1fed>] drm_mode_plane_set_obj_prop+0x2d/0x90 [drm]
  [<ffffffffa00a8a1b>] restore_fbdev_mode+0x6b/0xf0 [drm_kms_helper]
  [<ffffffffa00aa969>] drm_fb_helper_restore_fbdev_mode_unlocked+0x29/0x80 [drm_kms_helper]
  [<ffffffffa00aa9e2>] drm_fb_helper_set_par+0x22/0x50 [drm_kms_helper]
  [<ffffffffa050a71a>] intel_fbdev_set_par+0x1a/0x60 [i915]
  [<ffffffff813ad444>] fbcon_init+0x4f4/0x580

That's because we update the plane state with the fb from the firmware, but we
never associate the plane to that CRTC.

We don't quite have the full DRM take over from HW state just yet, so
fake enough of the plane atomic state to pass the checks.

v2: Fix the state on which we set the CRTC in the case we're sharing the
    initial fb with another pipe. (Matt)

Signed-off-by: Damien Lespiau <damien.lespiau@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
[Jani: backported to drm-intel-fixes for v4.0-rc]
Reference: http://mid.gmane.org/CA+5PVA7yXH=U757w8V=Zj2U1URG4nYNav20NpjtQ4svVueyPNw@mail.gmail.com
Reference: http://lkml.kernel.org/r/CA+55aFweWR=nDzc2Y=rCtL_H8JfdprQiCimN5dwc+TgyD4Bjsg@mail.gmail.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
9 years agoALSA: hda - Add one more node in the EAPD supporting candidate list
Hui Wang [Thu, 26 Mar 2015 09:14:55 +0000 (17:14 +0800)]
ALSA: hda - Add one more node in the EAPD supporting candidate list

We have a HP machine which use the codec node 0x17 connecting the
internal speaker, and from the node capability, we saw the EAPD,
if we don't set the EAPD on for this node, the internal speaker
can't output any sound.

Cc: <stable@vger.kernel.org>
BugLink: https://bugs.launchpad.net/bugs/1436745
Signed-off-by: Hui Wang <hui.wang@canonical.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
9 years agoclocksource/drivers/sun5i: Fix cpufreq interaction with sched_clock()
Maxime Ripard [Thu, 26 Mar 2015 09:27:09 +0000 (10:27 +0100)]
clocksource/drivers/sun5i: Fix cpufreq interaction with sched_clock()

The sun5i timer is used as the sched-clock on certain systems, and ever
since we started using cpufreq, the cpu clock (that is one of the
timer's clock indirect parent) now changes as well, along with the
actual sched_clock() rate.

This is not accurate and not desirable.

We can safely remove the sun5i sched-clock on those systems, since we
have other reliable sched_clock() sources in the system.

Tested-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[ Improved the changelog. ]
Cc: richard@nod.at
Link: http://lkml.kernel.org/r/1427362029-6511-4-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
9 years agoclocksource/drivers: Fix various !CONFIG_HAS_IOMEM build errors
Richard Weinberger [Thu, 26 Mar 2015 09:27:06 +0000 (10:27 +0100)]
clocksource/drivers: Fix various !CONFIG_HAS_IOMEM build errors

Fix !CONFIG_HAS_IOMEM related build failures in three clocksource drivers.

The build failures have the pattern of:

  drivers/clocksource/sh_cmt.c: In function ‘sh_cmt_map_memory’: drivers/clocksource/sh_cmt.c:920:2:
  error: implicit declaration of function ‘ioremap_nocache’ [-Werror=implicit-function-declaration]   cmt->mapbase = ioremap_nocache(mem->start, resource_size(mem));

Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: maxime.ripard@free-electrons.com
Link: http://lkml.kernel.org/r/1427362029-6511-1-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
9 years agodrm/i915: Keep ring->active_list and ring->requests_list consistent
Chris Wilson [Wed, 18 Mar 2015 18:19:22 +0000 (18:19 +0000)]
drm/i915: Keep ring->active_list and ring->requests_list consistent

If we retire requests last, we may use a later seqno and so clear
the requests lists without clearing the active list, leading to
confusion. Hence we should retire requests first for consistency with
the early return. The order used to be important as the lifecycle for
the object on the active list was determined by request->seqno. However,
the requests themselves are now reference counted removing the
constraint from the order of retirement.

Fixes regression from

commit 1b5a433a4dd967b125131da42b89b5cc0d5b1f57
Author: John Harrison <John.C.Harrison@Intel.com>
Date:   Mon Nov 24 18:49:42 2014 +0000

    drm/i915: Convert 'i915_seqno_passed' calls into 'i915_gem_request_completed
'

and a

WARNING: CPU: 0 PID: 1383 at drivers/gpu/drm/i915/i915_gem_evict.c:279 i915_gem_evict_vm+0x10c/0x140()
WARN_ON(!list_empty(&vm->active_list))

Identified by updating WATCH_LISTS:

[drm:i915_verify_lists] *ERROR* blitter ring: active list not empty, but no requests
WARNING: CPU: 0 PID: 681 at drivers/gpu/drm/i915/i915_gem.c:2751 i915_gem_retire_requests_ring+0x149/0x230()
WARN_ON(i915_verify_lists(ring->dev))

Note that this is only a problem in evict_vm where the following happens
after a retire_request has cleaned out all requests, but not all active
bo:
- intel_ring_idle called from i915_gpu_idle notices that no requests are
  outstanding and immediately returns.
- i915_gem_retire_requests_ring called from i915_gem_retire_requests also
  immediately returns when there's no request, still leaving the bo on the
  active list.
- evict_vm hits the WARN_ON(!list_empty(&vm->active_list)) after evicting
  all active objects that there's still stuff left that shouldn't be
  there.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
9 years agoALSA: hda_intel: apply the Seperate stream_tag for Sunrise Point
Libin Yang [Thu, 26 Mar 2015 05:28:39 +0000 (13:28 +0800)]
ALSA: hda_intel: apply the Seperate stream_tag for Sunrise Point

The total stream number of Sunrise Point's input and output stream
exceeds 15, which will cause some streams do not work because
of the overflow on SDxCTL.STRM field if using the legacy
stream tag allocation method.

This patch uses the new stream tag allocation method by add
the flag AZX_DCAPS_SEPARATE_STREAM_TAG for Skylake platform.

Signed-off-by: Libin Yang <libin.yang@intel.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
9 years agoARC: signal handling robustify
Vineet Gupta [Thu, 26 Mar 2015 05:44:41 +0000 (11:14 +0530)]
ARC: signal handling robustify

A malicious signal handler / restorer can DOS the system by fudging the
user regs saved on stack, causing weird things such as sigreturn returning
to user mode PC but cpu state still being kernel mode....

Ensure that in sigreturn path status32 always has U bit; any other bogosity
(gargbage PC etc) will be taken care of by normal user mode exceptions mechanisms.

Reproducer signal handler:

    void handle_sig(int signo, siginfo_t *info, void *context)
    {
ucontext_t *uc = context;
struct user_regs_struct *regs = &(uc->uc_mcontext.regs);

regs->scratch.status32 = 0;
    }

Before the fix, kernel would go off to weeds like below:

    --------->8-----------
    [ARCLinux]$ ./signal-test
    Path: /signal-test
    CPU: 0 PID: 61 Comm: signal-test Not tainted 4.0.0-rc5+ #65
    task: 8f177880 ti: 5ffe6000 task.ti: 8f15c000

    [ECR   ]: 0x00220200 => Invalid Write @ 0x00000010 by insn @ 0x00010698
    [EFA   ]: 0x00000010
    [BLINK ]: 0x2007c1ee
    [ERET  ]: 0x10698
    [STAT32]: 0x00000000 :                                   <--------
    BTA: 0x00010680  SP: 0x5ffe7e48  FP: 0x00000000
    LPS: 0x20003c6c LPE: 0x20003c70 LPC: 0x00000000
    ...
    --------->8-----------

Reported-by: Alexey Brodkin <abrodkin@synopsys.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
9 years agoARC: SA_SIGINFO ucontext regs off-by-one
Vineet Gupta [Thu, 26 Mar 2015 03:55:44 +0000 (09:25 +0530)]
ARC: SA_SIGINFO ucontext regs off-by-one

The regfile provided to SA_SIGINFO signal handler as ucontext was off by
one due to pt_regs gutter cleanups in 2013.

Before handling signal, user pt_regs are copied onto user_regs_struct and copied
back later. Both structs are binary compatible. This was all fine until
commit 2fa919045b72 (ARC: pt_regs update #2) which removed the empty stack slot
at top of pt_regs (corresponding to first pad) and made the corresponding
fixup in struct user_regs_struct (the pad in there was moved out of
@scratch - not removed altogether as it is part of ptrace ABI)

 struct user_regs_struct {
+       long pad;
        struct {
-               long pad;
                long bta, lp_start, lp_end,....
        } scratch;
 ...
 }

This meant that now user_regs_struct was off by 1 reg w.r.t pt_regs and
signal code needs to user_regs_struct.scratch to reflect it as pt_regs,
which is what this commit does.

This problem was hidden for 2 years, because both save/restore, despite
using wrong location, were using the same location. Only an interim
inspection (reproducer below) exposed the issue.

     void handle_segv(int signo, siginfo_t *info, void *context)
     {
  ucontext_t *uc = context;
struct user_regs_struct *regs = &(uc->uc_mcontext.regs);

printf("regs %x %x\n",               <=== prints 7 8 (vs. 8 9)
               regs->scratch.r8, regs->scratch.r9);
     }

     int main()
     {
struct sigaction sa;

sa.sa_sigaction = handle_segv;
sa.sa_flags = SA_SIGINFO;
sigemptyset(&sa.sa_mask);
sigaction(SIGSEGV, &sa, NULL);

asm volatile(
"mov r7, 7 \n"
"mov r8, 8 \n"
"mov r9, 9 \n"
"mov r10, 10 \n"
:::"r7","r8","r9","r10");

*((unsigned int*)0x10) = 0;
     }

Fixes: 2fa919045b72ec892e "ARC: pt_regs update #2: Remove unused gutter at start of pt_regs"
CC: <stable@vger.kernel.org>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
9 years agoMerge tag 'metag-fixes-v4.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jhoga...
Linus Torvalds [Wed, 25 Mar 2015 23:52:53 +0000 (16:52 -0700)]
Merge tag 'metag-fixes-v4.0-2' of git://git./linux/kernel/git/jhogan/metag

Pull arch/metag fix from James Hogan:
 "Another metag architecture fix for v4.0

  This is another single fix, for an include dependency problem when
  using ioremap_wc() from asm/io.h without also including asm/pgtable.h"

* tag 'metag-fixes-v4.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag:
  metag: Fix ioremap_wc/ioremap_cached build errors

9 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Wed, 25 Mar 2015 23:21:17 +0000 (16:21 -0700)]
Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
 "15 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: numa: mark huge PTEs young when clearing NUMA hinting faults
  mm: numa: slow PTE scan rate if migration failures occur
  mm: numa: preserve PTE write permissions across a NUMA hinting fault
  mm: numa: group related processes based on VMA flags instead of page table flags
  hfsplus: fix B-tree corruption after insertion at position 0
  MAINTAINERS: add Jan as DMI/SMBIOS support maintainer
  fs/affs/file.c: unlock/release page on error
  mm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolate
  mm/slub: fix lockups on PREEMPT && !SMP kernels
  mm/memory hotplug: postpone the reset of obsolete pgdat
  MAINTAINERS: correct rtc armada38x pattern entry
  mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers
  mm: fix anon_vma->degree underflow in anon_vma endless growing prevention
  drivers/rtc/rtc-mrst: fix suspend/resume
  aoe: update aoe maintainer information

9 years agoMerge tag 'signed-for-4.0' of git://github.com/agraf/linux-2.6
Marcelo Tosatti [Wed, 25 Mar 2015 23:20:31 +0000 (20:20 -0300)]
Merge tag 'signed-for-4.0' of git://github.com/agraf/linux-2.6

Patch queue for 4.0 - 2015-03-25

A few bug fixes for Book3S HV KVM:

  - Fix spinlock ordering
  - Fix idle guests on LE hosts
  - Fix instruction emulation

9 years agomm: numa: mark huge PTEs young when clearing NUMA hinting faults
Mel Gorman [Wed, 25 Mar 2015 22:55:45 +0000 (15:55 -0700)]
mm: numa: mark huge PTEs young when clearing NUMA hinting faults

Base PTEs are marked young when the NUMA hinting information is cleared
but the same does not happen for huge pages which this patch addresses.

Note that migrated pages are not marked young as the base page migration
code does not assume that migrated pages have been referenced.  This
could be addressed but beyond the scope of this series which is aimed at
Dave Chinners shrink workload that is unlikely to be affected by this
issue.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: numa: slow PTE scan rate if migration failures occur
Mel Gorman [Wed, 25 Mar 2015 22:55:42 +0000 (15:55 -0700)]
mm: numa: slow PTE scan rate if migration failures occur

Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

  Across the board the 4.0-rc1 numbers are much slower, and the degradation
  is far worse when using the large memory footprint configs. Perf points
  straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

   -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
      - default_send_IPI_mask_sequence_phys
         - 99.99% physflat_send_IPI_mask
            - 99.37% native_send_call_func_ipi
                 smp_call_function_many
               - native_flush_tlb_others
                  - 99.85% flush_tlb_page
                       ptep_clear_flush
                       try_to_unmap_one
                       rmap_walk
                       try_to_unmap
                       migrate_pages
                       migrate_misplaced_page
                     - handle_mm_fault
                        - 99.73% __do_page_fault
                             trace_do_page_fault
                             do_async_page_fault
                           + async_page_fault
              0.63% native_send_call_func_single_ipi
                 generic_exec_single
                 smp_call_function_single

This is showing excessive migration activity even though excessive
migrations are meant to get throttled.  Normally, the scan rate is tuned
on a per-task basis depending on the locality of faults.  However, if
migrations fail for any reason then the PTE scanner may scan faster if
the faults continue to be remote.  This means there is higher system CPU
overhead and fault trapping at exactly the time we know that migrations
cannot happen.  This patch tracks when migration failures occur and
slows the PTE scanner.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: numa: preserve PTE write permissions across a NUMA hinting fault
Mel Gorman [Wed, 25 Mar 2015 22:55:40 +0000 (15:55 -0700)]
mm: numa: preserve PTE write permissions across a NUMA hinting fault

Protecting a PTE to trap a NUMA hinting fault clears the writable bit
and further faults are needed after trapping a NUMA hinting fault to set
the writable bit again.  This patch preserves the writable bit when
trapping NUMA hinting faults.  The impact is obvious from the number of
minor faults trapped during the basis balancing benchmark and the system
CPU usage;

  autonumabench
                                             4.0.0-rc4             4.0.0-rc4
                                              baseline              preserve
  Time System-NUMA01                  107.13 (  0.00%)      103.13 (  3.73%)
  Time System-NUMA01_THEADLOCAL       131.87 (  0.00%)       83.30 ( 36.83%)
  Time System-NUMA02                    8.95 (  0.00%)       10.72 (-19.78%)
  Time System-NUMA02_SMT                4.57 (  0.00%)        3.99 ( 12.69%)
  Time Elapsed-NUMA01                 515.78 (  0.00%)      517.26 ( -0.29%)
  Time Elapsed-NUMA01_THEADLOCAL      384.10 (  0.00%)      384.31 ( -0.05%)
  Time Elapsed-NUMA02                  48.86 (  0.00%)       48.78 (  0.16%)
  Time Elapsed-NUMA02_SMT              47.98 (  0.00%)       48.12 ( -0.29%)

               4.0.0-rc4   4.0.0-rc4
                baseline    preserve
  User          44383.95    43971.89
  System          252.61      201.24
  Elapsed         998.68     1000.94

  Minor Faults   2597249     1981230
  Major Faults       365         364

There is a similar drop in system CPU usage using Dave Chinner's xfsrepair
workload

                                      4.0.0-rc4             4.0.0-rc4
                                       baseline              preserve
  Amean    real-xfsrepair      454.14 (  0.00%)      442.36 (  2.60%)
  Amean    syst-xfsrepair      277.20 (  0.00%)      204.68 ( 26.16%)

The patch looks hacky but the alternatives looked worse.  The tidest was
to rewalk the page tables after a hinting fault but it was more complex
than this approach and the performance was worse.  It's not generally
safe to just mark the page writable during the fault if it's a write
fault as it may have been read-only for COW so that approach was
discarded.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: numa: group related processes based on VMA flags instead of page table flags
Mel Gorman [Wed, 25 Mar 2015 22:55:37 +0000 (15:55 -0700)]
mm: numa: group related processes based on VMA flags instead of page table flags

These are three follow-on patches based on the xfsrepair workload Dave
Chinner reported was problematic in 4.0-rc1 due to changes in page table
management -- https://lkml.org/lkml/2015/3/1/226.

Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa
read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp:
Return the correct value for change_huge_pmd").  It was known that the
performance in 3.19 was still better even if is far less safe.  This
series aims to restore the performance without compromising on safety.

For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the
three patches applied on top

  autonumabench
                                                3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                               vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
  Time System-NUMA01                  124.00 (  0.00%)      161.86 (-30.53%)      107.13 ( 13.60%)      103.13 ( 16.83%)      145.01 (-16.94%)
  Time System-NUMA01_THEADLOCAL       115.54 (  0.00%)      107.64 (  6.84%)      131.87 (-14.13%)       83.30 ( 27.90%)       92.35 ( 20.07%)
  Time System-NUMA02                    9.35 (  0.00%)       10.44 (-11.66%)        8.95 (  4.28%)       10.72 (-14.65%)        8.16 ( 12.73%)
  Time System-NUMA02_SMT                3.87 (  0.00%)        4.63 (-19.64%)        4.57 (-18.09%)        3.99 ( -3.10%)        3.36 ( 13.18%)
  Time Elapsed-NUMA01                 570.06 (  0.00%)      567.82 (  0.39%)      515.78 (  9.52%)      517.26 (  9.26%)      543.80 (  4.61%)
  Time Elapsed-NUMA01_THEADLOCAL      393.69 (  0.00%)      384.83 (  2.25%)      384.10 (  2.44%)      384.31 (  2.38%)      380.73 (  3.29%)
  Time Elapsed-NUMA02                  49.09 (  0.00%)       49.33 ( -0.49%)       48.86 (  0.47%)       48.78 (  0.63%)       50.94 ( -3.77%)
  Time Elapsed-NUMA02_SMT              47.51 (  0.00%)       47.15 (  0.76%)       47.98 ( -0.99%)       48.12 ( -1.28%)       49.56 ( -4.31%)

                3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
               vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
  User        46334.60    46391.94    44383.95    43971.89    44372.12
  System        252.84      284.66      252.61      201.24      249.00
  Elapsed      1062.14     1050.96      998.68     1000.94     1026.78

Overall the system CPU usage is comparable and the test is naturally a
bit variable.  The slowing of the scanner hurts numa01 but on this
machine it is an adverse workload and patches that dramatically help it
often hurt absolutely everything else.

Due to patch 2, the fault activity is interesting

                                  3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                 vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
  Minor Faults                   2097811     2656646     2597249     1981230     1636841
  Major Faults                       362         450         365         364         365

Note the impact preserving the write bit across protection updates and
fault reduces faults.

  NUMA alloc hit                 1229008     1217015     1191660     1178322     1199681
  NUMA alloc miss                      0           0           0           0           0
  NUMA interleave hit                  0           0           0           0           0
  NUMA alloc local               1228514     1216317     1190871     1177448     1199021
  NUMA base PTE updates        245706197   240041607   238195516   244704842   115012800
  NUMA huge PMD updates           479530      468448      464868      477573      224487
  NUMA page range updates      491225557   479886983   476207932   489222218   229950144
  NUMA hint faults                659753      656503      641678      656926      294842
  NUMA hint local faults          381604      373963      360478      337585      186249
  NUMA hint local percent             57          56          56          51          63
  NUMA pages migrated            5412140     6374899     6266530     5277468     5755096
  AutoNUMA cost                    5121%       5083%       4994%       5097%       2388%

Here the impact of slowing the PTE scanner on migratrion failures is
obvious as "NUMA base PTE updates" and "NUMA huge PMD updates" are
massively reduced even though the headline performance is very similar.

As xfsrepair was the reported workload here is the impact of the series
on it.

  xfsrepair
                                         3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                        vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
  Min      real-fsmark        1183.29 (  0.00%)     1165.73 (  1.48%)     1152.78 (  2.58%)     1153.64 (  2.51%)     1177.62 (  0.48%)
  Min      syst-fsmark        4107.85 (  0.00%)     4027.75 (  1.95%)     3986.74 (  2.95%)     3979.16 (  3.13%)     4048.76 (  1.44%)
  Min      real-xfsrepair      441.51 (  0.00%)      463.96 ( -5.08%)      449.50 ( -1.81%)      440.08 (  0.32%)      439.87 (  0.37%)
  Min      syst-xfsrepair      195.76 (  0.00%)      278.47 (-42.25%)      262.34 (-34.01%)      203.70 ( -4.06%)      143.64 ( 26.62%)
  Amean    real-fsmark        1188.30 (  0.00%)     1177.34 (  0.92%)     1157.97 (  2.55%)     1158.21 (  2.53%)     1182.22 (  0.51%)
  Amean    syst-fsmark        4111.37 (  0.00%)     4055.70 (  1.35%)     3987.19 (  3.02%)     3998.72 (  2.74%)     4061.69 (  1.21%)
  Amean    real-xfsrepair      450.88 (  0.00%)      468.32 ( -3.87%)      454.14 ( -0.72%)      442.36 (  1.89%)      440.59 (  2.28%)
  Amean    syst-xfsrepair      199.66 (  0.00%)      290.60 (-45.55%)      277.20 (-38.84%)      204.68 ( -2.51%)      150.55 ( 24.60%)
  Stddev   real-fsmark           4.12 (  0.00%)       10.82 (-162.29%)       4.14 ( -0.28%)        5.98 (-45.05%)        4.60 (-11.53%)
  Stddev   syst-fsmark           2.63 (  0.00%)       20.32 (-671.82%)       0.37 ( 85.89%)       16.47 (-525.59%)      15.05 (-471.79%)
  Stddev   real-xfsrepair        6.87 (  0.00%)        4.55 ( 33.75%)        3.46 ( 49.58%)        1.78 ( 74.12%)        0.52 ( 92.50%)
  Stddev   syst-xfsrepair        3.02 (  0.00%)       10.30 (-241.37%)      13.17 (-336.37%)       0.71 ( 76.63%)        5.00 (-65.61%)
  CoeffVar real-fsmark           0.35 (  0.00%)        0.92 (-164.73%)       0.36 ( -2.91%)        0.52 (-48.82%)        0.39 (-12.10%)
  CoeffVar syst-fsmark           0.06 (  0.00%)        0.50 (-682.41%)       0.01 ( 85.45%)        0.41 (-543.22%)       0.37 (-478.78%)
  CoeffVar real-xfsrepair        1.52 (  0.00%)        0.97 ( 36.21%)        0.76 ( 49.94%)        0.40 ( 73.62%)        0.12 ( 92.33%)
  CoeffVar syst-xfsrepair        1.51 (  0.00%)        3.54 (-134.54%)       4.75 (-214.31%)       0.34 ( 77.20%)        3.32 (-119.63%)
  Max      real-fsmark        1193.39 (  0.00%)     1191.77 (  0.14%)     1162.90 (  2.55%)     1166.66 (  2.24%)     1188.50 (  0.41%)
  Max      syst-fsmark        4114.18 (  0.00%)     4075.45 (  0.94%)     3987.65 (  3.08%)     4019.45 (  2.30%)     4082.80 (  0.76%)
  Max      real-xfsrepair      457.80 (  0.00%)      474.60 ( -3.67%)      457.82 ( -0.00%)      444.42 (  2.92%)      441.03 (  3.66%)
  Max      syst-xfsrepair      203.11 (  0.00%)      303.65 (-49.50%)      294.35 (-44.92%)      205.33 ( -1.09%)      155.28 ( 23.55%)

The really relevant lines as syst-xfsrepair which is the system CPU
usage when running xfsrepair.  Note that on my machine the overhead was
45% higher on 4.0-rc4 which may be part of what Dave is seeing.  Once we
preserve the write bit across faults, it's only 2.51% higher on average.
With the full series applied, system CPU usage is 24.6% lower on
average.

Again, the impact of preserving the write bit on minor faults is obvious
and the impact of slowing scanning after migration failures is obvious
on the PTE updates.  Note also that the number of pages migrated is much
reduced even though the headline performance is comparable.

                                  3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                 vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
  Minor Faults                 153466827   254507978   249163829   153501373   105737890
  Major Faults                       610         702         690         649         724
  NUMA base PTE updates        217735049   210756527   217729596   216937111   144344993
  NUMA huge PMD updates           129294       85044      106921      127246       79887
  NUMA pages migrated           21938995    29705270    28594162    22687324    16258075

                        3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                       vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
  Mean sdb-avgqusz       13.47        2.54        2.55        2.47        2.49
  Mean sdb-avgrqsz      202.32      140.22      139.50      139.02      138.12
  Mean sdb-await         25.92        5.09        5.33        5.02        5.22
  Mean sdb-r_await        4.71        0.19        0.83        0.51        0.11
  Mean sdb-w_await      104.13        5.21        5.38        5.05        5.32
  Mean sdb-svctm          0.59        0.13        0.14        0.13        0.14
  Mean sdb-rrqm           0.16        0.00        0.00        0.00        0.00
  Mean sdb-wrqm           3.59     1799.43     1826.84     1812.21     1785.67
  Max  sdb-avgqusz      111.06       12.13       14.05       11.66       15.60
  Max  sdb-avgrqsz      255.60      190.34      190.01      187.33      191.78
  Max  sdb-await        168.24       39.28       49.22       44.64       65.62
  Max  sdb-r_await      660.00       52.00      280.00       76.00       12.00
  Max  sdb-w_await     7804.00       39.28       49.22       44.64       65.62
  Max  sdb-svctm          4.00        2.82        2.86        1.98        2.84
  Max  sdb-rrqm           8.30        0.00        0.00        0.00        0.00
  Max  sdb-wrqm          34.20     5372.80     5278.60     5386.60     5546.15

FWIW, I also checked SPECjbb in different configurations but it's
similar observations -- minor faults lower, PTE update activity lower
and performance is roughly comparable against 3.19.

This patch (of 3):

Threads that share writable data within pages are grouped together as
related tasks.  This decision is based on whether the PTE is marked
dirty which is subject to timing races between the PTE scanner update
and when the application writes the page.  If the page is file-backed,
then background flushes and sync also affect placement.  This is
unpredictable behaviour which is impossible to reason about so this
patch makes grouping decisions based on the VMA flags.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agohfsplus: fix B-tree corruption after insertion at position 0
Sergei Antonov [Wed, 25 Mar 2015 22:55:34 +0000 (15:55 -0700)]
hfsplus: fix B-tree corruption after insertion at position 0

Fix B-tree corruption when a new record is inserted at position 0 in the
node in hfs_brec_insert().  In this case a hfs_brec_update_parent() is
called to update the parent index node (if exists) and it is passed
hfs_find_data with a search_key containing a newly inserted key instead
of the key to be updated.  This results in an inconsistent index node.
The bug reproduces on my machine after an extents overflow record for
the catalog file (CNID=4) is inserted into the extents overflow B-tree.
Because of a low (reserved) value of CNID=4, it has to become the first
record in the first leaf node.

The resulting first leaf node is correct:

  ----------------------------------------------------
  | key0.CNID=4 | key1.CNID=123 | key2.CNID=456, ... |
  ----------------------------------------------------

But the parent index key0 still contains the previous key CNID=123:

  -----------------------
  | key0.CNID=123 | ... |
  -----------------------

A change in hfs_brec_insert() makes hfs_brec_update_parent() work
correctly by preventing it from getting fd->record=-1 value from
__hfs_brec_find().

Along the way, I removed duplicate code with unification of the if
condition.  The resulting code is equivalent to the original code
because node is never 0.

Also hfs_brec_update_parent() will now return an error after getting a
negative fd->record value.  However, the return value of
hfs_brec_update_parent() is not checked anywhere in the file and I'm
leaving it unchanged by this patch.  brec.c lacks error checking after
some other calls too, but this issue is of less importance than the one
being fixed by this patch.

Signed-off-by: Sergei Antonov <saproj@gmail.com>
Cc: Joe Perches <joe@perches.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Acked-by: Hin-Tak Leung <htl10@users.sourceforge.net>
Cc: Anton Altaparmakov <aia21@cam.ac.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agoMAINTAINERS: add Jan as DMI/SMBIOS support maintainer
Jean Delvare [Wed, 25 Mar 2015 22:55:31 +0000 (15:55 -0700)]
MAINTAINERS: add Jan as DMI/SMBIOS support maintainer

I am familiar with these drivers and I care about them so let me add
myself as their maintainer.

Signed-off-by: Jean Delvare <jdelvare@suse.de>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agofs/affs/file.c: unlock/release page on error
Taesoo Kim [Wed, 25 Mar 2015 22:55:29 +0000 (15:55 -0700)]
fs/affs/file.c: unlock/release page on error

When affs_bread_ino() fails, correctly unlock the page and release the
page cache with proper error value.  All write_end() should
unlock/release the page that was locked by write_beg().

Signed-off-by: Taesoo Kim <tsgatesv@gmail.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolate
Laura Abbott [Wed, 25 Mar 2015 22:55:26 +0000 (15:55 -0700)]
mm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolate

Commit 3c605096d315 ("mm/page_alloc: restrict max order of merging on
isolated pageblock") changed the logic of unset_migratetype_isolate to
check the buddy allocator and explicitly call __free_pages to merge.

The page that is being freed in this path never had prep_new_page called
so set_page_refcounted is called explicitly but there is no call to
kernel_map_pages.  With the default kernel_map_pages this is mostly
harmless but if kernel_map_pages does any manipulation of the page
tables (unmapping or setting pages to read only) this may trigger a
fault:

    alloc_contig_range test_pages_isolated(ceb00, ced00) failed
    Unable to handle kernel paging request at virtual address ffffffc0cec00000
    pgd = ffffffc045fc4000
    [ffffffc0cec00000] *pgd=0000000000000000
    Internal error: Oops: 9600004f [#1] PREEMPT SMP
    Modules linked in: exfatfs
    CPU: 1 PID: 23237 Comm: TimedEventQueue Not tainted 3.10.49-gc72ad36-dirty #1
    task: ffffffc03de52100 ti: ffffffc015388000 task.ti: ffffffc015388000
    PC is at memset+0xc8/0x1c0
    LR is at kernel_map_pages+0x1ec/0x244

Fix this by calling kernel_map_pages to ensure the page is set in the
page table properly

Fixes: 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Gioh Kim <gioh.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/slub: fix lockups on PREEMPT && !SMP kernels
Mark Rutland [Wed, 25 Mar 2015 22:55:23 +0000 (15:55 -0700)]
mm/slub: fix lockups on PREEMPT && !SMP kernels

Commit 9aabf810a67c ("mm/slub: optimize alloc/free fastpath by removing
preemption on/off") introduced an occasional hang for kernels built with
CONFIG_PREEMPT && !CONFIG_SMP.

The problem is the following loop the patch introduced to
slab_alloc_node and slab_free:

    do {
        tid = this_cpu_read(s->cpu_slab->tid);
        c = raw_cpu_ptr(s->cpu_slab);
    } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));

GCC 4.9 has been observed to hoist the load of c and c->tid above the
loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
constant and does not force a reload).  On arm64 the generated assembly
looks like:

         ldr     x4, [x0,#8]
  loop:
         ldr     x1, [x0,#8]
         cmp     x1, x4
         b.ne    loop

If the thread is preempted between the load of c->tid (into x1) and tid
(into x4), and an allocation or free occurs in another thread (bumping
the cpu_slab's tid), the thread will be stuck in the loop until
s->cpu_slab->tid wraps, which may be forever in the absence of
allocations/frees on the same CPU.

This patch changes the loop condition to access c->tid with READ_ONCE.
This ensures that the value is reloaded even when the compiler would
otherwise assume it could cache the value, and also ensures that the
load will not be torn.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/memory hotplug: postpone the reset of obsolete pgdat
Gu Zheng [Wed, 25 Mar 2015 22:55:20 +0000 (15:55 -0700)]
mm/memory hotplug: postpone the reset of obsolete pgdat

Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
stress condition:

  BUG: unable to handle kernel paging request at 0000000000025f60
  IP: next_online_pgdat+0x1/0x50
  PGD 0
  Oops: 0000 [#1] SMP
  ACPI: Device does not support D3cold
  Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
  CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G           O 3.10.15-5885-euler0302 #1
  Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
  Workqueue: events vmstat_update
  task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
  RIP: 0010: next_online_pgdat+0x1/0x50
  RSP: 0018:ffffa800d32afce8  EFLAGS: 00010286
  RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
  RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
  RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
  R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
  R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
  FS:  0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
    refresh_cpu_vm_stats+0xd0/0x140
    vmstat_update+0x11/0x50
    process_one_work+0x194/0x3d0
    worker_thread+0x12b/0x410
    kthread+0xc6/0xd0
    ret_from_fork+0x7c/0xb0

The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
try_offline_node, which will reset all the content of pgdat to 0, as the
pgdat is accessed lock-free, so that the users still using the pgdat
will panic, such as the vmstat_update routine.

process A: offline node XX:

vmstat_updat()
   refresh_cpu_vm_stats()
     for_each_populated_zone()
       find online node XX
     cond_resched()
offline cpu and memory, then try_offline_node()
node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
       zone = next_zone(zone)
         pg_data_t *pgdat = zone->zone_pgdat;  // here pgdat is NULL now
           next_online_pgdat(pgdat)
             next_online_node(pgdat->node_id);  // NULL pointer access

So the solution here is postponing the reset of obsolete pgdat from
try_offline_node() to hotadd_new_pgdat(), and just resetting
pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
0 to avoid breaking pointer information in pgdat.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Xishi Qiu <qiuxishi@huawei.com>
Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agoMAINTAINERS: correct rtc armada38x pattern entry
Joe Perches [Wed, 25 Mar 2015 22:55:17 +0000 (15:55 -0700)]
MAINTAINERS: correct rtc armada38x pattern entry

Commit c6a95dbee793 ("MAINTAINERS: add the RTC driver for the
Armada38x") typoed the pattern, fix it.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>