Andy Adamson [Tue, 4 Mar 2014 17:31:09 +0000 (12:31 -0500)]
NFSv4.1 Fail data server I/O if stateid represents a lost lock
Signed-off-by: Andy Adamson <andros@netapp.com>
Link: http://lkml.kernel.org/r/1393954269-3974-1-git-send-email-andros@netapp.com
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Tue, 4 Mar 2014 18:12:03 +0000 (13:12 -0500)]
NFSv4: Fix the return value of nfs4_select_rw_stateid
In commit
5521abfdcf4d6 (NFSv4: Resend the READ/WRITE RPC call
if a stateid change causes an error), we overloaded the return value of
nfs4_select_rw_stateid() to cause it to return -EWOULDBLOCK if an RPC
call is outstanding that would cause the NFSv4 lock or open stateid
to change.
That is all redundant when we actually copy the stateid used in the
read/write RPC call that failed, and check that against the current
stateid. It is doubly so, when we consider that in the NFSv4.1 case,
we also set the stateid's seqid to the special value '0', which means
'match the current valid stateid'.
Reported-by: Andy Adamson <andros@netapp.com>
Link: http://lkml.kernel.org/r/1393954269-3974-1-git-send-email-andros@netapp.com
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Wed, 5 Mar 2014 13:44:23 +0000 (08:44 -0500)]
NFSv4: nfs4_stateid_is_current should return 'true' for an invalid stateid
When nfs4_set_rw_stateid() can fails by returning EIO to indicate that
the stateid is completely invalid, then it makes no sense to have it
trigger a retry of the READ or WRITE operation. Instead, we should just
have it fall through and attempt a recovery.
This fixes an infinite loop in which the client keeps replaying the same
bad stateid back to the server.
Reported-by: Andy Adamson <andros@netapp.com>
Link: http://lkml.kernel.org/r/1393954269-3974-1-git-send-email-andros@netapp.com
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Mon, 3 Mar 2014 03:03:12 +0000 (22:03 -0500)]
NFS: Fix a delegation callback race
The clean-up in commit
36281caa839f ended up removing a NULL pointer check
that is needed in order to prevent an Oops in
nfs_async_inode_return_delegation().
Reported-by: "Yan, Zheng" <zheng.z.yan@intel.com>
Link: http://lkml.kernel.org/r/5313E9F6.2020405@intel.com
Fixes:
36281caa839f (NFSv4: Further clean-ups of delegation stateid validation)
Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Wed, 26 Feb 2014 19:19:14 +0000 (11:19 -0800)]
NFSv4: Fix another nfs4_sequence corruptor
nfs4_release_lockowner needs to set the rpc_message reply to point to
the nfs4_sequence_res in order to avoid another Oopsable situation
in nfs41_assign_slot.
Fixes:
fbd4bfd1d9d21 (NFS: Add nfs4_sequence calls for RELEASE_LOCKOWNER)
Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Andy Adamson [Tue, 18 Feb 2014 15:36:05 +0000 (10:36 -0500)]
NFS fix error return in nfs4_select_rw_stateid
Do not return an error when nfs4_copy_delegation_stateid succeeds.
Signed-off-by: Andy Adamson <andros@netapp.com>
Link: http://lkml.kernel.org/r/1392737765-41942-1-git-send-email-andros@netapp.com
Fixes:
ef1820f9be27b (NFSv4: Don't try to recover NFSv4 locks when...)
Cc: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Mon, 17 Feb 2014 02:42:56 +0000 (21:42 -0500)]
NFSv4: Use the correct net namespace in nfs4_update_server
We need to use the same net namespace that was used to resolve
the hostname and sockaddr arguments.
Fixes:
32e62b7c3ef09 (NFS: Add nfs4_update_server)
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sun, 16 Feb 2014 18:28:01 +0000 (13:28 -0500)]
SUNRPC: Fix a pipe_version reference leak
In gss_alloc_msg(), if the call to gss_encode_v1_msg() fails, we
want to release the reference to the pipe_version that was obtained
earlier in the function.
Fixes:
9d3a2260f0f4b (SUNRPC: Fix buffer overflow checking in...)
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sun, 16 Feb 2014 17:14:13 +0000 (12:14 -0500)]
SUNRPC: Ensure that gss_auth isn't freed before its upcall messages
Fix a race in which the RPC client is shutting down while the
gss daemon is processing a downcall. If the RPC client manages to
shut down before the gss daemon is done, then the struct gss_auth
used in gss_release_msg() may have already been freed.
Link: http://lkml.kernel.org/r/1392494917.71728.YahooMailNeo@web140002.mail.bf1.yahoo.com
Reported-by: John <da_audiophile@yahoo.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Tue, 11 Feb 2014 18:56:54 +0000 (13:56 -0500)]
SUNRPC: Fix potential memory scribble in xprt_free_bc_request()
The call to xprt_free_allocation() will call list_del() on
req->rq_bc_pa_list, which is not attached to a list.
This patch moves the list_del() out of xprt_free_allocation()
and into those callers that need it.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Tue, 11 Feb 2014 14:15:54 +0000 (09:15 -0500)]
SUNRPC: Fix races in xs_nospace()
When a send failure occurs due to the socket being out of buffer space,
we call xs_nospace() in order to have the RPC task wait until the
socket has drained enough to make it worth while trying again.
The current patch fixes a race in which the socket is drained before
we get round to setting up the machinery in xs_nospace(), and which
is reported to cause hangs.
Link: http://lkml.kernel.org/r/20140210170315.33dfc621@notabene.brown
Fixes:
a9a6b52ee1ba (SUNRPC: Don't start the retransmission timer...)
Reported-by: Neil Brown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Mon, 10 Feb 2014 21:28:52 +0000 (16:28 -0500)]
SUNRPC: Don't create a gss auth cache unless rpc.gssd is running
An infinite loop is caused when nfs4_establish_lease() fails
with -EACCES. This causes nfs4_handle_reclaim_lease_error()
to sleep a bit and resets the NFS4CLNT_LEASE_EXPIRED bit.
This in turn causes nfs4_state_manager() to try and
reestablished the lease, again, again, again...
The problem is a valid RPCSEC_GSS client is being created when
rpc.gssd is not running.
Link: http://lkml.kernel.org/r/1392066375-16502-1-git-send-email-steved@redhat.com
Fixes:
0ea9de0ea6a4 (sunrpc: turn warn_gssd() log message into a dprintk())
Reported-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Thu, 6 Feb 2014 19:38:53 +0000 (14:38 -0500)]
NFS: Do not set NFS_INO_INVALID_LABEL unless server supports labeled NFS
Commit
aa9c2669626c (NFS: Client implementation of Labeled-NFS) introduces
a performance regression. When nfs_zap_caches_locked is called, it sets
the NFS_INO_INVALID_LABEL flag irrespectively of whether or not the
NFS server supports security labels. Since that flag is never cleared,
it means that all calls to nfs_revalidate_inode() will now trigger
an on-the-wire GETATTR call.
This patch ensures that we never set the NFS_INO_INVALID_LABEL unless the
server advertises support for labeled NFS.
It also causes nfs_setsecurity() to clear NFS_INO_INVALID_LABEL when it
has successfully set the security label for the inode.
Finally it gets rid of the NFS_INO_INVALID_LABEL cruft from nfs_update_inode,
which has nothing to do with labeled NFS.
Reported-by: Neil Brown <neilb@suse.de>
Cc: stable@vger.kernel.org # 3.11+
Tested-by: Neil Brown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Linus Torvalds [Mon, 10 Feb 2014 02:15:47 +0000 (18:15 -0800)]
Linux 3.14-rc2
Linus Torvalds [Mon, 10 Feb 2014 02:14:53 +0000 (18:14 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jmorris/linux-security
Pull SELinux fixes from James Morris.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
SELinux: Fix kernel BUG on empty security contexts.
selinux: add SOCK_DIAG_BY_FAMILY to the list of netlink message types
Linus Torvalds [Mon, 10 Feb 2014 02:12:07 +0000 (18:12 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"A couple of fixes, both -stable fodder. The O_SYNC bug is fairly
old..."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix a kmap leak in virtio_console
fix O_SYNC|O_APPEND syncing the wrong range on write()
James Morris [Mon, 10 Feb 2014 00:48:21 +0000 (11:48 +1100)]
Merge branch 'stable-3.14' of git://git.infradead.org/users/pcmoore/selinux into for-linus
Al Viro [Sun, 2 Feb 2014 12:05:05 +0000 (07:05 -0500)]
fix a kmap leak in virtio_console
While we are at it, don't do kmap() under kmap_atomic(), *especially*
for a page we'd allocated with GFP_KERNEL. It's spelled "page_address",
and had that been more than that, we'd have a real trouble - kmap_high()
can block, and doing that while holding kmap_atomic() is a Bad Idea(tm).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 9 Feb 2014 20:18:09 +0000 (15:18 -0500)]
fix O_SYNC|O_APPEND syncing the wrong range on write()
It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support)
when sync_page_range() had been introduced; generic_file_write{,v}() correctly
synced
pos_after_write - written .. pos_after_write - 1
but generic_file_aio_write() synced
pos_before_write .. pos_before_write + written - 1
instead. Which is not the same thing with O_APPEND, obviously.
A couple of years later correct variant had been killed off when
everything switched to use of generic_file_aio_write().
All users of generic_file_aio_write() are affected, and the same bug
has been copied into other instances of ->aio_write().
The fix is trivial; the only subtle point is that generic_write_sync()
ought to be inlined to avoid calculations useless for the majority of
calls.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Sun, 9 Feb 2014 19:12:26 +0000 (11:12 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"This is a small collection of fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix data corruption when reading/updating compressed extents
Btrfs: don't loop forever if we can't run because of the tree mod log
btrfs: reserve no transaction units in btrfs_ioctl_set_features
btrfs: commit transaction after setting label and features
Btrfs: fix assert screwup for the pending move stuff
Linus Torvalds [Sun, 9 Feb 2014 18:09:49 +0000 (10:09 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Tooling fixes, mostly related to the KASLR fallout, but also other
fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf buildid-cache: Check relocation when checking for existing kcore
perf tools: Adjust kallsyms for relocated kernel
perf tests: No need to set up ref_reloc_sym
perf symbols: Prevent the use of kcore if the kernel has moved
perf record: Get ref_reloc_sym from kernel map
perf machine: Set up ref_reloc_sym in machine__create_kernel_maps()
perf machine: Add machine__get_kallsyms_filename()
perf tools: Add kallsyms__get_function_start()
perf symbols: Fix symbol annotation for relocated kernel
perf tools: Fix include for non x86 architectures
perf tools: Fix AAAAARGH64 memory barriers
perf tools: Demangle kernel and kernel module symbols too
perf/doc: Remove mention of non-existent set_perf_event_pending() from design.txt
Filipe David Borba Manana [Sat, 8 Feb 2014 15:47:46 +0000 (15:47 +0000)]
Btrfs: fix data corruption when reading/updating compressed extents
When using a mix of compressed file extents and prealloc extents, it
is possible to fill a page of a file with random, garbage data from
some unrelated previous use of the page, instead of a sequence of zeroes.
A simple sequence of steps to get into such case, taken from the test
case I made for xfstests, is:
_scratch_mkfs
_scratch_mount "-o compress-force=lzo"
$XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" $SCRATCH_MNT/foobar
$XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar
$XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
This results in the following file items in the fs tree:
item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160
inode generation 6 transid 6 size 542872 block group 0 mode 100600
item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16
inode ref index 2 namelen 6 name: foobar
item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
extent data disk byte 0 nr 0 gen 6
extent data offset 0 nr 24576 ram 266240
extent compression 0
item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53
prealloc data disk byte
12849152 nr 241664 gen 6
prealloc data offset 0 nr 241664
item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53
extent data disk byte
12845056 nr 4096 gen 6
extent data offset 0 nr 20480 ram 20480
extent compression 2
item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53
prealloc data disk byte
13090816 nr 405504 gen 6
prealloc data offset 0 nr 258048
The on disk extent at offset 266240 (which corresponds to 1 single disk block),
contains 5 compressed chunks of file data. Each of the first 4 compress 4096
bytes of file data, while the last one only compresses 3024 bytes of file data.
Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 =
1072 bytes) should always return zeroes (our next extent is a prealloc one).
The solution here is the compression code path to zero the remaining (untouched)
bytes of the last page it uncompressed data into, as the information about how
much space the file data consumes in the last page is not known in the upper layer
fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing
the remainder of the page but only if it corresponds to the last page of the inode
and if the inode's size is not a multiple of the page size.
This would cause not only returning random data on reads, but also permanently
storing random data when updating parts of the region that should be zeroed.
For the example above, it means updating a single byte in the region [285648 ; 286720[
would store that byte correctly but also store random data on disk.
A test case for xfstests follows soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Fri, 7 Feb 2014 18:57:59 +0000 (13:57 -0500)]
Btrfs: don't loop forever if we can't run because of the tree mod log
A user reported a 100% cpu hang with my new delayed ref code. Turns out I
forgot to increase the count check when we can't run a delayed ref because of
the tree mod log. If we can't run any delayed refs during this there is no
point in continuing to look, and we need to break out. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
David Sterba [Fri, 7 Feb 2014 13:34:04 +0000 (14:34 +0100)]
btrfs: reserve no transaction units in btrfs_ioctl_set_features
Added in patch "btrfs: add ioctls to query/change feature bits online"
modifications to superblock don't need to reserve metadata blocks when
starting a transaction.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Jeff Mahoney [Fri, 7 Feb 2014 13:33:57 +0000 (14:33 +0100)]
btrfs: commit transaction after setting label and features
The set_fslabel ioctl uses btrfs_end_transaction, which means it's
possible that the change will be lost if the system crashes, same for
the newly set features. Let's use btrfs_commit_transaction instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Wed, 5 Feb 2014 21:19:21 +0000 (16:19 -0500)]
Btrfs: fix assert screwup for the pending move stuff
Wang noticed that he was failing btrfs/030 even though me and Filipe couldn't
reproduce. Turns out this is because Wang didn't have CONFIG_BTRFS_ASSERT set,
which meant that a key part of Filipe's original patch was not being built in.
This appears to be a mess up with merging Filipe's patch as it does not exist in
his original patch. Fix this by changing how we make sure del_waiting_dir_move
asserts that it did not error and take the function out of the ifdef check.
This makes btrfs/030 pass with the assert on or off. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Linus Torvalds [Sat, 8 Feb 2014 22:31:39 +0000 (14:31 -0800)]
Merge tag 'pinctrl-v3.14-2' of git://git./linux/kernel/git/linusw/linux-pinctrl
Pull pinctrl fixes from Linus Walleij:
"First round of pin control fixes for v3.14:
- Protect pinctrl_list_add() with the proper mutex. This was
identified by RedHat. Caused nasty locking warnings was rootcased
by Stanislaw Gruszka.
- Avoid adding dangerous debugfs files when either half of the
subsystem is unused: pinmux or pinconf.
- Various fixes to various drivers: locking, hardware particulars, DT
parsing, error codes"
* tag 'pinctrl-v3.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: tegra: return correct error type
pinctrl: do not init debugfs entries for unimplemented functionalities
pinctrl: protect pinctrl_list add
pinctrl: sirf: correct the pin index of ac97_pins group
pinctrl: imx27: fix offset calculation in imx_read_2bit
pinctrl: vt8500: Change devicetree data parsing
pinctrl: imx27: fix wrong offset to ICONFB
pinctrl: at91: use locked variant of irq_set_handler
Linus Torvalds [Sat, 8 Feb 2014 20:08:48 +0000 (12:08 -0800)]
Merge branch 'irq-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
"Add a missing Kconfig dependency"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Generic irq chip requires IRQ_DOMAIN
Linus Torvalds [Sat, 8 Feb 2014 19:54:43 +0000 (11:54 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:
"Quite a varied little collection of fixes. Most of them are
relatively small or isolated; the biggest one is Mel Gorman's fixes
for TLB range flushing.
A couple of AMD-related fixes (including not crashing when given an
invalid microcode image) and fix a crash when compiled with gcov"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, microcode, AMD: Unify valid container checks
x86, hweight: Fix BUG when booting with CONFIG_GCOV_PROFILE_ALL=y
x86/efi: Allow mapping BGRT on x86-32
x86: Fix the initialization of physnode_map
x86, cpu hotplug: Fix stack frame warning in check_irq_vectors_for_cpu_disable()
x86/intel/mid: Fix X86_INTEL_MID dependencies
arch/x86/mm/srat: Skip NUMA_NO_NODE while parsing SLIT
mm, x86: Revisit tlb_flushall_shift tuning for page flushes except on IvyBridge
x86: mm: change tlb_flushall_shift for IvyBridge
x86/mm: Eliminate redundant page table walk during TLB range flushing
x86/mm: Clean up inconsistencies when flushing TLB ranges
mm, x86: Account for TLB flushes only when debugging
x86/AMD/NB: Fix amd_set_subcaches() parameter type
x86/quirks: Add workaround for AMD F16h Erratum792
x86, doc, kconfig: Fix dud URL for Microcode data
Linus Torvalds [Sat, 8 Feb 2014 18:13:47 +0000 (10:13 -0800)]
Merge tag 'jfs-3.14-rc2' of git://github.com/kleikamp/linux-shaggy
Pull jfs fix from David Kleikamp:
"Fix regression"
* tag 'jfs-3.14-rc2' of git://github.com/kleikamp/linux-shaggy:
jfs: fix generic posix ACL regression
Dave Kleikamp [Fri, 7 Feb 2014 20:36:10 +0000 (14:36 -0600)]
jfs: fix generic posix ACL regression
I missed a couple errors in reviewing the patches converting jfs
to use the generic posix ACL function. Setting ACL's currently
fails with -EOPNOTSUPP.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Richard Weinberger [Fri, 31 Jan 2014 12:47:34 +0000 (13:47 +0100)]
watchdog: dw_wdt: Add dependency on HAS_IOMEM
On archs like S390 or um this driver cannot build nor work.
Make it depend on HAS_IOMEM to bypass build failures.
drivers/built-in.o: In function `dw_wdt_drv_probe':
drivers/watchdog/dw_wdt.c:302: undefined reference to `devm_ioremap_resource'
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
Linus Torvalds [Fri, 7 Feb 2014 22:17:18 +0000 (14:17 -0800)]
Merge tag 'driver-core-3.14-rc2' of git://git./linux/kernel/git/gregkh/driver-core
Pull driver core fix from Greg KH:
"Here is a single kernfs fix to resolve a much-reported lockdep issue
with the removal of entries in sysfs"
* tag 'driver-core-3.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
kernfs: make kernfs_deactivate() honor KERNFS_LOCKDEP flag
Linus Torvalds [Fri, 7 Feb 2014 20:35:56 +0000 (12:35 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/sage/ceph-client
Pull ceph fixes from Sage Weil:
"There is an RBD fix for a crash due to the immutable bio changes, an
error path fix, and a locking fix in the recent redirect support"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: do not dereference a NULL bio pointer
libceph: take map_sem for read in handle_reply()
libceph: factor out logic from ceph_osdc_start_request()
libceph: fix error handling in ceph_osdc_init()
Linus Torvalds [Fri, 7 Feb 2014 20:19:50 +0000 (12:19 -0800)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Relax VDSO alignment requirements so that the kernel-picked one (4K)
does not conflict with the dynamic linker's one (64K)
- VDSO gettimeofday fix
- Barrier fixes for atomic operations and cache flushing
- TLB invalidation when overriding early page mappings during boot
- Wired up new 32-bit arm (compat) syscalls
- LSM_MMAP_MIN_ADDR when COMPAT is enabled
- defconfig update
- Clean-up (comments, pgd_alloc).
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: defconfig: Expand default enabled features
arm64: asm: remove redundant "cc" clobbers
arm64: atomics: fix use of acquire + release for full barrier semantics
arm64: barriers: allow dsb macro to take option parameter
security: select correct default LSM_MMAP_MIN_ADDR on arm on arm64
arm64: compat: Wire up new AArch32 syscalls
arm64: vdso: update wtm fields for CLOCK_MONOTONIC_COARSE
arm64: vdso: fix coarse clock handling
arm64: simplify pgd_alloc
arm64: fix typo: s/SERRROR/SERROR/
arm64: Invalidate the TLB when replacing pmd entries during boot
arm64: Align CMA sizes to PAGE_SIZE
arm64: add DSB after icache flush in __flush_icache_all()
arm64: vdso: prevent ld from aligning PT_LOAD segments to 64k
Linus Torvalds [Fri, 7 Feb 2014 20:19:06 +0000 (12:19 -0800)]
Merge branch 'upstream' of git://git.linux-mips.org/ralf/upstream-linus
Pull MIPS updates from Ralf Baechle:
"hree minor patches. All have sat in -next for a few days"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: fpu.h: Fix build when CONFIG_BUG is not set
MIPS: Wire up sched_setattr/sched_getattr syscalls
MIPS: Alchemy: Fix DB1100 GPIO registration
Linus Torvalds [Fri, 7 Feb 2014 20:16:36 +0000 (12:16 -0800)]
Merge branch 'v4l_for_linus' of git://git./linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
"A series of small fixes. Mostly driver ones. There is one core
regression fix on a patch that was meant to fix some race issues on
vb2, but that actually caused more harm than good. So, we're just
reverting it for now"
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
[media] adv7842: Composite free-run platfrom-data fix
[media] v4l2-dv-timings: fix GTF calculation
[media] hdpvr: Fix memory leak in debug
[media] af9035: add ID [2040:f900] Hauppauge WinTV-MiniStick 2
[media] mxl111sf: Fix compile when CONFIG_DVB_USB_MXL111SF is unset
[media] mxl111sf: Fix unintentional garbage stack read
[media] cx24117: use a valid dev pointer for dev_err printout
[media] cx24117: remove dead code in always 'false' if statement
[media] update Michael Krufky's email address
[media] vb2: Check if there are buffers before streamon
[media] Revert "[media] videobuf_vm_{open,close} race fixes"
[media] go7007-loader: fix usb_dev leak
[media] media: bt8xx: add missing put_device call
[media] exynos4-is: Compile in fimc-lite runtime PM callbacks conditionally
[media] exynos4-is: Compile in fimc runtime PM callbacks conditionally
[media] exynos4-is: Fix error paths in probe() for !pm_runtime_enabled()
[media] s5p-jpeg: Fix wrong NV12 format parameters
[media] s5k5baf: allow to handle arbitrary long i2c sequences
Linus Torvalds [Fri, 7 Feb 2014 20:14:24 +0000 (12:14 -0800)]
Merge tag 'hwmon-for-linus' of git://git./linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
"Fix PMBus driver problem with some multi-page voltage sensors and fix
da9055 interrupt initialization"
* tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (da9055) Remove use of regmap_irq_get_virq()
hwmon: (pmbus) Support per-page exponent in linear mode
Linus Torvalds [Fri, 7 Feb 2014 20:12:21 +0000 (12:12 -0800)]
Merge tag 'pm+acpi-3.14-rc2' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI and power management fixes from Rafael Wysocki:
"These include a fix for a recent ACPI hotplug regression, four
concurrency related fixes and one PCI device removal fix for
ACPI-based PCI hotplug (ACPIPHP), intel_pstate fix that should go into
stable, three simple ACPI cleanups and a new entry for the ACPI video
blacklist.
Specifics:
- Fix for a recent ACPI hotplug regression causing a NULL pointer
dereference to occur while handling ACPI eject notifications for
already ejected devices. From Toshi Kani.
- Four concurrency-related fixes for ACPIPHP. Two of them add
missing locking and the other two fix race conditions related to
reference counting.
- ACPIPHP fix to avoid NULL pointer dereferences during device
removal involving Virtual Funcions.
- intel_pstate fix to make it compute the percentage of time the CPU
is busy properly. From Dirk Brandewie.
- Removal of two unnecessary NULL pointer checks in ACPI code and a
fix for sscanf() format string from Dan Carpenter and Luis G.F.
- New ACPI video blacklist entry for HP EliteBook Revolve 810 from
Mika Westerberg"
* tag 'pm+acpi-3.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / hotplug: Fix panic on eject to ejected device
ACPI / battery: Fix incorrect sscanf() string in acpi_battery_init_alarm()
ACPI / proc: remove unneeded NULL check
ACPI / utils: remove a pointless NULL check
ACPI / video: Add HP EliteBook Revolve 810 to the blacklist
intel_pstate: Take core C0 time into account for core busy calculation
ACPI / hotplug / PCI: Fix bridge removal race vs dock events
ACPI / hotplug / PCI: Fix bridge removal race in handle_hotplug_event()
ACPI / hotplug / PCI: Scan root bus under the PCI rescan-remove lock
ACPI / hotplug / PCI: Move PCI rescan-remove locking to hotplug_event()
ACPI / hotplug / PCI: Remove entries from bus->devices in reverse order
Ilya Dryomov [Wed, 5 Feb 2014 13:19:55 +0000 (15:19 +0200)]
libceph: do not dereference a NULL bio pointer
Commit
f38a5181d9f3 ("ceph: Convert to immutable biovecs") introduced
a NULL pointer dereference, which broke rbd in -rc1. Fix it.
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
H. Peter Anvin [Fri, 7 Feb 2014 19:27:30 +0000 (11:27 -0800)]
Merge tag 'efi-urgent' into x86/urgent
* Avoid WARN_ON() when mapping BGRT on Baytrail (EFI 32-bit).
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Ilya Dryomov [Mon, 3 Feb 2014 11:56:33 +0000 (13:56 +0200)]
libceph: take map_sem for read in handle_reply()
Handling redirect replies requires both map_sem and request_mutex.
Taking map_sem unconditionally near the top of handle_reply() avoids
possible race conditions that arise from releasing request_mutex to be
able to acquire map_sem in redirect reply case. (Lock ordering is:
map_sem, request_mutex, crush_mutex.)
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Fri, 31 Jan 2014 17:33:39 +0000 (19:33 +0200)]
libceph: factor out logic from ceph_osdc_start_request()
Factor out logic from ceph_osdc_start_request() into a new helper,
__ceph_osdc_start_request(). ceph_osdc_start_request() now amounts to
taking locks and calling __ceph_osdc_start_request().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Mark Rutland [Fri, 7 Feb 2014 17:12:45 +0000 (17:12 +0000)]
arm64: defconfig: Expand default enabled features
FPGA implementations of the Cortex-A57 and Cortex-A53 are now available
in the form of the SMM-A57 and SMM-A53 Soft Macrocell Models (SMMs) for
Versatile Express. As these attach to a Motherboard Express V2M-P1 it
would be useful to have support for some V2M-P1 peripherals enabled by
default.
Additionally a couple of of features have been introduced since the last
defconfig update (CMA, jump labels) that would be good to have enabled
by default to ensure they are build and boot tested.
This patch updates the arm64 defconfig to enable support for these
devices and features. The arm64 Kconfig is modified to select
HAVE_PATA_PLATFORM, which is required to enable support for the
CompactFlash controller on the V2M-P1.
A few options which don't need to appear in defconfig are trimmed:
* BLK_DEV - selected by default
* EXPERIMENTAL - otherwise gone from the kernel
* MII - selected by drivers which require it
* USB_SUPPORT - selected by default
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Will Deacon [Tue, 4 Feb 2014 12:29:13 +0000 (12:29 +0000)]
arm64: asm: remove redundant "cc" clobbers
cbnz/tbnz don't update the condition flags, so remove the "cc" clobbers
from inline asm blocks that only use these instructions to implement
conditional branches.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Will Deacon [Tue, 4 Feb 2014 12:29:12 +0000 (12:29 +0000)]
arm64: atomics: fix use of acquire + release for full barrier semantics
Linux requires a number of atomic operations to provide full barrier
semantics, that is no memory accesses after the operation can be
observed before any accesses up to and including the operation in
program order.
On arm64, these operations have been incorrectly implemented as follows:
// A, B, C are independent memory locations
<Access [A]>
// atomic_op (B)
1: ldaxr x0, [B] // Exclusive load with acquire
<op(B)>
stlxr w1, x0, [B] // Exclusive store with release
cbnz w1, 1b
<Access [C]>
The assumption here being that two half barriers are equivalent to a
full barrier, so the only permitted ordering would be A -> B -> C
(where B is the atomic operation involving both a load and a store).
Unfortunately, this is not the case by the letter of the architecture
and, in fact, the accesses to A and C are permitted to pass their
nearest half barrier resulting in orderings such as Bl -> A -> C -> Bs
or Bl -> C -> A -> Bs (where Bl is the load-acquire on B and Bs is the
store-release on B). This is a clear violation of the full barrier
requirement.
The simple way to fix this is to implement the same algorithm as ARMv7
using explicit barriers:
<Access [A]>
// atomic_op (B)
dmb ish // Full barrier
1: ldxr x0, [B] // Exclusive load
<op(B)>
stxr w1, x0, [B] // Exclusive store
cbnz w1, 1b
dmb ish // Full barrier
<Access [C]>
but this has the undesirable effect of introducing *two* full barrier
instructions. A better approach is actually the following, non-intuitive
sequence:
<Access [A]>
// atomic_op (B)
1: ldxr x0, [B] // Exclusive load
<op(B)>
stlxr w1, x0, [B] // Exclusive store with release
cbnz w1, 1b
dmb ish // Full barrier
<Access [C]>
The simple observations here are:
- The dmb ensures that no subsequent accesses (e.g. the access to C)
can enter or pass the atomic sequence.
- The dmb also ensures that no prior accesses (e.g. the access to A)
can pass the atomic sequence.
- Therefore, no prior access can pass a subsequent access, or
vice-versa (i.e. A is strictly ordered before C).
- The stlxr ensures that no prior access can pass the store component
of the atomic operation.
The only tricky part remaining is the ordering between the ldxr and the
access to A, since the absence of the first dmb means that we're now
permitting re-ordering between the ldxr and any prior accesses.
From an (arbitrary) observer's point of view, there are two scenarios:
1. We have observed the ldxr. This means that if we perform a store to
[B], the ldxr will still return older data. If we can observe the
ldxr, then we can potentially observe the permitted re-ordering
with the access to A, which is clearly an issue when compared to
the dmb variant of the code. Thankfully, the exclusive monitor will
save us here since it will be cleared as a result of the store and
the ldxr will retry. Notice that any use of a later memory
observation to imply observation of the ldxr will also imply
observation of the access to A, since the stlxr/dmb ensure strict
ordering.
2. We have not observed the ldxr. This means we can perform a store
and influence the later ldxr. However, that doesn't actually tell
us anything about the access to [A], so we've not lost anything
here either when compared to the dmb variant.
This patch implements this solution for our barriered atomic operations,
ensuring that we satisfy the full barrier requirements where they are
needed.
Cc: <stable@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Adam Thomson [Thu, 6 Feb 2014 18:03:17 +0000 (18:03 +0000)]
hwmon: (da9055) Remove use of regmap_irq_get_virq()
Remove use of regmap_irq_get_virq() in driver probe which was
conflicting with use of platform_get_irq_byname().
platform_get_irq_byname() already returns the VIRQ number due
to MFD core translation so using regmap_irq_get_virq() on that
returned value results in an incorrect IRQ being requested.
The driver probes then fail because of this.
Signed-off-by: Adam Thomson <Adam.Thomson.Opensource@diasemi.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Rafael J. Wysocki [Thu, 6 Feb 2014 22:08:54 +0000 (23:08 +0100)]
Merge branches 'acpi-cleanup' and 'acpi-video'
* acpi-cleanup:
ACPI / battery: Fix incorrect sscanf() string in acpi_battery_init_alarm()
ACPI / proc: remove unneeded NULL check
ACPI / utils: remove a pointless NULL check
* acpi-video:
ACPI / video: Add HP EliteBook Revolve 810 to the blacklist
Rafael J. Wysocki [Thu, 6 Feb 2014 22:08:27 +0000 (23:08 +0100)]
Merge branch 'pm-cpufreq'
* pm-cpufreq:
intel_pstate: Take core C0 time into account for core busy calculation
Rafael J. Wysocki [Thu, 6 Feb 2014 22:07:55 +0000 (23:07 +0100)]
Merge branches 'acpi-pci-hotplug' and 'acpi-hotplug'
* acpi-pci-hotplug:
ACPI / hotplug / PCI: Fix bridge removal race vs dock events
ACPI / hotplug / PCI: Fix bridge removal race in handle_hotplug_event()
ACPI / hotplug / PCI: Scan root bus under the PCI rescan-remove lock
ACPI / hotplug / PCI: Move PCI rescan-remove locking to hotplug_event()
ACPI / hotplug / PCI: Remove entries from bus->devices in reverse order
* acpi-hotplug:
ACPI / hotplug: Fix panic on eject to ejected device
Linus Torvalds [Thu, 6 Feb 2014 21:49:03 +0000 (13:49 -0800)]
Merge branch 'akpm' (patches from Andrew Morton)
Merge a bunch of fixes from Andrew Morton:
"Commit
579f82901f6f ("swap: add a simple detector for inappropriate
swapin readahead") is a feature. No probs if you decide to defer it
until the next merge window.
It has been sitting in my tree for over a year because of my dislike
of all the magic numbers, but recent discussion with Hugh has made me
give up"
* emailed patches fron Andrew Morton <akpm@linux-foundation.org>:
mm: __set_page_dirty uses spin_lock_irqsave instead of spin_lock_irq
arch/x86/mm/numa.c: fix array index overflow when synchronizing nid to memblock.reserved.
arch/x86/mm/numa.c: initialize numa_kernel_nodes in numa_clear_kernel_node_hotplug()
mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq()
mm/swap: fix race on swap_info reuse between swapoff and swapon
swap: add a simple detector for inappropriate swapin readahead
ocfs2: free allocated clusters if error occurs after ocfs2_claim_clusters
Documentation/kernel-parameters.txt: fix memmap= language
KOSAKI Motohiro [Thu, 6 Feb 2014 20:04:28 +0000 (12:04 -0800)]
mm: __set_page_dirty uses spin_lock_irqsave instead of spin_lock_irq
To use spin_{un}lock_irq is dangerous if caller disabled interrupt.
During aio buffer migration, we have a possibility to see the following
call stack.
aio_migratepage [disable interrupt]
migrate_page_copy
clear_page_dirty_for_io
set_page_dirty
__set_page_dirty_buffers
__set_page_dirty
spin_lock_irq
This mean, current aio migration is a deadlockable. spin_lock_irqsave
is a safer alternative and we should use it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: David Rientjes rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tang Chen [Thu, 6 Feb 2014 20:04:27 +0000 (12:04 -0800)]
arch/x86/mm/numa.c: fix array index overflow when synchronizing nid to memblock.reserved.
The following path will cause array out of bound.
memblock_add_region() will always set nid in memblock.reserved to
MAX_NUMNODES. In numa_register_memblks(), after we set all nid to
correct valus in memblock.reserved, we called setup_node_data(), and
used memblock_alloc_nid() to allocate memory, with nid set to
MAX_NUMNODES.
The nodemask_t type can be seen as a bit array. And the index is 0 ~
MAX_NUMNODES-1.
After that, when we call node_set() in numa_clear_kernel_node_hotplug(),
the nodemask_t got an index of value MAX_NUMNODES, which is out of [0 ~
MAX_NUMNODES-1].
See below:
numa_init()
|---> numa_register_memblks()
| |---> memblock_set_node(memory) set correct nid in memblock.memory
| |---> memblock_set_node(reserved) set correct nid in memblock.reserved
| |......
| |---> setup_node_data()
| |---> memblock_alloc_nid() here, nid is set to MAX_NUMNODES (1024)
|......
|---> numa_clear_kernel_node_hotplug()
|---> node_set() here, we have an index 1024, and overflowed
This patch moves nid setting to numa_clear_kernel_node_hotplug() to fix
this problem.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Tested-by: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tang Chen [Thu, 6 Feb 2014 20:04:25 +0000 (12:04 -0800)]
arch/x86/mm/numa.c: initialize numa_kernel_nodes in numa_clear_kernel_node_hotplug()
On-stack variable numa_kernel_nodes in numa_clear_kernel_node_hotplug()
was not initialized. So we need to initialize it.
[akpm@linux-foundation.org: use NODE_MASK_NONE, per David]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: David Rientjes <rientjes@google.com>
Tested-by: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KOSAKI Motohiro [Thu, 6 Feb 2014 20:04:24 +0000 (12:04 -0800)]
mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq()
During aio stress test, we observed the following lockdep warning. This
mean AIO+numa_balancing is currently deadlockable.
The problem is, aio_migratepage disable interrupt, but
__set_page_dirty_nobuffers unintentionally enable it again.
Generally, all helper function should use spin_lock_irqsave() instead of
spin_lock_irq() because they don't know caller at all.
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&ctx->completion_lock)->rlock);
<Interrupt>
lock(&(&ctx->completion_lock)->rlock);
*** DEADLOCK ***
dump_stack+0x19/0x1b
print_usage_bug+0x1f7/0x208
mark_lock+0x21d/0x2a0
mark_held_locks+0xb9/0x140
trace_hardirqs_on_caller+0x105/0x1d0
trace_hardirqs_on+0xd/0x10
_raw_spin_unlock_irq+0x2c/0x50
__set_page_dirty_nobuffers+0x8c/0xf0
migrate_page_copy+0x434/0x540
aio_migratepage+0xb1/0x140
move_to_new_page+0x7d/0x230
migrate_pages+0x5e5/0x700
migrate_misplaced_page+0xbc/0xf0
do_numa_page+0x102/0x190
handle_pte_fault+0x241/0x970
handle_mm_fault+0x265/0x370
__do_page_fault+0x172/0x5a0
do_page_fault+0x1a/0x70
page_fault+0x28/0x30
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Weijie Yang [Thu, 6 Feb 2014 20:04:23 +0000 (12:04 -0800)]
mm/swap: fix race on swap_info reuse between swapoff and swapon
swapoff clear swap_info's SWP_USED flag prematurely and free its
resources after that. A concurrent swapon will reuse this swap_info
while its previous resources are not cleared completely.
These late freed resources are:
- p->percpu_cluster
- swap_cgroup_ctrl[type]
- block_device setting
- inode->i_flags &= ~S_SWAPFILE
This patch clears the SWP_USED flag after all its resources are freed,
so that swapon can reuse this swap_info by alloc_swap_info() safely.
[akpm@linux-foundation.org: tidy up code comment]
Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shaohua Li [Thu, 6 Feb 2014 20:04:21 +0000 (12:04 -0800)]
swap: add a simple detector for inappropriate swapin readahead
This is a patch to improve swap readahead algorithm. It's from Hugh and
I slightly changed it.
Hugh's original changelog:
swapin readahead does a blind readahead, whether or not the swapin is
sequential. This may be ok on harddisk, because large reads have
relatively small costs, and if the readahead pages are unneeded they can
be reclaimed easily - though, what if their allocation forced reclaim of
useful pages? But on SSD devices large reads are more expensive than
small ones: if the readahead pages are unneeded, reading them in caused
significant overhead.
This patch adds very simplistic random read detection. Stealing the
PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
simply looks at readahead's current success rate, and narrows or widens
its readahead window accordingly. There is little science to its
heuristic: it's about as stupid as can be whilst remaining effective.
The table below shows elapsed times (in centiseconds) when running a
single repetitive swapping load across a 1000MB mapping in 900MB ram
with 1GB swap (the harddisk tests had taken painfully too long when I
used mem=500M, but SSD shows similar results for that).
Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
which Shaohua showed to be defective; HughNew this Nov 14 patch, with
page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
(1-page reads: no readahead).
HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for
sequential access to the mapping, cycling five times around; Rand for
the same number of random touches. Anon for a MAP_PRIVATE anon mapping;
Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.
One weakness of Shaohua's vma/anon_vma approach was that it did not
optimize Shmem: seen below. Konstantin's approach was perhaps mistuned,
50% slower on Seq: did not compete and is not shown below.
HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon 73921 76210 75611 76904 78191 121542
Seq Shmem 73601 73176 73855 72947 74543 118322
Rand Anon 895392 831243 871569 845197 846496 841680
Rand Shmem 1058375 1053486 827935 764955 764376 756489
SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon 24634 24198 24673 25107 21614 70018
Seq Shmem 24959 24932 25052 25703 22030 69678
Rand Anon 43014 26146 28075 25989 26935 25901
Rand Shmem 45349 45215 28249 24268 24138 24332
These tests are, of course, two extremes of a very simple case: under
heavier mixed loads I've not yet observed any consistent improvement or
degradation, and wider testing would be welcome.
Shaohua Li:
Test shows Vanilla is slightly better in sequential workload than Hugh's
patch. I observed with Hugh's patch sometimes the readahead size is
shrinked too fast (from 8 to 1 immediately) in sequential workload if
there is no hit. And in such case, continuing doing readahead is good
actually.
I don't prepare a sophisticated algorithm for the sequential workload
because so far we can't guarantee sequential accessed pages are swap out
sequentially. So I slightly change Hugh's heuristic - don't shrink
readahead size too fast.
Here is my test result (unit second, 3 runs average):
Vanilla Hugh New
Seq 356 370 360
Random 4525 2447 2444
Attached graph is the swapin/swapout throughput I collected with 'vmstat
2'. The first part is running a random workload (till around 1200 of
the x-axis) and the second part is running a sequential workload.
swapin and swapout throughput are almost identical in steady state in
both workloads. These are expected behavior. while in Vanilla, swapin
is much bigger than swapout especially in random workload (because wrong
readahead).
Original patches by: Shaohua Li and Konstantin Khlebnikov.
[fengguang.wu@intel.com: swapin_nr_pages() can be static]
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zongxun Wang [Thu, 6 Feb 2014 20:04:20 +0000 (12:04 -0800)]
ocfs2: free allocated clusters if error occurs after ocfs2_claim_clusters
Even if using the same jbd2 handle, we cannot rollback a transaction.
So once some error occurs after successfully allocating clusters, the
allocated clusters will never be used and it means they are lost. For
example, call ocfs2_claim_clusters successfully when expanding a file,
but failed in ocfs2_insert_extent. So we need free the allocated
clusters if they are not used indeed.
Signed-off-by: Zongxun Wang <wangzongxun@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap [Thu, 6 Feb 2014 20:04:19 +0000 (12:04 -0800)]
Documentation/kernel-parameters.txt: fix memmap= language
Clean up descriptions of memmap= boot options.
Add periods (full stops), drop commas, change "used" to "reserved" or
"marked".
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Andiry Xu <andiry.xu@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 6 Feb 2014 21:32:38 +0000 (13:32 -0800)]
Merge tag 'sound-3.14-rc2' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A few HD-audio fixes and one USB-audio kconfig dependency fix. All
small and device-specific changes marked with Cc to stable"
* tag 'sound-3.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda - Improve loopback path lookups for AD1983
ALSA: hda - Fix missing VREF setup for Mac Pro 1,1
ALSA: hda - Add missing mixer widget for AD1983
ALSA: hda/realtek - Avoid invalid COEFs for ALC271X
ALSA: hda - Fix silent output on Toshiba Satellite L40
ALSA: usb-audio: Add missing kconfig dependecy
Linus Torvalds [Thu, 6 Feb 2014 21:31:42 +0000 (13:31 -0800)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"A few regression fixes already, one for my own stupidity, and mgag200
typo fix, vmwgfx fixes and ttm regression fixes, and a radeon register
checker update for older cards to handle geom shaders"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/radeon: allow geom rings to be setup on r600/r700 (v2)
drm/mgag200,ast,cirrus: fix regression with drm_can_sleep conversion
drm/ttm: Don't clear page metadata of imported sg pages
drm/ttm: Fix TTM object open regression
vmwgfx: Fix unitialized stack read in vmw_setup_otable_base
drm/vmwgfx: Reemit context bindings when necessary v2
drm/vmwgfx: Detect old user-space drivers and set up legacy emulation v2
drm/vmwgfx: Emulate legacy shaders on guest-backed devices v2
drm/vmwgfx: Fix legacy surface reference size copyback
drm/vmwgfx: Fix SET_SHADER_CONST emulation on guest-backed devices
drm/vmwgfx: Fix regression caused by "drm/ttm: make ttm reservation calls behave like reservation calls"
drm/vmwgfx: Don't commit staged bindings if execbuf fails
drm/mgag200: fix typo causing bw limits to be ignored on some chips
Borislav Petkov [Mon, 3 Feb 2014 20:41:44 +0000 (21:41 +0100)]
x86, microcode, AMD: Unify valid container checks
For additional coverage, BorisO and friends unknowlingly did swap AMD
microcode with Intel microcode blobs in order to see what happens. What
did happen on 32-bit was
[ 5.722656] BUG: unable to handle kernel paging request at
be3a6008
[ 5.722693] IP: [<
c106d6b4>] load_microcode_amd+0x24/0x3f0
[ 5.722716] *pdpt =
0000000000000000 *pde =
0000000000000000
because there was a valid initrd there but without valid microcode in it
and the container check happened *after* the relocated ramdisk handling
on 32-bit, which was clearly wrong.
While at it, take care of the ramdisk relocation on both 32- and 64-bit
as it is done on both. Also, comment what we're doing because this code
is a bit tricky.
Reported-and-tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1391460104-7261-1-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Peter Oberparleiter [Thu, 6 Feb 2014 14:58:20 +0000 (15:58 +0100)]
x86, hweight: Fix BUG when booting with CONFIG_GCOV_PROFILE_ALL=y
Commit
d61931d89b, "x86: Add optimized popcnt variants" introduced
compile flag -fcall-saved-rdi for lib/hweight.c. When combined with
options -fprofile-arcs and -O2, this flag causes gcc to generate
broken constructor code. As a result, a 64 bit x86 kernel compiled
with CONFIG_GCOV_PROFILE_ALL=y prints message "gcov: could not create
file" and runs into sproadic BUGs during boot.
The gcc people indicate that these kinds of problems are endemic when
using ad hoc calling conventions. It is therefore best to treat any
file compiled with ad hoc calling conventions as an isolated
environment and avoid things like profiling or coverage analysis,
since those subsystems assume a "normal" calling conventions.
This patch avoids the bug by excluding lib/hweight.o from coverage
profiling.
Reported-by: Meelis Roos <mroos@linux.ee>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/52F3A30C.7050205@linux.vnet.ibm.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Laxman Dewangan [Wed, 5 Feb 2014 13:41:34 +0000 (19:11 +0530)]
pinctrl: tegra: return correct error type
When memory allocation failed, drive should return error as ENOMEM.
Signed-off-by: Laxman Dewangan <ldewangan@nvidia.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Florian Vaussard [Wed, 5 Feb 2014 06:51:22 +0000 (07:51 +0100)]
pinctrl: do not init debugfs entries for unimplemented functionalities
Commit c420619 "pinctrl: pinconf: remove checks on ops->pin_config_get"
removed the check on (ops != NULL) when performing pinconf_pins_show() or
pinconf_groups_show(). As these entries are always enabled, even if
pinconf is not supported, reading will result in an oops due to NULL
ops.
Instead of checking for ops, remove the corresponding debugfs entries if
pinconf and/or pinmux are not implemented.
Tested on OMAP3 (pinctrl-single).
Cc: stable@vger.kernel.org
Signed-off-by: Florian Vaussard <florian.vaussard@epfl.ch>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Aaro Koskinen [Wed, 5 Feb 2014 20:05:44 +0000 (22:05 +0200)]
MIPS: fpu.h: Fix build when CONFIG_BUG is not set
__enable_fpu produces a build failure when CONFIG_BUG is not set:
In file included from arch/mips/kernel/cpu-probe.c:24:0:
arch/mips/include/asm/fpu.h: In function '__enable_fpu':
arch/mips/include/asm/fpu.h:77:1: error: control reaches end of non-void function [-Werror=return-type]
This is regression introduced in 3.14-rc1. Fix that.
Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Acked-by: Paul Burton <paul.burton@imgtec.com>
Cc: John Crispin <blogic@openwrt.org>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/6504/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Will Deacon [Thu, 6 Feb 2014 11:30:48 +0000 (11:30 +0000)]
arm64: barriers: allow dsb macro to take option parameter
The dsb instruction takes an option specifying both the target access
types and shareability domain.
This patch allows such an option to be passed to the dsb macro,
resulting in potentially more efficient code. Currently the option is
ignored until all callers are updated (unlike ARM, the option is
mandated by the assembler).
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Dave Airlie [Thu, 30 Jan 2014 04:11:12 +0000 (14:11 +1000)]
drm/radeon: allow geom rings to be setup on r600/r700 (v2)
the evergreen CS parser has allowed this for a while, just port
the code to the r600 one.
This is required before geom shaders can be made work.
v2: agd5f: minor cleanup and add additional 7xx reg.
Signed-off-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Dave Airlie [Thu, 6 Feb 2014 02:04:31 +0000 (12:04 +1000)]
Merge tag 'vmwgfx-fixes-3.14-2014-02-05' of git://people.freedesktop.org/~thomash/linux into drm-next
A couple of vmwgfx fixes together with missing bits of legacy device
emulation to facilitate old user-space drivers on new devices.
The shader emulation bits are a bit large, but since they mostly touch the
new device code, regressions are unlikely. I figure the gain of having
this from the start clearly outweighs the risc of adding these bits at
this point.
Pull request of 2014-02-05
* tag 'vmwgfx-fixes-3.14-2014-02-05' of git://people.freedesktop.org/~thomash/linux:
vmwgfx: Fix unitialized stack read in vmw_setup_otable_base
drm/vmwgfx: Reemit context bindings when necessary v2
drm/vmwgfx: Detect old user-space drivers and set up legacy emulation v2
drm/vmwgfx: Emulate legacy shaders on guest-backed devices v2
drm/vmwgfx: Fix legacy surface reference size copyback
drm/vmwgfx: Fix SET_SHADER_CONST emulation on guest-backed devices
drm/vmwgfx: Fix regression caused by "drm/ttm: make ttm reservation calls behave like reservation calls"
drm/vmwgfx: Don't commit staged bindings if execbuf fails
Dave Airlie [Thu, 6 Feb 2014 01:50:48 +0000 (11:50 +1000)]
Merge tag 'ttm-fixes-3.14-2014-02-05' of git://people.freedesktop.org/~thomash/linux into drm-next
Two ttm regression fixes.
Pull request of 2014-02-05
* tag 'ttm-fixes-3.14-2014-02-05' of git://people.freedesktop.org/~thomash/linux:
drm/ttm: Don't clear page metadata of imported sg pages
drm/ttm: Fix TTM object open regression
Dave Airlie [Wed, 5 Feb 2014 04:47:45 +0000 (14:47 +1000)]
drm/mgag200,ast,cirrus: fix regression with drm_can_sleep conversion
I totally sign inverted my way out of this one.
Cc: stable@vger.kernel.org
Reported-by: "Sabrina Dubroca" <sd@queasysnail.net>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Linus Torvalds [Thu, 6 Feb 2014 00:02:53 +0000 (16:02 -0800)]
Merge branch 'irq-core-for-linus' of git://git./linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"This lot provides:
* Bugfixes for armada irq controller
* Updates to renesas irq chip
* Support for the TI-NSPIRE irq controller
Not strictly a bug fix only pull request, but important updates for
some of the arm Socs which I completely forgot to send last week.
Seems like my obliviousness is getting worse, I just can't remember
when it started"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip: Add support for TI-NSPIRE irqchip
irqchip: renesas-irqc: Enable mask on suspend
irqchip: renesas-irqc: Use lazy disable
irqchip: armada-370-xp: fix MSI race condition
irqchip: armada-370-xp: fix IPI race condition
Linus Torvalds [Thu, 6 Feb 2014 00:01:11 +0000 (16:01 -0800)]
Merge tag 'stable/for-linus-3.14-rc1-tag' of git://git./linux/kernel/git/xen/tip
Pull Xen fixes from Konrad Rzeszutek Wilk:
"Bug-fixes:
- Revert "xen/grant-table: Avoid m2p_override during mapping" as it
broke Xen ARM build.
- Fix CR4 not being set on AP processors in Xen PVH mode"
* tag 'stable/for-linus-3.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/pvh: set CR4 flags for APs
Revert "xen/grant-table: Avoid m2p_override during mapping"
Linus Torvalds [Thu, 6 Feb 2014 00:00:27 +0000 (16:00 -0800)]
Merge tag 'please-pull-ia64-syscalls' of git://git./linux/kernel/git/aegl/linux
Pull ia64 update from Tony Luck:
"Wire up new sched_setattr and sched_getattr syscalls"
* tag 'please-pull-ia64-syscalls' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
[IA64] Wire up new sched_setattr and sched_getattr syscalls
Linus Torvalds [Wed, 5 Feb 2014 23:53:26 +0000 (15:53 -0800)]
Merge git://git.infradead.org/users/willy/linux-nvme
Pull NVMe driver update from Matthew Wilcox:
"Looks like I missed the merge window ... but these are almost all
bugfixes anyway (the ones that aren't have been baking for months)"
* git://git.infradead.org/users/willy/linux-nvme:
NVMe: Namespace use after free on surprise removal
NVMe: Correct uses of INIT_WORK
NVMe: Include device and queue numbers in interrupt name
NVMe: Add a pci_driver shutdown method
NVMe: Disable admin queue on init failure
NVMe: Dynamically allocate partition numbers
NVMe: Async IO queue deletion
NVMe: Surprise removal handling
NVMe: Abort timed out commands
NVMe: Schedule reset for failed controllers
NVMe: Device resume error handling
NVMe: Cache dev->pci_dev in a local pointer
NVMe: Fix lockdep warnings
NVMe: compat SG_IO ioctl
NVMe: remove deprecated IRQF_DISABLED
NVMe: Avoid shift operation when writing cq head doorbell
Linus Torvalds [Wed, 5 Feb 2014 23:52:26 +0000 (15:52 -0800)]
Merge tag 'regulator-v3.14-rc1' of git://git./linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"A couple of driver fixes here but the main thing is a fix to the
checks for deferred probe non-DT systems with fully specified
regulators which had been broken by a device tree fix which meant that
we wouldn't insert optional regulators.
This had slipped through the cracks since very few systems do that in
the first place and those that do it in mainline don't need optional
regulators anyway"
* tag 'regulator-v3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: s2mps11: Fix NULL pointer of_node value when using platform data
regulator: core: Correct default return value for full constraints
regulator: ab3100: cast fix
Linus Torvalds [Wed, 5 Feb 2014 23:51:42 +0000 (15:51 -0800)]
Merge git://git./linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
"This fixes a number of concurrency issues on s390 where multiple users
of the same crypto transform may clobber each other's results"
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: s390 - fix des and des3_ede ctr concurrency issue
crypto: s390 - fix des and des3_ede cbc concurrency issue
crypto: s390 - fix concurrency issue in aes-ctr mode
Matt Fleming [Tue, 14 Jan 2014 12:40:09 +0000 (12:40 +0000)]
x86/efi: Allow mapping BGRT on x86-32
CONFIG_X86_32 doesn't map the boot services regions into the EFI memory
map (see commit
700870119f49 ("x86, efi: Don't map Boot Services on
i386")), and so efi_lookup_mapped_addr() will fail to return a valid
address. Executing the ioremap() path in efi_bgrt_init() causes the
following warning on x86-32 because we're trying to ioremap() RAM,
WARNING: CPU: 0 PID: 0 at arch/x86/mm/ioremap.c:102 __ioremap_caller+0x2ad/0x2c0()
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-0.rc5.git0.1.2.fc21.i686 #1
Hardware name: DellInc. Venue 8 Pro 5830/09RP78, BIOS A02 10/17/2013
00000000 00000000 c0c0df08 c09a5196 00000000 c0c0df38 c0448c1e c0b41310
00000000 00000000 c0b37bc1 00000066 c043bbfd c043bbfd 00e7dfe0 00073eff
00073eff c0c0df48 c0448ce2 00000009 00000000 c0c0df9c c043bbfd 00078d88
Call Trace:
[<
c09a5196>] dump_stack+0x41/0x52
[<
c0448c1e>] warn_slowpath_common+0x7e/0xa0
[<
c043bbfd>] ? __ioremap_caller+0x2ad/0x2c0
[<
c043bbfd>] ? __ioremap_caller+0x2ad/0x2c0
[<
c0448ce2>] warn_slowpath_null+0x22/0x30
[<
c043bbfd>] __ioremap_caller+0x2ad/0x2c0
[<
c0718f92>] ? acpi_tb_verify_table+0x1c/0x43
[<
c0719c78>] ? acpi_get_table_with_size+0x63/0xb5
[<
c087cd5e>] ? efi_lookup_mapped_addr+0xe/0xf0
[<
c043bc2b>] ioremap_nocache+0x1b/0x20
[<
c0cb01c8>] ? efi_bgrt_init+0x83/0x10c
[<
c0cb01c8>] efi_bgrt_init+0x83/0x10c
[<
c0cafd82>] efi_late_init+0x8/0xa
[<
c0c9bab2>] start_kernel+0x3ae/0x3c3
[<
c0c9b53b>] ? repair_env_string+0x51/0x51
[<
c0c9b378>] i386_start_kernel+0x12e/0x131
Switch to using early_memremap(), which won't trigger this warning, and
has the added benefit of more accurately conveying what we're trying to
do - map a chunk of memory.
This patch addresses the following bug report,
https://bugzilla.kernel.org/show_bug.cgi?id=67911
Reported-by: Adam Williamson <awilliam@redhat.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Ingo Molnar [Wed, 5 Feb 2014 05:51:37 +0000 (06:51 +0100)]
x86: Disable CONFIG_X86_DECODER_SELFTEST in allmod/allyesconfigs
It can take some time to validate the image, make sure
{allyes|allmod}config doesn't enable it.
I'd say randconfig will cover it often enough, and the failure is also
borderline build coverage related: you cannot really make the decoder
test fail via source level changes, only with changes in the build
environment, so I agree with Andi that we can disable this one too.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Paul Gortmaker paul.gortmaker@windriver.com>
Suggested-and-acked-by: Andi Kleen andi@firstfloor.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 5 Feb 2014 20:54:53 +0000 (12:54 -0800)]
execve: use 'struct filename *' for executable name passing
This changes 'do_execve()' to get the executable name as a 'struct
filename', and to free it when it is done. This is what the normal
users want, and it simplifies and streamlines their error handling.
The controlled lifetime of the executable name also fixes a
use-after-free problem with the trace_sched_process_exec tracepoint: the
lifetime of the passed-in string for kernel users was not at all
obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize
the pathname allocation lifetime with the execve() having finished,
which in turn meant that the trace point that happened after
mm_release() of the old process VM ended up using already free'd memory.
To solve the kernel string lifetime issue, this simply introduces
"getname_kernel()" that works like the normal user-space getname()
function, except with the source coming from kernel memory.
As Oleg points out, this also means that we could drop the tcomm[] array
from 'struct linux_binprm', since the pathname lifetime now covers
setup_new_exec(). That would be a separate cleanup.
Reported-by: Igor Zhbanov <i.zhbanov@samsung.com>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 29 Jan 2014 17:04:03 +0000 (12:04 -0500)]
kernfs: make kernfs_deactivate() honor KERNFS_LOCKDEP flag
kernfs_deactivate() forgot to check whether KERNFS_LOCKDEP is set
before performing lockdep annotations and ends up feeding
uninitialized lockdep_map to lockdep triggering warning like the
following on USB stick hotunplug.
usb 1-2: USB disconnect, device number 2
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 1 PID: 62 Comm: khubd Not tainted 3.13.0-work+ #82
Hardware name: empty empty/S3992, BIOS 080011 10/26/2007
ffff880065ca7f60 ffff88013a4ffa08 ffffffff81cfb6bd 0000000000000002
ffff88013a4ffac8 ffffffff810f8530 ffff88013a4fc710 0000000000000002
ffff880100000000 ffffffff82a3db50 0000000000000001 ffff88013a4fc710
Call Trace:
[<
ffffffff81cfb6bd>] dump_stack+0x4e/0x7a
[<
ffffffff810f8530>] __lock_acquire+0x1910/0x1e70
[<
ffffffff810f931a>] lock_acquire+0x9a/0x1d0
[<
ffffffff8127c75e>] kernfs_deactivate+0xee/0x130
[<
ffffffff8127d4c8>] kernfs_addrm_finish+0x38/0x60
[<
ffffffff8127d701>] kernfs_remove_by_name_ns+0x51/0xa0
[<
ffffffff8127b4f1>] remove_files.isra.1+0x41/0x80
[<
ffffffff8127b7e7>] sysfs_remove_group+0x47/0xa0
[<
ffffffff8127b873>] sysfs_remove_groups+0x33/0x50
[<
ffffffff8177d66d>] device_remove_attrs+0x4d/0x80
[<
ffffffff8177e25e>] device_del+0x12e/0x1d0
[<
ffffffff819722c2>] usb_disconnect+0x122/0x1a0
[<
ffffffff819749b5>] hub_thread+0x3c5/0x1290
[<
ffffffff810c6a6d>] kthread+0xed/0x110
[<
ffffffff81d0a56c>] ret_from_fork+0x7c/0xb0
Fix it by making kernfs_deactivate() perform lockdep annotations only
if KERNFS_LOCKDEP is set.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fabio Estevam <festevam@gmail.com>
Reported-by: Alan Stern <stern@rowland.harvard.edu>
Reported-by: Jiri Kosina <jkosina@suse.cz>
Reported-by: Dave Jones <davej@redhat.com>
Tested-by: Fabio Estevam <fabio.estevam@freescale.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Stephen Smalley [Thu, 30 Jan 2014 16:26:59 +0000 (11:26 -0500)]
SELinux: Fix kernel BUG on empty security contexts.
Setting an empty security context (length=0) on a file will
lead to incorrectly dereferencing the type and other fields
of the security context structure, yielding a kernel BUG.
As a zero-length security context is never valid, just reject
all such security contexts whether coming from userspace
via setxattr or coming from the filesystem upon a getxattr
request by SELinux.
Setting a security context value (empty or otherwise) unknown to
SELinux in the first place is only possible for a root process
(CAP_MAC_ADMIN), and, if running SELinux in enforcing mode, only
if the corresponding SELinux mac_admin permission is also granted
to the domain by policy. In Fedora policies, this is only allowed for
specific domains such as livecd for setting down security contexts
that are not defined in the build host policy.
Reproducer:
su
setenforce 0
touch foo
setfattr -n security.selinux foo
Caveat:
Relabeling or removing foo after doing the above may not be possible
without booting with SELinux disabled. Any subsequent access to foo
after doing the above will also trigger the BUG.
BUG output from Matthew Thode:
[ 473.893141] ------------[ cut here ]------------
[ 473.962110] kernel BUG at security/selinux/ss/services.c:654!
[ 473.995314] invalid opcode: 0000 [#6] SMP
[ 474.027196] Modules linked in:
[ 474.058118] CPU: 0 PID: 8138 Comm: ls Tainted: G D I
3.13.0-grsec #1
[ 474.116637] Hardware name: Supermicro X8ST3/X8ST3, BIOS 2.0
07/29/10
[ 474.149768] task:
ffff8805f50cd010 ti:
ffff8805f50cd488 task.ti:
ffff8805f50cd488
[ 474.183707] RIP: 0010:[<
ffffffff814681c7>] [<
ffffffff814681c7>]
context_struct_compute_av+0xce/0x308
[ 474.219954] RSP: 0018:
ffff8805c0ac3c38 EFLAGS:
00010246
[ 474.252253] RAX:
0000000000000000 RBX:
ffff8805c0ac3d94 RCX:
0000000000000100
[ 474.287018] RDX:
ffff8805e8aac000 RSI:
00000000ffffffff RDI:
ffff8805e8aaa000
[ 474.321199] RBP:
ffff8805c0ac3cb8 R08:
0000000000000010 R09:
0000000000000006
[ 474.357446] R10:
0000000000000000 R11:
ffff8805c567a000 R12:
0000000000000006
[ 474.419191] R13:
ffff8805c2b74e88 R14:
00000000000001da R15:
0000000000000000
[ 474.453816] FS:
00007f2e75220800(0000) GS:
ffff88061fc00000(0000)
knlGS:
0000000000000000
[ 474.489254] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 474.522215] CR2:
00007f2e74716090 CR3:
00000005c085e000 CR4:
00000000000207f0
[ 474.556058] Stack:
[ 474.584325]
ffff8805c0ac3c98 ffffffff811b549b ffff8805c0ac3c98
ffff8805f1190a40
[ 474.618913]
ffff8805a6202f08 ffff8805c2b74e88 00068800d0464990
ffff8805e8aac860
[ 474.653955]
ffff8805c0ac3cb8 000700068113833a ffff880606c75060
ffff8805c0ac3d94
[ 474.690461] Call Trace:
[ 474.723779] [<
ffffffff811b549b>] ? lookup_fast+0x1cd/0x22a
[ 474.778049] [<
ffffffff81468824>] security_compute_av+0xf4/0x20b
[ 474.811398] [<
ffffffff8196f419>] avc_compute_av+0x2a/0x179
[ 474.843813] [<
ffffffff8145727b>] avc_has_perm+0x45/0xf4
[ 474.875694] [<
ffffffff81457d0e>] inode_has_perm+0x2a/0x31
[ 474.907370] [<
ffffffff81457e76>] selinux_inode_getattr+0x3c/0x3e
[ 474.938726] [<
ffffffff81455cf6>] security_inode_getattr+0x1b/0x22
[ 474.970036] [<
ffffffff811b057d>] vfs_getattr+0x19/0x2d
[ 475.000618] [<
ffffffff811b05e5>] vfs_fstatat+0x54/0x91
[ 475.030402] [<
ffffffff811b063b>] vfs_lstat+0x19/0x1b
[ 475.061097] [<
ffffffff811b077e>] SyS_newlstat+0x15/0x30
[ 475.094595] [<
ffffffff8113c5c1>] ? __audit_syscall_entry+0xa1/0xc3
[ 475.148405] [<
ffffffff8197791e>] system_call_fastpath+0x16/0x1b
[ 475.179201] Code: 00 48 85 c0 48 89 45 b8 75 02 0f 0b 48 8b 45 a0 48
8b 3d 45 d0 b6 00 8b 40 08 89 c6 ff ce e8 d1 b0 06 00 48 85 c0 49 89 c7
75 02 <0f> 0b 48 8b 45 b8 4c 8b 28 eb 1e 49 8d 7d 08 be 80 01 00 00 e8
[ 475.255884] RIP [<
ffffffff814681c7>]
context_struct_compute_av+0xce/0x308
[ 475.296120] RSP <
ffff8805c0ac3c38>
[ 475.328734] ---[ end trace
f076482e9d754adc ]---
Reported-by: Matthew Thode <mthode@mthode.org>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
Paul Moore [Tue, 28 Jan 2014 19:45:41 +0000 (14:45 -0500)]
selinux: add SOCK_DIAG_BY_FAMILY to the list of netlink message types
The SELinux AF_NETLINK/NETLINK_SOCK_DIAG socket class was missing the
SOCK_DIAG_BY_FAMILY definition which caused SELINUX_ERR messages when
the ss tool was run.
# ss
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
u_str ESTAB 0 0 * 14189 * 14190
u_str ESTAB 0 0 * 14145 * 14144
u_str ESTAB 0 0 * 14151 * 14150
{...}
# ausearch -m SELINUX_ERR
----
time->Thu Jan 23 11:11:16 2014
type=SYSCALL msg=audit(
1390493476.445:374):
arch=
c000003e syscall=44 success=yes exit=40
a0=3 a1=
7fff03aa11f0 a2=28 a3=0 items=0 ppid=1852 pid=1895
auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0
tty=pts0 ses=1 comm="ss" exe="/usr/sbin/ss"
subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
type=SELINUX_ERR msg=audit(
1390493476.445:374):
SELinux: unrecognized netlink message type=20 for sclass=32
Signed-off-by: Paul Moore <pmoore@redhat.com>
Paul Moore [Tue, 28 Jan 2014 19:44:16 +0000 (14:44 -0500)]
Merge tag 'v3.13' into stable-3.14
Linux 3.13
Conflicts:
security/selinux/hooks.c
Trivial merge issue in selinux_inet_conn_request() likely due to me
including patches that I sent to the stable folks in my next tree
resulting in the patch hitting twice (I think). Thankfully it was an
easy fix this time, but regardless, lesson learned, I will not do that
again.
Thomas Hellstrom [Wed, 5 Feb 2014 08:18:26 +0000 (09:18 +0100)]
drm/ttm: Don't clear page metadata of imported sg pages
These page pointers shouldn't be visible to TTM in the first place, but
until we fix that up, don't clear the page metadata because that
will upset the exporter.
Reported-and-tested-by: Cristoph Haag <haagch.christoph@googleemail.com>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Jakob Bornecrantz <jakob@vmware.com>
Colin Cross [Tue, 4 Feb 2014 02:15:32 +0000 (02:15 +0000)]
security: select correct default LSM_MMAP_MIN_ADDR on arm on arm64
Binaries compiled for arm may run on arm64 if CONFIG_COMPAT is
selected. Set LSM_MMAP_MIN_ADDR to 32768 if ARM64 && COMPAT to
prevent selinux failures launching 32-bit static executables that
are mapped at 0x8000.
Signed-off-by: Colin Cross <ccross@android.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Eric Paris <eparis@redhat.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Catalin Marinas [Wed, 5 Feb 2014 12:03:52 +0000 (12:03 +0000)]
arm64: compat: Wire up new AArch32 syscalls
This patch enables sys_compat, sys_finit_module, sys_sched_setattr and
sys_sched_getattr for compat (AArch32) applications.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Nathan Lynch [Mon, 3 Feb 2014 19:48:52 +0000 (19:48 +0000)]
arm64: vdso: update wtm fields for CLOCK_MONOTONIC_COARSE
Update wall-to-monotonic fields in the VDSO data page
unconditionally. These are used to service CLOCK_MONOTONIC_COARSE,
which is not guarded by use_syscall.
Signed-off-by: Nathan Lynch <nathan_lynch@mentor.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Nathan Lynch [Wed, 5 Feb 2014 05:53:04 +0000 (05:53 +0000)]
arm64: vdso: fix coarse clock handling
When __kernel_clock_gettime is called with a CLOCK_MONOTONIC_COARSE or
CLOCK_REALTIME_COARSE clock id, it returns incorrectly to whatever the
caller has placed in x2 ("ret x2" to return from the fast path). Fix
this by saving x30/LR to x2 only in code that will call
__do_get_tspec, restoring x30 afterward, and using a plain "ret" to
return from the routine.
Also: while the resulting tv_nsec value for CLOCK_REALTIME and
CLOCK_MONOTONIC must be computed using intermediate values that are
left-shifted by cs_shift (x12, set by __do_get_tspec), the results for
coarse clocks should be calculated using unshifted values
(xtime_coarse_nsec is in units of actual nanoseconds). The current
code shifts intermediate values by x12 unconditionally, but x12 is
uninitialized when servicing a coarse clock. Fix this by setting x12
to 0 once we know we are dealing with a coarse clock id.
Signed-off-by: Nathan Lynch <nathan_lynch@mentor.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Toshi Kani [Wed, 5 Feb 2014 00:48:28 +0000 (17:48 -0700)]
ACPI / hotplug: Fix panic on eject to ejected device
When an eject request is sent to an ejected ACPI device, the following
panic occurs:
ACPI: \_SB_.SCK3.CPU3: ACPI_NOTIFY_EJECT_REQUEST event
BUG: unable to handle kernel NULL pointer dereference at
0000000000000070
IP: [<
ffffffff813a7cfe>] acpi_device_hotplug+0x10b/0x33b
:
Call Trace:
[<
ffffffff813a24da>] acpi_hotplug_work_fn+0x1c/0x27
[<
ffffffff8109cbe5>] process_one_work+0x175/0x430
[<
ffffffff8109d7db>] worker_thread+0x11b/0x3a0
This is becase device->handler is NULL in acpi_device_hotplug().
This case was used to fail in acpi_hotplug_notify_cb() as the target
had no acpi_deivce. However, acpi_device now exists after ejection.
Added a check to verify if acpi_device->handler is valid for an
eject request in acpi_hotplug_notify_cb(). Note that handler passed
from an argument is still valid while acpi_device->handler is NULL.
Fixes:
202317a573b2 (ACPI / scan: Add acpi_device objects for all device nodes in the namespace)
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Mark Rutland [Wed, 5 Feb 2014 10:24:13 +0000 (10:24 +0000)]
arm64: simplify pgd_alloc
Currently pgd_alloc has a redundant NULL check in its return path that
can be removed with no ill effects. With that removed it's also possible
to return early and eliminate the new_pgd temporary variable.
This patch applies said modifications, making the logic of pgd_alloc
correspond 1-1 with that of pgd_free.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Rutland [Wed, 5 Feb 2014 10:24:12 +0000 (10:24 +0000)]
arm64: fix typo: s/SERRROR/SERROR/
Somehow SERROR has acquired an additional 'R' in a couple of headers.
This patch removes them before they spread further. As neither instance
is in use yet, no other sites need to be fixed up.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Catalin Marinas [Tue, 4 Feb 2014 16:01:31 +0000 (16:01 +0000)]
arm64: Invalidate the TLB when replacing pmd entries during boot
With the 64K page size configuration, __create_page_tables in head.S
maps enough memory to get started but using 64K pages rather than 512M
sections with a single pgd/pud/pmd entry pointing to a pte table.
create_mapping() may override the pgd/pud/pmd table entry with a block
(section) one if the RAM size is more than 512MB and aligned correctly.
For the end of this block to be accessible, the old TLB entry must be
invalidated.
Cc: <stable@vger.kernel.org>
Reported-by: Mark Salter <msalter@redhat.com>
Tested-by: Mark Salter <msalter@redhat.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Laura Abbott [Tue, 4 Feb 2014 23:08:57 +0000 (23:08 +0000)]
arm64: Align CMA sizes to PAGE_SIZE
dma_alloc_from_contiguous takes number of pages for a size.
Align up the dma size passed in to page size to avoid truncation
and allocation failures on sizes less than PAGE_SIZE.
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Vinayak Kale [Wed, 5 Feb 2014 09:34:36 +0000 (09:34 +0000)]
arm64: add DSB after icache flush in __flush_icache_all()
Add DSB after icache flush to complete the cache maintenance operation.
The function __flush_icache_all() is used only for user space mappings
and an ISB is not required because of an exception return before executing
user instructions. An exception return would behave like an ISB.
Signed-off-by: Vinayak Kale <vkale@apm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Nitin A Kamble [Fri, 31 Jan 2014 00:50:10 +0000 (16:50 -0800)]
genirq: Generic irq chip requires IRQ_DOMAIN
The generic_chip.c uses interfaces from irq_domain.c which is
controlled by the IRQ_DOMAIN config option, but there is no Kconfig
dependency so the build can fail:
linux/kernel/irq/generic-chip.c:400:11: error:
'irq_domain_xlate_onetwocell' undeclared here (not in a function)
Select IRQ_DOMAIN when GENERIC_IRQ_CHIP is selected.
Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Link: http://lkml.kernel.org/r/1391129410-54548-2-git-send-email-nitin.a.kamble@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org # 3.11+
Thomas Hellstrom [Fri, 24 Jan 2014 07:49:45 +0000 (08:49 +0100)]
drm/ttm: Fix TTM object open regression
Commit drm/ttm: ttm object security fixes for render nodes introduced a
regression where, if a TTM object was opened multiple times from the same
open file, the caller would spin uninterruptibly in the kernel.
Fix this.
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Jakob Bornecrantz <jakob@vmware.com>
Dave Jones [Fri, 31 Jan 2014 02:27:25 +0000 (21:27 -0500)]
vmwgfx: Fix unitialized stack read in vmw_setup_otable_base
One of the error paths in vmw_setup_otable_base causes us to return with
'ret' having never been set to anything causing us to return whatever was
on the stack.
Found with Coverity
Signed-off-by: Dave Jones <davej@fedoraproject.org>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Takashi Iwai [Wed, 5 Feb 2014 07:49:41 +0000 (08:49 +0100)]
ALSA: hda - Improve loopback path lookups for AD1983
AD1983 has flexible loopback routes and the generic parser would take
wrong path confusingly instead of taking individual paths via NID 0x0c
and 0x0d. For avoiding it, limit the connections at these widgets so
that the parser can think more straightforwardly. This fixes the
regression of the missing line-in loopback on Dell machine.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=70011
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Thomas Hellstrom [Wed, 5 Feb 2014 07:13:56 +0000 (08:13 +0100)]
drm/vmwgfx: Reemit context bindings when necessary v2
When a context is first referenced in the command stream, make sure that all
scrubbed (as a result of eviction) bindings are re-emitted. Also make sure that
all bound resources are put on the resource validate list.
This is needed for legacy emulation, since legacy user-space drivers will
typically not re-emit shader bindings. It also removes the requirement for
user-space drivers to re-emit render-target- and texture bindings.
Makes suspend and hibernate now also work with legacy user-space drivers on
guest-backed devices.
v2: Don't rebind on legacy devices.
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Jakob Bornecrantz <jakob@vmware.com>