Miao Xie [Thu, 3 Jul 2014 10:22:12 +0000 (18:22 +0800)]
Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs
We forgot to zero some members in fs_devices when we create new fs_devices
from the one of the seed fs. It would cause the problem that we got wrong
chunk profile when allocating chunks. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Anand Jain [Thu, 3 Jul 2014 10:22:06 +0000 (18:22 +0800)]
btrfs: check generation as replace duplicates devid+uuid
When FS in unmounted we need to check generation number as well
since devid+uuid combination could match with the missing replaced
disk when it reappears, and without this patch it might pair with
the replaced disk again.
device_list_add() function is called in the following threads,
mount device option
mount argument
ioctl BTRFS_IOC_SCAN_DEV (btrfs dev scan)
ioctl BTRFS_IOC_DEVICES_READY (btrfs dev ready <dev>)
they have been unit tested to work fine with this patch.
If the user knows what he is doing and really want to pair with
replaced disk (which is not a standard operation), then he should
first clear the kernel btrfs device list in the memory by doing
the module unload/load and followed with the mount -o device option.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Anand Jain [Thu, 3 Jul 2014 10:22:05 +0000 (18:22 +0800)]
Btrfs: device_list_add() should not update list when mounted
device_list_add() is called when user runs btrfs dev scan, which would add
any btrfs device into the btrfs_fs_devices list.
Now think of a mounted btrfs. And a new device which contains the a SB
from the mounted btrfs devices.
In this situation when user runs btrfs dev scan, the current code would
just replace existing device with the new device.
Which is to note that old device is neither closed nor gracefully
removed from the btrfs.
The FS is still operational with the old bdev however the device name
is the btrfs_device is new which is provided by the btrfs dev scan.
reproducer:
devmgt[1] detach /dev/sdc
replace the missing disk /dev/sdc
btrfs rep start -f 1 /dev/sde /btrfs
Label: none uuid:
5dc0aaf4-4683-4050-b2d6-
5ebe5f5cd120
Total devices 2 FS bytes used 32.00KiB
devid 1 size 958.94MiB used 115.88MiB path /dev/sde
devid 2 size 958.94MiB used 103.88MiB path /dev/sdd
make /dev/sdc to reappear
devmgt attach host2
btrfs dev scan
btrfs fi show -m
Label: none uuid:
5dc0aaf4-4683-4050-b2d6-
5ebe5f5cd120^M
Total devices 2 FS bytes used 32.00KiB^M
devid 1 size 958.94MiB used 115.88MiB path /dev/sdc <- Wrong.
devid 2 size 958.94MiB used 103.88MiB path /dev/sdd
since /dev/sdc has been replaced with /dev/sde, the /dev/sdc shouldn't be
part of the btrfs-fsid when it reappears. If user want it to be part of it
then sys admin should be using btrfs device add instead.
[1] github.com/anajain/devmgt.git
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
chandan [Tue, 1 Jul 2014 06:34:28 +0000 (12:04 +0530)]
Btrfs: fill_holes: Fix slot number passed to hole_mergeable() call.
For a non-existent key, btrfs_search_slot() sets path->slots[0] to the slot
where the key could have been present, which in this case would be the slot
containing the extent item which would be the next neighbor of the file range
being punched. The current code passes an incremented path->slots[0] and we
skip to the wrong file extent item. This would mean that we would fail to
merge the "yet to be created" hole with the next neighboring hole (if one
exists). Fix this.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Miao Xie [Tue, 17 Jun 2014 10:58:59 +0000 (18:58 +0800)]
Btrfs: fix put dio bio twice when we submit dio bio fail
The caller of btrfs_submit_direct_hook() will put the original dio bio
when btrfs_submit_direct_hook() return a error number, so we needn't
put the original bio in btrfs_submit_direct_hook().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Tue, 12 Aug 2014 17:47:42 +0000 (10:47 -0700)]
btrfs: disable strict file flushes for renames and truncates
Truncates and renames are often used to replace old versions of a file
with new versions. Applications often expect this to be an atomic
replacement, even if they haven't done anything to make sure the new
version is fully on disk.
Btrfs has strict flushing in place to make sure that renaming over an
old file with a new file will fully flush out the new file before
allowing the transaction commit with the rename to complete.
This ordering means the commit code needs to be able to lock file pages,
and there are a few paths in the filesystem where we will try to end a
transaction with the page lock held. It's rare, but these things can
deadlock.
This patch removes the ordered flushes and switches to a best effort
filemap_flush like ext4 uses. It's not perfect, but it should fix the
deadlocks.
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Sat, 9 Aug 2014 20:22:27 +0000 (21:22 +0100)]
Btrfs: fix csum tree corruption, duplicate and outdated checksums
Under rare circumstances we can end up leaving 2 versions of a checksum
for the same file extent range.
The reason for this is that after calling btrfs_next_leaf we process
slot 0 of the leaf it returns, instead of processing the slot set in
path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
btrfs_next_leaf() releases the path and before it searches for the next
leaf, another task might cause a split of the next leaf, which migrates
some of its keys to the leaf we were processing before calling
btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
same leaf but with path->slots[0] having a slot number corresponding
to the first new key it got, that is, a slot number that didn't exist
before calling btrfs_next_leaf(), as the leaf now has more keys than
it had before. So we must really process the returned leaf starting at
path->slots[0] always, as it isn't always 0, and the key at slot 0 can
have an offset much lower than our search offset/bytenr.
For example, consider the following scenario, where we have:
sums->bytenr:
40157184, sums->len: 16384, sums end:
40173568
four 4kb file data blocks with offsets
40157184,
40161280,
40165376,
40169472
Leaf N:
slot = 0 slot = btrfs_header_nritems() - 1
|-------------------------------------------------------------------|
| [(CSUM CSUM
39239680), size 8] ... [(CSUM CSUM
40116224), size 4] |
|-------------------------------------------------------------------|
Leaf N + 1:
slot = 0 slot = btrfs_header_nritems() - 1
|--------------------------------------------------------------------|
| [(CSUM CSUM
40161280), size 32] ... [((CSUM CSUM
40615936), size 8 |
|--------------------------------------------------------------------|
Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
find the next highest key, which releases the current path and then searches
for that next key. However after releasing the path and before finding that
next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
btrfs_next_leaf() will returns us a path again with leaf N but with the slot
pointing to its new last key (CSUM CSUM
40161280). This new version of leaf N
is then:
slot = 0 slot = btrfs_header_nritems() - 2 slot = btrfs_header_nritems() - 1
|----------------------------------------------------------------------------------------------------|
| [(CSUM CSUM
39239680), size 8] ... [(CSUM CSUM
40116224), size 4] [(CSUM CSUM
40161280), size 32] |
|----------------------------------------------------------------------------------------------------|
And incorrecly using slot 0, makes us set next_offset to
39239680 and we jump
into the "insert:" label, which will set tmp to:
tmp = min((sums->len - total_bytes) >> blocksize_bits,
(next_offset - file_key.offset) >> blocksize_bits) =
min((16384 - 0) >> 12, (
39239680 -
40157184) >> 12) =
min(4, (u64)-917504 =
18446744073708634112 >> 12) = 4
and
ins_size = csum_size * tmp = 4 * 4 = 16 bytes.
In other words, we insert a new csum item in the tree with key
(CSUM_OBJECTID CSUM_KEY
40157184 = sums->bytenr) that contains the checksums
for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
because the item with key (CSUM CSUM
40161280) (the one that was moved from
leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
bytes of our data and won't get those old checksums removed.
So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
and breaks the logical rule:
Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover
An obvious bad effect of this is that a subsequent csum tree lookup to get
the checksum of any of the blocks with logical offset of
40161280,
40165376
or
40169472 (the last 3 4kb blocks of file data), will get the old checksums.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Takashi Iwai [Mon, 28 Jul 2014 08:57:04 +0000 (10:57 +0200)]
Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch
We've got bug reports that btrfs crashes when quota is enabled on
32bit kernel, typically with the Oops like below:
BUG: unable to handle kernel NULL pointer dereference at
00000004
IP: [<
f9234590>] find_parent_nodes+0x360/0x1380 [btrfs]
*pde =
00000000
Oops: 0000 [#1] SMP
CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S W 3.15.2-1.gd43d97e-default #1
Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs]
task:
f1478130 ti:
f147c000 task.ti:
f147c000
EIP: 0060:[<
f9234590>] EFLAGS:
00010213 CPU: 0
EIP is at find_parent_nodes+0x360/0x1380 [btrfs]
EAX:
f147dda8 EBX:
f147ddb0 ECX:
00000011 EDX:
00000000
ESI:
00000000 EDI:
f147dda4 EBP:
f147ddf8 ESP:
f147dd38
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0:
8005003b CR2:
00000004 CR3:
00bf3000 CR4:
00000690
Stack:
00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050
00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000
00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000
Call Trace:
[<
f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs]
[<
f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs]
[<
f9206148>] normal_work_helper+0xc8/0x270 [btrfs]
[<
c025e38b>] process_one_work+0x11b/0x390
[<
c025eea1>] worker_thread+0x101/0x340
[<
c026432b>] kthread+0x9b/0xb0
[<
c0712a71>] ret_from_kernel_thread+0x21/0x30
[<
c0264290>] kthread_create_on_node+0x110/0x110
This indicates a NULL corruption in prefs_delayed list. The further
investigation and bisection pointed that the call of ulist_add_merge()
results in the corruption.
ulist_add_merge() takes u64 as aux and writes a 64bit value into
old_aux. The callers of this function in backref.c, however, pass a
pointer of a pointer to old_aux. That is, the function overwrites
64bit value on 32bit pointer. This caused a NULL in the adjacent
variable, in this case, prefs_delayed.
Here is a quick attempt to band-aid over this: a new function,
ulist_add_merge_ptr() is introduced to pass/store properly a pointer
value instead of u64. There are still ugly void ** cast remaining
in the callers because void ** cannot be taken implicitly. But, it's
safer than explicit cast to u64, anyway.
Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046
Cc: <stable@vger.kernel.org> [v3.11+]
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
Liu Bo [Thu, 24 Jul 2014 14:48:05 +0000 (22:48 +0800)]
Btrfs: fix compressed write corruption on enospc
When failing to allocate space for the whole compressed extent, we'll
fallback to uncompressed IO, but we've forgotten to redirty the pages
which belong to this compressed extent, and these 'clean' pages will
simply skip 'submit' part and go to endio directly, at last we got data
corruption as we write nothing.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Tested-By: Martin Steigerwald <martin@lichtvoll.de>
Signed-off-by: Chris Mason <clm@fb.com>
Mark Fasheh [Thu, 17 Jul 2014 19:39:04 +0000 (12:39 -0700)]
btrfs: correctly handle return from ulist_add
ulist_add() can return '1' on sucess, which qgroup_subtree_accounting()
doesn't take into account. As a result, that value can be bubbled up to
callers, causing an error to be printed. Fix this by only returning the
value of ulist_add() when it indicates an error.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
Mark Fasheh [Thu, 17 Jul 2014 19:39:01 +0000 (12:39 -0700)]
btrfs: qgroup: account shared subtrees during snapshot delete
During its tree walk, btrfs_drop_snapshot() will skip any shared
subtrees it encounters. This is incorrect when we have qgroups
turned on as those subtrees need to have their contents
accounted. In particular, the case we're concerned with is when
removing our snapshot root leaves the subtree with only one root
reference.
In those cases we need to find the last remaining root and add
each extent in the subtree to the corresponding qgroup exclusive
counts.
This patch implements the shared subtree walk and a new qgroup
operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
this type is encountered during qgroup accounting, we search for
any root references to that extent and in the case that we find
only one reference left, we go ahead and do the math on it's
exclusive counts.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Wed, 2 Jul 2014 19:07:54 +0000 (20:07 +0100)]
Btrfs: read lock extent buffer while walking backrefs
Before processing the extent buffer, acquire a read lock on it, so
that we're safe against concurrent updates on the extent buffer.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Wed, 2 Jul 2014 17:54:25 +0000 (10:54 -0700)]
Btrfs: __btrfs_mod_ref should always use no_quota
Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
understand how snapshot delete was using it and assumed that we needed the
quota operations there. With Mark's work this has turned out to be not the
case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
argument and make __btrfs_mod_ref call it's process function with no_quota set
always. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
David Sterba [Tue, 1 Jul 2014 14:21:33 +0000 (16:21 +0200)]
btrfs: adjust statfs calculations according to raid profiles
This has been discussed in thread:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/32528
and this patch implements this proposal:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/32536
Works fine for "clean" raid profiles where the raid factor correction
does the right job. Otherwise it's pessimistic and may show low space
although there's still some left.
The df nubmers are lightly wrong in case of mixed block groups, but this
is not a major usecase and can be addressed later.
The RAID56 numbers are wrong almost the same way as before and will be
addressed separately.
CC: Hugo Mills <hugo@carfax.org.uk>
CC: cwillu <cwillu@cwillu.com>
CC: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Linus Torvalds [Sun, 3 Aug 2014 22:25:02 +0000 (15:25 -0700)]
Linux 3.16
Linus Torvalds [Sun, 3 Aug 2014 16:58:20 +0000 (09:58 -0700)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"Two fixes in the timer area:
- a long-standing lock inversion due to a printk
- suspend-related hrtimer corruption in sched_clock"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
sched_clock: Avoid corrupting hrtimer tree during suspend
Linus Torvalds [Sat, 2 Aug 2014 17:57:39 +0000 (10:57 -0700)]
Merge branch 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm
Pull ARM fixes from Russell King:
"A few fixes for ARM. Some of these are correctness issues:
- TLBs must be flushed after the old mappings are removed by the DMA
mapping code, but before the new mappings are established.
- An off-by-one entry error in the Keystone LPAE setup code.
Fixes include:
- ensuring that the identity mapping for LPAE does not remove the
kernel image from the identity map.
- preventing userspace from trapping into kgdb.
- fixing a preemption issue in the Intel iwmmxt code.
- fixing a build error with nommu.
Other changes include:
- Adding a note about which areas of memory are expected to be
accessible while the identity mapping tables are in place"
* 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm:
ARM: 8124/1: don't enter kgdb when userspace executes a kgdb break instruction
ARM: idmap: add identity mapping usage note
ARM: 8115/1: LPAE: reduce damage caused by idmap to virtual memory layout
ARM: fix alignment of keystone page table fixup
ARM: 8112/1: only select ARM_PATCH_PHYS_VIRT if MMU is enabled
ARM: 8100/1: Fix preemption disable in iwmmxt_task_enable()
ARM: DMA: ensure that old section mappings are flushed from the TLB
Omar Sandoval [Fri, 1 Aug 2014 17:14:06 +0000 (18:14 +0100)]
ARM: 8124/1: don't enter kgdb when userspace executes a kgdb break instruction
The kgdb breakpoint hooks (kgdb_brk_fn and kgdb_compiled_brk_fn)
should only be entered when a kgdb break instruction is executed
from the kernel. Otherwise, if kgdb is enabled, a userspace program
can cause the kernel to drop into the debugger by executing either
KGDB_BREAKINST or KGDB_COMPILED_BREAK.
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Russell King [Tue, 29 Jul 2014 11:18:34 +0000 (12:18 +0100)]
ARM: idmap: add identity mapping usage note
Add a note about the usage of the identity mapping; we do not support
accesses outside of the identity map region and kernel image while a
CPU is using the identity map. This is because the identity mapping
may overwrite vmalloc space, IO mappings, the vectors pages, etc.
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Linus Torvalds [Sat, 2 Aug 2014 01:01:41 +0000 (18:01 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"This contains a couple of fixes - one is the aio fix from Christoph,
the other a fallocate() one from Eric"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: fix check for fallocate on active swapfile
direct-io: fix AIO regression
Linus Torvalds [Sat, 2 Aug 2014 00:37:01 +0000 (17:37 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fix from Peter Anvin:
"A single fix to not invoke the espfix code on Xen PV, as it turns out
to oops the guest when invoked after all. This patch leaves some
amount of dead code, in particular unnecessary initialization of the
espfix stacks when they won't be used, but in the interest of keeping
the patch minimal that cleanup can wait for the next cycle"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86_64/entry/xen: Do not invoke espfix64 on Xen
Linus Torvalds [Sat, 2 Aug 2014 00:16:05 +0000 (17:16 -0700)]
Merge tag 'staging-3.16-rc8' of git://git./linux/kernel/git/gregkh/staging
Pull staging driver bugfixes from Greg KH:
"Here are some tiny staging driver bugfixes that I've had in my tree
for the past week that resolve some reported issues. Nothing major at
all, but it would be good to get them merged for 3.16-rc8 or -final"
* tag 'staging-3.16-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: vt6655: Fix disassociated messages every 10 seconds
staging: vt6655: Fix Warning on boot handle_irq_event_percpu.
staging: rtl8723au: rtw_resume(): release semaphore before exit on error
iio:bma180: Missing check for frequency fractional part
iio:bma180: Fix scale factors to report correct acceleration units
iio: buffer: Fix demux table creation
Linus Torvalds [Fri, 1 Aug 2014 19:50:05 +0000 (12:50 -0700)]
Merge tag 'dm-3.16-fixes-3' of git://git./linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
"Fix dm bufio shrinker to properly zero-fill all fields.
Fix race in dm cache that caused improper reporting of the number of
dirty blocks in the cache"
* tag 'dm-3.16-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm cache: fix race affecting dirty block count
dm bufio: fully initialize shrinker
Linus Torvalds [Fri, 1 Aug 2014 19:49:02 +0000 (12:49 -0700)]
Merge tag 'fixes-for-linus' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM straggler SoC fix from Olof Johansson:
"A DT bugfix for Nomadik that had an ambigouos double-inversion of a
gpio line, and one MAINTAINER URL update that might as well go in now.
We could hold off until the merge window, but then we'll just have to
mark the DT fix for stable and it just seems like in total causing
more work"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
MAINTAINERS: Update Tegra Git URL
ARM: nomadik: fix up double inversion in DT
Anssi Hannula [Fri, 1 Aug 2014 15:55:47 +0000 (11:55 -0400)]
dm cache: fix race affecting dirty block count
nr_dirty is updated without locking, causing it to drift so that it is
non-zero (either a small positive integer, or a very large one when an
underflow occurs) even when there are no actual dirty blocks. This was
due to a race between the workqueue and map function accessing nr_dirty
in parallel without proper protection.
People were seeing under runs due to a race on increment/decrement of
nr_dirty, see: https://lkml.org/lkml/2014/6/3/648
Fix this by using an atomic_t for nr_dirty.
Reported-by: roma1390@gmail.com
Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Greg Thelen [Thu, 31 Jul 2014 16:07:19 +0000 (09:07 -0700)]
dm bufio: fully initialize shrinker
1d3d4437eae1 ("vmscan: per-node deferred work") added a flags field to
struct shrinker assuming that all shrinkers were zero filled. The dm
bufio shrinker is not zero filled, which leaves arbitrary kmalloc() data
in flags. So far the only defined flags bit is SHRINKER_NUMA_AWARE.
But there are proposed patches which add other bits to shrinker.flags
(e.g. memcg awareness).
Rather than simply initializing the shrinker, this patch uses kzalloc()
when allocating the dm_bufio_client to ensure that the embedded shrinker
and any other similar structures are zeroed.
This fixes theoretical over aggressive shrinking of dm bufio objects.
If the uninitialized dm_bufio_client.shrinker.flags contains
SHRINKER_NUMA_AWARE then shrink_slab() would call the dm shrinker for
each numa node rather than just once. This has been broken since 3.12.
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.12+
Jan Kara [Fri, 1 Aug 2014 10:20:02 +0000 (12:20 +0200)]
timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
clockevents_increase_min_delta() calls printk() from under
hrtimer_bases.lock. That causes lock inversion on scheduler locks because
printk() can call into the scheduler. Lockdep puts it as:
======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc8-06195-g939f04b #2 Not tainted
-------------------------------------------------------
trinity-main/74 is trying to acquire lock:
(&port_lock_key){-.....}, at: [<
811c60be>] serial8250_console_write+0x8c/0x10c
but task is already holding lock:
(hrtimer_bases.lock){-.-...}, at: [<
8103caeb>] hrtimer_try_to_cancel+0x13/0x66
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #5 (hrtimer_bases.lock){-.-...}:
[<
8104a942>] lock_acquire+0x92/0x101
[<
8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
[<
8103c918>] __hrtimer_start_range_ns+0x1c/0x197
[<
8107ec20>] perf_swevent_start_hrtimer.part.41+0x7a/0x85
[<
81080792>] task_clock_event_start+0x3a/0x3f
[<
810807a4>] task_clock_event_add+0xd/0x14
[<
8108259a>] event_sched_in+0xb6/0x17a
[<
810826a2>] group_sched_in+0x44/0x122
[<
81082885>] ctx_sched_in.isra.67+0x105/0x11f
[<
810828e6>] perf_event_sched_in.isra.70+0x47/0x4b
[<
81082bf6>] __perf_install_in_context+0x8b/0xa3
[<
8107eb8e>] remote_function+0x12/0x2a
[<
8105f5af>] smp_call_function_single+0x2d/0x53
[<
8107e17d>] task_function_call+0x30/0x36
[<
8107fb82>] perf_install_in_context+0x87/0xbb
[<
810852c9>] SYSC_perf_event_open+0x5c6/0x701
[<
810856f9>] SyS_perf_event_open+0x17/0x19
[<
8142f8ee>] syscall_call+0x7/0xb
-> #4 (&ctx->lock){......}:
[<
8104a942>] lock_acquire+0x92/0x101
[<
8142f04c>] _raw_spin_lock+0x21/0x30
[<
81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
[<
8142cacc>] __schedule+0x4c6/0x4cb
[<
8142cae0>] schedule+0xf/0x11
[<
8142f9a6>] work_resched+0x5/0x30
-> #3 (&rq->lock){-.-.-.}:
[<
8104a942>] lock_acquire+0x92/0x101
[<
8142f04c>] _raw_spin_lock+0x21/0x30
[<
81040873>] __task_rq_lock+0x33/0x3a
[<
8104184c>] wake_up_new_task+0x25/0xc2
[<
8102474b>] do_fork+0x15c/0x2a0
[<
810248a9>] kernel_thread+0x1a/0x1f
[<
814232a2>] rest_init+0x1a/0x10e
[<
817af949>] start_kernel+0x303/0x308
[<
817af2ab>] i386_start_kernel+0x79/0x7d
-> #2 (&p->pi_lock){-.-...}:
[<
8104a942>] lock_acquire+0x92/0x101
[<
8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
[<
810413dd>] try_to_wake_up+0x1d/0xd6
[<
810414cd>] default_wake_function+0xb/0xd
[<
810461f3>] __wake_up_common+0x39/0x59
[<
81046346>] __wake_up+0x29/0x3b
[<
811b8733>] tty_wakeup+0x49/0x51
[<
811c3568>] uart_write_wakeup+0x17/0x19
[<
811c5dc1>] serial8250_tx_chars+0xbc/0xfb
[<
811c5f28>] serial8250_handle_irq+0x54/0x6a
[<
811c5f57>] serial8250_default_handle_irq+0x19/0x1c
[<
811c56d8>] serial8250_interrupt+0x38/0x9e
[<
810510e7>] handle_irq_event_percpu+0x5f/0x1e2
[<
81051296>] handle_irq_event+0x2c/0x43
[<
81052cee>] handle_level_irq+0x57/0x80
[<
81002a72>] handle_irq+0x46/0x5c
[<
810027df>] do_IRQ+0x32/0x89
[<
8143036e>] common_interrupt+0x2e/0x33
[<
8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
[<
811c25a4>] uart_start+0x2d/0x32
[<
811c2c04>] uart_write+0xc7/0xd6
[<
811bc6f6>] n_tty_write+0xb8/0x35e
[<
811b9beb>] tty_write+0x163/0x1e4
[<
811b9cd9>] redirected_tty_write+0x6d/0x75
[<
810b6ed6>] vfs_write+0x75/0xb0
[<
810b7265>] SyS_write+0x44/0x77
[<
8142f8ee>] syscall_call+0x7/0xb
-> #1 (&tty->write_wait){-.....}:
[<
8104a942>] lock_acquire+0x92/0x101
[<
8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
[<
81046332>] __wake_up+0x15/0x3b
[<
811b8733>] tty_wakeup+0x49/0x51
[<
811c3568>] uart_write_wakeup+0x17/0x19
[<
811c5dc1>] serial8250_tx_chars+0xbc/0xfb
[<
811c5f28>] serial8250_handle_irq+0x54/0x6a
[<
811c5f57>] serial8250_default_handle_irq+0x19/0x1c
[<
811c56d8>] serial8250_interrupt+0x38/0x9e
[<
810510e7>] handle_irq_event_percpu+0x5f/0x1e2
[<
81051296>] handle_irq_event+0x2c/0x43
[<
81052cee>] handle_level_irq+0x57/0x80
[<
81002a72>] handle_irq+0x46/0x5c
[<
810027df>] do_IRQ+0x32/0x89
[<
8143036e>] common_interrupt+0x2e/0x33
[<
8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
[<
811c25a4>] uart_start+0x2d/0x32
[<
811c2c04>] uart_write+0xc7/0xd6
[<
811bc6f6>] n_tty_write+0xb8/0x35e
[<
811b9beb>] tty_write+0x163/0x1e4
[<
811b9cd9>] redirected_tty_write+0x6d/0x75
[<
810b6ed6>] vfs_write+0x75/0xb0
[<
810b7265>] SyS_write+0x44/0x77
[<
8142f8ee>] syscall_call+0x7/0xb
-> #0 (&port_lock_key){-.....}:
[<
8104a62d>] __lock_acquire+0x9ea/0xc6d
[<
8104a942>] lock_acquire+0x92/0x101
[<
8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
[<
811c60be>] serial8250_console_write+0x8c/0x10c
[<
8104e402>] call_console_drivers.constprop.31+0x87/0x118
[<
8104f5d5>] console_unlock+0x1d7/0x398
[<
8104fb70>] vprintk_emit+0x3da/0x3e4
[<
81425f76>] printk+0x17/0x19
[<
8105bfa0>] clockevents_program_min_delta+0x104/0x116
[<
8105c548>] clockevents_program_event+0xe7/0xf3
[<
8105cc1c>] tick_program_event+0x1e/0x23
[<
8103c43c>] hrtimer_force_reprogram+0x88/0x8f
[<
8103c49e>] __remove_hrtimer+0x5b/0x79
[<
8103cb21>] hrtimer_try_to_cancel+0x49/0x66
[<
8103cb4b>] hrtimer_cancel+0xd/0x18
[<
8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
[<
81080705>] task_clock_event_stop+0x20/0x64
[<
81080756>] task_clock_event_del+0xd/0xf
[<
81081350>] event_sched_out+0xab/0x11e
[<
810813e0>] group_sched_out+0x1d/0x66
[<
81081682>] ctx_sched_out+0xaf/0xbf
[<
81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
[<
8142cacc>] __schedule+0x4c6/0x4cb
[<
8142cae0>] schedule+0xf/0x11
[<
8142f9a6>] work_resched+0x5/0x30
other info that might help us debug this:
Chain exists of:
&port_lock_key --> &ctx->lock --> hrtimer_bases.lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(hrtimer_bases.lock);
lock(&ctx->lock);
lock(hrtimer_bases.lock);
lock(&port_lock_key);
*** DEADLOCK ***
4 locks held by trinity-main/74:
#0: (&rq->lock){-.-.-.}, at: [<
8142c6f3>] __schedule+0xed/0x4cb
#1: (&ctx->lock){......}, at: [<
81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
#2: (hrtimer_bases.lock){-.-...}, at: [<
8103caeb>] hrtimer_try_to_cancel+0x13/0x66
#3: (console_lock){+.+...}, at: [<
8104fb5d>] vprintk_emit+0x3c7/0x3e4
stack backtrace:
CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2
00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
Call Trace:
[<
81426f69>] dump_stack+0x16/0x18
[<
81425a99>] print_circular_bug+0x18f/0x19c
[<
8104a62d>] __lock_acquire+0x9ea/0xc6d
[<
8104a942>] lock_acquire+0x92/0x101
[<
811c60be>] ? serial8250_console_write+0x8c/0x10c
[<
811c6032>] ? wait_for_xmitr+0x76/0x76
[<
8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
[<
811c60be>] ? serial8250_console_write+0x8c/0x10c
[<
811c60be>] serial8250_console_write+0x8c/0x10c
[<
8104af87>] ? lock_release+0x191/0x223
[<
811c6032>] ? wait_for_xmitr+0x76/0x76
[<
8104e402>] call_console_drivers.constprop.31+0x87/0x118
[<
8104f5d5>] console_unlock+0x1d7/0x398
[<
8104fb70>] vprintk_emit+0x3da/0x3e4
[<
81425f76>] printk+0x17/0x19
[<
8105bfa0>] clockevents_program_min_delta+0x104/0x116
[<
8105cc1c>] tick_program_event+0x1e/0x23
[<
8103c43c>] hrtimer_force_reprogram+0x88/0x8f
[<
8103c49e>] __remove_hrtimer+0x5b/0x79
[<
8103cb21>] hrtimer_try_to_cancel+0x49/0x66
[<
8103cb4b>] hrtimer_cancel+0xd/0x18
[<
8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
[<
81080705>] task_clock_event_stop+0x20/0x64
[<
81080756>] task_clock_event_del+0xd/0xf
[<
81081350>] event_sched_out+0xab/0x11e
[<
810813e0>] group_sched_out+0x1d/0x66
[<
81081682>] ctx_sched_out+0xaf/0xbf
[<
81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
[<
8104416d>] ? __dequeue_entity+0x23/0x27
[<
81044505>] ? pick_next_task_fair+0xb1/0x120
[<
8142cacc>] __schedule+0x4c6/0x4cb
[<
81047574>] ? trace_hardirqs_off_caller+0xd7/0x108
[<
810475b0>] ? trace_hardirqs_off+0xb/0xd
[<
81056346>] ? rcu_irq_exit+0x64/0x77
Fix the problem by using printk_deferred() which does not call into the
scheduler.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Eric Biggers [Wed, 25 Jun 2014 04:45:08 +0000 (23:45 -0500)]
vfs: fix check for fallocate on active swapfile
Fix the broken check for calling sys_fallocate() on an active swapfile,
introduced by commit
0790b31b69374ddadefe ("fs: disallow all fallocate
operation on active swapfile").
Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Wed, 30 Jul 2014 11:18:48 +0000 (07:18 -0400)]
direct-io: fix AIO regression
The direct-io.c rewrite to use the iov_iter infrastructure stopped updating
the size field in struct dio_submit, and thus rendered the check for
allowing asynchronous completions to always return false. Fix this by
comparing it to the count of bytes in the iov_iter instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Tim Chen <tim.c.chen@linux.intel.com>
Linus Torvalds [Thu, 31 Jul 2014 23:42:10 +0000 (16:42 -0700)]
Merge tag 'pm+acpi-3.16-rc8' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"One commit that fixes a problem causing PNP devices to be associated
with wrong ACPI device objects sometimes during device enumeration due
to an incorrect check in a matching function.
That problem was uncovered by the ACPI device enumeration rework in
3.14"
* tag 'pm+acpi-3.16-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / PNP: Fix acpi_pnp_match()
Linus Torvalds [Thu, 31 Jul 2014 17:02:15 +0000 (10:02 -0700)]
Merge tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mike.turquette/linux
Pull clock driver fix from Mike Turquette:
"A single patch to re-enable audio which is broken on all DRA7
SoC-based platforms. Missed this one from the last set of fixes"
* tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mike.turquette/linux:
clk: ti: clk-7xx: Correct ABE DPLL configuration
Linus Torvalds [Thu, 31 Jul 2014 17:01:34 +0000 (10:01 -0700)]
Merge git://git./linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
"This adds missing SELinux labeling to AF_ALG sockets which apparently
causes SELinux (or at least the SELinux people) to misbehave :)"
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: af_alg - properly label AF_ALG socket
Linus Torvalds [Thu, 31 Jul 2014 17:00:42 +0000 (10:00 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI barrier fix from James Bottomley:
"This is a potential data corruption fix: If we get an error sending
down a barrier, we simply ignore it meaning the barrier semantics get
violated without anyone being any the wiser. If the system crashes at
this point, the filesystem potentially becomes corrupt. Fix is to
report errors on failed barriers"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: handle flush errors properly
Peter Ujfalusi [Wed, 2 Apr 2014 13:48:45 +0000 (16:48 +0300)]
clk: ti: clk-7xx: Correct ABE DPLL configuration
ABE DPLL frequency need to be lowered from
361267200
to
180633600 to facilitate the ATL requironments.
The dpll_abe_m2x2_ck clock need to be set to double
of ABE DPLL rate in order to have correct clocks
for audio.
Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
Acked-by: Tero Kristo <t-kristo@ti.com>
Signed-off-by: Mike Turquette <mturquette@linaro.org>
Milan Broz [Tue, 29 Jul 2014 18:41:09 +0000 (18:41 +0000)]
crypto: af_alg - properly label AF_ALG socket
Th AF_ALG socket was missing a security label (e.g. SELinux)
which means that socket was in "unlabeled" state.
This was recently demonstrated in the cryptsetup package
(cryptsetup v1.6.5 and later.)
See https://bugzilla.redhat.com/show_bug.cgi?id=1115120
This patch clones the sock's label from the parent sock
and resolves the issue (similar to AF_BLUETOOTH protocol family).
Cc: stable@vger.kernel.org
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
David Rientjes [Thu, 31 Jul 2014 02:05:55 +0000 (19:05 -0700)]
kexec: fix build error when hugetlbfs is disabled
free_huge_page() is undefined without CONFIG_HUGETLBFS and there's no
need to filter PageHuge() page is such a configuration either, so avoid
exporting the symbol to fix a build error:
In file included from kernel/kexec.c:14:0:
kernel/kexec.c: In function 'crash_save_vmcoreinfo_init':
kernel/kexec.c:1623:20: error: 'free_huge_page' undeclared (first use in this function)
VMCOREINFO_SYMBOL(free_huge_page);
^
Introduced by commit
8f1d26d0e59b ("kexec: export free_huge_page to
VMCOREINFO")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Olof Johansson <olof@lixom.net>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 31 Jul 2014 00:16:36 +0000 (17:16 -0700)]
Merge branch 'akpm' (patches from Andrew Morton)
Merge fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
Josh has moved
kexec: export free_huge_page to VMCOREINFO
mm: fix filemap.c pagecache_get_page() kernel-doc warnings
mm: debugfs: move rounddown_pow_of_two() out from do_fault path
memcg: oom_notify use-after-free fix
hwpoison: call action_result() in failure path of hwpoison_user_mappings()
hwpoison: fix hugetlbfs/thp precheck in hwpoison_user_mappings()
rapidio/tsi721_dma: fix failure to obtain transaction descriptor
mm, thp: do not allow thp faults to avoid cpuset restrictions
mm/page-writeback.c: fix divide by zero in bdi_dirty_limits()
Josh Triplett [Wed, 30 Jul 2014 23:08:42 +0000 (16:08 -0700)]
Josh has moved
My IBM email addresses haven't worked for years; also map some
old-but-functional forwarding addresses to my canonical address.
Update my GPG key fingerprint; I moved to 4096R a long time ago.
Update description.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Atsushi Kumagai [Wed, 30 Jul 2014 23:08:39 +0000 (16:08 -0700)]
kexec: export free_huge_page to VMCOREINFO
PG_head_mask was added into VMCOREINFO to filter huge pages in
b3acc56bfe1
("kexec: save PG_head_mask in VMCOREINFO"), but makedumpfile still need
another symbol to filter *hugetlbfs* pages.
If a user hope to filter user pages, makedumpfile tries to exclude them by
checking the condition whether the page is anonymous, but hugetlbfs pages
aren't anonymous while they also be user pages.
We know it's possible to detect them in the same way as PageHuge(),
so we need the start address of free_huge_page():
int PageHuge(struct page *page)
{
if (!PageCompound(page))
return 0;
page = compound_head(page);
return get_compound_page_dtor(page) == free_huge_page;
}
For that reason, this patch changes free_huge_page() into public
to export it to VMCOREINFO.
Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap [Wed, 30 Jul 2014 23:08:37 +0000 (16:08 -0700)]
mm: fix filemap.c pagecache_get_page() kernel-doc warnings
Fix kernel-doc warnings in mm/filemap.c: pagecache_get_page():
Warning(..//mm/filemap.c:1054): No description found for parameter 'cache_gfp_mask'
Warning(..//mm/filemap.c:1054): No description found for parameter 'radix_gfp_mask'
Warning(..//mm/filemap.c:1054): Excess function parameter 'gfp_mask' description in 'pagecache_get_page'
Fixes:
2457aec63745 ("mm: non-atomically mark page accessed during page cache allocation where possible")
[mgorman@suse.de: change everything]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Ryabinin [Wed, 30 Jul 2014 23:08:35 +0000 (16:08 -0700)]
mm: debugfs: move rounddown_pow_of_two() out from do_fault path
do_fault_around() expects fault_around_bytes rounded down to nearest page
order. Instead of calling rounddown_pow_of_two every time in
fault_around_pages()/fault_around_mask() we could do round down when user
changes fault_around_bytes via debugfs interface.
This also fixes bug when user set fault_around_bytes to 0. Result of
rounddown_pow_of_two(0) is not defined, therefore fault_around_bytes == 0
doesn't work without this patch.
Let's set fault_around_bytes to PAGE_SIZE if user sets to something less
than PAGE_SIZE
[akpm@linux-foundation.org: tweak code layout]
Fixes:
a9b0f861("mm: nominate faultaround area in bytes rather than page order")
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org> [3.15.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 30 Jul 2014 23:08:33 +0000 (16:08 -0700)]
memcg: oom_notify use-after-free fix
Paul Furtado has reported the following GPF:
general protection fault: 0000 [#1] SMP
Modules linked in: ipv6 dm_mod xen_netfront coretemp hwmon x86_pkg_temp_thermal crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 microcode pcspkr ext4 jbd2 mbcache raid0 xen_blkfront
CPU: 3 PID: 3062 Comm: java Not tainted 3.16.0-rc5 #1
task:
ffff8801cfe8f170 ti:
ffff8801d2ec4000 task.ti:
ffff8801d2ec4000
RIP: e030:mem_cgroup_oom_synchronize+0x140/0x240
RSP: e02b:
ffff8801d2ec7d48 EFLAGS:
00010283
RAX:
0000000000000001 RBX:
ffff88009d633800 RCX:
000000000000000e
RDX:
fffffffffffffffe RSI:
ffff88009d630200 RDI:
ffff88009d630200
RBP:
ffff8801d2ec7da8 R08:
0000000000000012 R09:
00000000fffffffe
R10:
0000000000000000 R11:
0000000000000000 R12:
ffff88009d633800
R13:
ffff8801d2ec7d48 R14:
dead000000100100 R15:
ffff88009d633a30
FS:
00007f1748bb4700(0000) GS:
ffff8801def80000(0000) knlGS:
0000000000000000
CS: e033 DS: 0000 ES: 0000 CR0:
000000008005003b
CR2:
00007f4110300308 CR3:
00000000c05f7000 CR4:
0000000000002660
Call Trace:
pagefault_out_of_memory+0x18/0x90
mm_fault_error+0xa9/0x1a0
__do_page_fault+0x478/0x4c0
do_page_fault+0x2c/0x40
page_fault+0x28/0x30
Code: 44 00 00 48 89 df e8 40 ca ff ff 48 85 c0 49 89 c4 74 35 4c 8b b0 30 02 00 00 4c 8d b8 30 02 00 00 4d 39 fe 74 1b 0f 1f 44 00 00 <49> 8b 7e 10 be 01 00 00 00 e8 42 d2 04 00 4d 8b 36 4d 39 fe 75
RIP mem_cgroup_oom_synchronize+0x140/0x240
Commit
fb2a6fc56be6 ("mm: memcg: rework and document OOM waiting and
wakeup") has moved mem_cgroup_oom_notify outside of memcg_oom_lock
assuming it is protected by the hierarchical OOM-lock.
Although this is true for the notification part the protection doesn't
cover unregistration of event which can happen in parallel now so
mem_cgroup_oom_notify can see already unlinked and/or freed
mem_cgroup_eventfd_list.
Fix this by using memcg_oom_lock also in mem_cgroup_oom_notify.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=80881
Fixes:
fb2a6fc56be6 (mm: memcg: rework and document OOM waiting and wakeup)
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Paul Furtado <paulfurtado91@gmail.com>
Tested-by: Paul Furtado <paulfurtado91@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Wed, 30 Jul 2014 23:08:30 +0000 (16:08 -0700)]
hwpoison: call action_result() in failure path of hwpoison_user_mappings()
hwpoison_user_mappings() could fail for various reasons, so printk()s to
print out the reasons should be done in each failure check inside
hwpoison_user_mappings().
And currently we don't call action_result() when hwpoison_user_mappings()
fails, which is not consistent with other exit points of memory error
handler. So this patch fixes these messaging problems.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Chen Yucong <slaoub@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Wed, 30 Jul 2014 23:08:28 +0000 (16:08 -0700)]
hwpoison: fix hugetlbfs/thp precheck in hwpoison_user_mappings()
A recent fix from Chen Yucong, commit
0bc1f8b0682c ("hwpoison: fix the
handling path of the victimized page frame that belong to non-LRU")
rejects going into unmapping operation for hugetlbfs/thp pages, which
results in failing error containing on such pages. This patch fixes it.
With this patch, hwpoison functional tests in mce-test testsuite pass.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Chen Yucong <slaoub@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexandre Bounine [Wed, 30 Jul 2014 23:08:26 +0000 (16:08 -0700)]
rapidio/tsi721_dma: fix failure to obtain transaction descriptor
This is a bug fix for the situation when function tsi721_desc_get() fails
to obtain a free transaction descriptor.
The bug usually results in a memory access crash dump when data transfer
scatter-gather list has more entries than size of hardware buffer
descriptors ring. This fix ensures that error is properly returned to a
caller instead of an invalid entry.
This patch is applicable to kernel versions starting from v3.5.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org> [3.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 30 Jul 2014 23:08:24 +0000 (16:08 -0700)]
mm, thp: do not allow thp faults to avoid cpuset restrictions
The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET
should be set in allocflags. ALLOC_CPUSET controls if a page allocation
should be restricted only to the set of allowed cpuset mems.
Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent
the fault path from using memory compaction or direct reclaim. Thus, it
is unfairly able to allocate outside of its cpuset mems restriction as a
side-effect.
This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is
truly GFP_ATOMIC by verifying it is also not a thp allocation.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Alex Thorlton <athorlton@sgi.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hedi Berriche <hedi@sgi.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Maxim Patlasov [Wed, 30 Jul 2014 23:08:21 +0000 (16:08 -0700)]
mm/page-writeback.c: fix divide by zero in bdi_dirty_limits()
Under memory pressure, it is possible for dirty_thresh, calculated by
global_dirty_limits() in balance_dirty_pages(), to equal zero. Then, if
strictlimit is true, bdi_dirty_limits() tries to resolve the proportion:
bdi_bg_thresh : bdi_thresh = background_thresh : dirty_thresh
by dividing by zero.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andreas Färber [Mon, 28 Jul 2014 18:06:26 +0000 (12:06 -0600)]
MAINTAINERS: Update Tegra Git URL
swarren/linux-tegra.git is a stale location; it has moved to
tegra/linux.git.
While the git protocol re-directs to the new location, HTTP does not.
Besides, MAINTAINERS should contain the canonical URL.
Signed-off-by: Andreas Färber <afaerber@suse.de>
[swarren, updated commit message]
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Olof Johansson <olof@lixom.net>
Linus Walleij [Fri, 25 Jul 2014 10:18:42 +0000 (12:18 +0200)]
ARM: nomadik: fix up double inversion in DT
The GPIO pin connected to card detect was inverted twice: once by
the argument to the GPIO line itself where it was magically marked
as active low by the flag GPIO_ACTIVE_LOW (0x01) in the third cell,
and also marked active low AGAIN by explicitly stating
"cd-inverted" (a deprecated method).
After commit
78f87df2b4f8760954d7d80603d0cfcbd4759683
"mmc: mmci: Use the common mmc DT parser" this results in the
line being inverted twice so it was effectively uninverted, while
the old code would not have this effect, instead disregarding the
flag on the GPIO line altogether, which is a bug. I admit the
semantics may be unclear but inverting twice is as good a
definition as any on how this should work.
So fix up the buggy device tree. Use proper #includes so the DTS
is clear and readable.
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Olof Johansson <olof@lixom.net>
Linus Torvalds [Wed, 30 Jul 2014 16:01:04 +0000 (09:01 -0700)]
Merge tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux
Pull Exynos platform DT fix from Grant Likely:
"Device tree Exynos bug fix for v3.16-rc7
This bug fix has been brewing for a while. I hate sending it to you
so late, but I only got confirmation that it solves the problem this
past weekend. The diff looks big for a bug fix, but the majority of
it is only executed in the Exynos quirk case. Unfortunately it
required splitting early_init_dt_scan() in two and adding quirk
handling in the middle of it on ARM.
Exynos has buggy firmware that puts bad data into the memory node.
Commit
1c2f87c22566 ("ARM: Get rid of meminfo") exposed the bug by
dropping the artificial upper bound on the number of memory banks that
can be added. Exynos fails to boot after that commit. This branch
fixes it by splitting the early DT parse function and inserting a
fixup hook. Exynos uses the hook to correct the DT before parsing
memory regions"
* tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux:
arm: Add devicetree fixup machine function
of: Add memory limiting function for flattened devicetrees
of: Split early_init_dt_scan into two parts
Linus Torvalds [Wed, 30 Jul 2014 16:00:20 +0000 (09:00 -0700)]
Merge tag 'stable/for-linus-3.16-rc7-tag' of git://git./linux/kernel/git/xen/tip
Pull Xen fix from David Vrabel:
"Fix BUG when trying to expand the grant table. This seems to occur
often during boot with Ubuntu 14.04 PV guests"
* tag 'stable/for-linus-3.16-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: safely map and unmap grant frames when in atomic context
Linus Torvalds [Wed, 30 Jul 2014 15:59:15 +0000 (08:59 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull KVM fix from Paolo Bonzini:
"Fix a bug which allows KVM guests to bring down the entire system on
some 64K enabled ARM64 hosts"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: arm64: vgic: fix hyp panic with 64k pages on juno platform
Linus Torvalds [Wed, 30 Jul 2014 15:56:23 +0000 (08:56 -0700)]
Revert "cdc_subset: deal with a device that needs reset for timeout"
This reverts commit
20fbe3ae990fd54fc7d1f889d61958bc8b38f254.
As reported by Stephen Rothwell, it causes compile failures in certain
configurations:
drivers/net/usb/cdc_subset.c:360:15: error: 'dummy_prereset' undeclared here (not in a function)
.pre_reset = dummy_prereset,
^
drivers/net/usb/cdc_subset.c:361:16: error: 'dummy_postreset' undeclared here (not in a function)
.post_reset = dummy_postreset,
^
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: David Miller <davem@davemloft.net>
Cc: Oliver Neukum <oneukum@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 30 Jul 2014 15:54:17 +0000 (08:54 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Make fragmentation IDs less predictable, from Eric Dumazet.
2) TSO tunneling can crash in bnx2x driver, fix from Dmitry Kravkov.
3) Don't allow NULL msg->msg_name just because msg->msg_namelen is
non-zero, from Andrey Ryabinin.
4) ndm->ndm_type set using wrong macros, from Jun Zhao.
5) cdc-ether devices can come up with entries in their address filter,
so explicitly clear the filter after the device initializes. From
Oliver Neukum.
6) Forgotten refcount bump in xfrm_lookup(), from Steffen Klassert.
7) Short packets not padded properly, exposing random data, in bcmgenet
driver. Fix from Florian Fainelli.
8) xgbe_probe() doesn't return an error code, but rather zero, when
netif_set_real_num_tx_queues() fails. Fix from Wei Yongjun.
9) USB speed not probed properly in r8152 driver, from Hayes Wang.
10) Transmit logic choosing the outgoing port in the sunvnet driver
needs to consider a) is the port actually up and b) whether it is a
switch port. Fix from David L Stevens.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (27 commits)
net: phy: re-apply PHY fixups during phy_register_device
cdc-ether: clean packet filter upon probe
cdc_subset: deal with a device that needs reset for timeout
net: sendmsg: fix NULL pointer dereference
isdn/bas_gigaset: fix a leak on failure path in gigaset_probe()
ip: make IP identifiers less predictable
neighbour : fix ndm_type type error issue
sunvnet: only use connected ports when sending
can: c_can_platform: Fix raminit, use devm_ioremap() instead of devm_ioremap_resource()
bnx2x: fix crash during TSO tunneling
r8152: fix the checking of the usb speed
net: phy: Ensure the MDIO bus module is held
net: phy: Set the driver when registering an MDIO bus device
bnx2x: fix set_setting for some PHYs
hyperv: Fix error return code in netvsc_init_buf()
amd-xgbe: Fix error return code in xgbe_probe()
ath9k: fix aggregation session lockup
net: bcmgenet: correctly pad short packets
net: sctp: inherit auth_capable on INIT collisions
mac80211: fix crash on getting sta info with uninitialized rate control
...
David Vrabel [Fri, 11 Jul 2014 15:42:34 +0000 (16:42 +0100)]
x86/xen: safely map and unmap grant frames when in atomic context
arch_gnttab_map_frames() and arch_gnttab_unmap_frames() are called in
atomic context but were calling alloc_vm_area() which might sleep.
Also, if a driver attempts to allocate a grant ref from an interrupt
and the table needs expanding, then the CPU may already by in lazy MMU
mode and apply_to_page_range() will BUG when it tries to re-enable
lazy MMU mode.
These two functions are only used in PV guests.
Introduce arch_gnttab_init() to allocates the virtual address space in
advance.
Avoid the use of apply_to_page_range() by using saving and using the
array of PTE addresses from the alloc_vm_area() call (which ensures
that the required page tables are pre-allocated).
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Will Deacon [Fri, 25 Jul 2014 15:29:12 +0000 (16:29 +0100)]
kvm: arm64: vgic: fix hyp panic with 64k pages on juno platform
If the physical address of GICV isn't page-aligned, then we end up
creating a stage-2 mapping of the page containing it, which causes us to
map neighbouring memory locations directly into the guest.
As an example, consider a platform with GICV at physical 0x2c02f000
running a 64k-page host kernel. If qemu maps this into the guest at
0x80010000, then guest physical addresses 0x80010000 - 0x8001efff will
map host physical region 0x2c020000 - 0x2c02efff. Accesses to these
physical regions may cause UNPREDICTABLE behaviour, for example, on the
Juno platform this will cause an SError exception to EL3, which brings
down the entire physical CPU resulting in RCU stalls / HYP panics / host
crashing / wasted weeks of debugging.
SBSA recommends that systems alias the 4k GICV across the bounding 64k
region, in which case GICV physical could be described as 0x2c020000 in
the above scenario.
This patch fixes the problem by failing the vgic probe if the physical
base address or the size of GICV aren't page-aligned. Note that this
generated a warning in dmesg about freeing enabled IRQs, so I had to
move the IRQ enabling later in the probe.
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joel Schopp <joel.schopp@amd.com>
Cc: Don Dutile <ddutile@redhat.com>
Acked-by: Peter Maydell <peter.maydell@linaro.org>
Acked-by: Joel Schopp <joel.schopp@amd.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Laura Abbott [Tue, 15 Jul 2014 17:03:36 +0000 (10:03 -0700)]
arm: Add devicetree fixup machine function
Commit
1c2f87c22566cd057bc8cde10c37ae9da1a1bb76
(ARM: 8025/1: Get rid of meminfo) dropped the upper bound on
the number of memory banks that can be added as there was no
technical need in the kernel. It turns out though, some bootloaders
(specifically the arndale-octa exynos boards) may pass invalid memory
information and rely on the kernel to not parse this data. This is a
bug in the bootloader but we still need to work around this.
Work around this by introducing a dt_fixup function. This function
gets called before the flattened devicetree is scanned for memory
and the like. In this fixup function for exynos, limit the maximum
number of memory regions in the devicetree.
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Tested-by: Andreas Färber <afaerber@suse.de>
[glikely: Added a comment and fixed up function name]
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Laura Abbott [Tue, 15 Jul 2014 17:03:35 +0000 (10:03 -0700)]
of: Add memory limiting function for flattened devicetrees
Buggy bootloaders may pass bogus memory entries in the devicetree.
Add of_fdt_limit_memory to add an upper bound on the number of
entries that can be present in the devicetree.
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Tested-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Laura Abbott [Tue, 15 Jul 2014 17:03:34 +0000 (10:03 -0700)]
of: Split early_init_dt_scan into two parts
Currently, early_init_dt_scan validates the header, sets the
boot params, and scans for chosen/memory all in one function.
Split this up into two separate functions (validation/setting
boot params in one, scanning in another) to allow for
additional setup between boot params and scanning the memory.
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Tested-by: Andreas Färber <afaerber@suse.de>
[glikely: s/early_init_dt_scan_all/early_init_dt_scan_nodes/]
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Rafael J. Wysocki [Tue, 29 Jul 2014 22:23:09 +0000 (00:23 +0200)]
ACPI / PNP: Fix acpi_pnp_match()
The acpi_pnp_match() function is used for finding the ACPI device
object that should be associated with the given PNP device.
Unfortunately, the check used by that function is not strict enough
and may cause success to be returned for a wrong ACPI device object.
To fix that, use the observation that the pointer to the ACPI
device object in question is already stored in the data field
in struct pnp_dev, so acpi_pnp_match() can simply use that
field to do its job.
This problem was uncovered in 3.14 by commit
202317a573b2 (ACPI / scan:
Add acpi_device objects for all device nodes in the namespace).
Fixes:
202317a573b2 (ACPI / scan: Add acpi_device objects for all device nodes in the namespace)
Reported-and-tested-by: Vinson Lee <vlee@twopensource.com>
Cc: 3.14+ <stable@vger.kernel.org> # 3.14+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Florian Fainelli [Mon, 28 Jul 2014 23:28:07 +0000 (16:28 -0700)]
net: phy: re-apply PHY fixups during phy_register_device
Commit
87aa9f9c61ad ("net: phy: consolidate PHY reset in phy_init_hw()")
moved the call to phy_scan_fixups() in phy_init_hw() after a software
reset is performed.
By the time phy_init_hw() is called in phy_device_register(), no driver
has been bound to this PHY yet, so all the checks in phy_init_hw()
against the PHY driver and the PHY driver's config_init function will
return 0. We will therefore never call phy_scan_fixups() as we should.
Fix this by calling phy_scan_fixups() and check for its return value to
restore the intended functionality.
This broke PHY drivers which do register an early PHY fixup callback to
intercept the PHY probing and do things like changing the 32-bits unique
PHY identifier when a pseudo-PHY address has been used, as well as
board-specific PHY fixups that need to be applied during driver probe
time.
Reported-by: Hauke Merthens <hauke-m@hauke-m.de>
Reported-by: Jonas Gorski <jogo@openwrt.org>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Neukum [Mon, 28 Jul 2014 08:56:36 +0000 (10:56 +0200)]
cdc-ether: clean packet filter upon probe
There are devices that don't do reset all the way. So the packet filter should
be set to a sane initial value. Failure to do so leads to intermittent failures
of DHCP on some systems under some conditions.
Signed-off-by: Oliver Neukum <oneukum@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Neukum [Mon, 28 Jul 2014 08:12:34 +0000 (10:12 +0200)]
cdc_subset: deal with a device that needs reset for timeout
This device needs to be reset to recover from a timeout.
Unfortunately this can be handled only at the level of
the subdrivers.
Signed-off-by: Oliver Neukum <oneukum@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrey Ryabinin [Sat, 26 Jul 2014 17:26:58 +0000 (21:26 +0400)]
net: sendmsg: fix NULL pointer dereference
Sasha's report:
> While fuzzing with trinity inside a KVM tools guest running the latest -next
> kernel with the KASAN patchset, I've stumbled on the following spew:
>
> [ 4448.949424] ==================================================================
> [ 4448.951737] AddressSanitizer: user-memory-access on address 0
> [ 4448.952988] Read of size 2 by thread T19638:
> [ 4448.954510] CPU: 28 PID: 19638 Comm: trinity-c76 Not tainted 3.16.0-rc4-next-
20140711-sasha-00046-g07d3099-dirty #813
> [ 4448.956823]
ffff88046d86ca40 0000000000000000 ffff880082f37e78 ffff880082f37a40
> [ 4448.958233]
ffffffffb6e47068 ffff880082f37a68 ffff880082f37a58 ffffffffb242708d
> [ 4448.959552]
0000000000000000 ffff880082f37a88 ffffffffb24255b1 0000000000000000
> [ 4448.961266] Call Trace:
> [ 4448.963158] dump_stack (lib/dump_stack.c:52)
> [ 4448.964244] kasan_report_user_access (mm/kasan/report.c:184)
> [ 4448.965507] __asan_load2 (mm/kasan/kasan.c:352)
> [ 4448.966482] ? netlink_sendmsg (net/netlink/af_netlink.c:2339)
> [ 4448.967541] netlink_sendmsg (net/netlink/af_netlink.c:2339)
> [ 4448.968537] ? get_parent_ip (kernel/sched/core.c:2555)
> [ 4448.970103] sock_sendmsg (net/socket.c:654)
> [ 4448.971584] ? might_fault (mm/memory.c:3741)
> [ 4448.972526] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3740)
> [ 4448.973596] ? verify_iovec (net/core/iovec.c:64)
> [ 4448.974522] ___sys_sendmsg (net/socket.c:2096)
> [ 4448.975797] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [ 4448.977030] ? lock_release_holdtime (kernel/locking/lockdep.c:273)
> [ 4448.978197] ? lock_release_non_nested (kernel/locking/lockdep.c:3434 (discriminator 1))
> [ 4448.979346] ? check_chain_key (kernel/locking/lockdep.c:2188)
> [ 4448.980535] __sys_sendmmsg (net/socket.c:2181)
> [ 4448.981592] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
> [ 4448.982773] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
> [ 4448.984458] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1500 (discriminator 2))
> [ 4448.985621] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
> [ 4448.986754] SyS_sendmmsg (net/socket.c:2201)
> [ 4448.987708] tracesys (arch/x86/kernel/entry_64.S:542)
> [ 4448.988929] ==================================================================
This reports means that we've come to netlink_sendmsg() with msg->msg_name == NULL and msg->msg_namelen > 0.
After this report there was no usual "Unable to handle kernel NULL pointer dereference"
and this gave me a clue that address 0 is mapped and contains valid socket address structure in it.
This bug was introduced in
f3d3342602f8bcbf37d7c46641cb9bca7618eb1c
(net: rework recvmsg handler msg_name and msg_namelen logic).
Commit message states that:
"Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
affect sendto as it would bail out earlier while trying to copy-in the
address."
But in fact this affects sendto when address 0 is mapped and contains
socket address structure in it. In such case copy-in address will succeed,
verify_iovec() function will successfully exit with msg->msg_namelen > 0
and msg->msg_name == NULL.
This patch fixes it by setting msg_namelen to 0 if msg_name == NULL.
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: <stable@vger.kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexey Khoroshilov [Fri, 25 Jul 2014 22:34:31 +0000 (02:34 +0400)]
isdn/bas_gigaset: fix a leak on failure path in gigaset_probe()
There is a lack of usb_put_dev(udev) on failure path in gigaset_probe().
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Acked-by: Tilman Schmidt <tilman@imap.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Tue, 29 Jul 2014 17:28:38 +0000 (10:28 -0700)]
Merge tag 'fixes-for-linus' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Arnd Bergmann:
"A nice small set of bug fixes for arm-soc:
- two incorrect register addresses in DT files on shmobile and hisilicon
- one revert for a regression on omap
- one bug fix for a newly introduced pin controller binding
- one regression fix for the memory controller on omap
- one patch to avoid a harmless WARN_ON"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: dts: Revert enabling of twl configuration for n900
ARM: dts: fix L2 address in Hi3620
ARM: OMAP2+: gpmc: fix gpmc_hwecc_bch_capable()
pinctrl: dra: dt-bindings: Fix pull enable/disable
ARM: shmobile: r8a7791: Fix SD2CKCR register address
ARM: OMAP2+: l2c: squelch warning dump on power control setting
David Howells [Tue, 29 Jul 2014 16:53:23 +0000 (17:53 +0100)]
AFS: Correctly assemble the client UUID
Correctly assemble the client UUID by OR'ing in the flags rather than
assigning them over the other components.
Reported-by: Himangi Saraogi <himangi774@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap [Sun, 27 Jul 2014 21:15:33 +0000 (14:15 -0700)]
mm: fix page_alloc.c kernel-doc warnings
Fix kernel-doc warnings and function name in mm/page_alloc.c:
Warning(..//mm/page_alloc.c:6074): No description found for parameter 'pfn'
Warning(..//mm/page_alloc.c:6074): No description found for parameter 'mask'
Warning(..//mm/page_alloc.c:6074): Excess function parameter 'start_bitidx' description in 'get_pfnblock_flags_mask'
Warning(..//mm/page_alloc.c:6102): No description found for parameter 'pfn'
Warning(..//mm/page_alloc.c:6102): No description found for parameter 'mask'
Warning(..//mm/page_alloc.c:6102): Excess function parameter 'start_bitidx' description in 'set_pfnblock_flags_mask'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Konstantin Khlebnikov [Fri, 25 Jul 2014 08:17:12 +0000 (09:17 +0100)]
ARM: 8115/1: LPAE: reduce damage caused by idmap to virtual memory layout
On LPAE, each level 1 (pgd) page table entry maps 1GiB, and the level 2
(pmd) entries map 2MiB.
When the identity mapping is created on LPAE, the pgd pointers are copied
from the swapper_pg_dir. If we find that we need to modify the contents
of a pmd, we allocate a new empty pmd table and insert it into the
appropriate 1GB slot, before then filling it with the identity mapping.
However, if the 1GB slot covers the kernel lowmem mappings, we obliterate
those mappings.
When replacing a PMD, first copy the old PMD contents to the new PMD, so
that we preserve the existing mappings, particularly the mappings of the
kernel itself.
[rewrote commit message and added code comment -- rmk]
Fixes:
ae2de101739c ("ARM: LPAE: Add identity mapping support for the 3-level page table format")
Signed-off-by: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Arnd Bergmann [Tue, 29 Jul 2014 11:04:27 +0000 (13:04 +0200)]
Merge tag 'omap-for-v3.16/n900-regression' of git://git./linux/kernel/git/tmlind/linux-omap into fixes
Merge "omap n900 regression fix for v3.16 rc series" from Tony Lindgren:
Minimal regression fix for n900 display that got broken with
enabling of twl4030 PM features. Turns out more work is needed
before we can enable twl4030 PM on n900.
I did not notice this earlier as I have my n900 in a rack
and the display did not get enabled for device tree based booting
until for v3.16.
* tag 'omap-for-v3.16/n900-regression' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
ARM: dts: Revert enabling of twl configuration for n900
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Russell King [Tue, 29 Jul 2014 08:24:47 +0000 (09:24 +0100)]
ARM: fix alignment of keystone page table fixup
If init_mm.brk is not section aligned, the LPAE fixup code will miss
updating the final PMD. Fix this by aligning map_end.
Fixes:
a77e0c7b2774 ("ARM: mm: Recreate kernel mappings in early_paging_init()")
Cc: <stable@vger.kernel.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Tony Lindgren [Fri, 25 Jul 2014 11:41:25 +0000 (04:41 -0700)]
ARM: dts: Revert enabling of twl configuration for n900
Commit
9188883fd66e9 (ARM: dts: Enable twl4030 off-idle configuration
for selected omaps) allowed n900 to cut off core voltages during
off-idle. This however caused a regression where twl regulator
vaux1 was not getting enabled for the LCD panel as we are not
requesting it for the panel.
Turns out quite a few devices on n900 are using vaux1, and we need
to either stop idling it, or add proper regulator_get calls for all
users. But until we have a proper solution implemented and tested,
let's just disable the twl off-idle configuration for now for n900.
Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Fixes:
9188883fd66e9 (ARM: dts: Enable twl4030 off-idle configuration for selected omaps)
Signed-off-by: Tony Lindgren <tony@atomide.com>
Eric Dumazet [Sat, 26 Jul 2014 06:58:10 +0000 (08:58 +0200)]
ip: make IP identifiers less predictable
In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
Jedidiah describe ways exploiting linux IP identifier generation to
infer whether two machines are exchanging packets.
With commit
73f156a6e8c1 ("inetpeer: get rid of ip_id_count"), we
changed IP id generation, but this does not really prevent this
side-channel technique.
This patch adds a random amount of perturbation so that IP identifiers
for a given destination [1] are no longer monotonically increasing after
an idle period.
Note that prandom_u32_max(1) returns 0, so if generator is used at most
once per jiffy, this patch inserts no hole in the ID suite and do not
increase collision probability.
This is jiffies based, so in the worst case (HZ=1000), the id can
rollover after ~65 seconds of idle time, which should be fine.
We also change the hash used in __ip_select_ident() to not only hash
on daddr, but also saddr and protocol, so that ICMP probes can not be
used to infer information for other protocols.
For IPv6, adds saddr into the hash as well, but not nexthdr.
If I ping the patched target, we can see ID are now hard to predict.
21:57:11.008086 IP (...)
A > target: ICMP echo request, seq 1, length 64
21:57:11.010752 IP (... id 2081 ...)
target > A: ICMP echo reply, seq 1, length 64
21:57:12.013133 IP (...)
A > target: ICMP echo request, seq 2, length 64
21:57:12.015737 IP (... id 3039 ...)
target > A: ICMP echo reply, seq 2, length 64
21:57:13.016580 IP (...)
A > target: ICMP echo request, seq 3, length 64
21:57:13.019251 IP (... id 3437 ...)
target > A: ICMP echo reply, seq 3, length 64
[1] TCP sessions uses a per flow ID generator not changed by this patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jeffrey Knockel <jeffk@cs.unm.edu>
Reported-by: Jedidiah R. Crandall <crandall@cs.unm.edu>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jun Zhao [Fri, 25 Jul 2014 16:38:59 +0000 (00:38 +0800)]
neighbour : fix ndm_type type error issue
ndm_type means L3 address type, in neighbour proxy and vxlan, it's RTN_UNICAST.
NDA_DST is for netlink TLV type, hence it's not right value in this context.
Signed-off-by: Jun Zhao <mypopydev@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David L Stevens [Fri, 25 Jul 2014 14:30:11 +0000 (10:30 -0400)]
sunvnet: only use connected ports when sending
The sunvnet driver doesn't check whether or not a port is connected when
transmitting packets, which results in failures if a port fails to connect
(e.g., due to a version mismatch). The original code also assumes
unnecessarily that the first port is up and a switch, even though there is
a flag for switch ports.
This patch only matches a port if it is connected, and otherwise uses the
switch_port flag to send the packet to a switch port that is up.
Signed-off-by: David L Stevens <david.stevens@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 29 Jul 2014 00:01:01 +0000 (17:01 -0700)]
Merge tag 'linux-can-fixes-for-3.16-
20140725' of git://gitorious.org/linux-can/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2014-07-25
this is a pull request of one patch for the net tree, hoping to get into the
3.16 release.
The patch by George Cherian fixes a regression in the c_can platform driver.
When using two interfaces the regression leads to a non function second
interface.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Lutomirski [Wed, 23 Jul 2014 15:34:11 +0000 (08:34 -0700)]
x86_64/entry/xen: Do not invoke espfix64 on Xen
This moves the espfix64 logic into native_iret. To make this work,
it gets rid of the native patch for INTERRUPT_RETURN:
INTERRUPT_RETURN on native kernels is now 'jmp native_iret'.
This changes the 16-bit SS behavior on Xen from OOPSing to leaking
some bits of the Xen hypervisor's RSP (I think).
[ hpa: this is a nonzero cost on native, but probably not enough to
measure. Xen needs to fix this in their own code, probably doing
something equivalent to espfix64. ]
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/7b8f1d8ef6597cb16ae004a43c56980a7de3cf94.1406129132.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org>
Linus Torvalds [Mon, 28 Jul 2014 18:35:30 +0000 (11:35 -0700)]
Merge git://git./linux/kernel/git/herbert/crypto-2.6
Pull ARM AES crypto fixes from Herbert Xu:
"This push fixes a regression on ARM where odd-sized blocks supplied to
AES may cause crashes"
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: arm-aes - fix encryption of unaligned data
crypto: arm64-aes - fix encryption of unaligned data
Linus Torvalds [Mon, 28 Jul 2014 18:34:31 +0000 (11:34 -0700)]
Merge branch 'merge' of git://git./linux/kernel/git/benh/powerpc
Pull powerpc fixes from Ben Herrenschmidt:
"Here are 3 more small powerpc fixes that should still go into .16.
One is a recent regression (MMCR2 business), the other is a trivial
endian fix without which FW updates won't work on LE in IBM machines,
and the 3rd one turns a BUG_ON into a WARN_ON which is definitely a
LOT more friendly especially when the whole thing is about retrieving
error logs ..."
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Fix endianness of flash_block_list in rtas_flash
powerpc/powernv: Change BUG_ON to WARN_ON in elog code
powerpc/perf: Fix MMCR2 handling for EBB
Mikulas Patocka [Fri, 25 Jul 2014 23:42:30 +0000 (19:42 -0400)]
crypto: arm-aes - fix encryption of unaligned data
Fix the same alignment bug as in arm64 - we need to pass residue
unprocessed bytes as the last argument to blkcipher_walk_done.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # 3.13+
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Mikulas Patocka [Fri, 25 Jul 2014 23:40:20 +0000 (19:40 -0400)]
crypto: arm64-aes - fix encryption of unaligned data
cryptsetup fails on arm64 when using kernel encryption via AF_ALG socket.
See https://bugzilla.redhat.com/show_bug.cgi?id=1122937
The bug is caused by incorrect handling of unaligned data in
arch/arm64/crypto/aes-glue.c. Cryptsetup creates a buffer that is aligned
on 8 bytes, but not on 16 bytes. It opens AF_ALG socket and uses the
socket to encrypt data in the buffer. The arm64 crypto accelerator causes
data corruption or crashes in the scatterwalk_pagedone.
This patch fixes the bug by passing the residue bytes that were not
processed as the last parameter to blkcipher_walk_done.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Thomas Falcon [Fri, 25 Jul 2014 17:47:42 +0000 (12:47 -0500)]
powerpc: Fix endianness of flash_block_list in rtas_flash
The function rtas_flash_firmware passes the address of a data structure,
flash_block_list, when making the update-flash-64-and-reboot rtas call.
While the endianness of the address is handled correctly, the endianness
of the data is not. This patch ensures that the data in flash_block_list
is big endian when passed to rtas on little endian hosts.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Vasant Hegde [Wed, 23 Jul 2014 09:22:39 +0000 (14:52 +0530)]
powerpc/powernv: Change BUG_ON to WARN_ON in elog code
We can continue to read the error log (up to MAX size) even if
we get the elog size more than MAX size. Hence change BUG_ON to
WARN_ON.
Also updated error message.
Reported-by: Gopesh Kumar Chaudhary <gopchaud@in.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Linus Torvalds [Sun, 27 Jul 2014 19:41:55 +0000 (12:41 -0700)]
Linux 3.16-rc7
Linus Torvalds [Sun, 27 Jul 2014 16:57:16 +0000 (09:57 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"A bunch of fixes for perf and kprobes:
- revert a commit that caused a perf group regression
- silence dmesg spam
- fix kprobe probing errors on ia64 and ppc64
- filter kprobe faults from userspace
- lockdep fix for perf exit path
- prevent perf #GP in KVM guest
- correct perf event and filters"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kprobes: Fix "Failed to find blacklist" probing errors on ia64 and ppc64
kprobes/x86: Don't try to resolve kprobe faults from userspace
perf/x86/intel: Avoid spamming kernel log for BTS buffer failure
perf/x86/intel: Protect LBR and extra_regs against KVM lying
perf: Fix lockdep warning on process exit
perf/x86/intel/uncore: Fix SNB-EP/IVT Cbox filter mappings
perf/x86/intel: Use proper dTLB-load-misses event on IvyBridge
perf: Revert ("perf: Always destroy groups on exit")
Linus Torvalds [Sun, 27 Jul 2014 16:53:01 +0000 (09:53 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:
"A couple of crash fixes, plus a fix that on 32 bits would cause a
missing -ENOSYS for nonexistent system calls"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, cpu: Fix cache topology for early P4-SMT
x86_32, entry: Store badsys error code in %eax
x86, MCE: Robustify mcheck_init_device
Linus Torvalds [Sun, 27 Jul 2014 16:43:41 +0000 (09:43 -0700)]
Merge branch 'vfs-for-3.16' of git://git.infradead.org/users/hch/vfs
Pull vfs fixes from Christoph Hellwig:
"A vfsmount leak fix, and a compile warning fix"
* 'vfs-for-3.16' of git://git.infradead.org/users/hch/vfs:
fs: umount on symlink leaks mnt count
direct-io: fix uninitialized warning in do_direct_IO()
Linus Torvalds [Sun, 27 Jul 2014 16:42:06 +0000 (09:42 -0700)]
Merge tag 'firewire-fix-vt6315' of git://git./linux/kernel/git/ieee1394/linux1394
Pull firewire regression fix from Stefan Richter:
"IEEE 1394 (FireWire) subsystem fix: MSI don't work on VIA PCIe
controllers with some isochronous workloads (regression since
v3.16-rc1)"
* tag 'firewire-fix-vt6315' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
firewire: ohci: disable MSI for VIA VT6315 again
Linus Torvalds [Sat, 26 Jul 2014 21:52:01 +0000 (14:52 -0700)]
Fix gcc-4.9.0 miscompilation of load_balance() in scheduler
Michel Dänzer and a couple of other people reported inexplicable random
oopses in the scheduler, and the cause turns out to be gcc mis-compiling
the load_balance() function when debugging is enabled. The gcc bug
apparently goes back to gcc-4.5, but slight optimization changes means
that it now showed up as a problem in 4.9.0 and 4.9.1.
The instruction scheduling problem causes gcc to schedule a spill
operation to before the stack frame has been created, which in turn can
corrupt the spilled value if an interrupt comes in. There may be other
effects of this bug too, but that's the code generation problem seen in
Michel's case.
This is fixed in current gcc HEAD, but the workaround as suggested by
Markus Trippelsdorf is pretty simple: use -fno-var-tracking-assignments
when compiling the kernel, which disables the gcc code that causes the
problem. This can result in slightly worse debug information for
variable accesses, but that is infinitely preferable to actual code
generation problems.
Doing this unconditionally (not just for CONFIG_DEBUG_INFO) also allows
non-debug builds to verify that the debug build would be identical: we
can do
export GCC_COMPARE_DEBUG=1
to make gcc internally verify that the result of the build is
independent of the "-g" flag (it will make the compiler build everything
twice, toggling the debug flag, and compare the results).
Without the "-fno-var-tracking-assignments" option, the build would fail
(even with 4.8.3 that didn't show the actual stack frame bug) with a gcc
compare failure.
See also gcc bugzilla:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61801
Reported-by: Michel Dänzer <michel@daenzer.net>
Suggested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Sat, 26 Jul 2014 19:58:23 +0000 (12:58 -0700)]
mm: fix direct reclaim writeback regression
Shortly before 3.16-rc1, Dave Jones reported:
WARNING: CPU: 3 PID: 19721 at fs/xfs/xfs_aops.c:971
xfs_vm_writepage+0x5ce/0x630 [xfs]()
CPU: 3 PID: 19721 Comm: trinity-c61 Not tainted 3.15.0+ #3
Call Trace:
xfs_vm_writepage+0x5ce/0x630 [xfs]
shrink_page_list+0x8f9/0xb90
shrink_inactive_list+0x253/0x510
shrink_lruvec+0x563/0x6c0
shrink_zone+0x3b/0x100
shrink_zones+0x1f1/0x3c0
try_to_free_pages+0x164/0x380
__alloc_pages_nodemask+0x822/0xc90
alloc_pages_vma+0xaf/0x1c0
handle_mm_fault+0xa31/0xc50
etc.
970 if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
971 PF_MEMALLOC))
I did not respond at the time, because a glance at the PageDirty block
in shrink_page_list() quickly shows that this is impossible: we don't do
writeback on file pages (other than tmpfs) from direct reclaim nowadays.
Dave was hallucinating, but it would have been disrespectful to say so.
However, my own /var/log/messages now shows similar complaints
WARNING: CPU: 1 PID: 28814 at fs/ext4/inode.c:1881 ext4_writepage+0xa7/0x38b()
WARNING: CPU: 0 PID: 27347 at fs/ext4/inode.c:1764 ext4_writepage+0xa7/0x38b()
from stressing some mmotm trees during July.
Could a dirty xfs or ext4 file page somehow get marked PageSwapBacked,
so fail shrink_page_list()'s page_is_file_cache() test, and so proceed
to mapping->a_ops->writepage()?
Yes, 3.16-rc1's commit
68711a746345 ("mm, migration: add destination
page freeing callback") has provided such a way to compaction: if
migrating a SwapBacked page fails, its newpage may be put back on the
list for later use with PageSwapBacked still set, and nothing will clear
it.
Whether that can do anything worse than issue WARN_ON_ONCEs, and get
some statistics wrong, is unclear: easier to fix than to think through
the consequences.
Fixing it here, before the put_new_page(), addresses the bug directly,
but is probably the worst place to fix it. Page migration is doing too
many parts of the job on too many levels: fixing it in
move_to_new_page() to complement its SetPageSwapBacked would be
preferable, except why is it (and newpage->mapping and newpage->index)
done there, rather than down in migrate_page_move_mapping(), once we are
sure of success? Not a cleanup to get into right now, especially not
with memcg cleanups coming in 3.17.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Haojian Zhuang [Wed, 2 Apr 2014 13:31:50 +0000 (21:31 +0800)]
ARM: dts: fix L2 address in Hi3620
Fix the address of L2 controler register in hi3620 SoC.
This has been wrong from the point that the file was merged
in v3.14.
Signed-off-by: Haojian Zhuang <haojian.zhuang@linaro.org>
Acked-by: Wei Xu <xuwei5@hisilicon.com>
Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Linus Torvalds [Sat, 26 Jul 2014 01:17:38 +0000 (18:17 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"This is radeon and intel fixes, and is a small bit larger than I'm
guessing you'd like it to be.
- i915: fixes 32-bit highmem i915 blank screen, semaphore hang and
runtime pm fix
- radeon: gpuvm stability fix for hangs since 3.15, and hang/reboot
regression on TN/RL devices,
The only slightly controversial one is the change to use GB for the
vm_size, which I'm letting through as its a new interface we defined
in this merge window, and I'd prefer to have the released kernel have
the final interface rather than changing it later"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/radeon: fix cut and paste issue for hawaii.
drm/radeon: fix irq ring buffer overflow handling
drm/i915: Simplify i915_gem_release_all_mmaps()
drm/radeon: fix error handling in radeon_vm_bo_set_addr
drm/i915: fix freeze with blank screen booting highmem
drm/i915: Reorder the semaphore deadlock check, again
drm/radeon/TN: only enable bapm on MSI systems
drm/radeon: fix VM IB handling
drm/radeon: fix handling of radeon_vm_bo_rmv v3
drm/radeon: let's use GB for vm_size (v2)
Linus Torvalds [Sat, 26 Jul 2014 01:03:45 +0000 (18:03 -0700)]
Merge tag 'sound-3.16-rc7' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Here contains only the fixes for the new FireWire bebob driver. All
fairly trivial and local fixes, so safe to apply"
* tag 'sound-3.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: bebob: Correction for return value of special_clk_ctl_put() in error
ALSA: bebob: Correction for return value of .put callback
ALSA: bebob: Use different labels for digital input/output
ALSA: bebob: Fix a missing to unlock mutex in error handling case
Linus Torvalds [Sat, 26 Jul 2014 01:03:05 +0000 (18:03 -0700)]
Merge tag 'hwmon-for-linus' of git://git./linux/kernel/git/groeck/linux-staging
Pull hwmon fix from Guenter Roeck:
"Fixes to temperature limit and vrm write operations in smsc47m192
driver"
* tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (smsc47m192) Fix temperature limit and vrm write operations
Randy Dunlap [Sat, 26 Jul 2014 00:07:47 +0000 (17:07 -0700)]
parport: fix menu breakage
Do not split the PARPORT-related symbols with the new kconfig
symbol ARCH_MIGHT_HAVE_PC_PARPORT. The split was causing incorrect
display of these symbols -- they were not being displayed together
as they should be.
Fixes:
d90c3eb31535 "Kconfig cleanup (PARPORT_PC dependencies)"
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org # for 3.13, 3.14, 3.15
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 26 Jul 2014 00:59:50 +0000 (17:59 -0700)]
Merge tag 'blackfin-3.16-fixes' of git://git./linux/kernel/git/realmz6/blackfin-linux
Pull blackfin fixes from Steven Miao:
"smc nor flash PM fix, pinctrl group fix, update defconfig, and build
fixes"
* tag 'blackfin-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/realmz6/blackfin-linux:
blackfin: vmlinux.lds.S: reserve 32 bytes space at the end of data section for XIP kernel
defconfig: BF609: update spi config name
irq: blackfin sec: drop duplicated sec priority set
blackfin: bind different groups of one pinmux function to different state name
blackfin: fix some bf5xx boards build for missing <linux/gpio.h>
pm: bf609: cleanup smc nor flash
Steven Miao [Wed, 23 Jul 2014 09:28:25 +0000 (17:28 +0800)]
blackfin: vmlinux.lds.S: reserve 32 bytes space at the end of data section for XIP kernel
to collect some undefined section to the end of the data section and avoid section overlap
Signed-off-by: Steven Miao <realmz6@gmail.com>
Steven Miao [Fri, 25 Jul 2014 02:31:16 +0000 (10:31 +0800)]
defconfig: BF609: update spi config name
Signed-off-by: Steven Miao <realmz6@gmail.com>
Steven Miao [Thu, 24 Jul 2014 08:10:19 +0000 (16:10 +0800)]
irq: blackfin sec: drop duplicated sec priority set
Signed-off-by: Steven Miao <realmz6@gmail.com>
Sonic Zhang [Thu, 13 Feb 2014 10:52:34 +0000 (18:52 +0800)]
blackfin: bind different groups of one pinmux function to different state name
Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
Signed-off-by: Steven Miao <realmz6@gmail.com>