Stefan Schmidt [Thu, 15 Dec 2016 17:40:14 +0000 (18:40 +0100)]
ieee802154: atusb: do not use the stack for buffers to make them DMA able
From 4.9 we should really avoid using the stack here as this will not be DMA
able on various platforms. This changes the buffers already being present in
time of 4.9 being released. This should go into stable as well.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
David Ahern [Wed, 11 Jan 2017 23:42:17 +0000 (15:42 -0800)]
net: ipv4: fix table id in getroute response
rtm_table is an 8-bit field while table ids are allowed up to u32. Commit
709772e6e065 ("net: Fix routing tables with id > 255 for legacy software")
added the preference to set rtm_table in dumps to RT_TABLE_COMPAT if the
table id is > 255. The table id returned on get route requests should do
the same.
Fixes:
c36ba6603a11 ("net: Allow user to get table id from route lookup")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Timur Tabi [Wed, 11 Jan 2017 22:45:51 +0000 (16:45 -0600)]
net: qcom/emac: grab a reference to the phydev on ACPI systems
Commit
6ffe1c4cd0a7 ("net: qcom/emac: fix of_node and phydev leaks")
fixed the problem with reference leaks on phydev, but the fix is
device-tree specific. When the driver unloads, the reference is
dropped only on DT systems.
Instead, it's cleaner if up grab an reference on ACPI systems.
When the driver unloads, we can drop the reference without having
to check whether we're on a DT system.
Signed-off-by: Timur Tabi <timur@codeaurora.org>
Reviewed-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Wed, 11 Jan 2017 22:29:54 +0000 (14:29 -0800)]
net: lwtunnel: Handle lwtunnel_fill_encap failure
Handle failure in lwtunnel_fill_encap adding attributes to skb.
Fixes:
571e722676fe ("ipv4: support for fib route lwtunnel encap attributes")
Fixes:
19e42e451506 ("ipv6: support for fib route lwtunnel encap attributes")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kazuya Mizuguchi [Thu, 12 Jan 2017 12:21:06 +0000 (13:21 +0100)]
ravb: Remove Rx overflow log messages
Remove Rx overflow log messages as in an environment where logging results
in network traffic logging may cause further overflows.
Fixes:
c156633f1353 ("Renesas Ethernet AVB driver proper")
Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
[simon: reworked changelog]
Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 12 Jan 2017 14:25:56 +0000 (09:25 -0500)]
Merge branch 'mlxsw-fixes'
Jiri Pirko says:
====================
mlxsw: Couple of fixes
Couple of simple fixes from Arkadi and Elad.
Please queue these up for stable. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Elad Raz [Thu, 12 Jan 2017 08:10:39 +0000 (09:10 +0100)]
mlxsw: pci: Fix EQE structure definition
The event_data starts from address 0x00-0x0C and not from 0x08-0x014. This
leads to duplication with other fields in the Event Queue Element such as
sub-type, cqn and owner.
Fixes:
eda6500a987a0 ("mlxsw: Add PCI bus implementation")
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arkadi Sharshevsky [Thu, 12 Jan 2017 08:10:38 +0000 (09:10 +0100)]
mlxsw: switchx2: Fix memory leak at skb reallocation
During transmission the skb is checked for headroom in order to
add vendor specific header. In case the skb needs to be re-allocated,
skb_realloc_headroom() is called to make a private copy of the original,
but doesn't release it. Current code assumes that the original skb is
released during reallocation and only releases it at the error path
which causes a memory leak.
Fix this by adding the original skb release to the main path.
Fixes:
d003462a50de ("mlxsw: Simplify mlxsw_sx_port_xmit function")
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arkadi Sharshevsky [Thu, 12 Jan 2017 08:10:37 +0000 (09:10 +0100)]
mlxsw: spectrum: Fix memory leak at skb reallocation
During transmission the skb is checked for headroom in order to
add vendor specific header. In case the skb needs to be re-allocated,
skb_realloc_headroom() is called to make a private copy of the original,
but doesn't release it. Current code assumes that the original skb is
released during reallocation and only releases it at the error path
which causes a memory leak.
Fix this by adding the original skb release to the main path.
Fixes:
56ade8fe3fe1 ("mlxsw: spectrum: Add initial support for Spectrum ASIC")
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Wed, 11 Jan 2017 17:16:32 +0000 (09:16 -0800)]
netvsc: add rcu_read locking to netvsc callback
The receive callback (in tasklet context) is using RCU to get reference
to associated VF network device but this is not safe. RCU read lock
needs to be held. Found by running with full lockdep debugging
enabled.
Fixes:
f207c10d9823 ("hv_netvsc: use RCU to protect vf_netdev")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martynas Pumputis [Wed, 11 Jan 2017 15:18:53 +0000 (15:18 +0000)]
vxlan: Set ports in flow key when doing route lookups
Otherwise, a xfrm policy with sport/dport being set cannot be matched.
Signed-off-by: Martynas Pumputis <martynas@weave.works>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Wed, 11 Jan 2017 08:25:34 +0000 (16:25 +0800)]
r8152: fix the sw rx checksum is unavailable
Fix the hw rx checksum is always enabled, and the user couldn't switch
it to sw rx checksum.
Note that the RTL_VER_01 only support sw rx checksum only. Besides,
the hw rx checksum for RTL_VER_02 is disabled after
commit
b9a321b48af4 ("r8152: Fix broken RX checksums."). Re-enable it.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 11 Jan 2017 19:15:15 +0000 (11:15 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
"27 fixes.
There are three patches that aren't actually fixes. They're simple
function renamings which are nice-to-have in mainline as ongoing net
development depends on them."
* akpm: (27 commits)
timerfd: export defines to userspace
mm/hugetlb.c: fix reservation race when freeing surplus pages
mm/slab.c: fix SLAB freelist randomization duplicate entries
zram: support BDI_CAP_STABLE_WRITES
zram: revalidate disk under init_lock
mm: support anonymous stable page
mm: add documentation for page fragment APIs
mm: rename __page_frag functions to __page_frag_cache, drop order from drain
mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
mm, memcg: fix the active list aging for lowmem requests when memcg is enabled
mm: don't dereference struct page fields of invalid pages
mailmap: add codeaurora.org names for nameless email commits
signal: protect SIGNAL_UNKILLABLE from unintentional clearing.
mm: pmd dirty emulation in page fault handler
ipc/sem.c: fix incorrect sem_lock pairing
lib/Kconfig.debug: fix frv build failure
mm: get rid of __GFP_OTHER_NODE
mm: fix remote numa hits statistics
mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}
ocfs2: fix crash caused by stale lvb with fsdlm plugin
...
Linus Torvalds [Wed, 11 Jan 2017 17:52:12 +0000 (09:52 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Fix rtlwifi crash, from Larry Finger.
2) Memory disclosure in appletalk ipddp routing code, from Vlad
Tsyrklevich.
3) r8152 can erroneously split an RX packet into multiple URBs if the
Rx FIFO is not empty when we suspend. Fix this by waiting for the
FIFO to empty before suspending. From Hayes Wang.
4) Two GRO fixes (enter slow path when not enough SKB tail room exists,
disable frag0 optimizations when there are IPV6 extension headers)
from Eric Dumazet and Herbert Xu.
5) A series of mlx5e bug fixes (do source udp port offloading for
tunnels properly, Ip fragment matching fixes, handling firmware
errors properly when installing TC rules, etc.) from Saeed Mahameed,
Or Gerlitz, Roi Dayan, Hadar Hen Zion, Gil Rockah, and Daniel
Jurgens.
6) Two VRF fixes from David Ahern (don't skip multipath selection for
VRF paths, disallow VRF to be configured with table ID 0).
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits)
net: vrf: do not allow table id 0
net: phy: marvell: fix Marvell 88E1512 used in SGMII mode
sctp: Fix spelling mistake: "Atempt" -> "Attempt"
net: ipv4: Fix multipath selection with vrf
cgroup: move CONFIG_SOCK_CGROUP_DATA to init/Kconfig
gro: use min_t() in skb_gro_reset_offset()
net/mlx5: Only cancel recovery work when cleaning up device
net/mlx5e: Remove WARN_ONCE from adaptive moderation code
net/mlx5e: Un-register uplink representor on nic_disable
net/mlx5e: Properly handle FW errors while adding TC rules
net/mlx5e: Fix kbuild warnings for uninitialized parameters
net/mlx5e: Set inline mode requirements for matching on IP fragments
net/mlx5e: Properly get address type of encapsulation IP headers
net/mlx5e: TC ipv4 tunnel encap offload error flow fixes
net/mlx5e: Warn when rejecting offload attempts of IP tunnels
net/mlx5e: Properly handle offloading of source udp port for IP tunnels
gro: Disable frag0 optimization on IPv6 ext headers
gro: Enter slow-path if there is no tailroom
mlx4: Return EOPNOTSUPP instead of ENOTSUPP
net/af_iucv: don't use paged skbs for TX on HiperSockets
...
Linus Torvalds [Wed, 11 Jan 2017 17:28:13 +0000 (09:28 -0800)]
Merge branch 'linus' of git://git./linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
"This fixes a regression in aesni that renders it useless if it's
built-in with a modular pcbc configuration"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: aesni - Fix failure when built-in with modular pcbc
David Ahern [Tue, 10 Jan 2017 23:22:25 +0000 (15:22 -0800)]
net: vrf: do not allow table id 0
Frank reported that vrf devices can be created with a table id of 0.
This breaks many of the run time table id checks and should not be
allowed. Detect this condition at create time and fail with EINVAL.
Fixes:
193125dbd8eb ("net: Introduce VRF device driver")
Reported-by: Frank Kellermann <frank.kellermann@atos.net>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King [Tue, 10 Jan 2017 23:13:45 +0000 (23:13 +0000)]
net: phy: marvell: fix Marvell 88E1512 used in SGMII mode
When an Marvell 88E1512 PHY is connected to a nic in SGMII mode, the
fiber page is used for the SGMII host-side connection. The PHY driver
notices that SUPPORTED_FIBRE is set, so it tries reading the fiber page
for the link status, and ends up reading the MAC-side status instead of
the outgoing (copper) link. This leads to incorrect results reported
via ethtool.
If the PHY is connected via SGMII to the host, ignore the fiber page.
However, continue to allow the existing power management code to
suspend and resume the fiber page.
Fixes:
6cfb3bcc0641 ("Marvell phy: check link status in case of fiber link.")
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Tue, 10 Jan 2017 22:53:06 +0000 (22:53 +0000)]
sctp: Fix spelling mistake: "Atempt" -> "Attempt"
Trivial fix to spelling mistake in WARN_ONCE message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 10 Jan 2017 22:37:35 +0000 (14:37 -0800)]
net: ipv4: Fix multipath selection with vrf
fib_select_path does not call fib_select_multipath if oif is set in the
flow struct. For VRF use cases oif is always set, so multipath route
selection is bypassed. Use the FLOWI_FLAG_SKIP_NH_OIF to skip the oif
check similar to what is done in fib_table_lookup.
Add saddr and proto to the flow struct for the fib lookup done by the
VRF driver to better match hash computation for a flow.
Fixes:
613d09b30f8b ("net: Use VRF device index for lookups on TX")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Tue, 10 Jan 2017 12:08:06 +0000 (13:08 +0100)]
cgroup: move CONFIG_SOCK_CGROUP_DATA to init/Kconfig
We now 'select SOCK_CGROUP_DATA' but Kconfig complains that this is
not right when CONFIG_NET is disabled and there is no socket interface:
warning: (CGROUP_BPF) selects SOCK_CGROUP_DATA which has unmet direct dependencies (NET)
I don't know what the correct solution for this is, but simply removing
the dependency on NET from SOCK_CGROUP_DATA by moving it out of the
'if NET' section avoids the warning and does not produce other build
errors.
Fixes:
483c4933ea09 ("cgroup: Fix CGROUP_BPF config")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 11 Jan 2017 03:52:43 +0000 (19:52 -0800)]
gro: use min_t() in skb_gro_reset_offset()
On 32bit arches, (skb->end - skb->data) is not 'unsigned int',
so we shall use min_t() instead of min() to avoid a compiler error.
Fixes:
1272ce87fa01 ("gro: Enter slow-path if there is no tailroom")
Reported-by: kernel test robot <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 11 Jan 2017 02:34:03 +0000 (21:34 -0500)]
Merge branch 'mlx5-fixes'
Saeed Mahameed says:
====================
Mellanox mlx5 fixes and cleanups 2017-01-10
This series includes some mlx5e general cleanups from Daniel, Gil, Hadar
and myself.
Also it includes some critical mlx5e TC offloads fixes from Or Gerlitz.
For -stable:
- net/mlx5e: Remove WARN_ONCE from adaptive moderation code
Although this fix doesn't affect any functionality, I thought it is
better to clean this -WARN_ONCE- up for -stable in case someone hits
such corner case.
Please apply and let me know if there's any problem.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Jurgens [Tue, 10 Jan 2017 20:33:39 +0000 (22:33 +0200)]
net/mlx5: Only cancel recovery work when cleaning up device
Do not attempt to drain the health workqueue when unloading the device in
the recovery flow, this can cause a deadlock when the recovery work
tries to cancel itself with sync.
Because the work is no longer unconditionally canceled when unloading, it
must be explicitly canceled in the AER flow.
fixes:
689a248df83b ("net/mlx5: Cancel recovery work in remove flow")
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gil Rockah [Tue, 10 Jan 2017 20:33:38 +0000 (22:33 +0200)]
net/mlx5e: Remove WARN_ONCE from adaptive moderation code
When trying to do interface down or changing interface configuration
under heavy traffic, some of the adaptive moderation corner cases can
occur and leave a WARN_ONCE call trace in the kernel log.
Those WARN_ONCE are meant for debug only, and should have been inserted
only under debug. We avoid such call traces by removing those WARN_ONCE.
Fixes:
cb3c7fd4f839 ("net/mlx5e: Support adaptive RX coalescing")
Signed-off-by: Gil Rockah <gilr@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed [Tue, 10 Jan 2017 20:33:37 +0000 (22:33 +0200)]
net/mlx5e: Un-register uplink representor on nic_disable
The code before this patch registered uplink e-Switch representor
on nic_enable and unregistered on nic_cleanup, the right place
for this unregister is in nic_disable.
Fixes:
127ea380acc9 ("net/mlx5: Add Representors registration API")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Mohamad Haj Yahia <mohamad@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Tue, 10 Jan 2017 20:33:36 +0000 (22:33 +0200)]
net/mlx5e: Properly handle FW errors while adding TC rules
When the firmware returns an error (common example is an attempt to
add twice the same rule which is refused by the some FWs), we are not
properly derefing/cleaning few resources allocated on the way.
Examples are vport vlan deref under eswitch vlan offloads, and encap
entry/neighbour deref under eswitch encapsulation offloads, fix that.
Fixes:
a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads')
Fixes:
8b32580df1cb ('net/mlx5e: Add TC vlan action for SRIOV offloads')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hadar Hen Zion [Tue, 10 Jan 2017 20:33:35 +0000 (22:33 +0200)]
net/mlx5e: Fix kbuild warnings for uninitialized parameters
kbuild warn about parameters that may be used uninitialized, fix it.
Fixes:
a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads')
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Tue, 10 Jan 2017 20:33:34 +0000 (22:33 +0200)]
net/mlx5e: Set inline mode requirements for matching on IP fragments
For e-switch level matching on packets being an IP fragment, we
need to make sure the source vport inline mode is L3, fix that.
Fixes:
3f7d0eb42d59 ('net/mlx5e: Offload TC matching on packets being IP fragments')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Tue, 10 Jan 2017 20:33:33 +0000 (22:33 +0200)]
net/mlx5e: Properly get address type of encapsulation IP headers
As done elsewhere in our TC/flower offload code, the address type of
the encapsulation IP headers should be realized accroding to the
addr_type field of the encapsulation control dissector key, do that.
Fixes:
bbd00f7e2349 ('net/mlx5e: Add TC tunnel release action for SRIOV offloads')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Tue, 10 Jan 2017 20:33:32 +0000 (22:33 +0200)]
net/mlx5e: TC ipv4 tunnel encap offload error flow fixes
When the route lookup fails we should return the actual error.
When the neigh isn't valid, we should return -EOPNOTSUPP as done
in similar cases along the code.
When the offload can't take place as of invalid neigh etc, we
must release the neigh.
Fixes:
a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Tue, 10 Jan 2017 20:33:31 +0000 (22:33 +0200)]
net/mlx5e: Warn when rejecting offload attempts of IP tunnels
We silently reject offloading of IPv6 tunnels, non vxlan tunnels,
vxlan tunnels where the dst port to match is not provided, etc.
Be a bit more verbose and print a warning so the user better
realizes what went wrong here and can fix it.
Fixes:
a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads')
Fixes:
bbd00f7e2349 ('net/mlx5e: Add TC tunnel release action for SRIOV offloads')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Tue, 10 Jan 2017 20:33:30 +0000 (22:33 +0200)]
net/mlx5e: Properly handle offloading of source udp port for IP tunnels
We can offload the matching on source udp port of ip tunnels for
decapsulation. We can not offload setting source udp port for tunnels
as part of encapsulation. Fix both the code that deals with matching
offload (decap) and the code that deal with encap offload to align with
that.
Fixes:
a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads')
Fixes:
bbd00f7e2349 ('net/mlx5e: Add TC tunnel release action for SRIOV offloads')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mike Frysinger [Wed, 11 Jan 2017 00:58:30 +0000 (16:58 -0800)]
timerfd: export defines to userspace
Since userspace is expected to call timerfd syscalls directly with these
flags/ioctls, make sure we export them so they don't have to duplicate
the values themselves.
Link: http://lkml.kernel.org/r/20161219064052.7196-1-vapier@gentoo.org
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 11 Jan 2017 00:58:27 +0000 (16:58 -0800)]
mm/hugetlb.c: fix reservation race when freeing surplus pages
return_unused_surplus_pages() decrements the global reservation count,
and frees any unused surplus pages that were backing the reservation.
Commit
7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in
return_unused_surplus_pages()") added a call to cond_resched_lock in the
loop freeing the pages.
As a result, the hugetlb_lock could be dropped, and someone else could
use the pages that will be freed in subsequent iterations of the loop.
This could result in inconsistent global hugetlb page state, application
api failures (such as mmap) failures or application crashes.
When dropping the lock in return_unused_surplus_pages, make sure that
the global reservation count (resv_huge_pages) remains sufficiently
large to prevent someone else from claiming pages about to be freed.
Analyzed by Paul Cassella.
Fixes:
7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()")
Link: http://lkml.kernel.org/r/1483991767-6879-1-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Paul Cassella <cassella@cray.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: <stable@vger.kernel.org> [3.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
John Sperbeck [Wed, 11 Jan 2017 00:58:24 +0000 (16:58 -0800)]
mm/slab.c: fix SLAB freelist randomization duplicate entries
This patch fixes a bug in the freelist randomization code. When a high
random number is used, the freelist will contain duplicate entries. It
will result in different allocations sharing the same chunk.
It will result in odd behaviours and crashes. It should be uncommon but
it depends on the machines. We saw it happening more often on some
machines (every few hours of running tests).
Fixes:
c7ce4f60ac19 ("mm: SLAB freelist randomization")
Link: http://lkml.kernel.org/r/20170103181908.143178-1-thgarnie@google.com
Signed-off-by: John Sperbeck <jsperbeck@google.com>
Signed-off-by: Thomas Garnier <thgarnie@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 11 Jan 2017 00:58:21 +0000 (16:58 -0800)]
zram: support BDI_CAP_STABLE_WRITES
zram has used per-cpu stream feature from v4.7. It aims for increasing
cache hit ratio of scratch buffer for compressing. Downside of that
approach is that zram should ask memory space for compressed page in
per-cpu context which requires stricted gfp flag which could be failed.
If so, it retries to allocate memory space out of per-cpu context so it
could get memory this time and compress the data again, copies it to the
memory space.
In this scenario, zram assumes the data should never be changed but it is
not true without stable page support. So, If the data is changed under
us, zram can make buffer overrun so that zsmalloc free object chain is
broken so system goes crash like below
https://bugzilla.suse.com/show_bug.cgi?id=997574
This patch adds BDI_CAP_STABLE_WRITES to zram for declaring "I am block
device needing *stable write*".
Fixes:
da9556a2367c ("zram: user per-cpu compression streams")
Link: http://lkml.kernel.org/r/1482366980-3782-4-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Hyeoncheol Lee <cheol.lee@lge.com>
Cc: <yjay.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: <stable@vger.kernel.org> [4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 11 Jan 2017 00:58:18 +0000 (16:58 -0800)]
zram: revalidate disk under init_lock
Commit
b4c5c60920e3 ("zram: avoid lockdep splat by revalidate_disk")
moved revalidate_disk call out of init_lock to avoid lockdep
false-positive splat. However, commit
08eee69fcf6b ("zram: remove
init_lock in zram_make_request") removed init_lock in IO path so there
is no worry about lockdep splat. So, let's restore it.
This patch is needed to set BDI_CAP_STABLE_WRITES atomically in next
patch.
Fixes:
da9556a2367c ("zram: user per-cpu compression streams")
Link: http://lkml.kernel.org/r/1482366980-3782-3-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Hyeoncheol Lee <cheol.lee@lge.com>
Cc: <yjay.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: <stable@vger.kernel.org> [4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 11 Jan 2017 00:58:15 +0000 (16:58 -0800)]
mm: support anonymous stable page
During developemnt for zram-swap asynchronous writeback, I found strange
corruption of compressed page, resulting in:
Modules linked in: zram(E)
CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G E 4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
task:
ffff88007620b840 task.stack:
ffff880078090000
RIP: set_freeobj.part.43+0x1c/0x1f
RSP: 0018:
ffff880078093ca8 EFLAGS:
00010246
RAX:
0000000000000018 RBX:
ffff880076798d88 RCX:
ffffffff81c408c8
RDX:
0000000000000018 RSI:
0000000000000000 RDI:
0000000000000246
RBP:
ffff880078093cb0 R08:
0000000000000000 R09:
0000000000000000
R10:
ffff88005bc43030 R11:
0000000000001df3 R12:
ffff880076798d88
R13:
000000000005bc43 R14:
ffff88007819d1b8 R15:
0000000000000001
FS:
0000000000000000(0000) GS:
ffff88007e380000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00007fc934048f20 CR3:
0000000077b01000 CR4:
00000000000406e0
Call Trace:
obj_malloc+0x22b/0x260
zs_malloc+0x1e4/0x580
zram_bvec_rw+0x4cd/0x830 [zram]
page_requests_rw+0x9c/0x130 [zram]
zram_thread+0xe6/0x173 [zram]
kthread+0xca/0xe0
ret_from_fork+0x25/0x30
With investigation, it reveals currently stable page doesn't support
anonymous page. IOW, reuse_swap_page can reuse the page without waiting
writeback completion so it can overwrite page zram is compressing.
Unfortunately, zram has used per-cpu stream feature from v4.7.
It aims for increasing cache hit ratio of scratch buffer for
compressing. Downside of that approach is that zram should ask
memory space for compressed page in per-cpu context which requires
stricted gfp flag which could be failed. If so, it retries to
allocate memory space out of per-cpu context so it could get memory
this time and compress the data again, copies it to the memory space.
In this scenario, zram assumes the data should never be changed
but it is not true unless stable page supports. So, If the data is
changed under us, zram can make buffer overrun because second
compression size could be bigger than one we got in previous trial
and blindly, copy bigger size object to smaller buffer which is
buffer overrun. The overrun breaks zsmalloc free object chaining
so system goes crash like above.
I think below is same problem.
https://bugzilla.suse.com/show_bug.cgi?id=997574
Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
writeback in there so the approach in this patch is simply return false if
we found it needs stable page. Although it increases memory footprint
temporarily, it happens rarely and it should be reclaimed easily althoug
it happened. Also, It would be better than waiting of IO completion,
which is critial path for application latency.
Fixes:
da9556a2367c ("zram: user per-cpu compression streams")
Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox
Link: http://lkml.kernel.org/r/1482366980-3782-2-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Hyeoncheol Lee <cheol.lee@lge.com>
Cc: <yjay.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: <stable@vger.kernel.org> [4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexander Duyck [Wed, 11 Jan 2017 00:58:12 +0000 (16:58 -0800)]
mm: add documentation for page fragment APIs
This is a first pass at trying to add documentation for the page_frag
APIs. They may still change over time but for now I thought I would try
to get these documented so that as more network drivers and stack calls
make use of them we have one central spot to document how they are meant
to be used.
Link: http://lkml.kernel.org/r/20170104024157.13451.6758.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexander Duyck [Wed, 11 Jan 2017 00:58:09 +0000 (16:58 -0800)]
mm: rename __page_frag functions to __page_frag_cache, drop order from drain
This patch does two things.
First it goes through and renames the __page_frag prefixed functions to
__page_frag_cache so that we can be clear that we are draining or
refilling the cache, not the frags themselves.
Second we drop the order parameter from __page_frag_cache_drain since we
don't actually need to pass it since all fragments are either order 0 or
must be a compound page.
Link: http://lkml.kernel.org/r/20170104023954.13451.5678.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexander Duyck [Wed, 11 Jan 2017 00:58:06 +0000 (16:58 -0800)]
mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
Patch series "Page fragment updates", v4.
This patch series takes care of a few cleanups for the page fragments
API.
First we do some renames so that things are much more consistent. First
we move the page_frag_ portion of the name to the front of the functions
names. Secondly we split out the cache specific functions from the
other page fragment functions by adding the word "cache" to the name.
Finally I added a bit of documentation that will hopefully help to
explain some of this. I plan to revisit this later as we get things
more ironed out in the near future with the changes planned for the DMA
setup to support eXpress Data Path.
This patch (of 3):
This patch renames the page frag functions to be more consistent with
other APIs. Specifically we place the name page_frag first in the name
and then have either an alloc or free call name that we append as the
suffix. This makes it a bit clearer in terms of naming.
In addition we drop the leading double underscores since we are
technically no longer a backing interface and instead the front end that
is called from the networking APIs.
Link: http://lkml.kernel.org/r/20170104023854.13451.67390.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 11 Jan 2017 00:58:04 +0000 (16:58 -0800)]
mm, memcg: fix the active list aging for lowmem requests when memcg is enabled
Nils Holland and Klaus Ethgen have reported unexpected OOM killer
invocations with 32b kernel starting with 4.8 kernels
kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
kworker/u4:5 cpuset=/ mems_allowed=0
CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
[...]
Mem-Info:
active_anon:58685 inactive_anon:90 isolated_anon:0
active_file:274324 inactive_file:281962 isolated_file:0
unevictable:0 dirty:649 writeback:0 unstable:0
slab_reclaimable:40662 slab_unreclaimable:17754
mapped:7382 shmem:202 pagetables:351 bounce:0
free:206736 free_pcp:332 free_cma:0
Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 813 3474 3474
Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
lowmem_reserve[]: 0 0 21292 21292
HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB
the oom killer is clearly pre-mature because there there is still a lot
of page cache in the zone Normal which should satisfy this lowmem
request. Further debugging has shown that the reclaim cannot make any
forward progress because the page cache is hidden in the active list
which doesn't get rotated because inactive_list_is_low is not memcg
aware.
The code simply subtracts per-zone highmem counters from the respective
memcg's lru sizes which doesn't make any sense. We can simply end up
always seeing the resulting active and inactive counts 0 and return
false. This issue is not limited to 32b kernels but in practice the
effect on systems without CONFIG_HIGHMEM would be much harder to notice
because we do not invoke the OOM killer for allocations requests
targeting < ZONE_NORMAL.
Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
and subtract per-memcg highmem counts when memcg is enabled. Introduce
helper lruvec_zone_lru_size which redirects to either zone counters or
mem_cgroup_get_zone_lru_size when appropriate.
We are losing empty LRU but non-zero lru size detection introduced by
ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
of the inherent zone vs. node discrepancy.
Fixes:
f8d1a31163fc ("mm: consider whether to decivate based on eligible zones inactive ratio")
Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Nils Holland <nholland@tisys.org>
Tested-by: Nils Holland <nholland@tisys.org>
Reported-by: Klaus Ethgen <Klaus@Ethgen.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ard Biesheuvel [Wed, 11 Jan 2017 00:58:00 +0000 (16:58 -0800)]
mm: don't dereference struct page fields of invalid pages
The VM_BUG_ON() check in move_freepages() checks whether the node id of
a page matches the node id of its zone. However, it does this before
having checked whether the struct page pointer refers to a valid struct
page to begin with. This is guaranteed in most cases, but may not be
the case if CONFIG_HOLES_IN_ZONE=y.
So reorder the VM_BUG_ON() with the pfn_valid_within() check.
Link: http://lkml.kernel.org/r/1481706707-6211-2-git-send-email-ard.biesheuvel@linaro.org
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Hanjun Guo <hanjun.guo@linaro.org>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Robert Richter <rrichter@cavium.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stephen Boyd [Wed, 11 Jan 2017 00:57:57 +0000 (16:57 -0800)]
mailmap: add codeaurora.org names for nameless email commits
Some codeaurora.org emails have crept in but the names don't exist for
them. Add the names for the emails so git can match everyone up.
Link: http://lkml.kernel.org/r/20170104194611.25933-1-sboyd@codeaurora.org
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Sarangdhar Joshi <spjoshi@codeaurora.org>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Subhash Jadavani <subhashj@codeaurora.org>
Cc: Thomas Pedersen <twp@codeaurora.org>
Cc: Andy Gross <andy.gross@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jamie Iles [Wed, 11 Jan 2017 00:57:54 +0000 (16:57 -0800)]
signal: protect SIGNAL_UNKILLABLE from unintentional clearing.
Since commit
00cd5c37afd5 ("ptrace: permit ptracing of /sbin/init") we
can now trace init processes. init is initially protected with
SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
there are a number of paths during tracing where SIGNAL_UNKILLABLE can
be implicitly cleared.
This can result in init becoming stoppable/killable after tracing. For
example, running:
while true; do kill -STOP 1; done &
strace -p 1
and then stopping strace and the kill loop will result in init being
left in state TASK_STOPPED. Sending SIGCONT to init will resume it, but
init will now respond to future SIGSTOP signals rather than ignoring
them.
Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
that we don't clear SIGNAL_UNKILLABLE.
Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.com
Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 11 Jan 2017 00:57:51 +0000 (16:57 -0800)]
mm: pmd dirty emulation in page fault handler
Andreas reported [1] made a test in jemalloc hang in THP mode in arm64:
http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de
The problem is currently page fault handler doesn't supports dirty bit
emulation of pmd for non-HW dirty-bit architecture so that application
stucks until VM marked the pmd dirty.
How the emulation work depends on the architecture. In case of arm64,
when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to
mark the pte dirty via triggering page fault when store access happens.
Once the page fault occurs, VM marks the pmd dirty and arch code for
setting pmd will clear PTE_RDONLY for application to proceed.
IOW, if VM doesn't mark the pmd dirty, application hangs forever by
repeated fault(i.e., store op but the pmd is PTE_RDONLY).
This patch enables pmd dirty-bit emulation for those architectures.
[1]
b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called
Fixes:
b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
Link: http://lkml.kernel.org/r/1482506098-6149-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Andreas Schwab <schwab@suse.de>
Tested-by: Andreas Schwab <schwab@suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jason Evans <je@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: <stable@vger.kernel.org> [4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Manfred Spraul [Wed, 11 Jan 2017 00:57:48 +0000 (16:57 -0800)]
ipc/sem.c: fix incorrect sem_lock pairing
Based on the syzcaller test case from dvyukov:
https://gist.githubusercontent.com/dvyukov/
d0e5efefe4d7d6daed829f5c3ca26a40/raw/
08d0a261fe3c987bed04fbf267e08ba04bd533ea/gistfile1.txt
The slow (i.e.: failure to acquire) syscall exit from semtimedop()
incorrectly assumed that the the same lock is acquired as it was at the
initial syscall entry.
This is wrong:
- thread A: single semop semop(), sleeps
- thread B: multi semop semop(), sleeps
- thread A: woken up by signal/timeout
With this sequence, the initial sem_lock() call locks the per-semaphore
spinlock, and it is unlocked with sem_unlock(). The call at the syscall
return locks the global spinlock. Because locknum is not updated, the
following sem_unlock() call unlocks the per-semaphore spinlock, which is
actually not locked.
The fix is trivial: Use the return value from sem_lock.
Fixes:
370b262c896e ("ipc/sem: avoid idr tree lookup for interrupted semop")
Link: http://lkml.kernel.org/r/1482215645-22328-1-git-send-email-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Johanna Abrahamsson <johanna@mjao.org>
Tested-by: Johanna Abrahamsson <johanna@mjao.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sudip Mukherjee [Wed, 11 Jan 2017 00:57:45 +0000 (16:57 -0800)]
lib/Kconfig.debug: fix frv build failure
The build of frv allmodconfig was failing with the errors like:
/tmp/cc0JSPc3.s: Assembler messages:
/tmp/cc0JSPc3.s:1839: Error: symbol `.LSLT0' is already defined
/tmp/cc0JSPc3.s:1842: Error: symbol `.LASLTP0' is already defined
/tmp/cc0JSPc3.s:1969: Error: symbol `.LELTP0' is already defined
/tmp/cc0JSPc3.s:1970: Error: symbol `.LELT0' is already defined
Commit
866ced950bcd ("kbuild: Support split debug info v4") introduced
splitting the debug info and keeping that in a separate file. Somehow,
the frv-linux gcc did not like that and I am guessing that instead of
splitting it started copying. The first report about this is at:
https://lists.01.org/pipermail/kbuild-all/2015-July/010527.html.
I will try and see if this can work with frv and if still fails I will
open a bug report with gcc. But meanwhile this is the easiest option to
solve build failure of frv.
Fixes:
866ced950bcd ("kbuild: Support split debug info v4")
Link: http://lkml.kernel.org/r/1482062348-5352-1-git-send-email-sudipm.mukherjee@gmail.com
Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 11 Jan 2017 00:57:42 +0000 (16:57 -0800)]
mm: get rid of __GFP_OTHER_NODE
The flag was introduced by commit
78afd5612deb ("mm: add
__GFP_OTHER_NODE flag") to allow proper accounting of remote node
allocations done by kernel daemons on behalf of a process - e.g.
khugepaged.
After "mm: fix remote numa hits statistics" we do not need and actually
use the flag so we can safely remove it because all allocations which
are satisfied from their "home" node are accounted properly.
[mhocko@suse.com: fix build]
Link: http://lkml.kernel.org/r/20170106122225.GK5556@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170102153057.9451-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 11 Jan 2017 00:57:39 +0000 (16:57 -0800)]
mm: fix remote numa hits statistics
Jia He has noticed that commit
b9f00e147f27 ("mm, page_alloc: reduce
branches in zone_statistics") has an unintentional side effect that
remote node allocation requests are accounted as NUMA_MISS rathat than
NUMA_HIT and NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE.
There are many of these potentially because the flag is used very rarely
while we have many users of __alloc_pages_node.
Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a
follow up patch) and treat all allocations that were satisfied from the
preferred zone's node as NUMA_HITS because this is the same node we
requested the allocation from in most cases. If this is not the local
node then we just account it as NUMA_OTHER rather than NUMA_LOCAL.
One downsize would be that an allocation request for a node which is
outside of the mempolicy nodemask would be reported as a hit which is a
bit weird but that was the case before
b9f00e147f27 already.
Fixes:
b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics")
Link: http://lkml.kernel.org/r/20170102153057.9451-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Jia He <hejianet@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz> # with cbmc[1] superpowers
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Williams [Wed, 11 Jan 2017 00:57:36 +0000 (16:57 -0800)]
mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}
Both arch_add_memory() and arch_remove_memory() expect a single threaded
context.
For example, arch/x86/mm/init_64.c::kernel_physical_mapping_init() does
not hold any locks over this check and branch:
if (pgd_val(*pgd)) {
pud = (pud_t *)pgd_page_vaddr(*pgd);
paddr_last = phys_pud_init(pud, __pa(vaddr),
__pa(vaddr_end),
page_size_mask);
continue;
}
pud = alloc_low_page();
paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
page_size_mask);
The result is that two threads calling devm_memremap_pages()
simultaneously can end up colliding on pgd initialization. This leads
to crash signatures like the following where the loser of the race
initializes the wrong pgd entry:
BUG: unable to handle kernel paging request at
ffff888ebfff0000
IP: memcpy_erms+0x6/0x10
PGD
2f8e8fc067 PUD 0 /* <---- Invalid PUD */
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 54 PID: 3818 Comm: systemd-udevd Not tainted 4.6.7+ #13
task:
ffff882fac290040 ti:
ffff882f887a4000 task.ti:
ffff882f887a4000
RIP: memcpy_erms+0x6/0x10
[..]
Call Trace:
? pmem_do_bvec+0x205/0x370 [nd_pmem]
? blk_queue_enter+0x3a/0x280
pmem_rw_page+0x38/0x80 [nd_pmem]
bdev_read_page+0x84/0xb0
Hold the standard memory hotplug mutex over calls to
arch_{add,remove}_memory().
Fixes:
41e94a851304 ("add devm_memremap_pages")
Link: http://lkml.kernel.org/r/148357647831.9498.12606007370121652979.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Ren [Wed, 11 Jan 2017 00:57:33 +0000 (16:57 -0800)]
ocfs2: fix crash caused by stale lvb with fsdlm plugin
The crash happens rather often when we reset some cluster nodes while
nodes contend fiercely to do truncate and append.
The crash backtrace is below:
dlm:
C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
dlm:
C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
ocfs2: Begin replay journal (node
318952601, slot 2) on device (253,18)
ocfs2: End replay journal (node
318952601, slot 2) on device (253,18)
ocfs2: Beginning quota recovery on device (253,18) for slot 2
ocfs2: Finishing quota recovery on device (253,18) for slot 2
(truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
(truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
------------[ cut here ]------------
kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
invalid opcode: 0000 [#1] SMP
Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
Supported: No, Unsupported modules are loaded
CPU: 1 PID: 30154 Comm: truncate Tainted: G OE N 4.4.21-69-default #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
task:
ffff88004ff6d240 ti:
ffff880074e68000 task.ti:
ffff880074e68000
RIP: 0010:[<
ffffffffa05c8c30>] [<
ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
RSP: 0018:
ffff880074e6bd50 EFLAGS:
00010282
RAX:
0000000000000074 RBX:
000000000000029e RCX:
0000000000000000
RDX:
0000000000000001 RSI:
0000000000000246 RDI:
0000000000000246
RBP:
ffff880074e6bda8 R08:
000000003675dc7a R09:
ffffffff82013414
R10:
0000000000034c50 R11:
0000000000000000 R12:
ffff88003aab3448
R13:
00000000000002dc R14:
0000000000046e11 R15:
0000000000000020
FS:
00007f839f965700(0000) GS:
ffff88007fc80000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
CR2:
00007f839f97e000 CR3:
0000000036723000 CR4:
00000000000006e0
Call Trace:
ocfs2_setattr+0x698/0xa90 [ocfs2]
notify_change+0x1ae/0x380
do_truncate+0x5e/0x90
do_sys_ftruncate.constprop.11+0x108/0x160
entry_SYSCALL_64_fastpath+0x12/0x6d
Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
RIP [<
ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
not equal to the disk i_size. We mistakenly trust the LVB because the
underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
DLM_SBF_VALNOTVALID properly for us. But, why?
The current code tries to downconvert lock without DLM_LKF_VALBLK flag
to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
if the lock resource type needs LVB. This is not the right way for
fsdlm.
The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
DLM_LKF_VALBLK to decide if we care about the LVB in the LKB. If
DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
failure happens.
The following diagram briefly illustrates how this crash happens:
RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
The 1st round:
Node1 Node2
RSB1: PR
RSB1(master): NULL->EX
ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
ocfs2_dlm_lock(no DLM_LKF_VALBLK)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
dlm_lock(no DLM_LKF_VALBLK)
convert_lock(overwrite lkb->lkb_exflags
with no DLM_LKF_VALBLK)
RSB1: NULL RSB1: EX
reset Node2
dlm_recover_rsbs()
recover_lvb()
/* The LVB is not trustable if the node with EX fails and
* no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
*/
if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
return; * to invalid the LVB here.
*/
The 2nd round:
Node 1 Node2
RSB1(become master from recovery)
ocfs2_setattr()
ocfs2_inode_lock(NULL->EX)
/* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
ocfs2_truncate_file()
mlog_bug_on_msg(disk isize != i_size_read(inode)) /* crash! */
The fix is quite straightforward. We keep to set DLM_LKF_VALBLK flag
for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
is uesed.
Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 11 Jan 2017 00:57:30 +0000 (16:57 -0800)]
bpf: do not use KMALLOC_SHIFT_MAX
Commit
01b3f52157ff ("bpf: fix allocation warnings in bpf maps and
integer overflow") has added checks for the maximum allocateable size.
It (ab)used KMALLOC_SHIFT_MAX for that purpose.
While this is not incorrect it is not very clean because we already have
KMALLOC_MAX_SIZE for this very reason so let's change both checks to use
KMALLOC_MAX_SIZE instead.
The original motivation for using KMALLOC_SHIFT_MAX was to work around
an incorrect KMALLOC_MAX_SIZE which could lead to allocation warnings
but it is no longer needed since "slab: make sure that KMALLOC_MAX_SIZE
will fit into MAX_ORDER".
Link: http://lkml.kernel.org/r/20161220130659.16461-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 11 Jan 2017 00:57:27 +0000 (16:57 -0800)]
mm, slab: make sure that KMALLOC_MAX_SIZE will fit into MAX_ORDER
Andrey Konovalov has reported the following warning triggered by the
syzkaller fuzzer.
WARNING: CPU: 1 PID: 9935 at mm/page_alloc.c:3511 __alloc_pages_nodemask+0x159c/0x1e20
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 9935 Comm: syz-executor0 Not tainted 4.9.0-rc7+ #34
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__alloc_pages_slowpath mm/page_alloc.c:3511
__alloc_pages_nodemask+0x159c/0x1e20 mm/page_alloc.c:3781
alloc_pages_current+0x1c7/0x6b0 mm/mempolicy.c:2072
alloc_pages include/linux/gfp.h:469
kmalloc_order+0x1f/0x70 mm/slab_common.c:1015
kmalloc_order_trace+0x1f/0x160 mm/slab_common.c:1026
kmalloc_large include/linux/slab.h:422
__kmalloc+0x210/0x2d0 mm/slub.c:3723
kmalloc include/linux/slab.h:495
ep_write_iter+0x167/0xb50 drivers/usb/gadget/legacy/inode.c:664
new_sync_write fs/read_write.c:499
__vfs_write+0x483/0x760 fs/read_write.c:512
vfs_write+0x170/0x4e0 fs/read_write.c:560
SYSC_write fs/read_write.c:607
SyS_write+0xfb/0x230 fs/read_write.c:599
entry_SYSCALL_64_fastpath+0x1f/0xc2
The issue is caused by a lack of size check for the request size in
ep_write_iter which should be fixed. It, however, points to another
problem, that SLUB defines KMALLOC_MAX_SIZE too large because the its
KMALLOC_SHIFT_MAX is (MAX_ORDER + PAGE_SHIFT) which means that the
resulting page allocator request might be MAX_ORDER which is too large
(see __alloc_pages_slowpath).
The same applies to the SLOB allocator which allows even larger sizes.
Make sure that they are capped properly and never request more than
MAX_ORDER order.
Link: http://lkml.kernel.org/r/20161220130659.16461-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Wed, 11 Jan 2017 00:57:24 +0000 (16:57 -0800)]
dax: wrprotect pmd_t in dax_mapping_entry_mkclean
Currently dax_mapping_entry_mkclean() fails to clean and write protect
the pmd_t of a DAX PMD entry during an *sync operation. This can result
in data loss in the following sequence:
1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the
pmd_t dirty and writeable
2) fsync, flushing out PMD data and cleaning the radix tree entry. We
currently fail to mark the pmd_t as clean and write protected.
3) more mmap writes to the PMD. These don't cause any page faults since
the pmd_t is dirty and writeable. The radix tree entry remains clean.
4) fsync, which fails to flush the dirty PMD data because the radix tree
entry was clean.
5) crash - dirty data that should have been fsync'd as part of 4) could
still have been in the processor cache, and is lost.
Fix this by marking the pmd_t clean and write protected in
dax_mapping_entry_mkclean(), which is called as part of the fsync
operation 2). This will cause the writes in step 3) above to generate
page faults where we'll re-dirty the PMD radix tree entry, resulting in
flushes in the fsync that happens in step 4).
Fixes:
4b4bb46d00b3 ("dax: clear dirty entry tags on cache flush")
Link: http://lkml.kernel.org/r/1482272586-21177-3-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Wed, 11 Jan 2017 00:57:21 +0000 (16:57 -0800)]
mm: add follow_pte_pmd()
Patch series "Write protect DAX PMDs in *sync path".
Currently dax_mapping_entry_mkclean() fails to clean and write protect
the pmd_t of a DAX PMD entry during an *sync operation. This can result
in data loss, as detailed in patch 2.
This series is based on Dan's "libnvdimm-pending" branch, which is the
current home for Jan's "dax: Page invalidation fixes" series. You can
find a working tree here:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_clean
This patch (of 2):
Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a
huge page PMD leaf to be found and returned.
Link: http://lkml.kernel.org/r/1482272586-21177-2-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aneesh Kumar K.V [Wed, 11 Jan 2017 00:57:18 +0000 (16:57 -0800)]
mm/thp/pagecache/collapse: free the pte page table on collapse for thp page cache.
With THP page cache, when trying to build a huge page from regular pte
pages, we just clear the pmd entry. We will take another fault and at
that point we will find the huge page in the radix tree, thereby using
the huge page to complete the page fault
The second fault path will allocate the needed pgtable_t page for archs
like ppc64. So no need to deposit the same in collapse path.
Depositing them in the collapse path resulting in a pgtable_t memory
leak also giving errors like
BUG: non-zero nr_ptes on freeing mm: 3
Fixes:
953c66c2b22a ("mm: THP page cache support for ppc64")
Link: http://lkml.kernel.org/r/20161212163428.6780-2-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Wed, 11 Jan 2017 00:57:15 +0000 (16:57 -0800)]
dax: fix deadlock with DAX 4k holes
Currently in DAX if we have three read faults on the same hole address we
can end up with the following:
Thread 0 Thread 1 Thread 2
-------- -------- --------
dax_iomap_fault
grab_mapping_entry
lock_slot
<locks empty DAX entry>
dax_iomap_fault
grab_mapping_entry
get_unlocked_mapping_entry
<sleeps on empty DAX entry>
dax_iomap_fault
grab_mapping_entry
get_unlocked_mapping_entry
<sleeps on empty DAX entry>
dax_load_hole
find_or_create_page
...
page_cache_tree_insert
dax_wake_mapping_entry_waiter
<wakes one sleeper>
__radix_tree_replace
<swaps empty DAX entry with 4k zero page>
<wakes>
get_page
lock_page
...
put_locked_mapping_entry
unlock_page
put_page
<sleeps forever on the DAX
wait queue>
The crux of the problem is that once we insert a 4k zero page, all
locking from then on is done in terms of that 4k zero page and any
additional threads sleeping on the empty DAX entry will never be woken.
Fix this by waking all sleepers when we replace the DAX radix tree entry
with a 4k zero page. This will allow all sleeping threads to
successfully transition from locking based on the DAX empty entry to
locking on the 4k zero page.
With the test case reported by Xiong this happens very regularly in my
test setup, with some runs resulting in 9+ threads in this deadlocked
state. With this fix I've been able to run that same test dozens of
times in a loop without issue.
Fixes:
ac401cc78242 ("dax: New fault locking")
Link: http://lkml.kernel.org/r/1483479365-13607-1-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Xiong Zhou <xzhou@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org> [4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Wed, 11 Jan 2017 00:57:12 +0000 (16:57 -0800)]
MAINTAINERS: remove duplicate bug filling description
I have noticed that two different descriptions for B: entries in
MAINTAINERS were merged: commit
686564434e88 ("MAINTAINERS: Add bug
tracking system location entry type") and
2de2bd95f456 ("MAINTAINERS:
add "B:" for URI where to file bugs").
This patch keeps the description from
2de2bd95f456. There has been a
discussion [1] about whether this more detailed description is useful
and what it exactly implies. I find it more useful and general, and the
author of
686564434e88 agreed in the end that either is fine.
[1] https://lkml.org/lkml/2016/12/8/71
Link: http://lkml.kernel.org/r/20161219085158.12114-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Herbert Xu [Tue, 10 Jan 2017 20:24:15 +0000 (12:24 -0800)]
gro: Disable frag0 optimization on IPv6 ext headers
The GRO fast path caches the frag0 address. This address becomes
invalid if frag0 is modified by pskb_may_pull or its variants.
So whenever that happens we must disable the frag0 optimization.
This is usually done through the combination of gro_header_hard
and gro_header_slow, however, the IPv6 extension header path did
the pulling directly and would continue to use the GRO fast path
incorrectly.
This patch fixes it by disabling the fast path when we enter the
IPv6 extension header path.
Fixes:
78a478d0efd9 ("gro: Inline skb_gro_header and cache frag0 virtual address")
Reported-by: Slava Shwartsman <slavash@mellanox.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Tue, 10 Jan 2017 20:24:01 +0000 (12:24 -0800)]
gro: Enter slow-path if there is no tailroom
The GRO path has a fast-path where we avoid calling pskb_may_pull
and pskb_expand by directly accessing frag0. However, this should
only be done if we have enough tailroom in the skb as otherwise
we'll have to expand it later anyway.
This patch adds the check by capping frag0_len with the skb tailroom.
Fixes:
cb18978cbf45 ("gro: Open-code final pskb_may_pull")
Reported-by: Slava Shwartsman <slavash@mellanox.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Tue, 10 Jan 2017 17:41:49 +0000 (09:41 -0800)]
mlx4: Return EOPNOTSUPP instead of ENOTSUPP
In commit
b45f0674b997 ("mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs"),
it changed EOPNOTSUPP to ENOTSUPP by mistake. This patch fixes it.
Fixes:
b45f0674b997 ("mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Tue, 10 Jan 2017 16:10:34 +0000 (17:10 +0100)]
net/af_iucv: don't use paged skbs for TX on HiperSockets
With commit
e53743994e21
("af_iucv: use paged SKBs for big outbound messages"),
we transmit paged skbs for both of AF_IUCV's transport modes
(IUCV or HiperSockets).
The qeth driver for Layer 3 HiperSockets currently doesn't
support NETIF_F_SG, so these skbs would just be linearized again
by the stack.
Avoid that overhead by using paged skbs only for IUCV transport.
cc stable, since this also circumvents a significant skb leak when
sending large messages (where the skb then needs to be linearized).
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # v4.8+
Fixes:
e53743994e21 ("af_iucv: use paged SKBs for big outbound messages")
Signed-off-by: David S. Miller <davem@davemloft.net>
Anna, Suman [Tue, 10 Jan 2017 03:48:56 +0000 (21:48 -0600)]
net: add the AF_QIPCRTR entries to family name tables
Commit
bdabad3e363d ("net: Add Qualcomm IPC router") introduced a
new address family. Update the family name tables accordingly so
that the lockdep initialization can use the proper names for this
family.
Cc: Courtney Cavin <courtney.cavin@sonymobile.com>
Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Suman Anna <s-anna@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Boyd [Mon, 9 Jan 2017 22:31:58 +0000 (14:31 -0800)]
net: qrtr: Mark 'buf' as little endian
Failure to mark this pointer as __le32 causes checkers like
sparse to complain:
net/qrtr/qrtr.c:274:16: warning: incorrect type in assignment (different base types)
net/qrtr/qrtr.c:274:16: expected unsigned int [unsigned] [usertype] <noident>
net/qrtr/qrtr.c:274:16: got restricted __le32 [usertype] <noident>
net/qrtr/qrtr.c:275:16: warning: incorrect type in assignment (different base types)
net/qrtr/qrtr.c:275:16: expected unsigned int [unsigned] [usertype] <noident>
net/qrtr/qrtr.c:275:16: got restricted __le32 [usertype] <noident>
net/qrtr/qrtr.c:276:16: warning: incorrect type in assignment (different base types)
net/qrtr/qrtr.c:276:16: expected unsigned int [unsigned] [usertype] <noident>
net/qrtr/qrtr.c:276:16: got restricted __le32 [usertype] <noident>
Silence it.
Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 9 Jan 2017 19:58:34 +0000 (11:58 -0800)]
net: dsa: Ensure validity of dst->ds[0]
It is perfectly possible to have non zero indexed switches being present
in a DSA switch tree, in such a case, we will be deferencing a NULL
pointer while dsa_cpu_port_ethtool_{setup,restore}. Be more defensive
and ensure that dst->ds[0] is valid before doing anything with it.
Fixes:
0c73c523cf73 ("net: dsa: Initialize CPU port ethtool ops per tree")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Tue, 10 Jan 2017 21:51:54 +0000 (13:51 -0800)]
Merge tag 'linux-kselftest-4.10-rc4-fixes' of git://git./linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
"This update consists of fixes to use shell instead of bash to run
tests in embedded devices where the only shell available is the
busybox ash.
Also included is a typo fix to a test result message"
* tag 'linux-kselftest-4.10-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests: x86/pkeys: fix spelling mistake: "itertation" -> "iteration"
selftests: do not require bash to run netsocktests testcase
selftests: do not require bash to run bpf tests
selftests: do not require bash for the generated test
Eric Dumazet [Mon, 9 Jan 2017 19:18:01 +0000 (11:18 -0800)]
net: skb_flow_get_be16() can be static
Removes following sparse complain :
net/core/flow_dissector.c:70:8: warning: symbol 'skb_flow_get_be16'
was not declared. Should it be static?
Fixes:
972d3876faa8 ("flow dissector: ICMP support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 10 Jan 2017 16:46:47 +0000 (11:46 -0500)]
Merge branch 'r8152-fix-autosuspend'
Hayes Wang says:
====================
r8152: fix autosuspend issue
Avoid rx is split into two parts when runtime suspend occurs.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Tue, 10 Jan 2017 09:04:07 +0000 (17:04 +0800)]
r8152: fix rx issue for runtime suspend
Pause the rx and make sure the rx fifo is empty when the autosuspend
occurs.
If the rx data comes when the driver is canceling the rx urb, the host
controller would stop getting the data from the device and continue
it after next rx urb is submitted. That is, one continuing data is
split into two different urb buffers. That let the driver take the
data as a rx descriptor, and unexpected behavior happens.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Tue, 10 Jan 2017 09:04:06 +0000 (17:04 +0800)]
r8152: split rtl8152_suspend function
Split rtl8152_suspend() into rtl8152_system_suspend() and
rtl8152_rumtime_suspend().
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tobias Klauser [Tue, 10 Jan 2017 08:30:51 +0000 (09:30 +0100)]
net: socket: Make unnecessarily global sockfs_setattr() static
Make sockfs_setattr() static as it is not used outside of net/socket.c
This fixes the following GCC warning:
net/socket.c:534:5: warning: no previous prototype for ‘sockfs_setattr’ [-Wmissing-prototypes]
Fixes:
86741ec25462 ("net: core: Add a UID field to struct sock.")
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 10 Jan 2017 16:27:46 +0000 (11:27 -0500)]
Merge tag 'wireless-drivers-for-davem-2017-01-10' of git://git./linux/kernel/git/kvalo/wireless-drivers
Kalle Valo says:
====================
wireless-drivers fixes for 4.10
Only two fixes at this time. The rtlwifi fix is an important one as it
fixes a reported oops and Linus was already asking about it. The
orinoco fix is not tested on a real device, because it's old legacy
hardware and hardly no-one use it, but it should fix a (theoretical)
issue with VMAP_STACK.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 9 Jan 2017 16:51:32 +0000 (08:51 -0800)]
tcp: do not export tcp_peer_is_proven()
After commit
1fb6f159fd21 ("tcp: add tcp_conn_request"),
tcp_peer_is_proven() no longer needs to be exported.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jean Delvare [Mon, 9 Jan 2017 14:17:27 +0000 (15:17 +0100)]
net: phy: Add Meson GXL PHY hardware dependency
As I understand it the Meson GXL PHY driver is only useful on one
architecture so only make it visible on that architecture.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Fixes:
7334b3e47aee ("net: phy: Add Meson GXL Internal PHY driver")
Cc: Neil Armstrong <narmstrong@baylibre.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlad Tsyrklevich [Mon, 9 Jan 2017 13:57:48 +0000 (20:57 +0700)]
net/appletalk: Fix kernel memory disclosure
ipddp_route structs contain alignment padding so kernel heap memory
is leaked when they are copied to user space in
ipddp_ioctl(SIOCFINDIPDDPRT). Change kmalloc() to kzalloc() to clear
that memory.
Signed-off-by: Vlad Tsyrklevich <vlad@tsyrklevich.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pavel Tikhomirov [Mon, 9 Jan 2017 07:45:49 +0000 (10:45 +0300)]
ipv4: make tcp_notsent_lowat sysctl knob behave as true unsigned int
> cat /proc/sys/net/ipv4/tcp_notsent_lowat
-1
> echo
4294967295 > /proc/sys/net/ipv4/tcp_notsent_lowat
-bash: echo: write error: Invalid argument
> echo -
2147483648 > /proc/sys/net/ipv4/tcp_notsent_lowat
> cat /proc/sys/net/ipv4/tcp_notsent_lowat
-
2147483648
but in documentation we have "tcp_notsent_lowat - UNSIGNED INTEGER"
v2: simplify to just proc_douintvec
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Alemayhu [Sat, 7 Jan 2017 22:53:00 +0000 (23:53 +0100)]
ipv6: fix typos
o s/approriate/appropriate
o s/discouvery/discovery
Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Mon, 9 Jan 2017 20:54:20 +0000 (12:54 -0800)]
Merge tag 'drm-fixes-for-v4.10-rc4' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"amdgpu, radeon, msm, meson, tilcdc, drm fixes.
Just back online for a couple of days, gathered up the remaining fixes
pull requests.
This contains fixes for a few ARM platforms (msm, tilcdc, meson), and
one core atomic fix. The AMD pull has some new hardware support
(Polaris12) in it, but this is pretty limited to just hw enablement
and shouldn't cause any problems"
* tag 'drm-fixes-for-v4.10-rc4' of git://people.freedesktop.org/~airlied/linux:
drm/amdgpu: drop verde dpm quirks
drm/radeon: drop verde dpm quirks
drm/radeon: update smc firmware selection for SI
drm/amdgpu: update si kicker smc firmware
drm/amd/powerplay: extend smu's response timeout time.
drm/amdgpu: remove static integer for uvd pp state
drm/amd/amdgpu: add Polaris12 PCI ID
drm/amdgpu/powerplay: add Polaris12 support
drm/amd/amdgpu: add Polaris12 support (v3)
MAINTAINERS: Update mailing list for radeon and amdgpu
drm/meson: Fix CVBS VDAC disable
drm/meson: Fix CVBS initialization when HDMI is configured by bootloader
drm: Clean up planes in atomic commit helper failure path
drm: tilcdc: simplify the recovery from sync lost error on rev1
drm/meson: Fix plane atomic check when no crtc for the plane
drm/msm: Verify that MSM_SUBMIT_BO_FLAGS are set
drm/msm: Put back the vaddr in submit_reloc()
drm/msm: Ensure that the hardware write pointer is valid
Linus Torvalds [Mon, 9 Jan 2017 20:50:33 +0000 (12:50 -0800)]
Merge tag 'gpio-v4.10-2' of git://git./linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
- move freeing of GPIO hogs to after freeing the device to get rid of a
warning state.
- a small compile warning fix
* tag 'gpio-v4.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio: Move freeing of GPIO hogs before numbing of the device
gpio: mxs: remove __init annotation
Zefir Kurtisi [Fri, 6 Jan 2017 11:14:48 +0000 (12:14 +0100)]
phy state machine: failsafe leave invalid RUNNING state
While in RUNNING state, phy_state_machine() checks for link changes by
comparing phydev->link before and after calling phy_read_status().
This works as long as it is guaranteed that phydev->link is never
changed outside the phy_state_machine().
If in some setups this happens, it causes the state machine to miss
a link loss and remain RUNNING despite phydev->link being 0.
This has been observed running a dsa setup with a process continuously
polling the link states over ethtool each second (SNMPD RFC-1213
agent). Disconnecting the link on a phy followed by a ETHTOOL_GSET
causes dsa_slave_get_settings() / dsa_slave_get_link_ksettings() to
call phy_read_status() and with that modify the link status - and
with that bricking the phy state machine.
This patch adds a fail-safe check while in RUNNING, which causes to
move to CHANGELINK when the link is gone and we are still RUNNING.
Signed-off-by: Zefir Kurtisi <zefir.kurtisi@neratec.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Mon, 9 Jan 2017 19:58:28 +0000 (11:58 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Fix dumping of nft_quota entries, from Pablo Neira Ayuso.
2) Fix out of bounds access in nf_tables discovered by KASAN, from
Florian Westphal.
3) Fix IRQ enabling in dp83867 driver, from Grygorii Strashko.
4) Fix unicast filtering in be2net driver, from Ivan Vecera.
5) tg3_get_stats64() can race with driver close and ethtool
reconfigurations, fix from Michael Chan.
6) Fix error handling when pass limit is reached in bpf code gen on
x86. From Daniel Borkmann.
7) Don't clobber switch ops and use proper MDIO nested reads and writes
in bcm_sf2 driver, from Florian Fainelli.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (21 commits)
net: dsa: bcm_sf2: Utilize nested MDIO read/write
net: dsa: bcm_sf2: Do not clobber b53_switch_ops
net: stmmac: fix maxmtu assignment to be within valid range
bpf: change back to orig prog on too many passes
tg3: Fix race condition in tg3_get_stats64().
be2net: fix unicast list filling
be2net: fix accesses to unicast list
netlabel: add CALIPSO to the list of built-in protocols
vti6: fix device register to report IFLA_INFO_KIND
net: phy: dp83867: fix irq generation
amd-xgbe: Fix IRQ processing when running in single IRQ mode
sh_eth: R8A7740 supports packet shecksumming
sh_eth: fix EESIPR values for SH77{34|63}
r8169: fix the typo in the comment
nl80211: fix sched scan netlink socket owner destruction
bridge: netfilter: Fix dropping packets that moving through bridge interface
netfilter: ipt_CLUSTERIP: check duplicate config when initializing
netfilter: nft_payload: mangle ckecksum if NFT_PAYLOAD_L4CSUM_PSEUDOHDR is set
netfilter: nf_tables: fix oob access
netfilter: nft_queue: use raw_smp_processor_id()
...
David S. Miller [Mon, 9 Jan 2017 03:01:22 +0000 (22:01 -0500)]
Merge branch 'bcm_sf2-fixes'
Florian Fainelli says:
====================
net: dsa: bcm_sf2: Couple fixes
Here are a couple of fixes for bcm_sf2, please queue these up for -stable
as well, thank you very much!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Sun, 8 Jan 2017 05:01:57 +0000 (21:01 -0800)]
net: dsa: bcm_sf2: Utilize nested MDIO read/write
We are implementing a MDIO bus which is behind another one, so use the
nested version of the accessors to get lockdep annotations correct.
Fixes:
461cd1b03e32 ("net: dsa: bcm_sf2: Register our slave MDIO bus")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Sun, 8 Jan 2017 05:01:56 +0000 (21:01 -0800)]
net: dsa: bcm_sf2: Do not clobber b53_switch_ops
We make the bcm_sf2 driver override ds->ops which points to
b53_switch_ops since b53_switch_alloc() did the assignent. This is all
well and good until a second b53 switch comes in, and ends up using the
bcm_sf2 operations. Make a proper local copy, substitute the ds->ops
pointer and then override the operations.
Fixes:
f458995b9ad8 ("net: dsa: bcm_sf2: Utilize core B53 driver when possible")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Airlie [Sun, 8 Jan 2017 23:47:19 +0000 (09:47 +1000)]
Merge branch 'drm-fixes-4.10' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
Fixes for 4.10:
- Polaris 12 support
- Add new amd-gfx mailing list to MAINTAINERS file
- UVD clockgating fix
- SI dpm fixes
* 'drm-fixes-4.10' of git://people.freedesktop.org/~agd5f/linux:
drm/amdgpu: drop verde dpm quirks
drm/radeon: drop verde dpm quirks
drm/radeon: update smc firmware selection for SI
drm/amdgpu: update si kicker smc firmware
drm/amd/powerplay: extend smu's response timeout time.
drm/amdgpu: remove static integer for uvd pp state
drm/amd/amdgpu: add Polaris12 PCI ID
drm/amdgpu/powerplay: add Polaris12 support
drm/amd/amdgpu: add Polaris12 support (v3)
MAINTAINERS: Update mailing list for radeon and amdgpu
Kweh, Hock Leong [Sat, 7 Jan 2017 09:32:03 +0000 (17:32 +0800)]
net: stmmac: fix maxmtu assignment to be within valid range
There is no checking valid value of maxmtu when getting it from
device tree. This resolution added the checking condition to
ensure the assignment is made within a valid range.
Signed-off-by: Kweh, Hock Leong <hock.leong.kweh@intel.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Airlie [Sun, 8 Jan 2017 23:13:20 +0000 (09:13 +1000)]
Merge branch 'msm-fixes-4.10' of git://people.freedesktop.org/~robclark/linux into drm-fixes
A few fixes for 4.10.. the first fixes a long-standing logic bug, that
by luck (ie. size of packets written into RB for a submit) wasn't hit
on a3xx/a4xx but was causing intermittent GPU lockups on a5xx. And a
couple other robustness issues that Jordan noticed.
* 'msm-fixes-4.10' of git://people.freedesktop.org/~robclark/linux:
drm/msm: Verify that MSM_SUBMIT_BO_FLAGS are set
drm/msm: Put back the vaddr in submit_reloc()
drm/msm: Ensure that the hardware write pointer is valid
Dave Airlie [Sun, 8 Jan 2017 23:13:13 +0000 (09:13 +1000)]
Merge tag 'meson-drm-fixes-for-4.10' of git://people.freedesktop.org/~narmstrong/linux into drm-fixes
- plan atomic check oops fix
- fix CVBS init when HDMI is configured by bootloader
- fix CVBS VDAC disable
* tag 'meson-drm-fixes-for-4.10' of git://people.freedesktop.org/~narmstrong/linux:
drm/meson: Fix CVBS VDAC disable
drm/meson: Fix CVBS initialization when HDMI is configured by bootloader
drm/meson: Fix plane atomic check when no crtc for the plane
Dave Airlie [Sun, 8 Jan 2017 23:13:06 +0000 (09:13 +1000)]
Merge tag 'tilcdc-4.10-fixes' of https://github.com/jsarha/linux into drm-fixes
tilcdc fixes for v4.10.
* tag 'tilcdc-4.10-fixes' of https://github.com/jsarha/linux:
drm: tilcdc: simplify the recovery from sync lost error on rev1
Dave Airlie [Sun, 8 Jan 2017 23:12:45 +0000 (09:12 +1000)]
Merge tag 'drm-misc-fixes-2017-01-04' of git://anongit.freedesktop.org/git/drm-misc into drm-fixes
single drm fix.
* tag 'drm-misc-fixes-2017-01-04' of git://anongit.freedesktop.org/git/drm-misc:
drm: Clean up planes in atomic commit helper failure path
Linus Torvalds [Sun, 8 Jan 2017 22:18:17 +0000 (14:18 -0800)]
Linux 4.10-rc3
Daniel Borkmann [Fri, 6 Jan 2017 23:26:33 +0000 (00:26 +0100)]
bpf: change back to orig prog on too many passes
If after too many passes still no image could be emitted, then
swap back to the original program as we do in all other cases
and don't use the one with blinding.
Fixes:
959a75791603 ("bpf, x86: add support for constant blinding")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 8 Jan 2017 19:42:04 +0000 (11:42 -0800)]
Merge tag 'usb-4.10-rc3' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are a bunch of USB fixes for 4.10-rc3. Yeah, it's a lot, an
artifact of the holiday break I think.
Lots of gadget and the usual XHCI fixups for reported issues (one day
that driver will calm down...) Also included are a bunch of usb-serial
driver fixes, and for good measure, a number of much-reported MUSB
driver issues have finally been resolved.
All of these have been in linux-next with no reported issues"
* tag 'usb-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (72 commits)
USB: fix problems with duplicate endpoint addresses
usb: ohci-at91: use descriptor-based gpio APIs correctly
usb: storage: unusual_uas: Add JMicron JMS56x to unusual device
usb: hub: Move hub_port_disable() to fix warning if PM is disabled
usb: musb: blackfin: add bfin_fifo_offset in bfin_ops
usb: musb: fix compilation warning on unused function
usb: musb: Fix trying to free already-free IRQ 4
usb: musb: dsps: implement clear_ep_rxintr() callback
usb: musb: core: add clear_ep_rxintr() to musb_platform_ops
USB: serial: ti_usb_3410_5052: fix NULL-deref at open
USB: serial: spcp8x5: fix NULL-deref at open
USB: serial: quatech2: fix sleep-while-atomic in close
USB: serial: pl2303: fix NULL-deref at open
USB: serial: oti6858: fix NULL-deref at open
USB: serial: omninet: fix NULL-derefs at open and disconnect
USB: serial: mos7840: fix misleading interrupt-URB comment
USB: serial: mos7840: remove unused write URB
USB: serial: mos7840: fix NULL-deref at open
USB: serial: mos7720: remove obsolete port initialisation
USB: serial: mos7720: fix parallel probe
...
Linus Torvalds [Sun, 8 Jan 2017 19:37:44 +0000 (11:37 -0800)]
Merge tag 'char-misc-4.10-rc3' of git://git./linux/kernel/git/gregkh/char-misc
Pull char/misc fixes from Greg KH:
"Here are a few small char/misc driver fixes for 4.10-rc3.
Two MEI driver fixes, and three NVMEM patches for reported issues, and
a new Hyper-V driver MAINTAINER update. Nothing major at all, all have
been in linux-next with no reported issues"
* tag 'char-misc-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
hyper-v: Add myself as additional MAINTAINER
nvmem: fix nvmem_cell_read() return type doc
nvmem: imx-ocotp: Fix wrong register size
nvmem: qfprom: Allow single byte accesses for read/write
mei: move write cb to completion on credentials failures
mei: bus: fix mei_cldev_enable KDoc
Linus Torvalds [Sun, 8 Jan 2017 19:22:00 +0000 (11:22 -0800)]
Merge tag 'staging-4.10-rc3' of git://git./linux/kernel/git/gregkh/staging
Pull staging/IIO fixes from Greg KH:
"Here are some staging and IIO driver fixes for 4.10-rc3.
Most of these are minor IIO fixes of reported issues, along with one
network driver fix to resolve an issue. And a MAINTAINERS update with
a new mailing list. All of these, except the MAINTAINERS file update,
have been in linux-next with no reported issues (the MAINTAINERS patch
happened on Friday...)"
* tag 'staging-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
MAINTAINERS: add greybus subsystem mailing list
staging: octeon: Call SET_NETDEV_DEV()
iio: accel: st_accel: fix LIS3LV02 reading and scaling
iio: common: st_sensors: fix channel data parsing
iio: max44000: correct value in illuminance_integration_time_available
iio: adc: TI_AM335X_ADC should depend on HAS_DMA
iio: bmi160: Fix time needed to sleep after command execution
iio: 104-quad-8: Fix active level mismatch for the preset enable option
iio: 104-quad-8: Fix off-by-one errors when addressing IOR
iio: 104-quad-8: Fix index control configuration
Michael Chan [Fri, 6 Jan 2017 21:18:53 +0000 (16:18 -0500)]
tg3: Fix race condition in tg3_get_stats64().
The driver's ndo_get_stats64() method is not always called under RTNL.
So it can race with driver close or ethtool reconfigurations. Fix the
race condition by taking tp->lock spinlock in tg3_free_consistent()
when freeing the tp->hw_stats memory block. tg3_get_stats64() is
already taking tp->lock.
Reported-by: Wang Yufen <wangyufen@huawei.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Fri, 6 Jan 2017 20:59:30 +0000 (21:59 +0100)]
be2net: fix unicast list filling
The adapter->pmac_id[0] item is used for primary MAC address but
this is not true for adapter->uc_list[0] as is assumed in
be_set_uc_list(). There are N UC addresses copied first from net_device
to adapter->uc_list[1..N] and then N UC addresses from
adapter->uc_list[0..N-1] are sent to HW. So the last UC address is never
stored into HW and address 00:00:00:00;00:00 (from uc_list[0]) is used
instead.
Cc: Sathya Perla <sathya.perla@broadcom.com>
Cc: Ajit Khaparde <ajit.khaparde@broadcom.com>
Cc: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Cc: Somnath Kotur <somnath.kotur@broadcom.com>
Fixes: b717241 be2net: replace polling with sleeping in the FW completion path
Signed-off-by: Ivan Vecera <cera@cera.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Weiner [Sat, 7 Jan 2017 00:21:43 +0000 (19:21 -0500)]
mm: workingset: fix use-after-free in shadow node shrinker
Several people report seeing warnings about inconsistent radix tree
nodes followed by crashes in the workingset code, which all looked like
use-after-free access from the shadow node shrinker.
Dave Jones managed to reproduce the issue with a debug patch applied,
which confirmed that the radix tree shrinking indeed frees shadow nodes
while they are still linked to the shadow LRU:
WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200
CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3
Call Trace:
delete_node+0x1e4/0x200
__radix_tree_delete_node+0xd/0x10
shadow_lru_isolate+0xe6/0x220
__list_lru_walk_one.isra.4+0x9b/0x190
list_lru_walk_one+0x23/0x30
scan_shadow_nodes+0x2e/0x40
shrink_slab.part.44+0x23d/0x5d0
shrink_node+0x22c/0x330
kswapd+0x392/0x8f0
This is the WARN_ON_ONCE(!list_empty(&node->private_list)) placed in the
inlined radix_tree_shrink().
The problem is with
14b468791fa9 ("mm: workingset: move shadow entry
tracking to radix tree exceptional tracking"), which passes an update
callback into the radix tree to link and unlink shadow leaf nodes when
tree entries change, but forgot to pass the callback when reclaiming a
shadow node.
While the reclaimed shadow node itself is unlinked by the shrinker, its
deletion from the tree can cause the left-most leaf node in the tree to
be shrunk. If that happens to be a shadow node as well, we don't unlink
it from the LRU as we should.
Consider this tree, where the s are shadow entries:
root->rnode
|
[0 n]
| |
[s ] [sssss]
Now the shadow node shrinker reclaims the rightmost leaf node through
the shadow node LRU:
root->rnode
|
[0 ]
|
[s ]
Because the parent of the deleted node is the first level below the
root and has only one child in the left-most slot, the intermediate
level is shrunk and the node containing the single shadow is put in
its place:
root->rnode
|
[s ]
The shrinker again sees a single left-most slot in a first level node
and thus decides to store the shadow in root->rnode directly and free
the node - which is a leaf node on the shadow node LRU.
root->rnode
|
s
Without the update callback, the freed node remains on the shadow LRU,
where it causes later shrinker runs to crash.
Pass the node updater callback into __radix_tree_delete_node() in case
the deletion causes the left-most branch in the tree to collapse too.
Also add warnings when linked nodes are freed right away, rather than
wait for the use-after-free when the list is scanned much later.
Fixes:
14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking")
Reported-by: Dave Chinner <david@fromorbit.com>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Leech <cleech@redhat.com>
Cc: Lee Duncan <lduncan@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Sat, 7 Jan 2017 23:37:31 +0000 (15:37 -0800)]
mm: stop leaking PageTables
4.10-rc loadtest (even on x86, and even without THPCache) fails with
"fork: Cannot allocate memory" or some such; and /proc/meminfo shows
PageTables growing.
Commit
953c66c2b22a ("mm: THP page cache support for ppc64") that got
merged in rc1 removed the freeing of an unused preallocated pagetable
after do_fault_around() has called map_pages().
This is usually a good optimization, so that the followup doesn't have
to reallocate one; but it's not sufficient to shift the freeing into
alloc_set_pte(), since there are failure cases (most commonly
VM_FAULT_RETRY) which never reach finish_fault().
Check and free it at the outer level in do_fault(), then we don't need
to worry in alloc_set_pte(), and can restore that to how it was (I
cannot find any reason to pte_free() under lock as it was doing).
And fix a separate pagetable leak, or crash, introduced by the same
change, that could only show up on some ppc64: why does do_set_pmd()'s
failure case attempt to withdraw a pagetable when it never deposited
one, at the same time overwriting (so leaking) the vmf->prealloc_pte?
Residue of an earlier implementation, perhaps? Delete it.
Fixes:
953c66c2b22a ("mm: THP page cache support for ppc64")
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>