Xin Long [Mon, 18 Dec 2017 06:20:56 +0000 (14:20 +0800)]
vxlan: update skb dst pmtu on tx path
Unlike ip tunnels, now vxlan doesn't do any pmtu update for
upper dst pmtu, even if it doesn't match the lower dst pmtu
any more.
The problem can be reproduced when reducing the vxlan lower
dev's pmtu when running netperf. In jianlin's testing, the
performance went to 1/7 of the previous.
This patch is to update the upper dst pmtu to match the lower
dst pmtu on tx path so that packets can be sent out even when
lower dev's pmtu has been changed.
It also works for metadata dst.
Note that this patch doesn't process any pmtu icmp packet.
But even in the future, the support for pmtu icmp packets
process of udp tunnels will also needs this.
The same thing will be done for geneve in another patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Kochetkov [Tue, 19 Dec 2017 11:03:57 +0000 (14:03 +0300)]
net: arc_emac: restart stalled EMAC
Under certain conditions EMAC stop reception of incoming packets and
continuously increment R_MISS register instead of saving data into
provided buffer. The commit implement workaround for such situation.
Then the stall detected EMAC will be restarted.
On device the stall looks like the device lost it's dynamic IP address.
ifconfig shows that interface error counter rapidly increments.
At the same time on the DHCP server we can see continues DHCP-requests
from device.
In real network stalls happen really rarely. To make them frequent the
broadcast storm[1] should be simulated. For simulation it is necessary
to make following connections:
1. connect radxarock to 1st port of switch
2. connect some PC to 2nd port of switch
3. connect two other free ports together using standard ethernet cable,
in order to make a switching loop.
After that, is necessary to make a broadcast storm. For example, running on
PC 'ping' to some IP address triggers ARP-request storm. After some
time (~10sec), EMAC on rk3188 will stall.
Observed and tested on rk3188 radxarock.
[1] https://en.wikipedia.org/wiki/Broadcast_radiation
Signed-off-by: Alexander Kochetkov <al.kochet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Kochetkov [Fri, 15 Dec 2017 17:20:06 +0000 (20:20 +0300)]
net: arc_emac: fix arc_emac_rx() error paths
arc_emac_rx() has some issues found by code review.
In case netdev_alloc_skb_ip_align() or dma_map_single() failure
rx fifo entry will not be returned to EMAC.
In case dma_map_single() failure previously allocated skb became
lost to driver. At the same time address of newly allocated skb
will not be provided to EMAC.
Signed-off-by: Alexander Kochetkov <al.kochet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Wang [Mon, 18 Dec 2017 09:00:17 +0000 (17:00 +0800)]
net: mediatek: setup proper state for disabled GMAC on the default
The current solution would setup fixed and force link of 1Gbps to the both
GMAC on the default. However, The GMAC should always be put to link down
state when the GMAC is disabled on certain target boards. Otherwise,
the driver possibly receives unexpected data from the floating hardware
connection through the unused GMAC. Although the driver had been added
certain protection in RX path to get rid of such kind of unexpected data
sent to the upper stack.
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Sun, 17 Dec 2017 16:16:43 +0000 (17:16 +0100)]
mlxsw: spectrum_router: Remove batch neighbour deletion causing FW bug
This reverts commit
63dd00fa3e524c27cc0509190084ab147ecc8ae2.
RAUHT DELETE_ALL seems to trigger a bug in FW. That manifests by later
calls to RAUHT ADD of an IPv6 neighbor to fail with "bad parameter"
error code.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Fixes:
63dd00fa3e52 ("mlxsw: spectrum_router: Add batch neighbour deletion")
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Brian King [Fri, 15 Dec 2017 21:21:50 +0000 (15:21 -0600)]
tg3: Fix rx hang on MTU change with 5717/5719
This fixes a hang issue seen when changing the MTU size from 1500 MTU
to 9000 MTU on both 5717 and 5719 chips. In discussion with Broadcom,
they've indicated that these chipsets have the same phy as the 57766
chipset, so the same workarounds apply. This has been tested by IBM
on both Power 8 and Power 9 systems as well as by Broadcom on x86
hardware and has been confirmed to resolve the hang issue.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 19 Dec 2017 14:39:11 +0000 (09:39 -0500)]
Merge tag 'mac80211-for-davem-2017-12-19' of git://git./linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
A few more fixes:
* hwsim:
- set To-DS bit in some frames missing it
- fix sleeping in atomic
* nl80211:
- doc cleanup
- fix locking in an error path
* build:
- don't append to created certs C files
- ship certificate pre-hexdumped
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 19 Dec 2017 08:26:17 +0000 (09:26 +0100)]
cfg80211: ship certificates as hex files
Not only does this remove the need for the hexdump code in most
normal kernel builds (still there for the extra directory), but
it also removes the need to ship binary files, which apparently
is somewhat problematic, as Randy reported.
While at it, also add the generated files to clean-files.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Jonathan Corbet [Mon, 11 Dec 2017 22:37:49 +0000 (15:37 -0700)]
nl80211: Remove obsolete kerneldoc line
Commit
ca986ad9bcd3 (nl80211: allow multiple active scheduled scan
requests) removed WIPHY_FLAG_SUPPORTS_SCHED_SCAN but left the kerneldoc
description in place, leading to this docs-build warning:
./include/net/cfg80211.h:3278: warning: Excess enum value
'WIPHY_FLAG_SUPPORTS_SCHED_SCAN' description in 'wiphy_flags'
Remove the line and gain a bit of peace.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Jia-Ju Bai [Tue, 12 Dec 2017 09:26:36 +0000 (17:26 +0800)]
mac80211_hwsim: Fix a possible sleep-in-atomic bug in hwsim_get_radio_nl
The driver may sleep under a spinlock.
The function call path is:
hwsim_get_radio_nl (acquire the spinlock)
nlmsg_new(GFP_KERNEL) --> may sleep
To fix it, GFP_KERNEL is replaced with GFP_ATOMIC.
This bug is found by my static analysis tool(DSAC) and checked by my code review.
Signed-off-by: Jia-Ju Bai <baijiaju1990@163.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Thierry Reding [Thu, 14 Dec 2017 13:33:38 +0000 (14:33 +0100)]
cfg80211: always rewrite generated files from scratch
Currently the certs C code generation appends to the generated files,
which is most likely a leftover from commit
715a12334764 ("wireless:
don't write C files on failures"). This causes duplicate code in the
generated files if the certificates have their timestamps modified
between builds and thereby trigger the generation rules.
Fixes:
715a12334764 ("wireless: don't write C files on failures")
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Adiel Aloni [Mon, 18 Dec 2017 10:14:04 +0000 (12:14 +0200)]
mac80211_hwsim: enable TODS BIT in null data frame
Same as in ieee80211_nullfunc_get, enable the TODS bit, otherwise the
nullfunc packet will not be handled in ap rx path.
(will be dropped in ieee80211_accept_frame()).
Signed-off-by: Adiel Aloni <adiel.aloni@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Alexey Khoroshilov [Fri, 15 Dec 2017 21:52:39 +0000 (00:52 +0300)]
net: phy: xgene: disable clk on error paths
There are several error paths in xgene_mdio_probe(),
where clk is left undisabled. The patch fixes them.
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King [Fri, 15 Dec 2017 16:10:20 +0000 (16:10 +0000)]
net: phy: marvell: avoid pause mode on SGMII-to-Copper for 88e151x
Observed on the 88e1512 in SGMII-to-Copper mode, negotiating pause
is unreliable. While the pause bits can be set in the advertisment
register, they clear shortly after negotiation with a link partner
commences irrespective of the cause of the negotiation.
While these bits may be correctly conveyed to the link partner on the
first negotiation, a subsequent negotiation (eg, due to negotiation
restart by the link partner, or reconnection of the cable) will result
in the link partner seeing these bits as zero, while the kernel
believes that it has advertised pause modes.
This leads to the local kernel evaluating (eg) symmetric pause mode,
while the remote end evaluates that we have no pause mode capability.
Since we can't guarantee the advertisment, disable pause mode support
with this PHY when used in SGMII-to-Copper mode.
The 88e1510 in RGMII-to-Copper mode appears to behave correctly.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Mon, 18 Dec 2017 15:35:09 +0000 (17:35 +0200)]
net: bridge: fix early call to br_stp_change_bridge_id and plug newlink leaks
The early call to br_stp_change_bridge_id in bridge's newlink can cause
a memory leak if an error occurs during the newlink because the fdb
entries are not cleaned up if a different lladdr was specified, also
another minor issue is that it generates fdb notifications with
ifindex = 0. Another unrelated memory leak is the bridge sysfs entries
which get added on NETDEV_REGISTER event, but are not cleaned up in the
newlink error path. To remove this special case the call to
br_stp_change_bridge_id is done after netdev register and we cleanup the
bridge on changelink error via br_dev_delete to plug all leaks.
This patch makes netlink bridge destruction on newlink error the same as
dellink and ioctl del which is necessary since at that point we have a
fully initialized bridge device.
To reproduce the issue:
$ ip l add br0 address 00:11:22:33:44:55 type bridge group_fwd_mask 1
RTNETLINK answers: Invalid argument
$ rmmod bridge
[ 1822.142525] =============================================================================
[ 1822.143640] BUG bridge_fdb_cache (Tainted: G O ): Objects remaining in bridge_fdb_cache on __kmem_cache_shutdown()
[ 1822.144821] -----------------------------------------------------------------------------
[ 1822.145990] Disabling lock debugging due to kernel taint
[ 1822.146732] INFO: Slab 0x0000000092a844b2 objects=32 used=2 fp=0x00000000fef011b0 flags=0x1ffff8000000100
[ 1822.147700] CPU: 2 PID: 13584 Comm: rmmod Tainted: G B O 4.15.0-rc2+ #87
[ 1822.148578] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 1822.150008] Call Trace:
[ 1822.150510] dump_stack+0x78/0xa9
[ 1822.151156] slab_err+0xb1/0xd3
[ 1822.151834] ? __kmalloc+0x1bb/0x1ce
[ 1822.152546] __kmem_cache_shutdown+0x151/0x28b
[ 1822.153395] shutdown_cache+0x13/0x144
[ 1822.154126] kmem_cache_destroy+0x1c0/0x1fb
[ 1822.154669] SyS_delete_module+0x194/0x244
[ 1822.155199] ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 1822.155773] entry_SYSCALL_64_fastpath+0x23/0x9a
[ 1822.156343] RIP: 0033:0x7f929bd38b17
[ 1822.156859] RSP: 002b:
00007ffd160e9a98 EFLAGS:
00000202 ORIG_RAX:
00000000000000b0
[ 1822.157728] RAX:
ffffffffffffffda RBX:
00005578316ba090 RCX:
00007f929bd38b17
[ 1822.158422] RDX:
00007f929bd9ec60 RSI:
0000000000000800 RDI:
00005578316ba0f0
[ 1822.159114] RBP:
0000000000000003 R08:
00007f929bff5f20 R09:
00007ffd160e8a11
[ 1822.159808] R10:
00007ffd160e9860 R11:
0000000000000202 R12:
00007ffd160e8a80
[ 1822.160513] R13:
0000000000000000 R14:
0000000000000000 R15:
00005578316ba090
[ 1822.161278] INFO: Object 0x000000007645de29 @offset=0
[ 1822.161666] INFO: Object 0x00000000d5df2ab5 @offset=128
Fixes:
30313a3d5794 ("bridge: Handle IFLA_ADDRESS correctly when creating bridge device")
Fixes:
5b8d5429daa0 ("bridge: netlink: register netdevice before executing changelink")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 18 Dec 2017 06:13:17 +0000 (14:13 +0800)]
sctp: add SCTP_CID_RECONF conversion in sctp_cname
Whenever a new type of chunk is added, the corresp conversion in
sctp_cname should be added. Otherwise, in some places, pr_debug
will print it as "unknown chunk".
Fixes:
cc16f00f6529 ("sctp: add support for generating stream reconf ssn reset request chunk")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 18 Dec 2017 06:07:25 +0000 (14:07 +0800)]
sctp: fix the issue that a __u16 variable may overflow in sctp_ulpq_renege
Now when reneging events in sctp_ulpq_renege(), the variable freed
could be increased by a __u16 value twice while freed is of __u16
type. It means freed may overflow at the second addition.
This patch is to fix it by using __u32 type for 'freed', while at
it, also to remove 'if (chunk)' check, as all renege commands are
generated in sctp_eat_data and it can't be NULL.
Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hemanth Puranik [Mon, 18 Dec 2017 05:57:47 +0000 (11:27 +0530)]
net: qcom/emac: Change the order of mac up and sgmii open
This patch fixes the order of mac_up and sgmii_open for the
reasons noted below:
- If open takes more time(if the SGMII block is not responding or
if we want to do some delay based task) in this situation we
will hit NETDEV watchdog
- The main reason : We should signal to upper layers that we are
ready to receive packets "only" when the entire path is initialized
not the other way around, this is followed in the reset path where
we do mac_down, sgmii_reset and mac_up. This also makes the driver
uniform across the reset and open paths.
- In the future there may be need for delay based tasks to be done in
sgmii open which will result in NETDEV watchdog
- As per the documentation the order of init should be sgmii, mac, rings
and DMA
Signed-off-by: Hemanth Puranik <hpuranik@codeaurora.org>
Acked-by: Timur Tabi <timur@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhao Qiang [Mon, 18 Dec 2017 02:26:43 +0000 (10:26 +0800)]
net: phy: marvell: Limit 88m1101 autoneg errata to 88E1145 as well.
88E1145 also need this autoneg errata.
Fixes:
f2899788353c ("net: phy: marvell: Limit errata to 88m1101")
Signed-off-by: Zhao Qiang <qiang.zhao@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Mon, 18 Dec 2017 17:13:34 +0000 (18:13 +0100)]
tipc: remove leaving group member from all lists
A group member going into state LEAVING should never go back to any
other state before it is finally deleted. However, this might happen
if the socket needs to send out a RECLAIM message during this interval.
Since we forget to remove the leaving member from the group's 'active'
or 'pending' list, the member might be selected for reclaiming, change
state to RECLAIMING, and get stuck in this state instead of being
deleted. This might lead to suppression of the expected 'member down'
event to the receiver.
We fix this by removing the member from all lists, except the RB tree,
at the moment it goes into state LEAVING.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Mon, 18 Dec 2017 16:34:16 +0000 (17:34 +0100)]
tipc: fix lost member events bug
Group messages are not supposed to be returned to sender when the
destination socket disappears. This is done correctly for regular
traffic messages, by setting the 'dest_droppable' bit in the header.
But we forget to do that in group protocol messages. This has the effect
that such messages may sometimes bounce back to the sender, be perceived
as a legitimate peer message, and wreak general havoc for the rest of
the session. In particular, we have seen that a member in state LEAVING
may go back to state RECLAIMED or REMITTED, hence causing suppression
of an otherwise expected 'member down' event to the user.
We fix this by setting the 'dest_droppable' bit even in group protocol
messages.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 18 Dec 2017 15:49:22 +0000 (10:49 -0500)]
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2017-12-17
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a corner case in generic XDP where we have non-linear skbs
but enough tailroom in the skb to not miss to linearizing there,
from Song.
2) Fix BPF JIT bugs in s390x and ppc64 to not recache skb data when
BPF context is not skb, from Daniel.
3) Fix a BPF JIT bug in sparc64 where recaching skb data after helper
call would use the wrong register for the skb, from Daniel.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexey Kodanev [Thu, 14 Dec 2017 17:20:00 +0000 (20:20 +0300)]
vxlan: restore dev->mtu setting based on lower device
Stefano Brivio says:
Commit
a985343ba906 ("vxlan: refactor verification and
application of configuration") introduced a change in the
behaviour of initial MTU setting: earlier, the MTU for a link
created on top of a given lower device, without an initial MTU
specification, was set to the MTU of the lower device minus
headroom as a result of this path in vxlan_dev_configure():
if (!conf->mtu)
dev->mtu = lowerdev->mtu -
(use_ipv6 ? VXLAN6_HEADROOM : VXLAN_HEADROOM);
which is now gone. Now, the initial MTU, in absence of a
configured value, is simply set by ether_setup() to ETH_DATA_LEN
(1500 bytes).
This breaks userspace expectations in case the MTU of
the lower device is higher than 1500 bytes minus headroom.
This patch restores the previous behaviour on newlink operation. Since
max_mtu can be negative and we update dev->mtu directly, also check it
for valid minimum.
Reported-by: Junhan Yan <juyan@redhat.com>
Fixes:
a985343ba906 ("vxlan: refactor verification and application of configuration")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Acked-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Brendan McGrath [Wed, 13 Dec 2017 11:14:57 +0000 (22:14 +1100)]
ipv6: icmp6: Allow icmp messages to be looped back
One example of when an ICMPv6 packet is required to be looped back is
when a host acts as both a Multicast Listener and a Multicast Router.
A Multicast Router will listen on address ff02::16 for MLDv2 messages.
Currently, MLDv2 messages originating from a Multicast Listener running
on the same host as the Multicast Router are not being delivered to the
Multicast Router. This is due to dst.input being assigned the default
value of dst_discard.
This results in the packet being looped back but discarded before being
delivered to the Multicast Router.
This patch sets dst.input to ip6_input to ensure a looped back packet
is delivered to the Multicast Router.
Signed-off-by: Brendan McGrath <redmcg@redmandi.dyndns.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 16 Dec 2017 21:43:08 +0000 (13:43 -0800)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"More fixes from testing done on the rc kernel, including more SELinux
testing. Looking forward, lockdep found regression today in ipoib
which is still being fixed.
Summary:
- Fix for SELinux on the umad SMI path. Some old hardware does not
fill the PKey properly exposing another bug in the newer SELinux
code.
- Check the input port as we can exceed array bounds from this user
supplied value
- Users are unable to use the hash field support as they want due to
incorrect checks on the field restrictions, correct that so the
feature works as intended
- User triggerable oops in the NETLINK_RDMA handler
- cxgb4 driver fix for a bad interaction with CQ flushing in iser
caused by patches in this merge window, and bad CQ flushing during
normal close.
- Unbalanced memalloc_noio in ipoib in an error path"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/ipoib: Restore MM behavior in case of tx_ring allocation failure
iw_cxgb4: only insert drain cqes if wq is flushed
iw_cxgb4: only clear the ARMED bit if a notification is needed
RDMA/netlink: Fix general protection fault
IB/mlx4: Fix RSS hash fields restrictions
IB/core: Don't enforce PKey security on SMI MADs
IB/core: Bound check alternate path port number
Linus Torvalds [Sat, 16 Dec 2017 21:34:38 +0000 (13:34 -0800)]
Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Two bugfixes for the AT24 I2C eeprom driver and some minor corrections
for I2C bus drivers"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: piix4: Fix port number check on release
i2c: stm32: Fix copyrights
i2c-cht-wc: constify platform_device_id
eeprom: at24: change nvmem stride to 1
eeprom: at24: fix I2C device selection for runtime PM
Linus Torvalds [Sat, 16 Dec 2017 21:12:53 +0000 (13:12 -0800)]
Merge tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"This has two stable bugfixes, one to fix a BUG_ON() when
nfs_commit_inode() is called with no outstanding commit requests and
another to fix a race in the SUNRPC receive codepath.
Additionally, there are also fixes for an NFS client deadlock and an
xprtrdma performance regression.
Summary:
Stable bugfixes:
- NFS: Avoid a BUG_ON() in nfs_commit_inode() by not waiting for a
commit in the case that there were no commit requests.
- SUNRPC: Fix a race in the receive code path
Other fixes:
- NFS: Fix a deadlock in nfs client initialization
- xprtrdma: Fix a performance regression for small IOs"
* tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
SUNRPC: Fix a race in the receive code path
nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests
xprtrdma: Spread reply processing over more CPUs
nfs: fix a deadlock in nfs client initialization
Linus Torvalds [Sat, 16 Dec 2017 02:53:22 +0000 (18:53 -0800)]
Revert "mm: replace p??_write with pte_access_permitted in fault + gup paths"
This reverts commits
5c9d2d5c269c,
c7da82b894e9, and
e7fe7b5cae90.
We'll probably need to revisit this, but basically we should not
complicate the get_user_pages_fast() case, and checking the actual page
table protection key bits will require more care anyway, since the
protection keys depend on the exact state of the VM in question.
Particularly when doing a "remote" page lookup (ie in somebody elses VM,
not your own), you need to be much more careful than this was. Dave
Hansen says:
"So, the underlying bug here is that we now a get_user_pages_remote()
and then go ahead and do the p*_access_permitted() checks against the
current PKRU. This was introduced recently with the addition of the
new p??_access_permitted() calls.
We have checks in the VMA path for the "remote" gups and we avoid
consulting PKRU for them. This got missed in the pkeys selftests
because I did a ptrace read, but not a *write*. I also didn't
explicitly test it against something where a COW needed to be done"
It's also not entirely clear that it makes sense to check the protection
key bits at this level at all. But one possible eventual solution is to
make the get_user_pages_fast() case just abort if it sees protection key
bits set, which makes us fall back to the regular get_user_pages() case,
which then has a vma and can do the check there if we want to.
We'll see.
Somewhat related to this all: what we _do_ want to do some day is to
check the PAGE_USER bit - it should obviously always be set for user
pages, but it would be a good check to have back. Because we have no
generic way to test for it, we lost it as part of moving over from the
architecture-specific x86 GUP implementation to the generic one in
commit
e585513b76f7 ("x86/mm/gup: Switch GUP to the generic
get_user_page_fast() implementation").
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 15 Dec 2017 21:08:37 +0000 (13:08 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Clamp timeouts to INT_MAX in conntrack, from Jay Elliot.
2) Fix broken UAPI for BPF_PROG_TYPE_PERF_EVENT, from Hendrik
Brueckner.
3) Fix locking in ieee80211_sta_tear_down_BA_sessions, from Johannes
Berg.
4) Add missing barriers to ptr_ring, from Michael S. Tsirkin.
5) Don't advertise gigabit in sh_eth when not available, from Thomas
Petazzoni.
6) Check network namespace when delivering to netlink taps, from Kevin
Cernekee.
7) Kill a race in raw_sendmsg(), from Mohamed Ghannam.
8) Use correct address in TCP md5 lookups when replying to an incoming
segment, from Christoph Paasch.
9) Add schedule points to BPF map alloc/free, from Eric Dumazet.
10) Don't allow silly mtu values to be used in ipv4/ipv6 multicast, also
from Eric Dumazet.
11) Fix SKB leak in tipc, from Jon Maloy.
12) Disable MAC learning on OVS ports of mlxsw, from Yuval Mintz.
13) SKB leak fix in skB_complete_tx_timestamp(), from Willem de Bruijn.
14) Add some new qmi_wwan device IDs, from Daniele Palmas.
15) Fix static key imbalance in ingress qdisc, from Jiri Pirko.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits)
net: qcom/emac: Reduce timeout for mdio read/write
net: sched: fix static key imbalance in case of ingress/clsact_init error
net: sched: fix clsact init error path
ip_gre: fix wrong return value of erspan_rcv
net: usb: qmi_wwan: add Telit ME910 PID 0x1101 support
pkt_sched: Remove TC_RED_OFFLOADED from uapi
net: sched: Move to new offload indication in RED
net: sched: Add TCA_HW_OFFLOAD
net: aquantia: Increment driver version
net: aquantia: Fix typo in ethtool statistics names
net: aquantia: Update hw counters on hw init
net: aquantia: Improve link state and statistics check interval callback
net: aquantia: Fill in multicast counter in ndev stats from hardware
net: aquantia: Fill ndev stat couters from hardware
net: aquantia: Extend stat counters to 64bit values
net: aquantia: Fix hardware DMA stream overload on large MRRS
net: aquantia: Fix actual speed capabilities reporting
sock: free skb in skb_complete_tx_timestamp on error
s390/qeth: update takeover IPs after configuration change
s390/qeth: lock IP table while applying takeover changes
...
Linus Torvalds [Fri, 15 Dec 2017 21:03:25 +0000 (13:03 -0800)]
Merge tag 'usb-4.15-rc4' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some USB fixes for 4.15-rc4.
There is the usual handful gadget/dwc2/dwc3 fixes as always, for
reported issues. But the most important things in here is the core fix
from Alan Stern to resolve a nasty security bug (my first attempt is
reverted, Alan's was much cleaner), as well as a number of usbip fixes
from Shuah Khan to resolve those reported security issues.
All of these have been in linux-next with no reported issues"
* tag 'usb-4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: core: prevent malicious bNumInterfaces overflow
Revert "USB: core: only clean up what we allocated"
USB: core: only clean up what we allocated
Revert "usb: gadget: allow to enable legacy drivers without USB_ETH"
usb: gadget: webcam: fix V4L2 Kconfig dependency
usb: dwc2: Fix TxFIFOn sizes and total TxFIFO size issues
usb: dwc3: gadget: Fix PCM1 for ISOC EP with ep->mult less than 3
usb: dwc3: of-simple: set dev_pm_ops
usb: dwc3: of-simple: fix missing clk_disable_unprepare
usb: dwc3: gadget: Wait longer for controller to end command processing
usb: xhci: fix TDS for MTK xHCI1.1
xhci: Don't add a virt_dev to the devs array before it's fully allocated
usbip: fix stub_send_ret_submit() vulnerability to null transfer_buffer
usbip: prevent vhci_hcd driver from leaking a socket pointer address
usbip: fix stub_rx: harden CMD_SUBMIT path to handle malicious input
usbip: fix stub_rx: get_pipe() to validate endpoint number
tools/usbip: fixes potential (minor) "buffer overflow" (detected on recent gcc with -Werror)
USB: uas and storage: Add US_FL_BROKEN_FUA for another JMicron JMS567 ID
usb: musb: da8xx: fix babble condition handling
Linus Torvalds [Fri, 15 Dec 2017 20:59:48 +0000 (12:59 -0800)]
Merge tag 'staging-4.15-rc4' of git://git./linux/kernel/git/gregkh/staging
Pull staging fixes from Greg KH:
"Here are some small staging driver fixes for 4.15-rc4.
One patch for the ccree driver to prevent an unitialized value from
being returned to a caller, and the other fixes a logic error in the
pi433 driver"
* tag 'staging-4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: pi433: Fixes issue with bit shift in rf69_get_modulation
staging: ccree: Uninitialized return in ssi_ahash_import()
Linus Torvalds [Fri, 15 Dec 2017 20:56:23 +0000 (12:56 -0800)]
Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost
Pull virtio regression fixes from Michael Tsirkin:
"Fixes two issues in the latest kernel"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio_mmio: fix devm cleanup
ptr_ring: fix up after recent ptr_ring changes
Linus Torvalds [Fri, 15 Dec 2017 20:53:37 +0000 (12:53 -0800)]
Merge tag 'for-4.15/dm-fixes' of git://git./linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- fix a particularly nasty DM core bug in a 4.15 refcount_t conversion.
- fix various targets to dm_register_target after module __init
resources created; otherwise racing lvm2 commands could result in a
NULL pointer during initialization of associated DM kernel module.
- fix regression in bio-based DM multipath queue_if_no_path handling.
- fix DM bufio's shrinker to reclaim more than one buffer per scan.
* tag 'for-4.15/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm bufio: fix shrinker scans when (nr_to_scan < retain_target)
dm mpath: fix bio-based multipath queue_if_no_path handling
dm: fix various targets to dm_register_target after module __init resources created
dm table: fix regression from improper dm_dev_internal.count refcount_t conversion
Linus Torvalds [Fri, 15 Dec 2017 20:51:42 +0000 (12:51 -0800)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"The most important one is the bfa fix because it's easy to oops the
kernel with this driver (this includes the commit that corrects the
compiler warning in the original), a regression in the new timespec
conversion in aacraid and a regression in the Fibre Channel ELS
handling patch.
The other three are a theoretical problem with termination in the
vendor/host matching code and a use after free in lpfc.
The additional patches are a fix for an I/O hang in the mq code under
certain circumstances and a rare oops in some debugging code"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: core: Fix a scsi_show_rq() NULL pointer dereference
scsi: MAINTAINERS: change FCoE list to linux-scsi
scsi: libsas: fix length error in sas_smp_handler()
scsi: bfa: fix type conversion warning
scsi: core: run queue if SCSI device queue isn't ready and queue is idle
scsi: scsi_devinfo: cleanly zero-pad devinfo strings
scsi: scsi_devinfo: handle non-terminated strings
scsi: bfa: fix access to bfad_im_port_s
scsi: aacraid: address UBSAN warning regression
scsi: libfc: fix ELS request handling
scsi: lpfc: Use after free in lpfc_rq_buf_free()
Linus Torvalds [Fri, 15 Dec 2017 20:49:54 +0000 (12:49 -0800)]
Merge tag 'mmc-v4.15-rc2' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"A couple of MMC fixes:
- fix use of uninitialized drv_typ variable
- apply NO_CMD23 quirk to some specific SD cards to make them work"
* tag 'mmc-v4.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: core: apply NO_CMD23 quirk to some specific cards
mmc: core: properly init drv_type
Linus Torvalds [Fri, 15 Dec 2017 20:48:27 +0000 (12:48 -0800)]
Merge tag 'ceph-for-4.15-rc4' of git://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"CephFS inode trimming fix from Zheng, marked for stable"
* tag 'ceph-for-4.15-rc4' of git://github.com/ceph/ceph-client:
ceph: drop negative child dentries before try pruning inode's alias
Linus Torvalds [Fri, 15 Dec 2017 20:46:48 +0000 (12:46 -0800)]
Merge branch 'overlayfs-linus' of git://git./linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
- fix incomplete syncing of filesystem
- fix regression in readdir on ovl over 9p
- only follow redirects when needed
- misc fixes and cleanups
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix overlay: warning prefix
ovl: Use PTR_ERR_OR_ZERO()
ovl: Sync upper dirty data when syncing overlayfs
ovl: update ctx->pos on impure dir iteration
ovl: Pass ovl_get_nlink() parameters in right order
ovl: don't follow redirects if redirect_dir=off
Hemanth Puranik [Fri, 15 Dec 2017 14:35:58 +0000 (20:05 +0530)]
net: qcom/emac: Reduce timeout for mdio read/write
Currently mdio read/write takes around ~115us as the timeout
between status check is set to 100us.
By reducing the timeout to 1us mdio read/write takes ~15us to
complete. This improves the link up event response.
Signed-off-by: Hemanth Puranik <hpuranik@codeaurora.org>
Acked-by: Timur Tabi <timur@codeaurora.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 15 Dec 2017 20:44:49 +0000 (12:44 -0800)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"There are some significant fixes in here for FP state corruption,
hardware access/dirty PTE corruption and an erratum workaround for the
Falkor CPU.
I'm hoping that things finally settle down now, but never say never...
Summary:
- Fix FPSIMD context switch regression introduced in -rc2
- Fix ABI break with SVE CPUID register reporting
- Fix use of uninitialised variable
- Fixes to hardware access/dirty management and sanity checking
- CPU erratum workaround for Falkor CPUs
- Fix reporting of writeable+executable mappings
- Fix signal reporting for RAS errors"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: fpsimd: Fix copying of FP state from signal frame into task struct
arm64/sve: Report SVE to userspace via CPUID only if supported
arm64: fix CONFIG_DEBUG_WX address reporting
arm64: fault: avoid send SIGBUS two times
arm64: hw_breakpoint: Use linux/uaccess.h instead of asm/uaccess.h
arm64: Add software workaround for Falkor erratum 1041
arm64: Define cputype macros for Falkor CPU
arm64: mm: Fix false positives in set_pte_at access/dirty race detection
arm64: mm: Fix pte_mkclean, pte_mkdirty semantics
arm64: Initialise high_memory global variable earlier
Jiri Pirko [Fri, 15 Dec 2017 11:40:13 +0000 (12:40 +0100)]
net: sched: fix static key imbalance in case of ingress/clsact_init error
Move static key increments to the beginning of the init function
so they pair 1:1 with decrements in ingress/clsact_destroy,
which is called in case ingress/clsact_init fails.
Fixes:
6529eaba33f0 ("net: sched: introduce tcf block infractructure")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 15 Dec 2017 11:40:12 +0000 (12:40 +0100)]
net: sched: fix clsact init error path
Since in qdisc_create, the destroy op is called when init fails, we
don't do cleanup in init and leave it up to destroy.
This fixes use-after-free when trying to put already freed block.
Fixes:
6e40cf2d4dee ("net: sched: use extended variants of block_get/put in ingress and clsact qdiscs")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 15 Dec 2017 20:14:33 +0000 (12:14 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- fix the s2ram regression related to confusion around segment
register restoration, plus related cleanups that make the code more
robust
- a guess-unwinder Kconfig dependency fix
- an isoimage build target fix for certain tool chain combinations
- instruction decoder opcode map fixes+updates, and the syncing of
the kernel decoder headers to the objtool headers
- a kmmio tracing fix
- two 5-level paging related fixes
- a topology enumeration fix on certain SMP systems"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Resync objtool's instruction decoder source code copy with the kernel's latest version
x86/decoder: Fix and update the opcodes map
x86/power: Make restore_processor_context() sane
x86/power/32: Move SYSENTER MSR restoration to fix_processor_context()
x86/power/64: Use struct desc_ptr for the IDT in struct saved_context
x86/unwinder/guess: Prevent using CONFIG_UNWINDER_GUESS=y with CONFIG_STACKDEPOT=y
x86/build: Don't verify mtools configuration file for isoimage
x86/mm/kmmio: Fix mmiotrace for page unaligned addresses
x86/boot/compressed/64: Print error if 5-level paging is not supported
x86/boot/compressed/64: Detect and handle 5-level paging at boot-time
x86/smpboot: Do not use smp_num_siblings in __max_logical_packages calculation
Linus Torvalds [Fri, 15 Dec 2017 19:44:59 +0000 (11:44 -0800)]
Merge branch 'locking-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
"Misc fixes:
- Fix a S390 boot hang that was caused by the lock-break logic.
Remove lock-break to begin with, as review suggested it was
unreasonably fragile and our confidence in its continued good
health is lower than our confidence in its removal.
- Remove the lockdep cross-release checking code for now, because of
unresolved false positive warnings. This should make lockdep work
well everywhere again.
- Get rid of the final (and single) ACCESS_ONCE() straggler and
remove the API from v4.15.
- Fix a liblockdep build warning"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools/lib/lockdep: Add missing declaration of 'pr_cont()'
checkpatch: Remove ACCESS_ONCE() warning
compiler.h: Remove ACCESS_ONCE()
tools/include: Remove ACCESS_ONCE()
tools/perf: Convert ACCESS_ONCE() to READ_ONCE()
locking/lockdep: Remove the cross-release locking checks
locking/core: Remove break_lock field when CONFIG_GENERIC_LOCKBREAK=y
locking/core: Fix deadlock during boot on systems with GENERIC_LOCKBREAK
Linus Torvalds [Fri, 15 Dec 2017 19:40:24 +0000 (11:40 -0800)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Two fixes: a crash fix for an ARM SoC platform, and kernel-doc
warnings fixes"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/rt: Do not pull from current CPU if only one CPU to pull
sched/core: Fix kernel-doc warnings after code movement
Linus Torvalds [Fri, 15 Dec 2017 19:36:20 +0000 (11:36 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf tooling fix from Ingo Molnar:
"Synchronize kernel <-> tooling headers to resolve two build warnings
in the perf build"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools/headers: Synchronize kernel <-> tooling headers
Linus Torvalds [Fri, 15 Dec 2017 19:34:29 +0000 (11:34 -0800)]
Merge branch 'core-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull early_ioremap fix from Ingo Molnar:
"A boot hang fix when the EFI earlyprintk driver is enabled"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
mm/early_ioremap: Fix boot hang with earlyprintk=efi,keep
Linus Torvalds [Fri, 15 Dec 2017 19:32:09 +0000 (11:32 -0800)]
Merge tag 'for-linus-4.15-rc4-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Two minor fixes for running as Xen dom0:
- when built as 32 bit kernel on large machines the Xen LAPIC
emulation should report a rather modern LAPIC in order to support
enough APIC-Ids
- The Xen LAPIC emulation is needed for dom0 only, so build it only
for kernels supporting to run as Xen dom0"
* tag 'for-linus-4.15-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: XEN_ACPI_PROCESSOR is Dom0-only
x86/Xen: don't report ancient LAPIC version
Trond Myklebust [Fri, 15 Dec 2017 02:24:08 +0000 (21:24 -0500)]
SUNRPC: Fix a race in the receive code path
We must ensure that the call to rpc_sleep_on() in xprt_transmit() cannot
race with the call to xprt_complete_rqst().
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=317
Fixes:
ce7c252a8c74 ("SUNRPC: Add a separate spinlock to protect..")
Cc: stable@vger.kernel.org # 4.14+
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Scott Mayhew [Fri, 8 Dec 2017 21:00:12 +0000 (16:00 -0500)]
nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests
If there were no commit requests, then nfs_commit_inode() should not
wait on the commit or mark the inode dirty, otherwise the following
BUG_ON can be triggered:
[ 1917.130762] kernel BUG at fs/inode.c:578!
[ 1917.130766] Oops: Exception in kernel mode, sig: 5 [#1]
[ 1917.130768] SMP NR_CPUS=2048 NUMA pSeries
[ 1917.130772] Modules linked in: iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi blocklayoutdriver rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc sg nx_crypto pseries_rng ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_common ibmvscsi scsi_transport_srp ibmveth scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[ 1917.130805] CPU: 2 PID: 14923 Comm: umount.nfs4 Tainted: G ------------ T 3.10.0-768.el7.ppc64 #1
[ 1917.130810] task:
c0000005ecd88040 ti:
c00000004cea0000 task.ti:
c00000004cea0000
[ 1917.130813] NIP:
c000000000354178 LR:
c000000000354160 CTR:
c00000000012db80
[ 1917.130816] REGS:
c00000004cea3720 TRAP: 0700 Tainted: G ------------ T (3.10.0-768.el7.ppc64)
[ 1917.130820] MSR:
8000000100029032 <SF,EE,ME,IR,DR,RI> CR:
22002822 XER:
20000000
[ 1917.130828] CFAR:
c00000000011f594 SOFTE: 1
GPR00:
c000000000354160 c00000004cea39a0 c0000000014c4700 c0000000018cc750
GPR04:
000000000000c750 80c0000000000000 0600000000000000 04eeb76bea749a03
GPR08:
0000000000000034 c0000000018cc758 0000000000000001 d000000005e619e8
GPR12:
c00000000012db80 c000000007b31200 0000000000000000 0000000000000000
GPR16:
0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20:
0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24:
0000000000000000 c000000000dfc3ec 0000000000000000 c0000005eefc02c0
GPR28:
d0000000079dbd50 c0000005b94a02c0 c0000005b94a0250 c0000005b94a01c8
[ 1917.130867] NIP [
c000000000354178] .evict+0x1c8/0x350
[ 1917.130871] LR [
c000000000354160] .evict+0x1b0/0x350
[ 1917.130873] Call Trace:
[ 1917.130876] [
c00000004cea39a0] [
c000000000354160] .evict+0x1b0/0x350 (unreliable)
[ 1917.130880] [
c00000004cea3a30] [
c0000000003558cc] .evict_inodes+0x13c/0x270
[ 1917.130884] [
c00000004cea3af0] [
c000000000327d20] .kill_anon_super+0x70/0x1e0
[ 1917.130896] [
c00000004cea3b80] [
d000000005e43e30] .nfs_kill_super+0x20/0x60 [nfs]
[ 1917.130900] [
c00000004cea3c00] [
c000000000328a20] .deactivate_locked_super+0xa0/0x1b0
[ 1917.130903] [
c00000004cea3c80] [
c00000000035ba54] .cleanup_mnt+0xd4/0x180
[ 1917.130907] [
c00000004cea3d10] [
c000000000119034] .task_work_run+0x114/0x150
[ 1917.130912] [
c00000004cea3db0] [
c00000000001ba6c] .do_notify_resume+0xcc/0x100
[ 1917.130916] [
c00000004cea3e30] [
c00000000000a7b0] .ret_from_except_lite+0x5c/0x60
[ 1917.130919] Instruction dump:
[ 1917.130921]
7fc3f378 486734b5 60000000 387f00a0 38800003 4bdcb365 60000000 e95f00a0
[ 1917.130927]
694a0060 7d4a0074 794ad182 694a0001 <
0b0a0000>
892d02a4 2f890000 40de0134
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Cc: stable@vger.kernel.org # 4.5+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Chuck Lever [Mon, 4 Dec 2017 19:04:04 +0000 (14:04 -0500)]
xprtrdma: Spread reply processing over more CPUs
Commit
d8f532d20ee4 ("xprtrdma: Invoke rpcrdma_reply_handler
directly from RECV completion") introduced a performance regression
for NFS I/O small enough to not need memory registration. In multi-
threaded benchmarks that generate primarily small I/O requests,
IOPS throughput is reduced by nearly a third. This patch restores
the previous level of throughput.
Because workqueues are typically BOUND (in particular ib_comp_wq,
nfsiod_workqueue, and rpciod_workqueue), NFS/RDMA workloads tend
to aggregate on the CPU that is handling Receive completions.
The usual approach to addressing this problem is to create a QP
and CQ for each CPU, and then schedule transactions on the QP
for the CPU where you want the transaction to complete. The
transaction then does not require an extra context switch during
completion to end up on the same CPU where the transaction was
started.
This approach doesn't work for the Linux NFS/RDMA client because
currently the Linux NFS client does not support multiple connections
per client-server pair, and the RDMA core API does not make it
straightforward for ULPs to determine which CPU is responsible for
handling Receive completions for a CQ.
So for the moment, record the CPU number in the rpcrdma_req before
the transport sends each RPC Call. Then during Receive completion,
queue the RPC completion on that same CPU.
Additionally, move all RPC completion processing to the deferred
handler so that even RPCs with simple small replies complete on
the CPU that sent the corresponding RPC Call.
Fixes:
d8f532d20ee4 ("xprtrdma: Invoke rpcrdma_reply_handler ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Scott Mayhew [Tue, 5 Dec 2017 18:55:44 +0000 (13:55 -0500)]
nfs: fix a deadlock in nfs client initialization
The following deadlock can occur between a process waiting for a client
to initialize in while walking the client list during nfsv4 server trunking
detection and another process waiting for the nfs_clid_init_mutex so it
can initialize that client:
Process 1 Process 2
--------- ---------
spin_lock(&nn->nfs_client_lock);
list_add_tail(&CLIENTA->cl_share_link,
&nn->nfs_client_list);
spin_unlock(&nn->nfs_client_lock);
spin_lock(&nn->nfs_client_lock);
list_add_tail(&CLIENTB->cl_share_link,
&nn->nfs_client_list);
spin_unlock(&nn->nfs_client_lock);
mutex_lock(&nfs_clid_init_mutex);
nfs41_walk_client_list(clp, result, cred);
nfs_wait_client_init_complete(CLIENTA);
(waiting for nfs_clid_init_mutex)
Make sure nfs_match_client() only evaluates clients that have completed
initialization in order to prevent that deadlock.
This patch also fixes v4.0 trunking behavior by not marking the client
NFS_CS_READY until the clientid has been confirmed.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Haishuang Yan [Fri, 15 Dec 2017 02:46:16 +0000 (10:46 +0800)]
ip_gre: fix wrong return value of erspan_rcv
If pskb_may_pull return failed, return PACKET_REJECT instead of -ENOMEM.
Fixes:
84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN")
Cc: William Tu <u9012063@gmail.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniele Palmas [Thu, 14 Dec 2017 15:56:14 +0000 (16:56 +0100)]
net: usb: qmi_wwan: add Telit ME910 PID 0x1101 support
This patch adds support for Telit ME910 PID 0x1101.
Signed-off-by: Daniele Palmas <dnlplm@gmail.com>
Acked-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 15 Dec 2017 18:35:37 +0000 (13:35 -0500)]
Merge branch 'net-sched-Make-qdisc-offload-uapi-uniform'
Yuval Mintz says:
====================
net: sched: Make qdisc offload uapi uniform
Several qdiscs can already be offloaded to hardware, but there's an
inconsistecy in regard to the uapi through which they indicate such
an offload is taking place - indication is passed to the user via
TCA_OPTIONS where each qdisc retains private logic for setting it.
The recent addition of offloading to RED in
602f3baf2218 ("net_sch: red: Add offload ability to RED qdisc") caused
the addition of yet another uapi field for this purpose -
TC_RED_OFFLOADED.
For clarity and prevention of bloat in the uapi we want to eliminate
said added uapi, replacing it with a common mechanism that can be used
to reflect offload status of the various qdiscs.
The first patch introduces TCA_HW_OFFLOAD as the generic message meant
for this purpose. The second changes the current RED implementation into
setting the internal bits necessary for passing it, and the third removes
TC_RED_OFFLOADED as its no longer needed.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuval Mintz [Thu, 14 Dec 2017 13:54:31 +0000 (15:54 +0200)]
pkt_sched: Remove TC_RED_OFFLOADED from uapi
Following the previous patch, RED is now using the new uniform uapi
for indicating it's offloaded. As a result, TC_RED_OFFLOADED is no
longer utilized by kernel and can be removed [as it's still not
part of any stable release].
Fixes:
602f3baf2218 ("net_sch: red: Add offload ability to RED qdisc")
Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuval Mintz [Thu, 14 Dec 2017 13:54:30 +0000 (15:54 +0200)]
net: sched: Move to new offload indication in RED
Let RED utilize the new internal flag, TCQ_F_OFFLOADED,
to mark a given qdisc as offloaded instead of using a dedicated
indication.
Also, change internal logic into looking at said flag when possible.
Fixes:
602f3baf2218 ("net_sch: red: Add offload ability to RED qdisc")
Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuval Mintz [Thu, 14 Dec 2017 13:54:29 +0000 (15:54 +0200)]
net: sched: Add TCA_HW_OFFLOAD
Qdiscs can be offloaded to HW, but current implementation isn't uniform.
Instead, qdiscs either pass information about offload status via their
TCA_OPTIONS or omit it altogether.
Introduce a new attribute - TCA_HW_OFFLOAD that would form a uniform
uAPI for the offloading status of qdiscs.
Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 15 Dec 2017 17:46:43 +0000 (12:46 -0500)]
Merge branch 'aquantia-fixes'
Igor Russkikh says:
====================
net: aquantia: Atlantic driver 12/2017 updates
The patchset contains important hardware fix for machines with large MRRS
and couple of improvement in stats and capabilities reporting
patch v3:
- Fixed patch #7 after Andrew's finding. NIC level stats actually
have to be cleaned only on hw struct creation (and this is done
in kzalloc). On each hwinit we only have to reset link state
to make sure hw stats update will not increment nic stats during init.
patch v2:
- split into more detailed commits
Comment from David on wrong defines case will be submitted separately later
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 Dec 2017 09:34:48 +0000 (12:34 +0300)]
net: aquantia: Increment driver version
Add a suffix to distinguish kernel mainline version and aquantia releases
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 Dec 2017 09:34:47 +0000 (12:34 +0300)]
net: aquantia: Fix typo in ethtool statistics names
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 Dec 2017 09:34:46 +0000 (12:34 +0300)]
net: aquantia: Update hw counters on hw init
On very first start we should read out current HW counter values
to make diff based calculations later.
This also should be done each time NIC gets down/up or wakes up
after sleep state. We reset link state explicitly to prevent diffs
from being summed this first time.
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 Dec 2017 09:34:45 +0000 (12:34 +0300)]
net: aquantia: Improve link state and statistics check interval callback
Reduce timeout from 2 secs to 1 sec. If link is down,
reduce it to 500msec. This speeds up link detection.
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 Dec 2017 09:34:44 +0000 (12:34 +0300)]
net: aquantia: Fill in multicast counter in ndev stats from hardware
This metric comes from HW and is also diff-calculated, like other counters
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 Dec 2017 09:34:43 +0000 (12:34 +0300)]
net: aquantia: Fill ndev stat couters from hardware
Originally they were filled from ring sw counters.
These sometimes incorrectly calculate byte and packet amounts
when using LRO/LSO and jumboframes. Filling ndev counters from
hardware makes them precise.
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 Dec 2017 09:34:42 +0000 (12:34 +0300)]
net: aquantia: Extend stat counters to 64bit values
Device hardware provides only 32bit counters. Using these directly
causes byte counters to overflow soon. A separate nic level structure
with 64 bit counters is now used to collect incrementally all the stats
and report these counters to ethtool stats and ndev stats.
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 Dec 2017 09:34:41 +0000 (12:34 +0300)]
net: aquantia: Fix hardware DMA stream overload on large MRRS
Systems with large MRRS on device (2K, 4K) with high data rates and/or
large MTU, atlantic observes DMA packet buffer overflow. On some systems
that causes PCIe transaction errors, hardware NMIs or datapath freeze.
This patch
1) Limits MRRS from device side to 2K (thats maximum our hardware supports)
2) Limit maximum size of outstanding TX DMA data read requests. This makes
hardware buffers running fine.
Signed-off-by: Pavel Belous <pavel.belous@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 Dec 2017 09:34:40 +0000 (12:34 +0300)]
net: aquantia: Fix actual speed capabilities reporting
Different hardware device Ids correspond to different maximum speed
available. Extra checks were added for devices D108 and D109 to
remove unsupported speeds from these device capabilities list.
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Fri, 15 Dec 2017 17:19:36 +0000 (09:19 -0800)]
Merge branch 'bpf-jit-fixes'
Daniel Borkmann says:
====================
Two fixes that deal with buggy usage of bpf_helper_changes_pkt_data()
in the sense that they also reload cached skb data when there's no
skb context but xdp one, for example. A fix where skb meta data is
reloaded out of the wrong register on helper call, rest is test cases
and making sure on verifier side that there's always the guarantee
that ctx sits in r1. Thanks!
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Thu, 14 Dec 2017 20:07:27 +0000 (21:07 +0100)]
bpf: add test case for ld_abs and helper changing pkt data
Add a test that i) uses LD_ABS, ii) zeroing R6 before call, iii) calls
a helper that triggers reload of cached skb data, iv) uses LD_ABS again.
It's added for test_bpf in order to do runtime testing after JITing as
well as test_verifier to test that the sequence is allowed.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Thu, 14 Dec 2017 20:07:26 +0000 (21:07 +0100)]
bpf, sparc: fix usage of wrong reg for load_skb_regs after call
When LD_ABS/IND is used in the program, and we have a BPF helper
call that changes packet data (bpf_helper_changes_pkt_data() returns
true), then in case of sparc JIT, we try to reload cached skb data
from bpf2sparc[BPF_REG_6]. However, there is no such guarantee or
assumption that skb sits in R6 at this point, all helpers changing
skb data only have a guarantee that skb sits in R1. Therefore,
store BPF R1 in L7 temporarily and after procedure call use L7 to
reload cached skb data. skb sitting in R6 is only true at the time
when LD_ABS/IND is executed.
Fixes:
7a12b5031c6b ("sparc64: Add eBPF JIT.")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Thu, 14 Dec 2017 20:07:25 +0000 (21:07 +0100)]
bpf: guarantee r1 to be ctx in case of bpf_helper_changes_pkt_data
Some JITs don't cache skb context on stack in prologue, so when
LD_ABS/IND is used and helper calls yield bpf_helper_changes_pkt_data()
as true, then they temporarily save/restore skb pointer. However,
the assumption that skb always has to be in r1 is a bit of a
gamble. Right now it turned out to be true for all helpers listed
in bpf_helper_changes_pkt_data(), but lets enforce that from verifier
side, so that we make this a guarantee and bail out if the func
proto is misconfigured in future helpers.
In case of BPF helper calls from cBPF, bpf_helper_changes_pkt_data()
is completely unrelevant here (since cBPF is context read-only) and
therefore always false.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Thu, 14 Dec 2017 20:07:24 +0000 (21:07 +0100)]
bpf, ppc64: do not reload skb pointers in non-skb context
The assumption of unconditionally reloading skb pointers on
BPF helper calls where bpf_helper_changes_pkt_data() holds
true is wrong. There can be different contexts where the helper
would enforce a reload such as in case of XDP. Here, we do
have a struct xdp_buff instead of struct sk_buff as context,
thus this will access garbage.
JITs only ever need to deal with cached skb pointer reload
when ld_abs/ind was seen, therefore guard the reload behind
SEEN_SKB.
Fixes:
156d0e290e96 ("powerpc/ebpf/jit: Implement JIT compiler for extended BPF")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Thu, 14 Dec 2017 20:07:23 +0000 (21:07 +0100)]
bpf, s390x: do not reload skb pointers in non-skb context
The assumption of unconditionally reloading skb pointers on
BPF helper calls where bpf_helper_changes_pkt_data() holds
true is wrong. There can be different contexts where the
BPF helper would enforce a reload such as in case of XDP.
Here, we do have a struct xdp_buff instead of struct sk_buff
as context, thus this will access garbage.
JITs only ever need to deal with cached skb pointer reload
when ld_abs/ind was seen, therefore guard the reload behind
SEEN_SKB only. Tested on s390x.
Fixes:
9db7f2b81880 ("s390/bpf: recache skb->data/hlen for skb_vlan_push/pop")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Willem de Bruijn [Wed, 13 Dec 2017 19:41:06 +0000 (14:41 -0500)]
sock: free skb in skb_complete_tx_timestamp on error
skb_complete_tx_timestamp must ingest the skb it is passed. Call
kfree_skb if the skb cannot be enqueued.
Fixes:
b245be1f4db1 ("net-timestamp: no-payload only sysctl")
Fixes:
9ac25fc06375 ("net: fix socket refcounting in skb_complete_tx_timestamp()")
Reported-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 15 Dec 2017 16:29:44 +0000 (11:29 -0500)]
Merge branch 's390-fixes'
Julian Wiedmann says:
====================
s390/qeth: fixes 2017-12-13
some more patches for 4.15, that fix multiple issues with IP Takeover
configuration in qeth.
Please queue them up for stable kernels as well (4.9 and newer).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Wed, 13 Dec 2017 17:56:32 +0000 (18:56 +0100)]
s390/qeth: update takeover IPs after configuration change
Any modification to the takeover IP-ranges requires that we re-evaluate
which IP addresses are takeover-eligible. Otherwise we might do takeover
for some addresses when we no longer should, or vice-versa.
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Wed, 13 Dec 2017 17:56:31 +0000 (18:56 +0100)]
s390/qeth: lock IP table while applying takeover changes
Modifying the flags of an IP addr object needs to be protected against
eg. concurrent removal of the same object from the IP table.
Fixes:
5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback")
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Wed, 13 Dec 2017 17:56:30 +0000 (18:56 +0100)]
s390/qeth: don't apply takeover changes to RXIP
When takeover is switched off, current code clears the 'TAKEOVER' flag on
all IPs. But the flag is also used for RXIP addresses, and those should
not be affected by the takeover mode.
Fix the behaviour by consistenly applying takover logic to NORMAL
addresses only.
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Wed, 13 Dec 2017 17:56:29 +0000 (18:56 +0100)]
s390/qeth: apply takeover changes when mode is toggled
Just as for an explicit enable/disable, toggling the takeover mode also
requires that the IP addresses get updated. Otherwise all IPs that were
added to the table before the mode-toggle, get registered with the old
settings.
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Will Deacon [Fri, 15 Dec 2017 16:07:22 +0000 (16:07 +0000)]
arm64: fpsimd: Fix copying of FP state from signal frame into task struct
Commit
9de52a755cfb6da5 ("arm64: fpsimd: Fix failure to restore FPSIMD
state after signals") fixed an issue reported in our FPSIMD signal
restore code but inadvertently introduced another issue which tends to
manifest as random SEGVs in userspace.
The problem is that when we copy the struct fpsimd_state from the kernel
stack (populated from the signal frame) into the struct held in the
current thread_struct, we blindly copy uninitialised stack into the
"cpu" field, which means that context-switching of the FP registers is
no longer reliable.
This patch fixes the problem by copying only the user_fpsimd member of
struct fpsimd_state. We should really rework the function prototypes
to take struct user_fpsimd_state * instead, but let's just get this
fixed for now.
Cc: Dave Martin <Dave.Martin@arm.com>
Fixes:
9de52a755cfb6da5 ("arm64: fpsimd: Fix failure to restore FPSIMD state after signals")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
David S. Miller [Fri, 15 Dec 2017 16:02:11 +0000 (11:02 -0500)]
Merge tag 'batadv-net-for-davem-
20171215' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here are some batman-adv bugfixes:
- Initialize the fragment headers, by Sven Eckelmann
- Fix a NULL check in BATMAN V, by Sven Eckelmann
- Fix kernel doc for the time_setup() change, by Sven Eckelmann
- Use the right lock in BATMAN IV OGM Update, by Sven Eckelmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuval Mintz [Fri, 15 Dec 2017 07:44:21 +0000 (08:44 +0100)]
mlxsw: spectrum: Disable MAC learning for ovs port
Learning is currently enabled for ports which are OVS slaves -
even though OVS doesn't need this indication.
Since we're not associating a fid with the port, HW would continuously
notify driver of learned [& aged] MACs which would be logged as errors.
Fixes:
2b94e58df58c ("mlxsw: spectrum: Allow ports to work under OVS master")
Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steven Rostedt [Sat, 2 Dec 2017 18:04:54 +0000 (13:04 -0500)]
sched/rt: Do not pull from current CPU if only one CPU to pull
Daniel Wagner reported a crash on the BeagleBone Black SoC.
This is a single CPU architecture, and does not have a functional
arch_send_call_function_single_ipi() implementation which can crash
the kernel if that is called.
As it only has one CPU, it shouldn't be called, but if the kernel is
compiled for SMP, the push/pull RT scheduling logic now calls it for
irq_work if the one CPU is overloaded, it can use that function to call
itself and crash the kernel.
Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
only has a single CPU. But SCHED_FEAT is a constant if sched debugging
is turned off. Another fix can also be used, and this should also help
with normal SMP machines. That is, do not initiate the pull code if
there's only one RT overloaded CPU, and that CPU happens to be the
current CPU that is scheduling in a lower priority task.
Even on a system with many CPUs, if there's many RT tasks waiting to
run on a single CPU, and that CPU schedules in another RT task of lower
priority, it will initiate the PULL logic in case there's a higher
priority RT task on another CPU that is waiting to run. But if there is
no other CPU with waiting RT tasks, it will initiate the RT pull logic
on itself (as it still has RT tasks waiting to run). This is a wasted
effort.
Not only does this help with SMP code where the current CPU is the only
one with RT overloaded tasks, it should also solve the issue that
Daniel encountered, because it will prevent the PULL logic from
executing, as there's only one CPU on the system, and the check added
here will cause it to exit the RT pull code.
Reported-by: Daniel Wagner <wagi@monom.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Cc: stable@vger.kernel.org
Fixes:
4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Song Liu [Fri, 15 Dec 2017 01:17:56 +0000 (17:17 -0800)]
xdp: linearize skb in netif_receive_generic_xdp()
In netif_receive_generic_xdp(), it is necessary to linearize all
nonlinear skb. However, in current implementation, skb with
troom <= 0 are not linearized. This patch fixes this by calling
skb_linearize() for all nonlinear skb.
Fixes:
de8f3a83b0a0 ("bpf: add meta pointer for direct access")
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Ingo Molnar [Fri, 15 Dec 2017 12:47:51 +0000 (13:47 +0100)]
tools/headers: Synchronize kernel <-> tooling headers
Two kernel headers got modified recently, which are used by tooling as well:
tools/include/uapi/linux/kvm.h
arch/x86/include/asm/cpufeatures.h
None of those changes have an effect on tooling, so do a plain copy.
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ingo Molnar [Fri, 15 Dec 2017 12:36:56 +0000 (13:36 +0100)]
objtool: Resync objtool's instruction decoder source code copy with the kernel's latest version
This fixes the following warning:
warning: objtool: x86 instruction decoder differs from kernel
Note that there are cleanups queued up for v4.16 that will make this
warning more informative and will make the syncing easier as well.
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Randy Dunlap [Mon, 11 Dec 2017 18:38:36 +0000 (10:38 -0800)]
x86/decoder: Fix and update the opcodes map
Update x86-opcode-map.txt based on the October 2017 Intel SDM publication.
Fix INVPID to INVVPID.
Add UD0 and UD1 instruction opcodes.
Also sync the objtool and perf tooling copies of this file.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/aac062d7-c0f6-96e3-5c92-ed299e2bd3da@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andy Lutomirski [Thu, 14 Dec 2017 21:19:07 +0000 (13:19 -0800)]
x86/power: Make restore_processor_context() sane
My previous attempt to fix a couple of bugs in __restore_processor_context():
5b06bbcfc2c6 ("x86/power: Fix some ordering bugs in __restore_processor_context()")
... introduced yet another bug, breaking suspend-resume.
Rather than trying to come up with a minimal fix, let's try to clean it up
for real. This patch fixes quite a few things:
- The old code saved a nonsensical subset of segment registers.
The only registers that need to be saved are those that contain
userspace state or those that can't be trivially restored without
percpu access working. (On x86_32, we can restore percpu access
by writing __KERNEL_PERCPU to %fs. On x86_64, it's easier to
save and restore the kernel's GSBASE.) With this patch, we
restore hardcoded values to the kernel state where applicable and
explicitly restore the user state after fixing all the descriptor
tables.
- We used to use an unholy mix of inline asm and C helpers for
segment register access. Let's get rid of the inline asm.
This fixes the reported s2ram hangs and make the code all around
more logical.
Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Reported-by: Pavel Machek <pavel@ucw.cz>
Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Tested-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Zhang Rui <rui.zhang@intel.com>
Fixes:
5b06bbcfc2c6 ("x86/power: Fix some ordering bugs in __restore_processor_context()")
Link: http://lkml.kernel.org/r/398ee68e5c0f766425a7b746becfc810840770ff.1513286253.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andy Lutomirski [Thu, 14 Dec 2017 21:19:06 +0000 (13:19 -0800)]
x86/power/32: Move SYSENTER MSR restoration to fix_processor_context()
x86_64 restores system call MSRs in fix_processor_context(), and
x86_32 restored them along with segment registers. The 64-bit
variant makes more sense, so move the 32-bit code to match the
64-bit code.
No side effects are expected to runtime behavior.
Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Zhang Rui <rui.zhang@intel.com>
Link: http://lkml.kernel.org/r/65158f8d7ee64dd6bbc6c1c83b3b34aaa854e3ae.1513286253.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andy Lutomirski [Thu, 14 Dec 2017 21:19:05 +0000 (13:19 -0800)]
x86/power/64: Use struct desc_ptr for the IDT in struct saved_context
x86_64's saved_context nonsensically used separate idt_limit and
idt_base fields and then cast &idt_limit to struct desc_ptr *.
This was correct (with -fno-strict-aliasing), but it's confusing,
served no purpose, and required #ifdeffery. Simplify this by
using struct desc_ptr directly.
No change in functionality.
Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Zhang Rui <rui.zhang@intel.com>
Link: http://lkml.kernel.org/r/967909ce38d341b01d45eff53e278e2728a3a93a.1513286253.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linus Torvalds [Fri, 15 Dec 2017 02:25:03 +0000 (18:25 -0800)]
Merge tag 'pm-4.15-rc4' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"This fixes an issue in two recent commits that may cause
pm_runtime_enable() to be called for too many times for some devices
during the "thaw" transition belonging to hibernation"
* tag 'pm-4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / sleep: Avoid excess pm_runtime_enable() calls in device_resume()
Linus Torvalds [Fri, 15 Dec 2017 02:21:33 +0000 (18:21 -0800)]
Merge tag 'trace-v4.15-rc1' of git://git./linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various fix-ups:
- comment fixes
- build fix
- better memory alloction (don't use NR_CPUS)
- configuration fix
- build warning fix
- enhanced callback parameter (to simplify users of trace hooks)
- give up on stack tracing when RCU isn't watching (it's a lost
cause)"
* tag 'trace-v4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have stack trace not record if RCU is not watching
tracing: Pass export pointer as argument to ->write()
ring-buffer: Remove unused function __rb_data_page_index()
tracing: make PREEMPTIRQ_EVENTS depend on TRACING
tracing: Allocate mask_str buffer dynamically
tracing: always define trace_{irq,preempt}_{enable_disable}
tracing: Fix code comments in trace.c
Steven Rostedt (VMware) [Tue, 5 Dec 2017 09:41:51 +0000 (04:41 -0500)]
tracing: Have stack trace not record if RCU is not watching
The stack tracer records a stack dump whenever it sees a stack usage that is
more than what it ever saw before. This can happen at any function that is
being traced. If it happens when the CPU is going idle (or other strange
locations), RCU may not be watching, and in this case, the recording of the
stack trace will trigger a warning. There's been lots of efforts to make
hacks to allow stack tracing to proceed even if RCU is not watching, but
this only causes more issues to appear. Simply do not trace a stack if RCU
is not watching. It probably isn't a bad stack anyway.
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Linus Torvalds [Fri, 15 Dec 2017 01:02:39 +0000 (17:02 -0800)]
Merge tag 'pci-v4.15-fixes-1' of git://git./linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
- add a pci_get_domain_bus_and_slot() stub for the CONFIG_PCI=n case to
avoid build breakage in the v4.16 merge window if a
pci_get_bus_and_slot() -> pci_get_domain_bus_and_slot() patch gets
merged before the PCI tree (Randy Dunlap)
- fix an AMD boot regression in the 64bit BAR support added in v4.15
(Christian König)
- fix an R-Car use-after-free that causes a crash if no PCIe card is
present (Geert Uytterhoeven)
* tag 'pci-v4.15-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: rcar: Fix use-after-free in probe error path
x86/PCI: Only enable a 64bit BAR on single-socket AMD Family 15h
x86/PCI: Fix infinite loop in search for 64bit BAR placement
PCI: Add pci_get_domain_bus_and_slot() stub
Linus Torvalds [Fri, 15 Dec 2017 00:35:20 +0000 (16:35 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"17 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
arch: define weak abort()
mm, oom_reaper: fix memory corruption
kernel: make groups_sort calling a responsibility group_info allocators
mm/frame_vector.c: release a semaphore in 'get_vaddr_frames()'
tools/slabinfo-gnuplot: force to use bash shell
kcov: fix comparison callback signature
mm/slab.c: do not hash pointers when debugging slab
mm/page_alloc.c: avoid excessive IRQ disabled times in free_unref_page_list()
mm/memory.c: mark wp_huge_pmd() inline to prevent build failure
scripts/faddr2line: fix CROSS_COMPILE unset error
Documentation/vm/zswap.txt: update with same-value filled page feature
exec: avoid gcc-8 warning for get_task_comm
autofs: fix careless error in recent commit
string.h: workaround for increased stack usage
mm/kmemleak.c: make cond_resched() rate-limiting more efficient
lib/rbtree,drm/mm: add rbtree_replace_node_cached()
include/linux/idr.h: add #include <linux/bug.h>
Sudip Mukherjee [Thu, 14 Dec 2017 23:33:19 +0000 (15:33 -0800)]
arch: define weak abort()
gcc toggle -fisolate-erroneous-paths-dereference (default at -O2
onwards) isolates faulty code paths such as null pointer access, divide
by zero etc. If gcc port doesnt implement __builtin_trap, an abort() is
generated which causes kernel link error.
In this case, gcc is generating abort due to 'divide by zero' in
lib/mpi/mpih-div.c.
Currently 'frv' and 'arc' are failing. Previously other arch was also
broken like m32r was fixed by commit
d22e3d69ee1a ("m32r: fix build
failure").
Let's define this weak function which is common for all arch and fix the
problem permanently. We can even remove the arch specific 'abort' after
this is done.
Link: http://lkml.kernel.org/r/1513118956-8718-1-git-send-email-sudipm.mukherjee@gmail.com
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Cc: Alexey Brodkin <Alexey.Brodkin@synopsys.com>
Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 14 Dec 2017 23:33:15 +0000 (15:33 -0800)]
mm, oom_reaper: fix memory corruption
David Rientjes has reported the following memory corruption while the
oom reaper tries to unmap the victims address space
BUG: Bad page map in process oom_reaper pte:
6353826300000000 pmd:
00000000
addr:
00007f50cab1d000 vm_flags:
08100073 anon_vma:
ffff9eea335603f0 mapping: (null) index:
7f50cab1d
file: (null) fault: (null) mmap: (null) readpage: (null)
CPU: 2 PID: 1001 Comm: oom_reaper
Call Trace:
unmap_page_range+0x1068/0x1130
__oom_reap_task_mm+0xd5/0x16b
oom_reaper+0xff/0x14c
kthread+0xc1/0xe0
Tetsuo Handa has noticed that the synchronization inside exit_mmap is
insufficient. We only synchronize with the oom reaper if
tsk_is_oom_victim which is not true if the final __mmput is called from
a different context than the oom victim exit path. This can trivially
happen from context of any task which has grabbed mm reference (e.g. to
read /proc/<pid>/ file which requires mm etc.).
The race would look like this
oom_reaper oom_victim task
mmget_not_zero
do_exit
mmput
__oom_reap_task_mm mmput
__mmput
exit_mmap
remove_vma
unmap_page_range
Fix this issue by providing a new mm_is_oom_victim() helper which
operates on the mm struct rather than a task. Any context which
operates on a remote mm struct should use this helper in place of
tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared
so it is stable in the exit_mmap path.
Debugged by Tetsuo Handa.
Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org
Fixes:
212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: David Rientjes <rientjes@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: <stable@vger.kernel.org> [4.14]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thiago Rafael Becker [Thu, 14 Dec 2017 23:33:12 +0000 (15:33 -0800)]
kernel: make groups_sort calling a responsibility group_info allocators
In testing, we found that nfsd threads may call set_groups in parallel
for the same entry cached in auth.unix.gid, racing in the call of
groups_sort, corrupting the groups for that entry and leading to
permission denials for the client.
This patch:
- Make groups_sort globally visible.
- Move the call to groups_sort to the modifiers of group_info
- Remove the call to groups_sort from set_groups
Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com
Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christophe JAILLET [Thu, 14 Dec 2017 23:33:08 +0000 (15:33 -0800)]
mm/frame_vector.c: release a semaphore in 'get_vaddr_frames()'
A semaphore is acquired before this check, so we must release it before
leaving.
Link: http://lkml.kernel.org/r/20171211211009.4971-1-christophe.jaillet@wanadoo.fr
Fixes:
b7f0554a56f2 ("mm: fail get_vaddr_frames() for filesystem-dax mappings")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Liu, Changcheng [Thu, 14 Dec 2017 23:33:05 +0000 (15:33 -0800)]
tools/slabinfo-gnuplot: force to use bash shell
On some linux distributions, the default link of sh is dash which
deoesn't support split array like "${var//,/ }"
It's better to force to use bash shell directly.
Link: http://lkml.kernel.org/r/20171208093751.GA175471@sofia
Signed-off-by: Liu Changcheng <changcheng.liu@intel.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>