Quentin Schulz [Fri, 23 Nov 2018 18:01:51 +0000 (19:01 +0100)]
net: phy: mscc: fix deadlock in vsc85xx_default_config
The vsc85xx_default_config function called in the vsc85xx_config_init
function which is used by VSC8530, VSC8531, VSC8540 and VSC8541 PHYs
mistakenly calls phy_read and phy_write in-between phy_select_page and
phy_restore_page.
phy_select_page and phy_restore_page actually take and release the MDIO
bus lock and phy_write and phy_read take and release the lock to write
or read to a PHY register.
Let's fix this deadlock by using phy_modify_paged which handles
correctly a read followed by a write in a non-standard page.
Fixes:
6a0bfbbe20b0 ("net: phy: mscc: migrate to phy_select/restore_page functions")
Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fabio Estevam [Fri, 23 Nov 2018 17:46:50 +0000 (15:46 -0200)]
dt-bindings: dsa: Fix typo in "probed"
The correct form is "can be probed", so fix the typo.
Signed-off-by: Fabio Estevam <festevam@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Bianconi [Fri, 23 Nov 2018 17:28:01 +0000 (18:28 +0100)]
net: thunderx: set tso_hdrs pointer to NULL in nicvf_free_snd_queue
Reset snd_queue tso_hdrs pointer to NULL in nicvf_free_snd_queue routine
since it is used to check if tso dma descriptor queue has been previously
allocated. The issue can be triggered with the following reproducer:
$ip link set dev enP2p1s0v0 xdpdrv obj xdp_dummy.o
$ip link set dev enP2p1s0v0 xdpdrv off
[ 341.467649] WARNING: CPU: 74 PID: 2158 at mm/vmalloc.c:1511 __vunmap+0x98/0xe0
[ 341.515010] Hardware name: GIGABYTE H270-T70/MT70-HD0, BIOS T49 02/02/2018
[ 341.521874] pstate:
60400005 (nZCv daif +PAN -UAO)
[ 341.526654] pc : __vunmap+0x98/0xe0
[ 341.530132] lr : __vunmap+0x98/0xe0
[ 341.533609] sp :
ffff00001c5db860
[ 341.536913] x29:
ffff00001c5db860 x28:
0000000000020000
[ 341.542214] x27:
ffff810feb5090b0 x26:
ffff000017e57000
[ 341.547515] x25:
0000000000000000 x24:
00000000fbd00000
[ 341.552816] x23:
0000000000000000 x22:
ffff810feb5090b0
[ 341.558117] x21:
0000000000000000 x20:
0000000000000000
[ 341.563418] x19:
ffff000017e57000 x18:
0000000000000000
[ 341.568719] x17:
0000000000000000 x16:
0000000000000000
[ 341.574020] x15:
0000000000000010 x14:
ffffffffffffffff
[ 341.579321] x13:
ffff00008985eb27 x12:
ffff00000985eb2f
[ 341.584622] x11:
ffff0000096b3000 x10:
ffff00001c5db510
[ 341.589923] x9 :
00000000ffffffd0 x8 :
ffff0000086868e8
[ 341.595224] x7 :
3430303030303030 x6 :
00000000000006ef
[ 341.600525] x5 :
00000000003fffff x4 :
0000000000000000
[ 341.605825] x3 :
0000000000000000 x2 :
ffffffffffffffff
[ 341.611126] x1 :
ffff0000096b3728 x0 :
0000000000000038
[ 341.616428] Call trace:
[ 341.618866] __vunmap+0x98/0xe0
[ 341.621997] vunmap+0x3c/0x50
[ 341.624961] arch_dma_free+0x68/0xa0
[ 341.628534] dma_direct_free+0x50/0x80
[ 341.632285] nicvf_free_resources+0x160/0x2d8 [nicvf]
[ 341.637327] nicvf_config_data_transfer+0x174/0x5e8 [nicvf]
[ 341.642890] nicvf_stop+0x298/0x340 [nicvf]
[ 341.647066] __dev_close_many+0x9c/0x108
[ 341.650977] dev_close_many+0xa4/0x158
[ 341.654720] rollback_registered_many+0x140/0x530
[ 341.659414] rollback_registered+0x54/0x80
[ 341.663499] unregister_netdevice_queue+0x9c/0xe8
[ 341.668192] unregister_netdev+0x28/0x38
[ 341.672106] nicvf_remove+0xa4/0xa8 [nicvf]
[ 341.676280] nicvf_shutdown+0x20/0x30 [nicvf]
[ 341.680630] pci_device_shutdown+0x44/0x88
[ 341.684720] device_shutdown+0x144/0x250
[ 341.688640] kernel_restart_prepare+0x44/0x50
[ 341.692986] kernel_restart+0x20/0x68
[ 341.696638] __se_sys_reboot+0x210/0x238
[ 341.700550] __arm64_sys_reboot+0x24/0x30
[ 341.704555] el0_svc_handler+0x94/0x110
[ 341.708382] el0_svc+0x8/0xc
[ 341.711252] ---[ end trace
3f4019c8439959c9 ]---
[ 341.715874] page:
ffff7e0003ef4000 count:0 mapcount:0 mapping:
0000000000000000 index:0x4
[ 341.723872] flags: 0x1fffe000000000()
[ 341.727527] raw:
001fffe000000000 ffff7e0003f1a008 ffff7e0003ef4048 0000000000000000
[ 341.735263] raw:
0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
[ 341.742994] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
where xdp_dummy.c is a simple bpf program that forwards the incoming
frames to the network stack (available here:
https://github.com/altoor/xdp_walkthrough_examples/blob/master/sample_1/xdp_dummy.c)
Fixes:
05c773f52b96 ("net: thunderx: Add basic XDP support")
Fixes:
4863dea3fab0 ("net: Adding support for Cavium ThunderX network controller")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yangtao Li [Thu, 22 Nov 2018 12:34:41 +0000 (07:34 -0500)]
net: amd: add missing of_node_put()
of_find_node_by_path() acquires a reference to the node
returned by it and that reference needs to be dropped by its caller.
This place doesn't do that, so fix it.
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hangbin Liu [Thu, 22 Nov 2018 08:15:28 +0000 (16:15 +0800)]
team: no need to do team_notify_peers or team_mcast_rejoin when disabling port
team_notify_peers() will send ARP and NA to notify peers. team_mcast_rejoin()
will send multicast join group message to notify peers. We should do this when
enabling/changed to a new port. But it doesn't make sense to do it when a port
is disabled.
On the other hand, when we set mcast_rejoin_count to 2, and do a failover,
team_port_disable() will increase mcast_rejoin.count_pending to 2 and then
team_port_enable() will increase mcast_rejoin.count_pending to 4. We will send
4 mcast rejoin messages at latest, which will make user confused. The same
with notify_peers.count.
Fix it by deleting team_notify_peers() and team_mcast_rejoin() in
team_port_disable().
Reported-by: Liang Li <liali@redhat.com>
Fixes:
fc423ff00df3a ("team: add peer notification")
Fixes:
492b200efdd20 ("team: add support for sending multicast rejoins")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Thu, 22 Nov 2018 06:36:31 +0000 (14:36 +0800)]
virtio-net: fail XDP set if guest csum is negotiated
We don't support partial csumed packet since its metadata will be lost
or incorrect during XDP processing. So fail the XDP set if guest_csum
feature is negotiated.
Fixes:
f600b6905015 ("virtio_net: Add XDP support")
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Pavel Popa <pashinho1990@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Thu, 22 Nov 2018 06:36:30 +0000 (14:36 +0800)]
virtio-net: disable guest csum during XDP set
We don't disable VIRTIO_NET_F_GUEST_CSUM if XDP was set. This means we
can receive partial csumed packets with metadata kept in the
vnet_hdr. This may have several side effects:
- It could be overridden by header adjustment, thus is might be not
correct after XDP processing.
- There's no way to pass such metadata information through
XDP_REDIRECT to another driver.
- XDP does not support checksum offload right now.
So simply disable guest csum if possible in this the case of XDP.
Fixes:
3f93522ffab2d ("virtio-net: switch off offloads on demand if possible on XDP set")
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Pavel Popa <pashinho1990@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Davide Caratti [Wed, 21 Nov 2018 17:23:53 +0000 (18:23 +0100)]
net/sched: act_police: add missing spinlock initialization
commit
f2cbd4852820 ("net/sched: act_police: fix race condition on state
variables") introduces a new spinlock, but forgets its initialization.
Ensure that tcf_police_init() initializes 'tcfp_lock' every time a 'police'
action is newly created, to avoid the following lockdep splat:
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
<...>
Call Trace:
dump_stack+0x85/0xcb
register_lock_class+0x581/0x590
__lock_acquire+0xd4/0x1330
? tcf_police_init+0x2fa/0x650 [act_police]
? lock_acquire+0x9e/0x1a0
lock_acquire+0x9e/0x1a0
? tcf_police_init+0x2fa/0x650 [act_police]
? tcf_police_init+0x55a/0x650 [act_police]
_raw_spin_lock_bh+0x34/0x40
? tcf_police_init+0x2fa/0x650 [act_police]
tcf_police_init+0x2fa/0x650 [act_police]
tcf_action_init_1+0x384/0x4c0
tcf_action_init+0xf6/0x160
tcf_action_add+0x73/0x170
tc_ctl_action+0x122/0x160
rtnetlink_rcv_msg+0x2a4/0x490
? netlink_deliver_tap+0x99/0x400
? validate_linkmsg+0x370/0x370
netlink_rcv_skb+0x4d/0x130
netlink_unicast+0x196/0x230
netlink_sendmsg+0x2e5/0x3e0
sock_sendmsg+0x36/0x40
___sys_sendmsg+0x280/0x2f0
? _raw_spin_unlock+0x24/0x30
? handle_pte_fault+0xafe/0xf30
? find_held_lock+0x2d/0x90
? syscall_trace_enter+0x1df/0x360
? __sys_sendmsg+0x5e/0xa0
__sys_sendmsg+0x5e/0xa0
do_syscall_64+0x60/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f1841c7cf10
Code: c3 48 8b 05 82 6f 2c 00 f7 db 64 89 18 48 83 cb ff eb dd 0f 1f 80 00 00 00 00 83 3d 8d d0 2c 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ae cc 00 00 48 89 04 24
RSP: 002b:
00007ffcf9df4d68 EFLAGS:
00000246 ORIG_RAX:
000000000000002e
RAX:
ffffffffffffffda RBX:
0000000000000001 RCX:
00007f1841c7cf10
RDX:
0000000000000000 RSI:
00007ffcf9df4dc0 RDI:
0000000000000003
RBP:
000000005bf56105 R08:
0000000000000002 R09:
00007ffcf9df8edc
R10:
00007ffcf9df47e0 R11:
0000000000000246 R12:
0000000000671be0
R13:
00007ffcf9df4e84 R14:
0000000000000008 R15:
0000000000000000
Fixes:
f2cbd4852820 ("net/sched: act_police: fix race condition on state variables")
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Wed, 21 Nov 2018 17:21:35 +0000 (18:21 +0100)]
net: don't keep lonely packets forever in the gro hash
Eric noted that with UDP GRO and NAPI timeout, we could keep a single
UDP packet inside the GRO hash forever, if the related NAPI instance
calls napi_gro_complete() at an higher frequency than the NAPI timeout.
Willem noted that even TCP packets could be trapped there, till the
next retransmission.
This patch tries to address the issue, flushing the old packets -
those with a NAPI_GRO_CB age before the current jiffy - before scheduling
the NAPI timeout. The rationale is that such a timeout should be
well below a jiffy and we are not flushing packets eligible for sane GRO.
v1 -> v2:
- clarified the commit message and comment
RFC -> v1:
- added 'Fixes tags', cleaned-up the wording.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes:
3b47d30396ba ("net: gro: add a per device gro flush timer")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hangbin Liu [Wed, 21 Nov 2018 13:52:33 +0000 (21:52 +0800)]
net/ipv6: re-do dad when interface has IFF_NOARP flag change
When we add a new IPv6 address, we should also join corresponding solicited-node
multicast address, unless the interface has IFF_NOARP flag, as function
addrconf_join_solict() did. But if we remove IFF_NOARP flag later, we do
not do dad and add the mcast address. So we will drop corresponding neighbour
discovery message that came from other nodes.
A typical example is after creating a ipvlan with mode l3, setting up an ipv6
address and changing the mode to l2. Then we will not be able to ping this
address as the interface doesn't join related solicited-node mcast address.
Fix it by re-doing dad when interface changed IFF_NOARP flag. Then we will add
corresponding mcast group and check if there is a duplicate address on the
network.
Reported-by: Jianlin Shi <jishi@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Tue, 20 Nov 2018 18:00:18 +0000 (13:00 -0500)]
packet: copy user buffers before orphan or clone
tpacket_snd sends packets with user pages linked into skb frags. It
notifies that pages can be reused when the skb is released by setting
skb->destructor to tpacket_destruct_skb.
This can cause data corruption if the skb is orphaned (e.g., on
transmit through veth) or cloned (e.g., on mirror to another psock).
Create a kernel-private copy of data in these cases, same as tun/tap
zerocopy transmission. Reuse that infrastructure: mark the skb as
SKBTX_ZEROCOPY_FRAG, which will trigger copy in skb_orphan_frags(_rx).
Unlike other zerocopy packets, do not set shinfo destructor_arg to
struct ubuf_info. tpacket_destruct_skb already uses that ptr to notify
when the original skb is released and a timestamp is recorded. Do not
change this timestamp behavior. The ubuf_info->callback is not needed
anyway, as no zerocopy notification is expected.
Mark destructor_arg as not-a-uarg by setting the lower bit to 1. The
resulting value is not a valid ubuf_info pointer, nor a valid
tpacket_snd frame address. Add skb_zcopy_.._nouarg helpers for this.
The fix relies on features introduced in commit
52267790ef52 ("sock:
add MSG_ZEROCOPY"), so can be backported as is only to 4.14.
Tested with from `./in_netns.sh ./txring_overwrite` from
http://github.com/wdebruij/kerneltools/tests
Fixes:
69e3c75f4d54 ("net: TX_RING and packet mmap")
Reported-by: Anand H. Krishnan <anandhkrishnan@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 22 Nov 2018 19:53:26 +0000 (11:53 -0800)]
Merge branch 'ibmvnic-Fix-queue-and-buffer-accounting-errors'
Thomas Falcon says:
====================
ibmvnic: Fix queue and buffer accounting errors
This series includes two small fixes. The first resolves a typo bug
in the code to clean up unused RX buffers during device queue removal.
The second ensures that device queue memory is updated to reflect new
supported queue ring sizes after migration to other backing hardware.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Falcon [Wed, 21 Nov 2018 17:17:59 +0000 (11:17 -0600)]
ibmvnic: Update driver queues after change in ring size support
During device reset, queue memory is not being updated to accommodate
changes in ring buffer sizes supported by backing hardware. Track
any differences in ring buffer sizes following the reset and update
queue memory when possible.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Falcon [Wed, 21 Nov 2018 17:17:58 +0000 (11:17 -0600)]
ibmvnic: Fix RX queue buffer cleanup
The wrong index is used when cleaning up RX buffer objects during release
of RX queues. Update to use the correct index counter.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Bianconi [Wed, 21 Nov 2018 15:32:10 +0000 (16:32 +0100)]
net: thunderx: set xdp_prog to NULL if bpf_prog_add fails
Set xdp_prog pointer to NULL if bpf_prog_add fails since that routine
reports the error code instead of NULL in case of failure and xdp_prog
pointer value is used in the driver to verify if XDP is currently
enabled.
Moreover report the error code to userspace if nicvf_xdp_setup fails
Fixes:
05c773f52b96 ("net: thunderx: Add basic XDP support")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tal Gilboa [Wed, 21 Nov 2018 14:28:23 +0000 (16:28 +0200)]
net/dim: Update DIM start sample after each DIM iteration
On every iteration of net_dim, the algorithm may choose to
check for the system state by comparing current data sample
with previous data sample. After each of these comparison,
regardless of the action taken, the sample used as baseline
is needed to be updated.
This patch fixes a bug that causes DIM to take wrong decisions,
due to never updating the baseline sample for comparison between
iterations. This way, DIM always compares current sample with
zeros.
Although this is a functional fix, it also improves and stabilizes
performance as the algorithm works properly now.
Performance:
Tested single UDP TX stream with pktgen:
samples/pktgen/pktgen_sample03_burst_single_flow.sh -i p4p2 -d 1.1.1.1
-m 24:8a:07:88:26:8b -f 3 -b 128
ConnectX-5 100GbE packet rate improved from 15-19Mpps to 19-20Mpps.
Also, toggling between profiles is less frequent with the fix.
Fixes:
8115b750dbcb ("net/dim: use struct net_dim_sample as arg to net_dim")
Signed-off-by: Tal Gilboa <talgi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vincent Chen [Wed, 21 Nov 2018 01:38:11 +0000 (09:38 +0800)]
net: faraday: ftmac100: remove netif_running(netdev) check before disabling interrupts
In the original ftmac100_interrupt(), the interrupts are only disabled when
the condition "netif_running(netdev)" is true. However, this condition
causes kerenl hang in the following case. When the user requests to
disable the network device, kernel will clear the bit __LINK_STATE_START
from the dev->state and then call the driver's ndo_stop function. Network
device interrupts are not blocked during this process. If an interrupt
occurs between clearing __LINK_STATE_START and stopping network device,
kernel cannot disable the interrupts due to the condition
"netif_running(netdev)" in the ISR. Hence, kernel will hang due to the
continuous interruption of the network device.
In order to solve the above problem, the interrupts of the network device
should always be disabled in the ISR without being restricted by the
condition "netif_running(netdev)".
[V2]
Remove unnecessary curly braces.
Signed-off-by: Vincent Chen <vincentc@andestech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 22 Nov 2018 00:14:56 +0000 (16:14 -0800)]
Merge branch 'smc-fixes'
Ursula Braun says:
====================
net/smc: fixes 2018-11-12
here is V4 of some net/smc fixes in different areas for the net tree.
v1->v2:
do not define 8-byte alignment for union smcd_cdc_cursor in
patch 4/5 "net/smc: atomic SMCD cursor handling"
v2->v3:
stay with 8-byte alignment for union smcd_cdc_cursor in
patch 4/5 "net/smc: atomic SMCD cursor handling", but get rid of
__packed for struct smcd_cdc_msg
v3->v4:
get rid of another __packed for struct smc_cdc_msg in
patch 4/5 "net/smc: atomic SMCD cursor handling"
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Tue, 20 Nov 2018 15:46:43 +0000 (16:46 +0100)]
net/smc: use after free fix in smc_wr_tx_put_slot()
In smc_wr_tx_put_slot() field pend->idx is used after being
cleared. That means always idx 0 is cleared in the wr_tx_mask.
This results in a broken administration of available WR send
payload buffers.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Tue, 20 Nov 2018 15:46:42 +0000 (16:46 +0100)]
net/smc: atomic SMCD cursor handling
Running uperf tests with SMCD on LPARs results in corrupted cursors.
SMCD cursors should be treated atomically to fix cursor corruption.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hans Wippel [Tue, 20 Nov 2018 15:46:41 +0000 (16:46 +0100)]
net/smc: add SMC-D shutdown signal
When a SMC-D link group is freed, a shutdown signal should be sent to
the peer to indicate that the link group is invalid. This patch adds the
shutdown signal to the SMC code.
Signed-off-by: Hans Wippel <hwippel@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karsten Graul [Tue, 20 Nov 2018 15:46:40 +0000 (16:46 +0100)]
net/smc: use queue pair number when matching link group
When searching for an existing link group the queue pair number is also
to be taken into consideration. When the SMC server sends a new number
in a CLC packet (keeping all other values equal) then a new link group
is to be created on the SMC client side.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hans Wippel [Tue, 20 Nov 2018 15:46:39 +0000 (16:46 +0100)]
net/smc: abort CLC connection in smc_release
In case of a non-blocking SMC socket, the initial CLC handshake is
performed over a blocking TCP connection in a worker. If the SMC socket
is released, smc_release has to wait for the blocking CLC socket
operations (e.g., kernel_connect) inside the worker.
This patch aborts a CLC connection when the respective non-blocking SMC
socket is released to avoid waiting on socket operations or timeouts.
Signed-off-by: Hans Wippel <hwippel@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 21 Nov 2018 23:51:16 +0000 (15:51 -0800)]
Merge tag 'wireless-drivers-for-davem-2018-11-20' of git://git./linux/kernel/git/kvalo/wireless-drivers
Kalle Valo says:
====================
wireless-drivers fixes for 4.20
First set of fixes for 4.20, this time we have quite a few them but
all very small.
ath9k
* fix a locking regression found by a static checker
wlcore
* fix a crash which was a regression with wakeirq handling
brcm80211
* yet another fix for 160 MHz channel handling
mt76
* fix a longstaning build problem when CONFIG_LEDS_CLASS is disabled
* don't use uninitialised mutex
iwlwifi
* do note that the iwlwifi merge tag (commit
4ec321c14693) seems to
contain wrong list of changes so ignore that
* fix ACPI data handling, a memory leak and other smaller fixes
ath10k
* fix a crash during suspend which was a recent regression
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 20 Nov 2018 13:53:59 +0000 (05:53 -0800)]
tcp: defer SACK compression after DupThresh
Jean-Louis reported a TCP regression and bisected to recent SACK
compression.
After a loss episode (receiver not able to keep up and dropping
packets because its backlog is full), linux TCP stack is sending
a single SACK (DUPACK).
Sender waits a full RTO timer before recovering losses.
While RFC 6675 says in section 5, "Algorithm Details",
(2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns true --
indicating at least three segments have arrived above the current
cumulative acknowledgment point, which is taken to indicate loss
-- go to step (4).
...
(4) Invoke fast retransmit and enter loss recovery as follows:
there are old TCP stacks not implementing this strategy, and
still counting the dupacks before starting fast retransmit.
While these stacks probably perform poorly when receivers implement
LRO/GRO, we should be a little more gentle to them.
This patch makes sure we do not enable SACK compression unless
3 dupacks have been sent since last rcv_nxt update.
Ideally we should even rearm the timer to send one or two
more DUPACK if no more packets are coming, but that will
be work aiming for linux-4.21.
Many thanks to Jean-Louis for bisecting the issue, providing
packet captures and testing this patch.
Fixes:
5d9f4262b7ea ("tcp: add SACK compression")
Reported-by: Jean-Louis Dupond <jean-louis@dupond.be>
Tested-by: Jean-Louis Dupond <jean-louis@dupond.be>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Tue, 20 Nov 2018 11:39:56 +0000 (11:39 +0000)]
net: skb_scrub_packet(): Scrub offload_fwd_mark
When a packet is trapped and the corresponding SKB marked as
already-forwarded, it retains this marking even after it is forwarded
across veth links into another bridge. There, since it ingresses the
bridge over veth, which doesn't have offload_fwd_mark, it triggers a
warning in nbp_switchdev_frame_mark().
Then nbp_switchdev_allowed_egress() decides not to allow egress from
this bridge through another veth, because the SKB is already marked, and
the mark (of 0) of course matches. Thus the packet is incorrectly
blocked.
Solve by resetting offload_fwd_mark() in skb_scrub_packet(). That
function is called from tunnels and also from veth, and thus catches the
cases where traffic is forwarded between bridges and transformed in a
way that invalidates the marking.
Fixes:
6bc506b4fb06 ("bridge: switchdev: Add forward mark support for stacked devices")
Fixes:
abf4bb6b63d0 ("skbuff: Add the offload_mr_fwd_mark field")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Suggested-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Davide Caratti [Tue, 20 Nov 2018 21:18:44 +0000 (22:18 +0100)]
net/sched: act_police: fix race condition on state variables
after 'police' configuration parameters were converted to use RCU instead
of spinlock, the state variables used to compute the traffic rate (namely
'tcfp_toks', 'tcfp_ptoks' and 'tcfp_t_c') are erroneously read/updated in
the traffic path without any protection.
Use a dedicated spinlock to avoid race conditions on these variables, and
ensure proper cache-line alignment. In this way, 'police' is still faster
than what we observed when 'tcf_lock' was used in the traffic path _ i.e.
reverting commit
2d550dbad83c ("net/sched: act_police: don't use spinlock
in the data path"). Moreover, we preserve the throughput improvement that
was obtained after 'police' started using per-cpu counters, when 'avrate'
is used instead of 'rate'.
Changes since v1 (thanks to Eric Dumazet):
- call ktime_get_ns() before acquiring the lock in the traffic path
- use a dedicated spinlock instead of tcf_lock
- improve cache-line usage
Fixes:
2d550dbad83c ("net/sched: act_police: don't use spinlock in the data path")
Reported-and-suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Heiner Kallweit [Tue, 20 Nov 2018 20:22:50 +0000 (21:22 +0100)]
MAINTAINERS: add myself as co-maintainer for r8169
Meanwhile I know the driver quite well and I refactored bigger parts
of it. As a result people contact me already with r8169 questions.
Therefore I'd volunteer to become co-maintainer of the driver also
officially.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Mallon [Tue, 20 Nov 2018 08:15:02 +0000 (19:15 +1100)]
tcp: Fix SOF_TIMESTAMPING_RX_HARDWARE to use the latest timestamp during TCP coalescing
During tcp coalescing ensure that the skb hardware timestamp refers to the
highest sequence number data.
Previously only the software timestamp was updated during coalescing.
Signed-off-by: Stephen Mallon <stephen.mallon@sydney.edu.au>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Siva Reddy Kallam [Tue, 20 Nov 2018 04:34:04 +0000 (10:04 +0530)]
tg3: Add PHY reset for 5717/5719/5720 in change ring and flow control paths
This patch has the fix to avoid PHY lockup with 5717/5719/5720 in change
ring and flow control paths. This patch solves the RX hang while doing
continuous ring or flow control parameters with heavy traffic from peer.
Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 20 Nov 2018 03:04:33 +0000 (19:04 -0800)]
Merge tag 'mlx5-fixes-2018-11-19' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Mellanox, mlx5 fixes 2018-11-19
The following fixes are for mlx5 core and netdev driver.
For -stable v4.16
bc7fda7d4637 ('net/mlx5e: IPoIB, Reset QP after channels are closed')
For -stable v4.17
36917a270395 ('net/mlx5: IPSec, Fix the SA context hash key')
For -stable v4.18
6492a432be3a ('net/mlx5e: Always use the match level enum when parsing TC rule match')
c3f81be236b1 ('net/mlx5e: Removed unnecessary warnings in FEC caps query')
c5ce2e736b64 ('net/mlx5e: Fix selftest for small MTUs')
For -stable v4.19
effcd896b25e ('net/mlx5e: Adjust to max number of channles when re-attaching')
394cbc5acd68 ('net/mlx5e: RX, verify received packet size in Linear Striding RQ')
447cbb3613c8 ('net/mlx5e: Don't match on vlan non-existence if ethertype is wildcarded')
c223c1574612 ('net/mlx5e: Claim TC hw offloads support only under a proper build config')
Please pull and let me know if there's any problem.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Juliet Kim [Mon, 19 Nov 2018 21:59:22 +0000 (15:59 -0600)]
net/ibmnvic: Fix deadlock problem in reset
This patch changes to use rtnl_lock only during a reset to avoid
deadlock that could occur when a thread operating close is holding
rtnl_lock and waiting for reset_lock acquired by another thread,
which is waiting for rtnl_lock in order to set the number of tx/rx
queues during a reset.
Also, we now setting the number of tx/rx queues during a soft reset
for failover or LPM events.
Signed-off-by: Juliet Kim <julietk@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 20 Nov 2018 02:38:16 +0000 (18:38 -0800)]
Merge branch 'qed-Fix-Queue-Manager-getters'
Denis Bolotin says:
====================
qed: Fix Queue Manager getters
This patch series fixes various queue manager getter functions. It is
important to make sure the getter's caller will receive a valid queue even
in error case to prevent more serious bugs.
Please consider applying to net.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Denis Bolotin [Mon, 19 Nov 2018 14:28:31 +0000 (16:28 +0200)]
qed: Fix QM getters to always return a valid pq
The getter callers doesn't know the valid Physical Queues (PQ) values.
This patch makes sure that a valid PQ will always be returned.
The patch consists of 3 fixes:
- When qed_init_qm_get_idx_from_flags() receives a disabled flag, it
returned PQ 0, which can potentially be another function's pq. Verify
that flag is enabled, otherwise return default start_pq.
- When qed_init_qm_get_idx_from_flags() receives an unknown flag, it
returned NULL and could lead to a segmentation fault. Return default
start_pq instead.
- A modulo operation was added to MCOS/VFS PQ getters to make sure the
PQ returned is in range of the required flag.
Fixes:
b5a9ee7cf3be ("qed: Revise QM cofiguration")
Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com>
Signed-off-by: Michal Kalderon <michal.kalderon@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Denis Bolotin [Mon, 19 Nov 2018 14:28:30 +0000 (16:28 +0200)]
qed: Fix bitmap_weight() check
Fix the condition which verifies that only one flag is set. The API
bitmap_weight() should receive size in bits instead of bytes.
Fixes:
b5a9ee7cf3be ("qed: Revise QM cofiguration")
Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com>
Signed-off-by: Michal Kalderon <michal.kalderon@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shay Agroskin [Thu, 8 Nov 2018 12:23:36 +0000 (14:23 +0200)]
net/mlx5e: Fix failing ethtool query on FEC query error
If FEC caps query fails when executing 'ethtool <interface>'
the whole callback fails unnecessarily, fixed that by replacing the
error return code with debug logging only.
Fixes:
6cfa94605091 ("net/mlx5e: Ethtool driver callback for query/set FEC policy")
Signed-off-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Shay Agroskin [Sun, 28 Oct 2018 07:06:11 +0000 (09:06 +0200)]
net/mlx5e: Removed unnecessary warnings in FEC caps query
Querying interface FEC caps with 'ethtool [int]' after link reset
throws warning regading link speed.
This warning is not needed as there is already an indication in
user space that the link is not up.
Fixes:
0696d60853d5 ("net/mlx5e: Receive buffer configuration")
Signed-off-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Shay Agroskin [Sun, 28 Oct 2018 07:17:29 +0000 (09:17 +0200)]
net/mlx5e: Fix wrong field name in FEC related functions
This bug would result in reading wrong FEC capabilities for 10G/40G.
Fixes:
2095b2641477 ("net/mlx5e: Add port FEC get/set functions")
Signed-off-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Shay Agroskin [Sun, 28 Oct 2018 14:13:46 +0000 (16:13 +0200)]
net/mlx5e: Fix a bug in turning off FEC policy in unsupported speeds
Some speeds don't support turning FEC policy off. In case a requested
FEC policy is not supported for a speed (including current speed), its new
FEC policy would be:
no FEC - if disabling FEC is supported for that speed
unchanged - else
Fixes:
2095b2641477 ("net/mlx5e: Add port FEC get/set functions")
Signed-off-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
David S. Miller [Mon, 19 Nov 2018 23:13:00 +0000 (15:13 -0800)]
Merge branch 'ena-hibernation-and-rmmod-bug-fixes'
Arthur Kiyanovski says:
====================
net: ena: hibernation and rmmod bug fixes
This patchset includes 2 bug fixes:
1. A fix to a crash during resume from hibernation.
2. A fix to an illegal memory access during driver removal (e.g. during rmmod)
which might cause a crash in certain systems.
The subminor number in the driver version is also promoted to indicate driver
was changed.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Arthur Kiyanovski [Mon, 19 Nov 2018 10:05:22 +0000 (12:05 +0200)]
net: ena: update driver version from 2.0.1 to 2.0.2
Update driver version due to critical bug fixes.
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arthur Kiyanovski [Mon, 19 Nov 2018 10:05:21 +0000 (12:05 +0200)]
net: ena: fix crash during ena_remove()
In ena_remove() we have the following stack call:
ena_remove()
unregister_netdev()
ena_destroy_device()
netif_carrier_off()
Calling netif_carrier_off() causes linkwatch to try to handle the
link change event on the already unregistered netdev, which leads
to a read from an unreadable memory address.
This patch switches the order of the two functions, so that
netif_carrier_off() is called on a regiestered netdev.
To accomplish this fix we also had to:
1. Remove the set bit ENA_FLAG_TRIGGER_RESET
2. Add a sanitiy check in ena_close()
both to prevent double device reset (when calling unregister_netdev()
ena_close is called, but the device was already deleted in
ena_destroy_device()).
3. Set the admin_queue running state to false to avoid using it after
device was reset (for example when calling ena_destroy_all_io_queues()
right after ena_com_dev_reset() in ena_down)
Fixes:
944b28aa2982 ("net: ena: fix missing lock during device destruction")
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arthur Kiyanovski [Mon, 19 Nov 2018 10:05:20 +0000 (12:05 +0200)]
net: ena: fix crash during failed resume from hibernation
During resume from hibernation if ena_restore_device fails,
ena_com_dev_reset() is called, and uses the readless read mechanism,
which was already destroyed by the call to
ena_com_mmio_reg_read_request_destroy(). This causes a NULL pointer
reference.
In this commit we switch the call order of the above two functions
to avoid this crash.
Fixes:
d7703ddbd7c9 ("net: ena: fix rare bug when failed restart/resume is followed by driver removal")
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Sun, 18 Nov 2018 13:59:49 +0000 (21:59 +0800)]
sctp: not increase stream's incnt before sending addstrm_in request
Different from processing the addstrm_out request, The receiver handles
an addstrm_in request by sending back an addstrm_out request to the
sender who will increase its stream's in and incnt later.
Now stream->incnt has been increased since it sent out the addstrm_in
request in sctp_send_add_streams(), with the wrong stream->incnt will
even cause crash when copying stream info from the old stream's in to
the new one's in sctp_process_strreset_addstrm_out().
This patch is to fix it by simply removing the stream->incnt change
from sctp_send_add_streams().
Fixes:
242bd2d519d7 ("sctp: implement sender-side procedures for Add Incoming/Outgoing Streams Request Parameter")
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Valentine Fatiev [Wed, 17 Oct 2018 08:45:07 +0000 (11:45 +0300)]
net/mlx5e: Fix selftest for small MTUs
Loopback test had fixed packet size, which can be bigger than configured
MTU. Shorten the loopback packet size to be bigger than minimal MTU
allowed by the device. Text field removed from struct 'mlx5ehdr'
as redundant to allow send small packets as minimal allowed MTU.
Fixes: d605d66 ("net/mlx5e: Add support for ethtool self diagnostics test")
Signed-off-by: Valentine Fatiev <valentinef@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Moshe Shemesh [Thu, 11 Oct 2018 04:31:10 +0000 (07:31 +0300)]
net/mlx5e: RX, verify received packet size in Linear Striding RQ
In case of striding RQ, we use MPWRQ (Multi Packet WQE RQ), which means
that WQE (RX descriptor) can be used for many packets and so the WQE is
much bigger than MTU. In virtualization setups where the port mtu can
be larger than the vf mtu, if received packet is bigger than MTU, it
won't be dropped by HW on too small receive WQE. If we use linear SKB in
striding RQ, since each stride has room for mtu size payload and skb
info, an oversized packet can lead to crash for crossing allocated page
boundary upon the call to build_skb. So driver needs to check packet
size and drop it.
Introduce new SW rx counter, rx_oversize_pkts_sw_drop, which counts the
number of packets dropped by the driver for being too large.
As a new field is added to the RQ struct, re-open the channels whenever
this field is being used in datapath (i.e., in the case of linear
Striding RQ).
Fixes:
619a8f2a42f1 ("net/mlx5e: Use linear SKB in Striding RQ")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Roi Dayan [Tue, 23 Oct 2018 14:30:04 +0000 (17:30 +0300)]
net/mlx5e: Apply the correct check for supporting TC esw rules split
The mirror and not the output count is the one denoting a split.
Fix to condition the offload attempt on the mirror count being > 0
along the firmware to have the related capability.
Fixes:
592d36515969 ("net/mlx5e: Parse mirroring action for offloaded TC eswitch flows")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Yossi Kuperman <yossiku@mellanox.com>
Reviewed-by: Chris Mi <chrism@mellanox.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Yuval Avnery [Tue, 16 Oct 2018 20:20:20 +0000 (15:20 -0500)]
net/mlx5e: Adjust to max number of channles when re-attaching
When core driver enters deattach/attach flow after pci reset,
Number of logical CPUs may have changed.
As a result we need to update the cpu affiliated resource tables.
1. indirect rqt list
2. eq table
Reproduction (PowerPC):
echo 1000 > /sys/kernel/debug/powerpc/eeh_max_freezes
ppc64_cpu --smt=on
# Restart driver
modprobe -r ... ; modprobe ...
# Link up
ifconfig ...
# Only physical CPUs
ppc64_cpu --smt=off
# Inject PCI errors so PCI will reset - calling the pci error handler
echo 0x8000000000000000 > /sys/kernel/debug/powerpc/<PCI BUS>/err_injct_inboundA
Call trace when trying to add non-existing rqs to an indirect rqt:
mlx5e_redirect_rqt+0x84/0x260 [mlx5_core] (unreliable)
mlx5e_redirect_rqts+0x188/0x190 [mlx5_core]
mlx5e_activate_priv_channels+0x488/0x570 [mlx5_core]
mlx5e_open_locked+0xbc/0x140 [mlx5_core]
mlx5e_open+0x50/0x130 [mlx5_core]
mlx5e_nic_enable+0x174/0x1b0 [mlx5_core]
mlx5e_attach_netdev+0x154/0x290 [mlx5_core]
mlx5e_attach+0x88/0xd0 [mlx5_core]
mlx5_attach_device+0x168/0x1e0 [mlx5_core]
mlx5_load_one+0x1140/0x1210 [mlx5_core]
mlx5_pci_resume+0x6c/0xf0 [mlx5_core]
Create cq will fail when trying to use non-existing EQ.
Fixes:
89d44f0a6c73 ("net/mlx5_core: Add pci error handlers to mlx5_core driver")
Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Or Gerlitz [Sun, 28 Oct 2018 10:27:29 +0000 (12:27 +0200)]
net/mlx5e: Always use the match level enum when parsing TC rule match
We get the match level (none, l2, l3, l4) while going over the match
dissectors of an offloaded tc rule. When doing this, the match level
enum and the not min inline enum values should be used, fix that.
This worked accidentally b/c both enums have the same numerical values.
Fixes:
d708f902989b ('net/mlx5e: Get the required HW match level while parsing TC flow matches')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Or Gerlitz [Thu, 18 Oct 2018 10:31:27 +0000 (12:31 +0200)]
net/mlx5e: Claim TC hw offloads support only under a proper build config
Currently, we are only supporting tc hw offloads when the eswitch
support is compiled in, but we are not gating the adevertizment
of the NETIF_F_HW_TC feature on this config being set.
Fix it, and while doing that, also avoid dealing with the feature
on ethtool when the config is not set.
Fixes:
e8f887ac6a45 ('net/mlx5e: Introduce tc offload support')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Or Gerlitz [Thu, 25 Oct 2018 15:41:58 +0000 (15:41 +0000)]
net/mlx5e: Don't match on vlan non-existence if ethertype is wildcarded
For the "all" ethertype we should not care whether the packet has
vlans. Besides being wrong, the way we did it caused FW error
for rules such as:
tc filter add dev eth0 protocol all parent ffff: \
prio 1 flower skip_sw action drop
b/c the matching meta-data (outer headers bit in struct mlx5_flow_spec)
wasn't set. Fix that by matching on vlan non-existence only if we were
also told to match on the ethertype.
Fixes:
cee26487620b ('net/mlx5e: Set vlan masks for all offloaded TC rules')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Slava Ovsiienko <viacheslavo@mellanox.com>
Reviewed-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Denis Drozdov [Thu, 27 Sep 2018 11:17:54 +0000 (14:17 +0300)]
net/mlx5e: IPoIB, Reset QP after channels are closed
The mlx5e channels should be closed before mlx5i_uninit_underlay_qp
puts the QP into RST (reset) state during mlx5i_close. Currently QP
state incorrectly set to RST before channels got deactivated and closed,
since mlx5_post_send request expects QP in RTS (Ready To Send) state.
The fix is to keep QP in RTS state until mlx5e channels get closed
and to reset QP afterwards.
Also this fix is simply correct in order to keep the open/close flow
symmetric, i.e mlx5i_init_underlay_qp() is called first thing at open,
the correct thing to do is to call mlx5i_uninit_underlay_qp() last thing
at close, which is exactly what this patch is doing.
Fixes:
dae37456c8ac ("net/mlx5: Support for attaching multiple underlay QPs to root flow table")
Signed-off-by: Denis Drozdov <denisd@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Raed Salem [Thu, 18 Oct 2018 05:55:21 +0000 (08:55 +0300)]
net/mlx5: IPSec, Fix the SA context hash key
The commit "net/mlx5: Refactor accel IPSec code" introduced a
bug where asynchronous short time change in hash key value
by create/release SA context might happen during an asynchronous
hash resize operation this could cause a subsequent remove SA
context operation to fail as the key value used during resize is
not the same key value used when remove SA context operation is
invoked.
This commit fixes the bug by defining the SA context hash key
such that it includes only fields that never change during the
lifetime of the SA context object.
Fixes:
d6c4f0298cec ("net/mlx5: Refactor accel IPSec code")
Signed-off-by: Raed Salem <raeds@mellanox.com>
Reviewed-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Xin Long [Sun, 18 Nov 2018 08:14:47 +0000 (16:14 +0800)]
Revert "sctp: remove sctp_transport_pmtu_check"
This reverts commit
22d7be267eaa8114dcc28d66c1c347f667d7878a.
The dst's mtu in transport can be updated by a non sctp place like
in xfrm where the MTU information didn't get synced between asoc,
transport and dst, so it is still needed to do the pmtu check
in sctp_packet_config.
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Sun, 18 Nov 2018 07:21:53 +0000 (15:21 +0800)]
sctp: not allow to set asoc prsctp_enable by sockopt
As rfc7496#section4.5 says about SCTP_PR_SUPPORTED:
This socket option allows the enabling or disabling of the
negotiation of PR-SCTP support for future associations. For existing
associations, it allows one to query whether or not PR-SCTP support
was negotiated on a particular association.
It means only sctp sock's prsctp_enable can be set.
Note that for the limitation of SCTP_{CURRENT|ALL}_ASSOC, we will
add it when introducing SCTP_{FUTURE|CURRENT|ALL}_ASSOC for linux
sctp in another patchset.
v1->v2:
- drop the params.assoc_id check as Neil suggested.
Fixes:
28aa4c26fce2 ("sctp: add SCTP_PR_SUPPORTED on sctp sockopt")
Reported-by: Ying Xu <yinxu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Sun, 18 Nov 2018 07:07:38 +0000 (15:07 +0800)]
sctp: count sk_wmem_alloc by skb truesize in sctp_packet_transmit
Now sctp increases sk_wmem_alloc by 1 when doing set_owner_w for the
skb allocked in sctp_packet_transmit and decreases by 1 when freeing
this skb.
But when this skb goes through networking stack, some subcomponents
might change skb->truesize and add the same amount on sk_wmem_alloc.
However sctp doesn't know the amount to decrease by, it would cause
a leak on sk->sk_wmem_alloc and the sock can never be freed.
Xiumei found this issue when it hit esp_output_head() by using sctp
over ipsec, where skb->truesize is added and so is sk->sk_wmem_alloc.
Since sctp has used sk_wmem_queued to count for writable space since
Commit
cd305c74b0f8 ("sctp: use sk_wmem_queued to check for writable
space"), it's ok to fix it by counting sk_wmem_alloc by skb truesize
in sctp_packet_transmit.
Fixes:
cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Mon, 19 Nov 2018 07:01:03 +0000 (08:01 +0100)]
MAINTAINERS: Add myself as third phylib maintainer
Add myself as third phylib maintainer.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Acked-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Mon, 19 Nov 2018 17:24:04 +0000 (09:24 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Fix some potentially uninitialized variables and use-after-free in
kvaser_usb can drier, from Jimmy Assarsson.
2) Fix leaks in qed driver, from Denis Bolotin.
3) Socket leak in l2tp, from Xin Long.
4) RSS context allocation fix in bnxt_en from Michael Chan.
5) Fix cxgb4 build errors, from Ganesh Goudar.
6) Route leaks in ipv6 when removing exceptions, from Xin Long.
7) Memory leak in IDR allocation handling of act_pedit, from Davide
Caratti.
8) Use-after-free of bridge vlan stats, from Nikolay Aleksandrov.
9) When MTU is locked, do not force DF bit on ipv4 tunnels. From
Sabrina Dubroca.
10) When NAPI cached skb is reused, we must set it to the proper initial
state which includes skb->pkt_type. From Eric Dumazet.
11) Lockdep and non-linear SKB handling fix in tipc from Jon Maloy.
12) Set RX queue properly in various tuntap receive paths, from Matthew
Cover.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
tuntap: fix multiqueue rx
ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF
tipc: don't assume linear buffer when reading ancillary data
tipc: fix lockdep warning when reinitilaizing sockets
net-gro: reset skb->pkt_type in napi_reuse_skb()
tc-testing: tdc.py: Guard against lack of returncode in executed command
tc-testing: tdc.py: ignore errors when decoding stdout/stderr
ip_tunnel: don't force DF when MTU is locked
MAINTAINERS: Add entry for CAKE qdisc
net: bridge: fix vlan stats use-after-free on destruction
socket: do a generic_file_splice_read when proto_ops has no splice_read
net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
Revert "net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs"
net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
net/sched: act_pedit: fix memory leak when IDR allocation fails
net: lantiq: Fix returned value in case of error in 'xrx200_probe()'
ipv6: fix a dst leak when removing its exception
net: mvneta: Don't advertise 2.5G modes
drivers/net/ethernet/qlogic/qed/qed_rdma.h: fix typo
net/mlx4: Fix UBSAN warning of signed integer overflow
...
Matthew Cover [Sun, 18 Nov 2018 07:46:00 +0000 (00:46 -0700)]
tuntap: fix multiqueue rx
When writing packets to a descriptor associated with a combined queue, the
packets should end up on that queue.
Before this change all packets written to any descriptor associated with a
tap interface end up on rx-0, even when the descriptor is associated with a
different queue.
The rx traffic can be generated by either of the following.
1. a simple tap program which spins up multiple queues and writes packets
to each of the file descriptors
2. tx from a qemu vm with a tap multiqueue netdev
The queue for rx traffic can be observed by either of the following (done
on the hypervisor in the qemu case).
1. a simple netmap program which opens and reads from per-queue
descriptors
2. configuring RPS and doing per-cpu captures with rxtxcpu
Alternatively, if you printk() the return value of skb_get_rx_queue() just
before each instance of netif_receive_skb() in tun.c, you will get 65535
for every skb.
Calling skb_record_rx_queue() to set the rx queue to the queue_index fixes
the association between descriptor and rx queue.
Signed-off-by: Matthew Cover <matthew.cover@stackpath.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Sun, 18 Nov 2018 18:45:30 +0000 (10:45 -0800)]
ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF
Preethi reported that PMTU discovery for UDP/raw applications is not
working in the presence of VRF when the socket is not bound to a device.
The problem is that ip6_sk_update_pmtu does not consider the L3 domain
of the skb device if the socket is not bound. Update the function to
set oif to the L3 master device if relevant.
Fixes:
ca254490c8df ("net: Add VRF support to IPv6 stack")
Reported-by: Preethi Ramachandra <preethir@juniper.net>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 18 Nov 2018 21:33:44 +0000 (13:33 -0800)]
Linux 4.20-rc3
Linus Torvalds [Sun, 18 Nov 2018 20:21:09 +0000 (12:21 -0800)]
Merge tag 'libnvdimm-fixes-4.20-rc3' of git://git./linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"A small batch of fixes for v4.20-rc3.
The overflow continuation fix addresses something that has been broken
for several releases. Arguably it could wait even longer, but it's a
one line fix and this finishes the last of the known address range
scrub bug reports. The revert addresses a lockdep regression. The unit
tests are not critical to fix, but no reason to hold this fix back.
Summary:
- Address Range Scrub overflow continuation handling has been broken
since it was initially merged. It was only recently that error
injection and platform-BIOS support enabled this corner case to be
exercised.
- The recent attempt to provide more isolation for the kernel Address
Range Scrub state machine from userapace initiated sessions
triggers a lockdep report. Revert and try again at the next merge
window.
- Fix a kasan reported buffer overflow in libnvdimm unit test
infrastrucutre (nfit_test)"
* tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
Revert "acpi, nfit: Further restrict userspace ARS start requests"
acpi, nfit: Fix ARS overflow continuation
tools/testing/nvdimm: Fix the array size for dimm devices.
Linus Torvalds [Sun, 18 Nov 2018 19:31:26 +0000 (11:31 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"16 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/memblock.c: fix a typo in __next_mem_pfn_range() comments
mm, page_alloc: check for max order in hot path
scripts/spdxcheck.py: make python3 compliant
tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset
lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn
mm/vmstat.c: fix NUMA statistics updates
mm/gup.c: fix follow_page_mask() kerneldoc comment
ocfs2: free up write context when direct IO failed
scripts/faddr2line: fix location of start_kernel in comment
mm: don't reclaim inodes with many attached pages
mm, memory_hotplug: check zone_movable in has_unmovable_pages
mm/swapfile.c: use kvzalloc for swap_info_struct allocation
MAINTAINERS: update OMAP MMC entry
hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
kernel/sched/psi.c: simplify cgroup_move_task()
z3fold: fix possible reclaim races
Linus Torvalds [Sun, 18 Nov 2018 18:58:20 +0000 (10:58 -0800)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Fix an exec() related scalability/performance regression, which was
caused by incorrectly calculating load and migrating tasks on exec()
when they shouldn't be"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix cpu_util_wake() for 'execl' type workloads
Linus Torvalds [Sun, 18 Nov 2018 18:54:59 +0000 (10:54 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Fix uncore PMU enumeration for CofeeLake CPUs"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Support CoffeeLake 8th CBOX
perf/x86/intel/uncore: Add more IMC PCI IDs for KabyLake and CoffeeLake CPUs
Linus Torvalds [Sun, 18 Nov 2018 18:52:26 +0000 (10:52 -0800)]
Merge branch 'efi-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull EFI fixes from Ingo Molnar:
"Misc fixes: two warning splat fixes, a leak fix and persistent memory
allocation fixes for ARM"
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: Permit calling efi_mem_reserve_persistent() from atomic context
efi/arm: Defer persistent reservations until after paging_init()
efi/arm/libstub: Pack FDT after populating it
efi/arm: Revert deferred unmap of early memmap mapping
efi: Fix debugobjects warning on 'efi_rts_work'
Linus Torvalds [Sun, 18 Nov 2018 18:45:09 +0000 (10:45 -0800)]
Merge branch 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM spectre updates from Russell King:
"These are the currently known final bits that resolve the Spectre
issues. big.Little systems used to be sufficiently identical in that
there were no differences between individual CPUs in the system that
mattered to the kernel. With the advent of the Spectre problem, the
CPUs now have differences in how the workaround is applied.
As a result of previous Spectre patches, these systems ended up
reporting quite a lot of:
"CPUx: Spectre v2: incorrect context switching function, system vulnerable"
messages due to the action of the big.Little switcher causing the CPUs
to be re-initialised regularly. This series resolves that issue by
making the CPU vtable unique to each CPU.
However, since this is used very early, before per-cpu is setup,
per-cpu can't be used. We also have a problem that two of the methods
are not called from preempt-safe paths, but thankfully these remain
identical between all CPUs in the system. To make sure, we validate
that these are identical during boot"
* 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: spectre-v2: per-CPU vtables to work around big.Little systems
ARM: add PROC_VTABLE and PROC_TABLE macros
ARM: clean up per-processor check_bugs method call
ARM: split out processor lookup
ARM: make lookup_processor_type() non-__init
Chen Chang [Fri, 16 Nov 2018 23:08:57 +0000 (15:08 -0800)]
mm/memblock.c: fix a typo in __next_mem_pfn_range() comments
Link: http://lkml.kernel.org/r/20181107100247.13359-1-rainccrun@gmail.com
Signed-off-by: Chen Chang <rainccrun@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 16 Nov 2018 23:08:53 +0000 (15:08 -0800)]
mm, page_alloc: check for max order in hot path
Konstantin has noticed that kvmalloc might trigger the following
warning:
WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
[...]
Call Trace:
fragmentation_index+0x76/0x90
compaction_suitable+0x4f/0xf0
shrink_node+0x295/0x310
node_reclaim+0x205/0x250
get_page_from_freelist+0x649/0xad0
__alloc_pages_nodemask+0x12a/0x2a0
kmalloc_large_node+0x47/0x90
__kmalloc_node+0x22b/0x2e0
kvmalloc_node+0x3e/0x70
xt_alloc_table_info+0x3a/0x80 [x_tables]
do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
nf_setsockopt+0x44/0x60
SyS_setsockopt+0x6f/0xc0
do_syscall_64+0x67/0x120
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
the problem is that we only check for an out of bound order in the slow
path and the node reclaim might happen from the fast path already. This
is fixable by making sure that kvmalloc doesn't ever use kmalloc for
requests that are larger than KMALLOC_MAX_SIZE but this also shows that
the code is rather fragile. A recent UBSAN report just underlines that
by the following report
UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
shift exponent 51 is too large for 32-bit type 'int'
CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0xd2/0x148 lib/dump_stack.c:113
ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
__ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
__zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
zone_watermark_fast mm/page_alloc.c:3216 [inline]
get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
__alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
alloc_pages include/linux/gfp.h:509 [inline]
__get_free_pages+0x12/0x60 mm/page_alloc.c:4414
dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
__blkdev_driver_ioctl block/ioctl.c:303 [inline]
blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
block_ioctl+0x105/0x150 fs/block_dev.c:1883
vfs_ioctl fs/ioctl.c:46 [inline]
do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
__do_sys_ioctl fs/ioctl.c:709 [inline]
__se_sys_ioctl fs/ioctl.c:707 [inline]
__x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Note that this is not a kvmalloc path. It is just that the fast path
really depends on having sanitzed order as well. Therefore move the
order check to the fast path.
Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Kyungtae Kim <kt0755@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Byoungyoung Lee <lifeasageek@gmail.com>
Cc: "Dae R. Jeong" <threeearcat@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Uwe Kleine-König [Fri, 16 Nov 2018 23:08:43 +0000 (15:08 -0800)]
scripts/spdxcheck.py: make python3 compliant
Without this change the following happens when using Python3 (3.6.6):
$ echo "GPL-2.0" | python3 scripts/spdxcheck.py -
FAIL: 'str' object has no attribute 'decode'
Traceback (most recent call last):
File "scripts/spdxcheck.py", line 253, in <module>
parser.parse_lines(sys.stdin, args.maxlines, '-')
File "scripts/spdxcheck.py", line 171, in parse_lines
line = line.decode(locale.getpreferredencoding(False), errors='ignore')
AttributeError: 'str' object has no attribute 'decode'
So as the line is already a string, there is no need to decode it and
the line can be dropped.
/usr/bin/python on Arch is Python 3. So this would indeed be worth
going into 4.19.
Link: http://lkml.kernel.org/r/20181023070802.22558-1-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Joe Perches <joe@perches.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yufen Yu [Fri, 16 Nov 2018 23:08:39 +0000 (15:08 -0800)]
tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset
Other filesystems such as ext4, f2fs and ubifs all return ENXIO when
lseek (SEEK_DATA or SEEK_HOLE) requests a negative offset.
man 2 lseek says
: EINVAL whence is not valid. Or: the resulting file offset would be
: negative, or beyond the end of a seekable device.
:
: ENXIO whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond
: the end of the file.
Make tmpfs return ENXIO under these circumstances as well. After this,
tmpfs also passes xfstests's generic/448.
[akpm@linux-foundation.org: rewrite changelog]
Link: http://lkml.kernel.org/r/1540434176-14349-1-git-send-email-yuyufen@huawei.com
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnd Bergmann [Fri, 16 Nov 2018 23:08:35 +0000 (15:08 -0800)]
lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn
gcc-8 complains about the prototype for this function:
lib/ubsan.c:432:1: error: ignoring attribute 'noreturn' in declaration of a built-in function '__ubsan_handle_builtin_unreachable' because it conflicts with attribute 'const' [-Werror=attributes]
This is actually a GCC's bug. In GCC internals
__ubsan_handle_builtin_unreachable() declared with both 'noreturn' and
'const' attributes instead of only 'noreturn':
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84210
Workaround this by removing the noreturn attribute.
[aryabinin: add information about GCC bug in changelog]
Link: http://lkml.kernel.org/r/20181107144516.4587-1-aryabinin@virtuozzo.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Olof Johansson <olof@lixom.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Janne Huttunen [Fri, 16 Nov 2018 23:08:32 +0000 (15:08 -0800)]
mm/vmstat.c: fix NUMA statistics updates
Scan through the whole array to see if an update is needed. While we're
at it, use sizeof() to be safe against any possible type changes in the
future.
The bug here is that we wouldn't sync per-cpu counters into global ones
if there was an update of numa_stats for higher cpus. Highly
theoretical one though because it is much more probable that zone_stats
are updated so we would refresh anyway. So I wouldn't bother to mark
this for stable, yet something nice to fix.
[mhocko@suse.com: changelog enhancement]
Link: http://lkml.kernel.org/r/1541601517-17282-1-git-send-email-janne.huttunen@nokia.com
Fixes:
1d90ca897cb0 ("mm: update NUMA counter threshold size")
Signed-off-by: Janne Huttunen <janne.huttunen@nokia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Fri, 16 Nov 2018 23:08:29 +0000 (15:08 -0800)]
mm/gup.c: fix follow_page_mask() kerneldoc comment
Commit
df06b37ffe5a ("mm/gup: cache dev_pagemap while pinning pages")
modified the signature of follow_page_mask() but left the parameter
description behind.
Update the description to make the code and comments agree again.
While at it, update formatting of the return value description to match
Documentation/doc-guide/kernel-doc.rst guidelines.
Link: http://lkml.kernel.org/r/1541603316-27832-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wengang Wang [Fri, 16 Nov 2018 23:08:25 +0000 (15:08 -0800)]
ocfs2: free up write context when direct IO failed
The write context should also be freed even when direct IO failed.
Otherwise a memory leak is introduced and entries remain in
oi->ip_unwritten_list causing the following BUG later in unlink path:
ERROR: bug expression: !list_empty(&oi->ip_unwritten_list)
ERROR: Clear inode of 215043, inode has unwritten extents
...
Call Trace:
? __set_current_blocked+0x42/0x68
ocfs2_evict_inode+0x91/0x6a0 [ocfs2]
? bit_waitqueue+0x40/0x33
evict+0xdb/0x1af
iput+0x1a2/0x1f7
do_unlinkat+0x194/0x28f
SyS_unlinkat+0x1b/0x2f
do_syscall_64+0x79/0x1ae
entry_SYSCALL_64_after_hwframe+0x151/0x0
This patch also logs, with frequency limit, direct IO failures.
Link: http://lkml.kernel.org/r/20181102170632.25921-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap [Fri, 16 Nov 2018 23:08:22 +0000 (15:08 -0800)]
scripts/faddr2line: fix location of start_kernel in comment
Fix a source file reference location to the correct path name.
Link: http://lkml.kernel.org/r/1d50bd3d-178e-dcd8-779f-9711887440eb@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Gushchin [Fri, 16 Nov 2018 23:08:18 +0000 (15:08 -0800)]
mm: don't reclaim inodes with many attached pages
Spock reported that commit
172b06c32b94 ("mm: slowly shrink slabs with a
relatively small number of objects") leads to a regression on his setup:
periodically the majority of the pagecache is evicted without an obvious
reason, while before the change the amount of free memory was balancing
around the watermark.
The reason behind is that the mentioned above change created some
minimal background pressure on the inode cache. The problem is that if
an inode is considered to be reclaimed, all belonging pagecache page are
stripped, no matter how many of them are there. So, if a huge
multi-gigabyte file is cached in the memory, and the goal is to reclaim
only few slab objects (unused inodes), we still can eventually evict all
gigabytes of the pagecache at once.
The workload described by Spock has few large non-mapped files in the
pagecache, so it's especially noticeable.
To solve the problem let's postpone the reclaim of inodes, which have
more than 1 attached page. Let's wait until the pagecache pages will be
evicted naturally by scanning the corresponding LRU lists, and only then
reclaim the inode structure.
Link: http://lkml.kernel.org/r/20181023164302.20436-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Spock <dairinin@gmail.com>
Tested-by: Spock <dairinin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org> [4.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 16 Nov 2018 23:08:15 +0000 (15:08 -0800)]
mm, memory_hotplug: check zone_movable in has_unmovable_pages
Page state checks are racy. Under a heavy memory workload (e.g. stress
-m 200 -t 2h) it is quite easy to hit a race window when the page is
allocated but its state is not fully populated yet. A debugging patch to
dump the struct page state shows
has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0
page:
ffffea0437fb0000 count:1 mapcount:1 mapping:
ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1
flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked)
Note that the state has been checked for both PageLRU and PageSwapBacked
already. Closing this race completely would require some sort of retry
logic. This can be tricky and error prone (think of potential endless
or long taking loops).
Workaround this problem for movable zones at least. Such a zone should
only contain movable pages. Commit
15c30bc09085 ("mm, memory_hotplug:
make has_unmovable_pages more robust") has told us that this is not
strictly true though. Bootmem pages should be marked reserved though so
we can move the original check after the PageReserved check. Pages from
other zones are still prone to races but we even do not pretend that
memory hotremove works for those so pre-mature failure doesn't hurt that
much.
Link: http://lkml.kernel.org/r/20181106095524.14629-1-mhocko@kernel.org
Fixes:
15c30bc09085 ("mm, memory_hotplug: make has_unmovable_pages more robust")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vasily Averin [Fri, 16 Nov 2018 23:08:11 +0000 (15:08 -0800)]
mm/swapfile.c: use kvzalloc for swap_info_struct allocation
Commit
a2468cc9bfdf ("swap: choose swap device according to numa node")
changed 'avail_lists' field of 'struct swap_info_struct' to an array.
In popular linux distros it increased size of swap_info_struct up to 40
Kbytes and now swap_info_struct allocation requires order-4 page.
Switch to kvzmalloc allows to avoid unexpected allocation failures.
Link: http://lkml.kernel.org/r/fc23172d-3c75-21e2-d551-8b1808cbe593@virtuozzo.com
Fixes:
a2468cc9bfdf ("swap: choose swap device according to numa node")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aaro Koskinen [Fri, 16 Nov 2018 23:08:08 +0000 (15:08 -0800)]
MAINTAINERS: update OMAP MMC entry
Jarkko's e-mail address hasn't worked for a long time. We still want to
keep this driver working as it is critical for some of the OMAP boards.
I use and test this driver frequently, so change myself as a maintainer
with "Odd Fixes" status.
Link: http://lkml.kernel.org/r/20181106222750.12939-1-aaro.koskinen@iki.fi
Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Acked-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Fri, 16 Nov 2018 23:08:04 +0000 (15:08 -0800)]
hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
This bug has been experienced several times by the Oracle DB team. The
BUG is in remove_inode_hugepages() as follows:
/*
* If page is mapped, it was faulted in after being
* unmapped in caller. Unmap (again) now after taking
* the fault mutex. The mutex will prevent faults
* until we finish removing the page.
*
* This race can only happen in the hole punch case.
* Getting here in a truncate operation is a bug.
*/
if (unlikely(page_mapped(page))) {
BUG_ON(truncate_op);
In this case, the elevated map count is not the result of a race.
Rather it was incorrectly incremented as the result of a bug in the huge
pmd sharing code. Consider the following:
- Process A maps a hugetlbfs file of sufficient size and alignment
(PUD_SIZE) that a pmd page could be shared.
- Process B maps the same hugetlbfs file with the same size and
alignment such that a pmd page is shared.
- Process B then calls mprotect() to change protections for the mapping
with the shared pmd. As a result, the pmd is 'unshared'.
- Process B then calls mprotect() again to chage protections for the
mapping back to their original value. pmd remains unshared.
- Process B then forks and process C is created. During the fork
process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
tables. Copying page tables for hugetlb mappings is done in the
routine copy_hugetlb_page_range.
In copy_hugetlb_page_range(), the destination pte is obtained by:
dst_pte = huge_pte_alloc(dst, addr, sz);
If pmd sharing is possible, the returned pointer will be to a pte in an
existing page table. In the situation above, process C could share with
either process A or process B. Since process A is first in the list,
the returned pte is a pointer to a pte in process A's page table.
However, the check for pmd sharing in copy_hugetlb_page_range is:
/* If the pagetables are shared don't copy or take references */
if (dst_pte == src_pte)
continue;
Since process C is sharing with process A instead of process B, the
above test fails. The code in copy_hugetlb_page_range which follows
assumes dst_pte points to a huge_pte_none pte. It copies the pte entry
from src_pte to dst_pte and increments this map count of the associated
page. This is how we end up with an elevated map count.
To solve, check the dst_pte entry for huge_pte_none. If !none, this
implies PMD sharing so do not copy.
Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
Fixes:
c5c99429fa57 ("fix hugepages leak due to pagetable page sharing")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Olof Johansson [Fri, 16 Nov 2018 23:08:00 +0000 (15:08 -0800)]
kernel/sched/psi.c: simplify cgroup_move_task()
The existing code triggered an invalid warning about 'rq' possibly being
used uninitialized. Instead of doing the silly warning suppression by
initializa it to NULL, refactor the code to bail out early instead.
Warning was:
kernel/sched/psi.c: In function `cgroup_move_task':
kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized]
Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net
Fixes:
2ce7135adc9ad ("psi: cgroup support")
Signed-off-by: Olof Johansson <olof@lixom.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vitaly Wool [Fri, 16 Nov 2018 23:07:56 +0000 (15:07 -0800)]
z3fold: fix possible reclaim races
Reclaim and free can race on an object which is basically fine but in
order for reclaim to be able to map "freed" object we need to encode
object length in the handle. handle_to_chunks() is then introduced to
extract object length from a handle and use it during mapping.
Moreover, to avoid racing on a z3fold "headless" page release, we should
not try to free that page in z3fold_free() if the reclaim bit is set.
Also, in the unlikely case of trying to reclaim a page being freed, we
should not proceed with that page.
While at it, fix the page accounting in reclaim function.
This patch supersedes "[PATCH] z3fold: fix reclaim lock-ups".
Link: http://lkml.kernel.org/r/20181105162225.74e8837d03583a9b707cf559@gmail.com
Signed-off-by: Vitaly Wool <vitaly.vul@sony.com>
Signed-off-by: Jongseok Kim <ks77sj@gmail.com>
Reported-by-by: Jongseok Kim <ks77sj@gmail.com>
Reviewed-by: Snild Dolkow <snild@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jon Maloy [Sat, 17 Nov 2018 17:17:06 +0000 (12:17 -0500)]
tipc: don't assume linear buffer when reading ancillary data
The code for reading ancillary data from a received buffer is assuming
the buffer is linear. To make this assumption true we have to linearize
the buffer before message data is read.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Fri, 16 Nov 2018 21:55:04 +0000 (16:55 -0500)]
tipc: fix lockdep warning when reinitilaizing sockets
We get the following warning:
[ 47.926140] 32-bit node address hash set to 2010a0a
[ 47.927202]
[ 47.927433] ================================
[ 47.928050] WARNING: inconsistent lock state
[ 47.928661] 4.19.0+ #37 Tainted: G E
[ 47.929346] --------------------------------
[ 47.929954] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[ 47.930116] swapper/3/0 [HC0[0]:SC1[3]:HE1:SE0] takes:
[ 47.930116]
00000000af8bc31e (&(&ht->lock)->rlock){+.?.}, at: rhashtable_walk_enter+0x36/0xb0
[ 47.930116] {SOFTIRQ-ON-W} state was registered at:
[ 47.930116] _raw_spin_lock+0x29/0x60
[ 47.930116] rht_deferred_worker+0x556/0x810
[ 47.930116] process_one_work+0x1f5/0x540
[ 47.930116] worker_thread+0x64/0x3e0
[ 47.930116] kthread+0x112/0x150
[ 47.930116] ret_from_fork+0x3a/0x50
[ 47.930116] irq event stamp: 14044
[ 47.930116] hardirqs last enabled at (14044): [<
ffffffff9a07fbba>] __local_bh_enable_ip+0x7a/0xf0
[ 47.938117] hardirqs last disabled at (14043): [<
ffffffff9a07fb81>] __local_bh_enable_ip+0x41/0xf0
[ 47.938117] softirqs last enabled at (14028): [<
ffffffff9a0803ee>] irq_enter+0x5e/0x60
[ 47.938117] softirqs last disabled at (14029): [<
ffffffff9a0804a5>] irq_exit+0xb5/0xc0
[ 47.938117]
[ 47.938117] other info that might help us debug this:
[ 47.938117] Possible unsafe locking scenario:
[ 47.938117]
[ 47.938117] CPU0
[ 47.938117] ----
[ 47.938117] lock(&(&ht->lock)->rlock);
[ 47.938117] <Interrupt>
[ 47.938117] lock(&(&ht->lock)->rlock);
[ 47.938117]
[ 47.938117] *** DEADLOCK ***
[ 47.938117]
[ 47.938117] 2 locks held by swapper/3/0:
[ 47.938117] #0:
0000000062c64f90 ((&d->timer)){+.-.}, at: call_timer_fn+0x5/0x280
[ 47.938117] #1:
00000000ee39619c (&(&d->lock)->rlock){+.-.}, at: tipc_disc_timeout+0xc8/0x540 [tipc]
[ 47.938117]
[ 47.938117] stack backtrace:
[ 47.938117] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G E 4.19.0+ #37
[ 47.938117] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 47.938117] Call Trace:
[ 47.938117] <IRQ>
[ 47.938117] dump_stack+0x5e/0x8b
[ 47.938117] print_usage_bug+0x1ed/0x1ff
[ 47.938117] mark_lock+0x5b5/0x630
[ 47.938117] __lock_acquire+0x4c0/0x18f0
[ 47.938117] ? lock_acquire+0xa6/0x180
[ 47.938117] lock_acquire+0xa6/0x180
[ 47.938117] ? rhashtable_walk_enter+0x36/0xb0
[ 47.938117] _raw_spin_lock+0x29/0x60
[ 47.938117] ? rhashtable_walk_enter+0x36/0xb0
[ 47.938117] rhashtable_walk_enter+0x36/0xb0
[ 47.938117] tipc_sk_reinit+0xb0/0x410 [tipc]
[ 47.938117] ? mark_held_locks+0x6f/0x90
[ 47.938117] ? __local_bh_enable_ip+0x7a/0xf0
[ 47.938117] ? lockdep_hardirqs_on+0x20/0x1a0
[ 47.938117] tipc_net_finalize+0xbf/0x180 [tipc]
[ 47.938117] tipc_disc_timeout+0x509/0x540 [tipc]
[ 47.938117] ? call_timer_fn+0x5/0x280
[ 47.938117] ? tipc_disc_msg_xmit.isra.19+0xa0/0xa0 [tipc]
[ 47.938117] ? tipc_disc_msg_xmit.isra.19+0xa0/0xa0 [tipc]
[ 47.938117] call_timer_fn+0xa1/0x280
[ 47.938117] ? tipc_disc_msg_xmit.isra.19+0xa0/0xa0 [tipc]
[ 47.938117] run_timer_softirq+0x1f2/0x4d0
[ 47.938117] __do_softirq+0xfc/0x413
[ 47.938117] irq_exit+0xb5/0xc0
[ 47.938117] smp_apic_timer_interrupt+0xac/0x210
[ 47.938117] apic_timer_interrupt+0xf/0x20
[ 47.938117] </IRQ>
[ 47.938117] RIP: 0010:default_idle+0x1c/0x140
[ 47.938117] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 54 55 53 65 8b 2d d8 2b 74 65 0f 1f 44 00 00 e8 c6 2c 8b ff fb f4 <65> 8b 2d c5 2b 74 65 0f 1f 44 00 00 5b 5d 41 5c c3 65 8b 05 b4 2b
[ 47.938117] RSP: 0018:
ffffaf6ac0207ec8 EFLAGS:
00000206 ORIG_RAX:
ffffffffffffff13
[ 47.938117] RAX:
ffff8f5b3735e200 RBX:
0000000000000003 RCX:
0000000000000001
[ 47.938117] RDX:
0000000000000001 RSI:
0000000000000001 RDI:
ffff8f5b3735e200
[ 47.938117] RBP:
0000000000000003 R08:
0000000000000001 R09:
0000000000000000
[ 47.938117] R10:
0000000000000000 R11:
0000000000000000 R12:
0000000000000000
[ 47.938117] R13:
0000000000000000 R14:
ffff8f5b3735e200 R15:
ffff8f5b3735e200
[ 47.938117] ? default_idle+0x1a/0x140
[ 47.938117] do_idle+0x1bc/0x280
[ 47.938117] cpu_startup_entry+0x19/0x20
[ 47.938117] start_secondary+0x187/0x1c0
[ 47.938117] secondary_startup_64+0xa4/0xb0
The reason seems to be that tipc_net_finalize()->tipc_sk_reinit() is
calling the function rhashtable_walk_enter() within a timer interrupt.
We fix this by executing tipc_net_finalize() in work queue context.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 18 Nov 2018 05:57:02 +0000 (21:57 -0800)]
net-gro: reset skb->pkt_type in napi_reuse_skb()
eth_type_trans() assumes initial value for skb->pkt_type
is PACKET_HOST.
This is indeed the value right after a fresh skb allocation.
However, it is possible that GRO merged a packet with a different
value (like PACKET_OTHERHOST in case macvlan is used), so
we need to make sure napi->skb will have pkt_type set back to
PACKET_HOST.
Otherwise, valid packets might be dropped by the stack because
their pkt_type is not PACKET_HOST.
napi_reuse_skb() was added in commit
96e93eab2033 ("gro: Add
internal interfaces for VLAN"), but this bug always has
been there.
Fixes:
96e93eab2033 ("gro: Add internal interfaces for VLAN")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 18 Nov 2018 05:54:53 +0000 (21:54 -0800)]
Merge branch 'tdc-fixes'
Lucas Bates says:
====================
Prevent uncaught exceptions in tdc
This patch series addresses two potential bugs in tdc that can
cause exceptions to be raised in certain circumstances. These
exceptions are generally not handled, so instead we will prevent
them from being raised.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenda J. Butler [Fri, 16 Nov 2018 22:37:56 +0000 (17:37 -0500)]
tc-testing: tdc.py: Guard against lack of returncode in executed command
Add some defensive coding in case one of the subprocesses created by tdc
returns nothing. If no object is returned from exec_cmd, then tdc will
halt with an unhandled exception.
Signed-off-by: Brenda J. Butler <bjb@mojatatu.com>
Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lucas Bates [Fri, 16 Nov 2018 22:37:55 +0000 (17:37 -0500)]
tc-testing: tdc.py: ignore errors when decoding stdout/stderr
Prevent exceptions from being raised while decoding output
from an executed command. There is no impact on tdc's
execution and the verify command phase would fail the pattern
match.
Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sabrina Dubroca [Fri, 16 Nov 2018 15:58:19 +0000 (16:58 +0100)]
ip_tunnel: don't force DF when MTU is locked
The various types of tunnels running over IPv4 can ask to set the DF
bit to do PMTU discovery. However, PMTU discovery is subject to the
threshold set by the net.ipv4.route.min_pmtu sysctl, and is also
disabled on routes with "mtu lock". In those cases, we shouldn't set
the DF bit.
This patch makes setting the DF bit conditional on the route's MTU
locking state.
This issue seems to be older than git history.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Toke Høiland-Jørgensen [Fri, 16 Nov 2018 20:13:59 +0000 (12:13 -0800)]
MAINTAINERS: Add entry for CAKE qdisc
We would like the existing community to be kept in the loop for any new
developments on CAKE; and I certainly plan to keep maintaining it. Reflect
this in MAINTAINERS.
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Fri, 16 Nov 2018 16:50:01 +0000 (18:50 +0200)]
net: bridge: fix vlan stats use-after-free on destruction
Syzbot reported a use-after-free of the global vlan context on port vlan
destruction. When I added per-port vlan stats I missed the fact that the
global vlan context can be freed before the per-port vlan rcu callback.
There're a few different ways to deal with this, I've chosen to add a
new private flag that is set only when per-port stats are allocated so
we can directly check it on destruction without dereferencing the global
context at all. The new field in net_bridge_vlan uses a hole.
v2: cosmetic change, move the check to br_process_vlan_info where the
other checks are done
v3: add change log in the patch, add private (in-kernel only) flags in a
hole in net_bridge_vlan struct and use that instead of mixing
user-space flags with private flags
Fixes:
9163a0fc1f0c ("net: bridge: add support for per-port vlan stats")
Reported-by: syzbot+04681da557a0e49a52e5@syzkaller.appspotmail.com
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Slavomir Kaslev [Fri, 16 Nov 2018 09:27:53 +0000 (11:27 +0200)]
socket: do a generic_file_splice_read when proto_ops has no splice_read
splice(2) fails with -EINVAL when called reading on a socket with no splice_read
set in its proto_ops (such as vsock sockets). Switch this to fallbacks to a
generic_file_splice_read instead.
Signed-off-by: Slavomir Kaslev <kaslevs@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Schiller [Fri, 16 Nov 2018 07:38:36 +0000 (08:38 +0100)]
net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
Up until commit
7e5fbd1e0700 ("net: mdio-gpio: Convert to use gpiod
functions where possible"), the _cansleep variants of the gpio_ API was
used. After that commit and the change to gpiod_ API, the _cansleep()
was dropped. This then results in WARN_ON() when used with GPIO
devices which do sleep. Add back the _cansleep() to avoid this.
Fixes:
7e5fbd1e0700 ("net: mdio-gpio: Convert to use gpiod functions where possible")
Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 17 Nov 2018 07:04:37 +0000 (23:04 -0800)]
Revert "net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs"
This reverts commit
dfa0d55ff6be64e7b6881212a291cb95f8da3b08.
Discussion still ongoing, I shouldn't have applied this.
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 17 Nov 2018 04:26:30 +0000 (20:26 -0800)]
Merge tag 'batadv-net-for-davem-
20181114' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here are two batman-adv bugfixes:
- Explicitly pad short ELP packets with zeros, by Sven Eckelmann
- Fix packet size calculation when merging fragments,
by Sven Eckelmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Schiller [Wed, 14 Nov 2018 11:54:49 +0000 (12:54 +0100)]
net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
This commit re-enables support for slow GPIO pins. It was initially
introduced by commit
2d6c9091ab76 ("net: mdio-gpio: support access that
may sleep") and got lost by commit
7e5fbd1e0700 ("net: mdio-gpio:
Convert to use gpiod functions where possible").
Also add a warning about slow GPIO pins like it is done in i2c-gpio.
Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Davide Caratti [Wed, 14 Nov 2018 11:17:25 +0000 (12:17 +0100)]
net/sched: act_pedit: fix memory leak when IDR allocation fails
tcf_idr_check_alloc() can return a negative value, on allocation failures
(-ENOMEM) or IDR exhaustion (-ENOSPC): don't leak keys_ex in these cases.
Fixes:
0190c1d452a9 ("net: sched: atomically check-allocate action")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christophe JAILLET [Tue, 13 Nov 2018 17:42:24 +0000 (18:42 +0100)]
net: lantiq: Fix returned value in case of error in 'xrx200_probe()'
Return 'err' in the error handling path instead of 0.
Return explicitly 0 in the normal path, instead of 'err', which is known
to be 0 at this point.
Fixes:
fe1a56420cf2 ("net: lantiq: Add Lantiq / Intel VRX200 Ethernet driver")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 13 Nov 2018 16:48:28 +0000 (00:48 +0800)]
ipv6: fix a dst leak when removing its exception
These is no need to hold dst before calling rt6_remove_exception_rt().
The call to dst_hold_safe() in ip6_link_failure() was for ip6_del_rt(),
which has been removed in Commit
93531c674315 ("net/ipv6: separate
handling of FIB entries from dst based routes"). Otherwise, it will
cause a dst leak.
This patch is to simply remove the dst_hold_safe() call before calling
rt6_remove_exception_rt() and also do the same in ip6_del_cached_rt().
It's safe, because the removal of the exception that holds its dst's
refcnt is protected by rt6_exception_lock.
Fixes:
93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes")
Fixes:
23fb93a4d3f1 ("net/ipv6: Cleanup exception and cache route handling")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>