Sabrina Dubroca [Fri, 25 Aug 2017 11:10:12 +0000 (13:10 +0200)]
tcp: fix refcnt leak with ebpf congestion control
There are a few bugs around refcnt handling in the new BPF congestion
control setsockopt:
- The new ca is assigned to icsk->icsk_ca_ops even in the case where we
cannot get a reference on it. This would lead to a use after free,
since that ca is going away soon.
- Changing the congestion control case doesn't release the refcnt on
the previous ca.
- In the reinit case, we first leak a reference on the old ca, then we
call tcp_reinit_congestion_control on the ca that we have just
assigned, leading to deinitializing the wrong ca (->release of the
new ca on the old ca's data) and releasing the refcount on the ca
that we actually want to use.
This is visible by building (for example) BIC as a module and setting
net.ipv4.tcp_congestion_control=bic, and using tcp_cong_kern.c from
samples/bpf.
This patch fixes the refcount issues, and moves reinit back into tcp
core to avoid passing a ca pointer back to BPF.
Fixes:
91b5b21c7c16 ("bpf: Add support for changing congestion control")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert [Fri, 25 Aug 2017 07:05:42 +0000 (09:05 +0200)]
ipv6: Fix may be used uninitialized warning in rt6_check
rt_cookie might be used uninitialized, fix this by
initializing it.
Fixes:
c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Thu, 24 Aug 2017 23:49:16 +0000 (16:49 -0700)]
netvsc: fix deadlock betwen link status and removal
There is a deadlock possible when canceling the link status
delayed work queue. The removal process is run with RTNL held,
and the link status callback is acquring RTNL.
Resolve the issue by using trylock and rescheduling.
If cancel is in process, that block it from happening.
Fixes:
122a5f6410f4 ("staging: hv: use delayed_work for netvsc_send_garp()")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 25 Aug 2017 04:54:35 +0000 (21:54 -0700)]
Merge branch 'tipc-buffer-reassignment-fixes'
Parthasarathy Bhuvaragan says:
====================
tipc: buffer reassignment fixes
This series contains fixes for buffer reassignments and a context imbalance.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Thu, 24 Aug 2017 14:31:24 +0000 (16:31 +0200)]
tipc: context imbalance at node read unlock
If we fail to find a valid bearer in tipc_node_get_linkname(),
node_read_unlock() is called without holding the node read lock.
This commit fixes this error.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Thu, 24 Aug 2017 14:31:23 +0000 (16:31 +0200)]
tipc: reassign pointers after skb reallocation / linearization
In tipc_msg_reverse(), we assign skb attributes to local pointers
in stack at startup. This is followed by skb_linearize() and for
cloned buffers we perform skb relocation using pskb_expand_head().
Both these methods may update the skb attributes and thus making
the pointers incorrect.
In this commit, we fix this error by ensuring that the pointers
are re-assigned after any of these skb operations.
Fixes:
29042e19f2c60 ("tipc: let function tipc_msg_reverse() expand header
when needed")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Thu, 24 Aug 2017 14:31:22 +0000 (16:31 +0200)]
tipc: perform skb_linearize() before parsing the inner header
In tipc_rcv(), we linearize only the header and usually the packets
are consumed as the nodes permit direct reception. However, if the
skb contains tunnelled message due to fail over or synchronization
we parse it in tipc_node_check_state() without performing
linearization. This will cause link disturbances if the skb was
non linear.
In this commit, we perform linearization for the above messages.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 25 Aug 2017 04:12:28 +0000 (21:12 -0700)]
net_sched: fix a refcount_t issue with noop_qdisc
syzkaller reported a refcount_t warning [1]
Issue here is that noop_qdisc refcnt was never really considered as
a true refcount, since qdisc_destroy() found TCQ_F_BUILTIN set :
if (qdisc->flags & TCQ_F_BUILTIN ||
!refcount_dec_and_test(&qdisc->refcnt)))
return;
Meaning that all atomic_inc() we did on noop_qdisc.refcnt were not
really needed, but harmless until refcount_t came.
To fix this problem, we simply need to not increment noop_qdisc.refcnt,
since we never decrement it.
[1]
refcount_t: increment on 0; use-after-free.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 21754 at lib/refcount.c:152 refcount_inc+0x47/0x50 lib/refcount.c:152
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 21754 Comm: syz-executor7 Not tainted 4.13.0-rc6+ #20
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:52
panic+0x1e4/0x417 kernel/panic.c:180
__warn+0x1c4/0x1d9 kernel/panic.c:541
report_bug+0x211/0x2d0 lib/bug.c:183
fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:310
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:846
RIP: 0010:refcount_inc+0x47/0x50 lib/refcount.c:152
RSP: 0018:
ffff8801c43477a0 EFLAGS:
00010282
RAX:
000000000000002b RBX:
ffffffff86093c14 RCX:
0000000000000000
RDX:
000000000000002b RSI:
ffffffff8159314e RDI:
ffffed0038868ee8
RBP:
ffff8801c43477a8 R08:
0000000000000001 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000000 R12:
ffffffff86093ac0
R13:
0000000000000001 R14:
ffff8801d0f3bac0 R15:
dffffc0000000000
attach_default_qdiscs net/sched/sch_generic.c:792 [inline]
dev_activate+0x7d3/0xaa0 net/sched/sch_generic.c:833
__dev_open+0x227/0x330 net/core/dev.c:1380
__dev_change_flags+0x695/0x990 net/core/dev.c:6726
dev_change_flags+0x88/0x140 net/core/dev.c:6792
dev_ifsioc+0x5a6/0x930 net/core/dev_ioctl.c:256
dev_ioctl+0x2bc/0xf90 net/core/dev_ioctl.c:554
sock_do_ioctl+0x94/0xb0 net/socket.c:968
sock_ioctl+0x2c2/0x440 net/socket.c:1058
vfs_ioctl fs/ioctl.c:45 [inline]
do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
SYSC_ioctl fs/ioctl.c:700 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
Fixes:
7b9364050246 ("net, sched: convert Qdisc.refcnt from atomic_t to refcount_t")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Reshetova, Elena <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Thu, 24 Aug 2017 23:01:13 +0000 (16:01 -0700)]
net: systemport: Free DMA coherent descriptors on errors
In case bcm_sysport_init_tx_ring() is not able to allocate ring->cbs, we
would return with an error, and call bcm_sysport_fini_tx_ring() and it
would see that ring->cbs is NULL and do nothing. This would leak the
coherent DMA descriptor area, so we need to free it on error before
returning.
Reported-by: Eric Dumazet <edumazet@gmail.com>
Fixes:
80105befdb4b ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Thu, 24 Aug 2017 22:56:29 +0000 (15:56 -0700)]
net: bcmgenet: Be drop monitor friendly
There are 3 spots where we call dev_kfree_skb() but we are actually
just doing a normal SKB consumption: __bcmgenet_tx_reclaim() for normal
TX reclamation, bcmgenet_alloc_rx_buffers() during the initial RX ring
setup and bcmgenet_free_rx_buffers() during RX ring cleanup.
Fixes:
d6707bec5986 ("net: bcmgenet: rewrite bcmgenet_rx_refill()")
Fixes:
f48bed16a756 ("net: bcmgenet: Free skb after last Tx frag")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung Cheng [Thu, 24 Aug 2017 22:48:21 +0000 (15:48 -0700)]
bpf: fix bpf_setsockopts return value
This patch fixes a bug causing any sock operations to always return EINVAL.
Fixes:
a5192c52377e ("bpf: fix to bpf_setsockops").
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Craig Gallek <kraig@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Thu, 24 Aug 2017 22:20:41 +0000 (15:20 -0700)]
net: systemport: Be drop monitor friendly
Utilize dev_consume_skb_any(cb->skb) in bcm_sysport_free_cb() which is
used when a TX packet is completed, as well as when the RX ring is
cleaned on shutdown. None of these two cases are packet drops, so be
drop monitor friendly.
Suggested-by: Eric Dumazet <edumazet@gmail.com>
Fixes:
80105befdb4b ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bob Peterson [Wed, 23 Aug 2017 14:43:02 +0000 (10:43 -0400)]
tipc: Fix tipc_sk_reinit handling of -EAGAIN
In
9dbbfb0ab6680c6a85609041011484e6658e7d3c function tipc_sk_reinit
had additional logic added to loop in the event that function
rhashtable_walk_next() returned -EAGAIN. No worries.
However, if rhashtable_walk_start returns -EAGAIN, it does "continue",
and therefore skips the call to rhashtable_walk_stop(). That has
the effect of calling rcu_read_lock() without its paired call to
rcu_read_unlock(). Since rcu_read_lock() may be nested, the problem
may not be apparent for a while, especially since resize events may
be rare. But the comments to rhashtable_walk_start() state:
* ...Note that we take the RCU lock in all
* cases including when we return an error. So you must always call
* rhashtable_walk_stop to clean up.
This patch replaces the continue with a goto and label to ensure a
matching call to rhashtable_walk_stop().
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Wed, 23 Aug 2017 13:59:49 +0000 (15:59 +0200)]
qlge: avoid memcpy buffer overflow
gcc-8.0.0 (snapshot) points out that we copy a variable-length string
into a fixed length field using memcpy() with the destination length,
and that ends up copying whatever follows the string:
inlined from 'ql_core_dump' at drivers/net/ethernet/qlogic/qlge/qlge_dbg.c:1106:2:
drivers/net/ethernet/qlogic/qlge/qlge_dbg.c:708:2: error: 'memcpy' reading 15 bytes from a region of size 14 [-Werror=stringop-overflow=]
memcpy(seg_hdr->description, desc, (sizeof(seg_hdr->description)) - 1);
Changing it to use strncpy() will instead zero-pad the destination,
which seems to be the right thing to do here.
The bug is probably harmless, but it seems like a good idea to address
it in stable kernels as well, if only for the purpose of building with
gcc-8 without warnings.
Fixes:
a61f80261306 ("qlge: Add ethtool register dump function.")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 24 Aug 2017 16:02:49 +0000 (09:02 -0700)]
virtio_net: be drop monitor friendly
This change is needed to not fool drop monitor.
(perf record ... -e skb:kfree_skb )
Packets were properly sent and are consumed after TX completion.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 24 Aug 2017 18:49:19 +0000 (11:49 -0700)]
Merge git://git./pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Fix use after free of struct proc_dir_entry in ipt_CLUSTERIP, patch
from Sabrina Dubroca.
2) Fix spurious EINVAL errors from iptables over nft compatibility layer.
3) Reload pointer to ip header only if there is non-terminal verdict,
ie. XT_CONTINUE, otherwise invalid memory access may happen, patch
from Taehee Yoo.
4) Fix interaction between SYNPROXY and NAT, SYNPROXY adds sequence
adjustment already, however from nf_nat_setup() assumes there's not.
Patch from Xin Long.
5) Fix burst arithmetics in nft_limit as Joe Stringer mentioned during
NFWS in Faro. Patch from Andy Zhou.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
andy zhou [Mon, 21 Aug 2017 19:38:53 +0000 (12:38 -0700)]
netfilter: nf_tables: Fix nft limit burst handling
Current implementation treats the burst configuration the same as
rate configuration. This can cause the per packet cost to be lower
than configured. In effect, this bug causes the token bucket to be
refilled at a higher rate than what user has specified.
This patch changes the implementation so that the token bucket size
is controlled by "rate + burst", while maintain the token bucket
refill rate the same as user specified.
Fixes:
96518518cc41 ("netfilter: add nftables")
Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Xin Long [Thu, 10 Aug 2017 02:22:24 +0000 (10:22 +0800)]
netfilter: check for seqadj ext existence before adding it in nf_nat_setup_info
Commit
4440a2ab3b9f ("netfilter: synproxy: Check oom when adding synproxy
and seqadj ct extensions") wanted to drop the packet when it fails to add
seqadj ext due to no memory by checking if nfct_seqadj_ext_add returns
NULL.
But that nfct_seqadj_ext_add returns NULL can also happen when seqadj ext
already exists in a nf_conn. It will cause that userspace protocol doesn't
work when both dnat and snat are configured.
Li Shuang found this issue in the case:
Topo:
ftp client router ftp server
10.167.131.2 <-> 10.167.131.254 10.167.141.254 <-> 10.167.141.1
Rules:
# iptables -t nat -A PREROUTING -i eth1 -p tcp -m tcp --dport 21 -j \
DNAT --to-destination 10.167.141.1
# iptables -t nat -A POSTROUTING -o eth2 -p tcp -m tcp --dport 21 -j \
SNAT --to-source 10.167.141.254
In router, when both dnat and snat are added, nf_nat_setup_info will be
called twice. The packet can be dropped at the 2nd time for DNAT due to
seqadj ext is already added at the 1st time for SNAT.
This patch is to fix it by checking for seqadj ext existence before adding
it, so that the packet will not be dropped if seqadj ext already exists.
Note that as Florian mentioned, as a long term, we should review ext_add()
behaviour, it's better to return a pointer to the existing ext instead.
Fixes:
4440a2ab3b9f ("netfilter: synproxy: Check oom when adding synproxy and seqadj ct extensions")
Reported-by: Li Shuang <shuali@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
David S. Miller [Thu, 24 Aug 2017 05:42:43 +0000 (22:42 -0700)]
Merge branch 'bnxt_en-bug-fixes'
Michael Chan says:
====================
bnxt_en: bug fixes.
3 bug fixes related to XDP ring accounting in bnxt_setup_tc(), freeing
MSIX vectors when bnxt_re unregisters, and preserving the user-administered
PF MAC address when disabling SRIOV.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Wed, 23 Aug 2017 23:34:05 +0000 (19:34 -0400)]
bnxt_en: Do not setup MAC address in bnxt_hwrm_func_qcaps().
bnxt_hwrm_func_qcaps() is called during probe to get all device
resources and it also sets up the factory MAC address. The same function
is called when SRIOV is disabled to reclaim all resources. If
the MAC address has been overridden by a user administered MAC
address, calling this function will overwrite it.
Separate the logic that sets up the default MAC address into a new
function bnxt_init_mac_addr() that is only called during probe time.
Fixes:
4a21b49b34c0 ("bnxt_en: Improve VF resource accounting.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Wed, 23 Aug 2017 23:34:04 +0000 (19:34 -0400)]
bnxt_en: Free MSIX vectors when unregistering the device from bnxt_re.
Take back ownership of the MSIX vectors when unregistering the device
from bnxt_re.
Fixes:
a588e4580a7e ("bnxt_en: Add interface to support RDMA driver.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Wed, 23 Aug 2017 23:34:03 +0000 (19:34 -0400)]
bnxt_en: Fix .ndo_setup_tc() to include XDP rings.
When the number of TX rings is changed in bnxt_setup_tc(), we need to
include the XDP rings in the total TX ring count.
Fixes:
38413406277f ("bnxt_en: Add support for XDP_TX action.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 23 Aug 2017 21:41:50 +0000 (14:41 -0700)]
nfp: TX time stamp packets before HW doorbell is rung
TX completion may happen any time after HW queue was kicked.
We can't access the skb afterwards. Move the time stamping
before ringing the doorbell.
Fixes:
4c3523623dc0 ("net: add driver for Netronome NFP4000/NFP6000 NIC VFs")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefano Brivio [Wed, 23 Aug 2017 11:27:13 +0000 (13:27 +0200)]
sctp: Avoid out-of-bounds reads from address storage
inet_diag_msg_sctp{,l}addr_fill() and sctp_get_sctp_info() copy
sizeof(sockaddr_storage) bytes to fill in sockaddr structs used
to export diagnostic information to userspace.
However, the memory allocated to store sockaddr information is
smaller than that and depends on the address family, so we leak
up to 100 uninitialized bytes to userspace. Just use the size of
the source structs instead, in all the three cases this is what
userspace expects. Zero out the remaining memory.
Unused bytes (i.e. when IPv4 addresses are used) in source
structs sctp_sockaddr_entry and sctp_transport are already
cleared by sctp_add_bind_addr() and sctp_transport_new(),
respectively.
Noticed while testing KASAN-enabled kernel with 'ss':
[ 2326.885243] BUG: KASAN: slab-out-of-bounds in inet_sctp_diag_fill+0x42c/0x6c0 [sctp_diag] at addr
ffff881be8779800
[ 2326.896800] Read of size 128 by task ss/9527
[ 2326.901564] CPU: 0 PID: 9527 Comm: ss Not tainted 4.11.0-22.el7a.x86_64 #1
[ 2326.909236] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
[ 2326.917585] Call Trace:
[ 2326.920312] dump_stack+0x63/0x8d
[ 2326.924014] kasan_object_err+0x21/0x70
[ 2326.928295] kasan_report+0x288/0x540
[ 2326.932380] ? inet_sctp_diag_fill+0x42c/0x6c0 [sctp_diag]
[ 2326.938500] ? skb_put+0x8b/0xd0
[ 2326.942098] ? memset+0x31/0x40
[ 2326.945599] check_memory_region+0x13c/0x1a0
[ 2326.950362] memcpy+0x23/0x50
[ 2326.953669] inet_sctp_diag_fill+0x42c/0x6c0 [sctp_diag]
[ 2326.959596] ? inet_diag_msg_sctpasoc_fill+0x460/0x460 [sctp_diag]
[ 2326.966495] ? __lock_sock+0x102/0x150
[ 2326.970671] ? sock_def_wakeup+0x60/0x60
[ 2326.975048] ? remove_wait_queue+0xc0/0xc0
[ 2326.979619] sctp_diag_dump+0x44a/0x760 [sctp_diag]
[ 2326.985063] ? sctp_ep_dump+0x280/0x280 [sctp_diag]
[ 2326.990504] ? memset+0x31/0x40
[ 2326.994007] ? mutex_lock+0x12/0x40
[ 2326.997900] __inet_diag_dump+0x57/0xb0 [inet_diag]
[ 2327.003340] ? __sys_sendmsg+0x150/0x150
[ 2327.007715] inet_diag_dump+0x4d/0x80 [inet_diag]
[ 2327.012979] netlink_dump+0x1e6/0x490
[ 2327.017064] __netlink_dump_start+0x28e/0x2c0
[ 2327.021924] inet_diag_handler_cmd+0x189/0x1a0 [inet_diag]
[ 2327.028045] ? inet_diag_rcv_msg_compat+0x1b0/0x1b0 [inet_diag]
[ 2327.034651] ? inet_diag_dump_compat+0x190/0x190 [inet_diag]
[ 2327.040965] ? __netlink_lookup+0x1b9/0x260
[ 2327.045631] sock_diag_rcv_msg+0x18b/0x1e0
[ 2327.050199] netlink_rcv_skb+0x14b/0x180
[ 2327.054574] ? sock_diag_bind+0x60/0x60
[ 2327.058850] sock_diag_rcv+0x28/0x40
[ 2327.062837] netlink_unicast+0x2e7/0x3b0
[ 2327.067212] ? netlink_attachskb+0x330/0x330
[ 2327.071975] ? kasan_check_write+0x14/0x20
[ 2327.076544] netlink_sendmsg+0x5be/0x730
[ 2327.080918] ? netlink_unicast+0x3b0/0x3b0
[ 2327.085486] ? kasan_check_write+0x14/0x20
[ 2327.090057] ? selinux_socket_sendmsg+0x24/0x30
[ 2327.095109] ? netlink_unicast+0x3b0/0x3b0
[ 2327.099678] sock_sendmsg+0x74/0x80
[ 2327.103567] ___sys_sendmsg+0x520/0x530
[ 2327.107844] ? __get_locked_pte+0x178/0x200
[ 2327.112510] ? copy_msghdr_from_user+0x270/0x270
[ 2327.117660] ? vm_insert_page+0x360/0x360
[ 2327.122133] ? vm_insert_pfn_prot+0xb4/0x150
[ 2327.126895] ? vm_insert_pfn+0x32/0x40
[ 2327.131077] ? vvar_fault+0x71/0xd0
[ 2327.134968] ? special_mapping_fault+0x69/0x110
[ 2327.140022] ? __do_fault+0x42/0x120
[ 2327.144008] ? __handle_mm_fault+0x1062/0x17a0
[ 2327.148965] ? __fget_light+0xa7/0xc0
[ 2327.153049] __sys_sendmsg+0xcb/0x150
[ 2327.157133] ? __sys_sendmsg+0xcb/0x150
[ 2327.161409] ? SyS_shutdown+0x140/0x140
[ 2327.165688] ? exit_to_usermode_loop+0xd0/0xd0
[ 2327.170646] ? __do_page_fault+0x55d/0x620
[ 2327.175216] ? __sys_sendmsg+0x150/0x150
[ 2327.179591] SyS_sendmsg+0x12/0x20
[ 2327.183384] do_syscall_64+0xe3/0x230
[ 2327.187471] entry_SYSCALL64_slow_path+0x25/0x25
[ 2327.192622] RIP: 0033:0x7f41d18fa3b0
[ 2327.196608] RSP: 002b:
00007ffc3b731218 EFLAGS:
00000246 ORIG_RAX:
000000000000002e
[ 2327.205055] RAX:
ffffffffffffffda RBX:
00007ffc3b731380 RCX:
00007f41d18fa3b0
[ 2327.213017] RDX:
0000000000000000 RSI:
00007ffc3b731340 RDI:
0000000000000003
[ 2327.220978] RBP:
0000000000000002 R08:
0000000000000004 R09:
0000000000000040
[ 2327.228939] R10:
00007ffc3b730f30 R11:
0000000000000246 R12:
0000000000000003
[ 2327.236901] R13:
00007ffc3b731340 R14:
00007ffc3b7313d0 R15:
0000000000000084
[ 2327.244865] Object at
ffff881be87797e0, in cache kmalloc-64 size: 64
[ 2327.251953] Allocated:
[ 2327.254581] PID = 9484
[ 2327.257215] save_stack_trace+0x1b/0x20
[ 2327.261485] save_stack+0x46/0xd0
[ 2327.265179] kasan_kmalloc+0xad/0xe0
[ 2327.269165] kmem_cache_alloc_trace+0xe6/0x1d0
[ 2327.274138] sctp_add_bind_addr+0x58/0x180 [sctp]
[ 2327.279400] sctp_do_bind+0x208/0x310 [sctp]
[ 2327.284176] sctp_bind+0x61/0xa0 [sctp]
[ 2327.288455] inet_bind+0x5f/0x3a0
[ 2327.292151] SYSC_bind+0x1a4/0x1e0
[ 2327.295944] SyS_bind+0xe/0x10
[ 2327.299349] do_syscall_64+0xe3/0x230
[ 2327.303433] return_from_SYSCALL_64+0x0/0x6a
[ 2327.308194] Freed:
[ 2327.310434] PID = 4131
[ 2327.313065] save_stack_trace+0x1b/0x20
[ 2327.317344] save_stack+0x46/0xd0
[ 2327.321040] kasan_slab_free+0x73/0xc0
[ 2327.325220] kfree+0x96/0x1a0
[ 2327.328530] dynamic_kobj_release+0x15/0x40
[ 2327.333195] kobject_release+0x99/0x1e0
[ 2327.337472] kobject_put+0x38/0x70
[ 2327.341266] free_notes_attrs+0x66/0x80
[ 2327.345545] mod_sysfs_teardown+0x1a5/0x270
[ 2327.350211] free_module+0x20/0x2a0
[ 2327.354099] SyS_delete_module+0x2cb/0x2f0
[ 2327.358667] do_syscall_64+0xe3/0x230
[ 2327.362750] return_from_SYSCALL_64+0x0/0x6a
[ 2327.367510] Memory state around the buggy address:
[ 2327.372855]
ffff881be8779700: fc fc fc fc 00 00 00 00 00 00 00 00 fc fc fc fc
[ 2327.380914]
ffff881be8779780: fb fb fb fb fb fb fb fb fc fc fc fc 00 00 00 00
[ 2327.388972] >
ffff881be8779800: 00 00 00 00 fc fc fc fc fb fb fb fb fb fb fb fb
[ 2327.397031] ^
[ 2327.401792]
ffff881be8779880: fc fc fc fc fb fb fb fb fb fb fb fb fc fc fc fc
[ 2327.409850]
ffff881be8779900: 00 00 00 00 00 04 fc fc fc fc fc fc 00 00 00 00
[ 2327.417907] ==================================================================
This fixes CVE-2017-7558.
References: https://bugzilla.redhat.com/show_bug.cgi?id=1480266
Fixes:
8f840e47f190 ("sctp: add the sctp_diag.c file")
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 24 Aug 2017 04:40:32 +0000 (21:40 -0700)]
net: dsa: use consume_skb()
Two kfree_skb() should be consume_skb(), to be friend with drop monitor
(perf record ... -e skb:kfree_skb)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 24 Aug 2017 03:39:44 +0000 (20:39 -0700)]
Merge branch 'nfp-fixes'
Jakub Kicinski says:
====================
nfp: fix SR-IOV deadlock and representor bugs
This series tackles the bug I've already tried to fix in commit
6d48ceb27af1 ("nfp: allocate a private workqueue for driver work").
I created a separate workqueue to avoid possible deadlock, and
the lockdep error disappeared, coincidentally. The way workqueues
are operating, separate workqueue doesn't necessarily mean separate
thread of execution. Luckily we can safely forego the lock.
Second fix changes the order in which vNIC netdevs and representors
are created/destroyed. The fix is kept small and should be sufficient
for net because of how flower uses representors, a more thorough fix
will be targeted at net-next.
Third fix avoids leaking mapped frame buffers if FW sent a frame with
unknown portid.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 23 Aug 2017 06:22:44 +0000 (23:22 -0700)]
nfp: avoid buffer leak when representor is missing
When driver receives a muxed frame, but it can't find the representor
netdev it is destined to it will try to "drop" that frame, i.e. reuse
the buffer. The issue is that the replacement buffer has already been
allocated at this point, and reusing the buffer from received frame
will leak it. Change the code to put the new buffer on the ring
earlier and not reuse the old buffer (make the buffer parameter
to nfp_net_rx_drop() a NULL).
Fixes:
91bf82ca9eed ("nfp: add support for tx/rx with metadata portid")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 23 Aug 2017 06:22:43 +0000 (23:22 -0700)]
nfp: make sure representors are destroyed before their lower netdev
App start/stop callbacks can perform application initialization.
Unfortunately, flower app started using them for creating and
destroying representors. This can lead to a situation where
lower vNIC netdev is destroyed while representors still try
to pass traffic. This will most likely lead to a NULL-dereference
on the lower netdev TX path.
Move the start/stop callbacks, so that representors are created/
destroyed when vNICs are fully initialized.
Fixes:
5de73ee46704 ("nfp: general representor implementation")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 23 Aug 2017 06:22:42 +0000 (23:22 -0700)]
nfp: don't hold PF lock while enabling SR-IOV
Enabling SR-IOV VFs will cause the PCI subsystem to schedule a
work and flush its workqueue. Since the nfp driver schedules its
own work we can't enable VFs while holding driver load. Commit
6d48ceb27af1 ("nfp: allocate a private workqueue for driver work")
tried to avoid this deadlock by creating a separate workqueue.
Unfortunately, due to the architecture of workqueue subsystem this
does not guarantee a separate thread of execution. Luckily
we can simply take pci_enable_sriov() from under the driver lock.
Take pci_disable_sriov() from under the lock too for symmetry.
Fixes:
6d48ceb27af1 ("nfp: allocate a private workqueue for driver work")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 24 Aug 2017 03:33:49 +0000 (20:33 -0700)]
Merge branch 'dst-tag-ksz-fix'
Florian Fainelli says:
====================
net: dsa: Fix tag_ksz.c
This implements David's suggestion of providing low-level functions
to control whether skb_pad() and skb_put_padto() should be freeing
the passed skb.
We make use of it to fix a double free in net/dsa/tag_ksz.c that would
occur if we kept using skb_put_padto() in both places.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Tue, 22 Aug 2017 22:12:15 +0000 (15:12 -0700)]
net: dsa: skb_put_padto() already frees nskb
The first call of skb_put_padto() will free up the SKB on error, but we
return NULL which tells dsa_slave_xmit() that the original SKB should be
freed so this would lead to a double free here.
The second skb_put_padto() already frees the passed sk_buff reference
upon error, so calling kfree_skb() on it again is not necessary.
Detected by CoverityScan, CID#1416687 ("USE_AFTER_FREE")
Fixes:
e71cb9e00922 ("net: dsa: ksz: fix skb freeing")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Woojung Huh <Woojung.Huh@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Tue, 22 Aug 2017 22:12:14 +0000 (15:12 -0700)]
net: core: Specify skb_pad()/skb_put_padto() SKB freeing
Rename skb_pad() into __skb_pad() and make it take a third argument:
free_on_error which controls whether kfree_skb() should be called or
not, skb_pad() directly makes use of it and passes true to preserve its
existing behavior. Do exactly the same thing with __skb_put_padto() and
skb_put_padto().
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Woojung Huh <Woojung.Huh@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephan Gatzka [Tue, 22 Aug 2017 12:25:07 +0000 (14:25 +0200)]
net: stmmac: socfgpa: Ensure emac bit set in sys manager for MII/GMII/SGMII.
When using MII/GMII/SGMII in the Altera SoC, the phy needs to be
wired through the FPGA. To ensure correct behavior, the appropriate
bit in the System Manager FPGA Interface Group register needs to be
set.
Signed-off-by: Stephan Gatzka <stephan.gatzka@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Tue, 22 Aug 2017 22:24:47 +0000 (15:24 -0700)]
fsl/man: Inherit parent device and of_node
Junote Cai reported that he was not able to get a DSA setup involving the
Freescale DPAA/FMAN driver to work and narrowed it down to
of_find_net_device_by_node(). This function requires the network device's
device reference to be correctly set which is the case here, though we have
lost any device_node association there.
The problem is that dpaa_eth_add_device() allocates a "dpaa-ethernet" platform
device, and later on dpaa_eth_probe() is called but SET_NETDEV_DEV() won't be
propagating &pdev->dev.of_node properly. Fix this by inherenting both the parent
device and the of_node when dpaa_eth_add_device() creates the platform device.
Fixes:
3933961682a3 ("fsl/fman: Add FMan MAC driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Tue, 22 Aug 2017 22:06:09 +0000 (00:06 +0200)]
bpf: fix map value attribute for hash of maps
Currently, iproute2's BPF ELF loader works fine with array of maps
when retrieving the fd from a pinned node and doing a selfcheck
against the provided map attributes from the object file, but we
fail to do the same for hash of maps and thus refuse to get the
map from pinned node.
Reason is that when allocating hash of maps, fd_htab_map_alloc() will
set the value size to sizeof(void *), and any user space map creation
requests are forced to set 4 bytes as value size. Thus, selfcheck
will complain about exposed 8 bytes on 64 bit archs vs. 4 bytes from
object file as value size. Contract is that fdinfo or BPF_MAP_GET_FD_BY_ID
returns the value size used to create the map.
Fix it by handling it the same way as we do for array of maps, which
means that we leave value size at 4 bytes and in the allocation phase
round up value size to 8 bytes. alloc_htab_elem() needs an adjustment
in order to copy rounded up 8 bytes due to bpf_fd_htab_map_update_elem()
calling into htab_map_update_elem() with the pointer of the map
pointer as value. Unlike array of maps where we just xchg(), we're
using the generic htab_map_update_elem() callback also used from helper
calls, which published the key/value already on return, so we need
to ensure to memcpy() the right size.
Fixes:
bcc6b1b7ebf8 ("bpf: Add hash of maps support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Tue, 22 Aug 2017 21:26:47 +0000 (14:26 -0700)]
net: phy: Deal with unbound PHY driver in phy_attached_print()
Priit reported that stmmac was crashing with the trace below. This is because
phy_attached_print() is called too early right after the PHY device has been
found, but before it has a driver attached, since that is only done in
phy_probe() which occurs later.
Fix this by dealing with a possibly NULL phydev->drv point since that can
happen here, but could also happen if we voluntarily did an unbind of the
PHY device with the PHY driver.
sun7i-dwmac 1c50000.ethernet: PTP uses main clock
sun7i-dwmac 1c50000.ethernet: no reset control found
sun7i-dwmac 1c50000.ethernet: no regulator found
sun7i-dwmac 1c50000.ethernet: Ring mode enabled
sun7i-dwmac 1c50000.ethernet: DMA HW capability register supported
sun7i-dwmac 1c50000.ethernet: Normal descriptors
libphy: stmmac: probed
Unable to handle kernel NULL pointer dereference at virtual address
00000048
pgd =
c0004000
[
00000048] *pgd=
00000000
Internal error: Oops: 5 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc6-00318-g0065bd7fa384 #1
Hardware name: Allwinner sun7i (A20) Family
task:
ee868000 task.stack:
ee85c000
PC is at phy_attached_print+0x1c/0x8c
LR is at stmmac_mdio_register+0x12c/0x200
pc : [<
c04510ac>] lr : [<
c045e6b4>] psr:
60000013
sp :
ee85ddc8 ip :
00000000 fp :
c07dfb5c
r10:
ee981210 r9 :
00000001 r8 :
eea73000
r7 :
eeaa6dd0 r6 :
eeb49800 r5 :
00000000 r4 :
00000000
r3 :
00000000 r2 :
00000000 r1 :
00000000 r0 :
eeb49800
Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control:
10c5387d Table:
4000406a DAC:
00000051
Process swapper/0 (pid: 1, stack limit = 0xee85c210)
Stack: (0xee85ddc8 to 0xee85e000)
ddc0:
00000000 00000002 eeb49400 eea72000 00000000 eeb49400
dde0:
c045e6b4 00000000 ffffffff eeab0810 00000000 c08051f8 ee9292c0 c016d480
de00:
eea725c0 eea73000 eea72000 00000001 eea726c0 c0457d0c 00000040 00000020
de20:
00000000 c045b850 00000001 00000000 ee981200 eeab0810 eeaa6ed0 ee981210
de40:
00000000 c094a4a0 00000000 c0465180 eeaa7550 f08d0000 c9ffb90c 00000032
de60:
fffffffa 00000032 ee981210 ffffffed c0a46620 fffffdfb c0a46620 c03f7be8
de80:
ee981210 c0a9a388 00000000 00000000 c0a46620 c03f63e0 ee981210 c0a46620
dea0:
ee981244 00000000 00000007 000000c6 c094a4a0 c03f6534 00000000 c0a46620
dec0:
c03f6490 c03f49ec ee828a58 ee9217b4 c0a46620 eeaa4b00 c0a43230 c03f59fc
dee0:
c08051f8 c094a49c c0a46620 c0a46620 00000000 c091c668 c093783c c03f6dfc
df00:
ffffe000 00000000 c091c668 c010177c eefe0938 eefe0935 c085e200 000000c6
df20:
00000005 c0136bc8 60000013 c080b3a4 00000006 00000006 c07ce7b4 00000000
df40:
c07d7ddc c07cef28 eefe0938 eefe093e c0a0b2f0 c0a641c0 c0a641c0 c0a641c0
df60:
c0937834 00000007 000000c6 c094a4a0 00000000 c0900d88 00000006 00000006
df80:
00000000 c09005a8 00000000 c060ecf4 00000000 00000000 00000000 00000000
dfa0:
00000000 c060ecfc 00000000 c0107738 00000000 00000000 00000000 00000000
dfc0:
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
dfe0:
00000000 00000000 00000000 00000000 00000013 00000000 ffdeffff ffffffff
[<
c04510ac>] (phy_attached_print) from [<
c045e6b4>] (stmmac_mdio_register+0x12c/0x200)
[<
c045e6b4>] (stmmac_mdio_register) from [<
c045b850>] (stmmac_dvr_probe+0x850/0x96c)
[<
c045b850>] (stmmac_dvr_probe) from [<
c0465180>] (sun7i_gmac_probe+0x120/0x180)
[<
c0465180>] (sun7i_gmac_probe) from [<
c03f7be8>] (platform_drv_probe+0x50/0xac)
[<
c03f7be8>] (platform_drv_probe) from [<
c03f63e0>] (driver_probe_device+0x234/0x2e4)
[<
c03f63e0>] (driver_probe_device) from [<
c03f6534>] (__driver_attach+0xa4/0xa8)
[<
c03f6534>] (__driver_attach) from [<
c03f49ec>] (bus_for_each_dev+0x4c/0x9c)
[<
c03f49ec>] (bus_for_each_dev) from [<
c03f59fc>] (bus_add_driver+0x190/0x214)
[<
c03f59fc>] (bus_add_driver) from [<
c03f6dfc>] (driver_register+0x78/0xf4)
[<
c03f6dfc>] (driver_register) from [<
c010177c>] (do_one_initcall+0x44/0x168)
[<
c010177c>] (do_one_initcall) from [<
c0900d88>] (kernel_init_freeable+0x144/0x1d0)
[<
c0900d88>] (kernel_init_freeable) from [<
c060ecfc>] (kernel_init+0x8/0x110)
[<
c060ecfc>] (kernel_init) from [<
c0107738>] (ret_from_fork+0x14/0x3c)
Code:
e59021c8 e59d401c e590302c e3540000 (
e5922048)
---[ end trace
39ae87c7923562d0 ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
Tested-By: Priit Laes <plaes@plaes.org>
Fixes:
fbca164776e4 ("net: stmmac: Use the right logging function in stmmac_mdio_register")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 22 Aug 2017 21:39:59 +0000 (14:39 -0700)]
Merge branch 'net-sched-couple-of-chain-fixes'
Jiri Pirko says:
====================
net: sched: couple of chain fixes
Jiri Pirko (2):
net: sched: fix use after free when tcf_chain_destroy is called
multiple times
net: sched: don't do tcf_chain_flush from tcf_chain_destroy
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Tue, 22 Aug 2017 20:46:50 +0000 (22:46 +0200)]
net: sched: don't do tcf_chain_flush from tcf_chain_destroy
tcf_chain_flush needs to be called with RTNL. However, on
free_tcf->
tcf_action_goto_chain_fini->
tcf_chain_put->
tcf_chain_destroy->
tcf_chain_flush
callpath, it is called without RTNL.
This issue was notified by following warning:
[ 155.599052] WARNING: suspicious RCU usage
[ 155.603165] 4.13.0-rc5jiri+ #54 Not tainted
[ 155.607456] -----------------------------
[ 155.611561] net/sched/cls_api.c:195 suspicious rcu_dereference_protected() usage!
Since on this callpath, the chain is guaranteed to be already empty
by check in tcf_chain_put, move the tcf_chain_flush call out and call it
only where it is needed - into tcf_block_put.
Fixes:
db50514f9a9c ("net: sched: add termination action to allow goto chain")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Tue, 22 Aug 2017 20:46:49 +0000 (22:46 +0200)]
net: sched: fix use after free when tcf_chain_destroy is called multiple times
The goto_chain termination action takes a reference of a chain. In that
case, there is an issue when block_put is called tcf_chain_destroy
directly. The follo-up call of tcf_chain_put by goto_chain action free
works with memory that is already freed. This was caught by kasan:
[ 220.337908] BUG: KASAN: use-after-free in tcf_chain_put+0x1b/0x50
[ 220.344103] Read of size 4 at addr
ffff88036d1f2cec by task systemd-journal/261
[ 220.353047] CPU: 0 PID: 261 Comm: systemd-journal Not tainted 4.13.0-rc5jiri+ #54
[ 220.360661] Hardware name: Mellanox Technologies Ltd. Mellanox switch/Mellanox x86 mezzanine board, BIOS 4.6.5 08/02/2016
[ 220.371784] Call Trace:
[ 220.374290] <IRQ>
[ 220.376355] dump_stack+0xd5/0x150
[ 220.391485] print_address_description+0x86/0x410
[ 220.396308] kasan_report+0x181/0x4c0
[ 220.415211] tcf_chain_put+0x1b/0x50
[ 220.418949] free_tcf+0x95/0xc0
So allow tcf_chain_destroy to be called multiple times, free only in
case the reference count drops to 0.
Fixes:
5bc1701881e3 ("net: sched: introduce multichain support for filters")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 22 Aug 2017 16:39:28 +0000 (09:39 -0700)]
udp: on peeking bad csum, drop packets even if not at head
When peeking, if a bad csum is discovered, the skb is unlinked from
the queue with __sk_queue_drop_skb and the peek operation restarted.
__sk_queue_drop_skb only drops packets that match the queue head.
This fails if the skb was found after the head, using SO_PEEK_OFF
socket option. This causes an infinite loop.
We MUST drop this problematic skb, and we can simply check if skb was
already removed by another thread, by looking at skb->next :
This pointer is set to NULL by the __skb_unlink() operation, that might
have happened only under the spinlock protection.
Many thanks to syzkaller team (and particularly Dmitry Vyukov who
provided us nice C reproducers exhibiting the lockup) and Willem de
Bruijn who provided first version for this patch and a test program.
Fixes:
627d2d6b5500 ("udp: enable MSG_PEEK at non-zero offset")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sabrina Dubroca [Tue, 22 Aug 2017 13:36:08 +0000 (15:36 +0200)]
macsec: add genl family module alias
This helps tools such as wpa_supplicant can start even if the macsec
module isn't loaded yet.
Fixes:
c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 22 Aug 2017 21:25:02 +0000 (14:25 -0700)]
Merge branch 'tipc-topology-server-fixes'
Parthasarathy Bhuvaragan says:
====================
tipc: topology server fixes
The following commits fixes two race conditions causing general
protection faults.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Tue, 22 Aug 2017 10:28:41 +0000 (12:28 +0200)]
tipc: fix a race condition of releasing subscriber object
No matter whether a request is inserted into workqueue as a work item
to cancel a subscription or to delete a subscription's subscriber
asynchronously, the work items may be executed in different workers.
As a result, it doesn't mean that one request which is raised prior to
another request is definitely handled before the latter. By contrast,
if the latter request is executed before the former request, below
error may happen:
[ 656.183644] BUG: spinlock bad magic on CPU#0, kworker/u8:0/12117
[ 656.184487] general protection fault: 0000 [#1] SMP
[ 656.185160] Modules linked in: tipc ip6_udp_tunnel udp_tunnel 9pnet_virtio 9p 9pnet virtio_net virtio_pci virtio_ring virtio [last unloaded: ip6_udp_tunnel]
[ 656.187003] CPU: 0 PID: 12117 Comm: kworker/u8:0 Not tainted 4.11.0-rc7+ #6
[ 656.187920] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 656.188690] Workqueue: tipc_rcv tipc_recv_work [tipc]
[ 656.189371] task:
ffff88003f5cec40 task.stack:
ffffc90004448000
[ 656.190157] RIP: 0010:spin_bug+0xdd/0xf0
[ 656.190678] RSP: 0018:
ffffc9000444bcb8 EFLAGS:
00010202
[ 656.191375] RAX:
0000000000000034 RBX:
ffff88003f8d1388 RCX:
0000000000000000
[ 656.192321] RDX:
ffff88003ba13708 RSI:
ffff88003ba0cd08 RDI:
ffff88003ba0cd08
[ 656.193265] RBP:
ffffc9000444bcd0 R08:
0000000000000030 R09:
000000006b6b6b6b
[ 656.194208] R10:
ffff8800bde3e000 R11:
00000000000001b4 R12:
6b6b6b6b6b6b6b6b
[ 656.195157] R13:
ffffffff81a3ca64 R14:
ffff88003f8d1388 R15:
ffff88003f8d13a0
[ 656.196101] FS:
0000000000000000(0000) GS:
ffff88003ba00000(0000) knlGS:
0000000000000000
[ 656.197172] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 656.197935] CR2:
00007f0b3d2e6000 CR3:
000000003ef9e000 CR4:
00000000000006f0
[ 656.198873] Call Trace:
[ 656.199210] do_raw_spin_lock+0x66/0xa0
[ 656.199735] _raw_spin_lock_bh+0x19/0x20
[ 656.200258] tipc_subscrb_subscrp_delete+0x28/0xf0 [tipc]
[ 656.200990] tipc_subscrb_rcv_cb+0x45/0x260 [tipc]
[ 656.201632] tipc_receive_from_sock+0xaf/0x100 [tipc]
[ 656.202299] tipc_recv_work+0x2b/0x60 [tipc]
[ 656.202872] process_one_work+0x157/0x420
[ 656.203404] worker_thread+0x69/0x4c0
[ 656.203898] kthread+0x138/0x170
[ 656.204328] ? process_one_work+0x420/0x420
[ 656.204889] ? kthread_create_on_node+0x40/0x40
[ 656.205527] ret_from_fork+0x29/0x40
[ 656.206012] Code: 48 8b 0c 25 00 c5 00 00 48 c7 c7 f0 24 a3 81 48 81 c1 f0 05 00 00 65 8b 15 61 ef f5 7e e8 9a 4c 09 00 4d 85 e4 44 8b 4b 08 74 92 <45> 8b 84 24 40 04 00 00 49 8d 8c 24 f0 05 00 00 eb 8d 90 0f 1f
[ 656.208504] RIP: spin_bug+0xdd/0xf0 RSP:
ffffc9000444bcb8
[ 656.209798] ---[ end trace
e2a800e6eb0770be ]---
In above scenario, the request of deleting subscriber was performed
earlier than the request of canceling a subscription although the
latter was issued before the former, which means tipc_subscrb_delete()
was called before tipc_subscrp_cancel(). As a result, when
tipc_subscrb_subscrp_delete() called by tipc_subscrp_cancel() was
executed to cancel a subscription, the subscription's subscriber
refcnt had been decreased to 1. After tipc_subscrp_delete() where
the subscriber was freed because its refcnt was decremented to zero,
but the subscriber's lock had to be released, as a consequence, panic
happened.
By contrast, if we increase subscriber's refcnt before
tipc_subscrb_subscrp_delete() is called in tipc_subscrp_cancel(),
the panic issue can be avoided.
Fixes:
d094c4d5f5c7 ("tipc: add subscription refcount to avoid invalid delete")
Reported-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 22 Aug 2017 10:28:40 +0000 (12:28 +0200)]
tipc: remove subscription references only for pending timers
In commit,
139bb36f754a ("tipc: advance the time of deleting
subscription from subscriber->subscrp_list"), we delete the
subscription from the subscribers list and from nametable
unconditionally. This leads to the following bug if the timer
running tipc_subscrp_timeout() in another CPU accesses the
subscription list after the subscription delete request.
[39.570] general protection fault: 0000 [#1] SMP
::
[39.574] task:
ffffffff81c10540 task.stack:
ffffffff81c00000
[39.575] RIP: 0010:tipc_subscrp_timeout+0x32/0x80 [tipc]
[39.576] RSP: 0018:
ffff88003ba03e90 EFLAGS:
00010282
[39.576] RAX:
dead000000000200 RBX:
ffff88003f0f3600 RCX:
0000000000000101
[39.577] RDX:
dead000000000100 RSI:
0000000000000201 RDI:
ffff88003f0d7948
[39.578] RBP:
ffff88003ba03ea0 R08:
0000000000000001 R09:
ffff88003ba03ef8
[39.579] R10:
000000000000014f R11:
0000000000000000 R12:
ffff88003f0d7948
[39.580] R13:
ffff88003f0f3618 R14:
ffffffffa006c250 R15:
ffff88003f0f3600
[39.581] FS:
0000000000000000(0000) GS:
ffff88003ba00000(0000) knlGS:
0000000000000000
[39.582] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[39.583] CR2:
00007f831c6e0714 CR3:
000000003d3b0000 CR4:
00000000000006f0
[39.584] Call Trace:
[39.584] <IRQ>
[39.585] call_timer_fn+0x3d/0x180
[39.585] ? tipc_subscrb_rcv_cb+0x260/0x260 [tipc]
[39.586] run_timer_softirq+0x168/0x1f0
[39.586] ? sched_clock_cpu+0x16/0xc0
[39.587] __do_softirq+0x9b/0x2de
[39.587] irq_exit+0x60/0x70
[39.588] smp_apic_timer_interrupt+0x3d/0x50
[39.588] apic_timer_interrupt+0x86/0x90
[39.589] RIP: 0010:default_idle+0x20/0xf0
[39.589] RSP: 0018:
ffffffff81c03e58 EFLAGS:
00000246 ORIG_RAX:
ffffffffffffff10
[39.590] RAX:
0000000000000000 RBX:
ffffffff81c10540 RCX:
0000000000000000
[39.591] RDX:
0000000000000000 RSI:
0000000000000000 RDI:
0000000000000000
[39.592] RBP:
ffffffff81c03e68 R08:
0000000000000000 R09:
0000000000000000
[39.593] R10:
ffffc90001cbbe00 R11:
0000000000000000 R12:
0000000000000000
[39.594] R13:
ffffffff81c10540 R14:
0000000000000000 R15:
0000000000000000
[39.595] </IRQ>
::
[39.603] RIP: tipc_subscrp_timeout+0x32/0x80 [tipc] RSP:
ffff88003ba03e90
[39.604] ---[ end trace
79ce94b7216cb459 ]---
Fixes:
139bb36f754a ("tipc: advance the time of deleting subscription from subscriber->subscrp_list")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nogah Frankel [Tue, 22 Aug 2017 08:28:11 +0000 (10:28 +0200)]
mlxsw: spectrum_switchdev: Fix mrouter flag update
Update the value of the mrouter flag in struct mlxsw_sp_bridge_port when
it is being changed.
Fixes:
c57529e1d5d8 ("mlxsw: spectrum: Replace vPorts with Port-VLAN")
Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Jeffery [Tue, 22 Aug 2017 06:36:22 +0000 (16:06 +0930)]
net: ftgmac100: Fix oops in probe on failure to find associated PHY
netif_napi_del() should be paired with netif_napi_add(), however no
such call takes place in ftgmac100_probe(). This triggers a NULL
pointer dereference if e.g. no PHY is found by the MDIO probe:
[ 2.770000] libphy: Fixed MDIO Bus: probed
[ 2.770000] ftgmac100
1e660000.ethernet: Generated random MAC address 66:58:c0:5a:50:b8
[ 2.790000] libphy: ftgmac100_mdio: probed
[ 2.790000] ftgmac100
1e660000.ethernet (unnamed net_device) (uninitialized): eth%d: no PHY found
[ 2.790000] ftgmac100
1e660000.ethernet: MII Probe failed!
[ 2.810000] Unable to handle kernel NULL pointer dereference at virtual address
00000004
[ 2.810000] pgd =
80004000
[ 2.810000] [
00000004] *pgd=
00000000
[ 2.810000] Internal error: Oops: 805 [#1] ARM
[ 2.810000] CPU: 0 PID: 1 Comm: swapper Not tainted 4.10.17-
1a4df30c39cf5ee0e3d2528c409787ccbb4a672a #1
[ 2.810000] Hardware name: ASpeed SoC
[ 2.810000] task:
9e421b60 task.stack:
9e4a0000
[ 2.810000] PC is at netif_napi_del+0x74/0xa4
[ 2.810000] LR is at ftgmac100_probe+0x290/0x674
[ 2.810000] pc : [<
80331004>] lr : [<
80292b30>] psr:
60000013
[ 2.810000] sp :
9e4a1d70 ip :
9e4a1d88 fp :
9e4a1d84
[ 2.810000] r10:
9e565000 r9 :
ffffffed r8 :
00000007
[ 2.810000] r7 :
9e565480 r6 :
9ec072c0 r5 :
00000000 r4 :
9e5654d8
[ 2.810000] r3 :
9e565530 r2 :
00000000 r1 :
00000000 r0 :
9e5654d8
[ 2.810000] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 2.810000] Control:
00c5387d Table:
80004008 DAC:
00000055
[ 2.810000] Process swapper (pid: 1, stack limit = 0x9e4a0188)
[ 2.810000] Stack: (0x9e4a1d70 to 0x9e4a2000)
[ 2.810000] 1d60:
9e565000 9e549e10 9e4a1dcc 9e4a1d88
[ 2.810000] 1d80:
80292b30 80330f9c ffffffff 9e4a1d98 80146058 9ec072c0 00009e10 00000000
[ 2.810000] 1da0:
9e549e18 9e549e10 ffffffed 805f81f4 fffffdfb 00000000 00000000 00000000
[ 2.810000] 1dc0:
9e4a1dec 9e4a1dd0 80243df8 802928ac 9e549e10 8062cbd8 8062cbe0 805f81f4
[ 2.810000] 1de0:
9e4a1e24 9e4a1df0 80242178 80243da4 803001d0 802ffa60 9e4a1e24 9e549e10
[ 2.810000] 1e00:
9e549e44 805f81f4 00000000 00000000 805b8840 8058a6b0 9e4a1e44 9e4a1e28
[ 2.810000] 1e20:
80242434 80241f04 00000000 805f81f4 80242344 00000000 9e4a1e6c 9e4a1e48
[ 2.810000] 1e40:
80240148 80242350 9e425bac 9e4fdc90 9e790e94 805f81f4 9e790e60 805f5640
[ 2.810000] 1e60:
9e4a1e7c 9e4a1e70 802425dc 802400d8 9e4a1ea4 9e4a1e80 80240ba8 802425c0
[ 2.810000] 1e80:
8050b6ac 9e4a1e90 805f81f4 ffffe000 805b8838 80616720 9e4a1ebc 9e4a1ea8
[ 2.810000] 1ea0:
80243068 80240a68 805ab24c ffffe000 9e4a1ecc 9e4a1ec0 80244a38 80242fec
[ 2.810000] 1ec0:
9e4a1edc 9e4a1ed0 805ab264 80244a04 9e4a1f4c 9e4a1ee0 8058ae70 805ab258
[ 2.810000] 1ee0:
80032c68 801e3fd8 8052f800 8041af2c 9e4a1f4c 9e4a1f00 80032f90 8058a6bc
[ 2.810000] 1f00:
9e4a1f2c 9e4a1f10 00000006 00000006 00000000 8052f220 805112f0 00000000
[ 2.810000] 1f20:
9e4a1f4c 00000006 80616720 805cf400 80616720 805b8838 80616720 00000057
[ 2.810000] 1f40:
9e4a1f94 9e4a1f50 8058b040 8058add0 00000006 00000006 00000000 8058a6b0
[ 2.810000] 1f60:
3940bf3d 00000007 f115c2e8 00000000 803fd158 00000000 00000000 00000000
[ 2.810000] 1f80:
00000000 00000000 9e4a1fac 9e4a1f98 803fd170 8058af38 00000000 803fd158
[ 2.810000] 1fa0:
00000000 9e4a1fb0 8000a5e8 803fd164 00000000 00000000 00000000 00000000
[ 2.810000] 1fc0:
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 2.810000] 1fe0:
00000000 00000000 00000000 00000000 00000013 00000000 d11dcae8 af8ddec5
[ 2.810000] [<
80331004>] (netif_napi_del) from [<
80292b30>] (ftgmac100_probe+0x290/0x674)
[ 2.810000] [<
80292b30>] (ftgmac100_probe) from [<
80243df8>] (platform_drv_probe+0x60/0xc0)
[ 2.810000] [<
80243df8>] (platform_drv_probe) from [<
80242178>] (driver_probe_device+0x280/0x44c)
[ 2.810000] [<
80242178>] (driver_probe_device) from [<
80242434>] (__driver_attach+0xf0/0x104)
[ 2.810000] [<
80242434>] (__driver_attach) from [<
80240148>] (bus_for_each_dev+0x7c/0xb0)
[ 2.810000] [<
80240148>] (bus_for_each_dev) from [<
802425dc>] (driver_attach+0x28/0x30)
[ 2.810000] [<
802425dc>] (driver_attach) from [<
80240ba8>] (bus_add_driver+0x14c/0x268)
[ 2.810000] [<
80240ba8>] (bus_add_driver) from [<
80243068>] (driver_register+0x88/0x104)
[ 2.810000] [<
80243068>] (driver_register) from [<
80244a38>] (__platform_driver_register+0x40/0x54)
[ 2.810000] [<
80244a38>] (__platform_driver_register) from [<
805ab264>] (ftgmac100_driver_init+0x18/0x20)
[ 2.810000] [<
805ab264>] (ftgmac100_driver_init) from [<
8058ae70>] (do_one_initcall+0xac/0x168)
[ 2.810000] [<
8058ae70>] (do_one_initcall) from [<
8058b040>] (kernel_init_freeable+0x114/0x1cc)
[ 2.810000] [<
8058b040>] (kernel_init_freeable) from [<
803fd170>] (kernel_init+0x18/0x104)
[ 2.810000] [<
803fd170>] (kernel_init) from [<
8000a5e8>] (ret_from_fork+0x14/0x2c)
[ 2.810000] Code:
e594205c e5941058 e2843058 e3a05000 (
e5812004)
[ 3.210000] ---[ end trace
f32811052fd3860c ]---
Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 21 Aug 2017 19:59:10 +0000 (12:59 -0700)]
net/hsr: Check skb_put_padto() return value
skb_put_padto() will free the sk_buff passed as reference in case of
errors, but we still need to check its return value and decide what to
do.
Detected by CoverityScan, CID#1416688 ("CHECKED_RETURN")
Fixes:
ee1c27977284 ("net/hsr: Added support for HSR v1")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang [Mon, 21 Aug 2017 16:47:10 +0000 (09:47 -0700)]
ipv6: add rcu grace period before freeing fib6_node
We currently keep rt->rt6i_node pointing to the fib6_node for the route.
And some functions make use of this pointer to dereference the fib6_node
from rt structure, e.g. rt6_check(). However, as there is neither
refcount nor rcu taken when dereferencing rt->rt6i_node, it could
potentially cause crashes as rt->rt6i_node could be set to NULL by other
CPUs when doing a route deletion.
This patch introduces an rcu grace period before freeing fib6_node and
makes sure the functions that dereference it takes rcu_read_lock().
Note: there is no "Fixes" tag because this bug was there in a very
early stage.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 22 Aug 2017 17:27:26 +0000 (10:27 -0700)]
Merge branch 'master' of git://git./linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2017-08-21
1) Fix memleaks when ESP takes an error path.
2) Fix null pointer dereference when creating a sub policy
that matches the same outer flow as main policy does.
From Koichiro Den.
3) Fix possible out-of-bound access in xfrm_migrate.
This patch should go to the stable trees too.
From Vladis Dronov.
4) ESP can return positive and negative error values,
so treat both cases as an error.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefano Brivio [Fri, 18 Aug 2017 12:40:53 +0000 (14:40 +0200)]
ipv6: accept 64k - 1 packet length in ip6_find_1stfragopt()
A packet length of exactly IPV6_MAXPLEN is allowed, we should
refuse parsing options only if the size is 64KiB or more.
While at it, remove one extra variable and one assignment which
were also introduced by the commit that introduced the size
check. Checking the sum 'offset + len' and only later adding
'len' to 'offset' doesn't provide any advantage over directly
summing to 'offset' and checking it.
Fixes:
6399f1fae4ec ("ipv6: avoid overflow of offset in ip6_find_1stfragopt")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Mon, 21 Aug 2017 21:07:48 +0000 (14:07 -0700)]
Merge git://git./linux/kernel/git/davem/sparc
Pull sparc fixes from David Miller:
"Just a couple small fixes, two of which have to do with gcc-7:
1) Don't clobber kernel fixed registers in __multi4 libgcc helper.
2) Fix a new uninitialized variable warning on sparc32 with gcc-7,
from Thomas Petazzoni.
3) Adjust pmd_t initializer on sparc32 to make gcc happy.
4) If ATU isn't available, don't bark in the logs. From Tushar Dave"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc: kernel/pcic: silence gcc 7.x warning in pcibios_fixup_bus()
sparc64: remove unnecessary log message
sparc64: Don't clibber fixed registers in __multi4.
mm: add pmd_t initializer __pmd() to work around a GCC bug.
Thomas Petazzoni [Sun, 13 Aug 2017 21:14:58 +0000 (23:14 +0200)]
sparc: kernel/pcic: silence gcc 7.x warning in pcibios_fixup_bus()
When building the kernel for Sparc using gcc 7.x, the build fails
with:
arch/sparc/kernel/pcic.c: In function ‘pcibios_fixup_bus’:
arch/sparc/kernel/pcic.c:647:8: error: ‘cmd’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
cmd |= PCI_COMMAND_IO;
^~
The simplified code looks like this:
unsigned int cmd;
[...]
pcic_read_config(dev->bus, dev->devfn, PCI_COMMAND, 2, &cmd);
[...]
cmd |= PCI_COMMAND_IO;
I.e, the code assumes that pcic_read_config() will always initialize
cmd. But it's not the case. Looking at pcic_read_config(), if
bus->number is != 0 or if the size is not one of 1, 2 or 4, *val will
not be initialized.
As a simple fix, we initialize cmd to zero at the beginning of
pcibios_fixup_bus.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Mon, 21 Aug 2017 20:30:36 +0000 (13:30 -0700)]
Merge tag 'arc-4.13-rc7-fixes' of git://git./linux/kernel/git/vgupta/arc
Pull ARC fixes from Vineet Gupta:
- PAE40 related updates
- SLC errata for region ops
- intc line masking by default
* tag 'arc-4.13-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
arc: Mask individual IRQ lines during core INTC init
ARCv2: PAE40: set MSB even if !CONFIG_ARC_HAS_PAE40 but PAE exists in SoC
ARCv2: PAE40: Explicitly set MSB counterpart of SLC region ops addresses
ARC: dma: implement dma_unmap_page and sg variant
ARCv2: SLC: Make sure busy bit is set properly for region ops
ARC: [plat-sim] Include this platform unconditionally
ARC: [plat-axs10x]: prepare dts files for enabling PAE40 on axs103
ARC: defconfig: Cleanup from old Kconfig options
Linus Torvalds [Mon, 21 Aug 2017 20:27:51 +0000 (13:27 -0700)]
Merge tag 'rtc-4.13-fixes' of git://git./linux/kernel/git/abelloni/linux
Pull RTC fix from Alexandre Belloni:
"Fix regmap configuration for ds1307"
* tag 'rtc-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
rtc: ds1307: fix regmap config
Linus Torvalds [Mon, 21 Aug 2017 20:16:27 +0000 (13:16 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Fix IGMP handling wrt VRF, from David Ahern.
2) Fix timer access to freed object in dccp, from Eric Dumazet.
3) Use kmalloc_array() in ptr_ring to avoid overflow cases which are
triggerable by userspace. Also from Eric Dumazet.
4) Fix infinite loop in unmapping cleanup of nfp driver, from Colin Ian
King.
5) Correct datagram peek handling of empty SKBs, from Matthew Dawson.
6) Fix use after free in TIPC, from Eric Dumazet.
7) When replacing a route in ipv6 we need to reset the round robin
pointer, from Wei Wang.
8) Fix bug in pci_find_pcie_root_port() which was unearthed by the
relaxed ordering changes, from Thierry Redding. I made sure to get
an explicit ACK from Bjorn this time around :-)
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (27 commits)
ipv6: repair fib6 tree in failure case
net_sched: fix order of queue length updates in qdisc_replace()
tools lib bpf: improve warning
switchdev: documentation: minor typo fixes
bpf, doc: also add s390x as arch to sysctl description
net: sched: fix NULL pointer dereference when action calls some targets
rxrpc: Fix oops when discarding a preallocated service call
irda: do not leak initialized list.dev to userspace
net/mlx4_core: Enable 4K UAR if SRIOV module parameter is not enabled
PCI: Allow PCI express root ports to find themselves
tcp: when rearming RTO, if RTO time is in past then fire RTO ASAP
net: check and errout if res->fi is NULL when RTM_F_FIB_MATCH is set
ipv6: reset fn->rr_ptr when replacing route
sctp: fully initialize the IPv6 address in sctp_v6_to_addr()
tipc: fix use-after-free
tun: handle register_netdevice() failures properly
datagram: When peeking datagrams with offset < 0 don't skip empty skbs
bpf, doc: improve sysctl knob description
netxen: fix incorrect loop counter decrement
nfp: fix infinite loop on umapping cleanup
...
Oleg Nesterov [Mon, 21 Aug 2017 15:35:02 +0000 (17:35 +0200)]
pids: make task_tgid_nr_ns() safe
This was reported many times, and this was even mentioned in commit
52ee2dfdd4f5 ("pids: refactor vnr/nr_ns helpers to make them safe") but
somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
not safe because task->group_leader points to nowhere after the exiting
task passes exit_notify(), rcu_read_lock() can not help.
We really need to change __unhash_process() to nullify group_leader,
parent, and real_parent, but this needs some cleanups. Until then we
can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
fix the problem.
Reported-by: Troy Kensinger <tkensinger@google.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Heiner Kallweit [Tue, 25 Jul 2017 16:44:04 +0000 (18:44 +0200)]
rtc: ds1307: fix regmap config
Current max_register setting breaks reading nvram on certain chips and
also reading the standard registers on RX8130 where register map starts
at 0x10.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Fixes:
11e5890b5342 "rtc: ds1307: convert driver to regmap"
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Wei Wang [Sat, 19 Aug 2017 00:14:49 +0000 (17:14 -0700)]
ipv6: repair fib6 tree in failure case
In fib6_add(), it is possible that fib6_add_1() picks an intermediate
node and sets the node's fn->leaf to NULL in order to add this new
route. However, if fib6_add_rt2node() fails to add the new
route for some reason, fn->leaf will be left as NULL and could
potentially cause crash when fn->leaf is accessed in fib6_locate().
This patch makes sure fib6_repair_tree() is called to properly repair
fn->leaf in the above failure case.
Here is the syzkaller reported general protection fault in fib6_locate:
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 0 PID: 40937 Comm: syz-executor3 Not tainted
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
task:
ffff8801d7d64100 ti:
ffff8801d01a0000 task.ti:
ffff8801d01a0000
RIP: 0010:[<
ffffffff82a3e0e1>] [<
ffffffff82a3e0e1>] __ipv6_prefix_equal64_half include/net/ipv6.h:475 [inline]
RIP: 0010:[<
ffffffff82a3e0e1>] [<
ffffffff82a3e0e1>] ipv6_prefix_equal include/net/ipv6.h:492 [inline]
RIP: 0010:[<
ffffffff82a3e0e1>] [<
ffffffff82a3e0e1>] fib6_locate_1 net/ipv6/ip6_fib.c:1210 [inline]
RIP: 0010:[<
ffffffff82a3e0e1>] [<
ffffffff82a3e0e1>] fib6_locate+0x281/0x3c0 net/ipv6/ip6_fib.c:1233
RSP: 0018:
ffff8801d01a36a8 EFLAGS:
00010202
RAX:
0000000000000020 RBX:
ffff8801bc790e00 RCX:
ffffc90002983000
RDX:
0000000000001219 RSI:
ffff8801d01a37a0 RDI:
0000000000000100
RBP:
ffff8801d01a36f0 R08:
00000000000000ff R09:
0000000000000000
R10:
0000000000000003 R11:
0000000000000000 R12:
0000000000000001
R13:
dffffc0000000000 R14:
ffff8801d01a37a0 R15:
0000000000000000
FS:
00007f6afd68c700(0000) GS:
ffff8801db400000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00000000004c6340 CR3:
00000000ba41f000 CR4:
00000000001426f0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Stack:
ffff8801d01a37a8 ffff8801d01a3780 ffffed003a0346f5 0000000c82a23ea0
ffff8800b7bd7700 ffff8801d01a3780 ffff8800b6a1c940 ffffffff82a23ea0
ffff8801d01a3920 ffff8801d01a3748 ffffffff82a223d6 ffff8801d7d64988
Call Trace:
[<
ffffffff82a223d6>] ip6_route_del+0x106/0x570 net/ipv6/route.c:2109
[<
ffffffff82a23f9d>] inet6_rtm_delroute+0xfd/0x100 net/ipv6/route.c:3075
[<
ffffffff82621359>] rtnetlink_rcv_msg+0x549/0x7a0 net/core/rtnetlink.c:3450
[<
ffffffff8274c1d1>] netlink_rcv_skb+0x141/0x370 net/netlink/af_netlink.c:2281
[<
ffffffff82613ddf>] rtnetlink_rcv+0x2f/0x40 net/core/rtnetlink.c:3456
[<
ffffffff8274ad38>] netlink_unicast_kernel net/netlink/af_netlink.c:1206 [inline]
[<
ffffffff8274ad38>] netlink_unicast+0x518/0x750 net/netlink/af_netlink.c:1232
[<
ffffffff8274b83e>] netlink_sendmsg+0x8ce/0xc30 net/netlink/af_netlink.c:1778
[<
ffffffff82564aff>] sock_sendmsg_nosec net/socket.c:609 [inline]
[<
ffffffff82564aff>] sock_sendmsg+0xcf/0x110 net/socket.c:619
[<
ffffffff82564d62>] sock_write_iter+0x222/0x3a0 net/socket.c:834
[<
ffffffff8178523d>] new_sync_write+0x1dd/0x2b0 fs/read_write.c:478
[<
ffffffff817853f4>] __vfs_write+0xe4/0x110 fs/read_write.c:491
[<
ffffffff81786c38>] vfs_write+0x178/0x4b0 fs/read_write.c:538
[<
ffffffff817892a9>] SYSC_write fs/read_write.c:585 [inline]
[<
ffffffff817892a9>] SyS_write+0xd9/0x1b0 fs/read_write.c:577
[<
ffffffff82c71e32>] entry_SYSCALL_64_fastpath+0x12/0x17
Note: there is no "Fixes" tag as this seems to be a bug introduced
very early.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Konstantin Khlebnikov [Sat, 19 Aug 2017 12:37:07 +0000 (15:37 +0300)]
net_sched: fix order of queue length updates in qdisc_replace()
This important to call qdisc_tree_reduce_backlog() after changing queue
length. Parent qdisc should deactivate class in ->qlen_notify() called from
qdisc_tree_reduce_backlog() but this happens only if qdisc->q.qlen in zero.
Missed class deactivations leads to crashes/warnings at picking packets
from empty qdisc and corrupting state at reactivating this class in future.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes:
86a7996cc8a0 ("net_sched: introduce qdisc_replace() helper")
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Leblond [Sun, 20 Aug 2017 19:48:14 +0000 (21:48 +0200)]
tools lib bpf: improve warning
Signed-off-by: Eric Leblond <eric@regit.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Chris Packham [Sun, 20 Aug 2017 20:52:54 +0000 (08:52 +1200)]
switchdev: documentation: minor typo fixes
Two typos in switchdev.txt
Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Sun, 20 Aug 2017 22:26:03 +0000 (00:26 +0200)]
bpf, doc: also add s390x as arch to sysctl description
Looks like this was accidentally missed, so still add s390x
as supported eBPF JIT arch to bpf_jit_enable.
Fixes:
014cd0a368dc ("bpf: Update sysctl documentation to list all supported architectures")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 20 Aug 2017 21:13:52 +0000 (14:13 -0700)]
Linux 4.13-rc6
Linus Torvalds [Sun, 20 Aug 2017 20:26:27 +0000 (13:26 -0700)]
Sanitize 'move_pages()' permission checks
The 'move_paghes()' system call was introduced long long ago with the
same permission checks as for sending a signal (except using
CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability).
That turns out to not be a great choice - while the system call really
only moves physical page allocations around (and you need other
capabilities to do a lot of it), you can check the return value to map
out some the virtual address choices and defeat ASLR of a binary that
still shares your uid.
So change the access checks to the more common 'ptrace_may_access()'
model instead.
This tightens the access checks for the uid, and also effectively
changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that
anybody really _uses_ this legacy system call any more (we hav ebetter
NUMA placement models these days), so I expect nobody to notice.
Famous last words.
Reported-by: Otto Ebeling <otto.ebeling@iki.fi>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 20 Aug 2017 16:36:52 +0000 (09:36 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Another pile of small fixes and updates for x86:
- Plug a hole in the SMAP implementation which misses to clear AC on
NMI entry
- Fix the norandmaps/ADDR_NO_RANDOMIZE logic so the command line
parameter works correctly again
- Use the proper accessor in the startup64 code for next_early_pgt to
prevent accessing of invalid addresses and faulting in the early
boot code.
- Prevent CPU hotplug lock recursion in the MTRR code
- Unbreak CPU0 hotplugging
- Rename overly long CPUID bits which got introduced in this cycle
- Two commits which mark data 'const' and restrict the scope of data
and functions to file scope by making them 'static'"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Constify attribute_group structures
x86/boot/64/clang: Use fixup_pointer() to access 'next_early_pgt'
x86/elf: Remove the unnecessary ADDR_NO_RANDOMIZE checks
x86: Fix norandmaps/ADDR_NO_RANDOMIZE
x86/mtrr: Prevent CPU hotplug lock recursion
x86: Mark various structures and functions as 'static'
x86/cpufeature, kvm/svm: Rename (shorten) the new "virtualized VMSAVE/VMLOAD" CPUID flag
x86/smpboot: Unbreak CPU0 hotplug
x86/asm/64: Clear AC on NMI entries
Linus Torvalds [Sun, 20 Aug 2017 16:34:24 +0000 (09:34 -0700)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"A few small fixes for timer drivers:
- Prevent infinite recursion in the arm architected timer driver with
ftrace
- Propagate error codes to the caller in case of failure in EM STI
driver
- Adjust a bogus loop iteration in the arm architected timer driver
- Add a missing Kconfig dependency to the pistachio clocksource to
prevent build failures
- Correctly check for IS_ERR() instead of NULL in the shared timer-of
code"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/arm_arch_timer: Avoid infinite recursion when ftrace is enabled
clocksource/drivers/Kconfig: Fix CLKSRC_PISTACHIO dependencies
clocksource/drivers/timer-of: Checking for IS_ERR() instead of NULL
clocksource/drivers/em_sti: Fix error return codes in em_sti_probe()
clocksource/drivers/arm_arch_timer: Fix mem frame loop initialization
Linus Torvalds [Sun, 20 Aug 2017 16:20:57 +0000 (09:20 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"Two fixes for the perf subsystem:
- Fix an inconsistency of RDPMC mm struct tagging across exec() which
causes RDPMC to fault.
- Correct the timestamp mechanics across IOC_DISABLE/ENABLE which
causes incorrect timestamps and total time calculations"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix time on IOC_ENABLE
perf/x86: Fix RDPMC vs. mm_struct tracking
Linus Torvalds [Sun, 20 Aug 2017 16:07:56 +0000 (09:07 -0700)]
Merge branch 'irq-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A pile of smallish changes all over the place:
- Add a missing ISB in the GIC V1 driver
- Remove an ACPI version check in the GIC V3 ITS driver
- Add the missing irq_pm_shutdown function for BRCMSTB-L2 to avoid
spurious wakeups
- Remove the artifical limitation of ITS instances to the number of
NUMA nodes which prevents utilizing the ITS hardware correctly
- Prevent a infinite parsing loop in the GIC-V3 ITS/MSI code
- Honour the force affinity argument in the GIC-V3 driver which is
required to make perf work correctly
- Correctly report allocation failures in GIC-V2/V3 to avoid using
half allocated and initialized interrupts.
- Fixup checks against nr_cpu_ids in the generic IPI code"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/ipi: Fixup checks against nr_cpu_ids
genirq: Restore trigger settings in irq_modify_status()
MAINTAINERS: Remove Jason Cooper's irqchip git tree
irqchip/gic-v3-its-platform-msi: Fix msi-parent parsing loop
irqchip/gic-v3-its: Allow GIC ITS number more than MAX_NUMNODES
irqchip: brcmstb-l2: Define an irq_pm_shutdown function
irqchip/gic: Ensure we have an ISB between ack and ->handle_irq
irqchip/gic-v3-its: Remove ACPICA version check for ACPI NUMA
irqchip/gic-v3: Honor forced affinity setting
irqchip/gic-v3: Report failures in gic_irq_domain_alloc
irqchip/gic-v2: Report failures in gic_irq_domain_alloc
irqchip/atmel-aic: Remove root argument from ->fixup() prototype
irqchip/atmel-aic: Fix unbalanced refcount in aic_common_rtc_irq_fixup()
irqchip/atmel-aic: Fix unbalanced of_node_put() in aic_common_irq_fixup()
Linus Torvalds [Sun, 20 Aug 2017 15:54:30 +0000 (08:54 -0700)]
Merge branch 'core-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull watchdog fix from Thomas Gleixner:
"A fix for the hardlockup watchdog to prevent false positives with
extreme Turbo-Modes which make the perf/NMI watchdog fire faster than
the hrtimer which is used to verify.
Slightly larger than the minimal fix, which just would increase the
hrtimer frequency, but comes with extra overhead of more watchdog
timer interrupts and thread wakeups for all users.
With this change we restrict the overhead to the extreme Turbo-Mode
systems"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kernel/watchdog: Prevent false positives with turbo modes
Alexey Dobriyan [Sat, 19 Aug 2017 09:57:51 +0000 (12:57 +0300)]
genirq/ipi: Fixup checks against nr_cpu_ids
Valid CPU ids are [0, nr_cpu_ids-1] inclusive.
Fixes:
3b8e29a82dd1 ("genirq: Implement ipi_send_mask/single()")
Fixes:
f9bce791ae2a ("genirq: Add a new function to get IPI reverse mapping")
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170819095751.GB27864@avx2
Xin Long [Fri, 18 Aug 2017 03:01:36 +0000 (11:01 +0800)]
net: sched: fix NULL pointer dereference when action calls some targets
As we know in some target's checkentry it may dereference par.entryinfo
to check entry stuff inside. But when sched action calls xt_check_target,
par.entryinfo is set with NULL. It would cause kernel panic when calling
some targets.
It can be reproduce with:
# tc qd add dev eth1 ingress handle ffff:
# tc filter add dev eth1 parent ffff: u32 match u32 0 0 action xt \
-j ECN --ecn-tcp-remove
It could also crash kernel when using target CLUSTERIP or TPROXY.
By now there's no proper value for par.entryinfo in ipt_init_target,
but it can not be set with NULL. This patch is to void all these
panics by setting it with an ipt_entry obj with all members = 0.
Note that this issue has been there since the very beginning.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Thu, 17 Aug 2017 23:19:42 +0000 (00:19 +0100)]
rxrpc: Fix oops when discarding a preallocated service call
rxrpc_service_prealloc_one() doesn't set the socket pointer on any new call
it preallocates, but does add it to the rxrpc net namespace call list.
This, however, causes rxrpc_put_call() to oops when the call is discarded
when the socket is closed. rxrpc_put_call() needs the socket to be able to
reach the namespace so that it can use a lock held therein.
Fix this by setting a call's socket pointer immediately before discarding
it.
This can be triggered by unloading the kafs module, resulting in an oops
like the following:
BUG: unable to handle kernel NULL pointer dereference at
0000000000000030
IP: rxrpc_put_call+0x1e2/0x32d
PGD 0
P4D 0
Oops: 0000 [#1] SMP
Modules linked in: kafs(E-)
CPU: 3 PID: 3037 Comm: rmmod Tainted: G E 4.12.0-fscache+ #213
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
task:
ffff8803fc92e2c0 task.stack:
ffff8803fef74000
RIP: 0010:rxrpc_put_call+0x1e2/0x32d
RSP: 0018:
ffff8803fef77e08 EFLAGS:
00010282
RAX:
0000000000000000 RBX:
ffff8803fab99ac0 RCX:
000000000000000f
RDX:
ffffffff81c50a40 RSI:
000000000000000c RDI:
ffff8803fc92ea88
RBP:
ffff8803fef77e30 R08:
ffff8803fc87b941 R09:
ffffffff82946d20
R10:
ffff8803fef77d10 R11:
00000000000076fc R12:
0000000000000005
R13:
ffff8803fab99c20 R14:
0000000000000001 R15:
ffffffff816c6aee
FS:
00007f915a059700(0000) GS:
ffff88041fb80000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000000000030 CR3:
00000003fef39000 CR4:
00000000001406e0
Call Trace:
rxrpc_discard_prealloc+0x325/0x341
rxrpc_listen+0xf9/0x146
kernel_listen+0xb/0xd
afs_close_socket+0x3e/0x173 [kafs]
afs_exit+0x1f/0x57 [kafs]
SyS_delete_module+0x10f/0x19a
do_syscall_64+0x8a/0x149
entry_SYSCALL64_slow_path+0x25/0x25
Fixes:
2baec2c3f854 ("rxrpc: Support network namespacing")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Thu, 17 Aug 2017 22:14:58 +0000 (23:14 +0100)]
irda: do not leak initialized list.dev to userspace
list.dev has not been initialized and so the copy_to_user is copying
data from the stack back to user space which is a potential
information leak. Fix this ensuring all of list is initialized to
zero.
Detected by CoverityScan, CID#1357894 ("Uninitialized scalar variable")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Huy Nguyen [Thu, 17 Aug 2017 15:29:52 +0000 (18:29 +0300)]
net/mlx4_core: Enable 4K UAR if SRIOV module parameter is not enabled
enable_4k_uar module parameter was added in patch cited below to
address the backward compatibility issue in SRIOV when the VM has
system's PAGE_SIZE uar implementation and the Hypervisor has 4k uar
implementation.
The above compatibility issue does not exist in the non SRIOV case.
In this patch, we always enable 4k uar implementation if SRIOV
is not enabled on mlx4's supported cards.
Fixes:
76e39ccf9c36 ("net/mlx4_core: Fix backward compatibility on VFs")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thierry Reding [Thu, 17 Aug 2017 11:06:14 +0000 (13:06 +0200)]
PCI: Allow PCI express root ports to find themselves
If the pci_find_pcie_root_port() function is called on a root port
itself, return the root port rather than NULL.
This effectively reverts commit
0e405232871d6 ("PCI: fix oops when
try to find Root Port for a PCI device") which added an extra check
that would now be redundant.
Fixes:
a99b646afa8a ("PCI: Disable PCIe Relaxed Ordering if unsupported")
Fixes:
c56d4450eb68 ("PCI: Turn off Request Attributes to avoid Chelsio T5 Completion erratum")
Signed-off-by: Thierry Reding <treding@nvidia.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Shawn Lin <shawn.lin@rock-chips.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neal Cardwell [Wed, 16 Aug 2017 21:53:36 +0000 (17:53 -0400)]
tcp: when rearming RTO, if RTO time is in past then fire RTO ASAP
In some situations tcp_send_loss_probe() can realize that it's unable
to send a loss probe (TLP), and falls back to calling tcp_rearm_rto()
to schedule an RTO timer. In such cases, sometimes tcp_rearm_rto()
realizes that the RTO was eligible to fire immediately or at some
point in the past (delta_us <= 0). Previously in such cases
tcp_rearm_rto() was scheduling such "overdue" RTOs to happen at now +
icsk_rto, which caused needless delays of hundreds of milliseconds
(and non-linear behavior that made reproducible testing
difficult). This commit changes the logic to schedule "overdue" RTOs
ASAP, rather than at now + icsk_rto.
Fixes:
6ba8a3b19e76 ("tcp: Tail loss probe (TLP)")
Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 18 Aug 2017 23:06:33 +0000 (16:06 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"14 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changes
mm/vmalloc.c: don't unconditonally use __GFP_HIGHMEM
mm/mempolicy: fix use after free when calling get_mempolicy
mm/cma_debug.c: fix stack corruption due to sprintf usage
signal: don't remove SIGNAL_UNKILLABLE for traced tasks.
mm, oom: fix potential data corruption when oom_reaper races with writer
mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS
slub: fix per memcg cache leak on css offline
mm: discard memblock data later
test_kmod: fix description for -s -and -c parameters
kmod: fix wait on recursive loop
wait: add wait_event_killable_timeout()
kernel/watchdog: fix Kconfig constraints for perf hardlockup watchdog
mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback()
Roopa Prabhu [Wed, 16 Aug 2017 19:38:52 +0000 (12:38 -0700)]
net: check and errout if res->fi is NULL when RTM_F_FIB_MATCH is set
Syzkaller hit 'general protection fault in fib_dump_info' bug on
commit 4.13-rc5..
Guilty file: net/ipv4/fib_semantics.c
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 0 PID: 2808 Comm: syz-executor0 Not tainted 4.13.0-rc5 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
task:
ffff880078562700 task.stack:
ffff880078110000
RIP: 0010:fib_dump_info+0x388/0x1170 net/ipv4/fib_semantics.c:1314
RSP: 0018:
ffff880078117010 EFLAGS:
00010206
RAX:
dffffc0000000000 RBX:
00000000000000fe RCX:
0000000000000002
RDX:
0000000000000006 RSI:
ffff880078117084 RDI:
0000000000000030
RBP:
ffff880078117268 R08:
000000000000000c R09:
ffff8800780d80c8
R10:
0000000058d629b4 R11:
0000000067fce681 R12:
0000000000000000
R13:
ffff8800784bd540 R14:
ffff8800780d80b5 R15:
ffff8800780d80a4
FS:
00000000022fa940(0000) GS:
ffff88007fc00000(0000)
knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00000000004387d0 CR3:
0000000079135000 CR4:
00000000000006f0
Call Trace:
inet_rtm_getroute+0xc89/0x1f50 net/ipv4/route.c:2766
rtnetlink_rcv_msg+0x288/0x680 net/core/rtnetlink.c:4217
netlink_rcv_skb+0x340/0x470 net/netlink/af_netlink.c:2397
rtnetlink_rcv+0x28/0x30 net/core/rtnetlink.c:4223
netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
netlink_unicast+0x4c4/0x6e0 net/netlink/af_netlink.c:1291
netlink_sendmsg+0x8c4/0xca0 net/netlink/af_netlink.c:1854
sock_sendmsg_nosec net/socket.c:633 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:643
___sys_sendmsg+0x779/0x8d0 net/socket.c:2035
__sys_sendmsg+0xd1/0x170 net/socket.c:2069
SYSC_sendmsg net/socket.c:2080 [inline]
SyS_sendmsg+0x2d/0x50 net/socket.c:2076
entry_SYSCALL_64_fastpath+0x1a/0xa5
RIP: 0033:0x4512e9
RSP: 002b:
00007ffc75584cc8 EFLAGS:
00000216 ORIG_RAX:
000000000000002e
RAX:
ffffffffffffffda RBX:
0000000000000002 RCX:
00000000004512e9
RDX:
0000000000000000 RSI:
0000000020f2cfc8 RDI:
0000000000000003
RBP:
000000000000000e R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000216 R12:
fffffffffffffffe
R13:
0000000000718000 R14:
0000000020c44ff0 R15:
0000000000000000
Code: 00 0f b6 8d ec fd ff ff 48 8b 85 f0 fd ff ff 88 48 17 48 8b 45
28 48 8d 78 30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03
<0f>
b6 04 02 84 c0 74 08 3c 03 0f 8e cb 0c 00 00 48 8b 45 28 44
RIP: fib_dump_info+0x388/0x1170 net/ipv4/fib_semantics.c:1314 RSP:
ffff880078117010
---[ end trace
254a7af28348f88b ]---
This patch adds a res->fi NULL check.
example run:
$ip route get 0.0.0.0 iif virt1-0
broadcast 0.0.0.0 dev lo
cache <local,brd> iif virt1-0
$ip route get 0.0.0.0 iif virt1-0 fibmatch
RTNETLINK answers: No route to host
Reported-by: idaifish <idaifish@gmail.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes:
b61798130f1b ("net: ipv4: RTM_GETROUTE: return matched fib result when requested")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang [Wed, 16 Aug 2017 18:18:09 +0000 (11:18 -0700)]
ipv6: reset fn->rr_ptr when replacing route
syzcaller reported the following use-after-free issue in rt6_select():
BUG: KASAN: use-after-free in rt6_select net/ipv6/route.c:755 [inline] at addr
ffff8800bc6994e8
BUG: KASAN: use-after-free in ip6_pol_route.isra.46+0x1429/0x1470 net/ipv6/route.c:1084 at addr
ffff8800bc6994e8
Read of size 4 by task syz-executor1/439628
CPU: 0 PID: 439628 Comm: syz-executor1 Not tainted 4.3.5+ #8
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
0000000000000000 ffff88018fe435b0 ffffffff81ca384d ffff8801d3588c00
ffff8800bc699380 ffff8800bc699500 dffffc0000000000 ffff8801d40a47c0
ffff88018fe435d8 ffffffff81735751 ffff88018fe43660 ffff8800bc699380
Call Trace:
[<
ffffffff81ca384d>] __dump_stack lib/dump_stack.c:15 [inline]
[<
ffffffff81ca384d>] dump_stack+0xc1/0x124 lib/dump_stack.c:51
sctp: [Deprecated]: syz-executor0 (pid 439615) Use of struct sctp_assoc_value in delayed_ack socket option.
Use struct sctp_sack_info instead
[<
ffffffff81735751>] kasan_object_err+0x21/0x70 mm/kasan/report.c:158
[<
ffffffff817359c4>] print_address_description mm/kasan/report.c:196 [inline]
[<
ffffffff817359c4>] kasan_report_error+0x1b4/0x4a0 mm/kasan/report.c:285
[<
ffffffff81735d93>] kasan_report mm/kasan/report.c:305 [inline]
[<
ffffffff81735d93>] __asan_report_load4_noabort+0x43/0x50 mm/kasan/report.c:325
[<
ffffffff82a28e39>] rt6_select net/ipv6/route.c:755 [inline]
[<
ffffffff82a28e39>] ip6_pol_route.isra.46+0x1429/0x1470 net/ipv6/route.c:1084
[<
ffffffff82a28fb1>] ip6_pol_route_output+0x81/0xb0 net/ipv6/route.c:1203
[<
ffffffff82ab0a50>] fib6_rule_action+0x1f0/0x680 net/ipv6/fib6_rules.c:95
[<
ffffffff8265cbb6>] fib_rules_lookup+0x2a6/0x7a0 net/core/fib_rules.c:223
[<
ffffffff82ab1430>] fib6_rule_lookup+0xd0/0x250 net/ipv6/fib6_rules.c:41
[<
ffffffff82a22006>] ip6_route_output+0x1d6/0x2c0 net/ipv6/route.c:1224
[<
ffffffff829e83d2>] ip6_dst_lookup_tail+0x4d2/0x890 net/ipv6/ip6_output.c:943
[<
ffffffff829e889a>] ip6_dst_lookup_flow+0x9a/0x250 net/ipv6/ip6_output.c:1079
[<
ffffffff82a9f7d8>] ip6_datagram_dst_update+0x538/0xd40 net/ipv6/datagram.c:91
[<
ffffffff82aa0978>] __ip6_datagram_connect net/ipv6/datagram.c:251 [inline]
[<
ffffffff82aa0978>] ip6_datagram_connect+0x518/0xe50 net/ipv6/datagram.c:272
[<
ffffffff82aa1313>] ip6_datagram_connect_v6_only+0x63/0x90 net/ipv6/datagram.c:284
[<
ffffffff8292f790>] inet_dgram_connect+0x170/0x1f0 net/ipv4/af_inet.c:564
[<
ffffffff82565547>] SYSC_connect+0x1a7/0x2f0 net/socket.c:1582
[<
ffffffff8256a649>] SyS_connect+0x29/0x30 net/socket.c:1563
[<
ffffffff82c72032>] entry_SYSCALL_64_fastpath+0x12/0x17
Object at
ffff8800bc699380, in cache ip6_dst_cache size: 384
The root cause of it is that in fib6_add_rt2node(), when it replaces an
existing route with the new one, it does not update fn->rr_ptr.
This commit resets fn->rr_ptr to NULL when it points to a route which is
replaced in fib6_add_rt2node().
Fixes:
27596472473a ("ipv6: fix ECMP route replacement")
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Potapenko [Wed, 16 Aug 2017 18:16:40 +0000 (20:16 +0200)]
sctp: fully initialize the IPv6 address in sctp_v6_to_addr()
KMSAN reported use of uninitialized sctp_addr->v4.sin_addr.s_addr and
sctp_addr->v6.sin6_scope_id in sctp_v6_cmp_addr() (see below).
Make sure all fields of an IPv6 address are initialized, which
guarantees that the IPv4 fields are also initialized.
==================================================================
BUG: KMSAN: use of uninitialized memory in sctp_v6_cmp_addr+0x8d4/0x9f0
net/sctp/ipv6.c:517
CPU: 2 PID: 31056 Comm: syz-executor1 Not tainted 4.11.0-rc5+ #2944
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Call Trace:
dump_stack+0x172/0x1c0 lib/dump_stack.c:42
is_logbuf_locked mm/kmsan/kmsan.c:59 [inline]
kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:938
native_save_fl arch/x86/include/asm/irqflags.h:18 [inline]
arch_local_save_flags arch/x86/include/asm/irqflags.h:72 [inline]
arch_local_irq_save arch/x86/include/asm/irqflags.h:113 [inline]
__msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:467
sctp_v6_cmp_addr+0x8d4/0x9f0 net/sctp/ipv6.c:517
sctp_v6_get_dst+0x8c7/0x1630 net/sctp/ipv6.c:290
sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
sctp_assoc_add_peer+0x66d/0x16f0 net/sctp/associola.c:651
sctp_sendmsg+0x35a5/0x4f90 net/sctp/socket.c:1871
inet_sendmsg+0x498/0x670 net/ipv4/af_inet.c:762
sock_sendmsg_nosec net/socket.c:633 [inline]
sock_sendmsg net/socket.c:643 [inline]
SYSC_sendto+0x608/0x710 net/socket.c:1696
SyS_sendto+0x8a/0xb0 net/socket.c:1664
entry_SYSCALL_64_fastpath+0x13/0x94
RIP: 0033:0x44b479
RSP: 002b:
00007f6213f21c08 EFLAGS:
00000286 ORIG_RAX:
000000000000002c
RAX:
ffffffffffffffda RBX:
0000000020000000 RCX:
000000000044b479
RDX:
0000000000000041 RSI:
0000000020edd000 RDI:
0000000000000006
RBP:
00000000007080a8 R08:
0000000020b85fe4 R09:
000000000000001c
R10:
0000000000040005 R11:
0000000000000286 R12:
00000000ffffffff
R13:
0000000000003760 R14:
00000000006e5820 R15:
0000000000ff8000
origin description: ----dst_saddr@sctp_v6_get_dst
local variable created at:
sk_fullsock include/net/sock.h:2321 [inline]
inet6_sk include/linux/ipv6.h:309 [inline]
sctp_v6_get_dst+0x91/0x1630 net/sctp/ipv6.c:241
sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
==================================================================
BUG: KMSAN: use of uninitialized memory in sctp_v6_cmp_addr+0x8d4/0x9f0
net/sctp/ipv6.c:517
CPU: 2 PID: 31056 Comm: syz-executor1 Not tainted 4.11.0-rc5+ #2944
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Call Trace:
dump_stack+0x172/0x1c0 lib/dump_stack.c:42
is_logbuf_locked mm/kmsan/kmsan.c:59 [inline]
kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:938
native_save_fl arch/x86/include/asm/irqflags.h:18 [inline]
arch_local_save_flags arch/x86/include/asm/irqflags.h:72 [inline]
arch_local_irq_save arch/x86/include/asm/irqflags.h:113 [inline]
__msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:467
sctp_v6_cmp_addr+0x8d4/0x9f0 net/sctp/ipv6.c:517
sctp_v6_get_dst+0x8c7/0x1630 net/sctp/ipv6.c:290
sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
sctp_assoc_add_peer+0x66d/0x16f0 net/sctp/associola.c:651
sctp_sendmsg+0x35a5/0x4f90 net/sctp/socket.c:1871
inet_sendmsg+0x498/0x670 net/ipv4/af_inet.c:762
sock_sendmsg_nosec net/socket.c:633 [inline]
sock_sendmsg net/socket.c:643 [inline]
SYSC_sendto+0x608/0x710 net/socket.c:1696
SyS_sendto+0x8a/0xb0 net/socket.c:1664
entry_SYSCALL_64_fastpath+0x13/0x94
RIP: 0033:0x44b479
RSP: 002b:
00007f6213f21c08 EFLAGS:
00000286 ORIG_RAX:
000000000000002c
RAX:
ffffffffffffffda RBX:
0000000020000000 RCX:
000000000044b479
RDX:
0000000000000041 RSI:
0000000020edd000 RDI:
0000000000000006
RBP:
00000000007080a8 R08:
0000000020b85fe4 R09:
000000000000001c
R10:
0000000000040005 R11:
0000000000000286 R12:
00000000ffffffff
R13:
0000000000003760 R14:
00000000006e5820 R15:
0000000000ff8000
origin description: ----dst_saddr@sctp_v6_get_dst
local variable created at:
sk_fullsock include/net/sock.h:2321 [inline]
inet6_sk include/linux/ipv6.h:309 [inline]
sctp_v6_get_dst+0x91/0x1630 net/sctp/ipv6.c:241
sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
==================================================================
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 16 Aug 2017 16:41:54 +0000 (09:41 -0700)]
tipc: fix use-after-free
syszkaller reported use-after-free in tipc [1]
When msg->rep skb is freed, set the pointer to NULL,
so that caller does not free it again.
[1]
==================================================================
BUG: KASAN: use-after-free in skb_push+0xd4/0xe0 net/core/skbuff.c:1466
Read of size 8 at addr
ffff8801c6e71e90 by task syz-executor5/4115
CPU: 1 PID: 4115 Comm: syz-executor5 Not tainted 4.13.0-rc4+ #32
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:52
print_address_description+0x73/0x250 mm/kasan/report.c:252
kasan_report_error mm/kasan/report.c:351 [inline]
kasan_report+0x24e/0x340 mm/kasan/report.c:409
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
skb_push+0xd4/0xe0 net/core/skbuff.c:1466
tipc_nl_compat_recv+0x833/0x18f0 net/tipc/netlink_compat.c:1209
genl_family_rcv_msg+0x7b7/0xfb0 net/netlink/genetlink.c:598
genl_rcv_msg+0xb2/0x140 net/netlink/genetlink.c:623
netlink_rcv_skb+0x216/0x440 net/netlink/af_netlink.c:2397
genl_rcv+0x28/0x40 net/netlink/genetlink.c:634
netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1291
netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1854
sock_sendmsg_nosec net/socket.c:633 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:643
sock_write_iter+0x31a/0x5d0 net/socket.c:898
call_write_iter include/linux/fs.h:1743 [inline]
new_sync_write fs/read_write.c:457 [inline]
__vfs_write+0x684/0x970 fs/read_write.c:470
vfs_write+0x189/0x510 fs/read_write.c:518
SYSC_write fs/read_write.c:565 [inline]
SyS_write+0xef/0x220 fs/read_write.c:557
entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x4512e9
RSP: 002b:
00007f3bc8184c08 EFLAGS:
00000216 ORIG_RAX:
0000000000000001
RAX:
ffffffffffffffda RBX:
0000000000718000 RCX:
00000000004512e9
RDX:
0000000000000020 RSI:
0000000020fdb000 RDI:
0000000000000006
RBP:
0000000000000086 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000216 R12:
00000000004b5e76
R13:
00007f3bc8184b48 R14:
00000000004b5e86 R15:
0000000000000000
Allocated by task 4115:
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459 [inline]
kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
kmem_cache_alloc_node+0x13d/0x750 mm/slab.c:3651
__alloc_skb+0xf1/0x740 net/core/skbuff.c:219
alloc_skb include/linux/skbuff.h:903 [inline]
tipc_tlv_alloc+0x26/0xb0 net/tipc/netlink_compat.c:148
tipc_nl_compat_dumpit+0xf2/0x3c0 net/tipc/netlink_compat.c:248
tipc_nl_compat_handle net/tipc/netlink_compat.c:1130 [inline]
tipc_nl_compat_recv+0x756/0x18f0 net/tipc/netlink_compat.c:1199
genl_family_rcv_msg+0x7b7/0xfb0 net/netlink/genetlink.c:598
genl_rcv_msg+0xb2/0x140 net/netlink/genetlink.c:623
netlink_rcv_skb+0x216/0x440 net/netlink/af_netlink.c:2397
genl_rcv+0x28/0x40 net/netlink/genetlink.c:634
netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1291
netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1854
sock_sendmsg_nosec net/socket.c:633 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:643
sock_write_iter+0x31a/0x5d0 net/socket.c:898
call_write_iter include/linux/fs.h:1743 [inline]
new_sync_write fs/read_write.c:457 [inline]
__vfs_write+0x684/0x970 fs/read_write.c:470
vfs_write+0x189/0x510 fs/read_write.c:518
SYSC_write fs/read_write.c:565 [inline]
SyS_write+0xef/0x220 fs/read_write.c:557
entry_SYSCALL_64_fastpath+0x1f/0xbe
Freed by task 4115:
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459 [inline]
kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
__cache_free mm/slab.c:3503 [inline]
kmem_cache_free+0x77/0x280 mm/slab.c:3763
kfree_skbmem+0x1a1/0x1d0 net/core/skbuff.c:622
__kfree_skb net/core/skbuff.c:682 [inline]
kfree_skb+0x165/0x4c0 net/core/skbuff.c:699
tipc_nl_compat_dumpit+0x36a/0x3c0 net/tipc/netlink_compat.c:260
tipc_nl_compat_handle net/tipc/netlink_compat.c:1130 [inline]
tipc_nl_compat_recv+0x756/0x18f0 net/tipc/netlink_compat.c:1199
genl_family_rcv_msg+0x7b7/0xfb0 net/netlink/genetlink.c:598
genl_rcv_msg+0xb2/0x140 net/netlink/genetlink.c:623
netlink_rcv_skb+0x216/0x440 net/netlink/af_netlink.c:2397
genl_rcv+0x28/0x40 net/netlink/genetlink.c:634
netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1291
netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1854
sock_sendmsg_nosec net/socket.c:633 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:643
sock_write_iter+0x31a/0x5d0 net/socket.c:898
call_write_iter include/linux/fs.h:1743 [inline]
new_sync_write fs/read_write.c:457 [inline]
__vfs_write+0x684/0x970 fs/read_write.c:470
vfs_write+0x189/0x510 fs/read_write.c:518
SYSC_write fs/read_write.c:565 [inline]
SyS_write+0xef/0x220 fs/read_write.c:557
entry_SYSCALL_64_fastpath+0x1f/0xbe
The buggy address belongs to the object at
ffff8801c6e71dc0
which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 208 bytes inside of
224-byte region [
ffff8801c6e71dc0,
ffff8801c6e71ea0)
The buggy address belongs to the page:
page:
ffffea00071b9c40 count:1 mapcount:0 mapping:
ffff8801c6e71000 index:0x0
flags: 0x200000000000100(slab)
raw:
0200000000000100 ffff8801c6e71000 0000000000000000 000000010000000c
raw:
ffffea0007224a20 ffff8801d98caf48 ffff8801d9e79040 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8801c6e71d80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
ffff8801c6e71e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>
ffff8801c6e71e80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff8801c6e71f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8801c6e71f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 18 Aug 2017 20:39:56 +0000 (13:39 -0700)]
tun: handle register_netdevice() failures properly
syzkaller reported a double free [1], caused by the fact
that tun driver was not updated properly when priv_destructor
was added.
When/if register_netdevice() fails, priv_destructor() must have been
called already.
[1]
BUG: KASAN: double-free or invalid-free in selinux_tun_dev_free_security+0x15/0x20 security/selinux/hooks.c:5023
CPU: 0 PID: 2919 Comm: syzkaller227220 Not tainted 4.13.0-rc4+ #23
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:52
print_address_description+0x7f/0x260 mm/kasan/report.c:252
kasan_report_double_free+0x55/0x80 mm/kasan/report.c:333
kasan_slab_free+0xa0/0xc0 mm/kasan/kasan.c:514
__cache_free mm/slab.c:3503 [inline]
kfree+0xd3/0x260 mm/slab.c:3820
selinux_tun_dev_free_security+0x15/0x20 security/selinux/hooks.c:5023
security_tun_dev_free_security+0x48/0x80 security/security.c:1512
tun_set_iff drivers/net/tun.c:1884 [inline]
__tun_chr_ioctl+0x2ce6/0x3d50 drivers/net/tun.c:2064
tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2309
vfs_ioctl fs/ioctl.c:45 [inline]
do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
SYSC_ioctl fs/ioctl.c:700 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x443ff9
RSP: 002b:
00007ffc34271f68 EFLAGS:
00000217 ORIG_RAX:
0000000000000010
RAX:
ffffffffffffffda RBX:
00000000004002e0 RCX:
0000000000443ff9
RDX:
0000000020533000 RSI:
00000000400454ca RDI:
0000000000000003
RBP:
0000000000000086 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000217 R12:
0000000000401ce0
R13:
0000000000401d70 R14:
0000000000000000 R15:
0000000000000000
Allocated by task 2919:
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459 [inline]
kasan_kmalloc+0xaa/0xd0 mm/kasan/kasan.c:551
kmem_cache_alloc_trace+0x101/0x6f0 mm/slab.c:3627
kmalloc include/linux/slab.h:493 [inline]
kzalloc include/linux/slab.h:666 [inline]
selinux_tun_dev_alloc_security+0x49/0x170 security/selinux/hooks.c:5012
security_tun_dev_alloc_security+0x6d/0xa0 security/security.c:1506
tun_set_iff drivers/net/tun.c:1839 [inline]
__tun_chr_ioctl+0x1730/0x3d50 drivers/net/tun.c:2064
tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2309
vfs_ioctl fs/ioctl.c:45 [inline]
do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
SYSC_ioctl fs/ioctl.c:700 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
entry_SYSCALL_64_fastpath+0x1f/0xbe
Freed by task 2919:
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459 [inline]
kasan_slab_free+0x6e/0xc0 mm/kasan/kasan.c:524
__cache_free mm/slab.c:3503 [inline]
kfree+0xd3/0x260 mm/slab.c:3820
selinux_tun_dev_free_security+0x15/0x20 security/selinux/hooks.c:5023
security_tun_dev_free_security+0x48/0x80 security/security.c:1512
tun_free_netdev+0x13b/0x1b0 drivers/net/tun.c:1563
register_netdevice+0x8d0/0xee0 net/core/dev.c:7605
tun_set_iff drivers/net/tun.c:1859 [inline]
__tun_chr_ioctl+0x1caf/0x3d50 drivers/net/tun.c:2064
tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2309
vfs_ioctl fs/ioctl.c:45 [inline]
do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
SYSC_ioctl fs/ioctl.c:700 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
entry_SYSCALL_64_fastpath+0x1f/0xbe
The buggy address belongs to the object at
ffff8801d2843b40
which belongs to the cache kmalloc-32 of size 32
The buggy address is located 0 bytes inside of
32-byte region [
ffff8801d2843b40,
ffff8801d2843b60)
The buggy address belongs to the page:
page:
ffffea000660cea8 count:1 mapcount:0 mapping:
ffff8801d2843000 index:0xffff8801d2843fc1
flags: 0x200000000000100(slab)
raw:
0200000000000100 ffff8801d2843000 ffff8801d2843fc1 000000010000003f
raw:
ffffea0006626a40 ffffea00066141a0 ffff8801dbc00100
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8801d2843a00: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
ffff8801d2843a80: 00 00 00 fc fc fc fc fc fb fb fb fb fc fc fc fc
>
ffff8801d2843b00: 00 00 00 00 fc fc fc fc fb fb fb fb fc fc fc fc
^
ffff8801d2843b80: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
ffff8801d2843c00: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
==================================================================
Fixes:
cf124db566e6 ("net: Fix inconsistent teardown and release of private netdev state.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kees Cook [Fri, 18 Aug 2017 22:16:31 +0000 (15:16 -0700)]
mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changes
Moving the x86_64 and arm64 PIE base from 0x555555554000 to 0x000100000000
broke AddressSanitizer. This is a partial revert of:
eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
02445990a96e ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB")
The AddressSanitizer tool has hard-coded expectations about where
executable mappings are loaded.
The motivation for changing the PIE base in the above commits was to
avoid the Stack-Clash CVEs that allowed executable mappings to get too
close to heap and stack. This was mainly a problem on 32-bit, but the
64-bit bases were moved too, in an effort to proactively protect those
systems (proofs of concept do exist that show 64-bit collisions, but
other recent changes to fix stack accounting and setuid behaviors will
minimize the impact).
The new 32-bit PIE base is fine for ASan (since it matches the ET_EXEC
base), so only the 64-bit PIE base needs to be reverted to let x86 and
arm64 ASan binaries run again. Future changes to the 64-bit PIE base on
these architectures can be made optional once a more dynamic method for
dealing with AddressSanitizer is found. (e.g. always loading PIE into
the mmap region for marked binaries.)
Link: http://lkml.kernel.org/r/20170807201542.GA21271@beast
Fixes:
eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
Fixes:
02445990a96e ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Kostya Serebryany <kcc@google.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Laura Abbott [Fri, 18 Aug 2017 22:16:27 +0000 (15:16 -0700)]
mm/vmalloc.c: don't unconditonally use __GFP_HIGHMEM
Commit
19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly") added
use of __GFP_HIGHMEM for allocations. vmalloc_32 may use
GFP_DMA/GFP_DMA32 which does not play nice with __GFP_HIGHMEM and will
trigger a BUG in gfp_zone.
Only add __GFP_HIGHMEM if we aren't using GFP_DMA/GFP_DMA32.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1482249
Link: http://lkml.kernel.org/r/20170816220705.31374-1-labbott@redhat.com
Fixes:
19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly")
Signed-off-by: Laura Abbott <labbott@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zhong jiang [Fri, 18 Aug 2017 22:16:24 +0000 (15:16 -0700)]
mm/mempolicy: fix use after free when calling get_mempolicy
I hit a use after free issue when executing trinity and repoduced it
with KASAN enabled. The related call trace is as follows.
BUG: KASan: use after free in SyS_get_mempolicy+0x3c8/0x960 at addr
ffff8801f582d766
Read of size 2 by task syz-executor1/798
INFO: Allocated in mpol_new.part.2+0x74/0x160 age=3 cpu=1 pid=799
__slab_alloc+0x768/0x970
kmem_cache_alloc+0x2e7/0x450
mpol_new.part.2+0x74/0x160
mpol_new+0x66/0x80
SyS_mbind+0x267/0x9f0
system_call_fastpath+0x16/0x1b
INFO: Freed in __mpol_put+0x2b/0x40 age=4 cpu=1 pid=799
__slab_free+0x495/0x8e0
kmem_cache_free+0x2f3/0x4c0
__mpol_put+0x2b/0x40
SyS_mbind+0x383/0x9f0
system_call_fastpath+0x16/0x1b
INFO: Slab 0xffffea0009cb8dc0 objects=23 used=8 fp=0xffff8801f582de40 flags=0x200000000004080
INFO: Object 0xffff8801f582d760 @offset=5984 fp=0xffff8801f582d600
Bytes b4
ffff8801f582d750: ae 01 ff ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
Object
ffff8801f582d760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object
ffff8801f582d770: 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkk.
Redzone
ffff8801f582d778: bb bb bb bb bb bb bb bb ........
Padding
ffff8801f582d8b8: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
Memory state around the buggy address:
ffff8801f582d600: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8801f582d680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>
ffff8801f582d700: fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb fc
!shared memory policy is not protected against parallel removal by other
thread which is normally protected by the mmap_sem. do_get_mempolicy,
however, drops the lock midway while we can still access it later.
Early premature up_read is a historical artifact from times when
put_user was called in this path see https://lwn.net/Articles/124754/
but that is gone since
8bccd85ffbaf ("[PATCH] Implement sys_* do_*
layering in the memory policy layer."). but when we have the the
current mempolicy ref count model. The issue was introduced
accordingly.
Fix the issue by removing the premature release.
Link: http://lkml.kernel.org/r/1502950924-27521-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org> [2.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prakash Gupta [Fri, 18 Aug 2017 22:16:21 +0000 (15:16 -0700)]
mm/cma_debug.c: fix stack corruption due to sprintf usage
name[] in cma_debugfs_add_one() can only accommodate 16 chars including
NULL to store sprintf output. It's common for cma device name to be
larger than 15 chars. This can cause stack corrpution. If the gcc
stack protector is turned on, this can cause a panic due to stack
corruption.
Below is one example trace:
Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in:
ffffff8e69a75730
Call trace:
dump_backtrace+0x0/0x2c4
show_stack+0x20/0x28
dump_stack+0xb8/0xf4
panic+0x154/0x2b0
print_tainted+0x0/0xc0
cma_debugfs_init+0x274/0x290
do_one_initcall+0x5c/0x168
kernel_init_freeable+0x1c8/0x280
Fix the short sprintf buffer in cma_debugfs_add_one() by using
scnprintf() instead of sprintf().
Link: http://lkml.kernel.org/r/1502446217-21840-1-git-send-email-guptap@codeaurora.org
Fixes:
f318dd083c81 ("cma: Store a name in the cma structure")
Signed-off-by: Prakash Gupta <guptap@codeaurora.org>
Acked-by: Laura Abbott <labbott@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jamie Iles [Fri, 18 Aug 2017 22:16:18 +0000 (15:16 -0700)]
signal: don't remove SIGNAL_UNKILLABLE for traced tasks.
When forcing a signal, SIGNAL_UNKILLABLE is removed to prevent recursive
faults, but this is undesirable when tracing. For example, debugging an
init process (whether global or namespace), hitting a breakpoint and
SIGTRAP will force SIGTRAP and then remove SIGNAL_UNKILLABLE.
Everything continues fine, but then once debugging has finished, the
init process is left killable which is unlikely what the user expects,
resulting in either an accidentally killed init or an init that stops
reaping zombies.
Link: http://lkml.kernel.org/r/20170815112806.10728-1-jamie.iles@oracle.com
Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 18 Aug 2017 22:16:15 +0000 (15:16 -0700)]
mm, oom: fix potential data corruption when oom_reaper races with writer
Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies, and so the
oom_reaper can tear it down, is not entirely true.
__task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT is set
but do_group_exit sends SIGKILL to all threads _after_ the flag is set.
So there is a race window when some threads won't have
fatal_signal_pending while the oom_reaper could start unmapping the
address space. Moreover some paths might not check for fatal signals
before each PF/g-u-p/copy_from_user.
We already have a protection for oom_reaper vs. PF races by checking
MMF_UNSTABLE. This has been, however, checked only for kernel threads
(use_mm users) which can outlive the oom victim. A simple fix would be
to extend the current check in handle_mm_fault for all tasks but that
wouldn't be sufficient because the current check assumes that a kernel
thread would bail out after EFAULT from get_user*/copy_from_user and
never re-read the same address which would succeed because the PF path
has established page tables already. This seems to be the case for the
only existing use_mm user currently (virtio driver) but it is rather
fragile in general.
This is even more fragile in general for more complex paths such as
generic_perform_write which can re-read the same address more times
(e.g. iov_iter_copy_from_user_atomic to fail and then
iov_iter_fault_in_readable on retry).
Therefore we have to implement MMF_UNSTABLE protection in a robust way
and never make a potentially corrupted content visible. That requires
to hook deeper into the PF path and check for the flag _every time_
before a pte for anonymous memory is established (that means all
!VM_SHARED mappings).
The corruption can be triggered artificially
(http://lkml.kernel.org/r/
201708040646.v746kkhC024636@www262.sakura.ne.jp)
but there doesn't seem to be any real life bug report. The race window
should be quite tight to trigger most of the time.
Link: http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org
Fixes:
aac453635549 ("mm, oom: introduce oom reaper")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 18 Aug 2017 22:16:12 +0000 (15:16 -0700)]
mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS
Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
handle_mm_fault causes a lockdep splat
Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
a.out (1169) used greatest stack depth: 11664 bytes left
DEBUG_LOCKS_WARN_ON(depth <= 0)
------------[ cut here ]------------
WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-
20170803+ #142
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
RIP: 0010:lock_release+0x172/0x1e0
Call Trace:
up_read+0x1a/0x40
__do_page_fault+0x28e/0x4c0
do_page_fault+0x30/0x80
page_fault+0x28/0x30
The reason is that the page fault path might have dropped the mmap_sem
and returned with VM_FAULT_RETRY. MMF_UNSTABLE check however rewrites
the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in
that path. Fix this by taking mmap_sem when VM_FAULT_RETRY is held in
the MMF_UNSTABLE path.
We cannot simply add VM_FAULT_SIGBUS to the existing error code because
all arch specific page fault handlers and g-u-p would have to learn a
new error code combination.
Link: http://lkml.kernel.org/r/20170807113839.16695-2-mhocko@kernel.org
Fixes:
3f70dc38cec2 ("mm: make sure that kthreads will not refault oom reaped memory")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Cc: <stable@vger.kernel.org> [4.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vladimir Davydov [Fri, 18 Aug 2017 22:16:08 +0000 (15:16 -0700)]
slub: fix per memcg cache leak on css offline
To avoid a possible deadlock, sysfs_slab_remove() schedules an
asynchronous work to delete sysfs entries corresponding to the kmem
cache. To ensure the cache isn't freed before the work function is
called, it takes a reference to the cache kobject. The reference is
supposed to be released by the work function.
However, the work function (sysfs_slab_remove_workfn()) does nothing in
case the cache sysfs entry has already been deleted, leaking the kobject
and the corresponding cache.
This may happen on a per memcg cache destruction, because sysfs entries
of a per memcg cache are deleted on memcg offline if the cache is empty
(see __kmemcg_cache_deactivate()).
The kmemleak report looks like this:
unreferenced object 0xffff9f798a79f540 (size 32):
comm "kworker/1:4", pid 15416, jiffies
4307432429 (age 28687.554s)
hex dump (first 32 bytes):
6b 6d 61 6c 6c 6f 63 2d 31 36 28 31 35 39 39 3a kmalloc-16(1599:
6e 65 77 72 6f 6f 74 29 00 23 6b c0 ff ff ff ff newroot).#k.....
backtrace:
kmemleak_alloc+0x4a/0xa0
__kmalloc_track_caller+0x148/0x2c0
kvasprintf+0x66/0xd0
kasprintf+0x49/0x70
memcg_create_kmem_cache+0xe6/0x160
memcg_kmem_cache_create_func+0x20/0x110
process_one_work+0x205/0x5d0
worker_thread+0x4e/0x3a0
kthread+0x109/0x140
ret_from_fork+0x2a/0x40
unreferenced object 0xffff9f79b6136840 (size 416):
comm "kworker/1:4", pid 15416, jiffies
4307432429 (age 28687.573s)
hex dump (first 32 bytes):
40 fb 80 c2 3e 33 00 00 00 00 00 40 00 00 00 00 @...>3.....@....
00 00 00 00 00 00 00 00 10 00 00 00 10 00 00 00 ................
backtrace:
kmemleak_alloc+0x4a/0xa0
kmem_cache_alloc+0x128/0x280
create_cache+0x3b/0x1e0
memcg_create_kmem_cache+0x118/0x160
memcg_kmem_cache_create_func+0x20/0x110
process_one_work+0x205/0x5d0
worker_thread+0x4e/0x3a0
kthread+0x109/0x140
ret_from_fork+0x2a/0x40
Fix the leak by adding the missing call to kobject_put() to
sysfs_slab_remove_workfn().
Link: http://lkml.kernel.org/r/20170812181134.25027-1-vdavydov.dev@gmail.com
Fixes:
3b7b314053d02 ("slub: make sysfs file removal asynchronous")
Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reported-by: Andrei Vagin <avagin@gmail.com>
Tested-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org> [4.12.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Tatashin [Fri, 18 Aug 2017 22:16:05 +0000 (15:16 -0700)]
mm: discard memblock data later
There is existing use after free bug when deferred struct pages are
enabled:
The memblock_add() allocates memory for the memory array if more than
128 entries are needed. See comment in e820__memblock_setup():
* The bootstrap memblock region count maximum is 128 entries
* (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
* than that - so allow memblock resizing.
This memblock memory is freed here:
free_low_memory_core_early()
We access the freed memblock.memory later in boot when deferred pages
are initialized in this path:
deferred_init_memmap()
for_each_mem_pfn_range()
__next_mem_pfn_range()
type = &memblock.memory;
One possible explanation for why this use-after-free hasn't been hit
before is that the limit of INIT_MEMBLOCK_REGIONS has never been
exceeded at least on systems where deferred struct pages were enabled.
Tested by reducing INIT_MEMBLOCK_REGIONS down to 4 from the current 128,
and verifying in qemu that this code is getting excuted and that the
freed pages are sane.
Link: http://lkml.kernel.org/r/1502485554-318703-2-git-send-email-pasha.tatashin@oracle.com
Fixes:
7e18adb4f80b ("mm: meminit: initialise remaining struct pages in parallel with kswapd")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Luis R. Rodriguez [Fri, 18 Aug 2017 22:16:02 +0000 (15:16 -0700)]
test_kmod: fix description for -s -and -c parameters
The descriptions were reversed, correct this.
Link: http://lkml.kernel.org/r/20170809234635.13443-4-mcgrof@kernel.org
Fixes:
64b671204afd71 ("test_sysctl: add generic script to expand on tests")
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reported-by: Daniel Mentz <danielmentz@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matt Redfearn <matt.redfearn@imgetc.com>
Cc: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: Michal Marek <mmarek@suse.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Luis R. Rodriguez [Fri, 18 Aug 2017 22:15:58 +0000 (15:15 -0700)]
kmod: fix wait on recursive loop
Recursive loops with module loading were previously handled in kmod by
restricting the number of modprobe calls to 50 and if that limit was
breached request_module() would return an error and a user would see the
following on their kernel dmesg:
request_module: runaway loop modprobe binfmt-464c
Starting init:/sbin/init exists but couldn't execute it (error -8)
This issue could happen for instance when a 64-bit kernel boots a 32-bit
userspace on some architectures and has no 32-bit binary format
hanlders. This is visible, for instance, when a CONFIG_MODULES enabled
64-bit MIPS kernel boots a into o32 root filesystem and the binfmt
handler for o32 binaries is not built-in.
After commit
6d7964a722af ("kmod: throttle kmod thread limit") we now
don't have any visible signs of an error and the kernel just waits for
the loop to end somehow.
Although this *particular* recursive loop could also be addressed by
doing a sanity check on search_binary_handler() and disallowing a
modular binfmt to be required for modprobe, a generic solution for any
recursive kernel kmod issues is still needed.
This should catch these loops. We can investigate each loop and address
each one separately as they come in, this however puts a stop gap for
them as before.
Link: http://lkml.kernel.org/r/20170809234635.13443-3-mcgrof@kernel.org
Fixes:
6d7964a722af ("kmod: throttle kmod thread limit")
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reported-by: Matt Redfearn <matt.redfearn@imgtec.com>
Tested-by: Matt Redfearn <matt.redfearn@imgetc.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Daniel Mentz <danielmentz@google.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Marek <mmarek@suse.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Luis R. Rodriguez [Fri, 18 Aug 2017 22:15:55 +0000 (15:15 -0700)]
wait: add wait_event_killable_timeout()
These are the few pending fixes I have queued up for v4.13-final. One
is a a generic regression fix for recursive loops on kmod and the other
one is a trivial print out correction.
During the v4.13 development we assumed that recursive kmod loops were
no longer possible. Clearly that is not true. The regression fix makes
use of a new killable wait. We use a killable wait to be paranoid in
how signals might be sent to modprobe and only accept a proper SIGKILL.
The signal will only be available to userspace to issue *iff* a thread
has already entered a wait state, and that happens only if we've already
throttled after 50 kmod threads have been hit.
Note that although it may seem excessive to trigger a failure afer 5
seconds if all kmod thread remain busy, prior to the series of changes
that went into v4.13 we would actually *always* fatally fail any request
which came in if the limit was already reached. The new waiting
implemented in v4.13 actually gives us *more* breathing room -- the wait
for 5 seconds is a wait for *any* kmod thread to finish. We give up and
fail *iff* no kmod thread has finished and they're *all* running
straight for 5 consecutive seconds. If 50 kmod threads are running
consecutively for 5 seconds something else must be really bad.
Recursive loops with kmod are bad but they're also hard to implement
properly as a selftest without currently fooling current userspace tools
like kmod [1]. For instance kmod will complain when you run depmod if
it finds a recursive loop with symbol dependency between modules as such
this type of recursive loop cannot go upstream as the modules_install
target will fail after running depmod.
These tests already exist on userspace kmod upstream though (refer to
the testsuite/module-playground/mod-loop-*.c files). The same is not
true if request_module() is used though, or worst if aliases are used.
Likewise the issue with 64-bit kernels booting 32-bit userspace without
a binfmt handler built-in is also currently not detected and proactively
avoided by userspace kmod tools, or kconfig for all architectures.
Although we could complain in the kernel when some of these individual
recursive issues creep up, proactively avoiding these situations in
userspace at build time is what we should keep striving for.
Lastly, since recursive loops could happen with kmod it may mean
recursive loops may also be possible with other kernel usermode helpers,
this should be investigated and long term if we can come up with a more
sensible generic solution even better!
[0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=
20170809-kmod-for-v4.13-final
[1] https://git.kernel.org/pub/scm/utils/kernel/kmod/kmod.git
This patch (of 3):
This wait is similar to wait_event_interruptible_timeout() but only
accepts SIGKILL interrupt signal. Other signals are ignored.
Link: http://lkml.kernel.org/r/20170809234635.13443-2-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michal Marek <mmarek@suse.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Daniel Mentz <danielmentz@google.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Matt Redfearn <matt.redfearn@imgetc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nicholas Piggin [Fri, 18 Aug 2017 22:15:51 +0000 (15:15 -0700)]
kernel/watchdog: fix Kconfig constraints for perf hardlockup watchdog
Commit
05a4a9527931 ("kernel/watchdog: split up config options") lost
the perf-based hardlockup detector's dependency on PERF_EVENTS, which
can result in broken builds with some powerpc configurations.
Restore the dependency. Add it in for x86 too, despite x86 always
selecting PERF_EVENTS it seems reasonable to make the dependency
explicit.
Link: http://lkml.kernel.org/r/20170810114452.6673-1-npiggin@gmail.com
Fixes:
05a4a9527931 ("kernel/watchdog: split up config options")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 18 Aug 2017 22:15:48 +0000 (15:15 -0700)]
mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback()
Jaegeuk and Brad report a NULL pointer crash when writeback ending tries
to update the memcg stats:
BUG: unable to handle kernel NULL pointer dereference at
00000000000003b0
IP: test_clear_page_writeback+0x12e/0x2c0
[...]
RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
Call Trace:
<IRQ>
end_page_writeback+0x47/0x70
f2fs_write_end_io+0x76/0x180 [f2fs]
bio_endio+0x9f/0x120
blk_update_request+0xa8/0x2f0
scsi_end_request+0x39/0x1d0
scsi_io_completion+0x211/0x690
scsi_finish_command+0xd9/0x120
scsi_softirq_done+0x127/0x150
__blk_mq_complete_request_remote+0x13/0x20
flush_smp_call_function_queue+0x56/0x110
generic_smp_call_function_single_interrupt+0x13/0x30
smp_call_function_single_interrupt+0x27/0x40
call_function_single_interrupt+0x89/0x90
RIP: 0010:native_safe_halt+0x6/0x10
(gdb) l *(test_clear_page_writeback+0x12e)
0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
614 mod_node_page_state(page_pgdat(page), idx, val);
615 if (mem_cgroup_disabled() || !page->mem_cgroup)
616 return;
617 mod_memcg_state(page->mem_cgroup, idx, val);
618 pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
619 this_cpu_add(pn->lruvec_stat->count[idx], val);
620 }
621
622 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
623 gfp_t gfp_mask,
The issue is that writeback doesn't hold a page reference and the page
might get freed after PG_writeback is cleared (and the mapping is
unlocked) in test_clear_page_writeback(). The stat functions looking up
the page's node or zone are safe, as those attributes are static across
allocation and free cycles. But page->mem_cgroup is not, and it will
get cleared if we race with truncation or migration.
It appears this race window has been around for a while, but less likely
to trigger when the memcg stats were updated first thing after
PG_writeback is cleared. Recent changes reshuffled this code to update
the global node stats before the memcg ones, though, stretching the race
window out to an extent where people can reproduce the problem.
Update test_clear_page_writeback() to look up and pin page->mem_cgroup
before clearing PG_writeback, then not use that pointer afterward. It
is a partial revert of
62cccb8c8e7a ("mm: simplify lock_page_memcg()")
but leaves the pageref-holding callsites that aren't affected alone.
Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org
Fixes:
62cccb8c8e7a ("mm: simplify lock_page_memcg()")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Jaegeuk Kim <jaegeuk@kernel.org>
Tested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reported-by: Bradley Bolen <bradleybolen@gmail.com>
Tested-by: Brad Bolen <bradleybolen@gmail.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: <stable@vger.kernel.org> [4.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Dawson [Fri, 18 Aug 2017 19:04:54 +0000 (15:04 -0400)]
datagram: When peeking datagrams with offset < 0 don't skip empty skbs
Due to commit
e6afc8ace6dd5cef5e812f26c72579da8806f5ac ("udp: remove
headers from UDP packets before queueing"), when udp packets are being
peeked the requested extra offset is always 0 as there is no need to skip
the udp header. However, when the offset is 0 and the next skb is
of length 0, it is only returned once. The behaviour can be seen with
the following python script:
from socket import *;
f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
f.bind(('::', 0));
addr=('::1', f.getsockname()[1]);
g.sendto(b'', addr)
g.sendto(b'b', addr)
print(f.recvfrom(10, MSG_PEEK));
print(f.recvfrom(10, MSG_PEEK));
Where the expected output should be the empty string twice.
Instead, make sk_peek_offset return negative values, and pass those values
to __skb_try_recv_datagram/__skb_try_recv_from_queue. If the passed offset
to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
__skb_try_recv_from_queue will then ensure the offset is reset back to 0
if a peek is requested without an offset, unless no packets are found.
Also simplify the if condition in __skb_try_recv_from_queue. If _off is
greater then 0, and off is greater then or equal to skb->len, then
(_off || skb->len) must always be true assuming skb->len >= 0 is always
true.
Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
as it double checked if MSG_PEEK was set in the flags.
V2:
- Moved the negative fixup into __skb_try_recv_from_queue, and remove now
redundant checks
- Fix peeking in udp{,v6}_recvmsg to report the right value when the
offset is 0
V3:
- Marked new branch in __skb_try_recv_from_queue as unlikely.
Signed-off-by: Matthew Dawson <matthew@mjdsystems.ca>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 18 Aug 2017 21:25:50 +0000 (14:25 -0700)]
Merge tag 'xfs-4.13-fixes-5' of git://git./fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"A handful more bug fixes for you today.
Changes since last time:
- Don't leak resources when mount fails
- Don't accidentally clobber variables when looking for free inodes"
* tag 'xfs-4.13-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: don't leak quotacheck dquots when cow recovery
xfs: clear MS_ACTIVE after finishing log recovery
iomap: fix integer truncation issues in the zeroing and dirtying helpers
xfs: fix inobt inode allocation search optimization
Linus Torvalds [Fri, 18 Aug 2017 21:12:39 +0000 (14:12 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A small set of fixes that should go into this release. This contains:
- An NVMe pull request from Christoph, with a few select fixes.
One of them fix a polling regression in this series, in which it's
trivial to cause the kernel to disable most of the hardware queue
interrupts.
- Fixup for a blk-mq queue usage imbalance on request allocation,
from Keith.
- A xen block pull request from Konrad, fixing two issues with
xen/xen-blkfront"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq-pci: add a fallback when pci_irq_get_affinity returns NULL
nvme-pci: set cqe_seen on polled completions
nvme-fabrics: fix reporting of unrecognized options
nvmet-fc: eliminate incorrect static markers on local variables
nvmet-fc: correct use after free on list teardown
nvmet: don't overwrite identify sn/fr with 0-bytes
xen-blkfront: use a right index when checking requests
xen: fix bio vec merging
blk-mq: Fix queue usage on failed request allocation
Linus Torvalds [Fri, 18 Aug 2017 19:35:22 +0000 (12:35 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/dledford/rdma
Pull rdma fixes from Doug Ledford:
"Fourth set of -rc fixes for 4.13 cycle. This is all of the -rc fixes
that we know of. I suspect this will be the last rc pull request, but
you never know, I could be wrong.
Nothing major here. There are the i40iw patches I mentioned in my last
pull request minus one that I pulled out because it wasn't a fix and
not appropriate for the rc cycle. Then a few other items trickled in
and were added to the pull request. It's fairly small aside from those
five i40iw patches
- Set of five i40iw fixes (the first of these is rather large by line
count consideration, but I decided to send it because if fixes a
legitimate issue and the line count is because it does so by
creating a new function and using it where needed instead of just
patching up a few lines...a smaller fix could probably be done, but
the larger fix is the better code solution)
- One vmw_pvrdma fix
- One hns_roce fix (this silences a checker warning, but can't
actually happen, I expect a patch to remove this from all drivers
that share this same check in for-next)
- One iw_cxgb4 fix
- Two IB core fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
IB/uverbs: Fix NULL pointer dereference during device removal
IB/core: Protect sysfs entry on ib_unregister_device
iw_cxgb4: fix misuse of integer variable
IB/hns: fix memory leak on ah on error return path
i40iw: Fix potential fcn_id_array out of bounds
i40iw: Use correct alignment for CQ0 memory
i40iw: Fix typecast of tcp_seq_num
i40iw: Correct variable names
i40iw: Fix parsing of query/commit FPM buffers
RDMA/vmw_pvrdma: Report CQ missed events