Luis Henriques [Tue, 7 Apr 2020 10:30:20 +0000 (11:30 +0100)]
ceph: allow rename operation under different quota realms
Returning -EXDEV when trying to 'mv' files/directories from different
quota realms results in copy+unlink operations instead of the faster
CEPH_MDS_OP_RENAME. This will occur even when there aren't any quotas
set in the destination directory, or if there's enough space left for
the new file(s).
This patch adds a new helper function to be called on rename operations
which will allow these operations if they can be executed. This patch
mimics userland fuse client commit
b8954e5734b3 ("client:
optimize rename operation under different quota root").
Since ceph_quota_is_same_realm() is now called only from this new
helper, make it static.
URL: https://tracker.ceph.com/issues/44791
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Luis Henriques [Tue, 7 Apr 2020 10:30:19 +0000 (11:30 +0100)]
ceph: normalize 'delta' parameter usage in check_quota_exceeded
Function check_quota_exceeded() uses delta parameter only for the
QUOTA_CHECK_MAX_BYTES_OP operation. Using this parameter also for
MAX_FILES will makes the code cleaner and will be required to support
cross-quota-tree renames.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Jeff Layton [Fri, 3 Apr 2020 17:09:07 +0000 (13:09 -0400)]
ceph: ceph_kick_flushing_caps needs the s_mutex
The mdsc->cap_dirty_lock is not held while walking the list in
ceph_kick_flushing_caps, which is not safe.
ceph_early_kick_flushing_caps does something similar, but the
s_mutex is held while it's called and I think that guards against
changes to the list.
Ensure we hold the s_mutex when calling ceph_kick_flushing_caps,
and add some clarifying comments.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Jeff Layton [Wed, 1 Apr 2020 22:27:25 +0000 (18:27 -0400)]
ceph: request expedited service on session's last cap flush
When flushing a lot of caps to the MDS's at once (e.g. for syncfs),
we can end up waiting a substantial amount of time for MDS replies, due
to the fact that it may delay some of them so that it can batch them up
together in a single journal transaction. This can lead to stalls when
calling sync or syncfs.
What we'd really like to do is request expedited service on the _last_
cap we're flushing back to the server. If the CHECK_CAPS_FLUSH flag is
set on the request and the current inode was the last one on the
session->s_cap_dirty list, then mark the request with
CEPH_CLIENT_CAPS_SYNC.
Note that this heuristic is not perfect. New inodes can race onto the
list after we've started flushing, but it does seem to fix some common
use cases.
URL: https://tracker.ceph.com/issues/44744
Reported-by: Jan Fajerski <jfajerski@suse.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Jeff Layton [Wed, 1 Apr 2020 21:07:52 +0000 (17:07 -0400)]
ceph: convert mdsc->cap_dirty to a per-session list
This is a per-sb list now, but that makes it difficult to tell when
the cap is the last dirty one associated with the session. Switch
this to be a per-session list, but continue using the
mdsc->cap_dirty_lock to protect the lists.
This list is only ever walked in ceph_flush_dirty_caps, so change that
to walk the sessions array and then flush the caps for inodes on each
session's list.
If the auth cap ever changes while the inode has dirty caps, then
move the inode to the appropriate session for the new auth_cap. Also,
ensure that we never remove an auth cap while the inode is still on the
s_cap_dirty list.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Yan, Zheng [Mon, 30 Mar 2020 11:56:37 +0000 (19:56 +0800)]
ceph: reset i_requested_max_size if file write is not wanted
write can stuck at waiting for larger max_size in following sequence of
events:
- client opens a file and writes to position 'A' (larger than unit of
max size increment)
- client closes the file handle and updates wanted caps (not wanting
file write caps)
- client opens and truncates the file, writes to position 'A' again.
At the 1st event, client set inode's requested_max_size to 'A'. At the
2nd event, mds removes client's writable range, but client does not reset
requested_max_size. At the 3rd event, client does not request max size
because requested_max_size is already larger than 'A'.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Jeff Layton [Fri, 20 Mar 2020 21:07:36 +0000 (17:07 -0400)]
ceph: throw a warning if we destroy session with mutex still locked
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Jeff Layton [Fri, 20 Mar 2020 20:45:45 +0000 (16:45 -0400)]
ceph: fix potential race in ceph_check_caps
Nothing ensures that session will still be valid by the time we
dereference the pointer. Take and put a reference.
In principle, we should always be able to get a reference here, but
throw a warning if that's ever not the case.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Jeff Layton [Thu, 19 Mar 2020 19:34:13 +0000 (15:34 -0400)]
ceph: document what protects i_dirty_item and i_flushing_item
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Jeff Layton [Thu, 19 Mar 2020 16:00:16 +0000 (12:00 -0400)]
ceph: don't take i_ceph_lock in handle_cap_import
Just take it before calling it. This means we have to do a couple of
minor in-memory operations under the spinlock now, but those shouldn't
be an issue.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Jeff Layton [Wed, 18 Mar 2020 20:43:30 +0000 (16:43 -0400)]
ceph: don't release i_ceph_lock in handle_cap_trunc
There's no reason to do this here. Just have the caller handle it.
Also, add a lockdep assertion.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Jeff Layton [Wed, 18 Mar 2020 19:34:20 +0000 (15:34 -0400)]
ceph: add comments for handle_cap_flush_ack logic
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Jeff Layton [Wed, 18 Mar 2020 19:29:34 +0000 (15:29 -0400)]
ceph: split up __finish_cap_flush
This function takes a mdsc argument or ci argument, but if both are
passed in, it ignores the ci arg. Fortunately, nothing does that, but
there's no good reason to have the same function handle both cases.
Also, get rid of some branches and just use |= to set the wake_* vals.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Jeff Layton [Tue, 17 Mar 2020 12:47:31 +0000 (08:47 -0400)]
ceph: reorganize __send_cap for less spinlock abuse
Get rid of the __releases annotation by breaking it up into two
functions: __prep_cap which is done under the spinlock and __send_cap
that is done outside it. Add new fields to cap_msg_args for the wake
boolean and old_xattr_buf pointer.
Nothing checks the return value from __send_cap, so make it void
return.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Xiubo Li [Fri, 20 Mar 2020 03:45:02 +0000 (23:45 -0400)]
ceph: add metadata perf metric support
Add a new "r_ended" field to struct ceph_mds_request and use that to
maintain the average latency of MDS requests.
URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Xiubo Li [Fri, 20 Mar 2020 03:45:01 +0000 (23:45 -0400)]
ceph: add read/write latency metric support
Calculate the latency for OSD read requests. Add a new r_end_stamp
field to struct ceph_osd_request that will hold the time of that
the reply was received. Use that to calculate the RTT for each call,
and divide the sum of those by number of calls to get averate RTT.
Keep a tally of RTT for OSD writes and number of calls to track average
latency of OSD writes.
URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Xiubo Li [Fri, 20 Mar 2020 03:45:00 +0000 (23:45 -0400)]
ceph: add caps perf metric for each superblock
Count hits and misses in the caps cache. If the client has all of
the necessary caps when a task needs references, then it's counted
as a hit. Any other situation is a miss.
URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Xiubo Li [Fri, 20 Mar 2020 03:44:59 +0000 (23:44 -0400)]
ceph: add dentry lease metric support
For dentry leases, only count the hit/miss info triggered from the vfs
calls. For the cases like request reply handling and ceph_trim_dentries,
ignore them.
For now, these are only viewable using debugfs. Future patches will
allow the client to send the stats to the MDS.
The output looks like:
item total miss hit
-------------------------------------------------
d_lease 11 7 141
URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Linus Torvalds [Sun, 31 May 2020 23:49:15 +0000 (16:49 -0700)]
Linux 5.7
Joe Perches [Fri, 29 May 2020 23:12:21 +0000 (16:12 -0700)]
checkpatch/coding-style: deprecate 80-column warning
Yes, staying withing 80 columns is certainly still _preferred_. But
it's not the hard limit that the checkpatch warnings imply, and other
concerns can most certainly dominate.
Increase the default limit to 100 characters. Not because 100
characters is some hard limit either, but that's certainly a "what are
you doing" kind of value and less likely to be about the occasional
slightly longer lines.
Miscellanea:
- to avoid unnecessary whitespace changes in files, checkpatch will no
longer emit a warning about line length when scanning files unless
--strict is also used
- Add a bit to coding-style about alignment to open parenthesis
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 31 May 2020 17:45:11 +0000 (10:45 -0700)]
Merge tag 'x86-urgent-2020-05-31' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A pile of x86 fixes:
- Prevent a memory leak in ioperm which was caused by the stupid
assumption that the exit cleanup is always called for current,
which is not the case when fork fails after taking a reference on
the ioperm bitmap.
- Fix an arithmething overflow in the DMA code on 32bit systems
- Fill gaps in the xstate copy with defaults instead of leaving them
uninitialized
- Revert: "Make __X32_SYSCALL_BIT be unsigned long" as it turned out
that existing user space fails to build"
* tag 'x86-urgent-2020-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ioperm: Prevent a memory leak when fork fails
x86/dma: Fix max PFN arithmetic overflow on 32 bit systems
copy_xstate_to_kernel(): don't leave parts of destination uninitialized
x86/syscalls: Revert "x86/syscalls: Make __X32_SYSCALL_BIT be unsigned long"
Linus Torvalds [Sun, 31 May 2020 17:43:17 +0000 (10:43 -0700)]
Merge tag 'sched-urgent-2020-05-31' of git://git./linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
"A single scheduler fix preventing a crash in NUMA balancing.
The current->mm check is not reliable as the mm might be temporary due
to use_mm() in a kthread. Check for PF_KTHREAD explictly"
* tag 'sched-urgent-2020-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Don't NUMA balance for kthreads
Linus Torvalds [Sun, 31 May 2020 17:16:53 +0000 (10:16 -0700)]
Merge git://git./linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
"Another week, another set of bug fixes:
1) Fix pskb_pull length in __xfrm_transport_prep(), from Xin Long.
2) Fix double xfrm_state put in esp{4,6}_gro_receive(), also from Xin
Long.
3) Re-arm discovery timer properly in mac80211 mesh code, from Linus
Lüssing.
4) Prevent buffer overflows in nf_conntrack_pptp debug code, from
Pablo Neira Ayuso.
5) Fix race in ktls code between tls_sw_recvmsg() and
tls_decrypt_done(), from Vinay Kumar Yadav.
6) Fix crashes on TCP fallback in MPTCP code, from Paolo Abeni.
7) More validation is necessary of untrusted GSO packets coming from
virtualization devices, from Willem de Bruijn.
8) Fix endianness of bnxt_en firmware message length accesses, from
Edwin Peer.
9) Fix infinite loop in sch_fq_pie, from Davide Caratti.
10) Fix lockdep splat in DSA by setting lockless TX in netdev features
for slave ports, from Vladimir Oltean.
11) Fix suspend/resume crashes in mlx5, from Mark Bloch.
12) Fix use after free in bpf fmod_ret, from Alexei Starovoitov.
13) ARP retransmit timer guard uses wrong offset, from Hongbin Liu.
14) Fix leak in inetdev_init(), from Yang Yingliang.
15) Don't try to use inet hash and unhash in l2tp code, results in
crashes. From Eric Dumazet"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (77 commits)
l2tp: add sk_family checks to l2tp_validate_socket
l2tp: do not use inet_hash()/inet_unhash()
net: qrtr: Allocate workqueue before kernel_bind
mptcp: remove msk from the token container at destruction time.
mptcp: fix race between MP_JOIN and close
mptcp: fix unblocking connect()
net/sched: act_ct: add nat mangle action only for NAT-conntrack
devinet: fix memleak in inetdev_init()
virtio_vsock: Fix race condition in virtio_transport_recv_pkt
drivers/net/ibmvnic: Update VNIC protocol version reporting
NFC: st21nfca: add missed kfree_skb() in an error path
neigh: fix ARP retransmit timer guard
bpf, selftests: Add a verifier test for assigning 32bit reg states to 64bit ones
bpf, selftests: Verifier bounds tests need to be updated
bpf: Fix a verifier issue when assigning 32bit reg states to 64bit ones
bpf: Fix use-after-free in fmod_ret check
net/mlx5e: replace EINVAL in mlx5e_flower_parse_meta()
net/mlx5e: Fix MLX5_TC_CT dependencies
net/mlx5e: Properly set default values when disabling adaptive moderation
net/mlx5e: Fix arch depending casting issue in FEC
...
Eric Dumazet [Fri, 29 May 2020 18:32:25 +0000 (11:32 -0700)]
l2tp: add sk_family checks to l2tp_validate_socket
syzbot was able to trigger a crash after using an ISDN socket
and fool l2tp.
Fix this by making sure the UDP socket is of the proper family.
BUG: KASAN: slab-out-of-bounds in setup_udp_tunnel_sock+0x465/0x540 net/ipv4/udp_tunnel.c:78
Write of size 1 at addr
ffff88808ed0c590 by task syz-executor.5/3018
CPU: 0 PID: 3018 Comm: syz-executor.5 Not tainted 5.7.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x188/0x20d lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xd3/0x413 mm/kasan/report.c:382
__kasan_report.cold+0x20/0x38 mm/kasan/report.c:511
kasan_report+0x33/0x50 mm/kasan/common.c:625
setup_udp_tunnel_sock+0x465/0x540 net/ipv4/udp_tunnel.c:78
l2tp_tunnel_register+0xb15/0xdd0 net/l2tp/l2tp_core.c:1523
l2tp_nl_cmd_tunnel_create+0x4b2/0xa60 net/l2tp/l2tp_netlink.c:249
genl_family_rcv_msg_doit net/netlink/genetlink.c:673 [inline]
genl_family_rcv_msg net/netlink/genetlink.c:718 [inline]
genl_rcv_msg+0x627/0xdf0 net/netlink/genetlink.c:735
netlink_rcv_skb+0x15a/0x410 net/netlink/af_netlink.c:2469
genl_rcv+0x24/0x40 net/netlink/genetlink.c:746
netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329
netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:672
____sys_sendmsg+0x6e6/0x810 net/socket.c:2352
___sys_sendmsg+0x100/0x170 net/socket.c:2406
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2439
do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x45ca29
Code: 0d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:
00007effe76edc78 EFLAGS:
00000246 ORIG_RAX:
000000000000002e
RAX:
ffffffffffffffda RBX:
00000000004fe1c0 RCX:
000000000045ca29
RDX:
0000000000000000 RSI:
0000000020000240 RDI:
0000000000000005
RBP:
000000000078bf00 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000246 R12:
00000000ffffffff
R13:
000000000000094e R14:
00000000004d5d00 R15:
00007effe76ee6d4
Allocated by task 3018:
save_stack+0x1b/0x40 mm/kasan/common.c:49
set_track mm/kasan/common.c:57 [inline]
__kasan_kmalloc mm/kasan/common.c:495 [inline]
__kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:468
__do_kmalloc mm/slab.c:3656 [inline]
__kmalloc+0x161/0x7a0 mm/slab.c:3665
kmalloc include/linux/slab.h:560 [inline]
sk_prot_alloc+0x223/0x2f0 net/core/sock.c:1612
sk_alloc+0x36/0x1100 net/core/sock.c:1666
data_sock_create drivers/isdn/mISDN/socket.c:600 [inline]
mISDN_sock_create+0x272/0x400 drivers/isdn/mISDN/socket.c:796
__sock_create+0x3cb/0x730 net/socket.c:1428
sock_create net/socket.c:1479 [inline]
__sys_socket+0xef/0x200 net/socket.c:1521
__do_sys_socket net/socket.c:1530 [inline]
__se_sys_socket net/socket.c:1528 [inline]
__x64_sys_socket+0x6f/0xb0 net/socket.c:1528
do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
entry_SYSCALL_64_after_hwframe+0x49/0xb3
Freed by task 2484:
save_stack+0x1b/0x40 mm/kasan/common.c:49
set_track mm/kasan/common.c:57 [inline]
kasan_set_free_info mm/kasan/common.c:317 [inline]
__kasan_slab_free+0xf7/0x140 mm/kasan/common.c:456
__cache_free mm/slab.c:3426 [inline]
kfree+0x109/0x2b0 mm/slab.c:3757
kvfree+0x42/0x50 mm/util.c:603
__free_fdtable+0x2d/0x70 fs/file.c:31
put_files_struct fs/file.c:420 [inline]
put_files_struct+0x248/0x2e0 fs/file.c:413
exit_files+0x7e/0xa0 fs/file.c:445
do_exit+0xb04/0x2dd0 kernel/exit.c:791
do_group_exit+0x125/0x340 kernel/exit.c:894
get_signal+0x47b/0x24e0 kernel/signal.c:2739
do_signal+0x81/0x2240 arch/x86/kernel/signal.c:784
exit_to_usermode_loop+0x26c/0x360 arch/x86/entry/common.c:161
prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
syscall_return_slowpath arch/x86/entry/common.c:279 [inline]
do_syscall_64+0x6b1/0x7d0 arch/x86/entry/common.c:305
entry_SYSCALL_64_after_hwframe+0x49/0xb3
The buggy address belongs to the object at
ffff88808ed0c000
which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 1424 bytes inside of
2048-byte region [
ffff88808ed0c000,
ffff88808ed0c800)
The buggy address belongs to the page:
page:
ffffea00023b4300 refcount:1 mapcount:0 mapping:
0000000000000000 index:0x0
flags: 0xfffe0000000200(slab)
raw:
00fffe0000000200 ffffea0002838208 ffffea00015ba288 ffff8880aa000e00
raw:
0000000000000000 ffff88808ed0c000 0000000100000001 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88808ed0c480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff88808ed0c500: 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc
>
ffff88808ed0c580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff88808ed0c600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88808ed0c680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
Fixes:
6b9f34239b00 ("l2tp: fix races in tunnel creation")
Fixes:
fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: James Chapman <jchapman@katalix.com>
Cc: Guillaume Nault <gnault@redhat.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 29 May 2020 18:20:53 +0000 (11:20 -0700)]
l2tp: do not use inet_hash()/inet_unhash()
syzbot recently found a way to crash the kernel [1]
Issue here is that inet_hash() & inet_unhash() are currently
only meant to be used by TCP & DCCP, since only these protocols
provide the needed hashinfo pointer.
L2TP uses a single list (instead of a hash table)
This old bug became an issue after commit
610236587600
("bpf: Add new cgroup attach type to enable sock modifications")
since after this commit, sk_common_release() can be called
while the L2TP socket is still considered 'hashed'.
general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 PID: 7063 Comm: syz-executor654 Not tainted 5.7.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:inet_unhash+0x11f/0x770 net/ipv4/inet_hashtables.c:600
Code: 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e dd 04 00 00 48 8d 7d 08 44 8b 73 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 55 05 00 00 48 8d 7d 14 4c 8b 6d 08 48 b8 00 00
RSP: 0018:
ffffc90001777d30 EFLAGS:
00010202
RAX:
dffffc0000000000 RBX:
ffff88809a6df940 RCX:
ffffffff8697c242
RDX:
0000000000000001 RSI:
ffffffff8697c251 RDI:
0000000000000008
RBP:
0000000000000000 R08:
ffff88809f3ae1c0 R09:
fffffbfff1514cc1
R10:
ffffffff8a8a6607 R11:
fffffbfff1514cc0 R12:
ffff88809a6df9b0
R13:
0000000000000007 R14:
0000000000000000 R15:
ffffffff873a4d00
FS:
0000000001d2b880(0000) GS:
ffff8880ae600000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00000000006cd090 CR3:
000000009403a000 CR4:
00000000001406f0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
sk_common_release+0xba/0x370 net/core/sock.c:3210
inet_create net/ipv4/af_inet.c:390 [inline]
inet_create+0x966/0xe00 net/ipv4/af_inet.c:248
__sock_create+0x3cb/0x730 net/socket.c:1428
sock_create net/socket.c:1479 [inline]
__sys_socket+0xef/0x200 net/socket.c:1521
__do_sys_socket net/socket.c:1530 [inline]
__se_sys_socket net/socket.c:1528 [inline]
__x64_sys_socket+0x6f/0xb0 net/socket.c:1528
do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x441e29
Code: e8 fc b3 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:
00007ffdce184148 EFLAGS:
00000246 ORIG_RAX:
0000000000000029
RAX:
ffffffffffffffda RBX:
0000000000000003 RCX:
0000000000441e29
RDX:
0000000000000073 RSI:
0000000000000002 RDI:
0000000000000002
RBP:
0000000000000000 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000246 R12:
0000000000000000
R13:
0000000000402c30 R14:
0000000000000000 R15:
0000000000000000
Modules linked in:
---[ end trace
23b6578228ce553e ]---
RIP: 0010:inet_unhash+0x11f/0x770 net/ipv4/inet_hashtables.c:600
Code: 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e dd 04 00 00 48 8d 7d 08 44 8b 73 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 55 05 00 00 48 8d 7d 14 4c 8b 6d 08 48 b8 00 00
RSP: 0018:
ffffc90001777d30 EFLAGS:
00010202
RAX:
dffffc0000000000 RBX:
ffff88809a6df940 RCX:
ffffffff8697c242
RDX:
0000000000000001 RSI:
ffffffff8697c251 RDI:
0000000000000008
RBP:
0000000000000000 R08:
ffff88809f3ae1c0 R09:
fffffbfff1514cc1
R10:
ffffffff8a8a6607 R11:
fffffbfff1514cc0 R12:
ffff88809a6df9b0
R13:
0000000000000007 R14:
0000000000000000 R15:
ffffffff873a4d00
FS:
0000000001d2b880(0000) GS:
ffff8880ae600000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00000000006cd090 CR3:
000000009403a000 CR4:
00000000001406f0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Fixes:
0d76751fad77 ("l2tp: Add L2TPv3 IP encapsulation (no UDP) support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: James Chapman <jchapman@katalix.com>
Cc: Andrii Nakryiko <andriin@fb.com>
Reported-by: syzbot+3610d489778b57cc8031@syzkaller.appspotmail.com
Chris Lew [Thu, 28 May 2020 23:05:26 +0000 (16:05 -0700)]
net: qrtr: Allocate workqueue before kernel_bind
A null pointer dereference in qrtr_ns_data_ready() is seen if a client
opens a qrtr socket before qrtr_ns_init() can bind to the control port.
When the control port is bound, the ENETRESET error will be broadcasted
and clients will close their sockets. This results in DEL_CLIENT
packets being sent to the ns and qrtr_ns_data_ready() being called
without the workqueue being allocated.
Allocate the workqueue before setting sk_data_ready and binding to the
control port. This ensures that the work and workqueue structs are
allocated and initialized before qrtr_ns_data_ready can be called.
Fixes:
0c2204a4ad71 ("net: qrtr: Migrate nameservice to kernel from userspace")
Signed-off-by: Chris Lew <clew@codeaurora.org>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 31 May 2020 04:39:13 +0000 (21:39 -0700)]
Merge branch 'mptcp-a-bunch-of-fixes'
Paolo Abeni says:
====================
mptcp: a bunch of fixes
This patch series pulls together a few bugfixes for MPTCP bug observed while
doing stress-test with apache bench - forced to use MPTCP and multiple
subflows.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 29 May 2020 15:43:31 +0000 (17:43 +0200)]
mptcp: remove msk from the token container at destruction time.
Currently we remote the msk from the token container only
via mptcp_close(). The MPTCP master socket can be destroyed
also via other paths (e.g. if not yet accepted, when shutting
down the listener socket). When we hit the latter scenario,
dangling msk references are left into the token container,
leading to memory corruption and/or UaF.
This change addresses the issue by moving the token removal
into the msk destructor.
Fixes:
79c0949e9a09 ("mptcp: Add key generation and token tree")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 29 May 2020 15:43:30 +0000 (17:43 +0200)]
mptcp: fix race between MP_JOIN and close
If a MP_JOIN subflow completes the 3whs while another
CPU is closing the master msk, we can hit the
following race:
CPU1 CPU2
close()
mptcp_close
subflow_syn_recv_sock
mptcp_token_get_sock
mptcp_finish_join
inet_sk_state_load
mptcp_token_destroy
inet_sk_state_store(TCP_CLOSE)
__mptcp_flush_join_list()
mptcp_sock_graft
list_add_tail
sk_common_release
sock_orphan()
<socket free>
The MP_JOIN socket will be leaked. Additionally we can hit
UaF for the msk 'struct socket' referenced via the 'conn'
field.
This change try to address the issue introducing some
synchronization between the MP_JOIN 3whs and mptcp_close
via the join_list spinlock. If we detect the msk is closing
the MP_JOIN socket is closed, too.
Fixes:
f296234c98a8 ("mptcp: Add handling of incoming MP_JOIN requests")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 29 May 2020 15:43:29 +0000 (17:43 +0200)]
mptcp: fix unblocking connect()
Currently unblocking connect() on MPTCP sockets fails frequently.
If mptcp_stream_connect() is invoked to complete a previously
attempted unblocking connection, it will still try to create
the first subflow via __mptcp_socket_create(). If the 3whs is
completed and the 'can_ack' flag is already set, the latter
will fail with -EINVAL.
This change addresses the issue checking for pending connect and
delegating the completion to the first subflow. Additionally
do msk addresses and sk_state changes only when needed.
Fixes:
2303f994b3e1 ("mptcp: Associate MPTCP context with TCP socket")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wenxu [Sat, 30 May 2020 05:54:51 +0000 (13:54 +0800)]
net/sched: act_ct: add nat mangle action only for NAT-conntrack
Currently add nat mangle action with comparing invert and orig tuple.
It is better to check IPS_NAT_MASK flags first to avoid non necessary
memcmp for non-NAT conntrack.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yang Yingliang [Sat, 30 May 2020 03:34:33 +0000 (11:34 +0800)]
devinet: fix memleak in inetdev_init()
When devinet_sysctl_register() failed, the memory allocated
in neigh_parms_alloc() should be freed.
Fixes:
20e61da7ffcf ("ipv4: fail early when creating netdev named all or default")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jia He [Sat, 30 May 2020 01:38:28 +0000 (09:38 +0800)]
virtio_vsock: Fix race condition in virtio_transport_recv_pkt
When client on the host tries to connect(SOCK_STREAM, O_NONBLOCK) to the
server on the guest, there will be a panic on a ThunderX2 (armv8a server):
[ 463.718844] Unable to handle kernel NULL pointer dereference at virtual address
0000000000000000
[ 463.718848] Mem abort info:
[ 463.718849] ESR = 0x96000044
[ 463.718852] EC = 0x25: DABT (current EL), IL = 32 bits
[ 463.718853] SET = 0, FnV = 0
[ 463.718854] EA = 0, S1PTW = 0
[ 463.718855] Data abort info:
[ 463.718856] ISV = 0, ISS = 0x00000044
[ 463.718857] CM = 0, WnR = 1
[ 463.718859] user pgtable: 4k pages, 48-bit VAs, pgdp=
0000008f6f6e9000
[ 463.718861] [
0000000000000000] pgd=
0000000000000000
[ 463.718866] Internal error: Oops:
96000044 [#1] SMP
[...]
[ 463.718977] CPU: 213 PID: 5040 Comm: vhost-5032 Tainted: G O 5.7.0-rc7+ #139
[ 463.718980] Hardware name: GIGABYTE R281-T91-00/MT91-FS1-00, BIOS F06 09/25/2018
[ 463.718982] pstate:
60400009 (nZCv daif +PAN -UAO)
[ 463.718995] pc : virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common]
[ 463.718999] lr : virtio_transport_recv_pkt+0x1fc/0xd40 [vmw_vsock_virtio_transport_common]
[ 463.719000] sp :
ffff80002dbe3c40
[...]
[ 463.719025] Call trace:
[ 463.719030] virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common]
[ 463.719034] vhost_vsock_handle_tx_kick+0x360/0x408 [vhost_vsock]
[ 463.719041] vhost_worker+0x100/0x1a0 [vhost]
[ 463.719048] kthread+0x128/0x130
[ 463.719052] ret_from_fork+0x10/0x18
The race condition is as follows:
Task1 Task2
===== =====
__sock_release virtio_transport_recv_pkt
__vsock_release vsock_find_bound_socket (found sk)
lock_sock_nested
vsock_remove_sock
sock_orphan
sk_set_socket(sk, NULL)
sk->sk_shutdown = SHUTDOWN_MASK
...
release_sock
lock_sock
virtio_transport_recv_connecting
sk->sk_socket->state (panic!)
The root cause is that vsock_find_bound_socket can't hold the lock_sock,
so there is a small race window between vsock_find_bound_socket() and
lock_sock(). If __vsock_release() is running in another task,
sk->sk_socket will be set to NULL inadvertently.
This fixes it by checking sk->sk_shutdown(suggested by Stefano) after
lock_sock since sk->sk_shutdown is set to SHUTDOWN_MASK under the
protection of lock_sock_nested.
Signed-off-by: Jia He <justin.he@arm.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 30 May 2020 19:28:44 +0000 (12:28 -0700)]
Merge tag 'powerpc-5.7-6' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- a fix for the recent change to how we restore non-volatile GPRs,
which broke our emulation of reading from the DSCR (Data Stream
Control Register).
- a fix for the recent rewrite of interrupt/syscall exit in C, we need
to exclude KCOV from that code, otherwise it can lead to
unrecoverable faults.
Thanks to Daniel Axtens.
* tag 'powerpc-5.7-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Disable sanitisers for C syscall/interrupt entry/exit code
powerpc/64s: Fix restore of NV GPRs after facility unavailable exception
Linus Torvalds [Sat, 30 May 2020 19:26:21 +0000 (12:26 -0700)]
Merge tag 'gpio-v5.7-3' of git://git./linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
"Here are some (very) late fixes for GPIO, none of them very serious
except the one tagged for stable for enabling IRQ on open drain lines:
- Fix probing of mvebu chips without PWM
- Fix error path on ida_get_simple() on the exar driver
- Notify userspace properly about line status changes when flags are
changed on lines.
- Fix a sleeping while holding spinlock in the mellanox driver.
- Fix return value of the PXA and Kona probe calls.
- Fix IRQ locking of open drain lines, it is fine to have IRQs on
open drain lines flagged for output"
* tag 'gpio-v5.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio: fix locking open drain IRQ lines
gpio: bcm-kona: Fix return value of bcm_kona_gpio_probe()
gpio: pxa: Fix return value of pxa_gpio_probe()
gpio: mlxbf2: Fix sleeping while holding spinlock
gpiolib: notify user-space about line status changes after flags are set
gpio: exar: Fix bad handling for ida_simple_get error path
gpio: mvebu: Fix probing for chips without PWM
Thomas Falcon [Thu, 28 May 2020 16:19:17 +0000 (11:19 -0500)]
drivers/net/ibmvnic: Update VNIC protocol version reporting
VNIC protocol version is reported in big-endian format, but it
is not byteswapped before logging. Fix that, and remove version
comparison as only one protocol version exists at this time.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Chuhong Yuan [Thu, 28 May 2020 10:20:37 +0000 (18:20 +0800)]
NFC: st21nfca: add missed kfree_skb() in an error path
st21nfca_tm_send_atr_res() misses to call kfree_skb() in an error path.
Add the missed function call to fix it.
Fixes:
1892bf844ea0 ("NFC: st21nfca: Adding P2P support to st21nfca in Initiator & Target mode")
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hangbin Liu [Thu, 28 May 2020 07:15:13 +0000 (15:15 +0800)]
neigh: fix ARP retransmit timer guard
In commit
19e16d220f0a ("neigh: support smaller retrans_time settting")
we add more accurate control for ARP and NS. But for ARP I forgot to
update the latest guard in neigh_timer_handler(), then the next
retransmit would be reset to jiffies + HZ/2 if we set the retrans_time
less than 500ms. Fix it by setting the time_before() check to HZ/100.
IPv6 does not have this issue.
Reported-by: Jianwen Ji <jiji@redhat.com>
Fixes:
19e16d220f0a ("neigh: support smaller retrans_time settting")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 29 May 2020 23:31:22 +0000 (16:31 -0700)]
Merge tag 'mlx5-fixes-2020-05-28' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2020-05-28
This series introduces some fixes to mlx5 driver.
v1->v2:
- Fix bad sha1, Jakub.
- Added one more patch by Pablo.
net/mlx5e: replace EINVAL in mlx5e_flower_parse_meta()
Nothing major, the only patch worth mentioning is the suspend/resume crash
fix by adding the missing pci device handlers, the fix is very straight
forward and as Dexuan already expressed, the patch is important for Azure
users to avoid crash on VM hibernation, patch is marked for -stable v4.6
below.
Conflict note:
('net/mlx5e: Fix MLX5_TC_CT dependencies') has a trivial one line conflict
with current net-next, which can be resolved by simply using the line from
net-next.
Please pull and let me know if there is any problem.
For -stable v4.6
('net/mlx5: Fix crash upon suspend/resume')
For -stable v5.6
('net/mlx5e: replace EINVAL in mlx5e_flower_parse_meta()')
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 29 May 2020 23:10:07 +0000 (16:10 -0700)]
Merge tag 'armsoc-fixes-v5.7' of git://git./linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"This time there is one fix for the error path in the mediatek cmdq
driver (used by their video driver) and a couple of devicetree fixes,
mostly for 32-bit ARM, and fairly harmless:
- On OMAP2 there were a few regressions in the ethernet drivers, one
of them leading to an external abort trap
- One Raspberry Pi version had a misconfigured LED
- Interrupts on Broadcom NSP were slightly misconfigured
- One i.MX6q board had issues with graphics mode setting
- On mmp3 there are some minor fixes that were submitted for v5.8
with a cc:stable tag, so I ended up picking them up here as well
- The Mediatek Video Codec needs to run at a higher frequency than
configured originally"
* tag 'armsoc-fixes-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
ARM: dts: mmp3: Drop usb-nop-xceiv from HSIC phy
ARM: dts: mmp3-dell-ariel: Fix the SPI devices
ARM: dts: mmp3: Use the MMP3 compatible string for /clocks
ARM: dts: bcm: HR2: Fix PPI interrupt types
ARM: dts: bcm2835-rpi-zero-w: Fix led polarity
ARM: dts/imx6q-bx50v3: Set display interface clock parents
soc: mediatek: cmdq: return send msg error code
arm64: dts: mt8173: fix vcodec-enc clock
ARM: dts: Fix wrong mdio clock for dm814x
ARM: dts: am437x: fix networking on boards with ksz9031 phy
ARM: dts: am57xx: fix networking on boards with ksz9031 phy
David S. Miller [Fri, 29 May 2020 22:59:08 +0000 (15:59 -0700)]
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:
====================
pull-request: bpf 2020-05-29
The following pull-request contains BPF updates for your *net* tree.
We've added 6 non-merge commits during the last 7 day(s) which contain
a total of 4 files changed, 55 insertions(+), 34 deletions(-).
The main changes are:
1) minor verifier fix for fmod_ret progs, from Alexei.
2) af_xdp overflow check, from Bjorn.
3) minor verifier fix for 32bit assignment, from John.
4) powerpc has non-overlapping addr space, from Petr.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 29 May 2020 20:59:54 +0000 (13:59 -0700)]
Merge tag 'ceph-for-5.7-rc8' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Cache tiering and cap handling fixups, both marked for stable"
* tag 'ceph-for-5.7-rc8' of git://github.com/ceph/ceph-client:
ceph: flush release queue when handling caps for unknown inode
libceph: ignore pool overlay and cache logic on redirects
Linus Torvalds [Fri, 29 May 2020 20:58:13 +0000 (13:58 -0700)]
Merge tag 'gfs2-v5.7-rc7.fixes' of git://git./linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fix from Andreas Gruenbacher:
"Fix the previous, flawed gfs2_find_jhead commit"
* tag 'gfs2-v5.7-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Even more gfs2_find_jhead fixes
Linus Torvalds [Fri, 29 May 2020 20:51:52 +0000 (13:51 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fix from Catalin Marinas:
"Ensure __cpu_up() returns an error if cpu_online() is false after
waiting for completion on cpu_running"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64/kernel: Fix return value when cpu_online() fails in __cpu_up()
Linus Torvalds [Fri, 29 May 2020 20:50:31 +0000 (13:50 -0700)]
Merge branch 'parisc-5.7-2' of git://git./linux/kernel/git/deller/parisc-linux
Pull parisc fix from Helge Deller:
"Fix a kernel panic at boot time for some HP-PARISC machines"
* 'parisc-5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix kernel panic in mem_init()
Linus Torvalds [Fri, 29 May 2020 20:41:33 +0000 (13:41 -0700)]
Merge tag 'iommu-fixes-v5.7-rc7' of git://git./linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
- Two build fixes for issues introduced during the merge window
- A fix for a reference count leak in an error path of
iommu_group_alloc()
* tag 'iommu-fixes-v5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu: Fix reference count leak in iommu_group_alloc.
x86: Hide the archdata.iommu field behind generic IOMMU_API
ia64: Hide the archdata.iommu field behind generic IOMMU_API
Linus Torvalds [Fri, 29 May 2020 20:39:26 +0000 (13:39 -0700)]
Merge tag 'block-5.7-2020-05-29' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Two small fixes:
- Revert a block change that mixed up the return values for non-mq
devices
- NVMe poll race fix"
* tag 'block-5.7-2020-05-29' of git://git.kernel.dk/linux-block:
Revert "block: end bio with BLK_STS_AGAIN in case of non-mq devs and REQ_NOWAIT"
nvme-pci: avoid race between nvme_reap_pending_cqes() and nvme_poll()
Linus Torvalds [Fri, 29 May 2020 20:35:45 +0000 (13:35 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Nothing profound here, just a last set of long standing bug fixes:
- Incorrect error unwind in qib and pvrdma
- User triggerable NULL pointer crash in mlx5 with ODP prefetch
- syzkaller RCU race in uverbs
- Rare double free crash in ipoib"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/ipoib: Fix double free of skb in case of multicast traffic in CM mode
RDMA/core: Fix double destruction of uobject
RDMA/pvrdma: Fix missing pci disable in pvrdma_pci_probe()
RDMA/mlx5: Fix NULL pointer dereference in destroy_prefetch_work
IB/qib: Call kobject_put() when kobject_init_and_add() fails
John Fastabend [Fri, 29 May 2020 17:29:18 +0000 (10:29 -0700)]
bpf, selftests: Add a verifier test for assigning 32bit reg states to 64bit ones
Added a verifier test for assigning 32bit reg states to
64bit where 32bit reg holds a constant value of 0.
Without previous kernel verifier.c fix, the test in
this patch will fail.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/159077335867.6014.2075350327073125374.stgit@john-Precision-5820-Tower
John Fastabend [Fri, 29 May 2020 17:28:59 +0000 (10:28 -0700)]
bpf, selftests: Verifier bounds tests need to be updated
After previous fix for zero extension test_verifier tests #65 and #66 now
fail. Before the fix we can see the alu32 mov op at insn 10
10: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
R1_w=invP(id=0,
smin_value=
4294967168,smax_value=
4294967423,
umin_value=
4294967168,umax_value=
4294967423,
var_off=(0x0; 0x1ffffffff),
s32_min_value=-
2147483648,s32_max_value=
2147483647,
u32_min_value=0,u32_max_value=-1)
R10=fp0 fp-8_w=mmmmmmmm
10: (bc) w1 = w1
11: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
R1_w=invP(id=0,
smin_value=0,smax_value=
2147483647,
umin_value=0,umax_value=
4294967295,
var_off=(0x0; 0xffffffff),
s32_min_value=-
2147483648,s32_max_value=
2147483647,
u32_min_value=0,u32_max_value=-1)
R10=fp0 fp-8_w=mmmmmmmm
After the fix at insn 10 because we have 's32_min_value < 0' the following
step 11 now has 'smax_value=U32_MAX' where before we pulled the s32_max_value
bound into the smax_value as seen above in 11 with smax_value=
2147483647.
10: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
R1_w=inv(id=0,
smin_value=
4294967168,smax_value=
4294967423,
umin_value=
4294967168,umax_value=
4294967423,
var_off=(0x0; 0x1ffffffff),
s32_min_value=-
2147483648, s32_max_value=
2147483647,
u32_min_value=0,u32_max_value=-1)
R10=fp0 fp-8_w=mmmmmmmm
10: (bc) w1 = w1
11: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
R1_w=inv(id=0,
smin_value=0,smax_value=
4294967295,
umin_value=0,umax_value=
4294967295,
var_off=(0x0; 0xffffffff),
s32_min_value=-
2147483648, s32_max_value=
2147483647,
u32_min_value=0, u32_max_value=-1)
R10=fp0 fp-8_w=mmmmmmmm
The fall out of this is by the time we get to the failing instruction at
step 14 where previously we had the following:
14: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
R1_w=inv(id=0,
smin_value=
72057594021150720,smax_value=
72057594029539328,
umin_value=
72057594021150720,umax_value=
72057594029539328,
var_off=(0xffffffff000000; 0xffffff),
s32_min_value=-
16777216,s32_max_value=-1,
u32_min_value=-
16777216,u32_max_value=-1)
R10=fp0 fp-8_w=mmmmmmmm
14: (0f) r0 += r1
We now have,
14: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
R1_w=inv(id=0,
smin_value=0,smax_value=
72057594037927935,
umin_value=0,umax_value=
72057594037927935,
var_off=(0x0; 0xffffffffffffff),
s32_min_value=-
2147483648,s32_max_value=
2147483647,
u32_min_value=0,u32_max_value=-1)
R10=fp0 fp-8_w=mmmmmmmm
14: (0f) r0 += r1
In the original step 14 'smin_value=
72057594021150720' this trips the logic
in the verifier function check_reg_sane_offset(),
if (smin >= BPF_MAX_VAR_OFF || smin <= -BPF_MAX_VAR_OFF) {
verbose(env, "value %lld makes %s pointer be out of bounds\n",
smin, reg_type_str[type]);
return false;
}
Specifically, the 'smin <= -BPF_MAX_VAR_OFF' check. But with the fix
at step 14 we have bounds 'smin_value=0' so the above check is not tripped
because BPF_MAX_VAR_OFF=1<<29.
We have a smin_value=0 here because at step 10 the smaller smin_value=0 means
the subtractions at steps 11 and 12 bring the smin_value negative.
11: (17) r1 -=
2147483584
12: (17) r1 -=
2147483584
13: (77) r1 >>= 8
Then the shift clears the top bit and smin_value is set to 0. Note we still
have the smax_value in the fixed code so any reads will fail. An alternative
would be to have reg_sane_check() do both smin and smax value tests.
To fix the test we can omit the 'r1 >>=8' at line 13. This will change the
err string, but keeps the intention of the test as suggseted by the title,
"check after truncation of boundary-crossing range". If the verifier logic
changes a different value is likely to be thrown in the error or the error
will no longer be thrown forcing this test to be examined. With this change
we see the new state at step 13.
13: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
R1_w=invP(id=0,
smin_value=-
4294967168,smax_value=127,
umin_value=0,umax_value=
18446744073709551615,
s32_min_value=-
2147483648,s32_max_value=
2147483647,
u32_min_value=0,u32_max_value=-1)
R10=fp0 fp-8_w=mmmmmmmm
Giving the expected out of bounds error, "value -
4294967168 makes map_value
pointer be out of bounds" However, for unpriv case we see a different error
now because of the mixed signed bounds pointer arithmatic. This seems OK so
I've only added the unpriv_errstr for this. Another optino may have been to
do addition on r1 instead of subtraction but I favor the approach above
slightly.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/159077333942.6014.14004320043595756079.stgit@john-Precision-5820-Tower
John Fastabend [Fri, 29 May 2020 17:28:40 +0000 (10:28 -0700)]
bpf: Fix a verifier issue when assigning 32bit reg states to 64bit ones
With the latest trunk llvm (llvm 11), I hit a verifier issue for
test_prog subtest test_verif_scale1.
The following simplified example illustrate the issue:
w9 = 0 /* R9_w=inv0 */
r8 = *(u32 *)(r1 + 80) /* __sk_buff->data_end */
r7 = *(u32 *)(r1 + 76) /* __sk_buff->data */
......
w2 = w9 /* R2_w=inv0 */
r6 = r7 /* R6_w=pkt(id=0,off=0,r=0,imm=0) */
r6 += r2 /* R6_w=inv(id=0) */
r3 = r6 /* R3_w=inv(id=0) */
r3 += 14 /* R3_w=inv(id=0) */
if r3 > r8 goto end
r5 = *(u32 *)(r6 + 0) /* R6_w=inv(id=0) */
<== error here: R6 invalid mem access 'inv'
...
end:
In real test_verif_scale1 code, "w9 = 0" and "w2 = w9" are in
different basic blocks.
In the above, after "r6 += r2", r6 becomes a scalar, which eventually
caused the memory access error. The correct register state should be
a pkt pointer.
The inprecise register state starts at "w2 = w9".
The 32bit register w9 is 0, in __reg_assign_32_into_64(),
the 64bit reg->smax_value is assigned to be U32_MAX.
The 64bit reg->smin_value is 0 and the 64bit register
itself remains constant based on reg->var_off.
In adjust_ptr_min_max_vals(), the verifier checks for a known constant,
smin_val must be equal to smax_val. Since they are not equal,
the verifier decides r6 is a unknown scalar, which caused later failure.
The llvm10 does not have this issue as it generates different code:
w9 = 0 /* R9_w=inv0 */
r8 = *(u32 *)(r1 + 80) /* __sk_buff->data_end */
r7 = *(u32 *)(r1 + 76) /* __sk_buff->data */
......
r6 = r7 /* R6_w=pkt(id=0,off=0,r=0,imm=0) */
r6 += r9 /* R6_w=pkt(id=0,off=0,r=0,imm=0) */
r3 = r6 /* R3_w=pkt(id=0,off=0,r=0,imm=0) */
r3 += 14 /* R3_w=pkt(id=0,off=14,r=0,imm=0) */
if r3 > r8 goto end
...
To fix the above issue, we can include zero in the test condition for
assigning the s32_max_value and s32_min_value to their 64-bit equivalents
smax_value and smin_value.
Further, fix the condition to avoid doing zero extension bounds checks
when s32_min_value <= 0. This could allow for the case where bounds
32-bit bounds (-1,1) get incorrectly translated to (0,1) 64-bit bounds.
When in-fact the -1 min value needs to force U32_MAX bound.
Fixes:
3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/159077331983.6014.5758956193749002737.stgit@john-Precision-5820-Tower
Linus Torvalds [Fri, 29 May 2020 20:34:01 +0000 (13:34 -0700)]
Merge tag 'mmc-v5.7-rc6' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC core:
- Fix use-after-free issue for rpmb partition
MMC host:
- Fix quirk for broken CQE support"
* tag 'mmc-v5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: block: Fix use-after-free issue for rpmb
mmc: sdhci: Fix SDHCI_QUIRK_BROKEN_CQE
Linus Torvalds [Fri, 29 May 2020 20:31:01 +0000 (13:31 -0700)]
Merge tag 'sound-5.7' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Only a few last-minute small fixes: the change in ALSA core hwdep is
about the undefined behavior of bit shift, which is almost harmless
but still worth to pick up quickly.
The rest are all device-specific fixes for HD-audio and USB-audio, and
safe to apply at the late stage"
* tag 'sound-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/realtek - Add new codec supported for ALC287
ALSA: usb-audio: Quirks for Gigabyte TRX40 Aorus Master onboard audio
ALSA: usb-audio: mixer: volume quirk for ESS Technology Asus USB DAC
ALSA: hda/realtek - Add a model for Thinkpad T570 without DAC workaround
ALSA: hwdep: fix a left shifting 1 by 31 UB bug
Linus Torvalds [Fri, 29 May 2020 20:29:20 +0000 (13:29 -0700)]
Merge tag 'clk-fixes-for-linus' of git://git./linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"Two fixes for the new SM8150 and SM8250 Qualcomm clk drivers to fix a
randconfig build error and an incorrect parent mapping"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: qcom: gcc: Fix parent for gpll0_out_even
clk: qcom: sm8250 gcc depends on QCOM_GDSC
Alexei Starovoitov [Fri, 29 May 2020 04:38:36 +0000 (21:38 -0700)]
bpf: Fix use-after-free in fmod_ret check
Fix the following issue:
[ 436.749342] BUG: KASAN: use-after-free in bpf_trampoline_put+0x39/0x2a0
[ 436.749995] Write of size 4 at addr
ffff8881ef38b8a0 by task kworker/3:5/2243
[ 436.750712]
[ 436.752677] Workqueue: events bpf_prog_free_deferred
[ 436.753183] Call Trace:
[ 436.756483] bpf_trampoline_put+0x39/0x2a0
[ 436.756904] bpf_prog_free_deferred+0x16d/0x3d0
[ 436.757377] process_one_work+0x94a/0x15b0
[ 436.761969]
[ 436.762130] Allocated by task 2529:
[ 436.763323] bpf_trampoline_lookup+0x136/0x540
[ 436.763776] bpf_check+0x2872/0xa0a8
[ 436.764144] bpf_prog_load+0xb6f/0x1350
[ 436.764539] __do_sys_bpf+0x16d7/0x3720
[ 436.765825]
[ 436.765988] Freed by task 2529:
[ 436.767084] kfree+0xc6/0x280
[ 436.767397] bpf_trampoline_put+0x1fd/0x2a0
[ 436.767826] bpf_check+0x6832/0xa0a8
[ 436.768197] bpf_prog_load+0xb6f/0x1350
[ 436.768594] __do_sys_bpf+0x16d7/0x3720
prog->aux->trampoline = tr should be set only when prog is valid.
Otherwise prog freeing will try to put trampoline via prog->aux->trampoline,
but it may not point to a valid trampoline.
Fixes:
6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200529043839.15824-2-alexei.starovoitov@gmail.com
Pablo Neira Ayuso [Sun, 19 Apr 2020 12:12:35 +0000 (14:12 +0200)]
net/mlx5e: replace EINVAL in mlx5e_flower_parse_meta()
The drivers reports EINVAL to userspace through netlink on invalid meta
match. This is confusing since EINVAL is usually reserved for malformed
netlink messages. Replace it by more meaningful codes.
Fixes:
6d65bc64e232 ("net/mlx5e: Add mlx5e_flower_parse_meta support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Vlad Buslov [Mon, 25 May 2020 13:57:51 +0000 (16:57 +0300)]
net/mlx5e: Fix MLX5_TC_CT dependencies
Change MLX5_TC_CT config dependencies to include MLX5_ESWITCH instead of
MLX5_CORE_EN && NET_SWITCHDEV, which are already required by MLX5_ESWITCH.
Without this change mlx5 fails to compile if user disables MLX5_ESWITCH
without also manually disabling MLX5_TC_CT.
Fixes:
4c3844d9e97e ("net/mlx5e: CT: Introduce connection tracking")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Tal Gilboa [Thu, 23 Apr 2020 10:23:06 +0000 (13:23 +0300)]
net/mlx5e: Properly set default values when disabling adaptive moderation
Add a call to mlx5e_reset_rx/tx_moderation() when enabling/disabling
adaptive moderation, in order to select the proper default values.
In order to do so, we separate the logic of selecting the moderation values
and setting moderion mode (CQE/EQE based).
Fixes:
0088cbbc4b66 ("net/mlx5e: Enable CQE based moderation on TX CQ")
Fixes:
9908aa292971 ("net/mlx5e: CQE based moderation")
Signed-off-by: Tal Gilboa <talgi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Aya Levin [Mon, 13 Apr 2020 08:31:00 +0000 (11:31 +0300)]
net/mlx5e: Fix arch depending casting issue in FEC
Change type of active_fec to u32 to match the type expected by
mlx5e_get_fec_mode. Copy active_fec and configured_fec values to
unsigned long before preforming bitwise manipulations.
Take the same approach when configuring FEC over 50G link modes: copy
the policy into an unsigned long and only than preform bitwise
operations.
Fixes:
2132b71f78d2 ("net/mlx5e: Advertise globaly supported FEC modes")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Maor Dickman [Sun, 24 May 2020 06:45:44 +0000 (09:45 +0300)]
net/mlx5e: Remove warning "devices are not on same switch HW"
On tunnel decap rule insertion, the indirect mechanism will attempt to
offload the rule on all uplink representors which will trigger the
"devices are not on same switch HW, can't offload forwarding" message
for the uplink which isn't on the same switch HW as the VF representor.
The above flow is valid and shouldn't cause warning message,
fix by removing the warning and only report this flow using extack.
Fixes:
321348475d54 ("net/mlx5e: Fix allowed tc redirect merged eswitch offload cases")
Signed-off-by: Maor Dickman <maord@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Roi Dayan [Wed, 27 May 2020 18:46:09 +0000 (21:46 +0300)]
net/mlx5e: Fix stats update for matchall classifier
It's bytes, packets, lastused.
Fixes:
fcb64c0f5640 ("net/mlx5: E-Switch, add ingress rate support")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Wed, 20 May 2020 17:32:08 +0000 (17:32 +0000)]
net/mlx5: Fix crash upon suspend/resume
Currently a Linux system with the mlx5 NIC always crashes upon
hibernation - suspend/resume.
Add basic callbacks so the NIC could be suspended and resumed.
Fixes:
9603b61de1ee ("mlx5: Move pci device handling from mlx5_ib to mlx5_core")
Tested-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
David S. Miller [Fri, 29 May 2020 20:05:56 +0000 (13:05 -0700)]
Merge branch 'master' of git://git./linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2020-05-29
1) Several fixes for ESP gro/gso in transport and beet mode when
IPv6 extension headers are present. From Xin Long.
2) Fix a wrong comment on XFRMA_OFFLOAD_DEV.
From Antony Antony.
3) Fix sk_destruct callback handling on ESP in TCP encapsulation.
From Sabrina Dubroca.
4) Fix a use after free in xfrm_output_gso when used with vxlan.
From Xin Long.
5) Fix secpath handling of VTI when used wiuth IPCOMP.
From Xin Long.
6) Fix an oops when deleting a x-netns xfrm interface.
From Nicolas Dichtel.
7) Fix a possible warning on policy updates. We had a case where it was
possible to add two policies with the same lookup keys.
From Xin Long.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 29 May 2020 19:32:46 +0000 (12:32 -0700)]
Merge tag 'drm-fixes-2020-05-29-1' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"A couple of amdgpu fixes and minor ingenic fixes:
amdgpu:
- display atomic test fix
- Fix soft hang in display vupdate code
ingenic:
- fix pointer cast
- fix crtc atomic check callback"
* tag 'drm-fixes-2020-05-29-1' of git://anongit.freedesktop.org/drm/drm:
drm/amd/display: Fix potential integer wraparound resulting in a hang
drm/amd/display: drop cursor position check in atomic test
gpu/drm: Ingenic: Fix opaque pointer casted to wrong type
gpu/drm: ingenic: Fix bogus crtc_atomic_check callback
Andreas Gruenbacher [Tue, 26 May 2020 18:11:51 +0000 (20:11 +0200)]
gfs2: Even more gfs2_find_jhead fixes
Fix several issues in the previous gfs2_find_jhead fix:
* When updating @blocks_submitted, @block refers to the first block block not
submitted yet, not the last block submitted, so fix an off-by-one error.
* We want to ensure that @blocks_submitted is far enough ahead of @blocks_read
to guarantee that there is in-flight I/O. Otherwise, we'll eventually end up
waiting for pages that haven't been submitted, yet.
* It's much easier to compare the number of blocks added with the number of
blocks submitted to limit the maximum bio size.
* Even with bio chaining, we can keep adding blocks until we reach the maximum
bio size, as long as we stop at a page boundary. This simplifies the logic.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Helge Deller [Thu, 28 May 2020 20:29:25 +0000 (22:29 +0200)]
parisc: Fix kernel panic in mem_init()
The Debian kernel v5.6 triggers this kernel panic:
Kernel panic - not syncing: Bad Address (null pointer deref?)
Bad Address (null pointer deref?): Code=26 (Data memory access rights trap) at addr
0000000000000000
CPU: 0 PID: 0 Comm: swapper Not tainted 5.6.0-2-parisc64 #1 Debian 5.6.14-1
IAOQ[0]: mem_init+0xb0/0x150
IAOQ[1]: mem_init+0xb4/0x150
RP(r2): start_kernel+0x6c8/0x1190
Backtrace:
[<
0000000040101ab4>] start_kernel+0x6c8/0x1190
[<
0000000040108574>] start_parisc+0x158/0x1b8
on a HP-PARISC rp3440 machine with this memory layout:
Memory Ranges:
0) Start 0x0000000000000000 End 0x000000003fffffff Size 1024 MB
1) Start 0x0000004040000000 End 0x00000040ffdfffff Size 3070 MB
Fix the crash by avoiding virt_to_page() and similar functions in
mem_init() until the memory zones have been fully set up.
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: stable@vger.kernel.org # v5.0+
Qiushi Wu [Wed, 27 May 2020 21:00:19 +0000 (16:00 -0500)]
iommu: Fix reference count leak in iommu_group_alloc.
kobject_init_and_add() takes reference even when it fails.
Thus, when kobject_init_and_add() returns an error,
kobject_put() must be called to properly clean up the kobject.
Fixes:
d72e31c93746 ("iommu: IOMMU Groups")
Signed-off-by: Qiushi Wu <wu000273@umn.edu>
Link: https://lore.kernel.org/r/20200527210020.6522-1-wu000273@umn.edu
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Linus Walleij [Wed, 27 May 2020 14:07:58 +0000 (16:07 +0200)]
gpio: fix locking open drain IRQ lines
We provided the right semantics on open drain lines being
by definition output but incidentally the irq set up function
would only allow IRQs on lines that were "not output".
Fix the semantics to allow output open drain lines to be used
for IRQs.
Reported-by: Hans Verkuil <hverkuil@xs4all.nl>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl>
Tested-by: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Russell King <linux@armlinux.org.uk>
Cc: stable@vger.kernel.org # v5.3+
Link: https://lore.kernel.org/r/20200527140758.162280-1-linus.walleij@linaro.org
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Daniel Axtens [Fri, 29 May 2020 06:14:46 +0000 (16:14 +1000)]
powerpc/64s: Disable sanitisers for C syscall/interrupt entry/exit code
syzkaller is picking up a bunch of crashes that look like this:
Unrecoverable exception 380 at
c00000000037ed60 (msr=
8000000000001031)
Oops: Unrecoverable exception, sig: 6 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 0 PID: 874 Comm: syz-executor.0 Not tainted 5.7.0-rc7-syzkaller-00016-gb0c3ba31be3e #0
NIP:
c00000000037ed60 LR:
c00000000004bac8 CTR:
c000000000030990
REGS:
c0000000555a7230 TRAP: 0380 Not tainted (5.7.0-rc7-syzkaller-00016-gb0c3ba31be3e)
MSR:
8000000000001031 <SF,ME,IR,DR,LE> CR:
48222882 XER:
20000000
CFAR:
c00000000004bac4 IRQMASK: 0
GPR00:
c00000000004bb68 c0000000555a74c0 c0000000024b3500 0000000000000005
GPR04:
0000000000000000 0000000000000000 c00000000004bb88 c008000000910000
GPR08:
00000000000b0000 c00000000004bac8 0000000000016000 c000000002503500
GPR12:
c000000000030990 c000000003190000 00000000106a5898 00000000106a0000
GPR16:
00000000106a5890 c000000007a92000 c000000008180e00 c000000007a8f700
GPR20:
c000000007a904b0 0000000010110000 c00000000259d318 5deadbeef0000100
GPR24:
5deadbeef0000122 c000000078422700 c000000009ee88b8 c000000078422778
GPR28:
0000000000000001 800000000280b033 0000000000000000 c0000000555a75a0
NIP [
c00000000037ed60] __sanitizer_cov_trace_pc+0x40/0x50
LR [
c00000000004bac8] interrupt_exit_kernel_prepare+0x118/0x310
Call Trace:
[
c0000000555a74c0] [
c00000000004bb68] interrupt_exit_kernel_prepare+0x1b8/0x310 (unreliable)
[
c0000000555a7530] [
c00000000000f9a8] interrupt_return+0x118/0x1c0
--- interrupt: 900 at __sanitizer_cov_trace_pc+0x0/0x50
...<random previous call chain>...
This is caused by __sanitizer_cov_trace_pc() causing an SLB fault
after MSR[RI] has been cleared by __hard_EE_RI_disable(), which we
can not recover from.
Do not instrument the new syscall/interrupt entry/exit code with KCOV,
GCOV or UBSAN.
Reported-by: syzbot-ppc64 <ozlabsyz@au1.ibm.com>
Fixes:
68b34588e202 ("powerpc/64/sycall: Implement syscall entry/exit logic in C")
Signed-off-by: Daniel Axtens <dja@axtens.net>
Acked-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Xin Long [Tue, 26 May 2020 09:41:46 +0000 (17:41 +0800)]
xfrm: fix a NULL-ptr deref in xfrm_local_error
This patch is to fix a crash:
[ ] kasan: GPF could be caused by NULL-ptr deref or user memory access
[ ] general protection fault: 0000 [#1] SMP KASAN PTI
[ ] RIP: 0010:ipv6_local_error+0xac/0x7a0
[ ] Call Trace:
[ ] xfrm6_local_error+0x1eb/0x300
[ ] xfrm_local_error+0x95/0x130
[ ] __xfrm6_output+0x65f/0xb50
[ ] xfrm6_output+0x106/0x46f
[ ] udp_tunnel6_xmit_skb+0x618/0xbf0 [ip6_udp_tunnel]
[ ] vxlan_xmit_one+0xbc6/0x2c60 [vxlan]
[ ] vxlan_xmit+0x6a0/0x4276 [vxlan]
[ ] dev_hard_start_xmit+0x165/0x820
[ ] __dev_queue_xmit+0x1ff0/0x2b90
[ ] ip_finish_output2+0xd3e/0x1480
[ ] ip_do_fragment+0x182d/0x2210
[ ] ip_output+0x1d0/0x510
[ ] ip_send_skb+0x37/0xa0
[ ] raw_sendmsg+0x1b4c/0x2b80
[ ] sock_sendmsg+0xc0/0x110
This occurred when sending a v4 skb over vxlan6 over ipsec, in which case
skb->protocol == htons(ETH_P_IPV6) while skb->sk->sk_family == AF_INET in
xfrm_local_error(). Then it will go to xfrm6_local_error() where it tries
to get ipv6 info from a ipv4 sk.
This issue was actually fixed by Commit
628e341f319f ("xfrm: make local
error reporting more robust"), but brought back by Commit
844d48746e4b
("xfrm: choose protocol family by skb protocol").
So to fix it, we should call xfrm6_local_error() only when skb->protocol
is htons(ETH_P_IPV6) and skb->sk->sk_family is AF_INET6.
Fixes:
844d48746e4b ("xfrm: choose protocol family by skb protocol")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Ingo Molnar [Fri, 29 May 2020 09:36:49 +0000 (11:36 +0200)]
Merge branch 'fixes' of git://git./linux/kernel/git/viro/vfs into x86/urgent
Pick up FPU register dump fixes from Al Viro.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dave Airlie [Fri, 29 May 2020 02:11:07 +0000 (12:11 +1000)]
Merge tag 'drm-misc-fixes-2020-05-28' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
Two ingenic fixes, one for a wrong cast, the other for a typo in a
comparison
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <maxime@cerno.tech>
Link: https://patchwork.freedesktop.org/patch/msgid/20200528110944.hanv4qgc6w7whnj3@gilmour.lan
Eric Dumazet [Thu, 28 May 2020 21:57:47 +0000 (14:57 -0700)]
net: be more gentle about silly gso requests coming from user
Recent change in virtio_net_hdr_to_skb() broke some packetdrill tests.
When --mss=XXX option is set, packetdrill always provide gso_type & gso_size
for its inbound packets, regardless of packet size.
if (packet->tcp && packet->mss) {
if (packet->ipv4)
gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
else
gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
gso.gso_size = packet->mss;
}
Since many other programs could do the same, relax virtio_net_hdr_to_skb()
to no longer return an error, but instead ignore gso settings.
This keeps Willem intent to make sure no malicious packet could
reach gso stack.
Note that TCP stack has a special logic in tcp_set_skb_tso_segs()
to clear gso_size for small packets.
Fixes:
6dd912f82680 ("net: check untrusted gso_size at kernel entry")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 28 May 2020 20:04:25 +0000 (13:04 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"5 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
include/asm-generic/topology.h: guard cpumask_of_node() macro argument
fs/binfmt_elf.c: allocate initialized memory in fill_thread_core_info()
mm: remove VM_BUG_ON(PageSlab()) from page_mapcount()
mm,thp: stop leaking unreleased file pages
mm/z3fold: silence kmemleak false positives of slots
Jonas Falkevik [Wed, 27 May 2020 09:56:40 +0000 (11:56 +0200)]
sctp: check assoc before SCTP_ADDR_{MADE_PRIM, ADDED} event
Make sure SCTP_ADDR_{MADE_PRIM,ADDED} are sent only for associations
that have been established.
These events are described in rfc6458#section-6.1
SCTP_PEER_ADDR_CHANGE:
This tag indicates that an address that is
part of an existing association has experienced a change of
state (e.g., a failure or return to service of the reachability
of an endpoint via a specific transport address).
Signed-off-by: Jonas Falkevik <jonas.falkevik@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 28 May 2020 19:41:11 +0000 (12:41 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
"Just a few random driver fixups"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: synaptics - add a second working PNP_ID for Lenovo T470s
Input: applespi - replace zero-length array with flexible-array
Input: axp20x-pek - always register interrupt handlers
Input: lm8333 - update contact email
Input: synaptics-rmi4 - fix error return code in rmi_driver_probe()
Input: synaptics-rmi4 - really fix attn_data use-after-free
Input: i8042 - add ThinkPad S230u to i8042 reset list
Revert "Input: i8042 - add ThinkPad S230u to i8042 nomux list"
Input: dlink-dir685-touchkeys - fix a typo in driver name
Input: xpad - add custom init packet for Xbox One S controllers
Input: evdev - call input_flush_device() on release(), not flush()
Input: i8042 - add ThinkPad S230u to i8042 nomux list
Input: usbtouchscreen - add support for BonXeon TP
Input: cros_ec_keyb - use cros_ec_cmd_xfer_status helper
Input: mms114 - fix handling of mms345l
Input: elants_i2c - support palm detection
Jay Lang [Sun, 24 May 2020 16:27:39 +0000 (12:27 -0400)]
x86/ioperm: Prevent a memory leak when fork fails
In the copy_process() routine called by _do_fork(), failure to allocate
a PID (or further along in the function) will trigger an invocation to
exit_thread(). This is done to clean up from an earlier call to
copy_thread_tls(). Naturally, the child task is passed into exit_thread(),
however during the process, io_bitmap_exit() nullifies the parent's
io_bitmap rather than the child's.
As copy_thread_tls() has been called ahead of the failure, the reference
count on the calling thread's io_bitmap is incremented as we would expect.
However, io_bitmap_exit() doesn't accept any arguments, and thus assumes
it should trash the current thread's io_bitmap reference rather than the
child's. This is pretty sneaky in practice, because in all instances but
this one, exit_thread() is called with respect to the current task and
everything works out.
A determined attacker can issue an appropriate ioctl (i.e. KDENABIO) to
get a bitmap allocated, and force a clone3() syscall to fail by passing
in a zeroed clone_args structure. The kernel handles the erroneous struct
and the buggy code path is followed, and even though the parent's reference
to the io_bitmap is trashed, the child still holds a reference and thus
the structure will never be freed.
Fix this by tweaking io_bitmap_exit() and its subroutines to accept a
task_struct argument which to operate on.
Fixes:
ea5f1cd7ab49 ("x86/ioperm: Remove bitmap if all permissions dropped")
Signed-off-by: Jay Lang <jaytlang@mit.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable#@vger.kernel.org
Link: https://lkml.kernel.org/r/20200524162742.253727-1-jaytlang@mit.edu
Linus Torvalds [Thu, 28 May 2020 19:32:56 +0000 (12:32 -0700)]
Merge tag 'csky-for-linus-5.7-rc8' of git://github.com/c-sky/csky-linux
Pull csky fixes from Guo Ren:
"Another four fixes for csky:
- fix req_syscall debug
- fix abiv2 syscall_trace
- fix preempt enable
- clean up regs usage in entry.S"
* tag 'csky-for-linus-5.7-rc8' of git://github.com/c-sky/csky-linux:
csky: Fixup CONFIG_DEBUG_RSEQ
csky: Coding convention in entry.S
csky: Fixup abiv2 syscall_trace break a4 & a5
csky: Fixup CONFIG_PREEMPT panic
Jens Axboe [Thu, 28 May 2020 19:19:29 +0000 (13:19 -0600)]
Revert "block: end bio with BLK_STS_AGAIN in case of non-mq devs and REQ_NOWAIT"
This reverts commit
c58c1f83436b501d45d4050fd1296d71a9760bcb.
io_uring does do the right thing for this case, and we're still returning
-EAGAIN to userspace for the cases we don't support. Revert this change
to avoid doing endless spins of resubmits.
Cc: stable@vger.kernel.org # v5.6
Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Arnd Bergmann [Thu, 28 May 2020 05:20:55 +0000 (22:20 -0700)]
include/asm-generic/topology.h: guard cpumask_of_node() macro argument
drivers/hwmon/amd_energy.c:195:15: error: invalid operands to binary expression ('void' and 'int')
(channel - data->nr_cpus));
~~~~~~~~~^~~~~~~~~~~~~~~~~
include/asm-generic/topology.h:51:42: note: expanded from macro 'cpumask_of_node'
#define cpumask_of_node(node) ((void)node, cpu_online_mask)
^~~~
include/linux/cpumask.h:618:72: note: expanded from macro 'cpumask_first_and'
#define cpumask_first_and(src1p, src2p) cpumask_next_and(-1, (src1p), (src2p))
^~~~~
Fixes:
f0b848ce6fe9 ("cpumask: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask")
Fixes:
8abee9566b7e ("hwmon: Add amd_energy driver to report energy counters")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Link: http://lkml.kernel.org/r/20200527134623.930247-1-arnd@arndb.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexander Potapenko [Thu, 28 May 2020 05:20:52 +0000 (22:20 -0700)]
fs/binfmt_elf.c: allocate initialized memory in fill_thread_core_info()
KMSAN reported uninitialized data being written to disk when dumping
core. As a result, several kilobytes of kmalloc memory may be written
to the core file and then read by a non-privileged user.
Reported-by: sam <sunhaoyl@outlook.com>
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200419100848.63472-1-glider@google.com
Link: https://github.com/google/kmsan/issues/76
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Konstantin Khlebnikov [Thu, 28 May 2020 05:20:47 +0000 (22:20 -0700)]
mm: remove VM_BUG_ON(PageSlab()) from page_mapcount()
Replace superfluous VM_BUG_ON() with comment about correct usage.
Technically reverts commit
1d148e218a0d ("mm: add VM_BUG_ON_PAGE() to
page_mapcount()"), but context lines have changed.
Function isolate_migratepages_block() runs some checks out of lru_lock
when choose pages for migration. After checking PageLRU() it checks
extra page references by comparing page_count() and page_mapcount().
Between these two checks page could be removed from lru, freed and taken
by slab.
As a result this race triggers VM_BUG_ON(PageSlab()) in page_mapcount().
Race window is tiny. For certain workload this happens around once a
year.
page:
ffffea0105ca9380 count:1 mapcount:0 mapping:
ffff88ff7712c180 index:0x0 compound_mapcount: 0
flags: 0x500000000008100(slab|head)
raw:
0500000000008100 dead000000000100 dead000000000200 ffff88ff7712c180
raw:
0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(PageSlab(page))
------------[ cut here ]------------
kernel BUG at ./include/linux/mm.h:628!
invalid opcode: 0000 [#1] SMP NOPTI
CPU: 77 PID: 504 Comm: kcompactd1 Tainted: G W 4.19.109-27 #1
Hardware name: Yandex T175-N41-Y3N/MY81-EX0-Y3N, BIOS R05 06/20/2019
RIP: 0010:isolate_migratepages_block+0x986/0x9b0
The code in isolate_migratepages_block() was added in commit
119d6d59dcc0 ("mm, compaction: avoid isolating pinned pages") before
adding VM_BUG_ON into page_mapcount().
This race has been predicted in 2015 by Vlastimil Babka (see link
below).
[akpm@linux-foundation.org: comment tweaks, per Hugh]
Fixes:
1d148e218a0d ("mm: add VM_BUG_ON_PAGE() to page_mapcount()")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/159032779896.957378.7852761411265662220.stgit@buzz
Link: https://lore.kernel.org/lkml/557710E1.6060103@suse.cz/
Link: https://lore.kernel.org/linux-mm/158937872515.474360.5066096871639561424.stgit@buzz/T/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Thu, 28 May 2020 05:20:43 +0000 (22:20 -0700)]
mm,thp: stop leaking unreleased file pages
When collapse_file() calls try_to_release_page(), it has already isolated
the page: so if releasing buffers happens to fail (as it sometimes does),
remember to putback_lru_page(): otherwise that page is left unreclaimable
and unfreeable, and the file extent uncollapsible.
Fixes:
99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: <stable@vger.kernel.org> [5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005231837500.1766@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Thu, 28 May 2020 05:20:40 +0000 (22:20 -0700)]
mm/z3fold: silence kmemleak false positives of slots
Kmemleak reported many leaks while under memory pressue in,
slots = alloc_slots(pool, gfp);
which is referenced by "zhdr" in init_z3fold_page(),
zhdr->slots = slots;
However, "zhdr" could be gone without freeing slots as the later will be
freed separately when the last "handle" off of "handles" array is freed.
It will be within "slots" which is always aligned.
unreferenced object 0xc000000fdadc1040 (size 104):
comm "oom04", pid 140476, jiffies
4295359280 (age 3454.970s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
z3fold_zpool_malloc+0x7b0/0xe10
alloc_slots at mm/z3fold.c:214
(inlined by) init_z3fold_page at mm/z3fold.c:412
(inlined by) z3fold_alloc at mm/z3fold.c:1161
(inlined by) z3fold_zpool_malloc at mm/z3fold.c:1735
zpool_malloc+0x34/0x50
zswap_frontswap_store+0x60c/0xda0
zswap_frontswap_store at mm/zswap.c:1093
__frontswap_store+0x128/0x330
swap_writepage+0x58/0x110
pageout+0x16c/0xa40
shrink_page_list+0x1ac8/0x25c0
shrink_inactive_list+0x270/0x730
shrink_lruvec+0x444/0xf30
shrink_node+0x2a4/0x9c0
do_try_to_free_pages+0x158/0x640
try_to_free_pages+0x1bc/0x5f0
__alloc_pages_slowpath.constprop.60+0x4dc/0x15a0
__alloc_pages_nodemask+0x520/0x650
alloc_pages_vma+0xc0/0x420
handle_mm_fault+0x1174/0x1bf0
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vitaly Wool <vitaly.wool@konsulko.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: http://lkml.kernel.org/r/20200522220052.2225-1-cai@lca.pw
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexander Dahl [Tue, 26 May 2020 17:57:49 +0000 (19:57 +0200)]
x86/dma: Fix max PFN arithmetic overflow on 32 bit systems
The intermediate result of the old term (4UL * 1024 * 1024 * 1024) is
4 294 967 296 or 0x100000000 which is no problem on 64 bit systems.
The patch does not change the later overall result of 0x100000 for
MAX_DMA32_PFN (after it has been shifted by PAGE_SHIFT). The new
calculation yields the same result, but does not require 64 bit
arithmetic.
On 32 bit systems the old calculation suffers from an arithmetic
overflow in that intermediate term in braces: 4UL aka unsigned long int
is 4 byte wide and an arithmetic overflow happens (the 0x100000000 does
not fit in 4 bytes), the in braces result is truncated to zero, the
following right shift does not alter that, so MAX_DMA32_PFN evaluates to
0 on 32 bit systems.
That wrong value is a problem in a comparision against MAX_DMA32_PFN in
the init code for swiotlb in pci_swiotlb_detect_4gb() to decide if
swiotlb should be active. That comparison yields the opposite result,
when compiling on 32 bit systems.
This was not possible before
1b7e03ef7570 ("x86, NUMA: Enable emulation on 32bit too")
when that MAX_DMA32_PFN was first made visible to x86_32 (and which
landed in v3.0).
In practice this wasn't a problem, unless CONFIG_SWIOTLB is active on
x86-32.
However if one has set CONFIG_IOMMU_INTEL, since
c5a5dc4cbbf4 ("iommu/vt-d: Don't switch off swiotlb if bounce page is used")
there's a dependency on CONFIG_SWIOTLB, which was not necessarily
active before. That landed in v5.4, where we noticed it in the fli4l
Linux distribution. We have CONFIG_IOMMU_INTEL active on both 32 and 64
bit kernel configs there (I could not find out why, so let's just say
historical reasons).
The effect is at boot time 64 MiB (default size) were allocated for
bounce buffers now, which is a noticeable amount of memory on small
systems like pcengines ALIX 2D3 with 256 MiB memory, which are still
frequently used as home routers.
We noticed this effect when migrating from kernel v4.19 (LTS) to v5.4
(LTS) in fli4l and got that kernel messages for example:
Linux version 5.4.22 (buildroot@buildroot) (gcc version 7.3.0 (Buildroot 2018.02.8)) #1 SMP Mon Nov 26 23:40:00 CET 2018
…
Memory: 183484K/261756K available (4594K kernel code, 393K rwdata, 1660K rodata, 536K init, 456K bss , 78272K reserved, 0K cma-reserved, 0K highmem)
…
PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
software IO TLB: mapped [mem 0x0bb78000-0x0fb78000] (64MB)
The initial analysis and the suggested fix was done by user 'sourcejedi'
at stackoverflow and explicitly marked as GPLv2 for inclusion in the
Linux kernel:
https://unix.stackexchange.com/a/520525/50007
The new calculation, which does not suffer from that overflow, is the
same as for arch/mips now as suggested by Robin Murphy.
The fix was tested by fli4l users on round about two dozen different
systems, including both 32 and 64 bit archs, bare metal and virtualized
machines.
[ bp: Massage commit message. ]
Fixes:
1b7e03ef7570 ("x86, NUMA: Enable emulation on 32bit too")
Reported-by: Alan Jenkins <alan.christopher.jenkins@gmail.com>
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Alexander Dahl <post@lespocky.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Link: https://unix.stackexchange.com/q/520065/50007
Link: https://web.nettworks.org/bugs/browse/FFL-2560
Link: https://lkml.kernel.org/r/20200526175749.20742-1-post@lespocky.de
Qiushi Wu [Thu, 28 May 2020 03:10:29 +0000 (22:10 -0500)]
bonding: Fix reference count leak in bond_sysfs_slave_add.
kobject_init_and_add() takes reference even when it fails.
If this function returns an error, kobject_put() must be called to
properly clean up the memory associated with the object. Previous
commit "
b8eb718348b8" fixed a similar problem.
Fixes:
07699f9a7c8d ("bonding: add sysfs /slave dir for bond slave devices.")
Signed-off-by: Qiushi Wu <wu000273@umn.edu>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 28 May 2020 17:54:18 +0000 (10:54 -0700)]
Merge git://git./pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Uninitialized when used in __nf_conntrack_update(), from
Nathan Chancellor.
2) Comparison of unsigned expression in nf_confirm_cthelper().
3) Remove 'const' type qualifier with no effect.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Mladek [Wed, 27 May 2020 12:28:44 +0000 (14:28 +0200)]
powerpc/bpf: Enable bpf_probe_read{, str}() on powerpc again
The commit
0ebeea8ca8a4d1d453a ("bpf: Restrict bpf_probe_read{, str}() only
to archs where they work") caused that bpf_probe_read{, str}() functions
were not longer available on architectures where the same logical address
might have different content in kernel and user memory mapping. These
architectures should use probe_read_{user,kernel}_str helpers.
For backward compatibility, the problematic functions are still available
on architectures where the user and kernel address spaces are not
overlapping. This is defined CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
At the moment, these backward compatible functions are enabled only on x86_64,
arm, and arm64. Let's do it also on powerpc that has the non overlapping
address space as well.
Fixes:
0ebeea8ca8a4 ("bpf: Restrict bpf_probe_read{, str}() only to archs where they work")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/lkml/20200527122844.19524-1-pmladek@suse.com
Jens Axboe [Thu, 28 May 2020 14:48:12 +0000 (08:48 -0600)]
Merge branch 'nvme-5.7' of git://git.infradead.org/nvme into block-5.7
Pull NVMe poll fix from Christoph.
* 'nvme-5.7' of git://git.infradead.org/nvme:
nvme-pci: avoid race between nvme_reap_pending_cqes() and nvme_poll()
Nobuhiro Iwamatsu [Wed, 27 May 2020 23:34:57 +0000 (08:34 +0900)]
arm64/kernel: Fix return value when cpu_online() fails in __cpu_up()
If boot_secondary() was successful, and cpu_online() was an error in
__cpu_up(), -EIO was returned, but 0 is returned by commit
d22b115cbfbb7
("arm64/kernel: Simplify __cpu_up() by bailing out early").
Therefore, bringup_wait_for_ap() causes the primary core to wait for a
long time, which may cause boot failure.
This commit sets -EIO to return code under the same conditions.
Fixes:
d22b115cbfbb ("arm64/kernel: Simplify __cpu_up() by bailing out early")
Signed-off-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp>
Tested-by: Yuji Ishikawa <yuji2.ishikawa@toshiba.co.jp>
Acked-by: Will Deacon <will@kernel.org>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20200527233457.2531118-1-nobuhiro1.iwamatsu@toshiba.co.jp
[catalin.marinas@arm.com: return -EIO at the end of the function]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Dave Airlie [Thu, 28 May 2020 05:38:42 +0000 (15:38 +1000)]
Merge tag 'amd-drm-fixes-5.7-2020-05-27' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
amd-drm-fixes-5.7-2020-05-27:
amdgpu:
- Display atomic test fix
- Fix soft hang in display vupdate code
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexdeucher@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200527222700.4378-1-alexander.deucher@amd.com
Guo Ren [Tue, 26 May 2020 06:34:50 +0000 (06:34 +0000)]
csky: Fixup CONFIG_DEBUG_RSEQ
Put the rseq_syscall check point at the prologue of the syscall
will break the a0 ... a7. This will casue system call bug when
DEBUG_RSEQ is enabled.
So move it to the epilogue of syscall, but before syscall_trace.
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Guo Ren [Sun, 24 May 2020 12:14:11 +0000 (12:14 +0000)]
csky: Coding convention in entry.S
There is no fixup or feature in the patch, we only cleanup with:
- Remove unnecessary reg used (r11, r12), just use r9 & r10 &
syscallid regs as temp useage.
- Add _TIF_SYSCALL_WORK and _TIF_WORK_MASK to gather macros.
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Guo Ren [Sun, 24 May 2020 10:44:38 +0000 (10:44 +0000)]
csky: Fixup abiv2 syscall_trace break a4 & a5
Current implementation could destory a4 & a5 when strace, so we need to get them
from pt_regs by SAVE_ALL.
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Guo Ren [Sun, 24 May 2020 08:03:07 +0000 (08:03 +0000)]
csky: Fixup CONFIG_PREEMPT panic
log:
[ 0.
13373200] Calibrating delay loop...
[ 0.
14077600] ------------[ cut here ]------------
[ 0.
14116700] WARNING: CPU: 0 PID: 0 at kernel/sched/core.c:3790 preempt_count_add+0xc8/0x11c
[ 0.
14348000] DEBUG_LOCKS_WARN_ON((preempt_count() < 0))Modules linked in:
[ 0.
14395100] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.0 #7
[ 0.
14410800]
[ 0.
14427400] Call Trace:
[ 0.
14450700] [<
807cd226>] dump_stack+0x8a/0xe4
[ 0.
14473500] [<
80072792>] __warn+0x10e/0x15c
[ 0.
14495900] [<
80072852>] warn_slowpath_fmt+0x72/0xc0
[ 0.
14518600] [<
800a5240>] preempt_count_add+0xc8/0x11c
[ 0.
14544900] [<
807ef918>] _raw_spin_lock+0x28/0x68
[ 0.
14572600] [<
800e0eb8>] vprintk_emit+0x84/0x2d8
[ 0.
14599000] [<
800e113a>] vprintk_default+0x2e/0x44
[ 0.
14625100] [<
800e2042>] vprintk_func+0x12a/0x1d0
[ 0.
14651300] [<
800e1804>] printk+0x30/0x48
[ 0.
14677600] [<
80008052>] lockdep_init+0x12/0xb0
[ 0.
14703800] [<
80002080>] start_kernel+0x558/0x7f8
[ 0.
14730000] [<
800052bc>] csky_start+0x58/0x94
[ 0.
14756600] irq event stamp: 34
[ 0.
14775100] hardirqs last enabled at (33): [<
80067370>] ret_from_exception+0x2c/0x72
[ 0.
14793700] hardirqs last disabled at (34): [<
800e0eae>] vprintk_emit+0x7a/0x2d8
[ 0.
14812300] softirqs last enabled at (32): [<
800655b0>] __do_softirq+0x578/0x6d8
[ 0.
14830800] softirqs last disabled at (25): [<
8007b3b8>] irq_exit+0xec/0x128
The preempt_count of reg could be destroyed after csky_do_IRQ without reload
from memory.
After reference to other architectures (arm64, riscv), we move preempt entry
into ret_from_exception and disable irq at the beginning of
ret_from_exception instead of RESTORE_ALL.
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Reported-by: Lu Baoquan <lu.baoquan@intellif.com>
Valentine Fatiev [Wed, 27 May 2020 13:47:05 +0000 (16:47 +0300)]
IB/ipoib: Fix double free of skb in case of multicast traffic in CM mode
When connected mode is set, and we have connected and datagram traffic in
parallel, ipoib might crash with double free of datagram skb.
The current mechanism assumes that the order in the completion queue is
the same as the order of sent packets for all QPs. Order is kept only for
specific QP, in case of mixed UD and CM traffic we have few QPs (one UD and
few CM's) in parallel.
The problem:
----------------------------------------------------------
Transmit queue:
-----------------
UD skb pointer kept in queue itself, CM skb kept in spearate queue and
uses transmit queue as a placeholder to count the number of total
transmitted packets.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 .........127
------------------------------------------------------------
NL ud1 UD2 CM1 ud3 cm2 cm3 ud4 cm4 ud5 NL NL NL ...........
------------------------------------------------------------
^ ^
tail head
Completion queue (problematic scenario) - the order not the same as in
the transmit queue:
1 2 3 4 5 6 7 8 9
------------------------------------
ud1 CM1 UD2 ud3 cm2 cm3 ud4 cm4 ud5
------------------------------------
1. CM1 'wc' processing
- skb freed in cm separate ring.
- tx_tail of transmit queue increased although UD2 is not freed.
Now driver assumes UD2 index is already freed and it could be used for
new transmitted skb.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 .........127
------------------------------------------------------------
NL NL UD2 CM1 ud3 cm2 cm3 ud4 cm4 ud5 NL NL NL ...........
------------------------------------------------------------
^ ^ ^
(Bad)tail head
(Bad - Could be used for new SKB)
In this case (due to heavy load) UD2 skb pointer could be replaced by new
transmitted packet UD_NEW, as the driver assumes its free. At this point
we will have to process two 'wc' with same index but we have only one
pointer to free.
During second attempt to free the same skb we will have NULL pointer
exception.
2. UD2 'wc' processing
- skb freed according the index we got from 'wc', but it was already
overwritten by mistake. So actually the skb that was released is the
skb of the new transmitted packet and not the original one.
3. UD_NEW 'wc' processing
- attempt to free already freed skb. NUll pointer exception.
The fix:
-----------------------------------------------------------------------
The fix is to stop using the UD ring as a placeholder for CM packets, the
cyclic ring variables tx_head and tx_tail will manage the UD tx_ring, a
new cyclic variables global_tx_head and global_tx_tail are introduced for
managing and counting the overall outstanding sent packets, then the send
queue will be stopped and waken based on these variables only.
Note that no locking is needed since global_tx_head is updated in the xmit
flow and global_tx_tail is updated in the NAPI flow only. A previous
attempt tried to use one variable to count the outstanding sent packets,
but it did not work since xmit and NAPI flows can run at the same time and
the counter will be updated wrongly. Thus, we use the same simple cyclic
head and tail scheme that we have today for the UD tx_ring.
Fixes:
2c104ea68350 ("IB/ipoib: Get rid of the tx_outstanding variable in all modes")
Link: https://lore.kernel.org/r/20200527134705.480068-1-leon@kernel.org
Signed-off-by: Valentine Fatiev <valentinef@mellanox.com>
Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Aric Cyr [Tue, 12 May 2020 15:48:48 +0000 (11:48 -0400)]
drm/amd/display: Fix potential integer wraparound resulting in a hang
[Why]
If VUPDATE_END is before VUPDATE_START the delay calculated can become
very large, causing a soft hang.
[How]
Take the absolute value of the difference between START and END.
Signed-off-by: Aric Cyr <aric.cyr@amd.com>
Reviewed-by: Nicholas Kazlauskas <Nicholas.Kazlauskas@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Simon Ser [Sat, 23 May 2020 11:53:41 +0000 (11:53 +0000)]
drm/amd/display: drop cursor position check in atomic test
get_cursor_position already handles the case where the cursor has
negative off-screen coordinates by not setting
dc_cursor_position.enabled.
Signed-off-by: Simon Ser <contact@emersion.fr>
Fixes:
626bf90fe03f ("drm/amd/display: add basic atomic check for cursor plane")
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Vladimir Oltean [Wed, 27 May 2020 18:08:05 +0000 (21:08 +0300)]
net: dsa: declare lockless TX feature for slave ports
Be there a platform with the following layout:
Regular NIC
|
+----> DSA master for switch port
|
+----> DSA master for another switch port
After changing DSA back to static lockdep class keys in commit
1a33e10e4a95 ("net: partially revert dynamic lockdep key changes"), this
kernel splat can be seen:
[ 13.361198] ============================================
[ 13.366524] WARNING: possible recursive locking detected
[ 13.371851] 5.7.0-rc4-02121-gc32a05ecd7af-dirty #988 Not tainted
[ 13.377874] --------------------------------------------
[ 13.383201] swapper/0/0 is trying to acquire lock:
[ 13.388004]
ffff0000668ff298 (&dsa_slave_netdev_xmit_lock_key){+.-.}-{2:2}, at: __dev_queue_xmit+0x84c/0xbe0
[ 13.397879]
[ 13.397879] but task is already holding lock:
[ 13.403727]
ffff0000661a1698 (&dsa_slave_netdev_xmit_lock_key){+.-.}-{2:2}, at: __dev_queue_xmit+0x84c/0xbe0
[ 13.413593]
[ 13.413593] other info that might help us debug this:
[ 13.420140] Possible unsafe locking scenario:
[ 13.420140]
[ 13.426075] CPU0
[ 13.428523] ----
[ 13.430969] lock(&dsa_slave_netdev_xmit_lock_key);
[ 13.435946] lock(&dsa_slave_netdev_xmit_lock_key);
[ 13.440924]
[ 13.440924] *** DEADLOCK ***
[ 13.440924]
[ 13.446860] May be due to missing lock nesting notation
[ 13.446860]
[ 13.453668] 6 locks held by swapper/0/0:
[ 13.457598] #0:
ffff800010003de0 ((&idev->mc_ifc_timer)){+.-.}-{0:0}, at: call_timer_fn+0x0/0x400
[ 13.466593] #1:
ffffd4d3fb478700 (rcu_read_lock){....}-{1:2}, at: mld_sendpack+0x0/0x560
[ 13.474803] #2:
ffffd4d3fb478728 (rcu_read_lock_bh){....}-{1:2}, at: ip6_finish_output2+0x64/0xb10
[ 13.483886] #3:
ffffd4d3fb478728 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x6c/0xbe0
[ 13.492793] #4:
ffff0000661a1698 (&dsa_slave_netdev_xmit_lock_key){+.-.}-{2:2}, at: __dev_queue_xmit+0x84c/0xbe0
[ 13.503094] #5:
ffffd4d3fb478728 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x6c/0xbe0
[ 13.512000]
[ 13.512000] stack backtrace:
[ 13.516369] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.7.0-rc4-02121-gc32a05ecd7af-dirty #988
[ 13.530421] Call trace:
[ 13.532871] dump_backtrace+0x0/0x1d8
[ 13.536539] show_stack+0x24/0x30
[ 13.539862] dump_stack+0xe8/0x150
[ 13.543271] __lock_acquire+0x1030/0x1678
[ 13.547290] lock_acquire+0xf8/0x458
[ 13.550873] _raw_spin_lock+0x44/0x58
[ 13.554543] __dev_queue_xmit+0x84c/0xbe0
[ 13.558562] dev_queue_xmit+0x24/0x30
[ 13.562232] dsa_slave_xmit+0xe0/0x128
[ 13.565988] dev_hard_start_xmit+0xf4/0x448
[ 13.570182] __dev_queue_xmit+0x808/0xbe0
[ 13.574200] dev_queue_xmit+0x24/0x30
[ 13.577869] neigh_resolve_output+0x15c/0x220
[ 13.582237] ip6_finish_output2+0x244/0xb10
[ 13.586430] __ip6_finish_output+0x1dc/0x298
[ 13.590709] ip6_output+0x84/0x358
[ 13.594116] mld_sendpack+0x2bc/0x560
[ 13.597786] mld_ifc_timer_expire+0x210/0x390
[ 13.602153] call_timer_fn+0xcc/0x400
[ 13.605822] run_timer_softirq+0x588/0x6e0
[ 13.609927] __do_softirq+0x118/0x590
[ 13.613597] irq_exit+0x13c/0x148
[ 13.616918] __handle_domain_irq+0x6c/0xc0
[ 13.621023] gic_handle_irq+0x6c/0x160
[ 13.624779] el1_irq+0xbc/0x180
[ 13.627927] cpuidle_enter_state+0xb4/0x4d0
[ 13.632120] cpuidle_enter+0x3c/0x50
[ 13.635703] call_cpuidle+0x44/0x78
[ 13.639199] do_idle+0x228/0x2c8
[ 13.642433] cpu_startup_entry+0x2c/0x48
[ 13.646363] rest_init+0x1ac/0x280
[ 13.649773] arch_call_rest_init+0x14/0x1c
[ 13.653878] start_kernel+0x490/0x4bc
Lockdep keys themselves were added in commit
ab92d68fc22f ("net: core:
add generic lockdep keys"), and it's very likely that this splat existed
since then, but I have no real way to check, since this stacked platform
wasn't supported by mainline back then.
>From Taehee's own words:
This patch was considered that all stackable devices have LLTX flag.
But the dsa doesn't have LLTX, so this splat happened.
After this patch, dsa shares the same lockdep class key.
On the nested dsa interface architecture, which you illustrated,
the same lockdep class key will be used in __dev_queue_xmit() because
dsa doesn't have LLTX.
So that lockdep detects deadlock because the same lockdep class key is
used recursively although actually the different locks are used.
There are some ways to fix this problem.
1. using NETIF_F_LLTX flag.
If possible, using the LLTX flag is a very clear way for it.
But I'm so sorry I don't know whether the dsa could have LLTX or not.
2. using dynamic lockdep again.
It means that each interface uses a separate lockdep class key.
So, lockdep will not detect recursive locking.
But this way has a problem that it could consume lockdep class key
too many.
Currently, lockdep can have 8192 lockdep class keys.
- you can see this number with the following command.
cat /proc/lockdep_stats
lock-classes: 1251 [max: 8192]
...
The [max: 8192] means that the maximum number of lockdep class keys.
If too many lockdep class keys are registered, lockdep stops to work.
So, using a dynamic(separated) lockdep class key should be considered
carefully.
In addition, updating lockdep class key routine might have to be existing.
(lockdep_register_key(), lockdep_set_class(), lockdep_unregister_key())
3. Using lockdep subclass.
A lockdep class key could have 8 subclasses.
The different subclass is considered different locks by lockdep
infrastructure.
But "lock-classes" is not counted by subclasses.
So, it could avoid stopping lockdep infrastructure by an overflow of
lockdep class keys.
This approach should also have an updating lockdep class key routine.
(lockdep_set_subclass())
4. Using nonvalidate lockdep class key.
The lockdep infrastructure supports nonvalidate lockdep class key type.
It means this lockdep is not validated by lockdep infrastructure.
So, the splat will not happen but lockdep couldn't detect real deadlock
case because lockdep really doesn't validate it.
I think this should be used for really special cases.
(lockdep_set_novalidate_class())
Further discussion here:
https://patchwork.ozlabs.org/project/netdev/patch/
20200503052220.4536-2-xiyou.wangcong@gmail.com/
There appears to be no negative side-effect to declaring lockless TX for
the DSA virtual interfaces, which means they handle their own locking.
So that's what we do to make the splat go away.
Patch tested in a wide variety of cases: unicast, multicast, PTP, etc.
Fixes:
ab92d68fc22f ("net: core: add generic lockdep keys")
Suggested-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Al Viro [Tue, 26 May 2020 22:39:49 +0000 (18:39 -0400)]
copy_xstate_to_kernel(): don't leave parts of destination uninitialized
copy the corresponding pieces of init_fpstate into the gaps instead.
Cc: stable@kernel.org
Tested-by: Alexander Potapenko <glider@google.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>