Alexei Starovoitov [Tue, 19 Jan 2021 20:42:13 +0000 (12:42 -0800)]
Merge branch 'bpf,x64: implement jump padding in jit'
Gary Lin says:
====================
This patch series implements jump padding to x64 jit to cover some
corner cases that used to consume more than 20 jit passes and caused
failure.
v4:
- Add the detailed comments about the possible padding bytes
- Add the second test case which triggers jmp_cond padding and imm32 nop
jmp padding.
- Add the new test case as another subprog
v3:
- Copy the instructions of prologue separately or the size calculation
of the first BPF instruction would include the prologue.
- Replace WARN_ONCE() with pr_err() and EFAULT
- Use MAX_PASSES in the for loop condition check
- Remove the "padded" flag from x64_jit_data. For the extra pass of
subprogs, padding is always enabled since it won't hurt the images
that converge without padding.
v2:
- Simplify the sample code in the commit description and provide the
jit code
- Check the expected padding bytes with WARN_ONCE
- Move the 'padded' flag to 'struct x64_jit_data'
- Remove the EXPECTED_FAIL flag from bpf_fill_maxinsns11() in test_bpf
- Add 2 verifier tests
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Gary Lin [Tue, 19 Jan 2021 10:25:01 +0000 (18:25 +0800)]
selftests/bpf: Add verifier tests for x64 jit jump padding
There are 3 tests added into verifier's jit tests to trigger x64
jit jump padding.
The first test can be represented as the following assembly code:
1: bpf_call bpf_get_prandom_u32
2: if r0 == 1 goto pc+128
3: if r0 == 2 goto pc+128
...
129: if r0 == 128 goto pc+128
130: goto pc+128
131: goto pc+127
...
256: goto pc+2
257: goto pc+1
258: r0 = 1
259: ret
We first store a random number to r0 and add the corresponding
conditional jumps (2~129) to make verifier believe that those jump
instructions from 130 to 257 are reachable. When the program is sent to
x64 jit, it starts to optimize out the NOP jumps backwards from 257.
Since there are 128 such jumps, the program easily reaches 15 passes and
triggers jump padding.
Here is the x64 jit code of the first test:
0: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
5: 66 90 xchg ax,ax
7: 55 push rbp
8: 48 89 e5 mov rbp,rsp
b: e8 4c 90 75 e3 call 0xffffffffe375905c
10: 48 83 f8 01 cmp rax,0x1
14: 0f 84 fe 04 00 00 je 0x518
1a: 48 83 f8 02 cmp rax,0x2
1e: 0f 84 f9 04 00 00 je 0x51d
...
f6: 48 83 f8 18 cmp rax,0x18
fa: 0f 84 8b 04 00 00 je 0x58b
100: 48 83 f8 19 cmp rax,0x19
104: 0f 84 86 04 00 00 je 0x590
10a: 48 83 f8 1a cmp rax,0x1a
10e: 0f 84 81 04 00 00 je 0x595
...
500: 0f 84 83 01 00 00 je 0x689
506: 48 81 f8 80 00 00 00 cmp rax,0x80
50d: 0f 84 76 01 00 00 je 0x689
513: e9 71 01 00 00 jmp 0x689
518: e9 6c 01 00 00 jmp 0x689
...
5fe: e9 86 00 00 00 jmp 0x689
603: e9 81 00 00 00 jmp 0x689
608: 0f 1f 00 nop DWORD PTR [rax]
60b: eb 7c jmp 0x689
60d: eb 7a jmp 0x689
...
683: eb 04 jmp 0x689
685: eb 02 jmp 0x689
687: 66 90 xchg ax,ax
689: b8 01 00 00 00 mov eax,0x1
68e: c9 leave
68f: c3 ret
As expected, a 3 bytes NOPs is inserted at 608 due to the transition
from imm32 jmp to imm8 jmp. A 2 bytes NOPs is also inserted at 687 to
replace a NOP jump.
The second test case is tricky. Here is the assembly code:
1: bpf_call bpf_get_prandom_u32
2: if r0 == 1 goto pc+2048
3: if r0 == 2 goto pc+2048
...
2049: if r0 == 2048 goto pc+2048
2050: goto pc+2048
2051: goto pc+16
2052: goto pc+15
...
2064: goto pc+3
2065: goto pc+2
2066: goto pc+1
...
[repeat "goto pc+16".."goto pc+1" 127 times]
...
4099: r0 = 2
4100: ret
There are 4 major parts of the program.
1) 1~2049: Those are instructions to make 2050~4098 reachable. Some of
them also could generate the padding for jmp_cond.
2) 2050: This is the target instruction for the imm32 nop jmp padding.
3) 2051~4098: The repeated "goto 1~16" instructions are designed to be
consumed by the nop jmp optimization. In the end, those
instrucitons become 128 continuous 0 offset jmp and are
optimized out in 1 pass, and this make insn 2050 an imm32
nop jmp in the next pass, so that we can trigger the
5 bytes padding.
4) 4099~4100: Those are the instructions to end the program.
The x64 jit code is like this:
0: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
5: 66 90 xchg ax,ax
7: 55 push rbp
8: 48 89 e5 mov rbp,rsp
b: e8 bc 7b d5 d3 call 0xffffffffd3d57bcc
10: 48 83 f8 01 cmp rax,0x1
14: 0f 84 7e 66 00 00 je 0x6698
1a: 48 83 f8 02 cmp rax,0x2
1e: 0f 84 74 66 00 00 je 0x6698
24: 48 83 f8 03 cmp rax,0x3
28: 0f 84 6a 66 00 00 je 0x6698
2e: 48 83 f8 04 cmp rax,0x4
32: 0f 84 60 66 00 00 je 0x6698
38: 48 83 f8 05 cmp rax,0x5
3c: 0f 84 56 66 00 00 je 0x6698
42: 48 83 f8 06 cmp rax,0x6
46: 0f 84 4c 66 00 00 je 0x6698
...
666c: 48 81 f8 fe 07 00 00 cmp rax,0x7fe
6673: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
6677: 74 1f je 0x6698
6679: 48 81 f8 ff 07 00 00 cmp rax,0x7ff
6680: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
6684: 74 12 je 0x6698
6686: 48 81 f8 00 08 00 00 cmp rax,0x800
668d: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
6691: 74 05 je 0x6698
6693: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
6698: b8 02 00 00 00 mov eax,0x2
669d: c9 leave
669e: c3 ret
Since insn 2051~4098 are optimized out right before the padding pass,
there are several conditional jumps from the first part are replaced with
imm8 jmp_cond, and this triggers the 4 bytes padding, for example at
6673, 6680, and 668d. On the other hand, Insn 2050 is replaced with the
5 bytes nops at 6693.
The third test is to invoke the first and second tests as subprogs to test
bpf2bpf. Per the system log, there was one more jit happened with only
one pass and the same jit code was produced.
v4:
- Add the second test case which triggers jmp_cond padding and imm32 nop
jmp padding.
- Add the new test case as another subprog
Signed-off-by: Gary Lin <glin@suse.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210119102501.511-4-glin@suse.com
Gary Lin [Tue, 19 Jan 2021 10:25:00 +0000 (18:25 +0800)]
test_bpf: Remove EXPECTED_FAIL flag from bpf_fill_maxinsns11
With NOPs padding, x64 jit now can handle the jump cases like
bpf_fill_maxinsns11().
Signed-off-by: Gary Lin <glin@suse.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210119102501.511-3-glin@suse.com
Gary Lin [Tue, 19 Jan 2021 10:24:59 +0000 (18:24 +0800)]
bpf,x64: Pad NOPs to make images converge more easily
The x64 bpf jit expects bpf images converge within the given passes, but
it could fail to do so with some corner cases. For example:
l0: ja 40
l1: ja 40
[... repeated ja 40 ]
l39: ja 40
l40: ret #0
This bpf program contains 40 "ja 40" instructions which are effectively
NOPs and designed to be replaced with valid code dynamically. Ideally,
bpf jit should optimize those "ja 40" instructions out when translating
the bpf instructions into x64 machine code. However, do_jit() can only
remove one "ja 40" for offset==0 on each pass, so it requires at least
40 runs to eliminate those JMPs and exceeds the current limit of
passes(20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON
is set even though it's legit as a classic socket filter.
To make bpf images more likely converge within 20 passes, this commit
pads some instructions with NOPs in the last 5 passes:
1. conditional jumps
A possible size variance comes from the adoption of imm8 JMP. If the
offset is imm8, we calculate the size difference of this BPF instruction
between the previous and the current pass and fill the gap with NOPs.
To avoid the recalculation of jump offset, those NOPs are inserted before
the JMP code, so we have to subtract the 2 bytes of imm8 JMP when
calculating the NOP number.
2. BPF_JA
There are two conditions for BPF_JA.
a.) nop jumps
If this instruction is not optimized out in the previous pass,
instead of removing it, we insert the equivalent size of NOPs.
b.) label jumps
Similar to condition jumps, we prepend NOPs right before the JMP
code.
To make the code concise, emit_nops() is modified to use the signed len and
return the number of inserted NOPs.
For bpf-to-bpf, we always enable padding for the extra pass since there
is only one extra run and the jump padding doesn't affected the images
that converge without padding.
After applying this patch, the corner case was loaded with the following
jit code:
flen=45 proglen=77 pass=17 image=
ffffffffc03367d4 from=jump pid=10097
JIT code:
00000000: 0f 1f 44 00 00 55 48 89 e5 53 41 55 31 c0 45 31
JIT code:
00000010: ed 48 89 fb eb 30 eb 2e eb 2c eb 2a eb 28 eb 26
JIT code:
00000020: eb 24 eb 22 eb 20 eb 1e eb 1c eb 1a eb 18 eb 16
JIT code:
00000030: eb 14 eb 12 eb 10 eb 0e eb 0c eb 0a eb 08 eb 06
JIT code:
00000040: eb 04 eb 02 66 90 31 c0 41 5d 5b c9 c3
0: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
5: 55 push rbp
6: 48 89 e5 mov rbp,rsp
9: 53 push rbx
a: 41 55 push r13
c: 31 c0 xor eax,eax
e: 45 31 ed xor r13d,r13d
11: 48 89 fb mov rbx,rdi
14: eb 30 jmp 0x46
16: eb 2e jmp 0x46
...
3e: eb 06 jmp 0x46
40: eb 04 jmp 0x46
42: eb 02 jmp 0x46
44: 66 90 xchg ax,ax
46: 31 c0 xor eax,eax
48: 41 5d pop r13
4a: 5b pop rbx
4b: c9 leave
4c: c3 ret
At the 16th pass, 15 jumps were already optimized out, and one jump was
replaced with NOPs at 44 and the image converged at the 17th pass.
v4:
- Add the detailed comments about the possible padding bytes
v3:
- Copy the instructions of prologue separately or the size calculation
of the first BPF instruction would include the prologue.
- Replace WARN_ONCE() with pr_err() and EFAULT
- Use MAX_PASSES in the for loop condition check
- Remove the "padded" flag from x64_jit_data. For the extra pass of
subprogs, padding is always enabled since it won't hurt the images
that converge without padding.
v2:
- Simplify the sample code in the description and provide the jit code
- Check the expected padding bytes with WARN_ONCE
- Move the 'padded' flag to 'struct x64_jit_data'
Signed-off-by: Gary Lin <glin@suse.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210119102501.511-2-glin@suse.com
Lukas Bulwahn [Mon, 18 Jan 2021 08:00:04 +0000 (09:00 +0100)]
docs, bpf: Add minimal markup to address doc warning
Commit
91c960b00566 ("bpf: Rename BPF_XADD and prepare to encode other
atomics in .imm") modified the BPF documentation, but missed some ReST
markup.
Hence, make htmldocs warns on Documentation/networking/filter.rst:1053:
WARNING: Inline emphasis start-string without end-string.
Add some minimal markup to address this warning.
Fixes: 91c960b00566 ("bpf: Rename BPF_XADD and prepare to encode other atomics in .imm")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Brendan Jackman <jackmanb@google.com>
Link: https://lore.kernel.org/bpf/20210118080004.6367-1-lukas.bulwahn@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Björn Töpel [Mon, 18 Jan 2021 09:17:53 +0000 (10:17 +0100)]
samples/bpf: Add BPF_ATOMIC_OP macro for BPF samples
Brendan Jackman added extend atomic operations to the BPF instruction
set in commit
7064a7341a0d ("Merge branch 'Atomics for eBPF'"), which
introduces the BPF_ATOMIC_OP macro. However, that macro was missing
for the BPF samples. Fix that by adding it into bpf_insn.h.
Fixes: 91c960b00566 ("bpf: Rename BPF_XADD and prepare to encode other atomics in .imm")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Link: https://lore.kernel.org/bpf/20210118091753.107572-1-bjorn.topel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Lorenzo Bianconi [Tue, 12 Jan 2021 18:26:13 +0000 (19:26 +0100)]
net, xdp: Introduce xdp_build_skb_from_frame utility routine
Introduce xdp_build_skb_from_frame utility routine to build the skb
from xdp_frame. Respect to __xdp_build_skb_from_frame,
xdp_build_skb_from_frame will allocate the skb object. Rely on
xdp_build_skb_from_frame in veth driver.
Introduce missing xdp metadata support in veth_xdp_rcv_one routine.
Add missing metadata support in veth_xdp_rcv_one().
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/94ade9e853162ae1947941965193190da97457bc.1610475660.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Lorenzo Bianconi [Tue, 12 Jan 2021 18:26:12 +0000 (19:26 +0100)]
net, xdp: Introduce __xdp_build_skb_from_frame utility routine
Introduce __xdp_build_skb_from_frame utility routine to build
the skb from xdp_frame. Rely on __xdp_build_skb_from_frame in
cpumap code.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/4f9f4c6b3dd3933770c617eb6689dbc0c6e25863.1610475660.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Carlos Neira [Thu, 14 Jan 2021 14:10:36 +0000 (11:10 -0300)]
bpf, selftests: Fold test_current_pid_tgid_new_ns into test_progs.
Currently tests for bpf_get_ns_current_pid_tgid() are outside test_progs.
This change folds test cases into test_progs.
Changes from v11:
- Fixed test failure is not detected.
- Removed EXIT(3) call as it will stop test_progs execution.
Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210114141033.GA17348@localhost
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jakub Kicinski [Wed, 20 Jan 2021 20:16:11 +0000 (12:16 -0800)]
Merge git://git./linux/kernel/git/netdev/net
Conflicts:
drivers/net/can/dev.c
commit
03f16c5075b2 ("can: dev: can_restart: fix use after free bug")
commit
3e77f70e7345 ("can: dev: move driver related infrastructure into separate subdir")
Code move.
drivers/net/dsa/b53/b53_common.c
commit
8e4052c32d6b ("net: dsa: b53: fix an off by one in checking "vlan->vid"")
commit
b7a9e0da2d1c ("net: switchdev: remove vid_begin -> vid_end range from VLAN objects")
Field rename.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Wed, 20 Jan 2021 19:52:21 +0000 (11:52 -0800)]
Merge tag 'net-5.11-rc5' of git://git./linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.11-rc5, including fixes from bpf, wireless, and
can trees.
Current release - regressions:
- nfc: nci: fix the wrong NCI_CORE_INIT parameters
Current release - new code bugs:
- bpf: allow empty module BTFs
Previous releases - regressions:
- bpf: fix signed_{sub,add32}_overflows type handling
- tcp: do not mess with cloned skbs in tcp_add_backlog()
- bpf: prevent double bpf_prog_put call from bpf_tracing_prog_attach
- bpf: don't leak memory in bpf getsockopt when optlen == 0
- tcp: fix potential use-after-free due to double kfree()
- mac80211: fix encryption issues with WEP
- devlink: use right genl user_ptr when handling port param get/set
- ipv6: set multicast flag on the multicast route
- tcp: fix TCP_USER_TIMEOUT with zero window
Previous releases - always broken:
- bpf: local storage helpers should check nullness of owner ptr passed
- mac80211: fix incorrect strlen of .write in debugfs
- cls_flower: call nla_ok() before nla_next()
- skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() too"
* tag 'net-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (52 commits)
net: systemport: free dev before on error path
net: usb: cdc_ncm: don't spew notifications
net: mscc: ocelot: Fix multicast to the CPU port
tcp: Fix potential use-after-free due to double kfree()
bpf: Fix signed_{sub,add32}_overflows type handling
can: peak_usb: fix use after free bugs
can: vxcan: vxcan_xmit: fix use after free bug
can: dev: can_restart: fix use after free bug
tcp: fix TCP socket rehash stats mis-accounting
net: dsa: b53: fix an off by one in checking "vlan->vid"
tcp: do not mess with cloned skbs in tcp_add_backlog()
selftests: net: fib_tests: remove duplicate log test
net: nfc: nci: fix the wrong NCI_CORE_INIT parameters
sh_eth: Fix power down vs. is_opened flag ordering
net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled
netfilter: rpfilter: mask ecn bits before fib lookup
udp: mask TOS bits in udp_v4_early_demux()
xsk: Clear pool even for inactive queues
bpf: Fix helper bpf_map_peek_elem_proto pointing to wrong callback
sh_eth: Make PHY access aware of Runtime PM to fix reboot crash
...
Linus Torvalds [Wed, 20 Jan 2021 19:46:38 +0000 (11:46 -0800)]
Merge tag 'for-linus-5.11-rc5-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fix from Juergen Gross:
"A fix for build failure showing up in some configurations"
* tag 'for-linus-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: fix 'nopvspin' build error
Tianjia Zhang [Tue, 19 Jan 2021 00:13:19 +0000 (00:13 +0000)]
X.509: Fix crash caused by NULL pointer
On the following call path, `sig->pkey_algo` is not assigned
in asymmetric_key_verify_signature(), which causes runtime
crash in public_key_verify_signature().
keyctl_pkey_verify
asymmetric_key_verify_signature
verify_signature
public_key_verify_signature
This patch simply check this situation and fixes the crash
caused by NULL pointer.
Fixes: 215525639631 ("X.509: support OSCCA SM2-with-SM3 certificate verification")
Reported-by: Tobias Markus <tobias@markus-regensburg.de>
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: João Fonseca <jpedrofonseca@ua.pt>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Cc: stable@vger.kernel.org # v5.10+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Takashi Iwai [Wed, 20 Jan 2021 16:11:12 +0000 (16:11 +0000)]
cachefiles: Drop superfluous readpages aops NULL check
After the recent actions to convert readpages aops to readahead, the
NULL checks of readpages aops in cachefiles_read_or_alloc_page() may
hit falsely. More badly, it's an ASSERT() call, and this panics.
Drop the superfluous NULL checks for fixing this regression.
[DH: Note that cachefiles never actually used readpages, so this check was
never actually necessary]
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=208883
BugLink: https://bugzilla.opensuse.org/show_bug.cgi?id=1175245
Fixes: 9ae326a69004 ("CacheFiles: A cache that backs onto a mounted filesystem")
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jakub Kicinski [Wed, 20 Jan 2021 17:16:01 +0000 (09:16 -0800)]
Merge tag 'linux-can-fixes-for-5.11-
20210120' of git://git./linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
linux-can-fixes-for-5.11-
20210120
All three patches are by Vincent Mailhol and fix a potential use after free bug
in the CAN device infrastructure, the vxcan driver, and the peak_usk driver. In
the TX-path the skb is used to read from after it was passed to the networking
stack with netif_rx_ni().
* tag 'linux-can-fixes-for-5.11-
20210120' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: peak_usb: fix use after free bugs
can: vxcan: vxcan_xmit: fix use after free bug
can: dev: can_restart: fix use after free bug
====================
Link: https://lore.kernel.org/r/20210120125202.2187358-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pan Bian [Wed, 20 Jan 2021 04:44:23 +0000 (20:44 -0800)]
net: systemport: free dev before on error path
On the error path, it should goto the error handling label to free
allocated memory rather than directly return.
Fixes: 31bc72d97656 ("net: systemport: fetch and use clock resources")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210120044423.1704-1-bianpan2016@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grant Grundler [Wed, 20 Jan 2021 01:12:08 +0000 (17:12 -0800)]
net: usb: cdc_ncm: don't spew notifications
RTL8156 sends notifications about every 32ms.
Only display/log notifications when something changes.
This issue has been reported by others:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/
1832472
https://lkml.org/lkml/2020/8/27/1083
...
[785962.779840] usb 1-1: new high-speed USB device number 5 using xhci_hcd
[785962.929944] usb 1-1: New USB device found, idVendor=0bda, idProduct=8156, bcdDevice=30.00
[785962.929949] usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=6
[785962.929952] usb 1-1: Product: USB 10/100/1G/2.5G LAN
[785962.929954] usb 1-1: Manufacturer: Realtek
[785962.929956] usb 1-1: SerialNumber:
000000001
[785962.991755] usbcore: registered new interface driver cdc_ether
[785963.017068] cdc_ncm 1-1:2.0: MAC-Address: 00:24:27:88:08:15
[785963.017072] cdc_ncm 1-1:2.0: setting rx_max = 16384
[785963.017169] cdc_ncm 1-1:2.0: setting tx_max = 16384
[785963.017682] cdc_ncm 1-1:2.0 usb0: register 'cdc_ncm' at usb-0000:00:14.0-1, CDC NCM, 00:24:27:88:08:15
[785963.019211] usbcore: registered new interface driver cdc_ncm
[785963.023856] usbcore: registered new interface driver cdc_wdm
[785963.025461] usbcore: registered new interface driver cdc_mbim
[785963.038824] cdc_ncm 1-1:2.0 enx002427880815: renamed from usb0
[785963.089586] cdc_ncm 1-1:2.0 enx002427880815: network connection: disconnected
[785963.121673] cdc_ncm 1-1:2.0 enx002427880815: network connection: disconnected
[785963.153682] cdc_ncm 1-1:2.0 enx002427880815: network connection: disconnected
...
This is about 2KB per second and will overwrite all contents of a 1MB
dmesg buffer in under 10 minutes rendering them useless for debugging
many kernel problems.
This is also an extra 180 MB/day in /var/logs (or 1GB per week) rendering
the majority of those logs useless too.
When the link is up (expected state), spew amount is >2x higher:
...
[786139.600992] cdc_ncm 2-1:2.0 enx002427880815: network connection: connected
[786139.632997] cdc_ncm 2-1:2.0 enx002427880815: 2500 mbit/s downlink 2500 mbit/s uplink
[786139.665097] cdc_ncm 2-1:2.0 enx002427880815: network connection: connected
[786139.697100] cdc_ncm 2-1:2.0 enx002427880815: 2500 mbit/s downlink 2500 mbit/s uplink
[786139.729094] cdc_ncm 2-1:2.0 enx002427880815: network connection: connected
[786139.761108] cdc_ncm 2-1:2.0 enx002427880815: 2500 mbit/s downlink 2500 mbit/s uplink
...
Chrome OS cannot support RTL8156 until this is fixed.
Signed-off-by: Grant Grundler <grundler@chromium.org>
Reviewed-by: Hayes Wang <hayeswang@realtek.com>
Link: https://lore.kernel.org/r/20210120011208.3768105-1-grundler@chromium.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alban Bedel [Tue, 19 Jan 2021 14:06:38 +0000 (15:06 +0100)]
net: mscc: ocelot: Fix multicast to the CPU port
Multicast entries in the MAC table use the high bits of the MAC
address to encode the ports that should get the packets. But this port
mask does not work for the CPU port, to receive these packets on the
CPU port the MAC_CPU_COPY flag must be set.
Because of this IPv6 was effectively not working because neighbor
solicitations were never received. This was not apparent before commit
9403c158 (net: mscc: ocelot: support IPv4, IPv6 and plain Ethernet mdb
entries) as the IPv6 entries were broken so all incoming IPv6
multicast was then treated as unknown and flooded on all ports.
To fix this problem rework the ocelot_mact_learn() to set the
MAC_CPU_COPY flag when a multicast entry that target the CPU port is
added. For this we have to read back the ports endcoded in the pseudo
MAC address by the caller. It is not a very nice design but that avoid
changing the callers and should make backporting easier.
Signed-off-by: Alban Bedel <alban.bedel@aerq.com>
Fixes: 9403c158b872 ("net: mscc: ocelot: support IPv4, IPv6 and plain Ethernet mdb entries")
Link: https://lore.kernel.org/r/20210119140638.203374-1-alban.bedel@aerq.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima [Mon, 18 Jan 2021 05:59:20 +0000 (14:59 +0900)]
tcp: Fix potential use-after-free due to double kfree()
Receiving ACK with a valid SYN cookie, cookie_v4_check() allocates struct
request_sock and then can allocate inet_rsk(req)->ireq_opt. After that,
tcp_v4_syn_recv_sock() allocates struct sock and copies ireq_opt to
inet_sk(sk)->inet_opt. Normally, tcp_v4_syn_recv_sock() inserts the full
socket into ehash and sets NULL to ireq_opt. Otherwise,
tcp_v4_syn_recv_sock() has to reset inet_opt by NULL and free the full
socket.
The commit
01770a1661657 ("tcp: fix race condition when creating child
sockets from syncookies") added a new path, in which more than one cores
create full sockets for the same SYN cookie. Currently, the core which
loses the race frees the full socket without resetting inet_opt, resulting
in that both sock_put() and reqsk_put() call kfree() for the same memory:
sock_put
sk_free
__sk_free
sk_destruct
__sk_destruct
sk->sk_destruct/inet_sock_destruct
kfree(rcu_dereference_protected(inet->inet_opt, 1));
reqsk_put
reqsk_free
__reqsk_free
req->rsk_ops->destructor/tcp_v4_reqsk_destructor
kfree(rcu_dereference_protected(inet_rsk(req)->ireq_opt, 1));
Calling kmalloc() between the double kfree() can lead to use-after-free, so
this patch fixes it by setting NULL to inet_opt before sock_put().
As a side note, this kind of issue does not happen for IPv6. This is
because tcp_v6_syn_recv_sock() clones both ipv6_opt and pktopts which
correspond to ireq_opt in IPv4.
Fixes: 01770a166165 ("tcp: fix race condition when creating child sockets from syncookies")
CC: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210118055920.82516-1-kuniyu@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 20 Jan 2021 16:47:34 +0000 (08:47 -0800)]
Merge https://git./linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2021-01-20
1) Fix wrong bpf_map_peek_elem_proto helper callback, from Mircea Cirjaliu.
2) Fix signed_{sub,add32}_overflows type truncation, from Daniel Borkmann.
3) Fix AF_XDP to also clear pools for inactive queues, from Maxim Mikityanskiy.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Fix signed_{sub,add32}_overflows type handling
xsk: Clear pool even for inactive queues
bpf: Fix helper bpf_map_peek_elem_proto pointing to wrong callback
====================
Link: https://lore.kernel.org/r/20210120163439.8160-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann [Tue, 19 Jan 2021 23:24:24 +0000 (00:24 +0100)]
bpf: Fix signed_{sub,add32}_overflows type handling
Fix incorrect signed_{sub,add32}_overflows() input types (and a related buggy
comment). It looks like this might have slipped in via copy/paste issue, also
given prior to
3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking")
the signature of signed_sub_overflows() had s64 a and s64 b as its input args
whereas now they are truncated to s32. Thus restore proper types. Also, the case
of signed_add32_overflows() is not consistent to signed_sub32_overflows(). Both
have s32 as inputs, therefore align the former.
Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Reported-by: De4dCr0w <sa516203@mail.ustc.edu.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Vincent Mailhol [Wed, 20 Jan 2021 11:41:37 +0000 (20:41 +0900)]
can: peak_usb: fix use after free bugs
After calling peak_usb_netif_rx_ni(skb), dereferencing skb is unsafe.
Especially, the can_frame cf which aliases skb memory is accessed
after the peak_usb_netif_rx_ni().
Reordering the lines solves the issue.
Fixes: 0a25e1f4f185 ("can: peak_usb: add support for PEAK new CANFD USB adapters")
Link: https://lore.kernel.org/r/20210120114137.200019-4-mailhol.vincent@wanadoo.fr
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Vincent Mailhol [Wed, 20 Jan 2021 11:41:36 +0000 (20:41 +0900)]
can: vxcan: vxcan_xmit: fix use after free bug
After calling netif_rx_ni(skb), dereferencing skb is unsafe.
Especially, the canfd_frame cfd which aliases skb memory is accessed
after the netif_rx_ni().
Fixes: a8f820a380a2 ("can: add Virtual CAN Tunnel driver (vxcan)")
Link: https://lore.kernel.org/r/20210120114137.200019-3-mailhol.vincent@wanadoo.fr
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Vincent Mailhol [Wed, 20 Jan 2021 11:41:35 +0000 (20:41 +0900)]
can: dev: can_restart: fix use after free bug
After calling netif_rx_ni(skb), dereferencing skb is unsafe.
Especially, the can_frame cf which aliases skb memory is accessed
after the netif_rx_ni() in:
stats->rx_bytes += cf->len;
Reordering the lines solves the issue.
Fixes: 39549eef3587 ("can: CAN Network device driver and Netlink interface")
Link: https://lore.kernel.org/r/20210120114137.200019-2-mailhol.vincent@wanadoo.fr
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Yuchung Cheng [Tue, 19 Jan 2021 19:26:19 +0000 (11:26 -0800)]
tcp: fix TCP socket rehash stats mis-accounting
The previous commit
32efcc06d2a1 ("tcp: export count for rehash attempts")
would mis-account rehashing SNMP and socket stats:
a. During handshake of an active open, only counts the first
SYN timeout
b. After handshake of passive and active open, stop updating
after (roughly) TCP_RETRIES1 recurring RTOs
c. After the socket aborts, over count timeout_rehash by 1
This patch fixes this by checking the rehash result from sk_rethink_txhash.
Fixes: 32efcc06d2a1 ("tcp: export count for rehash attempts")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210119192619.1848270-1-ycheng@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dan Carpenter [Tue, 19 Jan 2021 14:48:03 +0000 (17:48 +0300)]
net: dsa: b53: fix an off by one in checking "vlan->vid"
The > comparison should be >= to prevent accessing one element beyond
the end of the dev->vlans[] array in the caller function, b53_vlan_add().
The "dev->vlans" array is allocated in the b53_switch_init() function
and it has "dev->num_vlans" elements.
Fixes: a2482d2ce349 ("net: dsa: b53: Plug in VLAN support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/YAbxI97Dl/pmBy5V@mwanda
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jarod Wilson [Tue, 19 Jan 2021 01:09:27 +0000 (20:09 -0500)]
bonding: add a vlan+srcmac tx hashing option
This comes from an end-user request, where they're running multiple VMs on
hosts with bonded interfaces connected to some interest switch topologies,
where 802.3ad isn't an option. They're currently running a proprietary
solution that effectively achieves load-balancing of VMs and bandwidth
utilization improvements with a similar form of transmission algorithm.
Basically, each VM has it's own vlan, so it always sends its traffic out
the same interface, unless that interface fails. Traffic gets split
between the interfaces, maintaining a consistent path, with failover still
available if an interface goes down.
Unlike bond_eth_hash(), this hash function is using the full source MAC
address instead of just the last byte, as there are so few components to
the hash, and in the no-vlan case, we would be returning just the last
byte of the source MAC as the hash value. It's entirely possible to have
two NICs in a bond with the same last byte of their MAC, but not the same
MAC, so this adjustment should guarantee distinct hashes in all cases.
This has been rudimetarily tested to provide similar results to the
proprietary solution it is aiming to replace. A patch for iproute2 is also
posted, to properly support the new mode there as well.
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Thomas Davis <tadavis@lbl.gov>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Link: https://lore.kernel.org/r/20210119010927.1191922-1-jarod@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 19 Jan 2021 16:56:56 +0000 (17:56 +0100)]
net: fix GSO for SG-enabled devices
The commit
dbd50f238dec ("net: move the hsize check to the else
block in skb_segment") introduced a data corruption for devices
supporting scatter-gather.
The problem boils down to signed/unsigned comparison given
unexpected results: if signed 'hsize' is negative, it will be
considered greater than a positive 'len', which is unsigned.
This commit addresses resorting to the old checks order, so that
'hsize' never has a negative value when compared with 'len'.
v1 -> v2:
- reorder hsize checks instead of explicit cast (Alex)
Bisected-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Fixes: dbd50f238dec ("net: move the hsize check to the else block in skb_segment")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/861947c2d2d087db82af93c21920ce8147d15490.1611074818.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Tue, 19 Jan 2021 16:49:00 +0000 (08:49 -0800)]
tcp: do not mess with cloned skbs in tcp_add_backlog()
Heiner Kallweit reported that some skbs were sent with
the following invalid GSO properties :
- gso_size > 0
- gso_type == 0
This was triggerring a WARN_ON_ONCE() in rtl8169_tso_csum_v2.
Juerg Haefliger was able to reproduce a similar issue using
a lan78xx NIC and a workload mixing TCP incoming traffic
and forwarded packets.
The problem is that tcp_add_backlog() is writing
over gso_segs and gso_size even if the incoming packet will not
be coalesced to the backlog tail packet.
While skb_try_coalesce() would bail out if tail packet is cloned,
this overwriting would lead to corruptions of other packets
cooked by lan78xx, sharing a common super-packet.
The strategy used by lan78xx is to use a big skb, and split
it into all received packets using skb_clone() to avoid copies.
The drawback of this strategy is that all the small skb share a common
struct skb_shared_info.
This patch rewrites TCP gso_size/gso_segs handling to only
happen on the tail skb, since skb_try_coalesce() made sure
it was not cloned.
Fixes: 4f693b55c3d2 ("tcp: implement coalescing on backlog queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Bisected-by: Juerg Haefliger <juergh@canonical.com>
Tested-by: Juerg Haefliger <juergh@canonical.com>
Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=209423
Link: https://lore.kernel.org/r/20210119164900.766957-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xu Wang [Tue, 19 Jan 2021 07:50:59 +0000 (07:50 +0000)]
octeontx2-af: Remove unneeded semicolons
fix semicolon.cocci warnings:
./drivers/net/ethernet/marvell/octeontx2/af/rvu_npc_fs.c:272:2-3: Unneeded semicolon
drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c:1788:3-4: Unneeded semicolon
drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c:1809:3-4: Unneeded semicolon
drivers/net/ethernet/marvell/octeontx2/af/rvu.c:1326:2-3: Unneeded semicolon
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Link: https://lore.kernel.org/r/20210119075059.17493-1-vulab@iscas.ac.cn
Link: https://lore.kernel.org/r/20210119075507.17699-1-vulab@iscas.ac.cn
Link: https://lore.kernel.org/r/20210119080037.17931-1-vulab@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Geert Uytterhoeven [Mon, 18 Jan 2021 15:08:57 +0000 (16:08 +0100)]
net: smsc911x: Make Runtime PM handling more fine-grained
Currently the smsc911x driver has mininal power management: during
driver probe, the device is powered up, and during driver remove, it is
powered down.
Improve power management by making it more fine-grained:
1. Power the device down when driver probe is finished,
2. Power the device (down) when it is opened (closed),
3. Make sure the device is powered during PHY access.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20210118150857.796943-1-geert+renesas@glider.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Colin Ian King [Mon, 18 Jan 2021 11:19:02 +0000 (11:19 +0000)]
selftests: forwarding: Fix spelling mistake "succeded" -> "succeeded"
There are two spelling mistakes in check_fail messages. Fix them.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210118111902.71096-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Menglong Dong [Mon, 18 Jan 2021 11:15:39 +0000 (03:15 -0800)]
net: tun: fix misspellings using codespell tool
Some typos are found out by codespell tool:
$ codespell -w -i 3 ./drivers/net/tun.c
aovid ==> avoid
Fix typos found by codespell.
Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Link: https://lore.kernel.org/r/20210118111539.35886-1-dong.menglong@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiapeng Zhong [Mon, 18 Jan 2021 08:31:02 +0000 (16:31 +0800)]
taprio: boolean values to a bool variable
Fix the following coccicheck warnings:
./net/sched/sch_taprio.c:393:3-16: WARNING: Assignment of 0/1 to bool
variable.
./net/sched/sch_taprio.c:375:2-15: WARNING: Assignment of 0/1 to bool
variable.
./net/sched/sch_taprio.c:244:4-19: WARNING: Assignment of 0/1 to bool
variable.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com>
Link: https://lore.kernel.org/r/1610958662-71166-1-git-send-email-abaci-bugfix@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 20 Jan 2021 01:32:36 +0000 (17:32 -0800)]
Merge branch 'net-ethernet-ti-am65-cpsw-nuss-introduce-support-for-am64x-cpsw3g'
Grygorii Strashko says:
====================
net: ethernet: ti: am65-cpsw-nuss: introduce support for am64x cpsw3g
This series introduces basic support for recently introduced TI K3 AM642x SoC [1]
which contains 3 port (2 external ports) CPSW3g module. The CPSW3g integrated
in MAIN domain and can be configured in multi port or switch modes.
In this series only multi port mode is enabled. The initial version of switchdev
support was introduced by Vignesh Raghavendra [2] and work is in progress.
The overall functionality and DT bindings are similar to other K3 CPSWxg
versions, so DT binding changes are minimal and driver is mostly re-used for
TI K3 AM642x CPSW3g.
The main difference is that TI K3 AM642x SoC is not fully DMA coherent and
instead DMA coherency is supported per DMA channel.
Patches 1-2 - DT bindings update
Patches 3-4 - Update driver to support changed DMA coherency model
Patches 5-6 - adds TI K3 AM642x SoC platform data and so enable CPSW3g
[1] https://www.ti.com/lit/pdf/spruim2
[2] https://patchwork.ozlabs.org/project/netdev/cover/
20201130082046.16292-1-vigneshr@ti.com/
====================
Link: https://lore.kernel.org/r/20210115192853.5469-1-grygorii.strashko@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vignesh Raghavendra [Fri, 15 Jan 2021 19:28:53 +0000 (21:28 +0200)]
net: ethernet: ti: am65-cpsw: add support for am64x cpsw3g
The TI AM64x SoCs Gigabit Ethernet Switch subsystem (CPSW3g NUSS) has three
ports (2 ext. ports) and provides Ethernet packet communication for the
device and can be configured in multi port mode or as an Ethernet switch.
This patch adds support for the corresponding CPSW3g version.
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vignesh Raghavendra [Fri, 15 Jan 2021 19:28:52 +0000 (21:28 +0200)]
net: ti: cpsw_ale: add driver data for AM64 CPSW3g
The AM642x CPSW3g is similar to j721e-cpswxg except its ALE table size is
512 entries. Add entry for the same.
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Peter Ujfalusi [Fri, 15 Jan 2021 19:28:51 +0000 (21:28 +0200)]
net: ethernet: ti: am65-cpsw-nuss: Support for transparent ASEL handling
Use the glue layer's functions to convert the dma_addr_t to and from CPPI5
address (with the ASEL bits), which should be used within the descriptors
and data buffers.
- Per channel coherency support
The DMAs use the 'ASEL' bits to select data and configuration fetch path.
The ASEL bits are placed at the unused parts of any address field used by
the DMAs (pointers to descriptors, addresses in descriptors, ring base
addresses). The ASEL is not part of the address (the DMAs can address
48bits). Individual channels can be configured to be coherent (via ACP
port) or non coherent individually by configuring the ASEL to appropriate
value.
[1] https://lore.kernel.org/patchwork/cover/
1350756/
Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
Co-developed-by: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Peter Ujfalusi [Fri, 15 Jan 2021 19:28:50 +0000 (21:28 +0200)]
net: ethernet: ti: am65-cpsw-nuss: Use DMA device for DMA API
For DMA API the DMA device should be used as cpsw does not accesses to
descriptors or data buffers in any ways. The DMA does.
Also, drop dma_coerce_mask_and_coherent() setting on CPSW device, as it
should be done by DMA driver which does data movement.
This is required for adding AM64x CPSW3g support where DMA coherency
supported per DMA channel.
Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
Co-developed-by: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grygorii Strashko [Fri, 15 Jan 2021 19:28:49 +0000 (21:28 +0200)]
dt-binding: net: ti: k3-am654-cpsw-nuss: update bindings for am64x cpsw3g
Update DT binding for recently introduced TI K3 AM642x SoC [1] which
contains 3 port (2 external ports) CPSW3g module. The CPSW3g integrated
in MAIN domain and can be configured in multi port or switch modes.
The overall functionality and DT bindings are similar to other K3 CPSWxg
versions, so DT binding changes are minimal:
- reword description
- add new compatible 'ti,am642-cpsw-nuss'
- allow 2 external ports child nodes
- add missed 'assigned-clock' props
[1] https://www.ti.com/lit/pdf/spruim2
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grygorii Strashko [Fri, 15 Jan 2021 19:28:48 +0000 (21:28 +0200)]
dt-binding: ti: am65x-cpts: add assigned-clock and power-domains props
The CPTS clock is usually a clk-mux which allows to select CPTS reference
clock by using 'assigned-clock-parents', 'assigned-clocks' DT properties.
Also depending on integration the power-domains has to be specified to
enable CPTS IP.
Hence add 'assigned-clock-parents', 'assigned-clocks' and 'power-domains'
properties to the CPTS DT bindings to avoid dtbs_check warnings:
cpts@
310d0000: 'assigned-clock-parents', 'assigned-clocks' do not match any of the regexes: 'pinctrl-[0-9]+'
cpts@
310d0000: 'power-domains' does not match any of the regexes: 'pinctrl-[0-9]+'
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hangbin Liu [Tue, 19 Jan 2021 02:59:30 +0000 (10:59 +0800)]
selftests: net: fib_tests: remove duplicate log test
The previous test added an address with a specified metric and check if
correspond route was created. I somehow added two logs for the same
test. Remove the duplicated one.
Reported-by: Antoine Tenart <atenart@redhat.com>
Fixes: 0d29169a708b ("selftests/net/fib_tests: update addr_metric_test for peer route testing")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210119025930.2810532-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bongsu Jeon [Mon, 18 Jan 2021 20:55:22 +0000 (05:55 +0900)]
net: nfc: nci: fix the wrong NCI_CORE_INIT parameters
Fix the code because NCI_CORE_INIT_CMD includes two parameters in NCI2.0
but there is no parameters in NCI1.x.
Fixes: bcd684aace34 ("net/nfc/nci: Support NCI 2.x initial sequence")
Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com>
Link: https://lore.kernel.org/r/20210118205522.317087-1-bongsu.jeon@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Geert Uytterhoeven [Mon, 18 Jan 2021 15:08:12 +0000 (16:08 +0100)]
sh_eth: Fix power down vs. is_opened flag ordering
sh_eth_close() does a synchronous power down of the device before
marking it closed. Revert the order, to make sure the device is never
marked opened while suspended.
While at it, use pm_runtime_put() instead of pm_runtime_put_sync(), as
there is no reason to do a synchronous power down.
Fixes: 7fa2955ff70ce453 ("sh_eth: Fix sleeping function called from invalid context")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Sergei Shtylyov <sergei.shtylyov@gmail.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://lore.kernel.org/r/20210118150812.796791-1-geert+renesas@glider.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tariq Toukan [Sun, 17 Jan 2021 15:15:38 +0000 (17:15 +0200)]
net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled
With NETIF_F_HW_TLS_RX packets are decrypted in HW. This cannot be
logically done when RXCSUM offload is off.
Fixes: 14136564c8ee ("net: Add TLS RX offload feature")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Link: https://lore.kernel.org/r/20210117151538.9411-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 19 Jan 2021 22:31:28 +0000 (14:31 -0800)]
Merge branch 'net-support-sctp-crc-csum-offload-for-tunneling-packets-in-some-drivers'
Xin Long says:
====================
net: support SCTP CRC csum offload for tunneling packets in some drivers
This patchset introduces inline function skb_csum_is_sctp(), and uses it
to validate it's a sctp CRC csum offload packet, to make SCTP CRC csum
offload for tunneling packets supported in some HW drivers.
====================
Link: https://lore.kernel.org/r/cover.1610777159.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Sat, 16 Jan 2021 06:13:42 +0000 (14:13 +0800)]
net: ixgbevf: use skb_csum_is_sctp instead of protocol check
Using skb_csum_is_sctp is a easier way to validate it's a SCTP CRC
checksum offload packet, and yet it also makes ixgbevf support SCTP
CRC checksum offload for UDP and GRE encapped packets, just as it
does in igb driver.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Sat, 16 Jan 2021 06:13:41 +0000 (14:13 +0800)]
net: ixgbe: use skb_csum_is_sctp instead of protocol check
Using skb_csum_is_sctp is a easier way to validate it's a SCTP CRC
checksum offload packet, and yet it also makes ixgbe support SCTP
CRC checksum offload for UDP and GRE encapped packets, just as it
does in igb driver.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Sat, 16 Jan 2021 06:13:40 +0000 (14:13 +0800)]
net: igc: use skb_csum_is_sctp instead of protocol check
Using skb_csum_is_sctp is a easier way to validate it's a SCTP CRC
checksum offload packet, and yet it also makes igc support SCTP
CRC checksum offload for UDP and GRE encapped packets, just as it
does in igb driver.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Sat, 16 Jan 2021 06:13:39 +0000 (14:13 +0800)]
net: igbvf: use skb_csum_is_sctp instead of protocol check
Using skb_csum_is_sctp is a easier way to validate it's a SCTP CRC
checksum offload packet, and yet it also makes igbvf support SCTP
CRC checksum offload for UDP and GRE encapped packets, just as it
does in igb driver.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Sat, 16 Jan 2021 06:13:38 +0000 (14:13 +0800)]
net: igb: use skb_csum_is_sctp instead of protocol check
Using skb_csum_is_sctp is a easier way to validate it's a SCTP
CRC checksum offload packet, and there is no need to parse the
packet to check its proto field, especially when it's a UDP or
GRE encapped packet.
So this patch also makes igb support SCTP CRC checksum offload
for UDP and GRE encapped packets.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Sat, 16 Jan 2021 06:13:37 +0000 (14:13 +0800)]
net: add inline function skb_csum_is_sctp
This patch is to define a inline function skb_csum_is_sctp(), and
also replace all places where it checks if it's a SCTP CSUM skb.
This function would be used later in many networking drivers in
the following patches.
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 19 Jan 2021 21:54:32 +0000 (13:54 -0800)]
Merge branch 'ipv4-ensure-ecn-bits-don-t-influence-source-address-validation'
Guillaume Nault says:
====================
ipv4: Ensure ECN bits don't influence source address validation
Functions that end up calling fib_table_lookup() should clear the ECN
bits from the TOS, otherwise ECT(0) and ECT(1) packets can be treated
differently.
Most functions already clear the ECN bits, but there are a few cases
where this is not done. This series only fixes the ones related to
source address validation.
====================
Link: https://lore.kernel.org/r/cover.1610790904.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Guillaume Nault [Sat, 16 Jan 2021 10:44:26 +0000 (11:44 +0100)]
netfilter: rpfilter: mask ecn bits before fib lookup
RT_TOS() only masks one of the two ECN bits. Therefore rpfilter_mt()
treats Not-ECT or ECT(1) packets in a different way than those with
ECT(0) or CE.
Reproducer:
Create two netns, connected with a veth:
$ ip netns add ns0
$ ip netns add ns1
$ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
$ ip -netns ns0 link set dev veth01 up
$ ip -netns ns1 link set dev veth10 up
$ ip -netns ns0 address add 192.0.2.10/32 dev veth01
$ ip -netns ns1 address add 192.0.2.11/32 dev veth10
Add a route to ns1 in ns0:
$ ip -netns ns0 route add 192.0.2.11/32 dev veth01
In ns1, only packets with TOS 4 can be routed to ns0:
$ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10
Ping from ns0 to ns1 works regardless of the ECN bits, as long as TOS
is 4:
$ ip netns exec ns0 ping -Q 4 192.0.2.11 # TOS 4, Not-ECT
... 0% packet loss ...
$ ip netns exec ns0 ping -Q 5 192.0.2.11 # TOS 4, ECT(1)
... 0% packet loss ...
$ ip netns exec ns0 ping -Q 6 192.0.2.11 # TOS 4, ECT(0)
... 0% packet loss ...
$ ip netns exec ns0 ping -Q 7 192.0.2.11 # TOS 4, CE
... 0% packet loss ...
Now use iptable's rpfilter module in ns1:
$ ip netns exec ns1 iptables-legacy -t raw -A PREROUTING -m rpfilter --invert -j DROP
Not-ECT and ECT(1) packets still pass:
$ ip netns exec ns0 ping -Q 4 192.0.2.11 # TOS 4, Not-ECT
... 0% packet loss ...
$ ip netns exec ns0 ping -Q 5 192.0.2.11 # TOS 4, ECT(1)
... 0% packet loss ...
But ECT(0) and ECN packets are dropped:
$ ip netns exec ns0 ping -Q 6 192.0.2.11 # TOS 4, ECT(0)
... 100% packet loss ...
$ ip netns exec ns0 ping -Q 7 192.0.2.11 # TOS 4, CE
... 100% packet loss ...
After this patch, rpfilter doesn't drop ECT(0) and CE packets anymore.
Fixes: 8f97339d3feb ("netfilter: add ipv4 reverse path filter match")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Guillaume Nault [Sat, 16 Jan 2021 10:44:22 +0000 (11:44 +0100)]
udp: mask TOS bits in udp_v4_early_demux()
udp_v4_early_demux() is the only function that calls
ip_mc_validate_source() with a TOS that hasn't been masked with
IPTOS_RT_MASK.
This results in different behaviours for incoming multicast UDPv4
packets, depending on if ip_mc_validate_source() is called from the
early-demux path (udp_v4_early_demux) or from the regular input path
(ip_route_input_noref).
ECN would normally not be used with UDP multicast packets, so the
practical consequences should be limited on that side. However,
IPTOS_RT_MASK is used to also masks the TOS' high order bits, to align
with the non-early-demux path behaviour.
Reproducer:
Setup two netns, connected with veth:
$ ip netns add ns0
$ ip netns add ns1
$ ip -netns ns0 link set dev lo up
$ ip -netns ns1 link set dev lo up
$ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
$ ip -netns ns0 link set dev veth01 up
$ ip -netns ns1 link set dev veth10 up
$ ip -netns ns0 address add 192.0.2.10 peer 192.0.2.11/32 dev veth01
$ ip -netns ns1 address add 192.0.2.11 peer 192.0.2.10/32 dev veth10
In ns0, add route to multicast address 224.0.2.0/24 using source
address 198.51.100.10:
$ ip -netns ns0 address add 198.51.100.10/32 dev lo
$ ip -netns ns0 route add 224.0.2.0/24 dev veth01 src 198.51.100.10
In ns1, define route to 198.51.100.10, only for packets with TOS 4:
$ ip -netns ns1 route add 198.51.100.10/32 tos 4 dev veth10
Also activate rp_filter in ns1, so that incoming packets not matching
the above route get dropped:
$ ip netns exec ns1 sysctl -wq net.ipv4.conf.veth10.rp_filter=1
Now try to receive packets on 224.0.2.11:
$ ip netns exec ns1 socat UDP-RECVFROM:1111,ip-add-membership=224.0.2.11:veth10,ignoreeof -
In ns0, send packet to 224.0.2.11 with TOS 4 and ECT(0) (that is,
tos 6 for socat):
$ echo test0 | ip netns exec ns0 socat - UDP-DATAGRAM:224.0.2.11:1111,bind=:1111,tos=6
The "test0" message is properly received by socat in ns1, because
early-demux has no cached dst to use, so source address validation
is done by ip_route_input_mc(), which receives a TOS that has the
ECN bits masked.
Now send another packet to 224.0.2.11, still with TOS 4 and ECT(0):
$ echo test1 | ip netns exec ns0 socat - UDP-DATAGRAM:224.0.2.11:1111,bind=:1111,tos=6
The "test1" message isn't received by socat in ns1, because, now,
early-demux has a cached dst to use and calls ip_mc_validate_source()
immediately, without masking the ECN bits.
Fixes: bc044e8db796 ("udp: perform source validation for mcast early demux")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Maxim Mikityanskiy [Mon, 18 Jan 2021 16:03:33 +0000 (18:03 +0200)]
xsk: Clear pool even for inactive queues
The number of queues can change by other means, rather than ethtool. For
example, attaching an mqprio qdisc with num_tc > 1 leads to creating
multiple sets of TX queues, which may be then destroyed when mqprio is
deleted. If an AF_XDP socket is created while mqprio is active,
dev->_tx[queue_id].pool will be filled, but then real_num_tx_queues may
decrease with deletion of mqprio, which will mean that the pool won't be
NULLed, and a further increase of the number of TX queues may expose a
dangling pointer.
To avoid any potential misbehavior, this commit clears pool for RX and
TX queues, regardless of real_num_*_queues, still taking into
consideration num_*_queues to avoid overflows.
Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem")
Fixes: a41b4f3c58dd ("xsk: simplify xdp_clear_umem_at_qid implementation")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20210118160333.333439-1-maximmi@mellanox.com
Linus Torvalds [Tue, 19 Jan 2021 21:26:05 +0000 (13:26 -0800)]
Merge tag 'task_work-2021-01-19' of git://git.kernel.dk/linux-block
Pull task_work fix from Jens Axboe:
"The TIF_NOTIFY_SIGNAL change inadvertently removed the unconditional
task_work run we had in get_signal().
This caused a regression for some setups, since we're relying on eg
____fput() being run to close and release, for example, a pipe and
wake the other end.
For 5.11, I prefer the simple solution of just reinstating the
unconditional run, even if it conceptually doesn't make much sense -
if you need that kind of guarantee, you should be using TWA_SIGNAL
instead of TWA_NOTIFY. But it's the trivial fix for 5.11, and would
ensure that other potential gotchas/assumptions for task_work don't
regress for 5.11.
We're looking into further simplifying the task_work notifications for
5.12 which would resolve that too"
* tag 'task_work-2021-01-19' of git://git.kernel.dk/linux-block:
task_work: unconditionally run task_work from get_signal()
Mircea Cirjaliu [Tue, 19 Jan 2021 20:53:18 +0000 (21:53 +0100)]
bpf: Fix helper bpf_map_peek_elem_proto pointing to wrong callback
I assume this was obtained by copy/paste. Point it to bpf_map_peek_elem()
instead of bpf_map_pop_elem(). In practice it may have been less likely
hit when under JIT given shielded via
84430d4232c3 ("bpf, verifier: avoid
retpoline for map push/pop/peek operation").
Fixes: f1a2e44a3aec ("bpf: add queue and stack maps")
Signed-off-by: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Mauricio Vasquez <mauriciovasquezbernal@gmail.com>
Link: https://lore.kernel.org/bpf/AM7PR02MB6082663DFDCCE8DA7A6DD6B1BBA30@AM7PR02MB6082.eurprd02.prod.outlook.com
Linus Torvalds [Tue, 19 Jan 2021 21:01:50 +0000 (13:01 -0800)]
Merge tag 'nfsd-5.11-2' of git://git./linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Avoid exposing parent of root directory in NFSv3 READDIRPLUS results
- Fix a tracepoint change that went in the initial 5.11 merge
* tag 'nfsd-5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
SUNRPC: Move the svc_xdr_recvfrom tracepoint again
nfsd4: readdirplus shouldn't return parent of export
Linus Torvalds [Tue, 19 Jan 2021 20:58:55 +0000 (12:58 -0800)]
Merge tag 'hyperv-fixes-signed-
20210119' of git://git./linux/kernel/git/hyperv/linux
Pull hyperv fix from Wei Liu:
"One patch from Dexuan to fix clockevent initialization"
* tag 'hyperv-fixes-signed-
20210119' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
x86/hyperv: Initialize clockevents after LAPIC is initialized
Jakub Kicinski [Tue, 19 Jan 2021 20:02:22 +0000 (12:02 -0800)]
Merge branch 'sh_eth-fix-reboot-crash'
Geert Uytterhoeven says:
====================
sh_eth: Fix reboot crash
This patch fixes a regression v5.11-rc1, where rebooting while a sh_eth
device is not opened will cause a crash.
Changes compared to v1:
- Export mdiobb_{read,write}(),
- Call mdiobb_{read,write}() now they are exported,
- Use mii_bus.parent to avoid bb_info.dev copy,
- Drop RFC state.
Alternatively, mdio-bitbang could provide Runtime PM-aware wrappers
itself, and use them either manually (through a new parameter to
alloc_mdio_bitbang(), or a new alloc_mdio_bitbang_*() function), or
automatically (e.g. if pm_runtime_enabled() returns true). Note that
the latter requires a "struct device *" parameter to operate on.
Currently there are only two drivers that call alloc_mdio_bitbang() and
use Runtime PM: the Renesas sh_eth and ravb drivers. This series fixes
the former, while the latter is not affected (it keeps the device
powered all the time between driver probe and driver unbind, and
changing that seems to be non-trivial).
====================
Link: https://lore.kernel.org/r/20210118150656.796584-1-geert+renesas@glider.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Geert Uytterhoeven [Mon, 18 Jan 2021 15:06:56 +0000 (16:06 +0100)]
sh_eth: Make PHY access aware of Runtime PM to fix reboot crash
Wolfram reports that his R-Car H2-based Lager board can no longer be
rebooted in v5.11-rc1, as it crashes with an imprecise external abort.
The issue can be reproduced on other boards (e.g. Koelsch with R-Car
M2-W) too, if CONFIG_IP_PNP is disabled, and the Ethernet interface is
down at reboot time:
Unhandled fault: imprecise external abort (0x1406) at 0x00000000
pgd = (ptrval)
[
00000000] *pgd=
422b6835, *pte=
00000000, *ppte=
00000000
Internal error: : 1406 [#1] ARM
Modules linked in:
CPU: 0 PID: 1105 Comm: init Tainted: G W
5.10.0-rc1-00402-ge2f016cf7751 #1048
Hardware name: Generic R-Car Gen2 (Flattened Device Tree)
PC is at sh_mdio_ctrl+0x44/0x60
LR is at sh_mmd_ctrl+0x20/0x24
...
Backtrace:
[<
c0451f30>] (sh_mdio_ctrl) from [<
c0451fd4>] (sh_mmd_ctrl+0x20/0x24)
r7:
0000001f r6:
00000020 r5:
00000002 r4:
c22a1dc4
[<
c0451fb4>] (sh_mmd_ctrl) from [<
c044fc18>] (mdiobb_cmd+0x38/0xa8)
[<
c044fbe0>] (mdiobb_cmd) from [<
c044feb8>] (mdiobb_read+0x58/0xdc)
r9:
c229f844 r8:
c0c329dc r7:
c221e000 r6:
00000001 r5:
c22a1dc4 r4:
00000001
[<
c044fe60>] (mdiobb_read) from [<
c044c854>] (__mdiobus_read+0x74/0xe0)
r7:
0000001f r6:
00000001 r5:
c221e000 r4:
c221e000
[<
c044c7e0>] (__mdiobus_read) from [<
c044c9d8>] (mdiobus_read+0x40/0x54)
r7:
0000001f r6:
00000001 r5:
c221e000 r4:
c221e458
[<
c044c998>] (mdiobus_read) from [<
c044d678>] (phy_read+0x1c/0x20)
r7:
ffffe000 r6:
c221e470 r5:
00000200 r4:
c229f800
[<
c044d65c>] (phy_read) from [<
c044d94c>] (kszphy_config_intr+0x44/0x80)
[<
c044d908>] (kszphy_config_intr) from [<
c044694c>] (phy_disable_interrupts+0x44/0x50)
r5:
c229f800 r4:
c229f800
[<
c0446908>] (phy_disable_interrupts) from [<
c0449370>] (phy_shutdown+0x18/0x1c)
r5:
c229f800 r4:
c229f804
[<
c0449358>] (phy_shutdown) from [<
c040066c>] (device_shutdown+0x168/0x1f8)
[<
c0400504>] (device_shutdown) from [<
c013de44>] (kernel_restart_prepare+0x3c/0x48)
r9:
c22d2000 r8:
c0100264 r7:
c0b0d034 r6:
00000000 r5:
4321fedc r4:
00000000
[<
c013de08>] (kernel_restart_prepare) from [<
c013dee0>] (kernel_restart+0x1c/0x60)
[<
c013dec4>] (kernel_restart) from [<
c013e1d8>] (__do_sys_reboot+0x168/0x208)
r5:
4321fedc r4:
01234567
[<
c013e070>] (__do_sys_reboot) from [<
c013e2e8>] (sys_reboot+0x18/0x1c)
r7:
00000058 r6:
00000000 r5:
00000000 r4:
00000000
[<
c013e2d0>] (sys_reboot) from [<
c0100060>] (ret_fast_syscall+0x0/0x54)
As of commit
e2f016cf775129c0 ("net: phy: add a shutdown procedure"),
system reboot calls phy_disable_interrupts() during shutdown. As this
happens unconditionally, the PHY registers may be accessed while the
device is suspended, causing undefined behavior, which may crash the
system.
Fix this by wrapping the PHY bitbang accessors in the sh_eth driver by
wrappers that take care of Runtime PM, to resume the device when needed.
Reported-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Geert Uytterhoeven [Mon, 18 Jan 2021 15:06:55 +0000 (16:06 +0100)]
mdio-bitbang: Export mdiobb_{read,write}()
Export mdiobb_read() and mdiobb_write(), so Ethernet controller drivers
can call them from their MDIO read/write wrappers.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexander Lobakin [Sat, 16 Jan 2021 16:13:22 +0000 (16:13 +0000)]
mdio, phy: fix -Wshadow warnings triggered by nested container_of()
container_of() macro hides a local variable '__mptr' inside. This
becomes a problem when several container_of() are nested in each
other within single line or plain macros.
As C preprocessor doesn't support generating random variable names,
the sole solution is to avoid defining macros that consist only of
container_of() calls, or they will self-shadow '__mptr' each time:
In file included from ./include/linux/bitmap.h:10,
from drivers/net/phy/phy_device.c:12:
drivers/net/phy/phy_device.c: In function ‘phy_device_release’:
./include/linux/kernel.h:693:8: warning: declaration of ‘__mptr’ shadows a previous local [-Wshadow]
693 | void *__mptr = (void *)(ptr); \
| ^~~~~~
./include/linux/phy.h:647:26: note: in expansion of macro ‘container_of’
647 | #define to_phy_device(d) container_of(to_mdio_device(d), \
| ^~~~~~~~~~~~
./include/linux/mdio.h:52:27: note: in expansion of macro ‘container_of’
52 | #define to_mdio_device(d) container_of(d, struct mdio_device, dev)
| ^~~~~~~~~~~~
./include/linux/phy.h:647:39: note: in expansion of macro ‘to_mdio_device’
647 | #define to_phy_device(d) container_of(to_mdio_device(d), \
| ^~~~~~~~~~~~~~
drivers/net/phy/phy_device.c:217:8: note: in expansion of macro ‘to_phy_device’
217 | kfree(to_phy_device(dev));
| ^~~~~~~~~~~~~
./include/linux/kernel.h:693:8: note: shadowed declaration is here
693 | void *__mptr = (void *)(ptr); \
| ^~~~~~
./include/linux/phy.h:647:26: note: in expansion of macro ‘container_of’
647 | #define to_phy_device(d) container_of(to_mdio_device(d), \
| ^~~~~~~~~~~~
drivers/net/phy/phy_device.c:217:8: note: in expansion of macro ‘to_phy_device’
217 | kfree(to_phy_device(dev));
| ^~~~~~~~~~~~~
As they are declared in header files, these warnings are highly
repetitive and very annoying (along with the one from linux/pci.h).
Convert the related macros from linux/{mdio,phy}.h to static inlines
to avoid self-shadowing and potentially improve bug-catching.
No functional changes implied.
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20210116161246.67075-1-alobakin@pm.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Oleksandr Mazur [Tue, 19 Jan 2021 08:53:33 +0000 (10:53 +0200)]
net: core: devlink: use right genl user_ptr when handling port param get/set
Fix incorrect user_ptr dereferencing when handling port param get/set:
idx [0] stores the 'struct devlink' pointer;
idx [1] stores the 'struct devlink_port' pointer;
Fixes: 637989b5d77e ("devlink: Always use user_ptr[0] for devlink and simplify post_doit")
CC: Parav Pandit <parav@mellanox.com>
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Vadym Kochan <vadym.kochan@plvision.eu>
Link: https://lore.kernel.org/r/20210119085333.16833-1-vadym.kochan@plvision.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yunjian Wang [Fri, 15 Jan 2021 04:46:20 +0000 (12:46 +0800)]
vhost_net: avoid tx queue stuck when sendmsg fails
Currently the driver doesn't drop a packet which can't be sent by tun
(e.g bad packet). In this case, the driver will always process the
same packet lead to the tx queue stuck.
To fix this issue:
1. in the case of persistent failure (e.g bad packet), the driver
can skip this descriptor by ignoring the error.
2. in the case of transient failure (e.g -ENOBUFS, -EAGAIN and -ENOMEM),
the driver schedules the worker to try again.
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/1610685980-38608-1-git-send-email-wangyunjian@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tom Rix [Sun, 17 Jan 2021 19:10:44 +0000 (11:10 -0800)]
net: hns: fix variable used when DEBUG is defined
When DEBUG is defined this error occurs
drivers/net/ethernet/hisilicon/hns/hns_enet.c:1505:36: error:
‘struct net_device’ has no member named ‘ae_handle’;
did you mean ‘rx_handler’?
assert(skb->queue_mapping < ndev->ae_handle->q_num);
^~~~~~~~~
ae_handle is an element of struct hns_nic_priv, so change
ndev to priv.
Signed-off-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20210117191044.533725-1-trix@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tom Rix [Sun, 17 Jan 2021 18:15:19 +0000 (10:15 -0800)]
arcnet: fix macro name when DEBUG is defined
When DEBUG is defined this error occurs
drivers/net/arcnet/com20020_cs.c:70:15: error: ‘com20020_REG_W_ADDR_HI’
undeclared (first use in this function);
did you mean ‘COM20020_REG_W_ADDR_HI’?
ioaddr, com20020_REG_W_ADDR_HI);
^~~~~~~~~~~~~~~~~~~~~~
From reviewing the context, the suggestion is what is meant.
Signed-off-by: Tom Rix <trix@redhat.com>
Acked-by: Joe Perches <joe@perches.com>
Link: https://lore.kernel.org/r/20210117181519.527625-1-trix@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 19 Jan 2021 04:48:42 +0000 (20:48 -0800)]
Merge branch 'tls-device-offload-for-bond'
Tariq Toukan says:
====================
TLS device offload for Bond
This series opens TX and RX TLS device offload for bond interfaces.
This allows bond interfaces to benefit from capable lower devices.
We add a new ndo_sk_get_lower_dev() to be used to get the lower dev that
corresponds to a given socket.
The TLS module uses it to interact directly with the lowest device in
chain, and invoke the control operations in tlsdev_ops. This means that the
bond interface doesn't have his own struct tlsdev_ops instance and
derived logic/callbacks.
To keep simple track of the HW and SW TLS contexts, we bind each socket to
a specific lower device for the socket's whole lifetime. This is logically
valid (and similar to the SW kTLS behavior) in the following bond configuration,
so we restrict the offload support to it:
((mode == balance-xor) or (mode == 802.3ad))
and xmit_hash_policy == layer3+4.
In this design, TLS TX/RX offload feature flags of the bond device are
independent from the lower devices. They reflect the current features state,
but are not directly controllable.
This is because the bond driver is bypassed by the call to
ndo_sk_get_lower_dev(), without him knowing who the caller is.
The bond TLS feature flags are set/cleared only according to the configuration
of the mode and xmit_hash_policy.
Bypass is true only for the control flow. Packets in fast path still go through
the bond logic.
The design here differs from the xfrm/ipsec offload, where the bond driver
has his own copy of struct xfrmdev_ops and callbacks.
====================
Link: https://lore.kernel.org/r/20210117145949.8632-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tariq Toukan [Sun, 17 Jan 2021 14:59:49 +0000 (16:59 +0200)]
net/tls: Except bond interface from some TLS checks
In the tls_dev_event handler, ignore tlsdev_ops requirement for bond
interfaces, they do not exist as the interaction is done directly with
the lower device.
Also, make the validate function pass when it's called with the upper
bond interface.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tariq Toukan [Sun, 17 Jan 2021 14:59:48 +0000 (16:59 +0200)]
net/tls: Device offload to use lowest netdevice in chain
Do not call the tls_dev_ops of upper devices. Instead, ask them
for the proper lowest device and communicate with it directly.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tariq Toukan [Sun, 17 Jan 2021 14:59:47 +0000 (16:59 +0200)]
net/bonding: Declare TLS RX device offload support
Following the description in previous patch (for TX):
As the bond interface is being bypassed by the TLS module, interacting
directly against the lower devs, there is no way for the bond interface
to disable its device offload capabilities, as long as the mode/policy
config allows it.
Hence, the feature flag is not directly controllable, but just reflects
the offload status based on the logic under bond_sk_check().
Here we just declare RX device offload support, and expose it via the
NETIF_F_HW_TLS_RX flag.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tariq Toukan [Sun, 17 Jan 2021 14:59:46 +0000 (16:59 +0200)]
net/bonding: Implement TLS TX device offload
Implement TLS TX device offload for bonding interfaces.
This allows kTLS sockets running on a bond to benefit from the
device offload on capable lower devices.
To allow a simple and fast maintenance of the TLS context in SW and
lower devices, we bind the TLS socket to a specific lower dev.
To achieve a behavior similar to SW kTLS, we support only balance-xor
and 802.3ad modes, with xmit_hash_policy=layer3+4. This is enforced
in bond_sk_check(), done in a previous patch.
For the above configuration, the SW implementation keeps picking the
same exact lower dev for all the socket's SKBs. The device offload
behaves similarly, making the decision once at the connection creation.
Per socket, the TLS module should work directly with the lowest netdev
in chain, to call the tls_dev_ops operations.
As the bond interface is being bypassed by the TLS module, interacting
directly against the lower devs, there is no way for the bond interface
to disable its device offload capabilities, as long as the mode/policy
config allows it.
Hence, the feature flag is not directly controllable, but just reflects
the current offload status based on the logic under bond_sk_check().
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tariq Toukan [Sun, 17 Jan 2021 14:59:45 +0000 (16:59 +0200)]
net/bonding: Take update_features call out of XFRM funciton
In preparation for more cases that call netdev_update_features().
While here, move the features logic to the stage where struct bond
is already updated, and pass it as the only parameter to function
bond_set_xfrm_features().
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tariq Toukan [Sun, 17 Jan 2021 14:59:44 +0000 (16:59 +0200)]
net/bonding: Implement ndo_sk_get_lower_dev
Add ndo_sk_get_lower_dev() implementation for bond interfaces.
Support only for the cases where the socket's and SKBs' hash
yields identical value for the whole connection lifetime.
Here we restrict it to L3+4 sockets only, with
xmit_hash_policy==LAYER34 and bond modes xor/802.3ad.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tariq Toukan [Sun, 17 Jan 2021 14:59:43 +0000 (16:59 +0200)]
net/bonding: Take IP hash logic into a helper
Hash logic on L3 will be used in a downstream patch for one more use
case.
Take it to a function for a better code reuse.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tariq Toukan [Sun, 17 Jan 2021 14:59:42 +0000 (16:59 +0200)]
net: netdevice: Add operation ndo_sk_get_lower_dev
ndo_sk_get_lower_dev returns the lower netdev that corresponds to
a given socket.
Additionally, we implement a helper netdev_sk_get_lowest_dev() to get
the lowest one in chain.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Christophe JAILLET [Sun, 17 Jan 2021 08:15:42 +0000 (09:15 +0100)]
net/qla3xxx: switch from 'pci_' to 'dma_' API
The wrappers in include/linux/pci-dma-compat.h should go away.
The patch has been generated with the coccinelle script below and has been
hand modified to replace GFP_ with a correct flag.
It has been compile tested.
When memory is allocated in 'ql_alloc_net_req_rsp_queues()' GFP_KERNEL can
be used because it is only called from 'ql_alloc_mem_resources()' which
already calls 'ql_alloc_buffer_queues()' which uses GFP_KERNEL. (see below)
When memory is allocated in 'ql_alloc_buffer_queues()' GFP_KERNEL can be
used because this flag is already used just a few line above.
When memory is allocated in 'ql_alloc_small_buffers()' GFP_KERNEL can
be used because it is only called from 'ql_alloc_mem_resources()' which
already calls 'ql_alloc_buffer_queues()' which uses GFP_KERNEL. (see above)
When memory is allocated in 'ql_alloc_mem_resources()' GFP_KERNEL can be
used because this function already calls 'ql_alloc_buffer_queues()' which
uses GFP_KERNEL. (see above)
While at it, use 'dma_set_mask_and_coherent()' instead of 'dma_set_mask()/
dma_set_coherent_mask()' in order to slightly simplify code.
@@
@@
- PCI_DMA_BIDIRECTIONAL
+ DMA_BIDIRECTIONAL
@@
@@
- PCI_DMA_TODEVICE
+ DMA_TO_DEVICE
@@
@@
- PCI_DMA_FROMDEVICE
+ DMA_FROM_DEVICE
@@
@@
- PCI_DMA_NONE
+ DMA_NONE
@@
expression e1, e2, e3;
@@
- pci_alloc_consistent(e1, e2, e3)
+ dma_alloc_coherent(&e1->dev, e2, e3, GFP_)
@@
expression e1, e2, e3;
@@
- pci_zalloc_consistent(e1, e2, e3)
+ dma_alloc_coherent(&e1->dev, e2, e3, GFP_)
@@
expression e1, e2, e3, e4;
@@
- pci_free_consistent(e1, e2, e3, e4)
+ dma_free_coherent(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4;
@@
- pci_map_single(e1, e2, e3, e4)
+ dma_map_single(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4;
@@
- pci_unmap_single(e1, e2, e3, e4)
+ dma_unmap_single(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4, e5;
@@
- pci_map_page(e1, e2, e3, e4, e5)
+ dma_map_page(&e1->dev, e2, e3, e4, e5)
@@
expression e1, e2, e3, e4;
@@
- pci_unmap_page(e1, e2, e3, e4)
+ dma_unmap_page(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4;
@@
- pci_map_sg(e1, e2, e3, e4)
+ dma_map_sg(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4;
@@
- pci_unmap_sg(e1, e2, e3, e4)
+ dma_unmap_sg(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4;
@@
- pci_dma_sync_single_for_cpu(e1, e2, e3, e4)
+ dma_sync_single_for_cpu(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4;
@@
- pci_dma_sync_single_for_device(e1, e2, e3, e4)
+ dma_sync_single_for_device(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4;
@@
- pci_dma_sync_sg_for_cpu(e1, e2, e3, e4)
+ dma_sync_sg_for_cpu(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4;
@@
- pci_dma_sync_sg_for_device(e1, e2, e3, e4)
+ dma_sync_sg_for_device(&e1->dev, e2, e3, e4)
@@
expression e1, e2;
@@
- pci_dma_mapping_error(e1, e2)
+ dma_mapping_error(&e1->dev, e2)
@@
expression e1, e2;
@@
- pci_set_dma_mask(e1, e2)
+ dma_set_mask(&e1->dev, e2)
@@
expression e1, e2;
@@
- pci_set_consistent_dma_mask(e1, e2)
+ dma_set_coherent_mask(&e1->dev, e2)
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/20210117081542.560021-1-christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cong Wang [Sun, 17 Jan 2021 00:56:57 +0000 (16:56 -0800)]
net_sched: fix RTNL deadlock again caused by request_module()
tcf_action_init_1() loads tc action modules automatically with
request_module() after parsing the tc action names, and it drops RTNL
lock and re-holds it before and after request_module(). This causes a
lot of troubles, as discovered by syzbot, because we can be in the
middle of batch initializations when we create an array of tc actions.
One of the problem is deadlock:
CPU 0 CPU 1
rtnl_lock();
for (...) {
tcf_action_init_1();
-> rtnl_unlock();
-> request_module();
rtnl_lock();
for (...) {
tcf_action_init_1();
-> tcf_idr_check_alloc();
// Insert one action into idr,
// but it is not committed until
// tcf_idr_insert_many(), then drop
// the RTNL lock in the _next_
// iteration
-> rtnl_unlock();
-> rtnl_lock();
-> a_o->init();
-> tcf_idr_check_alloc();
// Now waiting for the same index
// to be committed
-> request_module();
-> rtnl_lock()
// Now waiting for RTNL lock
}
rtnl_unlock();
}
rtnl_unlock();
This is not easy to solve, we can move the request_module() before
this loop and pre-load all the modules we need for this netlink
message and then do the rest initializations. So the loop breaks down
to two now:
for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
struct tc_action_ops *a_o;
a_o = tc_action_load_ops(name, tb[i]...);
ops[i - 1] = a_o;
}
for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
act = tcf_action_init_1(ops[i - 1]...);
}
Although this looks serious, it only has been reported by syzbot, so it
seems hard to trigger this by humans. And given the size of this patch,
I'd suggest to make it to net-next and not to backport to stable.
This patch has been tested by syzbot and tested with tdc.py by me.
Fixes: 0fedc63fadf0 ("net_sched: commit action insertions together")
Reported-and-tested-by: syzbot+82752bc5331601cf4899@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+b3b63b6bff456bd95294@syzkaller.appspotmail.com
Reported-by: syzbot+ba67b12b1ca729912834@syzkaller.appspotmail.com
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20210117005657.14810-1-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tom Rix [Fri, 15 Jan 2021 23:53:46 +0000 (15:53 -0800)]
net: phy: national: remove definition of DEBUG
Defining DEBUG should only be done in development.
So remove DEBUG.
Signed-off-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20210115235346.289611-1-trix@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Enke Chen [Fri, 15 Jan 2021 22:30:58 +0000 (14:30 -0800)]
tcp: fix TCP_USER_TIMEOUT with zero window
The TCP session does not terminate with TCP_USER_TIMEOUT when data
remain untransmitted due to zero window.
The number of unanswered zero-window probes (tcp_probes_out) is
reset to zero with incoming acks irrespective of the window size,
as described in tcp_probe_timer():
RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
as long as the receiver continues to respond probes. We support
this by default and reset icsk_probes_out with incoming ACKs.
This counter, however, is the wrong one to be used in calculating the
duration that the window remains closed and data remain untransmitted.
Thanks to Jonathan Maxwell <jmaxwell37@gmail.com> for diagnosing the
actual issue.
In this patch a new timestamp is introduced for the socket in order to
track the elapsed time for the zero-window probes that have not been
answered with any non-zero window ack.
Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT")
Reported-by: William McCall <william.mccall@gmail.com>
Co-developed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210115223058.GA39267@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 19 Jan 2021 03:57:04 +0000 (19:57 -0800)]
Merge branch 'net-make-udp-tunnel-devices-support-fraglist'
Xin Long says:
====================
net: make udp tunnel devices support fraglist
Like GRE device, UDP tunnel devices should also support fraglist, so
that some protocol (like SCTP) HW GSO that requires NETIF_F_FRAGLIST
in the dev can work. Especially when the lower device support both
NETIF_F_GSO_UDP_TUNNEL and NETIF_F_GSO_SCTP.
====================
Link: https://lore.kernel.org/r/cover.1610704037.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Fri, 15 Jan 2021 09:47:47 +0000 (17:47 +0800)]
bareudp: add NETIF_F_FRAGLIST flag for dev features
Like vxlan and geneve, bareudp also needs this dev feature
to support some protocol's HW GSO.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Fri, 15 Jan 2021 09:47:46 +0000 (17:47 +0800)]
geneve: add NETIF_F_FRAGLIST flag for dev features
Some protocol HW GSO requires fraglist supported by the device, like
SCTP. Without NETIF_F_FRAGLIST set in the dev features of geneve, it
would have to do SW GSO before the packets enter the driver, even
when the geneve dev and lower dev (like veth) both have the feature
of NETIF_F_GSO_SCTP.
So this patch is to add it for geneve.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Fri, 15 Jan 2021 09:47:45 +0000 (17:47 +0800)]
vxlan: add NETIF_F_FRAGLIST flag for dev features
Some protocol HW GSO requires fraglist supported by the device, like
SCTP. Without NETIF_F_FRAGLIST set in the dev features of vxlan, it
would have to do SW GSO before the packets enter the driver, even
when the vxlan dev and lower dev (like veth) both have the feature
of NETIF_F_GSO_SCTP.
So this patch is to add it for vxlan.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 19 Jan 2021 03:53:05 +0000 (19:53 -0800)]
Merge branch 'ipv6-fixes-for-the-multicast-routes'
Matteo Croce says:
====================
ipv6: fixes for the multicast routes
Fix two wrong flags in the IPv6 multicast routes created
by the autoconf code.
====================
Link: https://lore.kernel.org/r/20210115184209.78611-1-mcroce@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matteo Croce [Fri, 15 Jan 2021 18:42:09 +0000 (19:42 +0100)]
ipv6: set multicast flag on the multicast route
The multicast route ff00::/8 is created with type RTN_UNICAST:
$ ip -6 -d route
unicast ::1 dev lo proto kernel scope global metric 256 pref medium
unicast fe80::/64 dev eth0 proto kernel scope global metric 256 pref medium
unicast ff00::/8 dev eth0 proto kernel scope global metric 256 pref medium
Set the type to RTN_MULTICAST which is more appropriate.
Fixes: e8478e80e5a7 ("net/ipv6: Save route type in rt6_info")
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matteo Croce [Fri, 15 Jan 2021 18:42:08 +0000 (19:42 +0100)]
ipv6: create multicast route with RTPROT_KERNEL
The ff00::/8 multicast route is created without specifying the fc_protocol
field, so the default RTPROT_BOOT value is used:
$ ip -6 -d route
unicast ::1 dev lo proto kernel scope global metric 256 pref medium
unicast fe80::/64 dev eth0 proto kernel scope global metric 256 pref medium
unicast ff00::/8 dev eth0 proto boot scope global metric 256 pref medium
As the documentation says, this value identifies routes installed during
boot, but the route is created when interface is set up.
Change the value to RTPROT_KERNEL which is a better value.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andrea Parri (Microsoft) [Thu, 14 Jan 2021 20:26:28 +0000 (21:26 +0100)]
hv_netvsc: Add (more) validation for untrusted Hyper-V values
For additional robustness in the face of Hyper-V errors or malicious
behavior, validate all values that originate from packets that Hyper-V
has sent to the guest. Ensure that invalid values cannot cause indexing
off the end of an array, or subvert an existing validation via integer
overflow. Ensure that outgoing packets do not have any leftover guest
memory that has not been zeroed out.
Reported-by: Juan Vazquez <juvazq@microsoft.com>
Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
Link: https://lore.kernel.org/r/20210114202628.119541-1-parri.andrea@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Menglong Dong [Sun, 17 Jan 2021 08:09:50 +0000 (16:09 +0800)]
net: bridge: check vlan with eth_type_vlan() method
Replace some checks for ETH_P_8021Q and ETH_P_8021AD with
eth_type_vlan().
Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20210117080950.122761-1-dong.menglong@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Mon, 18 Jan 2021 22:23:57 +0000 (14:23 -0800)]
Merge tag 'mac80211-for-net-2021-01-18.2' of git://git./linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Various fixes:
* kernel-doc parsing fixes
* incorrect debugfs string checks
* locking fix in regulatory
* some encryption-related fixes
* tag 'mac80211-for-net-2021-01-18.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211:
mac80211: check if atf has been disabled in __ieee80211_schedule_txq
mac80211: do not drop tx nulldata packets on encrypted links
mac80211: fix encryption key selection for 802.3 xmit
mac80211: fix fast-rx encryption check
mac80211: fix incorrect strlen of .write in debugfs
cfg80211: fix a kerneldoc markup
cfg80211: Save the regulatory domain with a lock
cfg80211/mac80211: fix kernel-doc for SAR APIs
====================
Link: https://lore.kernel.org/r/20210118204750.7243-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rasmus Villemoes [Sat, 16 Jan 2021 02:39:35 +0000 (03:39 +0100)]
net: dsa: mv88e6xxx: also read STU state in mv88e6250_g1_vtu_getnext
mv88e6xxx_port_vlan_join checks whether the VTU already contains an
entry for the given vid (via mv88e6xxx_vtu_getnext), and if so, merely
changes the relevant .member[] element and loads the updated entry
into the VTU.
However, at least for the mv88e6250, the on-stack struct
mv88e6xxx_vtu_entry vlan never has its .state[] array explicitly
initialized, neither in mv88e6xxx_port_vlan_join() nor inside the
getnext implementation. So the new entry has random garbage for the
STU bits, breaking VLAN filtering.
When the VTU entry is initially created, those bits are all zero, and
we should make sure to keep them that way when the entry is updated.
Fixes: 92307069a96c (net: dsa: mv88e6xxx: Avoid VTU corruption on 6097)
Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com>
Tested-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Mon, 18 Jan 2021 19:51:08 +0000 (11:51 -0800)]
Merge branch 'net-ipa-interconnect-improvements'
Alex Elder says:
====================
net: ipa: interconnect improvements
The main outcome of this series is to allow the number of
interconnects used by the IPA to differ from the three that
are implemented now. With this series in place, any number
of interconnects can now be used, all specified in the
configuration data for a specific platform.
A few minor interconnect-related cleanups are implemented as well.
====================
Link: https://lore.kernel.org/r/20210115125050.20555-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder [Fri, 15 Jan 2021 12:50:50 +0000 (06:50 -0600)]
net: ipa: allow arbitrary number of interconnects
Currently we assume that the IPA hardware has exactly three
interconnects. But that won't be guaranteed for all platforms,
so allow any number of interconnects to be specified in the
configuration data.
For each platform, define an array of interconnect data entries
(still associated with the IPA clock structure), and record the
number of entries initialized in that array.
Loop over all entries in this array when initializing, enabling,
disabling, or tearing down the set of interconnects.
With this change we no longer need the ipa_interconnect_id
enumerated type, so get rid of it.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder [Fri, 15 Jan 2021 12:50:49 +0000 (06:50 -0600)]
net: ipa: clean up interconnect initialization
Pass an the address of an IPA interconnect structure and its
configuration data to ipa_interconnect_init_one() and have that
function initialize all the structure's fields. Change the function
to simply return an error code.
Introduce ipa_interconnect_exit_one() to encapsulate the cleanup of
an IPA interconnect structure.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder [Fri, 15 Jan 2021 12:50:48 +0000 (06:50 -0600)]
net: ipa: add interconnect name to configuration data
Add the name to the configuration data for each interconnect. Use
this information rather than a constant string during initialization.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder [Fri, 15 Jan 2021 12:50:47 +0000 (06:50 -0600)]
net: ipa: store average and peak interconnect bandwidth
Add fields in the ipa_interconnect structure to hold the average and
peak bandwidth values for the interconnect. Pass the configuring
data for interconnects to ipa_interconnect_init() so these values
can be recorded, and use them when enabling the interconnects.
There's no longer any need to keep a copy of the interconnect data
after initialization.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder [Fri, 15 Jan 2021 12:50:46 +0000 (06:50 -0600)]
net: ipa: introduce an IPA interconnect structure
Rather than having separate pointers for the memory, imem, and
config interconnect paths, maintain an array of ipa_interconnect
structures each of which contains a pointer to a path.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder [Fri, 15 Jan 2021 12:50:45 +0000 (06:50 -0600)]
net: ipa: don't return an error from ipa_interconnect_disable()
If disabling interconnects fails there's not a lot we can do. The
only two callers of ipa_interconnect_disable() ignore the return
value, so just give the function a void return type.
Print an error message if disabling any of the interconnects is not
successful. Return (and print) only the first error seen.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder [Fri, 15 Jan 2021 12:50:44 +0000 (06:50 -0600)]
net: ipa: rename interconnect settings
Use "bandwidth" rather than "rate" in describing the average and
peak values to use for IPA interconnects. They should have been
named that way to begin with.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>