review.tizen.org Git - platform/kernel/linux-rpi.git/log

r8169: factor out redundant RTL8168d PHY config functionality to rtl8168d_1_common()

rtl8168d_2_hw_phy_config() shares quite some functionality with
rtl8168d_1_hw_phy_config(), so let's factor out the common part to a
new function rtl8168d_1_common(). In addition improve the code a little.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ip6mr: fix use-after-free in ip6mr_sk_done()

Apparently addrconf_exit_net() is called before igmp6_net_exit()
and ndisc_net_exit() at netns dismantle time:

net_namespace: call ip6table_mangle_net_exit()
net_namespace: call ip6_tables_net_exit()
net_namespace: call ipv6_sysctl_net_exit()
net_namespace: call ioam6_net_exit()
net_namespace: call seg6_net_exit()
net_namespace: call ping_v6_proc_exit_net()
net_namespace: call tcpv6_net_exit()
ip6mr_sk_done sk=ffffa354c78a74c0
net_namespace: call ipv6_frags_exit_net()
net_namespace: call addrconf_exit_net()
net_namespace: call ip6addrlbl_net_exit()
net_namespace: call ip6_flowlabel_net_exit()
net_namespace: call ip6_route_net_exit_late()
net_namespace: call fib6_rules_net_exit()
net_namespace: call xfrm6_net_exit()
net_namespace: call fib6_net_exit()
net_namespace: call ip6_route_net_exit()
net_namespace: call ipv6_inetpeer_exit()
net_namespace: call if6_proc_net_exit()
net_namespace: call ipv6_proc_exit_net()
net_namespace: call udplite6_proc_exit_net()
net_namespace: call raw6_exit_net()
net_namespace: call igmp6_net_exit()
ip6mr_sk_done sk=ffffa35472b2a180
ip6mr_sk_done sk=ffffa354c78a7980
net_namespace: call ndisc_net_exit()
ip6mr_sk_done sk=ffffa35472b2ab00
net_namespace: call ip6mr_net_exit()
net_namespace: call inet6_net_exit()

This was fine because ip6mr_sk_done() would not reach the point decreasing
net->ipv6.devconf_all->mc_forwarding until my patch in ip6mr_sk_done().

To fix this without changing struct pernet_operations ordering,
we can clear net->ipv6.devconf_dflt and net->ipv6.devconf_all
when they are freed from addrconf_exit_net()

BUG: KASAN: use-after-free in instrument_atomic_read include/linux/instrumented.h:71 [inline]
BUG: KASAN: use-after-free in atomic_read include/linux/atomic/atomic-instrumented.h:27 [inline]
BUG: KASAN: use-after-free in ip6mr_sk_done+0x11b/0x410 net/ipv6/ip6mr.c:1578
Read of size 4 at addr ffff88801ff08688 by task kworker/u4:4/963

CPU: 0 PID: 963 Comm: kworker/u4:4 Not tainted 5.17.0-rc2-syzkaller-00650-g5a8fb33e5305 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: netns cleanup_net
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
__kasan_report mm/kasan/report.c:442 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
check_region_inline mm/kasan/generic.c:183 [inline]
kasan_check_range+0x13d/0x180 mm/kasan/generic.c:189
instrument_atomic_read include/linux/instrumented.h:71 [inline]
atomic_read include/linux/atomic/atomic-instrumented.h:27 [inline]
ip6mr_sk_done+0x11b/0x410 net/ipv6/ip6mr.c:1578
rawv6_close+0x58/0x80 net/ipv6/raw.c:1201
inet_release+0x12e/0x280 net/ipv4/af_inet.c:428
inet6_release+0x4c/0x70 net/ipv6/af_inet6.c:478
__sock_release net/socket.c:650 [inline]
sock_release+0x87/0x1b0 net/socket.c:678
inet_ctl_sock_destroy include/net/inet_common.h:65 [inline]
igmp6_net_exit+0x6b/0x170 net/ipv6/mcast.c:3173
ops_exit_list+0xb0/0x170 net/core/net_namespace.c:168
cleanup_net+0x4ea/0xb00 net/core/net_namespace.c:600
process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
worker_thread+0x657/0x1110 kernel/workqueue.c:2454
kthread+0x2e9/0x3a0 kernel/kthread.c:377
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
</TASK>

Fixes: f2f2325ec799 ("ip6mr: ip6mr_sk_done() can exit early in common cases")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

caif: cleanup double word in comment

Replace the second 'so' with 'free'.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-dip-sip-mangling'

Ido Schimmel says:

====================
mlxsw: Add SIP and DIP mangling support

Danielle says:

On Spectrum-2 onwards, it is possible to overwrite SIP and DIP address
of an IPv4 or IPv6 packet in the ACL engine. That corresponds to pedit
munges of, respectively, ip src and ip dst fields, and likewise for ip6.
Offload these munges on the systems where they are supported.

Patchset overview:
Patch #1: introduces SIP_DIP_ACTION and its fields.
Patch #2-#3: adds the new pedit fields, and dispatches on them on
Spectrum-2 and above.
Patch #4 adds a selftest.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: Add a test for pedit munge SIP and DIP

Add a test that checks that pedit adjusts source and destination
addresses of IPv4 and IPv6 packets.

Output example:

$ ./pedit_ip.sh
TEST: ping                                                          [ OK ]
TEST: ping6                                                         [ OK ]
TEST: dev swp2 ingress pedit ip src set 198.51.100.1                [ OK ]
TEST: dev swp3 egress pedit ip src set 198.51.100.1                 [ OK ]
TEST: dev swp2 ingress pedit ip dst set 198.51.100.1                [ OK ]
TEST: dev swp3 egress pedit ip dst set 198.51.100.1                 [ OK ]
TEST: dev swp2 ingress pedit ip6 src set 2001:db8:2::1              [ OK ]
TEST: dev swp3 egress pedit ip6 src set 2001:db8:2::1               [ OK ]
TEST: dev swp2 ingress pedit ip6 dst set 2001:db8:2::1              [ OK ]
TEST: dev swp3 egress pedit ip6 dst set 2001:db8:2::1               [ OK ]

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: Support FLOW_ACTION_MANGLE for SIP and DIP IPv6 addresses

Spectrum-2 supports an ACL action SIP_DIP, which allows IPv4 and IPv6
source and destination addresses change. Offload suitable mangles to
the IPv6 address change action.

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: Support FLOW_ACTION_MANGLE for SIP and DIP IPv4 addresses

Spectrum-2 supports an ACL action SIP_DIP, which allows IPv4 and IPv6
source and destination addresses change. Offload suitable mangles to
the IPv4 address change action.

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: core_acl_flex_actions: Add SIP_DIP_ACTION

Add fields related to SIP_DIP_ACTION, which is used for changing of SIP
and DIP addresses.

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ipv6-kfree_skb_reason'

Menglong Dong says:

====================
net: use kfree_skb_reason() for ip/udp packet receive

In this series patches, kfree_skb() is replaced with kfree_skb_reason()
during ipv4 and udp4 packet receiving path, and following drop reasons
are introduced:

SKB_DROP_REASON_SOCKET_FILTER
SKB_DROP_REASON_NETFILTER_DROP
SKB_DROP_REASON_OTHERHOST
SKB_DROP_REASON_IP_CSUM
SKB_DROP_REASON_IP_INHDR
SKB_DROP_REASON_IP_RPFILTER
SKB_DROP_REASON_UNICAST_IN_L2_MULTICAST
SKB_DROP_REASON_XFRM_POLICY
SKB_DROP_REASON_IP_NOPROTO
SKB_DROP_REASON_SOCKET_RCVBUFF
SKB_DROP_REASON_PROTO_MEM

TCP is more complex, so I left it in the next series.

I just figure out how __print_symbolic() works. It doesn't base on the
array index, but searching for symbols by loop. So I'm a little afraid
it's performance.

Changes since v3:
- fix some small problems in the third patch (net: ipv4: use
kfree_skb_reason() in ip_rcv_core()), as David Ahern said

Changes since v2:
- use SKB_DROP_REASON_PKT_TOO_SMALL for a path in ip_rcv_core()

Changes since v1:
- add document for all drop reasons, as David advised
- remove unreleated cleanup
- remove EARLY_DEMUX and IP_ROUTE_INPUT drop reason
- replace {UDP, TCP}_FILTER with SOCKET_FILTER
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: udp: use kfree_skb_reason() in __udp_queue_rcv_skb()

Replace kfree_skb() with kfree_skb_reason() in __udp_queue_rcv_skb().
Following new drop reasons are introduced:

SKB_DROP_REASON_SOCKET_RCVBUFF
SKB_DROP_REASON_PROTO_MEM

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: udp: use kfree_skb_reason() in udp_queue_rcv_one_skb()

Replace kfree_skb() with kfree_skb_reason() in udp_queue_rcv_one_skb().

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipv4: use kfree_skb_reason() in ip_protocol_deliver_rcu()

Replace kfree_skb() with kfree_skb_reason() in ip_protocol_deliver_rcu().
Following new drop reasons are introduced:

SKB_DROP_REASON_XFRM_POLICY
SKB_DROP_REASON_IP_NOPROTO

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipv4: use kfree_skb_reason() in ip_rcv_finish_core()

Replace kfree_skb() with kfree_skb_reason() in ip_rcv_finish_core(),
following drop reasons are introduced:

SKB_DROP_REASON_IP_RPFILTER
SKB_DROP_REASON_UNICAST_IN_L2_MULTICAST

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipv4: use kfree_skb_reason() in ip_rcv_core()

Replace kfree_skb() with kfree_skb_reason() in ip_rcv_core(). Three new
drop reasons are introduced:

SKB_DROP_REASON_OTHERHOST
SKB_DROP_REASON_IP_CSUM
SKB_DROP_REASON_IP_INHDR

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netfilter: use kfree_drop_reason() for NF_DROP

Replace kfree_skb() with kfree_skb_reason() in nf_hook_slow() when
skb is dropped by reason of NF_DROP. Following new drop reasons
are introduced:

SKB_DROP_REASON_NETFILTER_DROP

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: skb_drop_reason: add document for drop reasons

Add document for following existing drop reasons:

SKB_DROP_REASON_NOT_SPECIFIED
SKB_DROP_REASON_NO_SOCKET
SKB_DROP_REASON_PKT_TOO_SMALL
SKB_DROP_REASON_TCP_CSUM
SKB_DROP_REASON_SOCKET_FILTER
SKB_DROP_REASON_UDP_CSUM

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ref_tracker: remove filter_irq_stacks() call

After commit e94006608949 ("lib/stackdepot: always do filter_irq_stacks()
in stack_depot_save()") it became unnecessary to filter the stack
before calling stack_depot_save().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: initialize init_net earlier

While testing a patch that will follow later
("net: add netns refcount tracker to struct nsproxy")
I found that devtmpfs_init() was called before init_net
was initialized.

This is a bug, because devtmpfs_setup() calls
ksys_unshare(CLONE_NEWNS);

This has the effect of increasing init_net refcount,
which will be later overwritten to 1, as part of setup_net(&init_net)

We had too many prior patches [1] trying to work around the root cause.

Really, make sure init_net is in BSS section, and that net_ns_init()
is called earlier at boot time.

Note that another patch ("vfs: add netns refcount tracker
to struct fs_context") also will need net_ns_init() being called
before vfs_caches_init()

As a bonus, this patch saves around 4KB in .data section.

[1]

f8c46cb39079 ("netns: do not call pernet ops for not yet set up init_net namespace")
b5082df8019a ("net: Initialise init_net.count to 1")
734b65417b24 ("net: Statically initialize init_net.dev_base_head")

v2: fixed a build error reported by kernel build bots (CONFIG_NET=n)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hsr: use hlist_head instead of list_head for mac addresses

Currently, HSR manages mac addresses of known HSR nodes by using list_head.
It takes a lot of time when there are a lot of registered nodes due to
finding specific mac address nodes by using linear search. We can be
reducing the time by using hlist. Thus, this patch moves list_head to
hlist_head for mac addresses and this allows for further improvement of
network performance.

    Condition: registered 10,000 known HSR nodes
    Before:
    # iperf3 -c 192.168.10.1 -i 1 -t 10
    Connecting to host 192.168.10.1, port 5201
    [  5] local 192.168.10.2 port 59442 connected to 192.168.10.1 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
    [  5]   0.00-1.49   sec  3.75 MBytes  21.1 Mbits/sec    0    158 KBytes
    [  5]   1.49-2.05   sec  1.25 MBytes  18.7 Mbits/sec    0    166 KBytes
    [  5]   2.05-3.06   sec  2.44 MBytes  20.3 Mbits/sec   56   16.9 KBytes
    [  5]   3.06-4.08   sec  1.43 MBytes  11.7 Mbits/sec   11   38.0 KBytes
    [  5]   4.08-5.00   sec   951 KBytes  8.49 Mbits/sec    0   56.3 KBytes

    After:
    # iperf3 -c 192.168.10.1 -i 1 -t 10
    Connecting to host 192.168.10.1, port 5201
    [  5] local 192.168.10.2 port 36460 connected to 192.168.10.1 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
    [  5]   0.00-1.00   sec  7.39 MBytes  62.0 Mbits/sec    3    130 KBytes
    [  5]   1.00-2.00   sec  5.06 MBytes  42.4 Mbits/sec   16    113 KBytes
    [  5]   2.00-3.00   sec  8.58 MBytes  72.0 Mbits/sec   42   94.3 KBytes
    [  5]   3.00-4.00   sec  7.44 MBytes  62.4 Mbits/sec    2    131 KBytes
    [  5]   4.00-5.07   sec  8.13 MBytes  63.5 Mbits/sec   38   92.9 KBytes

Signed-off-by: Juhee Kang <claudiajkang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

skmsg: convert struct sk_msg_sg::copy to a bitmap

We have plans for increasing MAX_SKB_FRAGS, but sk_msg_sg::copy
is currently an unsigned long, limiting MAX_SKB_FRAGS to 30 on 32bit arches.

Convert it to a bitmap, as Jakub suggested.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: typhoon: implement ndo_features_check method

Instead of disabling TSO at compile time if MAX_SKB_FRAGS > 32,
implement ndo_features_check() method for this driver for
a more dynamic handling.

If skb has more than 32 frags and is a GSO packet, force
software segmentation.

Most locally generated packets will use a small number
of fragments anyway.

For forwarding workloads, we can limit gro_max_size at ingress,
we might also implement gro_max_segs if needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sundance: Replace one-element array with non-array object

It seems this one-element array is not actually being used as an
array of variable size, so we can just replace it with just a
non-array object of type struct desc_frag and refactor a bit the
rest of the code.

This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().

This issue was found with the help of Coccinelle and audited and fixed,
manually.

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays

Link: https://github.com/KSPP/linux/issues/79
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: Replace one-element array with flexible-array member

There is a regular need in the kernel to provide a way to declare having
a dynamically sized set of trailing elements in a structure. Kernel code
should always use “flexible array members”[1] for these cases. The older
style of one-element or zero-length arrays should no longer be used[2].

This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().

This issue was found with the help of Coccinelle and audited and fixed,
manually.

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays

Link: https://github.com/KSPP/linux/issues/79
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-mana-next'

Haiyang Zhang says:

====================
net: mana: Add handling of CQE_RX_TRUNCATED and a cleanup

Add handling of CQE_RX_TRUNCATED and a cleanup patch
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: mana: Remove unnecessary check of cqe_type in mana_process_rx_cqe()

The switch statement already ensures cqe_type == CQE_RX_OKAY at that
point.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mana: Add handling of CQE_RX_TRUNCATED

The proper way to drop this kind of CQE is advancing rxq tail
without indicating the packet to the upper network layer.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-dev-tracking-improvements'

Eric Dumazet says:

====================
net: device tracking improvements

Main goal of this series is to be able to detect the following case
which apparently is still haunting us.

dev_hold_track(dev, tracker_1, GFP_ATOMIC);
    dev_hold(dev);
    dev_put(dev);
    dev_put(dev);              // Should complain loudly here.
dev_put_track(dev, tracker_1); // instead of here (as before this series)

v2: third patch:
  I replaced the dev_put() in linkwatch_do_dev() with __dev_put().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: refine dev_put()/dev_hold() debugging

We are still chasing some syzbot reports where we think a rogue dev_put()
is called with no corresponding prior dev_hold().
Unfortunately it eats a reference on dev->dev_refcnt taken by innocent
dev_hold_track(), meaning that the refcount saturation splat comes
too late to be useful.

Make sure that 'not tracked' dev_put() and dev_hold() better use
CONFIG_NET_DEV_REFCNT_TRACKER=y debug infrastructure:

Prior patch in the series allowed ref_tracker_alloc() and ref_tracker_free()
to be called with a NULL @trackerp parameter, and to use a separate refcount
only to detect too many put() even in the following case:

dev_hold_track(dev, tracker_1, GFP_ATOMIC);
dev_hold(dev);
dev_put(dev);
dev_put(dev); // Should complain loudly here.
dev_put_track(dev, tracker_1); // instead of here

Add clarification about netdev_tracker_alloc() role.

v2: I replaced the dev_put() in linkwatch_do_dev()
with __dev_put() because callers called netdev_tracker_free().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ref_tracker: add a count of untracked references

We are still chasing a netdev refcount imbalance, and we suspect
we have one rogue dev_put() that is consuming a reference taken
from a dev_hold_track()

To detect this case, allow ref_tracker_alloc() and ref_tracker_free()
to be called with a NULL @trackerp parameter, and use a dedicated
refcount_t just for them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ref_tracker: implement use-after-free detection

Whenever ref_tracker_dir_init() is called, mark the struct ref_tracker_dir
as dead.

Test the dead status from ref_tracker_alloc() and ref_tracker_free()

This should detect buggy dev_put()/dev_hold() happening too late
in netdevice dismantle process.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ipv6-mc_forwarding-changes'

Eric Dumazet says:

====================
ipv6: mc_forwarding changes

First patch removes minor data-races, as mc_forwarding can
be locklessly read in fast path.

Second patch adds a short cut in ip6mr_sk_done()
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ip6mr: ip6mr_sk_done() can exit early in common cases

In many cases, ip6mr_sk_done() is called while no ipmr socket
has been registered.

This removes 4 rtnl acquisitions per netns dismantle,
with following callers:

igmp6_net_exit(), tcpv6_net_exit(), ndisc_net_exit()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: make mc_forwarding atomic

This fixes minor data-races in ip6_mc_input() and
batadv_mcast_mla_rtr_flags_softif_get_ipv6()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: realtek: don't default Kconfigs to y

We generally default the vendor to y and the drivers itself
to n. NET_DSA_REALTEK, however, selects a whole bunch of things,
so it's not a pure "vendor selection" knob. Let's default it all
to n.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sparx5: remove phylink_config.pcs_poll usage

Phylink will use PCS polling whenever phylink_config.pcs_poll or the
phylink_pcs poll member is set. As this driver sets both, remove the
former.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phylink: remove phylink_set_10g_modes()

phylink_set_10g_modes() is no longer used with the conversion of
drivers to phylink_generic_validate(), so we can remove it.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'gro-minor-opts'

Paolo Abeni says:

====================
gro: a couple of minor optimization

This series collects a couple of small optimizations for the GRO engine,
reducing slightly the number of cycles for dev_gro_receive().
The delta is within noise range in tput tests, but with big TCP coming
every cycle saved from the GRO engine will count - I hope ;)

v1 -> v2:
- a few cleanup suggested from Alexander(s)
- moved away the more controversial 3rd patch
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: gro: minor optimization for dev_gro_receive()

While inspecting some perf report, I noticed that the compiler
emits suboptimal code for the napi CB initialization, fetching
and storing multiple times the memory for flags bitfield.
This is with gcc 10.3.1, but I observed the same with older compiler
versions.

We can help the compiler to do a nicer work clearing several
fields at once using an u32 alias. The generated code is quite
smaller, with the same number of conditional.

Before:
objdump -t net/core/gro.o | grep " F .text"
0000000000000bb0 l     F .text 0000000000000357 dev_gro_receive

After:
0000000000000bb0 l     F .text 000000000000033c dev_gro_receive

v1  -> v2:
- use struct_group (Alexander and Alex)

RFC -> v1:
- use __struct_group to delimit the zeroed area (Alexander)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: gro: avoid re-computing truesize twice on recycle

After commit 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb
carring sock reference") and commit af352460b465 ("net: fix GRO
skb truesize update") the truesize of the skb with stolen head is
properly updated by the GRO engine, we don't need anymore resetting
it at recycle time.

v1 -> v2:
- clarify the commit message (Alexander)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: qca8k: check correct variable in qca8k_phy_eth_command()

This is a copy and paste bug. It was supposed to check "clear_skb"
instead of "write_skb".

Fixes: 2cd548566384 ("net: dsa: qca8k: add support for phy read/write with mgmt Ethernet")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'lan966x-mcast-snooping'

Horatiu Vultur says:

====================
net: lan966x: add support for mcast snooping

Implement the switchdev callback SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED
to allow to enable/disable multicast snooping.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: lan966x: Update mdb when enabling/disabling mcast_snooping

When the multicast snooping is disabled, the mdb entries should be
removed from the HW, but they still need to be kept in memory for when
the mcast_snooping will be enabled again.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: lan966x: Implement the callback SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED

The callback allows to enable/disable multicast snooping.
When the snooping is enabled, all IGMP and MLD frames are redirected to
the CPU, therefore make sure not to set the skb flag 'offload_fwd_mark'.
The HW will not flood multicast ipv4/ipv6 data frames.
When the snooping is disabled, the HW will flood IGMP, MLD and multicast
ipv4/ipv6 frames according to the mcast_flood flag.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: lan966x: Update the PGID used by IPV6 data frames

When enabling the multicast snooping, the forwarding of the IPV6 frames
has it's own forwarding mask.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sched: Enable tc skb ext allocation on chain miss only when needed

Currently tc skb extension is used to send miss info from
tc to ovs datapath module, and driver to tc. For the tc to ovs
miss it is currently always allocated even if it will not
be used by ovs datapath (as it depends on a requested feature).

Export the static key which is used by openvswitch module to
guard this code path as well, so it will be skipped if ovs
datapath doesn't need it. Enable this code path once
ovs datapath needs it.

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mptcp-improve-set-flags-command-and-update-self-tests'

Mat Martineau says:

====================
mptcp: Improve set-flags command and update self tests

Patches 1-3 allow more flexibility in the combinations of features and
flags allowed with the MPTCP_PM_CMD_SET_FLAGS netlink command, and add
self test case coverage for the new functionality.

Patches 4-6 and 9 refactor the mptcp_join.sh self tests to allow them to
configure all of the test cases using either the pm_nl_ctl utility (part
of the mptcp self tests) or the 'ip mptcp' command (from iproute2). The
default remains to use pm_nl_ctl.

Patches 7 and 8 update the pm_netlink.sh self tests to cover the use of
endpoint ids to set endpoint flags (instead of just addresses).
====================

Link: https://lore.kernel.org/r/20220205000337.187292-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: set ip_mptcp in command line

This patch added a command line option '-i' for mptcp_join.sh to use
'ip mptcp' commands instead of using 'pm_nl_ctl' commands to deal with
PM netlink.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: add set_flags tests in pm_netlink.sh

This patch added the setting flags test cases, using both addr-based and
id-based lookups for the setting address.

The output looks like this:

set flags (backup)                                 [ OK ]
           (nobackup)                               [ OK ]
           (fullmesh)                               [ OK ]
           (nofullmesh)                             [ OK ]
           (backup,fullmesh)                        [ OK ]

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: add the id argument for set_flags

This patch added the id argument for setting the address flags in
pm_nl_ctl.

Usage:

pm_nl_ctl set id 1 flags backup

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: add wrapper for setting flags

This patch implemented a new function named pm_nl_set_endpoint(), wrapped
the PM netlink commands 'ip mptcp endpoint change flags' and 'pm_nl_ctl
set flags' in it, and used a new argument 'ip_mptcp' to choose which one
to use to set the flags of the PM endpoint.

'ip mptcp' used the ID number argument to find out the address to change
flags, while 'pm_nl_ctl' used the address and port number arguments. So
we need to parse the address ID from the PM dump output as well as the
address and port number.

Used this wrapper in do_transfer() instead of using the pm_nl_ctl command
directly.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: add wrapper for showing addrs

This patch implemented a new function named pm_nl_show_endpoints(), wrapped
the PM netlink commands 'ip mptcp endpoint show' and 'pm_nl_ctl dump' in
it, used a new argument 'ip_mptcp' to choose which one to use to show all
the PM endpoints.

Used this wrapper in do_transfer() instead of using the pm_nl_ctl commands
directly.

The original 'pos+=5' in the remoing tests only works for the output of
'pm_nl_ctl show':

id 1 flags subflow 10.0.1.1

It doesn't work for the output of 'ip mptcp endpoint show':

10.0.1.1 id 1 subflow

So implemented a more flexible approach to get the address ID from the PM
dump output to fit for both commands.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: add ip mptcp wrappers

This patch added four basic 'ip mptcp' wrappers:

pm_nl_set_limits()
pm_nl_add_endpoint()
pm_nl_del_endpoint()
pm_nl_flush_endpoint().

Wrapped the PM netlink commands 'ip mptcp' and 'pm_nl_ctl' in them, and
used a new argument 'ip_mptcp' to choose which one to use for setting the
PM limits, adding or deleting the PM endpoint.

Used the wrappers in all the selftests in mptcp_join.sh instead of using
the pm_nl_ctl commands directly.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: add backup with port testcase

This patch added the backup testcase using an address with a port number.

The original backup tests only work for the output of 'pm_nl_ctl dump'
without the port number. It chooses the last item in the dump to parse
the address in it, and in this case, the address is showed at the end
of the item.

But it doesn't work for the dump with the port number, in this case, the
port number is showed at the end of the item, not the address.

So implemented a more flexible approach to get the address and the port
number from the dump to fit for the port number case.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: add the port argument for set_flags

This patch added the port argument for setting the address flags in
pm_nl_ctl.

Usage:

pm_nl_ctl set 10.0.2.1 flags backup port 10100

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: allow to use port and non-signal in set_flags

It's illegal to use both port and non-signal flags for adding address.
But it's legal to use both of them for setting flags, which always uses
non-signal flags, backup or fullmesh.

This patch moves this non-signal flag with port check from
mptcp_pm_parse_addr() to mptcp_nl_cmd_add_addr(). Do the check only when
adding addresses, not setting flags or deleting addresses.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'support-for-the-ioam-insertion-frequency'

Justin Iurman says:

====================
Support for the IOAM insertion frequency

The insertion frequency is represented as "k/n", meaning IOAM will be
added to {k} packets over {n} packets, with 0 < k <= n and 1 <= {k,n} <=
1000000. Therefore, it provides the following percentages of insertion
frequency: [0.0001% (min) ... 100% (max)].

Not only this solution allows an operator to apply dynamic frequencies
based on the current traffic load, but it also provides some
flexibility, i.e., by distinguishing similar cases (e.g., "1/2" and
"2/4").

"1/2" = Y N Y N Y N Y N ...
"2/4" = Y Y N N Y Y N N ...
====================

Link: https://lore.kernel.org/r/20220202142554.9691-1-justin.iurman@uliege.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: ioam: Insertion frequency in lwtunnel output

Add support for the IOAM insertion frequency inside its lwtunnel output
function. This patch introduces a new (atomic) counter for packets,
based on which the algorithm will decide if IOAM should be added or not.

Default frequency is "1/1" (i.e., applied to all packets) for backward
compatibility. The iproute2 patch is ready and will be submitted as soon
as this one is accepted.

Previous iproute2 command:
ip -6 ro ad fc00::1/128 encap ioam6 [ mode ... ] ...

New iproute2 command:
ip -6 ro ad fc00::1/128 encap ioam6 [ freq k/n ] [ mode ... ] ...

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

uapi: ioam: Insertion frequency

Add the insertion frequency uapi for IOAM lwtunnels.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: don't include ndisc.h from ipv6.h

Nothing in ipv6.h needs ndisc.h, drop it.

Link: https://lore.kernel.org/r/20220203043457.2222388-1-kuba@kernel.org
Acked-by: Jeremy Kerr <jk@codeconstruct.com.au>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Link: https://lore.kernel.org/r/20220203231240.2297588-1-kuba@kernel.org
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ipa-RX-replenish'

Alex Elder says:

====================
net: ipa: improve RX buffer replenishing

This series revises the algorithm used for replenishing receive
buffers on RX endpoints.  Currently there are two atomic variables
that track how many receive buffers can be sent to the hardware.
The new algorithm obviates the need for those, by just assuming we
always want to provide the hardware with buffers until it can hold
no more.

The first patch eliminates an atomic variable that's not required.
The next moves some code into the main replenish function's caller,
making one of the called function's arguments unnecessary.   The
next six refactor things a bit more, adding a new helper function
that allows us to eliminate an additional atomic variable.  And the
final two implement two more minor improvements.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: determine replenish doorbell differently

Rather than tracking the number of receive buffer transactions that
have been submitted without a doorbell, just track the total number
of transactions that have been issued. Then ring the doorbell when
that number modulo the replenish batch size is 0.

The effect is roughly the same, but the new count is slightly more
interesting, and this approach will someday allow the replenish
batch size to be tuned at runtime.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: replenish after delivering payload

Replenishing is now solely driven by whether transactions are
available for a channel, and it doesn't really matter whether
we replenish before or after we deliver received packets to the
network stack.

Replenishing before delivering the payload adds a little latency.
Eliminate that by requesting a replenish after the payload is
delivered.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: kill replenish_backlog

We no longer use the replenish_backlog atomic variable to decide
when we've got work to do providing receive buffers to hardware.
Basically, we try to keep the hardware as full as possible, all the
time. We keep supplying buffers until the hardware has no more
space for them.

As a result, we can get rid of the replenish_backlog field and the
atomic operations performed on it.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: introduce gsi_channel_trans_idle()

Create a new function that returns true if all transactions for a
channel are available for use.

Use it in ipa_endpoint_replenish_enable() to see whether to start
replenishing, and in ipa_endpoint_replenish() to determine whether
it's necessary after a failure to schedule delayed work to ensure a
future replenish attempt occurs.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: don't use replenish_backlog

Rather than determining when to stop replenishing using the
replenish backlog, just stop when we have exhausted all available
transactions.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: allocate transaction in replenish loop

When replenishing, have ipa_endpoint_replenish() allocate a
transaction, and pass that to ipa_endpoint_replenish_one() to fill.
Then, if that produces no error, commit the transaction within the
replenish loop as well. In this way we can distinguish between
transaction failures and buffer allocation/mapping failures.

Failure to allocate a transaction simply means the hardware already
has as many receive buffers as it can hold. In that case we can
break out of the replenish loop because there's nothing more to do.

If we fail to allocate or map pages for the receive buffer, just
try again later.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: decide on doorbell in replenish loop

Decide whether the doorbell should be signaled when committing a
replenish transaction in the main replenish loop, rather than in
ipa_endpoint_replenish_one(). This is a step to facilitate the
next patch.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: increment backlog in replenish caller

Three spots call ipa_endpoint_replenish(), and just one of those
requests that the backlog be incremented after completing the
replenish operation.

Instead, have the caller increment the backlog, and get rid of the
add_one argument to ipa_endpoint_replenish().

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: allocate transaction before pages when replenishing

A transaction failure only occurs if no more transactions are
available for an endpoint. It's a very cheap test.

When replenishing an RX endpoint buffer, there's no point in
allocating pages if transactions are exhausted. So don't bother
doing so unless the transaction allocation succeeds.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: kill replenish_saved

The replenish_saved field keeps track of the number of times a new
buffer is added to the backlog when replenishing is disabled. We
don't really use it though, so there's no need for us to track it
separately. Whether replenishing is enabled or not, we can simply
increment the backlog.

Get rid of replenish_saved, and initialize and increment the backlog
where it would have otherwise been used.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

tls: cap the output scatter list to something reasonable

TLS recvmsg() passes user pages as destination for decrypt.
The decrypt operation is repeated record by record, each
record being 16kB, max. TLS allocates an sg_table and uses
iov_iter_get_pages() to populate it with enough pages to
fit the decrypted record.

Even though we decrypt a single message at a time we size
the sg_table based on the entire length of the iovec.
This leads to unnecessarily large allocations, risking
triggering OOM conditions.

Use iov_iter_truncate() / iov_iter_reexpand() to construct
a "capped" version of iov_iter_npages(). Alternatively we
could parametrize iov_iter_npages() to take the size as
arg instead of using i->count, or do something else..

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: realtek: convert to phylink_generic_validate()

Populate the supported interfaces and MAC capabilities for the Realtek
rtl8365 DSA switch and remove the old validate implementation to allow
DSA to use phylink_generic_validate() for this switch driver.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '40GbE' of git://git./linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
40GbE Intel Wired LAN Driver Updates 2022-02-03

This series contains updates to the i40e client header file and driver.

Mateusz disables HW TC offload by default.

Joe Damato removes a no longer used statistic.

Jakub Kicinski removes an unused enum from the client header file.

Jedrzej changes some admin queue commands to occur under atomic context
and adds new functions for admin queue MAC VLAN filters to avoid a
potential race that could occur due storing results in a structure that
could be overwritten by the next admin queue call.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: lan966x: use .mac_select_pcs() interface

Convert lan966x to use the mac_select_interface instead of
phylink_set_pcs.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20220202114949.833075-1-horatiu.vultur@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rtnetlink: Use more sensible tos values

Using tos 0x1 with 'ip route get <IPv4 address> ...' doesn't test much
of the tos option handling: 0x1 just sets an ECN bit, which is cleared
by inet_rtm_getroute() before doing the fib lookup. Let's use 0x10
instead, which is actually taken into account in the route lookup (and
is less surprising for the reader).

For consistency, use 0x10 for the IPv6 route lookup too (IPv6 currently
doesn't clear ECN bits, but might do so in the future).

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://lore.kernel.org/r/d61119e68d01ba7ef3ba50c1345a5123a11de123.1643815297.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: fib offload: use sensible tos values

Although both iproute2 and the kernel accept 1 and 2 as tos values for
new routes, those are invalid. These values only set ECN bits, which
are ignored during IPv4 fib lookups. Therefore, no packet can actually
match such routes. This selftest therefore only succeeds because it
doesn't verify that the new routes do actually work in practice (it
just checks if the routes are offloaded or not).

It makes more sense to use tos values that don't conflict with ECN.
This way, the selftest won't be affected if we later decide to warn or
even reject invalid tos configurations for new routes.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/5e43b343720360a1c0e4f5947d9e917b26f30fbf.1643826556.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: minor __dev_alloc_name() optimization

__dev_alloc_name() allocates a private zeroed page,
then sets bits in it while iterating through net devices.

It can use __set_bit() to avoid unnecessary locked operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220203064609.3242863-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git./linux/kernel/git/netdev/net

No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

gcc-plugins/stackleak: Use noinstr in favor of notrace

While the stackleak plugin was already using notrace, objtool is now a
bit more picky.  Update the notrace uses to noinstr.  Silences the
following objtool warnings when building with:

CONFIG_DEBUG_ENTRY=y
CONFIG_STACK_VALIDATION=y
CONFIG_VMLINUX_VALIDATION=y
CONFIG_GCC_PLUGIN_STACKLEAK=y

  vmlinux.o: warning: objtool: do_syscall_64()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
  vmlinux.o: warning: objtool: do_int80_syscall_32()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
  vmlinux.o: warning: objtool: exc_general_protection()+0x22: call to stackleak_track_stack() leaves .noinstr.text section
  vmlinux.o: warning: objtool: fixup_bad_iret()+0x20: call to stackleak_track_stack() leaves .noinstr.text section
  vmlinux.o: warning: objtool: do_machine_check()+0x27: call to stackleak_track_stack() leaves .noinstr.text section
  vmlinux.o: warning: objtool: .text+0x5346e: call to stackleak_erase() leaves .noinstr.text section
  vmlinux.o: warning: objtool: .entry.text+0x143: call to stackleak_erase() leaves .noinstr.text section
  vmlinux.o: warning: objtool: .entry.text+0x10eb: call to stackleak_erase() leaves .noinstr.text section
  vmlinux.o: warning: objtool: .entry.text+0x17f9: call to stackleak_erase() leaves .noinstr.text section

Note that the plugin's addition of calls to stackleak_track_stack() from
noinstr functions is expected to be safe, as it isn't runtime
instrumentation and is self-contained.

Cc: Alexander Popov <alex.popov@linux.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'net-5.17-rc3' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, netfilter, and ieee802154.

  Current release - regressions:

   - Partially revert "net/smc: Add netlink net namespace support", fix
     uABI breakage

   - netfilter:
      - nft_ct: fix use after free when attaching zone template
      - nft_byteorder: track register operations

  Previous releases - regressions:

   - ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback

   - phy: qca8081: fix speeds lower than 2.5Gb/s

   - sched: fix use-after-free in tc_new_tfilter()

  Previous releases - always broken:

   - tcp: fix mem under-charging with zerocopy sendmsg()

   - tcp: add missing tcp_skb_can_collapse() test in
     tcp_shift_skb_data()

   - neigh: do not trigger immediate probes on NUD_FAILED from
     neigh_managed_work, avoid a deadlock

   - bpf: use VM_MAP instead of VM_ALLOC for ringbuf, avoid KASAN
     false-positives

   - netfilter: nft_reject_bridge: fix for missing reply from prerouting

   - smc: forward wakeup to smc socket waitqueue after fallback

   - ieee802154:
      - return meaningful error codes from the netlink helpers
      - mcr20a: fix lifs/sifs periods
      - at86rf230, ca8210: stop leaking skbs on error paths

   - macsec: add missing un-offload call for NETDEV_UNREGISTER of parent

   - ax25: add refcount in ax25_dev to avoid UAF bugs

   - eth: mlx5e:
      - fix SFP module EEPROM query
      - fix broken SKB allocation in HW-GRO
      - IPsec offload: fix tunnel mode crypto for non-TCP/UDP flows

   - eth: amd-xgbe:
      - fix skb data length underflow
      - ensure reset of the tx_timer_active flag, avoid Tx timeouts

   - eth: stmmac: fix runtime pm use in stmmac_dvr_remove()

   - eth: e1000e: handshake with CSME starts from Alder Lake platforms"

* tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
  ax25: fix reference count leaks of ax25_dev
  net: stmmac: ensure PTP time register reads are consistent
  net: ipa: request IPA register values be retained
  dt-bindings: net: qcom,ipa: add optional qcom,qmp property
  tools/resolve_btfids: Do not print any commands when building silently
  bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
  net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work
  tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data()
  net: sparx5: do not refer to skb after passing it on
  Partially revert "net/smc: Add netlink net namespace support"
  net/mlx5e: Avoid field-overflowing memcpy()
  net/mlx5e: Use struct_group() for memcpy() region
  net/mlx5e: Avoid implicit modify hdr for decap drop rule
  net/mlx5e: IPsec: Fix tunnel mode crypto offload for non TCP/UDP traffic
  net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated traffic
  net/mlx5e: Don't treat small ceil values as unlimited in HTB offload
  net/mlx5: E-Switch, Fix uninitialized variable modact
  net/mlx5e: Fix handling of wrong devices during bond netevent
  net/mlx5e: Fix broken SKB allocation in HW-GRO
  net/mlx5e: Fix wrong calculation of header index in HW_GRO
  ...

Merge tag 'selinux-pr-20220203' of git://git./linux/kernel/git/pcmoore/selinux

Pull selinux fix from Paul Moore:
"One small SELinux patch to ensure that a policy structure field is
  properly reset after freeing so that we don't inadvertently do a
  double-free on certain error conditions"

* tag 'selinux-pr-20220203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  selinux: fix double free of cond_list on error paths

Merge tag 'linux-kselftest-fixes-5.17-rc3' of git://git./linux/kernel/git/shuah/linux-kselftest

Pull Kselftest fixes from Shuah Khan:
"Important fixes to several tests and documentation clarification on
  running mainline kselftest on stable releases. A few notable fixes:

   - fix kselftest run hang due to child processes that haven't been
     terminated. Fix signals all child processes

   - fix false pass/fail results from vdso_test_abi, openat2, mincore

   - build failures when using -j (multiple jobs) option

   - exec test build failure due to incorrect build rule for a run-time
     created "pipe"

   - zram test fixes related to interaction with zram-generator to make
     sure zram test to coordinate deleted with zram-generator

   - zram test compression ratio calculation fix and skipping
     max_comp_streams.

   - increasing rtc test timeout

   - cpufreq test to write test results to stdout which will necessary
     on automated test systems"

* tag 'linux-kselftest-fixes-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kselftest: Fix vdso_test_abi return status
  selftests: skip mincore.check_file_mmap when fs lacks needed support
  selftests: openat2: Skip testcases that fail with EOPNOTSUPP
  selftests: openat2: Add missing dependency in Makefile
  selftests: openat2: Print also errno in failure messages
  selftests: futex: Use variable MAKE instead of make
  selftests/exec: Remove pipe from TEST_GEN_FILES
  selftests/zram: Adapt the situation that /dev/zram0 is being used
  selftests/zram01.sh: Fix compression ratio calculation
  selftests/zram: Skip max_comp_streams interface on newer kernel
  docs/kselftest: clarify running mainline tests on stables
  kselftest: signal all child processes
  selftests: cpufreq: Write test output to stdout as well
  selftests: rtc: Increase test timeout so that all tests run

ax25: fix reference count leaks of ax25_dev

The previous commit d01ffb9eee4a ("ax25: add refcount in ax25_dev
to avoid UAF bugs") introduces refcount into ax25_dev, but there
are reference leak paths in ax25_ctl_ioctl(), ax25_fwd_ioctl(),
ax25_rt_add(), ax25_rt_del() and ax25_rt_opt().

This patch uses ax25_dev_put() and adjusts the position of
ax25_addr_ax25dev() to fix reference cout leaks of ax25_dev.

Fixes: d01ffb9eee4a ("ax25: add refcount in ax25_dev to avoid UAF bugs")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20220203150811.42256-1-duoming@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: ensure PTP time register reads are consistent

Even if protected from preemption and interrupts, a small time window
remains when the 2 register reads could return inconsistent values,
each time the "seconds" register changes. This could lead to an about
1-second error in the reported time.

Add logic to ensure the "seconds" and "nanoseconds" values are consistent.

Fixes: 92ba6888510c ("stmmac: add the support for PTP hw clock driver")
Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20220203160025.750632-1-yannick.vignon@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge https://git./linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2022-02-03

We've added 6 non-merge commits during the last 10 day(s) which contain
a total of 7 files changed, 11 insertions(+), 236 deletions(-).

The main changes are:

1) Fix BPF ringbuf to allocate its area with VM_MAP instead of VM_ALLOC
   flag which otherwise trips over KASAN, from Hou Tao.

2) Fix unresolved symbol warning in resolve_btfids due to LSM callback
   rename, from Alexei Starovoitov.

3) Fix a possible race in inc_misses_counter() when IRQ would trigger
   during counter update, from He Fengqing.

4) Fix tooling infra for cross-building with clang upon probing whether
   gcc provides the standard libraries, from Jean-Philippe Brucker.

5) Fix silent mode build for resolve_btfids, from Nathan Chancellor.

6) Drop unneeded and outdated lirc.h header copy from tooling infra as
   BPF does not require it anymore, from Sean Young.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  tools/resolve_btfids: Do not print any commands when building silently
  bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
  tools: Ignore errors from `which' when searching a GCC toolchain
  tools headers UAPI: remove stale lirc.h
  bpf: Fix possible race in inc_misses_counter
  bpf: Fix renaming task_getsecid_subj->current_getsecid_subj.
====================

Link: https://lore.kernel.org/r/20220203155815.25689-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

i40e: Fix race condition while adding/deleting MAC/VLAN filters

There was a race condition in access to hw->aq.asq_last_status
while adding and deleting MAC/VLAN filters causing
incorrect error status to be printed as ERROR OK instead of
the correct error.

Change calls to i40e_aq_add_macvlan in i40e_aqc_add_filters
and i40e_aq_remove_macvlan in i40e_aqc_del_filters
to _v2 versions that return Admin Queue status on the stack
to avoid race conditions in access to hw->aq.asq_last_status.

Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

i40e: Add new version of i40e_aq_add_macvlan function

ASQ send command functions are returning only i40e status codes
yet some calling functions also need Admin Queue status
that is stored in hw->aq.asq_last_status. Since hw object
is stored on a heap it introduces a possibility for
a race condition in access to hw if calling function is not
fast enough to read hw->aq.asq_last_status before next
send ASQ command is executed.

Add new _v2 version of i40e_aq_add_macvlan that is using
new _v2 versions of ASQ send command functions and returns
the Admin Queue status on the stack.

Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

i40e: Add new versions of send ASQ command functions

ASQ send command functions are returning only i40e status codes
yet some calling functions also need Admin Queue status
that is stored in hw->aq.asq_last_status. Since hw object
is stored on a heap it introduces a possibility for
a race condition in access to hw if calling function is not
fast enough to read hw->aq.asq_last_status before next
send ASQ command is executed.

Add new versions of send ASQ command functions that return
Admin Queue status on the stack to avoid race conditions
in access to hw->aq.asq_last_status.
Add new _v2 version of i40e_aq_remove_macvlan that is using
new _v2 versions of ASQ send command functions and returns
the Admin Queue status on the stack.

Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

i40e: Add sending commands in atomic context

Change functions:
- i40e_aq_add_macvlan
- i40e_aq_remove_macvlan
- i40e_aq_delete_element
- i40e_aq_add_vsi
- i40e_aq_update_vsi_params
to explicitly use i40e_asq_send_command_atomic(..., true)
instead of i40e_asq_send_command, as they use mutexes and do some
work in an atomic context.
Without this change setting vlan via netdev will fail with
call trace cased by bug "BUG: scheduling while atomic".

Signed-off-by: Witold Fijalkowski <witoldx.fijalkowski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

i40e: remove enum i40e_client_state

It's not used.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

i40e: Remove unused RX realloc stat

After commit 1a557afc4dd5 ("i40e: Refactor receive routine"),
rx_stats.realloc_count is no longer being incremented, so remove it.

The debugfs string was left, but hardcoded to 0. This is intended to
prevent breaking any existing code / scripts that are parsing debugfs
for i40e.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

i40e: Disable hw-tc-offload feature on driver load

After loading driver hw-tc-offload is enabled by default.
Change the behaviour of driver to disable hw-tc-offload by default as
this is the expected state. Additionally since this impacts ntuple
feature state change the way of checking NETIF_F_HW_TC flag.

Signed-off-by: Norbert Zulinski <norbertx.zulinski@intel.com>
Signed-off-by: Przemyslaw Patynowski <przemyslawx.patynowski@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Dave Switzer <david.switzer@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

printk: Fix incorrect __user type in proc_dointvec_minmax_sysadmin()

The move of proc_dointvec_minmax_sysadmin() from kernel/sysctl.c to
kernel/printk/sysctl.c introduced an incorrect __user attribute to the
buffer argument.  I spotted this change in [1] as well as the kernel
test robot.  Revert this change to please sparse:

  kernel/printk/sysctl.c:20:51: warning: incorrect type in argument 3 (different address spaces)
  kernel/printk/sysctl.c:20:51:    expected void *
  kernel/printk/sysctl.c:20:51:    got void [noderef] __user *buffer

Fixes: faaa357a55e0 ("printk: move printk sysctl to printk/sysctl.c")
Link: https://lore.kernel.org/r/20220104155024.48023-2-mic@digikod.net
Reported-by: kernel test robot <lkp@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com>
Link: https://lore.kernel.org/r/20220203145029.272640-1-mic@digikod.net
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Revert "module, async: async_synchronize_full() on module init iff async is used"

This reverts commit 774a1221e862b343388347bac9b318767336b20b.

We need to finish all async code before the module init sequence is
done.  In the reverted commit the PF_USED_ASYNC flag was added to mark a
thread that called async_schedule().  Then the PF_USED_ASYNC flag was
used to determine whether or not async_synchronize_full() needs to be
invoked.  This works when modprobe thread is calling async_schedule(),
but it does not work if module dispatches init code to a worker thread
which then calls async_schedule().

For example, PCI driver probing is invoked from a worker thread based on
a node where device is attached:

if (cpu < nr_cpu_ids)
error = work_on_cpu(cpu, local_pci_probe, &ddi);
else
error = local_pci_probe(&ddi);

We end up in a situation where a worker thread gets the PF_USED_ASYNC
flag set instead of the modprobe thread.  As a result,
async_synchronize_full() is not invoked and modprobe completes without
waiting for the async code to finish.

The issue was discovered while loading the pm80xx driver:
(scsi_mod.scan=async)

modprobe pm80xx                      worker
...
  do_init_module()
  ...
    pci_call_probe()
      work_on_cpu(local_pci_probe)
                                     local_pci_probe()
                                       pm8001_pci_probe()
                                         scsi_scan_host()
                                           async_schedule()
                                           worker->flags |= PF_USED_ASYNC;
                                     ...
      < return from worker >
  ...
  if (current->flags & PF_USED_ASYNC) <--- false
   async_synchronize_full();

Commit 21c3c5d28007 ("block: don't request module during elevator init")
fixed the deadlock issue which the reverted commit 774a1221e862
("module, async: async_synchronize_full() on module init iff async is
used") tried to fix.

Since commit 0fdff3ec6d87 ("async, kmod: warn on synchronous
request_module() from async workers") synchronous module loading from
async is not allowed.

Given that the original deadlock issue is fixed and it is no longer
allowed to call synchronous request_module() from async we can remove
PF_USED_ASYNC flag to make module init consistently invoke
async_synchronize_full() unless async module probe is requested.

Signed-off-by: Igor Pylypiv <ipylypiv@google.com>
Reviewed-by: Changyuan Lyu <changyuanl@google.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-5.17-fixes' of git://git./linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

- Eric's fix for a long standing cgroup1 permission issue where it only
   checks for uid 0 instead of CAP which inadvertently allows
   unprivileged userns roots to modify release_agent userhelper

- Fixes for the fallout from Waiman's recent cpuset work

* 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning
  cgroup-v1: Require capabilities to set release_agent
  cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask()
  cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchy

Merge branch 'net-ipa-enable-register-retention'

Alex Elder says:

====================
net: ipa: enable register retention

With runtime power management in place, we sometimes need to issue
a command to enable retention of IPA register values before power
collapse. This requires a new Device Tree property, whose presence
will also be used to signal that the command is required.
====================

Link: https://lore.kernel.org/r/20220201150205.468403-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: request IPA register values be retained

In some cases, the IPA hardware needs to request the always-on
subsystem (AOSS) to coordinate with the IPA microcontroller to
retain IPA register values at power collapse.  This is done by
issuing a QMP request to the AOSS microcontroller.  A similar
request ondoes that request.

We must get and hold the "QMP" handle early, because we might get
back EPROBE_DEFER for that.  But the actual request should be sent
while we know the IPA clock is active, and when we know the
microcontroller is operational.

Fixes: 1aac309d3207 ("net: ipa: use autosuspend")
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: qcom,ipa: add optional qcom,qmp property

For some systems, the IPA driver must make a request to ensure that
its registers are retained across power collapse of the IPA hardware.
On such systems, we'll use the existence of the "qcom,qmp" property
as a signal that this request is required.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning

It was found that a "suspicious RCU usage" lockdep warning was issued
with the rcu_read_lock() call in update_sibling_cpumasks(). It is
because the update_cpumasks_hier() function may sleep. So we have
to release the RCU lock, call update_cpumasks_hier() and reacquire
it afterward.

Also add a percpu_rwsem_assert_held() in update_sibling_cpumasks()
instead of stating that in the comment.

Fixes: 4716909cc5c5 ("cpuset: Track cpusets that use parent's effective_cpus")
Signed-off-by: Waiman Long <longman@redhat.com>
Tested-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

tools/resolve_btfids: Do not print any commands when building silently

When building with 'make -s', there is some output from resolve_btfids:

$ make -sj"$(nproc)" oldconfig prepare
  MKDIR     .../tools/bpf/resolve_btfids/libbpf/
  MKDIR     .../tools/bpf/resolve_btfids//libsubcmd
  LINK     resolve_btfids

Silent mode means that no information should be emitted about what is
currently being done. Use the $(silent) variable from Makefile.include
to avoid defining the msg macro so that there is no information printed.

Fixes: fbbb68de80a4 ("bpf: Add resolve_btfids tool to resolve BTF IDs in ELF object")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220201212503.731732-1-nathan@kernel.org