platform/kernel/linux-rpi.git
13 months agonet: introduce and use skb_frag_fill_page_desc()
Yunsheng Lin [Thu, 11 May 2023 01:12:12 +0000 (09:12 +0800)]
net: introduce and use skb_frag_fill_page_desc()

Most users use __skb_frag_set_page()/skb_frag_off_set()/
skb_frag_size_set() to fill the page desc for a skb frag.

Introduce skb_frag_fill_page_desc() to do that.

net/bpf/test_run.c does not call skb_frag_off_set() to
set the offset, "copy_from_user(page_address(page), ...)"
and 'shinfo' being part of the 'data' kzalloced in
bpf_test_init() suggest that it is assuming offset to be
initialized as zero, so call skb_frag_fill_page_desc()
with offset being zero for this case.

Also, skb_frag_set_page() is not used anymore, so remove
it.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: net: vxlan: Add tests for vxlan nolocalbypass option.
Vladimir Nikishkin [Fri, 12 May 2023 03:40:34 +0000 (11:40 +0800)]
selftests: net: vxlan: Add tests for vxlan nolocalbypass option.

Add test to make sure that the localbypass option is on by default.

Add test to change vxlan localbypass to nolocalbypass and check
that packets are delivered to userspace.

Signed-off-by: Vladimir Nikishkin <vladimir@nikishkin.pw>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: vxlan: Add nolocalbypass option to vxlan.
Vladimir Nikishkin [Fri, 12 May 2023 03:40:33 +0000 (11:40 +0800)]
net: vxlan: Add nolocalbypass option to vxlan.

If a packet needs to be encapsulated towards a local destination IP, the
packet will undergo a "local bypass" and be injected into the Rx path as
if it was received by the target VXLAN device without undergoing
encapsulation. If such a device does not exist, the packet will be
dropped.

There are scenarios where we do not want to perform such a bypass, but
instead want the packet to be encapsulated and locally received by a
user space program for post-processing.

To that end, add a new VXLAN device attribute that controls whether a
"local bypass" is performed or not. Default to performing a bypass to
maintain existing behavior.

Signed-off-by: Vladimir Nikishkin <vladimir@nikishkin.pw>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoMerge branch 'broadcom-phy-wol'
David S. Miller [Sat, 13 May 2023 15:56:29 +0000 (16:56 +0100)]
Merge branch 'broadcom-phy-wol'

Florian Fainelli says:

====================
Support for Wake-on-LAN for Broadcom PHYs

This patch series adds support for Wake-on-LAN to the Broadcom PHY
driver. Specifically the BCM54210E/B50212E are capable of supporting
Wake-on-LAN using an external pin typically wired up to a system's GPIO.

These PHY operate a programmable Ethernet MAC destination address
comparator which will fire up an interrupt whenever a match is received.
Because of that, it was necessary to introduce patch #1 which allows the
PHY driver's ->suspend() routine to be called unconditionally. This is
necessary in our case because we need a hook point into the device
suspend/resume flow to enable the wake-up interrupt as late as possible.

Patch #2 adds support for the Broadcom PHY library and driver for
Wake-on-LAN proper with the WAKE_UCAST, WAKE_MCAST, WAKE_BCAST,
WAKE_MAGIC and WAKE_MAGICSECURE. Note that WAKE_FILTER is supportable,
however this will require further discussions and be submitted as a RFC
series later on.

Patch #3 updates the GENET driver to defer to the PHY for Wake-on-LAN if
the PHY supports it, thus allowing the MAC to be powered down to
conserve power.

Changes in v3:

- collected Reviewed-by tags
- explicitly use return 0 in bcm54xx_phy_probe() (Paolo)

Changes in v2:

- introduce PHY_ALWAYS_CALL_SUSPEND and only have the Broadcom PHY
  driver set this flag to minimize changes to the suspend flow to only
  drivers that need it

- corrected possibly uninitialized variable in bcm54xx_set_wakeup_irq
  (Simon)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: bcmgenet: Add support for PHY-based Wake-on-LAN
Florian Fainelli [Thu, 11 May 2023 17:21:10 +0000 (10:21 -0700)]
net: bcmgenet: Add support for PHY-based Wake-on-LAN

If available, interrogate the PHY to find out whether we can use it for
Wake-on-LAN. This can be a more power efficient way of implementing
that feature, especially when the MAC is powered off in low power
states.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: phy: broadcom: Add support for Wake-on-LAN
Florian Fainelli [Thu, 11 May 2023 17:21:09 +0000 (10:21 -0700)]
net: phy: broadcom: Add support for Wake-on-LAN

Add support for WAKE_UCAST, WAKE_MCAST, WAKE_BCAST, WAKE_MAGIC and
WAKE_MAGICSECURE. This is only supported with the BCM54210E and
compatible Ethernet PHYs. Using the in-band interrupt or an out of band
GPIO interrupts are supported.

Broadcom PHYs will generate a Wake-on-LAN level low interrupt on LED4 as
soon as one of the supported patterns is being matched. That includes
generating such an interrupt even if the PHY is operated during normal
modes. If WAKE_UCAST is selected, this could lead to the LED4 interrupt
firing up for every packet being received which is absolutely
undesirable from a performance point of view.

Because the Wake-on-LAN configuration can be set long before the system
is actually put to sleep, we cannot have an interrupt service routine to
clear on read the interrupt status register and ensure that new packet
matches will be detected.

It is desirable to enable the Wake-on-LAN interrupt as late as possible
during the system suspend process such that we limit the number of
interrupts to be handled by the system, but also conversely feed into
the Linux's system suspend way of dealing with interrupts in and around
the points of no return.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: phy: Allow drivers to always call into ->suspend()
Florian Fainelli [Thu, 11 May 2023 17:21:08 +0000 (10:21 -0700)]
net: phy: Allow drivers to always call into ->suspend()

A few PHY drivers are currently attempting to not suspend the PHY when
Wake-on-LAN is enabled, however that code is not currently executing at
all due to an early check in phy_suspend().

This prevents PHY drivers from making an appropriate decisions and put
the hardware into a low power state if desired.

In order to allow the PHY drivers to opt into getting their ->suspend
routine to be called, add a PHY_ALWAYS_CALL_SUSPEND bit which can be
set. A boolean that tracks whether the PHY or the attached MAC has
Wake-on-LAN enabled is also provided for convenience.

If phydev::wol_enabled then the PHY shall not prevent its own
Wake-on-LAN detection logic from working and shall not prevent the
Ethernet MAC from receiving packets for matching.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoMerge branch 'sfc-decap'
David S. Miller [Fri, 12 May 2023 09:37:02 +0000 (10:37 +0100)]
Merge branch 'sfc-decap'

Edward Cree says:

====================
sfc: more flexible encap matches on TC decap rules

This series extends the TC offload support on EF100 to support optionally
 matching on the IP ToS and UDP source port of the outer header in rules
 performing tunnel decapsulation.  Both of these fields allow masked
 matches if the underlying hardware supports it (current EF100 hardware
 supports masking on ToS, but only exact-match on source port).
Given that the source port is typically populated from a hash of inner
 header entropy, it's not clear whether filtering on it is useful, but
 since we can support it we may as well expose the capability.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agosfc: support TC decap rules matching on enc_src_port
Edward Cree [Thu, 11 May 2023 19:47:31 +0000 (20:47 +0100)]
sfc: support TC decap rules matching on enc_src_port

Allow efx_tc_encap_match entries to include a udp_sport and a
 udp_sport_mask.  As with enc_ip_tos, use pseudos to enforce that all
 encap matches within a given <src_ip,dst_ip,udp_dport> tuple have
 the same udp_sport_mask.
Note that since we use a single layer of pseudos for both fields, two
 matches that differ in (say) udp_sport value aren't permitted to have
 different ip_tos_mask, even though this would technically be safe.
Current userland TC does not support setting enc_src_port; this patch
 was tested with an iproute2 patched to support it.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agosfc: support TC decap rules matching on enc_ip_tos
Edward Cree [Thu, 11 May 2023 19:47:30 +0000 (20:47 +0100)]
sfc: support TC decap rules matching on enc_ip_tos

Allow efx_tc_encap_match entries to include an ip_tos and ip_tos_mask.
To avoid partially-overlapping Outer Rules (which can lead to undefined
 behaviour in the hardware), store extra "pseudo" entries in our
 encap_match hashtable, which are used to enforce that all Outer Rule
 entries within a given <src_ip,dst_ip,udp_dport> tuple (or IPv6
 equivalent) have the same ip_tos_mask.
The "direct" encap_match entry takes a reference on the "pseudo",
 allowing it to be destroyed when all "direct" entries using it are
 removed.
efx_tc_em_pseudo_type is an enum rather than just a bool because in
 future an additional pseudo-type will be added to support Conntrack
 offload.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agosfc: populate enc_ip_tos matches in MAE outer rules
Edward Cree [Thu, 11 May 2023 19:47:29 +0000 (20:47 +0100)]
sfc: populate enc_ip_tos matches in MAE outer rules

Currently tc.c will block them before they get here, but following
 patch will change that.
Use the extack message from efx_mae_check_encap_match_caps() instead
 of writing a new one, since there's now more being fed in than just
 an IP version.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agosfc: release encap match in efx_tc_flow_free()
Edward Cree [Thu, 11 May 2023 19:47:28 +0000 (20:47 +0100)]
sfc: release encap match in efx_tc_flow_free()

When force-freeing leftover entries from our match_action_ht, call
 efx_tc_delete_rule(), which releases all the rule's resources, rather
 than open-coding it.  The open-coded version was missing a call to
 release the rule's encap match (if any).
It probably doesn't matter as everything's being torn down anyway, but
 it's cleaner this way and prevents further error messages potentially
 being logged by efx_tc_encap_match_free() later on.
Move efx_tc_flow_free() further down the file to avoid introducing a
 forward declaration of efx_tc_delete_rule().

Fixes: 17654d84b47c ("sfc: add offloading of 'foreign' TC (decap) rules")
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: liquidio: lio_main: Remove unnecessary (void*) conversions
wuych [Fri, 12 May 2023 02:44:03 +0000 (10:44 +0800)]
net: liquidio: lio_main: Remove unnecessary (void*) conversions

Pointer variables of void * type do not require type cast.

Signed-off-by: wuych <yunchuan@nfschina.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agosctp: add bpf_bypass_getsockopt proto callback
Alexander Mikhalitsyn [Thu, 11 May 2023 13:25:06 +0000 (15:25 +0200)]
sctp: add bpf_bypass_getsockopt proto callback

Implement ->bpf_bypass_getsockopt proto callback and filter out
SCTP_SOCKOPT_PEELOFF, SCTP_SOCKOPT_PEELOFF_FLAGS and SCTP_SOCKOPT_CONNECTX3
socket options from running eBPF hook on them.

SCTP_SOCKOPT_PEELOFF and SCTP_SOCKOPT_PEELOFF_FLAGS options do fd_install(),
and if BPF_CGROUP_RUN_PROG_GETSOCKOPT hook returns an error after success of
the original handler sctp_getsockopt(...), userspace will receive an error
from getsockopt syscall and will be not aware that fd was successfully
installed into a fdtable.

As pointed by Marcelo Ricardo Leitner it seems reasonable to skip
bpf getsockopt hook for SCTP_SOCKOPT_CONNECTX3 sockopt too.
Because internaly, it triggers connect() and if error is masked
then userspace will be confused.

This patch was born as a result of discussion around a new SCM_PIDFD interface:
https://lore.kernel.org/all/20230413133355.350571-3-aleksandr.mikhalitsyn@canonical.com/

Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks")
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Suggested-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoMerge branch 'selftests-fcnal'
David S. Miller [Fri, 12 May 2023 08:43:56 +0000 (09:43 +0100)]
Merge branch 'selftests-fcnal'

Guillaume Nault says:

====================
selftests: fcnal: Test SO_DONTROUTE socket option.

The objective is to cover kernel paths that use the RTO_ONLINK flag
in .flowi4_tos. This way we'll be able to safely remove this flag in
the future by properly setting .flowi4_scope instead. With these
selftests in place, we can make sure this won't introduce regressions.

For more context, the final objective is to convert .flowi4_tos to
dscp_t, to ensure that ECN bits don't influence route and fib-rule
lookups (see commit a410a0cf9885 ("ipv6: Define dscp_t and stop taking
ECN bits into account in fib6-rules")).

These selftests only cover IPv4, as SO_DONTROUTE has no effect on IPv6
sockets.

v2:
  - Use two different nettest options for setting SO_DONTROUTE either
    on the server or on the client socket.

  - Use the above feature to run a single 'nettest -B' instance per
    test (instead of having two nettest processes for server and
    client).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: fcnal: Test SO_DONTROUTE on raw and ping sockets.
Guillaume Nault [Thu, 11 May 2023 14:39:46 +0000 (16:39 +0200)]
selftests: fcnal: Test SO_DONTROUTE on raw and ping sockets.

Use ping -r to test the kernel behaviour with raw and ping sockets
having the SO_DONTROUTE option.

Since ipv4_ping_novrf() is called with different values of
net.ipv4.ping_group_range, then it tests both raw and ping sockets
(ping uses ping sockets if its user ID belongs to ping_group_range
and raw sockets otherwise).

With both socket types, sending packets to a neighbour (on link) host,
should work. When the host is behind a router, sending should fail.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: fcnal: Test SO_DONTROUTE on UDP sockets.
Guillaume Nault [Thu, 11 May 2023 14:39:39 +0000 (16:39 +0200)]
selftests: fcnal: Test SO_DONTROUTE on UDP sockets.

Use nettest --client-dontroute to test the kernel behaviour with UDP
sockets having the SO_DONTROUTE option. Sending packets to a neighbour
(on link) host, should work. When the host is behind a router, sending
should fail.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: fcnal: Test SO_DONTROUTE on TCP sockets.
Guillaume Nault [Thu, 11 May 2023 14:39:32 +0000 (16:39 +0200)]
selftests: fcnal: Test SO_DONTROUTE on TCP sockets.

Use nettest --{client,server}-dontroute to test the kernel behaviour
with TCP sockets having the SO_DONTROUTE option. Sending packets to a
neighbour (on link) host, should work. When the host is behind a
router, sending should fail.

Client and server sockets are tested independently, so that we can
cover different TCP kernel paths.

SO_DONTROUTE also affects the syncookies path. So ipv4_tcp_dontroute()
is made to work with or without syncookies, to cover both paths.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: Add SO_DONTROUTE option to nettest.
Guillaume Nault [Thu, 11 May 2023 14:39:25 +0000 (16:39 +0200)]
selftests: Add SO_DONTROUTE option to nettest.

Add --client-dontroute and --server-dontroute options to nettest. They
allow to set the SO_DONTROUTE option to the client and server sockets
respectively. This will be used by the following patches to test
the SO_DONTROUTE kernel behaviour with TCP and UDP.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agobonding: Always assign be16 value to vlan_proto
Simon Horman [Thu, 11 May 2023 15:07:12 +0000 (17:07 +0200)]
bonding: Always assign be16 value to vlan_proto

The type of the vlan_proto field is __be16.
And most users of the field use it as such.

In the case of setting or testing the field for the special VLAN_N_VID
value, host byte order is used. Which seems incorrect.

It also seems somewhat odd to store a VLAN ID value in a field that is
otherwise used to store Ether types.

Address this issue by defining BOND_VLAN_PROTO_NONE, a big endian value.
0xffff was chosen somewhat arbitrarily. What is important is that it
doesn't overlap with any valid VLAN Ether types.

I don't believe the problems described above are a bug because
VLAN_N_VID in both little-endian and big-endian byte order does not
conflict with any supported VLAN Ether types in big-endian byte order.

Reported by sparse as:

 .../bond_main.c:2857:26: warning: restricted __be16 degrades to integer
 .../bond_main.c:2863:20: warning: restricted __be16 degrades to integer
 .../bond_main.c:2939:40: warning: incorrect type in assignment (different base types)
 .../bond_main.c:2939:40:    expected restricted __be16 [usertype] vlan_proto
 .../bond_main.c:2939:40:    got int

No functional changes intended.
Compile tested only.

Signed-off-by: Simon Horman <horms@kernel.org>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoMerge branch 'net-handshake-fixes'
David S. Miller [Fri, 12 May 2023 08:24:08 +0000 (09:24 +0100)]
Merge branch 'net-handshake-fixes'

Chuck Lever says:

====================
Bug fixes for net/handshake

Please consider these for merge via net-next.

Paolo observed that there is a possible leak of sock->file. I
haven't looked into that yet, but it seems to be separate from
the fixes in this series, so no need to hold these up.

Changes since v2:
- Address Paolo comment regarding handshake_dup()

Changes since v1:
- Rework "Fix handshake_dup() ref counting"
- Unpin sock->file when a handshake is cancelled
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet/handshake: Enable the SNI extension to work properly
Chuck Lever [Thu, 11 May 2023 15:49:50 +0000 (11:49 -0400)]
net/handshake: Enable the SNI extension to work properly

Enable the upper layer protocol to specify the SNI peername. This
avoids the need for tlshd to use a DNS lookup, which can return a
hostname that doesn't match the incoming certificate's SubjectName.

Fixes: 2fd5532044a8 ("net/handshake: Add a kernel API for requesting a TLSv1.3 handshake")
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet/handshake: Unpin sock->file if a handshake is cancelled
Chuck Lever [Thu, 11 May 2023 15:49:17 +0000 (11:49 -0400)]
net/handshake: Unpin sock->file if a handshake is cancelled

If user space never calls DONE, sock->file's reference count remains
elevated. Enable sock->file to be freed eventually in this case.

Reported-by: Jakub Kacinski <kuba@kernel.org>
Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet/handshake: handshake_genl_notify() shouldn't ignore @flags
Chuck Lever [Thu, 11 May 2023 15:48:45 +0000 (11:48 -0400)]
net/handshake: handshake_genl_notify() shouldn't ignore @flags

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests")
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet/handshake: Fix uninitialized local variable
Chuck Lever [Thu, 11 May 2023 15:48:13 +0000 (11:48 -0400)]
net/handshake: Fix uninitialized local variable

trace_handshake_cmd_done_err() simply records the pointer in @req,
so initializing it to NULL is sufficient and safe.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests")
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet/handshake: Fix handshake_dup() ref counting
Chuck Lever [Thu, 11 May 2023 15:47:40 +0000 (11:47 -0400)]
net/handshake: Fix handshake_dup() ref counting

If get_unused_fd_flags() fails, we ended up calling fput(sock->file)
twice.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet/handshake: Remove unneeded check from handshake_dup()
Chuck Lever [Thu, 11 May 2023 15:47:09 +0000 (11:47 -0400)]
net/handshake: Remove unneeded check from handshake_dup()

handshake_req_submit() now verifies that the socket has a file.

Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests")
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipvlan: Remove NULL check before dev_{put, hold}
Yang Li [Thu, 11 May 2023 07:21:19 +0000 (15:21 +0800)]
ipvlan: Remove NULL check before dev_{put, hold}

The call netdev_{put, hold} of dev_{put, hold} will check NULL,
so there is no need to check before using dev_{put, hold},
remove it to silence the warning:

./drivers/net/ipvlan/ipvlan_core.c:559:3-11: WARNING: NULL check before dev_{put, hold} functions is not needed.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=4930
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoocteontx2-pf: mcs: Offload extended packet number(XPN) feature
Subbaraya Sundeep [Thu, 11 May 2023 06:17:12 +0000 (11:47 +0530)]
octeontx2-pf: mcs: Offload extended packet number(XPN) feature

The macsec hardware block supports XPN cipher suites also.
Hence added changes to offload XPN feature. Changes include
configuring SecY policy to XPN cipher suite, Salt and SSCI values.
64 bit packet number is passed instead of 32 bit packet number.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: samsung: sxgbe: Make sxgbe_drv_remove() return void
Uwe Kleine-König [Wed, 10 May 2023 20:02:47 +0000 (22:02 +0200)]
net: samsung: sxgbe: Make sxgbe_drv_remove() return void

sxgbe_drv_remove() returned zero unconditionally, so it can be converted
to return void without losing anything. The upside is that it becomes
more obvious in its callers that there is no error to handle.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: enc28j60: Use threaded interrupt instead of workqueue
Philipp Rosenberger [Tue, 9 May 2023 04:28:56 +0000 (06:28 +0200)]
net: enc28j60: Use threaded interrupt instead of workqueue

The Microchip ENC28J60 SPI Ethernet driver schedules a work item from
the interrupt handler because accesses to the SPI bus may sleep.

On PREEMPT_RT (which forces interrupt handling into threads) this
old-fashioned approach unnecessarily increases latency because an
interrupt results in first waking the interrupt thread, then scheduling
the work item.  So, a double indirection to handle an interrupt.

Avoid by converting the driver to modern threaded interrupt handling.

Signed-off-by: Philipp Rosenberger <p.rosenberger@kunbus.com>
Signed-off-by: Zhi Han <hanzhi09@gmail.com>
[lukas: rewrite commit message, linewrap request_threaded_irq() call]
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Reviewed-by: Piotr Raczynski <piotr.raczynski@intel.com>
Link: https://lore.kernel.org/r/342380d989ce26bc49f0e5d45fbb0416a5f7809f.1683606193.git.lukas@wunner.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 11 May 2023 16:06:26 +0000 (09:06 -0700)]
Merge git://git./linux/kernel/git/netdev/net

Cross-merge networking fixes. No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoMerge tag 'net-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 11 May 2023 13:42:47 +0000 (08:42 -0500)]
Merge tag 'net-6.4-rc2' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from netfilter.

  Current release - regressions:

   - mtk_eth_soc: fix NULL pointer dereference

  Previous releases - regressions:

   - core:
      - skb_partial_csum_set() fix against transport header magic value
      - fix load-tearing on sk->sk_stamp in sock_recv_cmsgs().
      - annotate sk->sk_err write from do_recvmmsg()
      - add vlan_get_protocol_and_depth() helper

   - netlink: annotate accesses to nlk->cb_running

   - netfilter: always release netdev hooks from notifier

  Previous releases - always broken:

   - core: deal with most data-races in sk_wait_event()

   - netfilter: fix possible bug_on with enable_hooks=1

   - eth: bonding: fix send_peer_notif overflow

   - eth: xpcs: fix incorrect number of interfaces

   - eth: ipvlan: fix out-of-bounds caused by unclear skb->cb

   - eth: stmmac: Initialize MAC_ONEUS_TIC_COUNTER register"

* tag 'net-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (31 commits)
  af_unix: Fix data races around sk->sk_shutdown.
  af_unix: Fix a data race of sk->sk_receive_queue->qlen.
  net: datagram: fix data-races in datagram_poll()
  net: mscc: ocelot: fix stat counter register values
  ipvlan:Fix out-of-bounds caused by unclear skb->cb
  docs: networking: fix x25-iface.rst heading & index order
  gve: Remove the code of clearing PBA bit
  tcp: add annotations around sk->sk_shutdown accesses
  net: add vlan_get_protocol_and_depth() helper
  net: pcs: xpcs: fix incorrect number of interfaces
  net: deal with most data-races in sk_wait_event()
  net: annotate sk->sk_err write from do_recvmmsg()
  netlink: annotate accesses to nlk->cb_running
  kselftest: bonding: add num_grat_arp test
  selftests: forwarding: lib: add netns support for tc rule handle stats get
  Documentation: bonding: fix the doc of peer_notif_delay
  bonding: fix send_peer_notif overflow
  net: ethernet: mtk_eth_soc: fix NULL pointer dereference
  selftests: nft_flowtable.sh: check ingress/egress chain too
  selftests: nft_flowtable.sh: monitor result file sizes
  ...

13 months agoMerge tag 'media/v6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab...
Linus Torvalds [Thu, 11 May 2023 13:35:52 +0000 (08:35 -0500)]
Merge tag 'media/v6.4-2' of git://git./linux/kernel/git/mchehab/linux-media

Pull media fixes from Mauro Carvalho Chehab:

 - fix some unused-variable warning in mtk-mdp3

 - ignore unused suspend operations in nxp

 - some driver fixes in rcar-vin

* tag 'media/v6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  media: platform: mtk-mdp3: work around unused-variable warning
  media: nxp: ignore unused suspend operations
  media: rcar-vin: Select correct interrupt mode for V4L2_FIELD_ALTERNATE
  media: rcar-vin: Fix NV12 size alignment
  media: rcar-vin: Gen3 can not scale NV12

13 months agoMerge branch 'net-mvneta-reduce-size-of-tso-header-allocation'
Paolo Abeni [Thu, 11 May 2023 11:05:19 +0000 (13:05 +0200)]
Merge branch 'net-mvneta-reduce-size-of-tso-header-allocation'

Russell King says:

====================
net: mvneta: reduce size of TSO header allocation

With reference to
https://forum.turris.cz/t/random-kernel-exceptions-on-hbl-tos-7-0/18865/
https://github.com/openwrt/openwrt/pull/12375#issuecomment-1528842334

It appears that mvneta attempts an order-6 allocation for the TSO
header memory. While this succeeds early on in the system's life time,
trying order-6 allocations later can result in failure due to memory
fragmentation.

Firstly, the reason it's so large is that we take the number of
transmit descriptors, and allocate a TSO header buffer for each, and
each TSO header is 256 bytes. The driver uses a simple mechanism to
determine the address - it uses the transmit descriptor index as an
index into the TSO header memory.

(The first obvious question is: do there need to be this
many? Won't each TSO header always have at least one bit
of data to go with it? In other words, wouldn't the maximum
number of TSO headers that a ring could accept be the number
of ring entries divided by 2?)

There is no real need for this memory to be an order-6 allocation,
since nothing in hardware requires this buffer to be contiguous.

Therefore, this series splits this order-6 allocation up into 32
order-1 allocations (8k pages on 4k page platforms), each giving
32 TSO headers per page.

In order to do this, these patches:

1) fix a horrible transmit path error-cleanup bug - the existing
   code unmaps from the first descriptor that was allocated at
   interface bringup, not the first descriptor that the packet
   is using, resulting in the wrong descriptors being unmapped.

2) since xdp support was added, we now have buf->type which indicates
   what this transmit buffer contains. Use this to mark TSO header
   buffers.

3) get rid of IS_TSO_HEADER(), instead using buf->type to determine
   whether this transmit buffer needs to be DMA-unmapped.

4) move tso_build_hdr() into mvneta_tso_put_hdr() to keep all the
   TSO header building code together.

5) split the TSO header allocation into chunks of order-1 pages.

This has now been tested by the Turris folk and has been found to fix
the allocation error.
====================

Link: https://lore.kernel.org/r/ZFtuhJOC03qpASt2@shell.armlinux.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agonet: mvneta: allocate TSO header DMA memory in chunks
Russell King (Oracle) [Wed, 10 May 2023 10:16:03 +0000 (11:16 +0100)]
net: mvneta: allocate TSO header DMA memory in chunks

Now that we no longer need to check whether the DMA address is within
the TSO header DMA memory range for the queue, we can allocate the TSO
header DMA memory in chunks rather than one contiguous order-6 chunk,
which can stress the kernel's memory subsystems to allocate.

Instead, use order-1 (8k) allocations, which will result in 32 order-1
pages containing 32 TSO headers.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agonet: mvneta: move tso_build_hdr() into mvneta_tso_put_hdr()
Russell King (Oracle) [Wed, 10 May 2023 10:15:58 +0000 (11:15 +0100)]
net: mvneta: move tso_build_hdr() into mvneta_tso_put_hdr()

Move tso_build_hdr() into mvneta_tso_put_hdr() so that all the TSO
header building code is in one place.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agonet: mvneta: use buf->type to determine whether to dma-unmap
Russell King (Oracle) [Wed, 10 May 2023 10:15:53 +0000 (11:15 +0100)]
net: mvneta: use buf->type to determine whether to dma-unmap

Now that we use a different buffer type for TSO headers, we can use
buf->type to determine whether the original buffer was DMA-mapped or
not. The rules are:

MVNETA_TYPE_XDP_TX - from a DMA pool, no unmap is required
MVNETA_TYPE_XDP_NDO - dma_map_single()'d
MVNETA_TYPE_SKB - normal skbuff, dma_map_single()'d
MVNETA_TYPE_TSO - from the TSO buffer area

This means we only need to call dma_unmap_single() on the XDP_NDO and
SKB types of buffer, and we no longer need the private IS_TSO_HEADER()
which relies on the TSO region being contiguously allocated.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agonet: mvneta: mark mapped and tso buffers separately
Russell King (Oracle) [Wed, 10 May 2023 10:15:48 +0000 (11:15 +0100)]
net: mvneta: mark mapped and tso buffers separately

Mark dma-mapped skbs and TSO buffers separately, so we can use
buf->type to identify their differences.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agonet: mvneta: fix transmit path dma-unmapping on error
Russell King (Oracle) [Wed, 10 May 2023 10:15:42 +0000 (11:15 +0100)]
net: mvneta: fix transmit path dma-unmapping on error

The transmit code assumes that the transmit descriptors that are used
begin with the first descriptor in the ring, but this may not be the
case. Fix this by providing a new function that dma-unmaps a range of
numbered descriptor entries, and use that to do the unmapping.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agotcp: make the first N SYN RTO backoffs linear
David Morley [Tue, 9 May 2023 18:05:58 +0000 (18:05 +0000)]
tcp: make the first N SYN RTO backoffs linear

Currently the SYN RTO schedule follows an exponential backoff
scheme, which can be unnecessarily conservative in cases where
there are link failures. In such cases, it's better to
aggressively try to retransmit packets, so it takes routers
less time to find a repath with a working link.

We chose a default value for this sysctl of 4, to follow
the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
MacOS and IOS have used this backoff schedule for over
a decade, since before this 2009 IETF presentation
discussed the behavior:
https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf

This commit makes the SYN RTO schedule start with a number of
linear backoffs given by the following sysctl:
* tcp_syn_linear_timeouts

This changes the SYN RTO scheme to be: init_rto_val for
tcp_syn_linear_timeouts, exp backoff starting at init_rto_val

For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...

Signed-off-by: David Morley <morleyd@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Tested-by: David Morley <morleyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230509180558.2541885-1-morleyd.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agonet: wwan: iosm: clean up unused struct members
M Chetan Kumar [Tue, 9 May 2023 16:36:35 +0000 (22:06 +0530)]
net: wwan: iosm: clean up unused struct members

Below members are unused.
- td_tag member defined in struct ipc_pipe.
- adb_finish_timer & params defined in struct iosm_mux.

Remove it to avoid unexpected usage.

Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/92ee483d79dfc871ed7408da8fec60b395ff3a9c.1683649868.git.m.chetan.kumar@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: wwan: iosm: remove unused enum definition
M Chetan Kumar [Tue, 9 May 2023 16:36:22 +0000 (22:06 +0530)]
net: wwan: iosm: remove unused enum definition

ipc_time_unit enum is defined but not used.
Remove it to avoid unexpected usage.

Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/8295a6138f13c686590ee4021384ee992f717408.1683649868.git.m.chetan.kumar@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: wwan: iosm: remove unused macro definition
M Chetan Kumar [Tue, 9 May 2023 16:35:55 +0000 (22:05 +0530)]
net: wwan: iosm: remove unused macro definition

IOSM_IF_ID_PAYLOAD is defined but not used.
Remove it to avoid unexpected usage.

Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/0697e811cb7f10b4fd8f99e66bda1329efdd3d1d.1683649868.git.m.chetan.kumar@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoMerge tag 'nf-23-05-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Jakub Kicinski [Thu, 11 May 2023 02:08:58 +0000 (19:08 -0700)]
Merge tag 'nf-23-05-10' of git://git./linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter updates for net

The following patchset contains Netfilter fixes for net:

1) Fix UAF when releasing netnamespace, from Florian Westphal.

2) Fix possible BUG_ON when nf_conntrack is enabled with enable_hooks,
   from Florian Westphal.

3) Fixes for nft_flowtable.sh selftest, from Boris Sukholitko.

4) Extend nft_flowtable.sh selftest to cover integration with
   ingress/egress hooks, from Florian Westphal.

* tag 'nf-23-05-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  selftests: nft_flowtable.sh: check ingress/egress chain too
  selftests: nft_flowtable.sh: monitor result file sizes
  selftests: nft_flowtable.sh: wait for specific nc pids
  selftests: nft_flowtable.sh: no need for ps -x option
  selftests: nft_flowtable.sh: use /proc for pid checking
  netfilter: conntrack: fix possible bug_on with enable_hooks=1
  netfilter: nf_tables: always release netdev hooks from notifier
====================

Link: https://lore.kernel.org/r/20230510083313.152961-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoMerge branch 'af_unix-fix-two-data-races-reported-by-kcsan'
Jakub Kicinski [Thu, 11 May 2023 02:06:55 +0000 (19:06 -0700)]
Merge branch 'af_unix-fix-two-data-races-reported-by-kcsan'

Kuniyuki Iwashima says:

====================
af_unix: Fix two data races reported by KCSAN.

KCSAN reported data races around these two fields for AF_UNIX sockets.

  * sk->sk_receive_queue->qlen
  * sk->sk_shutdown

Let's annotate them properly.
====================

Link: https://lore.kernel.org/r/20230510003456.42357-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoaf_unix: Fix data races around sk->sk_shutdown.
Kuniyuki Iwashima [Wed, 10 May 2023 00:34:56 +0000 (17:34 -0700)]
af_unix: Fix data races around sk->sk_shutdown.

KCSAN found a data race around sk->sk_shutdown where unix_release_sock()
and unix_shutdown() update it under unix_state_lock(), OTOH unix_poll()
and unix_dgram_poll() read it locklessly.

We need to annotate the writes and reads with WRITE_ONCE() and READ_ONCE().

BUG: KCSAN: data-race in unix_poll / unix_release_sock

write to 0xffff88800d0f8aec of 1 bytes by task 264 on cpu 0:
 unix_release_sock+0x75c/0x910 net/unix/af_unix.c:631
 unix_release+0x59/0x80 net/unix/af_unix.c:1042
 __sock_release+0x7d/0x170 net/socket.c:653
 sock_close+0x19/0x30 net/socket.c:1397
 __fput+0x179/0x5e0 fs/file_table.c:321
 ____fput+0x15/0x20 fs/file_table.c:349
 task_work_run+0x116/0x1a0 kernel/task_work.c:179
 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
 exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204
 __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
 syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297
 do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

read to 0xffff88800d0f8aec of 1 bytes by task 222 on cpu 1:
 unix_poll+0xa3/0x2a0 net/unix/af_unix.c:3170
 sock_poll+0xcf/0x2b0 net/socket.c:1385
 vfs_poll include/linux/poll.h:88 [inline]
 ep_item_poll.isra.0+0x78/0xc0 fs/eventpoll.c:855
 ep_send_events fs/eventpoll.c:1694 [inline]
 ep_poll fs/eventpoll.c:1823 [inline]
 do_epoll_wait+0x6c4/0xea0 fs/eventpoll.c:2258
 __do_sys_epoll_wait fs/eventpoll.c:2270 [inline]
 __se_sys_epoll_wait fs/eventpoll.c:2265 [inline]
 __x64_sys_epoll_wait+0xcc/0x190 fs/eventpoll.c:2265
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

value changed: 0x00 -> 0x03

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 222 Comm: dbus-broker Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

Fixes: 3c73419c09a5 ("af_unix: fix 'poll for write'/ connected DGRAM sockets")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoaf_unix: Fix a data race of sk->sk_receive_queue->qlen.
Kuniyuki Iwashima [Wed, 10 May 2023 00:34:55 +0000 (17:34 -0700)]
af_unix: Fix a data race of sk->sk_receive_queue->qlen.

KCSAN found a data race of sk->sk_receive_queue->qlen where recvmsg()
updates qlen under the queue lock and sendmsg() checks qlen under
unix_state_sock(), not the queue lock, so the reader side needs
READ_ONCE().

BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_wait_for_peer

write (marked) to 0xffff888019fe7c68 of 4 bytes by task 49792 on cpu 0:
 __skb_unlink include/linux/skbuff.h:2347 [inline]
 __skb_try_recv_from_queue+0x3de/0x470 net/core/datagram.c:197
 __skb_try_recv_datagram+0xf7/0x390 net/core/datagram.c:263
 __unix_dgram_recvmsg+0x109/0x8a0 net/unix/af_unix.c:2452
 unix_dgram_recvmsg+0x94/0xa0 net/unix/af_unix.c:2549
 sock_recvmsg_nosec net/socket.c:1019 [inline]
 ____sys_recvmsg+0x3a3/0x3b0 net/socket.c:2720
 ___sys_recvmsg+0xc8/0x150 net/socket.c:2764
 do_recvmmsg+0x182/0x560 net/socket.c:2858
 __sys_recvmmsg net/socket.c:2937 [inline]
 __do_sys_recvmmsg net/socket.c:2960 [inline]
 __se_sys_recvmmsg net/socket.c:2953 [inline]
 __x64_sys_recvmmsg+0x153/0x170 net/socket.c:2953
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

read to 0xffff888019fe7c68 of 4 bytes by task 49793 on cpu 1:
 skb_queue_len include/linux/skbuff.h:2127 [inline]
 unix_recvq_full net/unix/af_unix.c:229 [inline]
 unix_wait_for_peer+0x154/0x1a0 net/unix/af_unix.c:1445
 unix_dgram_sendmsg+0x13bc/0x14b0 net/unix/af_unix.c:2048
 sock_sendmsg_nosec net/socket.c:724 [inline]
 sock_sendmsg+0x148/0x160 net/socket.c:747
 ____sys_sendmsg+0x20e/0x620 net/socket.c:2503
 ___sys_sendmsg+0xc6/0x140 net/socket.c:2557
 __sys_sendmmsg+0x11d/0x370 net/socket.c:2643
 __do_sys_sendmmsg net/socket.c:2672 [inline]
 __se_sys_sendmmsg net/socket.c:2669 [inline]
 __x64_sys_sendmmsg+0x58/0x70 net/socket.c:2669
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

value changed: 0x0000000b -> 0x00000001

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 49793 Comm: syz-executor.0 Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: datagram: fix data-races in datagram_poll()
Eric Dumazet [Tue, 9 May 2023 17:31:31 +0000 (17:31 +0000)]
net: datagram: fix data-races in datagram_poll()

datagram_poll() runs locklessly, we should add READ_ONCE()
annotations while reading sk->sk_err, sk->sk_shutdown and sk->sk_state.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230509173131.3263780-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoMAINTAINERS: re-sort all entries and fields
Linus Torvalds [Thu, 11 May 2023 01:04:04 +0000 (20:04 -0500)]
MAINTAINERS: re-sort all entries and fields

It's been a few years since we've sorted this thing, and the end result
is that we've added MAINTAINERS entries in the wrong order, and a number
of entries have their fields in non-canonical order too.

So roll this boulder up the hill one more time by re-running

   ./scripts/parse-maintainers.pl --order

on it.

This file ends up being fairly painful for merge conflicts even
normally, since unlike almost all other kernel files it's one of those
"everybody touches the same thing", and re-ordering all entries is only
going to make that worse.  But the alternative is to never do it at all,
and just let it all rot..

The rc2 week is likely the quietest and least painful time to do this.

Requested-by: Randy Dunlap <rdunlap@infradead.org>
Requested-by: Joe Perches <joe@perches.com> # "Please use --order"
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 months agoMerge tag 'fsnotify_for_v6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Wed, 10 May 2023 22:07:42 +0000 (17:07 -0500)]
Merge tag 'fsnotify_for_v6.4-rc2' of git://git./linux/kernel/git/jack/linux-fs

Pull inotify fix from Jan Kara:
 "A fix for possibly reporting invalid watch descriptor with inotify
  event"

* tag 'fsnotify_for_v6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  inotify: Avoid reporting event with invalid wd

13 months agoMerge tag 'gfs2-v6.3-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux...
Linus Torvalds [Wed, 10 May 2023 21:48:35 +0000 (16:48 -0500)]
Merge tag 'gfs2-v6.3-fix' of git://git./linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Andreas Gruenbacher:

 - Fix a NULL pointer dereference when mounting corrupted filesystems

* tag 'gfs2-v6.3-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Don't deref jdesc in evict

13 months agogfs2: Don't deref jdesc in evict
Bob Peterson [Fri, 28 Apr 2023 16:07:46 +0000 (12:07 -0400)]
gfs2: Don't deref jdesc in evict

On corrupt gfs2 file systems the evict code can try to reference the
journal descriptor structure, jdesc, after it has been freed and set to
NULL. The sequence of events is:

init_journal()
...
fail_jindex:
   gfs2_jindex_free(sdp); <------frees journals, sets jdesc = NULL
      if (gfs2_holder_initialized(&ji_gh))
         gfs2_glock_dq_uninit(&ji_gh);
fail:
   iput(sdp->sd_jindex); <--references jdesc in evict_linked_inode
      evict()
         gfs2_evict_inode()
            evict_linked_inode()
               ret = gfs2_trans_begin(sdp, 0, sdp->sd_jdesc->jd_blocks);
<------references the now freed/zeroed sd_jdesc pointer.

The call to gfs2_trans_begin is done because the truncate_inode_pages
call can cause gfs2 events that require a transaction, such as removing
journaled data (jdata) blocks from the journal.

This patch fixes the problem by adding a check for sdp->sd_jdesc to
function gfs2_evict_inode. In theory, this should only happen to corrupt
gfs2 file systems, when gfs2 detects the problem, reports it, then tries
to evict all the system inodes it has read in up to that point.

Reported-by: Yang Lan <lanyang0908@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
13 months agoMerge tag 'platform-drivers-x86-v6.4-2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Wed, 10 May 2023 14:36:42 +0000 (09:36 -0500)]
Merge tag 'platform-drivers-x86-v6.4-2' of git://git./linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver fixes from Hans de Goede:
 "Nothing special to report just various small fixes:

   - thinkpad_acpi: Fix profile (performance/bal/low-power) regression
     on T490

   - misc other small fixes / hw-id additions"

* tag 'platform-drivers-x86-v6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  platform/mellanox: fix potential race in mlxbf-tmfifo driver
  platform/x86: touchscreen_dmi: Add info for the Dexp Ursus KX210i
  platform/x86: touchscreen_dmi: Add upside-down quirk for GDIX1002 ts on the Juno Tablet
  platform/x86: thinkpad_acpi: Add profile force ability
  platform/x86: thinkpad_acpi: Fix platform profiles on T490
  platform/x86: hp-wmi: add micmute to hp_wmi_keymap struct
  platform/x86/intel-uncore-freq: Return error on write frequency
  platform/x86: intel_scu_pcidrv: Add back PCI ID for Medfield

13 months agonet: mscc: ocelot: fix stat counter register values
Colin Foster [Wed, 10 May 2023 04:48:51 +0000 (21:48 -0700)]
net: mscc: ocelot: fix stat counter register values

Commit d4c367650704 ("net: mscc: ocelot: keep ocelot_stat_layout by reg
address, not offset") organized the stats counters for Ocelot chips, namely
the VSC7512 and VSC7514. A few of the counter offsets were incorrect, and
were caught by this warning:

WARNING: CPU: 0 PID: 24 at drivers/net/ethernet/mscc/ocelot_stats.c:909
ocelot_stats_init+0x1fc/0x2d8
reg 0x5000078 had address 0x220 but reg 0x5000079 has address 0x214,
bulking broken!

Fix these register offsets.

Fixes: d4c367650704 ("net: mscc: ocelot: keep ocelot_stat_layout by reg address, not offset")
Signed-off-by: Colin Foster <colin.foster@in-advantage.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agosctp: fix a potential OOB access in sctp_sched_set_sched()
Ilia.Gavrilov [Wed, 10 May 2023 09:23:40 +0000 (09:23 +0000)]
sctp: fix a potential OOB access in sctp_sched_set_sched()

The 'sched' index value must be checked before accessing an element
of the 'sctp_sched_ops' array. Otherwise, it can lead to OOB access.

Note that it's harmless since the 'sched' parameter is checked before
calling 'sctp_sched_set_sched'.

Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.

Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilia.Gavrilov <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: liquidio: lio_vf_main: Remove unnecessary (void*) conversions
wuych [Wed, 10 May 2023 06:06:49 +0000 (14:06 +0800)]
net: liquidio: lio_vf_main: Remove unnecessary (void*) conversions

Pointer variables of void * type do not require type cast.

Signed-off-by: wuych <yunchuan@nfschina.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agomacsec: Use helper macsec_netdev_priv for offload drivers
Subbaraya Sundeep [Wed, 10 May 2023 08:28:09 +0000 (13:58 +0530)]
macsec: Use helper macsec_netdev_priv for offload drivers

Now macsec on top of vlan can be offloaded to macsec offloading
devices so that VLAN tag is sent in clear text on wire i.e,
packet structure is DMAC|SMAC|VLAN|SECTAG. Offloading devices can
simply enable NETIF_F_HW_MACSEC feature in netdev->vlan_features for
this to work. But the logic in offloading drivers to retrieve the
private structure from netdev needs to be changed to check whether
the netdev received is real device or a vlan device and get private
structure accordingly. This patch changes the offloading drivers to
use helper macsec_netdev_priv instead of netdev_priv.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipvlan:Fix out-of-bounds caused by unclear skb->cb
t.feng [Wed, 10 May 2023 03:50:44 +0000 (11:50 +0800)]
ipvlan:Fix out-of-bounds caused by unclear skb->cb

If skb enqueue the qdisc, fq_skb_cb(skb)->time_to_send is changed which
is actually skb->cb, and IPCB(skb_in)->opt will be used in
__ip_options_echo. It is possible that memcpy is out of bounds and lead
to stack overflow.
We should clear skb->cb before ip_local_out or ip6_local_out.

v2:
1. clean the stack info
2. use IPCB/IP6CB instead of skb->cb

crash on stable-5.10(reproduce in kasan kernel).
Stack info:
[ 2203.651571] BUG: KASAN: stack-out-of-bounds in
__ip_options_echo+0x589/0x800
[ 2203.653327] Write of size 4 at addr ffff88811a388f27 by task
swapper/3/0
[ 2203.655460] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Not tainted
5.10.0-60.18.0.50.h856.kasan.eulerosv2r11.x86_64 #1
[ 2203.655466] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.10.2-0-g5f4c7b1-20181220_000000-szxrtosci10000 04/01/2014
[ 2203.655475] Call Trace:
[ 2203.655481]  <IRQ>
[ 2203.655501]  dump_stack+0x9c/0xd3
[ 2203.655514]  print_address_description.constprop.0+0x19/0x170
[ 2203.655530]  __kasan_report.cold+0x6c/0x84
[ 2203.655586]  kasan_report+0x3a/0x50
[ 2203.655594]  check_memory_region+0xfd/0x1f0
[ 2203.655601]  memcpy+0x39/0x60
[ 2203.655608]  __ip_options_echo+0x589/0x800
[ 2203.655654]  __icmp_send+0x59a/0x960
[ 2203.655755]  nf_send_unreach+0x129/0x3d0 [nf_reject_ipv4]
[ 2203.655763]  reject_tg+0x77/0x1bf [ipt_REJECT]
[ 2203.655772]  ipt_do_table+0x691/0xa40 [ip_tables]
[ 2203.655821]  nf_hook_slow+0x69/0x100
[ 2203.655828]  __ip_local_out+0x21e/0x2b0
[ 2203.655857]  ip_local_out+0x28/0x90
[ 2203.655868]  ipvlan_process_v4_outbound+0x21e/0x260 [ipvlan]
[ 2203.655931]  ipvlan_xmit_mode_l3+0x3bd/0x400 [ipvlan]
[ 2203.655967]  ipvlan_queue_xmit+0xb3/0x190 [ipvlan]
[ 2203.655977]  ipvlan_start_xmit+0x2e/0xb0 [ipvlan]
[ 2203.655984]  xmit_one.constprop.0+0xe1/0x280
[ 2203.655992]  dev_hard_start_xmit+0x62/0x100
[ 2203.656000]  sch_direct_xmit+0x215/0x640
[ 2203.656028]  __qdisc_run+0x153/0x1f0
[ 2203.656069]  __dev_queue_xmit+0x77f/0x1030
[ 2203.656173]  ip_finish_output2+0x59b/0xc20
[ 2203.656244]  __ip_finish_output.part.0+0x318/0x3d0
[ 2203.656312]  ip_finish_output+0x168/0x190
[ 2203.656320]  ip_output+0x12d/0x220
[ 2203.656357]  __ip_queue_xmit+0x392/0x880
[ 2203.656380]  __tcp_transmit_skb+0x1088/0x11c0
[ 2203.656436]  __tcp_retransmit_skb+0x475/0xa30
[ 2203.656505]  tcp_retransmit_skb+0x2d/0x190
[ 2203.656512]  tcp_retransmit_timer+0x3af/0x9a0
[ 2203.656519]  tcp_write_timer_handler+0x3ba/0x510
[ 2203.656529]  tcp_write_timer+0x55/0x180
[ 2203.656542]  call_timer_fn+0x3f/0x1d0
[ 2203.656555]  expire_timers+0x160/0x200
[ 2203.656562]  run_timer_softirq+0x1f4/0x480
[ 2203.656606]  __do_softirq+0xfd/0x402
[ 2203.656613]  asm_call_irq_on_stack+0x12/0x20
[ 2203.656617]  </IRQ>
[ 2203.656623]  do_softirq_own_stack+0x37/0x50
[ 2203.656631]  irq_exit_rcu+0x134/0x1a0
[ 2203.656639]  sysvec_apic_timer_interrupt+0x36/0x80
[ 2203.656646]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 2203.656654] RIP: 0010:default_idle+0x13/0x20
[ 2203.656663] Code: 89 f0 5d 41 5c 41 5d 41 5e c3 cc cc cc cc cc cc cc
cc cc cc cc cc cc 0f 1f 44 00 00 0f 1f 44 00 00 0f 00 2d 9f 32 57 00 fb
f4 <c3> cc cc cc cc 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 be 08
[ 2203.656668] RSP: 0018:ffff88810036fe78 EFLAGS: 00000256
[ 2203.656676] RAX: ffffffffaf2a87f0 RBX: ffff888100360000 RCX:
ffffffffaf290191
[ 2203.656681] RDX: 0000000000098b5e RSI: 0000000000000004 RDI:
ffff88811a3c4f60
[ 2203.656686] RBP: 0000000000000000 R08: 0000000000000001 R09:
ffff88811a3c4f63
[ 2203.656690] R10: ffffed10234789ec R11: 0000000000000001 R12:
0000000000000003
[ 2203.656695] R13: ffff888100360000 R14: 0000000000000000 R15:
0000000000000000
[ 2203.656729]  default_idle_call+0x5a/0x150
[ 2203.656735]  cpuidle_idle_call+0x1c6/0x220
[ 2203.656780]  do_idle+0xab/0x100
[ 2203.656786]  cpu_startup_entry+0x19/0x20
[ 2203.656793]  secondary_startup_64_no_verify+0xc2/0xcb

[ 2203.657409] The buggy address belongs to the page:
[ 2203.658648] page:0000000027a9842f refcount:1 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0x11a388
[ 2203.658665] flags:
0x17ffffc0001000(reserved|node=0|zone=2|lastcpupid=0x1fffff)
[ 2203.658675] raw: 0017ffffc0001000 ffffea000468e208 ffffea000468e208
0000000000000000
[ 2203.658682] raw: 0000000000000000 0000000000000000 00000001ffffffff
0000000000000000
[ 2203.658686] page dumped because: kasan: bad access detected

To reproduce(ipvlan with IPVLAN_MODE_L3):
Env setting:
=======================================================
modprobe ipvlan ipvlan_default_mode=1
sysctl net.ipv4.conf.eth0.forwarding=1
iptables -t nat -A POSTROUTING -s 20.0.0.0/255.255.255.0 -o eth0 -j
MASQUERADE
ip link add gw link eth0 type ipvlan
ip -4 addr add 20.0.0.254/24 dev gw
ip netns add net1
ip link add ipv1 link eth0 type ipvlan
ip link set ipv1 netns net1
ip netns exec net1 ip link set ipv1 up
ip netns exec net1 ip -4 addr add 20.0.0.4/24 dev ipv1
ip netns exec net1 route add default gw 20.0.0.254
ip netns exec net1 tc qdisc add dev ipv1 root netem loss 10%
ifconfig gw up
iptables -t filter -A OUTPUT -p tcp --dport 8888 -j REJECT --reject-with
icmp-port-unreachable
=======================================================
And then excute the shell(curl any address of eth0 can reach):

for((i=1;i<=100000;i++))
do
        ip netns exec net1 curl x.x.x.x:8888
done
=======================================================

Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: "t.feng" <fengtao40@huawei.com>
Suggested-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agodocs: networking: fix x25-iface.rst heading & index order
Randy Dunlap [Wed, 10 May 2023 02:29:14 +0000 (19:29 -0700)]
docs: networking: fix x25-iface.rst heading & index order

Fix the chapter heading for "X.25 Device Driver Interface" so that it
does not contain a trailing '-' character, which makes Sphinx
omit this heading from the contents.

Reverse the order of the x25.rst and x25-iface.rst files in the index
so that the project introduction (x25.rst) comes first.

Fixes: 883780af7209 ("docs: networking: convert x25-iface.txt to ReST")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Cc: Martin Schiller <ms@dev.tdt.de>
Cc: linux-x25@vger.kernel.org
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agogve: Remove the code of clearing PBA bit
Ziwei Xiao [Tue, 9 May 2023 22:51:23 +0000 (15:51 -0700)]
gve: Remove the code of clearing PBA bit

Clearing the PBA bit from the driver is race prone and it may lead to
dropped interrupt events. This could potentially lead to the traffic
being completely halted.

Fixes: 5e8c5adf95f8 ("gve: DQO: Add core netdev features")
Signed-off-by: Ziwei Xiao <ziweixiao@google.com>
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agotcp: add annotations around sk->sk_shutdown accesses
Eric Dumazet [Tue, 9 May 2023 20:36:56 +0000 (20:36 +0000)]
tcp: add annotations around sk->sk_shutdown accesses

Now sk->sk_shutdown is no longer a bitfield, we can add
standard READ_ONCE()/WRITE_ONCE() annotations to silence
KCSAN reports like the following:

BUG: KCSAN: data-race in tcp_disconnect / tcp_poll

write to 0xffff88814588582c of 1 bytes by task 3404 on cpu 1:
tcp_disconnect+0x4d6/0xdb0 net/ipv4/tcp.c:3121
__inet_stream_connect+0x5dd/0x6e0 net/ipv4/af_inet.c:715
inet_stream_connect+0x48/0x70 net/ipv4/af_inet.c:727
__sys_connect_file net/socket.c:2001 [inline]
__sys_connect+0x19b/0x1b0 net/socket.c:2018
__do_sys_connect net/socket.c:2028 [inline]
__se_sys_connect net/socket.c:2025 [inline]
__x64_sys_connect+0x41/0x50 net/socket.c:2025
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

read to 0xffff88814588582c of 1 bytes by task 3374 on cpu 0:
tcp_poll+0x2e6/0x7d0 net/ipv4/tcp.c:562
sock_poll+0x253/0x270 net/socket.c:1383
vfs_poll include/linux/poll.h:88 [inline]
io_poll_check_events io_uring/poll.c:281 [inline]
io_poll_task_func+0x15a/0x820 io_uring/poll.c:333
handle_tw_list io_uring/io_uring.c:1184 [inline]
tctx_task_work+0x1fe/0x4d0 io_uring/io_uring.c:1246
task_work_run+0x123/0x160 kernel/task_work.c:179
get_signal+0xe64/0xff0 kernel/signal.c:2635
arch_do_signal_or_restart+0x89/0x2a0 arch/x86/kernel/signal.c:306
exit_to_user_mode_loop+0x6f/0xe0 kernel/entry/common.c:168
exit_to_user_mode_prepare+0x6c/0xb0 kernel/entry/common.c:204
__syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
syscall_exit_to_user_mode+0x26/0x140 kernel/entry/common.c:297
do_syscall_64+0x4d/0xc0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x03 -> 0x00

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: add vlan_get_protocol_and_depth() helper
Eric Dumazet [Tue, 9 May 2023 13:18:57 +0000 (13:18 +0000)]
net: add vlan_get_protocol_and_depth() helper

Before blamed commit, pskb_may_pull() was used instead
of skb_header_pointer() in __vlan_get_protocol() and friends.

Few callers depended on skb->head being populated with MAC header,
syzbot caught one of them (skb_mac_gso_segment())

Add vlan_get_protocol_and_depth() to make the intent clearer
and use it where sensible.

This is a more generic fix than commit e9d3f80935b6
("net/af_packet: make sure to pull mac header") which was
dealing with a similar issue.

kernel BUG at include/linux/skbuff.h:2655 !
invalid opcode: 0000 [#1] SMP KASAN
CPU: 0 PID: 1441 Comm: syz-executor199 Not tainted 6.1.24-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/14/2023
RIP: 0010:__skb_pull include/linux/skbuff.h:2655 [inline]
RIP: 0010:skb_mac_gso_segment+0x68f/0x6a0 net/core/gro.c:136
Code: fd 48 8b 5c 24 10 44 89 6b 70 48 c7 c7 c0 ae 0d 86 44 89 e6 e8 a1 91 d0 00 48 c7 c7 00 af 0d 86 48 89 de 31 d2 e8 d1 4a e9 ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
RSP: 0018:ffffc90001bd7520 EFLAGS: 00010286
RAX: ffffffff8469736a RBX: ffff88810f31dac0 RCX: ffff888115a18b00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90001bd75e8 R08: ffffffff84697183 R09: fffff5200037adf9
R10: 0000000000000000 R11: dffffc0000000001 R12: 0000000000000012
R13: 000000000000fee5 R14: 0000000000005865 R15: 000000000000fed7
FS: 000055555633f300(0000) GS:ffff8881f6a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000000 CR3: 0000000116fea000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
[<ffffffff847018dd>] __skb_gso_segment+0x32d/0x4c0 net/core/dev.c:3419
[<ffffffff8470398a>] skb_gso_segment include/linux/netdevice.h:4819 [inline]
[<ffffffff8470398a>] validate_xmit_skb+0x3aa/0xee0 net/core/dev.c:3725
[<ffffffff84707042>] __dev_queue_xmit+0x1332/0x3300 net/core/dev.c:4313
[<ffffffff851a9ec7>] dev_queue_xmit+0x17/0x20 include/linux/netdevice.h:3029
[<ffffffff851b4a82>] packet_snd net/packet/af_packet.c:3111 [inline]
[<ffffffff851b4a82>] packet_sendmsg+0x49d2/0x6470 net/packet/af_packet.c:3142
[<ffffffff84669a12>] sock_sendmsg_nosec net/socket.c:716 [inline]
[<ffffffff84669a12>] sock_sendmsg net/socket.c:736 [inline]
[<ffffffff84669a12>] __sys_sendto+0x472/0x5f0 net/socket.c:2139
[<ffffffff84669c75>] __do_sys_sendto net/socket.c:2151 [inline]
[<ffffffff84669c75>] __se_sys_sendto net/socket.c:2147 [inline]
[<ffffffff84669c75>] __x64_sys_sendto+0xe5/0x100 net/socket.c:2147
[<ffffffff8551d40f>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<ffffffff8551d40f>] do_syscall_64+0x2f/0x50 arch/x86/entry/common.c:80
[<ffffffff85600087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: 469aceddfa3e ("vlan: consolidate VLAN parsing code and limit max parsing depth")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Willem de Bruijn <willemb@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: pcs: xpcs: fix incorrect number of interfaces
Russell King (Oracle) [Tue, 9 May 2023 11:50:04 +0000 (12:50 +0100)]
net: pcs: xpcs: fix incorrect number of interfaces

In synopsys_xpcs_compat[], the DW_XPCS_2500BASEX entry was setting
the number of interfaces using the xpcs_2500basex_features array
rather than xpcs_2500basex_interfaces. This causes us to overflow
the array of interfaces. Fix this.

Fixes: f27abde3042a ("net: pcs: add 2500BASEX support for Intel mGbE controller")
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: bonding: delete unnecessary line
Liang Li [Tue, 9 May 2023 09:09:19 +0000 (09:09 +0000)]
selftests: bonding: delete unnecessary line

"ip link set dev "$devbond1" nomaster"
This line code in bond-eth-type-change.sh is unnecessary.
Because $devbond1 was not added to any master device.

Signed-off-by: Liang Li <liali@redhat.com>
Acked-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: deal with most data-races in sk_wait_event()
Eric Dumazet [Tue, 9 May 2023 18:29:48 +0000 (18:29 +0000)]
net: deal with most data-races in sk_wait_event()

__condition is evaluated twice in sk_wait_event() macro.

First invocation is lockless, and reads can race with writes,
as spotted by syzbot.

BUG: KCSAN: data-race in sk_stream_wait_connect / tcp_disconnect

write to 0xffff88812d83d6a0 of 4 bytes by task 9065 on cpu 1:
tcp_disconnect+0x2cd/0xdb0
inet_shutdown+0x19e/0x1f0 net/ipv4/af_inet.c:911
__sys_shutdown_sock net/socket.c:2343 [inline]
__sys_shutdown net/socket.c:2355 [inline]
__do_sys_shutdown net/socket.c:2363 [inline]
__se_sys_shutdown+0xf8/0x140 net/socket.c:2361
__x64_sys_shutdown+0x31/0x40 net/socket.c:2361
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

read to 0xffff88812d83d6a0 of 4 bytes by task 9040 on cpu 0:
sk_stream_wait_connect+0x1de/0x3a0 net/core/stream.c:75
tcp_sendmsg_locked+0x2e4/0x2120 net/ipv4/tcp.c:1266
tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1484
inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:651
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
__sys_sendto+0x246/0x300 net/socket.c:2142
__do_sys_sendto net/socket.c:2154 [inline]
__se_sys_sendto net/socket.c:2150 [inline]
__x64_sys_sendto+0x78/0x90 net/socket.c:2150
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x00000000 -> 0x00000068

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: annotate sk->sk_err write from do_recvmmsg()
Eric Dumazet [Tue, 9 May 2023 16:35:53 +0000 (16:35 +0000)]
net: annotate sk->sk_err write from do_recvmmsg()

do_recvmmsg() can write to sk->sk_err from multiple threads.

As said before, many other points reading or writing sk_err
need annotations.

Fixes: 34b88a68f26a ("net: Fix use after free in the recvmmsg exit path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: veth: make PAGE_POOL_STATS optional
Lorenzo Bianconi [Tue, 9 May 2023 09:05:16 +0000 (11:05 +0200)]
net: veth: make PAGE_POOL_STATS optional

Since veth is very likely to be enabled and there are some drivers
(e.g. mlx5) where CONFIG_PAGE_POOL_STATS is optional, make
CONFIG_PAGE_POOL_STATS optional for veth too in order to keep it
optional instead of required.

Suggested-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoMerge branch 'lan966x-es0-vcap'
David S. Miller [Wed, 10 May 2023 08:51:11 +0000 (09:51 +0100)]
Merge branch 'lan966x-es0-vcap'

Horatiu Vultur says:

====================
net: lan966x: Add support for ES0 VCAP

Provide the Egress Stage 0 (ES0) VCAP (Versatile Content-Aware
Processor) support for the lan966x platform.

The ES0 VCAP has only 1 lookup which is accessible with a TC chain
id 10000000.

Currently only one action is support which is vlan pop. Also it is
possible to link the IS1 to ES0 using 'goto chain 10000000'.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: lan966x: Add TC support for ES0 VCAP
Horatiu Vultur [Tue, 9 May 2023 07:26:45 +0000 (09:26 +0200)]
net: lan966x: Add TC support for ES0 VCAP

Enable the TC command to use the lan966x ES0 VCAP. Currently support
only one action which is vlan pop, other will be added later.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: lan966x: Add ES0 VCAP keyset configuration for lan966x
Horatiu Vultur [Tue, 9 May 2023 07:26:44 +0000 (09:26 +0200)]
net: lan966x: Add ES0 VCAP keyset configuration for lan966x

Add ES0 VCAP port keyset configuration for lan966x and also update
debugfs to show the keyset configuration.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: lan966x: Add ES0 VCAP model
Horatiu Vultur [Tue, 9 May 2023 07:26:43 +0000 (09:26 +0200)]
net: lan966x: Add ES0 VCAP model

Provide ES0 (egress stage 0) VCAP model for lan966x.
This provides rewriting functionality in the gress path.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonetlink: annotate accesses to nlk->cb_running
Eric Dumazet [Tue, 9 May 2023 16:56:34 +0000 (16:56 +0000)]
netlink: annotate accesses to nlk->cb_running

Both netlink_recvmsg() and netlink_native_seq_show() read
nlk->cb_running locklessly. Use READ_ONCE() there.

Add corresponding WRITE_ONCE() to netlink_dump() and
__netlink_dump_start()

syzbot reported:
BUG: KCSAN: data-race in __netlink_dump_start / netlink_recvmsg

write to 0xffff88813ea4db59 of 1 bytes by task 28219 on cpu 0:
__netlink_dump_start+0x3af/0x4d0 net/netlink/af_netlink.c:2399
netlink_dump_start include/linux/netlink.h:308 [inline]
rtnetlink_rcv_msg+0x70f/0x8c0 net/core/rtnetlink.c:6130
netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2577
rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6192
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1942
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
sock_write_iter+0x1aa/0x230 net/socket.c:1138
call_write_iter include/linux/fs.h:1851 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x463/0x760 fs/read_write.c:584
ksys_write+0xeb/0x1a0 fs/read_write.c:637
__do_sys_write fs/read_write.c:649 [inline]
__se_sys_write fs/read_write.c:646 [inline]
__x64_sys_write+0x42/0x50 fs/read_write.c:646
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

read to 0xffff88813ea4db59 of 1 bytes by task 28222 on cpu 1:
netlink_recvmsg+0x3b4/0x730 net/netlink/af_netlink.c:2022
sock_recvmsg_nosec+0x4c/0x80 net/socket.c:1017
____sys_recvmsg+0x2db/0x310 net/socket.c:2718
___sys_recvmsg net/socket.c:2762 [inline]
do_recvmmsg+0x2e5/0x710 net/socket.c:2856
__sys_recvmmsg net/socket.c:2935 [inline]
__do_sys_recvmmsg net/socket.c:2958 [inline]
__se_sys_recvmmsg net/socket.c:2951 [inline]
__x64_sys_recvmmsg+0xe2/0x160 net/socket.c:2951
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x00 -> 0x01

Fixes: 16b304f3404f ("netlink: Eliminate kmalloc in netlink dump operation.")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoMerge branch 'bonding-overflow'
David S. Miller [Wed, 10 May 2023 08:27:21 +0000 (09:27 +0100)]
Merge branch 'bonding-overflow'

Hangbin Liu says:

====================
bonding: fix send_peer_notif overflow

Bonding send_peer_notif was defined as u8. But the value is
num_peer_notif multiplied by peer_notif_delay, which is u8 * u32.
This would cause the send_peer_notif overflow.

Before the fix:
TEST: num_grat_arp (active-backup miimon num_grat_arp 10)           [ OK ]
TEST: num_grat_arp (active-backup miimon num_grat_arp 20)           [ OK ]
4 garp packets sent on active slave eth1
TEST: num_grat_arp (active-backup miimon num_grat_arp 30)           [FAIL]
24 garp packets sent on active slave eth1
TEST: num_grat_arp (active-backup miimon num_grat_arp 50)           [FAIL]

After the fix:
TEST: num_grat_arp (active-backup miimon num_grat_arp 10)           [ OK ]
TEST: num_grat_arp (active-backup miimon num_grat_arp 20)           [ OK ]
TEST: num_grat_arp (active-backup miimon num_grat_arp 30)           [ OK ]
TEST: num_grat_arp (active-backup miimon num_grat_arp 50)           [ OK ]
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agokselftest: bonding: add num_grat_arp test
Hangbin Liu [Tue, 9 May 2023 03:12:00 +0000 (11:12 +0800)]
kselftest: bonding: add num_grat_arp test

TEST: num_grat_arp (active-backup miimon num_grat_arp 10)           [ OK ]
TEST: num_grat_arp (active-backup miimon num_grat_arp 20)           [ OK ]
TEST: num_grat_arp (active-backup miimon num_grat_arp 30)           [ OK ]
TEST: num_grat_arp (active-backup miimon num_grat_arp 50)           [ OK ]

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: forwarding: lib: add netns support for tc rule handle stats get
Hangbin Liu [Tue, 9 May 2023 03:11:59 +0000 (11:11 +0800)]
selftests: forwarding: lib: add netns support for tc rule handle stats get

When run the test in netns, it's not easy to get the tc stats via
tc_rule_handle_stats_get(). With the new netns parameter, we can get
stats from specific netns like

  num=$(tc_rule_handle_stats_get "dev eth0 ingress" 101 ".packets" "-n ns")

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoDocumentation: bonding: fix the doc of peer_notif_delay
Hangbin Liu [Tue, 9 May 2023 03:11:58 +0000 (11:11 +0800)]
Documentation: bonding: fix the doc of peer_notif_delay

Bonding only supports setting peer_notif_delay with miimon set.

Fixes: 0307d589c4d6 ("bonding: add documentation for peer_notif_delay")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agobonding: fix send_peer_notif overflow
Hangbin Liu [Tue, 9 May 2023 03:11:57 +0000 (11:11 +0800)]
bonding: fix send_peer_notif overflow

Bonding send_peer_notif was defined as u8. Since commit 07a4ddec3ce9
("bonding: add an option to specify a delay between peer notifications").
the bond->send_peer_notif will be num_peer_notif multiplied by
peer_notif_delay, which is u8 * u32. This would cause the send_peer_notif
overflow easily. e.g.

  ip link add bond0 type bond mode 1 miimon 100 num_grat_arp 30 peer_notify_delay 1000

To fix the overflow, let's set the send_peer_notif to u32 and limit
peer_notif_delay to 300s.

Reported-by: Liang Li <liali@redhat.com>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2090053
Fixes: 07a4ddec3ce9 ("bonding: add an option to specify a delay between peer notifications")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: ethernet: mtk_eth_soc: fix NULL pointer dereference
Daniel Golle [Tue, 9 May 2023 01:20:06 +0000 (03:20 +0200)]
net: ethernet: mtk_eth_soc: fix NULL pointer dereference

Check for NULL pointer to avoid kernel crashing in case of missing WO
firmware in case only a single WEDv2 device has been initialized, e.g. on
MT7981 which can connect just one wireless frontend.

Fixes: 86ce0d09e424 ("net: ethernet: mtk_eth_soc: use WO firmware for MT7981")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: ipconfig: Allow DNS to be overwritten by DHCPACK
Martin Wetterwald [Mon, 8 May 2023 17:44:47 +0000 (19:44 +0200)]
net: ipconfig: Allow DNS to be overwritten by DHCPACK

Some DHCP server implementations only send the important requested DHCP
options in the final BOOTP reply (DHCPACK).
One example is systemd-networkd.
However, RFC2131, in section 4.3.1 states:

> The server MUST return to the client:
> [...]
> o Parameters requested by the client, according to the following
>   rules:
>
>      -- IF the server has been explicitly configured with a default
>         value for the parameter, the server MUST include that value
>         in an appropriate option in the 'option' field, ELSE

I've reported the issue here:
https://github.com/systemd/systemd/issues/27471

Linux PNP DHCP client implementation only takes into account the DNS
servers received in the first BOOTP reply (DHCPOFFER).
This usually isn't an issue as servers are required to put the same
values in the DHCPOFFER and DHCPACK.
However, RFC2131, in section 4.3.2 states:

> Any configuration parameters in the DHCPACK message SHOULD NOT
> conflict with those in the earlier DHCPOFFER message to which the
> client is responding.  The client SHOULD use the parameters in the
> DHCPACK message for configuration.

When making Linux PNP DHCP client (cmdline ip=dhcp) interact with
systemd-networkd DHCP server, an interesting "protocol misunderstanding"
happens:
Because DNS servers were only specified in the DHCPACK and not in the
DHCPOFFER, Linux will not catch the correct DNS servers: in the first
BOOTP reply (DHCPOFFER), it sees that there is no DNS, and sets as
fallback the IP of the DHCP server itself. When the second BOOTP reply
comes (DHCPACK), it's already too late: the kernel will not overwrite
the fallback setting it has set previously.

This patch makes the kernel overwrite its DNS fallback by DNS servers
specified in the DHCPACK if any.

Signed-off-by: Martin Wetterwald <martin@wetterwald.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: nft_flowtable.sh: check ingress/egress chain too
Florian Westphal [Tue, 9 May 2023 14:47:24 +0000 (16:47 +0200)]
selftests: nft_flowtable.sh: check ingress/egress chain too

Make sure flowtable interacts correctly with ingress and egress
chains, i.e. those get handled before and after flow table respectively.

Adds three more tests:
1. repeat flowtable test, but with 'ip dscp set cs3' done in
   inet forward chain.

Expect that some packets have been mangled (before flowtable offload
became effective) while some pass without mangling (after offload
succeeds).

2. repeat flowtable test, but with 'ip dscp set cs3' done in
   veth0:ingress.

Expect that all packets pass with cs3 dscp field.

3. same as 2, but use veth1:egress.  Expect the same outcome.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
13 months agoselftests: nft_flowtable.sh: monitor result file sizes
Boris Sukholitko [Thu, 4 May 2023 08:48:14 +0000 (11:48 +0300)]
selftests: nft_flowtable.sh: monitor result file sizes

When running nft_flowtable.sh in VM on a busy server we've found that
the time of the netcat file transfers vary wildly.

Therefore replace hardcoded 3 second sleep with the loop checking for
a change in the file sizes. Once no change in detected we test the results.

Nice side effect is that we shave 1 second sleep in the fast case
(hard-coded 3 second sleep vs two 1 second sleeps).

Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
13 months agoselftests: nft_flowtable.sh: wait for specific nc pids
Boris Sukholitko [Thu, 4 May 2023 08:48:13 +0000 (11:48 +0300)]
selftests: nft_flowtable.sh: wait for specific nc pids

Doing wait with no parameters may interfere with some of the tests
having their own background processes.

Although no such test is currently present, the cleanup is useful
to rely on the nft_flowtable.sh for local development (e.g. running
background tcpdump command during the tests).

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
13 months agoselftests: nft_flowtable.sh: no need for ps -x option
Boris Sukholitko [Thu, 4 May 2023 08:48:12 +0000 (11:48 +0300)]
selftests: nft_flowtable.sh: no need for ps -x option

Some ps commands (e.g. busybox derived) have no -x option. For the
purposes of hash calculation of the list of processes this option is
inessential.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
13 months agoselftests: nft_flowtable.sh: use /proc for pid checking
Boris Sukholitko [Thu, 4 May 2023 08:48:11 +0000 (11:48 +0300)]
selftests: nft_flowtable.sh: use /proc for pid checking

Some ps commands (e.g. busybox derived) have no -p option. Use /proc for
pid existence check.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
13 months agonetfilter: conntrack: fix possible bug_on with enable_hooks=1
Florian Westphal [Thu, 4 May 2023 12:55:02 +0000 (14:55 +0200)]
netfilter: conntrack: fix possible bug_on with enable_hooks=1

I received a bug report (no reproducer so far) where we trip over

712         rcu_read_lock();
713         ct_hook = rcu_dereference(nf_ct_hook);
714         BUG_ON(ct_hook == NULL);  // here

In nf_conntrack_destroy().

First turn this BUG_ON into a WARN.  I think it was triggered
via enable_hooks=1 flag.

When this flag is turned on, the conntrack hooks are registered
before nf_ct_hook pointer gets assigned.
This opens a short window where packets enter the conntrack machinery,
can have skb->_nfct set up and a subsequent kfree_skb might occur
before nf_ct_hook is set.

Call nf_conntrack_init_end() to set nf_ct_hook before we register the
pernet ops.

Fixes: ba3fbe663635 ("netfilter: nf_conntrack: provide modparam to always register conntrack hooks")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
13 months agonetfilter: nf_tables: always release netdev hooks from notifier
Florian Westphal [Thu, 4 May 2023 12:20:21 +0000 (14:20 +0200)]
netfilter: nf_tables: always release netdev hooks from notifier

This reverts "netfilter: nf_tables: skip netdev events generated on netns removal".

The problem is that when a veth device is released, the veth release
callback will also queue the peer netns device for removal.

Its possible that the peer netns is also slated for removal.  In this
case, the device memory is already released before the pre_exit hook of
the peer netns runs:

BUG: KASAN: slab-use-after-free in nf_hook_entry_head+0x1b8/0x1d0
Read of size 8 at addr ffff88812c0124f0 by task kworker/u8:1/45
Workqueue: netns cleanup_net
Call Trace:
 nf_hook_entry_head+0x1b8/0x1d0
 __nf_unregister_net_hook+0x76/0x510
 nft_netdev_unregister_hooks+0xa0/0x220
 __nft_release_hook+0x184/0x490
 nf_tables_pre_exit_net+0x12f/0x1b0
 ..

Order is:
1. First netns is released, veth_dellink() queues peer netns device
   for removal
2. peer netns is queued for removal
3. peer netns device is released, unreg event is triggered
4. unreg event is ignored because netns is going down
5. pre_exit hook calls nft_netdev_unregister_hooks but device memory
   might be free'd already.

Fixes: 68a3765c659f ("netfilter: nf_tables: skip netdev events generated on netns removal")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
13 months agonet: phy: bcm7xx: Correct read from expansion register
Florian Fainelli [Mon, 8 May 2023 23:17:49 +0000 (16:17 -0700)]
net: phy: bcm7xx: Correct read from expansion register

Since the driver works in the "legacy" addressing mode, we need to write
to the expansion register (0x17) with bits 11:8 set to 0xf to properly
select the expansion register passed as argument.

Fixes: f68d08c437f9 ("net: phy: bcm7xxx: Add EPHY entry for 72165")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230508231749.1681169-1-f.fainelli@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: veth: rely on napi_build_skb in veth_convert_skb_to_xdp_buff
Lorenzo Bianconi [Mon, 8 May 2023 20:45:23 +0000 (22:45 +0200)]
net: veth: rely on napi_build_skb in veth_convert_skb_to_xdp_buff

Since veth_convert_skb_to_xdp_buff routine runs in veth_poll() NAPI,
rely on napi_build_skb() instead of build_skb() to reduce skb allocation
cost.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Link: https://lore.kernel.org/r/0f822c0b72f8b71555c11745cb8fb33399d02de9.1683578488.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: Fix load-tearing on sk->sk_stamp in sock_recv_cmsgs().
Kuniyuki Iwashima [Mon, 8 May 2023 17:55:43 +0000 (10:55 -0700)]
net: Fix load-tearing on sk->sk_stamp in sock_recv_cmsgs().

KCSAN found a data race in sock_recv_cmsgs() where the read access
to sk->sk_stamp needs READ_ONCE().

BUG: KCSAN: data-race in packet_recvmsg / packet_recvmsg

write (marked) to 0xffff88803c81f258 of 8 bytes by task 19171 on cpu 0:
 sock_write_timestamp include/net/sock.h:2670 [inline]
 sock_recv_cmsgs include/net/sock.h:2722 [inline]
 packet_recvmsg+0xb97/0xd00 net/packet/af_packet.c:3489
 sock_recvmsg_nosec net/socket.c:1019 [inline]
 sock_recvmsg+0x11a/0x130 net/socket.c:1040
 sock_read_iter+0x176/0x220 net/socket.c:1118
 call_read_iter include/linux/fs.h:1845 [inline]
 new_sync_read fs/read_write.c:389 [inline]
 vfs_read+0x5e0/0x630 fs/read_write.c:470
 ksys_read+0x163/0x1a0 fs/read_write.c:613
 __do_sys_read fs/read_write.c:623 [inline]
 __se_sys_read fs/read_write.c:621 [inline]
 __x64_sys_read+0x41/0x50 fs/read_write.c:621
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

read to 0xffff88803c81f258 of 8 bytes by task 19183 on cpu 1:
 sock_recv_cmsgs include/net/sock.h:2721 [inline]
 packet_recvmsg+0xb64/0xd00 net/packet/af_packet.c:3489
 sock_recvmsg_nosec net/socket.c:1019 [inline]
 sock_recvmsg+0x11a/0x130 net/socket.c:1040
 sock_read_iter+0x176/0x220 net/socket.c:1118
 call_read_iter include/linux/fs.h:1845 [inline]
 new_sync_read fs/read_write.c:389 [inline]
 vfs_read+0x5e0/0x630 fs/read_write.c:470
 ksys_read+0x163/0x1a0 fs/read_write.c:613
 __do_sys_read fs/read_write.c:623 [inline]
 __se_sys_read fs/read_write.c:621 [inline]
 __x64_sys_read+0x41/0x50 fs/read_write.c:621
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

value changed: 0xffffffffc4653600 -> 0x0000000000000000

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 19183 Comm: syz-executor.5 Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

Fixes: 6c7c98bad488 ("sock: avoid dirtying sk_stamp, if possible")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230508175543.55756-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: stmmac: xgmac: add ethtool per-queue irq statistic support
Teoh Ji Sheng [Mon, 8 May 2023 14:43:40 +0000 (22:43 +0800)]
net: stmmac: xgmac: add ethtool per-queue irq statistic support

Commit af9bf70154eb ("net: stmmac: add ethtool per-queue irq statistic
support") introduced ethtool per-queue statistics support to display
number of interrupts generated by DMA tx and DMA rx for DWMAC4 core.
This patch extend the support to XGMAC core.

Signed-off-by: Teoh Ji Sheng <ji.sheng.teoh@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230508144339.3014402-1-ji.sheng.teoh@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoMerge branch 'net-stmmac-convert-to-platform-remove-callback-returning-void'
Jakub Kicinski [Wed, 10 May 2023 02:55:32 +0000 (19:55 -0700)]
Merge branch 'net-stmmac-convert-to-platform-remove-callback-returning-void'

Uwe Kleine-König says:

====================
net: stmmac: Convert to platform remove callback returning void

(implicit) v1 of this series is available at
https://lore.kernel.org/netdev/20230402143025.2524443-1-u.kleine-koenig@pengutronix.de

Changes since then:

 - Added various Reviewed-by: and Acked-by: tags received for v1
 - Removed a variable in an earlier patch to make all intermediate steps
   compilable, spotted by Simon Horman
 - Rebased to v6.4-rc1 (which needed a slight adaption to cope for
   4bd3bb7b4526 ("net: stmmac: Add glue layer for StarFive JH7110 SoC"))
====================

Link: https://lore.kernel.org/r/20230508142637.1449363-1-u.kleine-koenig@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: stmmac: dwmac-tegra: Convert to platform remove callback returning void
Uwe Kleine-König [Mon, 8 May 2023 14:26:37 +0000 (16:26 +0200)]
net: stmmac: dwmac-tegra: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: stmmac: dwmac-sun8i: Convert to platform remove callback returning void
Uwe Kleine-König [Mon, 8 May 2023 14:26:36 +0000 (16:26 +0200)]
net: stmmac: dwmac-sun8i: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Acked-by: Jernej Skrabec <jernej.skrabec@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: stmmac: dwmac-stm32: Convert to platform remove callback returning void
Uwe Kleine-König [Mon, 8 May 2023 14:26:35 +0000 (16:26 +0200)]
net: stmmac: dwmac-stm32: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: stmmac: dwmac-sti: Convert to platform remove callback returning void
Uwe Kleine-König [Mon, 8 May 2023 14:26:34 +0000 (16:26 +0200)]
net: stmmac: dwmac-sti: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: stmmac: dwmac-rk: Convert to platform remove callback returning void
Uwe Kleine-König [Mon, 8 May 2023 14:26:33 +0000 (16:26 +0200)]
net: stmmac: dwmac-rk: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: stmmac: dwmac-qcom-ethqos: Convert to platform remove callback returning void
Uwe Kleine-König [Mon, 8 May 2023 14:26:32 +0000 (16:26 +0200)]
net: stmmac: dwmac-qcom-ethqos: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Bhupesh Sharma <bhupesh.sharma@linaro.org>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: stmmac: dwmac-dwc-qos-eth: Convert to platform remove callback returning void
Uwe Kleine-König [Mon, 8 May 2023 14:26:31 +0000 (16:26 +0200)]
net: stmmac: dwmac-dwc-qos-eth: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: stmmac: dwmac-visconti: Convert to platform remove callback returning void
Uwe Kleine-König [Mon, 8 May 2023 14:26:30 +0000 (16:26 +0200)]
net: stmmac: dwmac-visconti: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Acked-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>