review.tizen.org Git - platform/kernel/linux-rpi.git/log

inet: Add IP_LOCAL_PORT_RANGE socket option

Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.

A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:

1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
   and the destination port.

An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:

1. Manually pick the source port by bind()'ing to it before connect()'ing
   the socket.

   This approach has a couple of downsides:

   a) Search for a free port has to be implemented in the user-space. If
      the chosen 4-tuple happens to be busy, the application needs to retry
      from a different local port number.

      Detecting if 4-tuple is busy can be either easy (TCP) or hard
      (UDP). In TCP case, the application simply has to check if connect()
      returned an error (EADDRNOTAVAIL). That is assuming that the local
      port sharing was enabled (REUSEADDR) by all the sockets.

        # Assume desired local port range is 60_000-60_511
        s = socket(AF_INET, SOCK_STREAM)
        s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
        s.bind(("192.0.2.1", 60_000))
        s.connect(("1.1.1.1", 53))
        # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
        # Application must retry with another local port

      In case of UDP, the network stack allows binding more than one socket
      to the same 4-tuple, when local port sharing is enabled
      (REUSEADDR). Hence detecting the conflict is much harder and involves
      querying sock_diag and toggling the REUSEADDR flag [1].

   b) For TCP, bind()-ing to a port within the ephemeral port range means
      that no connecting sockets, that is those which leave it to the
      network stack to find a free local port at connect() time, can use
      the this port.

      IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
      will be skipped during the free port search at connect() time.

2. Isolate the app in a dedicated netns and use the use the per-netns
   ip_local_port_range sysctl to adjust the ephemeral port range bounds.

   The per-netns setting affects all sockets, so this approach can be used
   only if:

   - there is just one egress IP address, or
   - the desired egress port range is the same for all egress IP addresses
     used by the application.

   For TCP, this approach avoids the downsides of (1). Free port search and
   4-tuple conflict detection is done by the network stack:

     system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")

     s = socket(AF_INET, SOCK_STREAM)
     s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
     s.bind(("192.0.2.1", 0))
     s.connect(("1.1.1.1", 53))
     # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy

  For UDP this approach has limited applicability. Setting the
  IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
  port being shared with other connected UDP sockets.

  Hence relying on the network stack to find a free source port, limits the
  number of outgoing UDP flows from a single IP address down to the number
  of available ephemeral ports.

To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.

To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.

The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.

UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.

  PORT_LO = 40_000
  PORT_HI = 40_511

  s = socket(AF_INET, SOCK_STREAM)
  v = struct.pack("I", PORT_HI << 16 | PORT_LO)
  s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
  s.bind(("127.0.0.1", 0))
  s.getsockname()
  # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
  # if there is a free port. EADDRINUSE otherwise.

[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116

Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Kconfig: fix spellos

Fix spelling in net/ Kconfig files.
(reported by codespell)

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: coreteam@netfilter.org
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Link: https://lore.kernel.org/r/20230124181724.18166-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethtool: fix NULL pointer dereference in pause_prepare_data()

In the following call path:

ethnl_default_dumpit
-> ethnl_default_dump_one
-> ctx->ops->prepare_data
-> pause_prepare_data

struct genl_info *info will be passed as NULL, and pause_prepare_data()
dereferences it while getting the extended ack pointer.

To avoid that, just set the extack to NULL if "info" is NULL, since the
netlink extack handling messages know how to deal with that.

The pattern "info ? info->extack : NULL" is present in quite a few other
"prepare_data" implementations, so it's clear that it's a more general
problem to be dealt with at a higher level, but the code should have at
least adhered to the current conventions to avoid the NULL dereference.

Fixes: 04692c9020b7 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reported-by: syzbot+9d44aae2720fc40b8474@syzkaller.appspotmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethtool: fix NULL pointer dereference in stats_prepare_data()

In the following call path:

ethnl_default_dumpit
-> ethnl_default_dump_one
-> ctx->ops->prepare_data
-> stats_prepare_data

struct genl_info *info will be passed as NULL, and stats_prepare_data()
dereferences it while getting the extended ack pointer.

To avoid that, just set the extack to NULL if "info" is NULL, since the
netlink extack handling messages know how to deal with that.

The pattern "info ? info->extack : NULL" is present in quite a few other
"prepare_data" implementations, so it's clear that it's a more general
problem to be dealt with at a higher level, but the code should have at
least adhered to the current conventions to avoid the NULL dereference.

Fixes: 04692c9020b7 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 's390-ism-generalized-interface'

Jan Karcher says:

====================
drivers/s390/net/ism: Add generalized interface

Previously, there was no clean separation between SMC-D code and the ISM
device driver.This patch series addresses the situation to make ISM available
for uses outside of SMC-D.
In detail: SMC-D offers an interface via struct smcd_ops, which only the
ISM module implements so far. However, there is no real separation between
the smcd and ism modules, which starts right with the ISM device
initialization, which calls directly into the SMC-D code.
This patch series introduces a new API in the ISM module, which allows
registration of arbitrary clients via include/linux/ism.h: struct ism_client.
Furthermore, it introduces a "pure" struct ism_dev (i.e. getting rid of
dependencies on SMC-D in the device structure), and adds a number of API
calls for data transfers via ISM (see ism_register_dmb() & friends).
Still, the ISM module implements the SMC-D API, and therefore has a number
of internal helper functions for that matter.
Note that the ISM API is consciously kept thin for now (as compared to the
SMC-D API calls), as a number of API calls are only used with SMC-D and
hardly have any meaningful usage beyond SMC-D, e.g. the VLAN-related calls.

v1 -> v2:
Removed s390x dependency which broke config for other archs.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: De-tangle ism and smc device initialization

The struct device for ISM devices was part of struct smcd_dev. Move to
struct ism_dev, provide a new API call in struct smcd_ops, and convert
existing SMCD code accordingly.
Furthermore, remove struct smcd_dev from struct ism_dev.
This is the final part of a bigger overhaul of the interfaces between SMC
and ISM.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/ism: Consolidate SMC-D-related code

The ism module had SMC-D-specific code sprinkled across the entire module.
We are now consolidating the SMC-D-specific parts into the latter parts
of the module, so it becomes more clear what code is intended for use with
ISM, and which parts are glue code for usage in the context of SMC-D.
This is the fourth part of a bigger overhaul of the interfaces between SMC
and ISM.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: Separate SMC-D and ISM APIs

We separate the code implementing the struct smcd_ops API in the ISM
device driver from the functions that may be used by other exploiters of
ISM devices.
Note: We start out small, and don't offer the whole breadth of the ISM
device for public use, as many functions are specific to or likely only
ever used in the context of SMC-D.
This is the third part of a bigger overhaul of the interfaces between SMC
and ISM.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: Register SMC-D as ISM client

Register the smc module with the new ism device driver API.
This is the second part of a bigger overhaul of the interfaces between SMC
and ISM.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/ism: Add new API for client registration

Add a new API that allows other drivers to concurrently access ISM devices.
To do so, we introduce a new API that allows other modules to register for
ISM device usage. Furthermore, we move the GID to struct ism, where it
belongs conceptually, and rename and relocate struct smcd_event to struct
ism_event.
This is the first part of a bigger overhaul of the interfaces between SMC
and ISM.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/ism: Introduce struct ism_dmb

Conceptually, a DMB is a structure that belongs to ISM devices. However,
SMC currently 'owns' this structure. So future exploiters of ISM devices
would be forced to include SMC headers to work - which is just weird.
Therefore, we switch ISM to struct ism_dmb, introduce a new public header
with the definition (will be populated with further API calls later on),
and, add a thin wrapper to please SMC. Since structs smcd_dmb and ism_dmb
are identical, we can simply convert between the two for now.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/ism: Add missing calls to disable bus-mastering

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: Terminate connections prior to device removal

Removing an ISM device prior to terminating its associated connections
doesn't end well.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

virtio-net: Reduce debug name field size to 16 bytes

virtio queue index can be maximum of 65535. 16 bytes are enough to store
the vq name with the existing string prefix.

With this change, send queue struct saves 24 bytes and receive
queue saves whole cache line worth 64 bytes per structure
due to saving in alignment bytes.

Pahole results before:

pahole -s drivers/net/virtio_net.o | \
    grep -e "send_queue" -e "receive_queue"
send_queue      1112    0
receive_queue   1280    1

Pahole results after:
pahole -s drivers/net/virtio_net.o | \
    grep -e "send_queue" -e "receive_queue"
send_queue      1088    0
receive_queue   1216    1

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: remove a dubious assumption in fmsg dumping

Build bot detects that err may be returned uninitialized in
devlink_fmsg_prepare_skb(). This is not really true because
all fmsgs users should create at least one outer nest, and
therefore fmsg can't be completely empty.

That said the assumption is not trivial to confirm, so let's
follow the bots advice, anyway.

This code does not seem to have changed since its inception in
commit 1db64e8733f6 ("devlink: Add devlink formatted message (fmsg) API")

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230124035231.787381-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: fix incorrect verify_enabled reporting in ethtool get_mm()

We don't read the verify_enabled variable from hardware in the MAC Merge
layer state GET operation, instead we always leave it set to "false".
The user may think something is wrong if they set verify_enabled to
true, then read it back and see it's still false, even though the
configuration took place.

Fixes: 6505b6805655 ("net: mscc: ocelot: add MAC Merge layer support for VSC9959")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230123184538.3420098-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

nfp: flower: change get/set_eeprom logic and enable for flower reps

The changes in this patch are as follows:

- Alter the logic of get/set_eeprom functions to use the helper function
nfp_app_from_netdev() which handles differentiating between an nfp_net
and a nfp_repr. This allows us to get an agnostic backpointer to the
pdev.

- Enable the various eeprom commands by adding the 'get_eeprom_len',
'get_eeprom', 'set_eeprom' callbacks to the nfp_port_ethtool_ops struct.
This allows the eeprom commands to work on representor interfaces,
similar to a previous patch which added it to the vnics.
Currently these are being used to configure persistent MAC addresses for
the physical ports on the nfp.

Signed-off-by: James Hershaw <james.hershaw@corigine.com>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230123134135.293278-1-simon.horman@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: Make ip6_route_output_flags_noref() static.

This function is only used in net/ipv6/route.c and has no reason to be
visible outside of it.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/50706db7f675e40b3594d62011d9363dce32b92e.1674495822.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: fix spelling mistake in dump size assert

Commit 2c7bc10d0f7b ("netlink: add macro for checking dump ctx size")
misspelled the name of the assert as asset, missing an R.

Reported-by: Ido Schimmel <idosch@idosch.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20230123222224.732338-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'netlink-protocol-specs'

Jakub Kicinski says:

====================
Netlink protocol specs

I think the Netlink proto specs are far along enough to merge.
Filling in all attribute types and quirks will be an ongoing
effort but we have enough to cover FOU so it's somewhat complete.

I fully intend to continue polishing the code but at the same
time I'd like to start helping others base their work on the
specs (e.g. DPLL) and need to start working on some new families
myself.

That's the progress / motivation for merging. The RFC [1] has more
of a high level blurb, plus I created a lot of documentation, I'm
not going to repeat it here. There was also the talk at LPC [2].

[1] https://lore.kernel.org/all/20220811022304.583300-1-kuba@kernel.org/
[2] https://youtu.be/9QkXIQXkaQk?t=2562
v2: https://lore.kernel.org/all/20220930023418.1346263-1-kuba@kernel.org/
v3: https://lore.kernel.org/all/20230119003613.111778-1-kuba@kernel.org/1

v4:
- spec improvements (patch 2)
- Python cleanup (patch 3)
- rename auto-gen files and use the right comment style
====================

Link: https://lore.kernel.org/r/20230120175041.342573-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

tools: ynl: add a completely generic client

Add a CLI sample which can take in arbitrary request
in JSON format, convert it to Netlink and do the inverse
for output.

It's meant as a development tool primarily and perhaps
for selftests which need to tickle netlink in a special way.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: fou: use policy and operation tables generated from the spec

Generate and plug in the spec-based tables.

A little bit of renaming is needed in the FOU code.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: fou: rename the source for linking

We'll need to link two objects together to form the fou module.
This means the source can't be called fou, the build system expects
fou.o to be the combined object.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: fou: regenerate the uAPI from the spec

Regenerate the FOU uAPI header from the YAML spec.

The flags now come before attributes which use them,
and the comments for type disappear (coders should look
at the spec instead).

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netlink: add a proto specification for FOU

FOU has a reasonably modern Genetlink family. Add a spec.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: add basic C code generators for Netlink

Code generators to turn Netlink specs into C code.
I'm definitely not proud of it.

The main generator is in Python, there's a bash script
to regen all code-gen'ed files in tree after making
spec changes.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netlink: add schemas for YAML specs

Add schemas for Netlink spec files. As described in the docs
we have 4 "protocols" or compatibility levels, and each one
comes with its own schema, but the more general / legacy
schemas are superset of more modern ones: genetlink is
the smallest followed by genetlink-c and genetlink-legacy.
There is no schema for raw netlink, yet, I haven't found the time..

I don't know enough jsonschema to do inheritance or something
but the repetition is not too bad. I hope.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

docs: add more netlink docs (incl. spec docs)

Add documentation about the upcoming Netlink protocol specs.

Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-sched-use-the-backlog-for-nested-mirred-ingress'

Davide Caratti says:

====================
net/sched: use the backlog for nested mirred ingress

TC mirred has a protection against excessive stack growth, but that
protection doesn't really guarantee the absence of recursion, nor
it guards against loops. Patch 1/2 rewords "recursion" to "nesting" to
make this more clear.
We can leverage on this existing mechanism to prevent TCP / SCTP from doing
soft lock-up in some specific scenarios that uses mirred egress->ingress:
patch 2 changes mirred so that the networking backlog is used for nested
mirred ingress actions.
====================

Link: https://lore.kernel.org/r/cover.1674233458.git.dcaratti@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

act_mirred: use the backlog for nested calls to mirred ingress

William reports kernel soft-lockups on some OVS topologies when TC mirred
egress->ingress action is hit by local TCP traffic [1].
The same can also be reproduced with SCTP (thanks Xin for verifying), when
client and server reach themselves through mirred egress to ingress, and
one of the two peers sends a "heartbeat" packet (from within a timer).

Enqueueing to backlog proved to fix this soft lockup; however, as Cong
noticed [2], we should preserve - when possible - the current mirred
behavior that counts as "overlimits" any eventual packet drop subsequent to
the mirred forwarding action [3]. A compromise solution might use the
backlog only when tcf_mirred_act() has a nest level greater than one:
change tcf_mirred_forward() accordingly.

Also, add a kselftest that can reproduce the lockup and verifies TC mirred
ability to account for further packet drops after TC mirred egress->ingress
(when the nest level is 1).

[1] https://lore.kernel.org/netdev/33dc43f587ec1388ba456b4915c75f02a8aae226.1663945716.git.dcaratti@redhat.com/
[2] https://lore.kernel.org/netdev/Y0w%2FWWY60gqrtGLp@pop-os.localdomain/
[3] such behavior is not guaranteed: for example, if RPS or skb RX
     timestamping is enabled on the mirred target device, the kernel
     can defer receiving the skb and return NET_RX_SUCCESS inside
     tcf_mirred_forward().

Reported-by: William Zhao <wizhao@redhat.com>
CC: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: act_mirred: better wording on protection against excessive stack growth

with commit e2ca070f89ec ("net: sched: protect against stack overflow in
TC act_mirred"), act_mirred protected itself against excessive stack growth
using per_cpu counter of nested calls to tcf_mirred_act(), and capping it
to MIRRED_RECURSION_LIMIT. However, such protection does not detect
recursion/loops in case the packet is enqueued to the backlog (for example,
when the mirred target device has RPS or skb timestamping enabled). Change
the wording from "recursion" to "nesting" to make it more clear to readers.

CC: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'fix-cpts-release-action-in-am65-cpts-driver'

Siddharth Vadapalli says:

====================
Fix CPTS release action in am65-cpts driver

Delete unreachable code in am65_cpsw_init_cpts() function, which was
Reported-by: Leon Romanovsky <leon@kernel.org>
at:
https://lore.kernel.org/r/Y8aHwSnVK9+sAb24@unreal

Remove the devm action associated with am65_cpts_release() and invoke the
function directly on the cleanup and exit paths.

v4:
https://lore.kernel.org/r/20230120044201.357950-1-s-vadapalli@ti.com/
v3:
https://lore.kernel.org/r/20230118095439.114222-1-s-vadapalli@ti.com/
v2:
https://lore.kernel.org/r/20230116044517.310461-1-s-vadapalli@ti.com/
v1:
https://lore.kernel.org/r/20230113104816.132815-1-s-vadapalli@ti.com/
====================

Link: https://lore.kernel.org/r/20230120070731.383729-1-s-vadapalli@ti.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ethernet: ti: am65-cpsw/cpts: Fix CPTS release action

The am65_cpts_release() function is registered as a devm_action in the
am65_cpts_create() function in am65-cpts driver. When the am65-cpsw driver
invokes am65_cpts_create(), am65_cpts_release() is added in the set of devm
actions associated with the am65-cpsw driver's device.

In the event of probe failure or probe deferral, the platform_drv_probe()
function invokes dev_pm_domain_detach() which powers off the CPSW and the
CPSW's CPTS hardware, both of which share the same power domain. Since the
am65_cpts_disable() function invoked by the am65_cpts_release() function
attempts to reset the CPTS hardware by writing to its registers, the CPTS
hardware is assumed to be powered on at this point. However, the hardware
is powered off before the devm actions are executed.

Fix this by getting rid of the devm action for am65_cpts_release() and
invoking it directly on the cleanup and exit paths.

Fixes: f6bd59526ca5 ("net: ethernet: ti: introduce am654 common platform time sync driver")
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ethernet: ti: am65-cpsw: Delete unreachable error handling code

The am65_cpts_create() function returns -EOPNOTSUPP only when the config
"CONFIG_TI_K3_AM65_CPTS" is disabled. Also, in the am65_cpsw_init_cpts()
function, am65_cpts_create() can only be invoked if the config
"CONFIG_TI_K3_AM65_CPTS" is enabled. Thus, the error handling code for the
case in which the return value of am65_cpts_create() is -EOPNOTSUPP, is
unreachable. Hence delete it.

Reported-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: phy: microchip: run phy initialization during each link update

PHY initialization is supposed to run on every mode changes.
"lan87xx_config_aneg()" verifies every mode change using
"phy_modify_changed()" function. Earlier code had phy_modify_changed()
followed by genphy_soft_reset. But soft_reset resets all the
pre-configured register values to default state, and lost all the
initialization done. With this reason gen_phy_reset was removed.
But it need to go through init sequence each time the mode changed.
Update lan87xx_config_aneg() to invoke phy_init once successful mode
update is detected.

PHY init sequence added in lan87xx_phy_init() have slave init
commands executed every time. Update the init sequence to run
slave init only if phydev is in slave mode.

Test setup contains LAN9370 EVB connected to SAMA5D3 (Running DSA),
and issue can be reproduced by connecting link to any of the available
ports after SAMA5D3 boot-up. With this issue, port will fail to
update link state. But once the SAMA5D3 is reset with LAN9370 link in
connected state itself, on boot-up link state will be reported as UP. But
Again after some time, if link is moved to DOWN state, it will not get
reported.

Signed-off-by: Rakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230120104733.724701-1-rakesh.sankaranarayanan@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-dsa-microchip-add-support-for-credit-based-shaper'

Arun Ramadoss says:

====================
net: dsa: microchip: add support for credit based shaper

LAN937x switch family, KSZ9477, KSZ9567, KSZ9563 and KSZ8563 supports
the credit based shaper. But there were few difference between LAN937x and KSZ
switch like
- number of queues for LAN937x is 8 and for others it is 4.
- size of credit increment register for LAN937x is 24 and for other is 16-bit.
This patch series add the credit based shaper with common implementation for
LAN937x and KSZ swithes.
====================

Link: https://lore.kernel.org/r/20230120052135.32120-1-arun.ramadoss@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: add support for credit based shaper

KSZ9477, KSZ9567, KSZ9563, KSZ8563 and LAN937x supports Credit based
shaper. To differentiate the chip supporting cbs, tc_cbs_supported
flag is introduced in ksz_chip_data.
And KSZ series has 16bit Credit increment registers whereas LAN937x has
24bit register. The value to be programmed in the credit increment is
determined using the successive multiplication method to convert decimal
fraction to hexadecimal fraction.
For example: if idleslope is 10000 and sendslope is -90000, then
bandwidth is 10000 - (-90000) = 100000.
The 10% bandwidth of 100Mbps means 10/100 = 0.1(decimal). This value has
to be converted to hexa.
1) 0.1 * 16 = 1.6  --> fraction 0.6 Carry = 1 (MSB)
2) 0.6 * 16 = 9.6  --> fraction 0.6 Carry = 9
3) 0.6 * 16 = 9.6  --> fraction 0.6 Carry = 9
4) 0.6 * 16 = 9.6  --> fraction 0.6 Carry = 9
5) 0.6 * 16 = 9.6  --> fraction 0.6 Carry = 9
6) 0.6 * 16 = 9.6  --> fraction 0.6 Carry = 9 (LSB)
Now 0.1(decimal) becomes 0.199999(Hex).
If it is LAN937x, 24 bit value will be programmed to Credit Inc
register, 0x199999. For others 16 bit value will be prgrammed, 0x1999.

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: enable port queues for tc mqprio

LAN937x family of switches has 8 queues per port where the KSZ switches
has 4 queues per port. By default, only one queue per port is enabled.
The queues are configurable in 2, 4 or 8. This patch add 8 number of
queues for LAN937x and 4 for other switches.
In the tag_ksz.c file, prioirty of the packet is queried using the skb
buffer and the corresponding value is updated in the tag.

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: avoid irqsave in skb_defer_free_flush

The spin_lock irqsave/restore API variant in skb_defer_free_flush can
be replaced with the faster spin_lock irq variant, which doesn't need
to read and restore the CPU flags.

Using the unconditional irq "disable/enable" API variant is safe,
because the skb_defer_free_flush() function is only called during
NAPI-RX processing in net_rx_action(), where it is known the IRQs
are enabled.

Expected gain is 14 cycles from avoiding reading and restoring CPU
flags in a spin_lock_irqsave/restore operation, measured via a
microbencmark kernel module[1] on CPU E5-1650 v4 @ 3.60GHz.

Microbenchmark overhead of spin_lock+unlock:
- spin_lock_unlock_irq cost: 34 cycles(tsc) 9.486 ns
- spin_lock_unlock_irqsave cost: 48 cycles(tsc) 13.567 ns

We don't expect to see a measurable packet performance gain, as
skb_defer_free_flush() is called infrequently once per NIC device NAPI
bulk cycle and conditionally only if SKBs have been deferred by other
CPUs via skb_attempt_defer_free().

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/r/167421646327.1321776.7390743166998776914.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: Document that max_size sysctl is deprecated

v4: fix deprecated typo.

Document that max_size is deprecated due to:

commit af6d10345ca7 ("ipv6: remove max_size check inline with ipv4")

Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Link: https://lore.kernel.org/r/20230120232331.1273881-1-jmaxwell37@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fix kfree_skb_list use of skb_mark_not_on_list

A bug was introduced by commit eedade12f4cb ("net: kfree_skb_list use
kmem_cache_free_bulk"). It unconditionally unlinked the SKB list via
invoking skb_mark_not_on_list().

In this patch we choose to remove the skb_mark_not_on_list() call as it
isn't necessary. It would be possible and correct to call
skb_mark_not_on_list() only when __kfree_skb_reason() returns true,
meaning the SKB is ready to be free'ed, as it calls/check skb_unref().

This fix is needed as kfree_skb_list() is also invoked on skb_shared_info
frag_list (skb_drop_fraglist() calling kfree_skb_list()). A frag_list can
have SKBs with elevated refcnt due to cloning via skb_clone_fraglist(),
which takes a reference on all SKBs in the list. This implies the
invariant that all SKBs in the list must have the same refcnt, when using
kfree_skb_list().

Reported-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
Fixes: eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/167421088417.1125894.9761158218878962159.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: microchip: sparx5: Fix uninitialized variable in vcap_path_exist()

The "eport" variable needs to be initialized to NULL for this code to
work.

Fixes: 814e7693207f ("net: microchip: vcap api: Add a storage state to a VCAP rule")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Link: https://lore.kernel.org/r/Y8qbYAb+YSXo1DgR@kili
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: warn once if addr parameter is invalid in mdiobus_get_phy()

If mdiobus_get_phy() is called with an invalid addr parameter, then the
caller has a bug. Print a call trace to help identifying the caller.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/daec3f08-6192-ba79-f74b-5beb436cab6c@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'wireless-next-2023-01-23' of git://git./linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.3

First set of patches for v6.3. The most important change here is that
the old Wireless Extension user space interface is not supported on
Wi-Fi 7 devices at all. We also added a warning if anyone with modern
drivers (ie. cfg80211 and mac80211 drivers) tries to use Wireless
Extensions, everyone should switch to using nl80211 interface instead.

Static WEP support is removed, there wasn't any driver using that
anyway so there's no user impact. Otherwise it's smaller features and
fixes as usual.

Note: As mt76 had tricky conflicts due to the fixes in wireless tree,
we decided to merge wireless into wireless-next to solve them easily.
There should not be any merge problems anymore.

Major changes:

cfg80211
- remove never used static WEP support
- warn if Wireless Extention interface is used with cfg80211/mac80211 drivers
- stop supporting Wireless Extensions with Wi-Fi 7 devices
- support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting

rfkill
- add GPIO DT support

bitfield
- add FIELD_PREP_CONST()

mt76
- per-PHY LED support

rtw89
- support new Bluetooth co-existance version

rtl8xxxu
- support RTL8188EU

* tag 'wireless-next-2023-01-23' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (123 commits)
  wifi: wireless: deny wireless extensions on MLO-capable devices
  wifi: wireless: warn on most wireless extension usage
  wifi: mac80211: drop extra 'e' from ieeee80211... name
  wifi: cfg80211: Deduplicate certificate loading
  bitfield: add FIELD_PREP_CONST()
  wifi: mac80211: add kernel-doc for EHT structure
  mac80211: support minimal EHT rate reporting on RX
  wifi: mac80211: Add HE MU-MIMO related flags in ieee80211_bss_conf
  wifi: mac80211: Add VHT MU-MIMO related flags in ieee80211_bss_conf
  wifi: cfg80211: Use MLD address to indicate MLD STA disconnection
  wifi: cfg80211: Support 32 bytes KCK key in GTK rekey offload
  wifi: cfg80211: Fix extended KCK key length check in nl80211_set_rekey_data()
  wifi: cfg80211: remove support for static WEP
  wifi: rtl8xxxu: Dump the efuse only for untested devices
  wifi: rtl8xxxu: Print the ROM version too
  wifi: rtw88: Use non-atomic sta iterator in rtw_ra_mask_info_update()
  wifi: rtw88: Use rtw_iterate_vifs() for rtw_vif_watch_dog_iter()
  wifi: rtw88: Move register access from rtw_bf_assoc() outside the RCU
  wifi: rtl8xxxu: Use a longer retry limit of 48
  wifi: rtl8xxxu: Report the RSSI to the firmware
  ...
====================

Link: https://lore.kernel.org/r/20230123103338.330CBC433EF@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: asix,ax88796c: allow SPI peripheral properties

The AX88796C device node on SPI bus can use SPI peripheral properties in
certain configurations:

exynos3250-artik5-eval.dtb: ethernet@0: 'controller-data' does not match any of the regexes: 'pinctrl-[0-9]+'

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20230120144329.305655-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: tcp_mmap: populate pages in send path

In commit 72653ae5303c ("selftests: net: tcp_mmap:
Use huge pages in send path") I made a change to use hugepages
for the buffer used by the client (tx path)

Today, I understood that the cause for poor zerocopy
performance was that after a mmap() for a 512KB memory
zone, kernel uses a single zeropage, mapped 128 times.

This was really the reason for poor tx path performance
in zero copy mode, because this zero page refcount is
under high pressure, especially when TCP ACK packets
are processed on another cpu.

We need either to force a COW on all the memory range,
or use MAP_POPULATE so that a zero page is not abused.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230120181136.3764521-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: mux-meson-g12a: use devm_clk_get_enabled to simplify the code

Use devm_clk_get_enabled() to simplify the code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mdiobus: Convert to use fwnode_device_is_compatible()

Replace open coded fwnode_device_is_compatible() in the driver.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethtool: Add and use ethnl_update_bool.

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'enetc-mac-merge-prep'

Vladimir Oltean says:

====================
ENETC MAC Merge cleanup

This is a preparatory patch set for MAC Merge layer support in enetc via
ethtool. It does the following:

- consolidates a software lockstep register write procedure for the pMAC
- detects per-port frame preemption capability and only writes pMAC
registers if a pMAC exists
- stops enabling the pMAC by default

Additionally, I noticed some build warnings in the driver which are new
in this kernel version, so patch 1/6 fixes those.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: enetc: stop auto-configuring the port pMAC

The pMAC (ENETC_PFPMR_PMACE) is probably unconditionally enabled in the
enetc driver to allow RX of preemptible packets and not see them as
error frames. I don't know why TX preemption (ENETC_MMCSR_ME) is enabled
though. With no way to say which traffic classes are preemptible (all
are express by default), no preemptible frames would be transmitted
anyway.

Lastly, it may have been believed that the register write lock-step mode
(now deleted) needed the pMAC to be enabled at all times. I don't know
if that's true. However, I've checked that driver writes to PM1
registers do propagate through to the ENETC IP even when the pMAC is
disabled.

With such incomplete support for frame preemption, it's best to just
remove whatever exists right now and come with something more coherent
later.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: enetc: implement software lockstep for port MAC registers

Currently the enetc driver duplicates its writes to the PM0 registers
also to PM1, but it doesn't do this consistently - for example we write
to ENETC_PM0_MAXFRM but not to ENETC_PM1_MAXFRM.

Create enetc_port_mac_wr() which writes both the PM0 and PM1 register
with the same value (if frame preemption is supported on this port).
Also create enetc_port_mac_rd() which reads from PM0 - the assumption
being that PM1 contains just the same value.

This will be necessary when we enable the MAC Merge layer properly, and
the pMAC becomes operational.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: enetc: stop configuring pMAC in lockstep with eMAC

The MWLM bit (MAC write lock-step mode) allows register writes to the
pMAC to be auto-performed whenever the corresponding eMAC register is
written by the driver. This allows their configuration to remain
in sync.

The driver has set this bit since the initial commit, but it doesn't do
anything, since the hardware feature doesn't work (and the bit has been
removed from more recent versions of the documentation).

The driver does attempt, more or less, to keep those MAC registers in
sync by writing the same value once to e.g. ENETC_PM0_CMD_CFG (eMAC) and
once to ENETC_PM1_CMD_CFG (pMAC). Because the lockstep feature doesn't
work, that's what it will stick to.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: enetc: add definition for offset between eMAC and pMAC regs

This is a preliminary patch which replaces the hardcoded 0x1000 present
in other PM1 (port MAC 1, aka pMAC) register definitions, which is an
offset to the PM0 (port MAC 0, aka eMAC) equivalent register.
This definition will be used in more places by future code.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: enetc: detect frame preemption hardware capability

Similar to other TSN features, query the Station Interface capability
register to see whether preemption is supported on this port or not.
On LS1028A, preemption is available on ports 0 and 2, but not on 1
and 3.

This will allow us in the future to write the pMAC registers only on the
ENETC ports where a pMAC actually exists.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: enetc: build common object files into a separate module

The build system is complaining about the following:

enetc.o is added to multiple modules: fsl-enetc fsl-enetc-vf
enetc_cbdr.o is added to multiple modules: fsl-enetc fsl-enetc-vf
enetc_ethtool.o is added to multiple modules: fsl-enetc fsl-enetc-vf

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ethtool-mac-merge'

Vladimir Oltean says:

====================
ethtool support for IEEE 802.3 MAC Merge layer

Change log
----------

v3->v4:
- add missing opening bracket in ocelot_port_mm_irq()
- moved cfg.verify_time range checking so that it actually takes place
  for the updated rather than old value
v3 at:
https://patchwork.kernel.org/project/netdevbpf/cover/20230117085947.2176464-1-vladimir.oltean@nxp.com/

v2->v3:
- made get_mm return int instead of void
- deleted ETHTOOL_A_MM_SUPPORTED
- renamed ETHTOOL_A_MM_ADD_FRAG_SIZE to ETHTOOL_A_MM_TX_MIN_FRAG_SIZE
- introduced ETHTOOL_A_MM_RX_MIN_FRAG_SIZE
- cleaned up documentation
- rebased on top of PLCA changes
- renamed ETHTOOL_STATS_SRC_* to ETHTOOL_MAC_STATS_SRC_*
v2 at:
https://patchwork.kernel.org/project/netdevbpf/cover/20230111161706.1465242-1-vladimir.oltean@nxp.com/

v1->v2:
I've decided to focus just on the MAC Merge layer for now, which is why
I am able to submit this patch set as non-RFC.
v1 (RFC) at:
https://patchwork.kernel.org/project/netdevbpf/cover/20220816222920.1952936-1-vladimir.oltean@nxp.com/

What is being introduced
------------------------

TL;DR: a MAC Merge layer as defined by IEEE 802.3-2018, clause 99
(interspersing of express traffic). This is controlled through ethtool
netlink (ETHTOOL_MSG_MM_GET, ETHTOOL_MSG_MM_SET). The raw ethtool
commands are posted here:
https://patchwork.kernel.org/project/netdevbpf/cover/20230111153638.1454687-1-vladimir.oltean@nxp.com/

The MAC Merge layer has its own statistics counters
(ethtool --include-statistics --show-mm swp0) as well as two member
MACs, the statistics of which can be queried individually, through a new
ethtool netlink attribute, corresponding to:

$ ethtool -I --show-pause eno2 --src aggregate
$ ethtool -S eno2 --groups eth-mac eth-phy eth-ctrl rmon -- --src pmac

The core properties of the MAC Merge layer are described in great detail
in patches 02/12 and 03/12. They can be viewed in "make htmldocs" format.

Devices for which the API is supported
--------------------------------------

I decided to start with the Ethernet switch on NXP LS1028A (Felix)
because of the smaller patch set. I also have support for the ENETC
controller pending.

I would like to get confirmation that the UAPI being proposed here will
not restrict any use cases known by other hardware vendors.

Why is support for preemptible traffic classes not here?
--------------------------------------------------------

There is legitimate concern whether the 802.1Q portion of the standard
(which traffic classes go to the eMAC and which to the pMAC) should be
modeled in Linux using tc or using another UAPI. I think that is
stalling the entire series, but should be discussed separately instead.
Removing FP adminStatus support makes me confident enough to submit this
patch set without an RFC tag (meaning: I wouldn't mind if it was merged
as is).

What is submitted here is sufficient for an LLDP daemon to do its job.
I've patched openlldp to advertise and configure frame preemption:
https://github.com/vladimiroltean/openlldp/tree/frame-preemption-v3

In case someone wants to try it out, here are some commands I've used.

# Configure the interfaces to receive and transmit LLDP Data Units
lldptool -L -i eno0 adminStatus=rxtx
lldptool -L -i swp0 adminStatus=rxtx
# Enable the transmission of certain TLVs on switch's interface
lldptool -T -i eno0 -V addEthCap enableTx=yes
lldptool -T -i swp0 -V addEthCap enableTx=yes
# Query LLDP statistics on switch's interface
lldptool -S -i swp0
# Query the received neighbor TLVs
lldptool -i swp0 -t -n -V addEthCap
Additional Ethernet Capabilities TLV
         Preemption capability supported
         Preemption capability enabled
         Preemption capability active
         Additional fragment size: 60 octets

So using this patch set, lldpad will be able to advertise and configure
frame preemption, but still, no data packet will be sent as preemptible
over the link, because there is no UAPI to control which traffic classes
are sent as preemptible and which as express.

Preemptable or preemptible?
---------------------------

IEEE 802.3 uses "preemptable" throughout. IEEE 802.1Q uses "preemptible"
throughout. Because the definition of "preemptible" falls under 802.1Q's
jurisdiction and 802.3 just references it, I went with the 802.1Q naming
even where supporting an 802.3 feature. Also, checkpatch agrees with this.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: add MAC Merge layer support for VSC9959

Felix (VSC9959) has a DEV_GMII:MM_CONFIG block composed of 2 registers
(ENABLE_CONFIG and VERIF_CONFIG). Because the MAC Merge statistics and
pMAC statistics are already in the Ocelot switch lib even if just Felix
supports them, I'm adding support for the whole MAC Merge layer in the
common Ocelot library too.

There is an interrupt (shared with the PTP interrupt) which signals
changes to the MM verification state. This is done because the
preemptible traffic classes should be committed to hardware only once
the verification procedure has declared the link partner of being
capable of receiving preemptible frames.

We implement ethtool getters and setters for the MAC Merge layer state.
The "TX enabled" and "verify status" are taken from the IRQ handler,
using a mutex to ensure serialized access.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: export ethtool MAC Merge stats for Felix VSC9959

The Felix VSC9959 switch supports frame preemption and has a MAC Merge
layer. In addition to the structured stats that exist for the eMAC,
export the counters associated with its pMAC (pause, RMON, MAC, PHY,
control) plus the high-level MAC Merge layer stats. The unstructured
ethtool counters, as well as the rtnl_link_stats64 were left to report
only the eMAC counters.

Because statistics processing is quite self-contained in ocelot_stats.c
now, I've opted for introducing an ocelot->mm_supported bool, based on
which the common switch lib does everything, rather than pushing the
TSN-specific code in felix_vsc9959.c, as happens for other TSN stuff.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: hide access to ocelot_stats_layout behind a helper

Some hardware instances of the ocelot driver support the MAC Merge
layer, which gives access to an extra preemptible MAC. This has
implications upon the statistics. There will be a stats layout when MM
isn't supported, and a different one when it is.
The ocelot_stats_layout() helper will return the correct one.
In preparation of that, refactor the existing code to use this helper.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: allow ocelot_stat_layout elements with no name

We will add support for pMAC counters and MAC merge layer counters,
which are only reported via the structured stats, and the current
ocelot_get_strings() stands in our way, because it expects that the
statistics should be placed in the data array at the same index as found
in the ocelot_stats_layout array.

That is not true. Statistics which don't have a name should not be
exported to the unstructured ethtool -S, so we need to have different
indices into the ocelot_stats_layout array (i) and into the data array
(data itself).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: add plumbing for changing and getting MAC merge layer state

The DSA core is in charge of the ethtool_ops of the net devices
associated with switch ports, so in case a hardware driver supports the
MAC merge layer, DSA must pass the callbacks through to the driver.
Add support for precisely that.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethtool: add helpers for MM fragment size translation

We deliberately make the Linux UAPI pass the minimum fragment size in
octets, even though IEEE 802.3 defines it as discrete values, and
addFragSize is just the multiplier. This is because there is nothing
impossible in operating with an in-between value for the fragment size
of non-final preempted fragments, and there may even appear hardware
which supports the in-between sizes.

For the hardware which just understands the addFragSize multiplier,
create two helpers which translate back and forth the values passed in
octets.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethtool: add helpers for aggregate statistics

When a pMAC exists but the driver is unable to atomically query the
aggregate eMAC+pMAC statistics, the user should be given back at least
the sum of eMAC and pMAC counters queried separately.

This is a generic problem, so add helpers in ethtool to do this
operation, if the driver doesn't have a better way to report aggregate
stats. Do this in a way that does not require changes to these functions
when new stats are added (basically treat the structures as an array of
u64 values, except for the first element which is the stats source).

In include/linux/ethtool.h, there is already a section where helper
function prototypes should be placed. The trouble is, this section is
too early, before the definitions of struct ethtool_eth_mac_stats et.al.
Move that section at the end and append these new helpers to it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

docs: ethtool: document ETHTOOL_A_STATS_SRC and ETHTOOL_A_PAUSE_STATS_SRC

Two new netlink attributes were added to PAUSE_GET and STATS_GET and
their replies. Document them.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)

IEEE 802.3-2018 clause 99 defines a MAC Merge sublayer which contains an
Express MAC and a Preemptible MAC. Both MACs are hidden to higher and
lower layers and visible as a single MAC (packet classification to eMAC
or pMAC on TX is done based on priority; classification on RX is done
based on SFD).

For devices which support a MAC Merge sublayer, it is desirable to
retrieve individual packet counters from the eMAC and the pMAC, as well
as aggregate statistics (their sum).

Introduce a new ETHTOOL_A_STATS_SRC attribute which is part of the
policy of ETHTOOL_MSG_STATS_GET and, and an ETHTOOL_A_PAUSE_STATS_SRC
which is part of the policy of ETHTOOL_MSG_PAUSE_GET (accepted when
ETHTOOL_FLAG_STATS is set in the common ethtool header). Both of these
take values from enum ethtool_mac_stats_src, defaulting to "aggregate"
in the absence of the attribute.

Existing drivers do not need to pay attention to this enum which was
added to all driver-facing structures, just the ones which report the
MAC merge layer as supported.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

docs: ethtool-netlink: document interface for MAC Merge layer

Show details about the structures passed back and forth related to MAC
Merge layer configuration, state and statistics. The rendered htmldocs
will be much more verbose due to the kerneldoc references.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethtool: add support for MAC Merge layer

The MAC merge sublayer (IEEE 802.3-2018 clause 99) is one of 2
specifications (the other being Frame Preemption; IEEE 802.1Q-2018
clause 6.7.2), which work together to minimize latency caused by frame
interference at TX. The overall goal of TSN is for normal traffic and
traffic with a bounded deadline to be able to cohabitate on the same L2
network and not bother each other too much.

The standards achieve this (partly) by introducing the concept of
preemptible traffic, i.e. Ethernet frames that have a custom value for
the Start-of-Frame-Delimiter (SFD), and these frames can be fragmented
and reassembled at L2 on a link-local basis. The non-preemptible frames
are called express traffic, they are transmitted using a normal SFD, and
they can preempt preemptible frames, therefore having lower latency,
which can matter at lower (100 Mbps) link speeds, or at high MTUs (jumbo
frames around 9K). Preemption is not recursive, i.e. a P frame cannot
preempt another P frame. Preemption also does not depend upon priority,
or otherwise said, an E frame with prio 0 will still preempt a P frame
with prio 7.

In terms of implementation, the standards talk about the presence of an
express MAC (eMAC) which handles express traffic, and a preemptible MAC
(pMAC) which handles preemptible traffic, and these MACs are multiplexed
on the same MII by a MAC merge layer.

To support frame preemption, the definition of the SFD was generalized
to SMD (Start-of-mPacket-Delimiter), where an mPacket is essentially an
Ethernet frame fragment, or a complete frame. Stations unaware of an SMD
value different from the standard SFD will treat P frames as error
frames. To prevent that from happening, a negotiation process is
defined.

On RX, packets are dispatched to the eMAC or pMAC after being filtered
by their SMD. On TX, the eMAC/pMAC classification decision is taken by
the 802.1Q spec, based on packet priority (each of the 8 user priority
values may have an admin-status of preemptible or express).

The MAC Merge layer and the Frame Preemption parameters have some degree
of independence in terms of how software stacks are supposed to deal
with them. The activation of the MM layer is supposed to be controlled
by an LLDP daemon (after it has been communicated that the link partner
also supports it), after which a (hardware-based or not) verification
handshake takes place, before actually enabling the feature. So the
process is intended to be relatively plug-and-play. Whereas FP settings
are supposed to be coordinated across a network using something
approximating NETCONF.

The support contained here is exclusively for the 802.3 (MAC Merge)
portions and not for the 802.1Q (Frame Preemption) parts. This API is
sufficient for an LLDP daemon to do its job. The FP adminStatus variable
from 802.1Q is outside the scope of an LLDP daemon.

I have taken a few creative licenses and augmented the Linux kernel UAPI
compared to the standard managed objects recommended by IEEE 802.3.
These are:

- ETHTOOL_A_MM_PMAC_ENABLED: According to Figure 99-6: Receive
  Processing state diagram, a MAC Merge layer is always supposed to be
  able to receive P frames. However, this implies keeping the pMAC
  powered on, which will consume needless power in applications where FP
  will never be used. If LLDP is used, the reception of an Additional
  Ethernet Capabilities TLV from the link partner is sufficient
  indication that the pMAC should be enabled. So my proposal is that in
  Linux, we keep the pMAC turned off by default and that user space
  turns it on when needed.

- ETHTOOL_A_MM_VERIFY_ENABLED: The IEEE managed object is called
  aMACMergeVerifyDisableTx. I opted for consistency (positive logic) in
  the boolean netlink attributes offered, so this is also positive here.
  Other than the meaning being reversed, they correspond to the same
  thing.

- ETHTOOL_A_MM_MAX_VERIFY_TIME: I found it most reasonable for a LLDP
  daemon to maximize the verifyTime variable (delay between SMD-V
  transmissions), to maximize its chances that the LP replies. IEEE says
  that the verifyTime can range between 1 and 128 ms, but the NXP ENETC
  stupidly keeps this variable in a 7 bit register, so the maximum
  supported value is 127 ms. I could have chosen to hardcode this in the
  LLDP daemon to a lower value, but why not let the kernel expose its
  supported range directly.

- ETHTOOL_A_MM_TX_MIN_FRAG_SIZE: the standard managed object is called
  aMACMergeAddFragSize, and expresses the "additional" fragment size
  (on top of ETH_ZLEN), whereas this expresses the absolute value of the
  fragment size.

- ETHTOOL_A_MM_RX_MIN_FRAG_SIZE: there doesn't appear to exist a managed
  object mandated by the standard, but user space clearly needs to know
  what is the minimum supported fragment size of our local receiver,
  since LLDP must advertise a value no lower than that.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sock: Introduce trace_sk_data_ready()

As suggested by Cong, introduce a tracepoint for all ->sk_data_ready()
callback implementations.  For example:

<...>
  iperf-609  [002] .....  70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable
  iperf-609  [002] .....  70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable
<...>

Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-add-support-of-latency-tlv'

Petr Machata says:

====================
mlxsw: Add support of latency TLV

Amit Cohen writes:

Ethernet Management Datagrams (EMADs) are Ethernet packets sent between
the driver and device's firmware. They are used to pass various
configurations to the device, but also to get events (e.g., port up)
from it. After the Ethernet header, these packets are built in a TLV
format.

This is the structure of EMADs:
* Ethernet header
* Operation TLV
* String TLV (optional)
* Latency TLV (optional)
* Reg TLV
* End TLV

The latency of each EMAD is measured by firmware. The driver can get the
measurement via latency TLV which can be added to each EMAD. This TLV is
optional, when EMAD is sent with this TLV, the EMAD's response will include
the TLV and will contain the firmware measurement.

Add support for Latency TLV and use it by default for all EMADs (see
more information in commit messages). The latency measurements can be
processed using BPF program for example, to create a histogram and average
of the latency per register. In addition, it is possible to measure the
end-to-end latency, so then the latency of the software overhead can be
calculated. This information can be useful to improve the driver
performance.

See an example of output of BPF tool which presents these measurements:

$ ./emadlatency -f -a
    Tracing EMADs... Hit Ctrl-C to end.
    Register write = RALUE (0x8013)
    E2E Measurements:
    average = 23 usecs, total = 32052693 usecs, count = 1337061
         usecs               : count    distribution
             0 -> 1          : 0        |                                 |
             2 -> 3          : 0        |                                 |
             4 -> 7          : 0        |                                 |
             8 -> 15         : 0        |                                 |
            16 -> 31         : 1290814  |*********************************|
            32 -> 63         : 45339    |*                                |
            64 -> 127        : 532      |                                 |
           128 -> 255        : 247      |                                 |
           256 -> 511        : 57       |                                 |
           512 -> 1023       : 26       |                                 |
          1024 -> 2047       : 33       |                                 |
          2048 -> 4095       : 0        |                                 |
          4096 -> 8191       : 10       |                                 |
          8192 -> 16383      : 1        |                                 |
         16384 -> 32767      : 1        |                                 |
         32768 -> 65535      : 1        |                                 |

    Firmware Measurements:
    average = 10 usecs, total = 13884128 usecs, count = 1337061
         usecs               : count    distribution
             0 -> 1          : 0        |                                 |
             2 -> 3          : 0        |                                 |
             4 -> 7          : 0        |                                 |
             8 -> 15         : 1337035  |*********************************|
            16 -> 31         : 17       |                                 |
            32 -> 63         : 7        |                                 |
            64 -> 127        : 0        |                                 |
           128 -> 255        : 2        |                                 |

    Diff between measurements: 13 usecs

Patch set overview:
Patches #1-#3 add support for querying MGIR, to know if string TLV and
latency TLV are supported
Patches #4-#5 add some relevant fields to support latency TLV
Patch #6 adds support of latency TLV
====================

Link: https://lore.kernel.org/r/cover.1674123673.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: Add support of latency TLV

The latency of each EMAD can be measured by firmware. The driver can get
the measurement via latency TLV which can be added to each EMAD. This TLV
is optional, when EMAD is sent with this TLV, the EMAD's response will
include the TLV and the field 'latency_time' will contain the firmware
measurement.

This information can be processed using BPF program for example, to
create a histogram and average of the latency per register. In addition,
it is possible to measure the end-to-end latency, and then reduce firmware
measurement, which will result in the latency of the software overhead.
This information can be useful to improve the driver performance.

Add support for latency TLV by default for all EMADs. First we planned to
enable latency TLV per demand, using devlink-param. After some tests, we
know that the usage of latency TLV does not impact the end-to-end latency,
so it is OK to enable it by default.

Note that similar to string TLV, the latency TLV is not supported in all
firmware versions. Enable the usage of this TLV only after verifying it is
supported by the current firmware version by querying the Management
General Information Register (MGIR).

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: core: Define latency TLV fields

The next patch will add support for latency TLV as part of EMAD (Ethernet
Management Datagrams) packets. As preparation, add the relevant fields.

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: emad: Add support for latency TLV

The next patches will add support for latency TLV as part of EMAD (Ethernet
Management Datagrams) packets. As preparation, add the relevant values.

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: core: Do not worry about changing 'enable_string_tlv' while sending EMADs

Till now, the field 'mlxsw_core->emad.enable_string_tlv' is set as part
of mlxsw_sp_init(), this means that it can be changed during
emad_reg_access(). To avoid such change, this field is read once in
emad_reg_access() and the value is used all the way.

The previous patch sets this value according to MGIR output, as part of
mlxsw_emad_init(), so now it cannot be changed while sending EMADs.

Do not save 'enable_string_tlv' and do not pass it to functions, just pass
'struct mlxsw_core' and use the value directly from it.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: Enable string TLV usage according to MGIR output

String TLV is not supported by old firmware versions, therefore
'struct mlxsw_core' stores the field 'emad.enable_string_tlv', which is
set to true only after firmware version check.

Instead of assuming that firmware version check is enough to enable
string TLV, a better solution is to query if this TLV is supported from
MGIR register. Add such query and initialize 'emad.enable_string_tlv'
accordingly.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: reg: Add TLV related fields to MGIR register

MGIR (Management General Information Register) allows software to query the
hardware and firmware general information. As part of firmware information,
the driver can query if string TLV and latency TLV are supported. These
TLVs are part of EMAD's header and are used to provide information per
EMAD packet to software.

Currently, string TLV is already used by the driver, but it does not
query if this TLV is supported from MGIR. The next patches will add support
of latency TLV. Add the relevant fields to MGIR, so then the driver will
query them to know if the TLVs are supported before using them.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptp_qoriq: fix latency in ptp_qoriq_adjtime() operation

1588 driver loses about 1us in adjtime operation at PTP slave
This is because adjtime operation uses a slow non-atomic tmr_cnt_read()
followed by tmr_cnt_write() operation.

In the above sequence, since the timer counter operation keeps
incrementing, it leads to latency. The tmr_offset register
(which is added to TMR_CNT_H/L register giving the current time)
must be programmed with the delta nanoseconds.

Signed-off-by: Nikhil Gupta <nikhil.gupta@nxp.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230119204034.7969-1-nikhil.gupta@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: microchip: vcap: use kmemdup() to allocate memory

Use kmemdup() helper instead of open-coding to simplify
the code when allocating newckf and newcaf.

Generated by: scripts/coccinelle/api/memdup.cocci

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Link: https://lore.kernel.org/r/20230119092210.3607634-1-yangyingliang@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mdio-remove-support-for-building-c45-muxed-addresses'

Michael Walle says:

====================
net: mdio: Remove support for building C45 muxed addresses

I've picked this older series from Andrew up and rebased it onto
the latest net-next.

With all drivers which support c45 now being converted to a seperate c22
and c45 access op, we can now remove the old MII_ADDR_C45 flag.
====================

Link: https://lore.kernel.org/r/20230119130700.440601-1-michael@walle.cc
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: Remove support for building C45 muxed addresses

The old way of performing a C45 bus transfer created a special
register value and passed it to the MDIO bus driver, in the hope it
would see the MII_ADDR_C45 bit set, and perform a C45 transfer. Now
that there is a clear separation of C22 and C45, this scheme is no
longer used. Remove all the #defines and helpers, to prevent any code
being added which tries to use it.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Remove C45 check in C22 only MDIO bus drivers

The MDIO core should not pass a C45 request via the C22 API call any
more. So remove the tests from the drivers.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ngbe: Drop mdiobus_c45_regad()

With the new C45 MDIO access API, there is no encoding of the register
number anymore and thus the masking isn't necessary anymore. Remove it.

Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: Remove fallback to old C45 method

Now that all MDIO bus drivers which support C45 implement the c45
specific ops, remove the fallback to the old method.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'r8152-improve-the-code'

Hayes Wang says:

====================
r8152: improve the code

These are some minor improvements depending on commit ec51fbd1b8a2 ("r8152:
add USB device driver for config selection").
====================

Link: https://lore.kernel.org/r/20230119074043.10021-397-nic_swsd@realtek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8152: reduce the control transfer of rtl8152_get_version()

Reduce the control transfer by moving calling rtl8152_get_version() in
rtl8152_probe(). This could prevent from calling rtl8152_get_version()
for unnecessary situations. For example, after setting config #2 for the
device, there are two interfaces and rtl8152_probe() may be called
twice. However, we don't need to call rtl8152_get_version() for this
situation.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8152: remove rtl_vendor_mode function

After commit ec51fbd1b8a2 ("r8152: add USB device driver for
config selection"), the code about changing USB configuration
in rtl_vendor_mode() wouldn't be run anymore. Therefore, the
function could be removed.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git./linux/kernel/git/netdev/net

drivers/net/ipa/ipa_interrupt.c
drivers/net/ipa/ipa_interrupt.h
  9ec9b2a30853 ("net: ipa: disable ipa interrupt during suspend")
  8e461e1f092b ("net: ipa: introduce ipa_interrupt_enable()")
  d50ed3558719 ("net: ipa: enable IPA interrupt handlers separate from registration")
https://lore.kernel.org/all/20230119114125.5182c7ab@canb.auug.org.au/
https://lore.kernel.org/all/79e46152-8043-a512-79d9-c3b905462774@tessares.net/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'net-6.2-rc5-2' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from wireless, bluetooth, bpf and netfilter.

  Current release - regressions:

   - Revert "net: team: use IFF_NO_ADDRCONF flag to prevent ipv6
     addrconf", fix nsna_ping mode of team

   - wifi: mt76: fix bugs in Rx queue handling and DMA mapping

   - eth: mlx5:
      - add missing mutex_unlock in error reporter
      - protect global IPsec ASO with a lock

  Current release - new code bugs:

   - rxrpc: fix wrong error return in rxrpc_connect_call()

  Previous releases - regressions:

   - bluetooth: hci_sync: fix use of HCI_OP_LE_READ_BUFFER_SIZE_V2

   - wifi:
      - mac80211: fix crashes on Rx due to incorrect initialization of
        rx->link and rx->link_sta
      - mac80211: fix bugs in iTXQ conversion - Tx stalls, incorrect
        aggregation handling, crashes
      - brcmfmac: fix regression for Broadcom PCIe wifi devices
      - rndis_wlan: prevent buffer overflow in rndis_query_oid

   - netfilter: conntrack: handle tcp challenge acks during connection
     reuse

   - sched: avoid grafting on htb_destroy_class_offload when destroying

   - virtio-net: correctly enable callback during start_xmit, fix stalls

   - tcp: avoid the lookup process failing to get sk in ehash table

   - ipa: disable ipa interrupt during suspend

   - eth: stmmac: enable all safety features by default

  Previous releases - always broken:

   - bpf:
      - fix pointer-leak due to insufficient speculative store bypass
        mitigation (Spectre v4)
      - skip task with pid=1 in send_signal_common() to avoid a splat
      - fix BPF program ID information in BPF_AUDIT_UNLOAD as well as
        PERF_BPF_EVENT_PROG_UNLOAD events
      - fix potential deadlock in htab_lock_bucket from same bucket
        index but different map_locked index

   - bluetooth:
      - fix a buffer overflow in mgmt_mesh_add()
      - hci_qca: fix driver shutdown on closed serdev
      - ISO: fix possible circular locking dependency
      - CIS: hci_event: fix invalid wait context

   - wifi: brcmfmac: fixes for survey dump handling

   - mptcp: explicitly specify sock family at subflow creation time

   - netfilter: nft_payload: incorrect arithmetics when fetching VLAN
     header bits

   - tcp: fix rate_app_limited to default to 1

   - l2tp: close all race conditions in l2tp_tunnel_register()

   - eth: mlx5: fixes for QoS config and eswitch configuration

   - eth: enetc: avoid deadlock in enetc_tx_onestep_tstamp()

   - eth: stmmac: fix invalid call to mdiobus_get_phy()

  Misc:

   - ethtool: add netlink attr in rss get reply only if the value is not
     empty"

* tag 'net-6.2-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
  Revert "Merge branch 'octeontx2-af-CPT'"
  tcp: fix rate_app_limited to default to 1
  bnxt: Do not read past the end of test names
  net: stmmac: enable all safety features by default
  octeontx2-af: add mbox to return CPT_AF_FLT_INT info
  octeontx2-af: update cpt lf alloc mailbox
  octeontx2-af: restore rxc conf after teardown sequence
  octeontx2-af: optimize cpt pf identification
  octeontx2-af: modify FLR sequence for CPT
  octeontx2-af: add mbox for CPT LF reset
  octeontx2-af: recover CPT engine when it gets fault
  net: dsa: microchip: ksz9477: port map correction in ALU table entry register
  selftests/net: toeplitz: fix race on tpacket_v3 block close
  net/ulp: use consistent error code when blocking ULP
  octeontx2-pf: Fix the use of GFP_KERNEL in atomic context on rt
  tcp: avoid the lookup process failing to get sk in ehash table
  Revert "net: team: use IFF_NO_ADDRCONF flag to prevent ipv6 addrconf"
  MAINTAINERS: add networking entries for Willem
  net: sched: gred: prevent races when adding offloads to stats
  l2tp: prevent lockdep issue in l2tp_tunnel_register()
  ...

Revert "Merge branch 'octeontx2-af-CPT'"

This reverts commit b4fbf0b27fa9dd2594b3371532341bd4636a00f9, reversing
changes made to 6c977c5c2e4c5d8ad1b604724cc344e38f96fe9b.

This seems like net-next material.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'octeontx2-af-miscellaneous-changes-for-cpt'

Srujana Challa says:

====================
octeontx2-af: Miscellaneous changes for CPT

This patchset consists of miscellaneous changes for CPT.
- Adds a new mailbox to reset the requested CPT LF.
- Modify FLR sequence as per HW team suggested.
- Adds support to recover CPT engines when they gets fault.
- Updates CPT inbound inline IPsec configuration mailbox,
as per new generation of the OcteonTX2 chips.
- Adds a new mailbox to return CPT FLT Interrupt info.
====================

Link: https://lore.kernel.org/r/20230118120354.1017961-1-schalla@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: add mbox to return CPT_AF_FLT_INT info

CPT HW would trigger the CPT AF FLT interrupt when CPT engines
hits some uncorrectable errors and AF is the one which receives
the interrupt and recovers the engines.
This patch adds a mailbox for CPT VFs to request for CPT faulted
and recovered engines info.

Signed-off-by: Srujana Challa <schalla@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: update cpt lf alloc mailbox

The CN10K CPT coprocessor contains a context processor
to accelerate updates to the IPsec security association
contexts. The context processor contains a context cache.
This patch updates CPT LF ALLOC mailbox to config ctx_ilen
requested by VFs. CPT_LF_ALLOC:ctx_ilen is the size of
initial context fetch.

Signed-off-by: Srujana Challa <schalla@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: restore rxc conf after teardown sequence

CN10K CPT coprocessor includes a component named RXC which
is responsible for reassembly of inner IP packets. RXC has
the feature to evict oldest entries based on age/threshold.
The age/threshold is being set to minimum values to evict
all entries at the time of teardown.
This patch adds code to restore timeout and threshold config
after teardown sequence is complete as it is global config.

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: optimize cpt pf identification

Optimize CPT PF identification in mbox handling for faster
mbox response by doing it at AF driver probe instead of
every mbox message.

Signed-off-by: Srujana Challa <schalla@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: modify FLR sequence for CPT

On OcteonTX2 platform CPT instruction enqueue is only
possible via LMTST operations.
The existing FLR sequence mentioned in HRM requires
a dummy LMTST to CPT but LMTST can't be submitted from
AF driver. So, HW team provided a new sequence to avoid
dummy LMTST. This patch adds code for the same.

Signed-off-by: Srujana Challa <schalla@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: add mbox for CPT LF reset

On OcteonTX2 SoC, the admin function (AF) is the only one with all
priviliges to configure HW and alloc resources, PFs and it's VFs
have to request AF via mailbox for all their needs.
This patch adds a new mailbox for CPT VFs to request for CPT LF
reset.

Signed-off-by: Srujana Challa <schalla@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: recover CPT engine when it gets fault

When CPT engine has uncorrectable errors, it will get halted and
must be disabled and re-enabled. This patch adds code for the same.

Signed-off-by: Srujana Challa <schalla@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: use GNSS subsystem instead of TTY

Previously support for GNSS was implemented as a TTY driver, it allowed
to access GNSS receiver on /dev/ttyGNSS_<bus><func>.

Use generic GNSS subsystem API instead of implementing own TTY driver.
The receiver is accessible on /dev/gnss<id>. In case of multiple receivers
in the OS, correct device can be found by enumerating either:
- /sys/class/net/<eth port>/device/gnss/
- /sys/class/gnss/gnss<id>/device/

Using GNSS subsystem is superior to implementing own TTY driver, as the
GNSS subsystem was designed solely for this purpose. It also implements
TTY driver but in a common and defined way.

From user perspective, there is no difference in communicating with a
device, except new path to the device shall be used. The device will
provide same information to the userspace as the old one, and can be used
in the same way, i.e.:
old # gpsmon /dev/ttyGNSS_2100_0
new # gpsmon /dev/gnss0
There is no other impact on userspace tools.

User expecting onboard GNSS receiver support is required to enable
CONFIG_GNSS=y/m in kernel config.

Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Michal Michalik <michal.michalik@intel.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns: Switch to use acpi_evaluate_dsm_typed()

The acpi_evaluate_dsm_typed() provides a way to check the type of the
object evaluated by _DSM call. Use it instead of open coded variant.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ACPI: utils: Add acpi_evaluate_dsm_typed() and acpi_check_dsm() stubs

When the ACPI part of a driver is optional the methods used in it
are expected to be available even if CONFIG_ACPI=n. This is not
the case for _DSM related methods. Add stubs for
acpi_evaluate_dsm_typed() and acpi_check_dsm() methods.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>