review.tizen.org Git - platform/kernel/linux-starfive.git/log

octeontx2-af: Add devlink option to adjust mcam high prio zone entries

The NPC MCAM entries are currently divided into three priority zones
in AF driver: high, mid, and low. The high priority zone and low priority
zone take up 1/8th (each) of the available MCAM entries, and remaining
going to the mid priority zone.

The current allocation scheme may not meet certain requirements, such
as when a requester needs more high priority zone entries than are
reserved. This patch adds a devlink configurable option to increase the
number of high priority zone entries that can be allocated by requester.
The max number of entries that can be reserved for high priority usage
is 100% of available MCAM entries.

Usage:
1) Change high priority zone percentage to 75%:
devlink -p dev param set pci/0002:01:00.0 name npc_mcam_high_zone_percent \
value 75 cmode runtime

2) Read high priority zone percentage:
devlink -p dev param show pci/0002:01:00.0 name npc_mcam_high_zone_percent

The devlink set configuration is only permitted when no MCAM entries
are assigned, i.e., all MCAM entries are free, indicating that no PF/VF
driver is loaded. So user must unload/unbind PF/VF driver/devices before
modifying the high priority zone percentage.

Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: Fix layer 2 miss test syntax

The test currently specifies "l2_miss" as "true" / "false", but the
version that eventually landed in iproute2 uses "1" / "0" [1]. Align the
test accordingly.

[1] https://lore.kernel.org/netdev/20230607153550.3829340-1-idosch@nvidia.com/

Fixes: 8c33266ae26a ("selftests: forwarding: Add layer 2 miss test cases")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'splice-net-some-miscellaneous-msg_splice_pages-changes'

David Howells says:

====================
splice, net: Some miscellaneous MSG_SPLICE_PAGES changes

Now that the splice_to_socket() has been rewritten so that nothing now uses
the ->sendpage() file op[1], some further changes can be made, so here are
some miscellaneous changes that can now be done.

(1) Remove the ->sendpage() file op.

(2) Remove hash_sendpage*() from AF_ALG.

(3) Make sunrpc send multiple pages in single sendmsg() call rather than
calling sendpage() in TCP (or maybe TLS).

(4) Make tcp_bpf_sendpage() a wrapper around tcp_bpf_sendmsg().

(5) Make AF_KCM use sendmsg() when calling down to TCP and then make it
send entire fragment lists in single sendmsg calls.

Link: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=fd5f4d7da29218485153fd8b4c08da7fc130c79f
====================

Link: https://lore.kernel.org/r/20230609100221.2620633-1-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

kcm: Send multiple frags in one sendmsg()

Rewrite the AF_KCM transmission loop to send all the fragments in a single
skb or frag_list-skb in one sendmsg() with MSG_SPLICE_PAGES set.  The list
of fragments in each skb is conveniently a bio_vec[] that can just be
attached to a BVEC iter.

Note: I'm working out the size of each fragment-skb by adding up bv_len for
all the bio_vecs in skb->frags[] - but surely this information is recorded
somewhere?  For the skbs in head->frag_list, this is equal to
skb->data_len, but not for the head.  head->data_len includes all the tail
frags too.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Tom Herbert <tom@herbertland.com>
cc: Tom Herbert <tom@quantonium.net>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

kcm: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

When transmitting data, call down into the transport socket using sendmsg
with MSG_SPLICE_PAGES to indicate that content should be spliced rather
than using sendpage.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Tom Herbert <tom@herbertland.com>
cc: Tom Herbert <tom@quantonium.net>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp_bpf: Make tcp_bpf_sendpage() go through tcp_bpf_sendmsg(MSG_SPLICE_PAGES)

Make tcp_bpf_sendpage() a wrapper around tcp_bpf_sendmsg(MSG_SPLICE_PAGES)
rather than a loop calling tcp_sendpage(). sendpage() will be removed in
the future.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jakub Sitnicki <jakub@cloudflare.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

When transmitting data, call down into TCP using sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather than
performing sendpage calls to transmit header, data pages and trailer.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

algif: Remove hash_sendpage*()

Remove hash_sendpage*() as nothing should now call it since the rewrite of
splice_to_socket()[1].

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Link: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=2dc334f1a63a8839b88483a3e73c0f27c9c1791c
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Remove file->f_op->sendpage

Remove file->f_op->sendpage as splicing to a socket now calls sendmsg
rather than sendpage.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-flower-add-cfm-support'

Zahari Doychev says:

====================
net: flower: add cfm support

The first patch adds cfm support to the flow dissector.
The second adds the flower classifier support.
The third adds a selftest for the flower cfm functionality.
====================

Link: https://lore.kernel.org/r/20230608105648.266575-1-zahari.doychev@linux.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: add tc flower cfm test

New cfm flower test case is added to the net forwarding selfttests.

Example output:

# ./tc_flower_cfm.sh p1 p2
TEST: CFM opcode match test                                         [ OK ]
TEST: CFM level match test                                          [ OK ]
TEST: CFM opcode and level match test                               [ OK ]

Signed-off-by: Zahari Doychev <zdoychev@maxlinear.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: flower: add support for matching cfm fields

Add support to the tc flower classifier to match based on fields in CFM
information elements like level and opcode.

tc filter add dev ens6 ingress protocol 802.1q \
flower vlan_id 698 vlan_ethtype 0x8902 cfm mdl 5 op 46 \
action drop

Signed-off-by: Zahari Doychev <zdoychev@maxlinear.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: flow_dissector: add support for cfm packets

Add support for dissecting cfm packets. The cfm packet header
fields maintenance domain level and opcode can be dissected.

Signed-off-by: Zahari Doychev <zdoychev@maxlinear.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mlxsw: i2c: Switch back to use struct i2c_driver's .probe()

After commit b8a1a4cd5a98 ("i2c: Provide a temporary .probe_new()
call-back type"), all drivers being converted to .probe_new() and then
commit 03c835f498b5 ("i2c: Switch .probe() to not take an id parameter")
convert back to (the new) .probe() to be able to eventually drop
.probe_new() from struct i2c_driver.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: add driver for MediaTek SoC built-in GE PHYs

Some of MediaTek's Filogic SoCs come with built-in gigabit Ethernet
PHYs which require calibration data from the SoC's efuse.
Despite the similar design the driver doesn't share any code with the
existing mediatek-ge.c.
Add support for such PHYs by introducing a new driver with basic
support for MediaTek SoCs MT7981 and MT7988 built-in 1GE PHYs.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mlx5-updates-2023-06-09' of git://git./linux/kernel/git/saeed/linux

mlx5-updates-2023-06-09

1) Embedded CPU Virtual Functions
2) Lightweight local SFs

Daniel Jurgens says:
====================
Embedded CPU Virtual Functions

This series enables the creation of virtual functions on Bluefield (the
embedded CPU platform). Embedded CPU virtual functions (EC VFs). EC VF
creation, deletion and management interfaces are the same as those for
virtual functions in a server with a Connect-X NIC.

When using EC VFs on the ARM the creation of virtual functions on the
host system is still supported. Host VFs eswitch vports occupy a range
of 1..max_vfs, the EC VF vport range is max_vfs+1..max_ec_vfs.

Every function (PF, ECPF, VF, EC VF, and subfunction) has a function ID
associated with it. Prior to this series the function ID and the eswitch
vport were the same. That is no longer the case, the EC VF function ID
range is 1..max_ec_vfs. When querying or setting the capabilities of an
EC VF function an new bit must be set in the query/set HCA cap
structure.

This is a high level overview of the changes made:
- Allocate vports for EC VFs if they are enabled.
- Create representors and devlink ports for the EC VF vports.
- When querying/setting HCA caps by vport break the assumption
  that function ID is the same a vport number and adjust
  accordingly.
- Create a new type of page, so that when SRIOV on the ARM is
  disabled, but remains enabled on the host, the driver can
  wait for the correct pages.
- Update SRIOV code to support EC VF creation/deletion.

===================

Lightweight local SFs:

Last 3 patches form Shay Drory:

SFs are heavy weight and by default they come with the full package of
ConnectX features. Usually users want specialized SFs for one specific
purpose and using devlink users will almost always override the set of
advertises features of an SF and reload it.

Shay Drory says:
================
In order to avoid the wasted time and resources on the reload, local SFs
will probe without any auxiliary sub-device, so that the SFs can be
configured prior to its full probe.

The defaults of the enable_* devlink params of these SFs are set to
false.

Usage example:
Create SF:
$ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11
$ devlink port function set pci/0000:08:00.0/32768 \
               hw_addr 00:00:00:00:00:11 state active

Enable ETH auxiliary device:
$ devlink dev param set auxiliary/mlx5_core.sf.1 \
              name enable_eth value true cmode driverinit

Now, in order to fully probe the SF, use devlink reload:
$ devlink dev reload auxiliary/mlx5_core.sf.1

At this point the user have SF devlink instance with auxiliary device
for the Ethernet functionality only.

================

Merge branch 'tcp-tx-headless'

Eric Dumazet says:

====================
tcp: tx path fully headless

This series completes transition of TCP stack tx path
to headless packets : All payload now reside in page frags,
never in skb->head.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: remove size parameter from tcp_stream_alloc_skb()

Now all tcp_stream_alloc_skb() callers pass @size == 0, we can
remove this parameter.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: remove some dead code

Now all skbs in write queue do not contain any payload in skb->head,
we can remove some dead code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: let tcp_send_syn_data() build headless packets

tcp_send_syn_data() is the last component in TCP transmit
path to put payload in skb->head.

Switch it to use page frags, so that we can remove dead
code later.

This allows to put more payload than previous implementation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ethtool-extack'

Jakub Kicinski says:

====================
net: support extack in dump and simplify ethtool uAPI

Ethtool currently requires header nest to be always present even if
it doesn't have to carry any attr for a given request. This inflicts
unnecessary pain on the users.

What makes it worse is that extack was not working in dump's ->start()
callback. Address both of those issues.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethtool: don't require empty header nests

Ethtool currently requires a header nest (which is used to carry
the common family options) in all requests including dumps.

  $ cli.py --spec netlink/specs/ethtool.yaml --dump channels-get
  lib.ynl.NlError: Netlink error: Invalid argument
  nl_len = 64 (48) nl_flags = 0x300 nl_type = 2
error: -22      extack: {'msg': 'request header missing'}

  $ cli.py --spec netlink/specs/ethtool.yaml --dump channels-get \
           --json '{"header":{}}';  )
  [{'combined-count': 1,
    'combined-max': 1,
    'header': {'dev-index': 2, 'dev-name': 'enp1s0'}}]

Requiring the header nest to always be there may seem nice
from the consistency perspective, but it's not serving any
practical purpose. We shouldn't burden the user like this.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

netlink: support extack in dump ->start()

Commit 4a19edb60d02 ("netlink: Pass extack to dump handlers")
added extack support to netlink dumps. It was focused on rtnl
and since rtnl does not use ->start(), ->done() callbacks
it ignored those. Genetlink on the other hand uses ->start()
extensively, for parsing and input validation.

Pass the extact in via struct netlink_dump_control and link
it to cb for the time of ->start(). Both struct netlink_dump_control
and extack itself live on the stack so we can't keep the same
extack for the duration of the dump. This means that the extack
visible in ->start() and each ->dump() callbacks will be different.
Corner cases like reporting a warning message in DONE across dump
calls are still not supported.

We could put the extack (for dumps) in the socket struct,
but layering makes it slightly awkward (extack pointer is decided
before the DO / DUMP split).

The genetlink dump error extacks are now surfaced:

  $ cli.py --spec netlink/specs/ethtool.yaml --dump channels-get
  lib.ynl.NlError: Netlink error: Invalid argument
  nl_len = 64 (48) nl_flags = 0x300 nl_type = 2
error: -22 extack: {'msg': 'request header missing'}

Previously extack was missing:

  $ cli.py --spec netlink/specs/ethtool.yaml --dump channels-get
  lib.ynl.NlError: Netlink error: Invalid argument
  nl_len = 36 (20) nl_flags = 0x100 nl_type = 2
error: -22

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ynl-ethtool'

Jakub Kicinski says:

====================
tools: ynl: generate code for the ethtool family

And finally ethtool support. Thanks to Stan's work the ethtool family
spec is quite complete, so there is a lot of operations to support.

I chickened out of stats-get support, they require at the very least
type-value support on a u64 scalar. Type-value is an arrangement where
a u16 attribute is encoded directly in attribute type. Code gen can
support this if the inside is a nest, we just throw in an extra
field into that nest to carry the attr type. But a little more coding
is needed to for a scalar, because first we need to turn the scalar
into a struct with one member, then we can add the attr type.

Other than that ethtool required event support (notification which
does not share contents with any GET), but the previous series
already added that to the codegen.

I haven't tested all the ops here, and a few I tried seem to work.
====================

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tools: ynl: add sample for ethtool

Configuring / reading ring sizes and counts is a fairly common
operation for ethtool netlink. Present a sample doing that with
YNL:

$ ./ethtool
Channels:
    enp1s0: combined 1
   eni1np1: combined 1
   eni2np1: combined 1
Rings:
    enp1s0: rx 256 tx 256
   eni1np1: rx 0 tx 0
   eni2np1: rx 0 tx 0

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

tools: ynl: generate code for the ethtool family

Generate the protocol code for ethtool. Skip the stats
for now, they are the only outlier in terms of complexity.
Stats are a sort-of semi-polymorphic (attr space of a nest
depends on value of another attr) or a type-value-scalar,
depending on how one wants to look at it...
A challenge for another time.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

netlink: specs: ethtool: mark pads as pads

Pad is a separate type. Even though in practice they can
only be a u32 the value should be discarded.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

netlink: specs: ethtool: untangle stats-get

Code gen for stats is a bit of a challenge, but from looking
at the attrs I think that the format isn't quite right.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

netlink: specs: ethtool: untangle UDP tunnels and cable test a bit

UDP tunnel and cable test messages have a lot of nests,
which do not match the names of the enum entries in C uAPI.
Some of the structure / nesting also looks wrong.

Untangle this a little bit based on the names, comments and
educated guesses, I haven't actually tested the results.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

netlink: specs: ethtool: add empty enum stringset

C does not allow defining structures and enums with the same name.
Since enum ethtool_stringset exists in the uAPI we need to include
at least a stub of it in the spec. This will trigger name collision
avoidance in the code gen.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

tools: ynl-gen: resolve enum vs struct name conflicts

Ethtool has an attribute set called stringset, from which
we'll generate struct ethtool_stringset. Unfortunately,
the old ethtool header declares enum ethtool_stringset
(the same name), to which compilers object.

This seems unavoidable. Check struct names against known
constants and append an underscore if conflict is detected.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

tools: ynl-gen: don't generate enum types if unnamed

If attr set or enum has empty enum name we need to use u32 or int
as function arguments and struct members.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

netlink: specs: ethtool: add C render hints

Most of the C enum names are guessed correctly, but there
is a handful of corner cases we need to name explicitly.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

netlink: specs: support setting prefix-name per attribute

Ethtool's PSE PoDL has a attr nest with different prefixes:

/* Power Sourcing Equipment */
enum {
ETHTOOL_A_PSE_UNSPEC,
ETHTOOL_A_PSE_HEADER, /* nest - _A_HEADER_* */
ETHTOOL_A_PODL_PSE_ADMIN_STATE, /* u32 */
ETHTOOL_A_PODL_PSE_ADMIN_CONTROL, /* u32 */
ETHTOOL_A_PODL_PSE_PW_D_STATUS, /* u32 */

Header has a prefix of ETHTOOL_A_PSE_ and other attrs prefix of
ETHTOOL_A_PODL_PSE_ we can't cover them uniformly.
If PODL was after PSE life would be easy.

Now we either need to add prefixes to attr names which is yucky
or support setting prefix name per attr.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

tools: ynl-gen: record extra args for regen

ynl-regen needs to know the arguments used to generate a file.
Record excluded ops and, while at it, user headers.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

tools: ynl-gen: support excluding tricky ops

The ethtool family has a small handful of quite tricky ops
and a lot of simple very useful ops. Teach ynl-gen to skip
ops so that we can bypass the tricky ones.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

mdio: mdio-mux-mmioreg: Use of_property_read_reg() to parse "reg"

Use the recently added of_property_read_reg() helper to get the
untranslated "reg" address value.

Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: drop unneeded quotes

Cleanup bindings dropping unneeded quotes. Once all these are fixed,
checking for this can be enabled in yamllint.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Jernej Skrabec <jernej.skrabec@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'SCM_PIDFD-SCM_PEERPIDFD'

Alexander Mikhalitsyn says:

====================
Add SCM_PIDFD and SO_PEERPIDFD

1. Implement SCM_PIDFD, a new type of CMSG type analogical to SCM_CREDENTIALS,
but it contains pidfd instead of plain pid, which allows programmers not
to care about PID reuse problem.

2. Add SO_PEERPIDFD which allows to get pidfd of peer socket holder pidfd.
This thing is direct analog of SO_PEERCRED which allows to get plain PID.

3. Add SCM_PIDFD / SO_PEERPIDFD kselftest

Idea comes from UAPI kernel group:
https://uapi-group.org/kernel-features/

Big thanks to Christian Brauner and Lennart Poettering for productive
discussions about this and Luca Boccassi for testing and reviewing this.

=== Motivation behind this patchset

Eric Dumazet raised a question:
> It seems that we already can use pidfd_open() (since linux-5.3), and
> pass the resulting fd in af_unix SCM_RIGHTS message ?

Yes, it's possible, but it means that from the receiver side we need
to trust the sent pidfd (in SCM_RIGHTS),
or always use combination of SCM_RIGHTS+SCM_CREDENTIALS, then we can
extract pidfd from SCM_RIGHTS,
then acquire plain pid from pidfd and after compare it with the pid
from SCM_CREDENTIALS.

A few comments from other folks regarding this.

Christian Brauner wrote:

>Let me try and provide some of the missing background.

>There are a range of use-cases where we would like to authenticate a
>client through sockets without being susceptible to PID recycling
>attacks. Currently, we can't do this as the race isn't fully fixable.
>We can only apply mitigations.

>What this patchset will allows us to do is to get a pidfd without the
>client having to send us an fd explicitly via SCM_RIGHTS. As that's
>already possibly as you correctly point out.

>But for protocols like polkit this is quite important. Every message is
>standalone and we would need to force a complete protocol change where
>we would need to require that every client allocate and send a pidfd via
>SCM_RIGHTS. That would also mean patching through all polkit users.

>For something like systemd-journald where we provide logging facilities
>and want to add metadata to the log we would also immensely benefit from
>being able to get a receiver-side controlled pidfd.

>With the message type we envisioned we don't need to change the sender
>at all and can be safe against pid recycling.

Link: https://gitlab.freedesktop.org/polkit/polkit/-/merge_requests/154
Link: https://uapi-group.org/kernel-features
Lennart Poettering wrote:

>So yes, this is of course possible, but it would mean the pidfd would
>have to be transported as part of the user protocol, explicitly sent
>by the sender. (Moreover, the receiver after receiving the pidfd would
>then still have to somehow be able to prove that the pidfd it just
>received actually refers to the peer's process and not some random
>process. – this part is actually solvable in userspace, but ugly)

>The big thing is simply that we want that the pidfd is associated
>*implicity* with each AF_UNIX connection, not explicitly. A lot of
>userspace already relies on this, both in the authentication area
>(polkit) as well as in the logging area (systemd-journald). Right now
>using the PID field from SO_PEERCREDS/SCM_CREDENTIALS is racy though
>and very hard to get right. Making this available as pidfd too, would
>solve this raciness, without otherwise changing semantics of it all:
>receivers can still enable the creds stuff as they wish, and the data
>is then implicitly appended to the connections/datagrams the sender
>initiates.

>Or to turn this around: things like polkit are typically used to
>authenticate arbitrary dbus methods calls: some service implements a
>dbus method call, and when an unprivileged client then issues that
>call, it will take the client's info, go to polkit and ask it if this
>is ok. If we wanted to send the pidfd as part of the protocol we
>basically would have to extend every single method call to contain the
>client's pidfd along with it as an additional argument, which would be
>a massive undertaking: it would change the prototypes of basically
>*all* methods a service defines… And that's just ugly.

>Note that Alex' patch set doesn't expose anything that wasn't exposed
>before, or attach, propagate what wasn't before. All it does, is make
>the field already available anyway (the struct ucred .pid field)
>available also in a better way (as a pidfd), to solve a variety of
>races, with no effect on the protocol actually spoken within the
>AF_UNIX transport. It's a seamless improvement of the status quo.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

af_unix: Kconfig: make CONFIG_UNIX bool

Let's make CONFIG_UNIX a bool instead of a tristate.
We've decided to do that during discussion about SCM_PIDFD patchset [1].

[1] https://lore.kernel.org/lkml/20230524081933.44dc8bea@kernel.org/

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: net: add SCM_PIDFD / SO_PEERPIDFD test

Basic test to check consistency between:
- SCM_CREDENTIALS and SCM_PIDFD
- SO_PEERCRED and SO_PEERPIDFD

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: core: add getsockopt SO_PEERPIDFD

Add SO_PEERPIDFD which allows to get pidfd of peer socket holder pidfd.
This thing is direct analog of SO_PEERCRED which allows to get plain PID.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: bpf@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

scm: add SO_PASSPIDFD and SCM_PIDFD

Implement SCM_PIDFD, a new type of CMSG type analogical to SCM_CREDENTIALS,
but it contains pidfd instead of plain pid, which allows programmers not
to care about PID reuse problem.

We mask SO_PASSPIDFD feature if CONFIG_UNIX is not builtin because
it depends on a pidfd_prepare() API which is not exported to the kernel
modules.

Idea comes from UAPI kernel group:
https://uapi-group.org/kernel-features/

Big thanks to Christian Brauner and Lennart Poettering for productive
discussions about this.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Tested-by: Luca Boccassi <bluca@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-cleanups'

Petr Machata says:

====================
mlxsw: Cleanups in router code

This patchset moves some router-related code from spectrum.c to
spectrum_router.c where it should be. It also simplifies handlers of
netevent notifications.

- Patch #1 caches router pointer in a dedicated variable. This obviates the
  need to access the same as mlxsw_sp->router, making lines shorter, and
  permitting a future patch to add code that fits within 80 character
  limit.

- Patch #2 moves IP / IPv6 validation notifier blocks from spectrum.c
  to spectrum_router, where the handlers are anyway.

- In patch #3, pass router pointer to scheduler of deferred work directly,
  instead of having it deduce it on its own.

- This makes the router pointer available in the handler function
  mlxsw_sp_router_netevent_event(), so in patch #4, use it directly,
  instead of finding it through mlxsw_sp_port.

- In patch #5, extend mlxsw_sp_router_schedule_work() so that the
  NETEVENT_NEIGH_UPDATE handler can use it directly instead of inlining
  equivalent code.

- In patches #6 and #7, add helpers for two common operations involving
  a backing netdev of a RIF. This makes it unnecessary for the function
  mlxsw_sp_rif_dev() to be visible outside of the router module, so in
  patch #8, hide it.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Privatize mlxsw_sp_rif_dev()

Now that the external users of mlxsw_sp_rif_dev() have been converted in
the preceding patches, make the function static.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: Convert does-RIF-have-this-netdev queries to a dedicated helper

In a number of places, a netdevice underlying a RIF is obtained only to
compare it to another pointer. In order to clean up the interface between
the router and the other modules, add a new helper to specifically answer
this question, and convert the relevant uses to this new interface.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: Convert RIF-has-netdevice queries to a dedicated helper

In a number of places, a netdevice underlying a RIF is obtained only to
check if it a NULL pointer. In order to clean up the interface between the
router and the other modules, add a new helper to specifically answer this
question, and convert the relevant uses to this new interface.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Reuse work neighbor initialization in work scheduler

After the struct mlxsw_sp_netevent_work.n field initialization is moved
here, the body of code that handles NETEVENT_NEIGH_UPDATE is almost
identical to the one in the helper function. Therefore defer to the helper
instead of inlining the equivalent.

Note that previously, the code took and put a reference of the netdevice.
The new code defers to mlxsw_sp_dev_lower_is_port() to obviate the need for
taking the reference.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Use the available router pointer for netevent handling

This code handles NETEVENT_DELAY_PROBE_TIME_UPDATE, which is invoked every
time the delay_probe_time changes. mlxsw router currently only maintains
one timer, so the last delay_probe_time set wins.

Currently, mlxsw uses mlxsw_sp_port_lower_dev_hold() to find a reference to
the router. This is no longer necessary. But as a side effect, this makes
sure that only updates to "interesting netdevices" (ones that have a
physical netdevice lower) are projected.

Retain that side effect by calling mlxsw_sp_port_dev_lower_find_rcu() and
punting if there is none. Then just proceed using the router pointer that's
already at hand in the helper.

Note that previously, the code took and put a reference of the netdevice.
Because the mlxsw_sp pointer is now obtained from the notifier block, the
port pointer (non-) NULL-ness is all that's relevant, and the reference
does not need to be taken anymore.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Pass router to mlxsw_sp_router_schedule_work() directly

Instead of passing a notifier block and deducing the router pointer from
that in the helper, do that in the caller, and pass the result. In the
following patches, the pointer will also be made useful in the caller.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Move here inetaddr validator notifiers

The validation logic is already in the router code. Move there the notifier
blocks themselves as well.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: mlxsw_sp_router_fini(): Extract a helper variable

Make mlxsw_sp_router_fini() more similar to the _init() function (and more
concise) by extracting the `router' handle to a named variable and using
that throughout. The availability of a dedicated `router' variable will
come in handy in following patches.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: openvswitch: add support for l4 symmetric hashing

Since its introduction, the ovs module execute_hash action allowed
hash algorithms other than the skb->l4_hash to be used.  However,
additional hash algorithms were not implemented.  This means flows
requiring different hash distributions weren't able to use the
kernel datapath.

Now, introduce support for symmetric hashing algorithm as an
alternative hash supported by the ovs module using the flow
dissector.

Output of flow using l4_sym hash:

    recirc_id(0),in_port(3),eth(),eth_type(0x0800),
    ipv4(dst=64.0.0.0/192.0.0.0,proto=6,frag=no), packets:30473425,
    bytes:45902883702, used:0.000s, flags:SP.,
    actions:hash(sym_l4(0)),recirc(0xd)

Some performance testing with no GRO/GSO, two veths, single flow:

    hash(l4(0)):      4.35 GBits/s
    hash(l4_sym(0)):  4.24 GBits/s

Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'taprio-xstats'

Vladimir Oltean says:

====================
Fixes for taprio xstats

1. Taprio classes correspond to TXQs, and thus, class stats are TXQ
stats not TC stats.
2. Drivers reporting taprio xstats should report xstats for *this*
taprio, not for a previous one.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: enetc: reset taprio stats when taprio is deleted

Currently, the window_drop stats persist even if an incorrect Qdisc was
removed from the interface and a new one is installed. This is because
the enetc driver keeps the state, and that is persistent across multiple
Qdiscs.

To resolve the issue, clear all win_drop counters from all TX queues
when the currently active Qdisc is removed. These counters are zero
by default. The counters visible in ethtool -S are also affected,
but I don't care very much about preserving those enough to keep them
monotonically incrementing.

Fixes: 4802fca8d1af ("net: enetc: report statistics counters for taprio")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sched: taprio: report class offload stats per TXQ, not per TC

The taprio Qdisc creates child classes per netdev TX queue, but
taprio_dump_class_stats() currently reports offload statistics per
traffic class. Traffic classes are groups of TXQs sharing the same
dequeue priority, so this is incorrect and we shouldn't be bundling up
the TXQ stats when reporting them, as we currently do in enetc.

Modify the API from taprio to drivers such that they report TXQ offload
stats and not TC offload stats.

There is no change in the UAPI or in the global Qdisc stats.

Fixes: 6c1adb650c8d ("net/sched: taprio: add netlink reporting for offload statistics counters")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfc: nxp-nci: store __be16 value in __be16 variable

Use a __be16 variable to store the big endian value of header in
nxp_nci_i2c_fw_read().

Flagged by Sparse as:

.../i2c.c:113:22: warning: cast to restricted __be16

No functional changes intended.
Compile tested only.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mana: Add support for vlan tagging

To support vlan, use MANA_LONG_PKT_FMT if vlan tag is present in TX
skb. Then extract the vlan tag from the skb struct, and save it to
tx_oob for the NIC to transmit. For vlan tags on the payload, they
are accepted by the NIC too.

For RX, extract the vlan tag from CQE and put it into skb.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: Add devlink dev info support for EF10

Reuse the work done for EF100 to add devlink support for EF10.
There is no devlink port support for EF10.

Signed-off-by: Martin Habets <habetsm.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sched: act_pedit: Use kmemdup() to replace kmalloc + memcpy

./net/sched/act_pedit.c:245:21-28: WARNING opportunity for kmemdup.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5478
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ionic: add support for ethtool extended stat link_down_count

Following the example of 'commit 9a0f830f8026 ("ethtool: linkstate:
add a statistic for PHY down events")', added support for link down
events.

Add callback ionic_get_link_ext_stats to ionic_ethtool.c to support
link_down_count, a property of netdev that gets reported exclusively
on physical link down events.

Run ethtool -I <devname> to display the device link down count.

Signed-off-by: Nitya Sunkad <nitya.sunkad@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '100GbE' of git://git./linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
ice: Improve miscellaneous interrupt code

Jacob Keller says:

This series improves the driver's use of the threaded IRQ and the
communication between ice_misc_intr() and the ice_misc_intr_thread_fn()
which was previously introduced by commit 1229b33973c7 ("ice: Add low
latency Tx timestamp read").

First, a new custom enumerated return value is used instead of a boolean for
ice_ptp_process_ts(). This significantly reduces the cognitive burden when
reviewing the logic for this function, as the expected action is clear from
the return value name.

Second, the unconditional loop in ice_misc_intr_thread_fn() is removed,
replacing it with a write to the Other Interrupt Cause register. This causes
the MAC to trigger the Tx timestamp interrupt again. This makes it possible
to safely use the ice_misc_intr_thread_fn() to handle other tasks beyond
just the Tx timestamps. It is also easier to reason about since the thread
function will exit cleanly if we do something like disable the interrupt and
call synchronize_irq().

Third, refactor the handling for external timestamp events to use the
miscellaneous thread function. This resolves an issue with the external
time stamps getting blocked while processing the periodic work function
task.

Fourth, a simplification of the ice_misc_intr() function to always return
IRQ_WAKE_THREAD, and schedule the ice service task in the
ice_misc_intr_thread_fn() instead.

Finally, the Other Interrupt Cause is kept disabled over the thread function
processing, rather than immediately re-enabled.

Special thanks to Michal Schmidt for the careful review of the series and
pointing out my misunderstandings of the kernel IRQ code. It has been
determined that the race outlined as being fixed in previous series was
actually introduced by this series itself, which I've since corrected.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: wwan: iosm: enable runtime pm support for 7560

Adds runtime pm support for 7560.

As part of probe procedure auto suspend is enabled and auto suspend
delay is set to 5000 ms for runtime pm use. Later auto flag is set
to power manage the device at run time.

On successful communication establishment between host and device the
device usage counter is dropped and request to put the device into
sleep state (suspend).

In TX path, the device usage counter is raised and device is moved out
of sleep(resume) for data transmission. In RX path, if the device has
some data to be sent it request host platform to change the power state
by giving PCI PME message.

Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: xlnx,axi-ethernet: convert bindings document to yaml

Convert the bindings document for Xilinx AXI Ethernet Subsystem
from txt to yaml. No changes to existing binding description.

Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Signed-off-by: Sarath Babu Naidu Gaddam <sarath.babu.naidu.gaddam@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: net: vxlan: Fix selftest regression after changes in iproute2.

The iproute2 output that eventually landed upstream is different than
the one used in this test, resulting in failures. Fix by adjusting the
test to use iproute2's JSON output, which is more stable than regular
output.

Fixes: 305c04189997 ("selftests: net: vxlan: Add tests for vxlan nolocalbypass option.")
Signed-off-by: Vladimir Nikishkin <vladimir@nikishkin.pw>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'renesas-rswitch-perf'

Yoshihiro Shimoda says:

====================
net: renesas: rswitch: Improve perfromance of TX/RX

This patch series is based on net-next.git / main branch [1]. This patch
series can improve perfromance of TX in a specific condition. The previous code
used "global rate limiter" feature so that this is possible to cause
performance down if we use multiple ports at the same time. To resolve this
issue, use "hardware pause" features of GWCA and COMA. Note that this is not
related to the ethernet PAUSE frames.

< UDP TX by iperf3 >
before: about 450Mbps on both tsn0 and tsn1
after:  about 950Mbps on both tsn0 and tsn1

Also, this patch series can improve performance of RX by using
napi_gro_receive().

< TCP RX by iperf >
before: about 670Mbps on tsn0
after:  about 840Mbps on tsn0

[1]
The commit e06bd5e3adae ("Merge branch 'followup-fixes-for-the-dwmac-and-altera-lynx-conversion'")

Changes from v3:
https://lore.kernel.org/all/20230607015641.1724057-1-yoshihiro.shimoda.uh@renesas.com/
- Rebased on the latest net-next.git / main branch.
- Added Reviewed-by in the patch 2/2. (Maciej, thanks!)
- Fix typos in the commit description in the patch 2/2.

Changes from v2:
https://lore.kernel.org/all/20230606085558.1708766-1-yoshihiro.shimoda.uh@renesas.com/
- Rebased on the latest net-next.git / main branch.
- Added Reviewed-by in the patch 1/2. (Maciej, thanks!)
- Revise the commit description in the patch 2/2.
- Add definition to remove magic hardcoded numbers in the patch 2/2.

Changes from v1:
https://lore.kernel.org/all/20230529080840.1156458-1-yoshihiro.shimoda.uh@renesas.com/
- Rebased on the latest net-next.git / main branch.
- Use "hardware pause" feature instead of "per-queue limiter" feature.
- Drop refactaring for "per-queue limiter".
- Drop dt-bindings update because "hardware pause" doesn't need additional
   clock information.
- Use napi_gro_receive() to improve RX performance.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: renesas: rswitch: Use hardware pause features

Since this driver used the "global rate limiter" feature of GWCA,
the TX performance of each port was reduced when multiple ports
transmitted frames simultaneously. To improve performance, remove
the use of the "global rate limiter" feature and use "hardware pause"
features of the following:
- "per priority pause" of GWCA
- "global pause" of COMA

Note that these features are not related to the ethernet PAUSE frame.

Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: renesas: rswitch: Use napi_gro_receive() in RX

This hardware can receive multiple frames so that using
napi_gro_receive() instead of netif_receive_skb() gets good
performance of RX.

Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sfc-tc-encap-actions-offload'

Edward Cree says:

====================
sfc: TC encap actions offload

This series adds support for offloading TC tunnel_key set actions to the
EF100 driver, supporting VxLAN and GENEVE tunnels over IPv4 or IPv6.
====================

Link: https://lore.kernel.org/r/cover.1686240142.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: generate encap headers for TC offload

Support constructing VxLAN and GENEVE headers, on either IPv4 or IPv6,
using the neighbouring information obtained in encap->neigh to
populate the Ethernet header.
Note that the ef100 hardware does not insert UDP checksums when
performing encap, so for IPv6 the remote endpoint will need to be
configured with udp6zerocsumrx or equivalent.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: neighbour lookup for TC encap action offload

For each neighbour we're interested in, create a struct efx_neigh_binder
object which has a list of all the encap_actions using it. When we
receive a neighbouring update (through the netevent notifier), find the
corresponding efx_neigh_binder and update all its users.
Since the actual generation of encap headers is still only a stub, the
resulting rules still get left on fallback actions.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: MAE functions to create/update/delete encap headers

Besides the raw header data, also pass the tunnel type, so that the
hardware knows it needs to update the IP Total Length and UDP Length
fields (and corresponding checksums) for each packet.
Also, populate the ENCAP_HEADER_ID field in efx_mae_alloc_action_set()
with the fw_id returned from efx_mae_allocate_encap_md().

Reviewed-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com>
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: add function to atomically update a rule in the MAE

efx_mae_update_rule() changes the action-set-list attached to an MAE
flow rule in the Action Rule Table.
We will use this when neighbouring updates change encap actions.

Reviewed-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com>
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: some plumbing towards TC encap action offload

Create software objects to manage the metadata for encap actions that
can be attached to TC rules. However, since we don't yet have the
neighbouring information (needed to generate the Ethernet header),
all rules with encap actions are marked as "unready" and thus insert
the fallback action into hardware rather than actually offloading the
encapsulation action.

Reviewed-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com>
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: add fallback action-set-lists for TC offload

When offloading a TC encap action, the action information for the
hardware might not be "ready": if there's currently no neighbour entry
available for the destination address, we can't construct the Ethernet
header to prepend to the packet. In this case, we still offload the
flow rule, but with its action-set-list ID pointing at a "fallback"
action which simply delivers the packet to its default destination (as
though no flow rule had matched), thus allowing software TC to handle
it. Later, when we receive a neighbouring update that allows us to
construct the encap header, the rule will become "ready" and we will
update its action-set-list ID in hardware to point at the actual
offloaded actions.
This patch sets up these fallback ASLs, but does not yet use them.

Reviewed-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com>
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: move gso declarations and functions to their own files

Move declarations into include/net/gso.h and code into net/core/gso.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mptcp-unify-pm-interfaces'

Matthieu Baerts says:

====================
mptcp: unify PM interfaces

These patches from Geliang better isolate the two MPTCP path-managers by
avoiding calling userspace PM functions from the in-kernel PM. Instead,
new functions declared in pm.c directly dispatch to the right PM.

In addition to have a clearer code, this also avoids a bit of duplicated
checks.

This is a refactoring, there is no behaviour change intended here.
====================

Link: https://lore.kernel.org/r/20230608-upstream-net-next-20230608-mptcp-unify-pm-interfaces-v1-0-b301717c9ff5@tessares.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: unify pm set_flags interfaces

This patch unifies the three PM set_flags() interfaces:

mptcp_pm_nl_set_flags() in mptcp/pm_netlink.c for the in-kernel PM and
mptcp_userspace_pm_set_flags() in mptcp/pm_userspace.c for the
userspace PM.

They'll be switched in the common PM infterface mptcp_pm_set_flags() in
mptcp/pm.c based on whether token is NULL or not.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: unify pm get_flags_and_ifindex_by_id

This patch unifies the three PM get_flags_and_ifindex_by_id() interfaces:

mptcp_pm_nl_get_flags_and_ifindex_by_id() in mptcp/pm_netlink.c for the
in-kernel PM and mptcp_userspace_pm_get_flags_and_ifindex_by_id() in
mptcp/pm_userspace.c for the userspace PM.

They'll be switched in the common PM infterface
mptcp_pm_get_flags_and_ifindex_by_id() in mptcp/pm.c based on whether
mptcp_pm_is_userspace() or not.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: unify pm get_local_id interfaces

This patch unifies the three PM get_local_id() interfaces:

mptcp_pm_nl_get_local_id() in mptcp/pm_netlink.c for the in-kernel PM and
mptcp_userspace_pm_get_local_id() in mptcp/pm_userspace.c for the
userspace PM.

They'll be switched in the common PM infterface mptcp_pm_get_local_id()
in mptcp/pm.c based on whether mptcp_pm_is_userspace() or not.

Also put together the declarations of these three functions in protocol.h.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: export local_address

Rename local_address() with "mptcp_" prefix and export it in protocol.h.

This function will be re-used in the common PM code (pm.c) in the
following commit.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'wireless-next-2023-06-09' of git://git./linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.5

The second pull request for v6.5. We have support for three new
Realtek chipsets, all from different generations. Shows how active
Realtek development is right now, even older generations are being
worked on.

Note: We merged wireless into wireless-next to avoid complex conflicts
between the trees.

Major changes:

rtl8xxxu
- RTL8192FU support

rtw89
- RTL8851BE support

rtw88
- RTL8723DS support

ath11k
- Multiple Basic Service Set Identifier (MBSSID) and Enhanced MBSSID
   Advertisement (EMA) support in AP mode

iwlwifi
- support for segmented PNVM images and power tables
- new vendor entries for PPAG (platform antenna gain) feature

cfg80211/mac80211
- more Multi-Link Operation (MLO) support such as hardware restart
- fixes for a potential work/mutex deadlock and with it beginnings of
   the previously discussed locking simplifications

* tag 'wireless-next-2023-06-09' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (162 commits)
  wifi: rtlwifi: remove misused flag from HAL data
  wifi: rtlwifi: remove unused dualmac control leftovers
  wifi: rtlwifi: remove unused timer and related code
  wifi: rsi: Do not set MMC_PM_KEEP_POWER in shutdown
  wifi: rsi: Do not configure WoWlan in shutdown hook if not enabled
  wifi: brcmfmac: Detect corner error case earlier with log
  wifi: rtw89: 8852c: update RF radio A/B parameters to R63
  wifi: rtw89: 8852c: update TX power tables to R63 with 6 GHz power type (3 of 3)
  wifi: rtw89: 8852c: update TX power tables to R63 with 6 GHz power type (2 of 3)
  wifi: rtw89: 8852c: update TX power tables to R63 with 6 GHz power type (1 of 3)
  wifi: rtw89: process regulatory for 6 GHz power type
  wifi: rtw89: regd: update regulatory map to R64-R40
  wifi: rtw89: regd: judge 6 GHz according to chip and BIOS
  wifi: rtw89: refine clearing supported bands to check 2/5 GHz first
  wifi: rtw89: 8851b: configure CRASH_TRIGGER feature for 8851B
  wifi: rtw89: set TX power without precondition during setting channel
  wifi: rtw89: debug: txpwr table access only valid page according to chip
  wifi: rtw89: 8851b: enable hw_scan support
  wifi: cfg80211: move scan done work to wiphy work
  wifi: cfg80211: move sched scan stop to wiphy work
  ...
====================

Link: https://lore.kernel.org/r/87bkhohkbg.fsf@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Remove a useless function call

'handle' is known to be NULL here. There is no need to kfree() it.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Light probe local SFs

In case user wants to configure the SFs, for example: to use only vdpa
functionality, he needs to fully probe a SF, configure what he wants,
and afterward reload the SF.

In order to save the time of the reload, local SFs will probe without
any auxiliary sub-device, so that the SFs can be configured prior to
its full probe.

The defaults of the enable_* devlink params of these SFs are set to
false.

Usage example:
Create SF:
$ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11
$ devlink port function set pci/0000:08:00.0/32768 \
hw_addr 00:00:00:00:00:11 state active

Enable ETH auxiliary device:
$ devlink dev param set auxiliary/mlx5_core.sf.1 \
name enable_eth value true cmode driverinit

Now, in order to fully probe the SF, use devlink reload:
$ devlink dev reload auxiliary/mlx5_core.sf.1

At this point the user have SF devlink instance with auxiliary device
for the Ethernet functionality only.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Move esw multiport devlink param to eswitch code

Move the param registration and handling code into the eswitch
code as they are related to each other. No point in having the
devlink param registration done in separate file.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Split function_setup() to enable and open functions

mlx5_cmd_init_hca() is taking ~0.2 seconds. In case of a user who
desire to disable some of the SF aux devices, and with large scale-1K
SFs for example, this user will waste more than 3 minutes on
mlx5_cmd_init_hca() which isn't needed at this stage.

Downstream patch will change SFs which are probe over the E-switch,
local SFs, to be probed without any aux dev. In order to support this,
split function_setup() to avoid executing mlx5_cmd_init_hca().

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Set max number of embedded CPU VFs

Set the maximum number of embedded cpu VF functions available.

Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Update SRIOV enable/disable to handle EC/VFs

Previously on the embedded CPU platform SRIOV was never enabled/disabled
via mlx5_core_sriov_configure. Host VF updates are provided by an event
handler. Now in the disable flow it must be known if this is a disable
due to driver unload or SRIOV detach, or if the user updated the number
of VFs. If due to change in the number of VFs only wait for the pages of
ECVFs.

Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Query correct caps for min msix vectors

The VFs on the host and the embedded CPU platform share function
numbers. Set the ec_vf_function field to query the caps for the correct
function.

Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Use correct vport when restoring GUIDs

Prior to enabling EC VF functionality the vport number and function ID
were always the same. That's not the case now. Use the correct vport
number to modify the HCA vport context.

Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Add new page type for EC VF pages

When the embedded cpu supports SRIOV it can be enabled and disabled
independently from the host SRIOV. Track the pages separately so we can
properly wait for returned VF pages.

Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Add/remove peer miss rules for EC VFs

Add and remove the peer miss rules for EC VFs. It's possible that there
are different amounts of total VFs per function so only create rules for
the minimum number of max VFs.

Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Add management of EC VF vports

Add init, load, unload, and cleanup of the EC VF vports. This includes
changes in how eswitch SRIOV is managed. Previous on an embedded CPU
platform the number of VFs provided when enabling the eswitch was always
0, host VFs vports are handled in the eswitch functions change event
handler. Now track the number of EC VFs as well, so they can be handled
properly in the enable/disable flows.

There are only 3 marks available for use in xarrays, all 3 were already
in use for this use case. EC VF vports are in a known range so we can
access them by index instead of marks.

Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Update vport caps query/set for EC VFs

These functions are for query/set by vport, there was an underlying
assumption that vport was equal to function ID. That's not the case for
EC VF functions. Set the ec_vf_function bit accordingly.

Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Enable devlink port for embedded cpu VF vports

Enable creation of a devlink port for EC VF vports.

Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: mlx5_ifc updates for embedded CPU SRIOV

Add ec_vf_vport_base to HCA Capabilities 2. This indicates the base vport
of embedded CPU virtual functions that are connected to the eswitch.

Add ec_vf_function to query/set_hca_caps. If set this indicates
accessing a virtual function on the embedded CPU by function ID. This
should only be used with other_function set to 1.

Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: Bodong Wang <bodong@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Simplify unload all rep code

Instead of using type specific iterators which are only used in one place
just traverse the xarray. It will provide suitable ordering based on the
vport numbers. This will also eliminate the need for changes here when
new types are added.

Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

Merge branch 'tools-ynl-gen-code-gen-improvements-before-ethtool'

Jakub Kicinski says:

====================
tools: ynl-gen: code gen improvements before ethtool

I was going to post ethtool but I couldn't stand the ugliness
of the if conditions which were previously generated.
So I cleaned that up and improved a number of other things
ethtool will benefit from.
====================

Link: https://lore.kernel.org/r/20230608211200.1247213-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl-gen: support / skip pads on the way to kernel

Kernel does not have padding requirements for 64b attrs.
We can ignore pad attrs.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl-gen: don't pass op_name to RenderInfo

The op_name argument is barely used and identical to op.name
in all cases.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>