review.tizen.org Git - platform/kernel/linux-starfive.git/log

Merge branch 'rtl8366-cleanups'

Linus Walleij says:

====================
RTL8366(RB) cleanups part 1

This is a first set of patches making the RTL8366RB work out of
the box with a default OpenWrt userspace.

We achieve bridge port isolation with the first patch, and the
next 5 patches removes the very weird VLAN set-up with one
VLAN with PVID per port that has been in this driver in all
vendor trees and in OpenWrt for years.

The switch is now managed the way a modern bridge/DSA switch
shall be managed.

After these patches are merged, I will send the next set which
adds new features, some which have circulated before.

ChangeLog v4->v5:
- Drop the patch disabling 4K VLAN.
- Drop the patch forcing VLAN0 untagged.
- Fix a semantic bug in the filer enablement code.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: rtl8366: Drop and depromote pointless prints

We don't need a message for every VLAN association, dbg
is fine. The message about adding the DSA or CPU
port to a VLAN is directly misleading, this is perfectly
fine.

Cc: Vladimir Oltean <olteanv@gmail.com>
Cc: Mauri Sandberg <sandberg@mailfence.com>
Cc: DENG Qingfang <dqfext@gmail.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: rtl8366: Fix a bug in deleting VLANs

We were checking that the MC (member config) was != 0
for some reason, all we need to check is that the config
has no ports, i.e. no members. Then it can be recycled.
This must be some misunderstanding.

Fixes: 4ddcaf1ebb5e ("net: dsa: rtl8366: Properly clear member config")
Cc: Mauri Sandberg <sandberg@mailfence.com>
Cc: DENG Qingfang <dqfext@gmail.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: rtl8366rb: Fix off-by-one bug

The max VLAN number with non-4K VLAN activated is 15, and the
range is 0..15. Not 16.

The impact should be low since we by default have 4K VLAN and
thus have 4095 VLANs to play with in this switch. There will
not be a problem unless the code is rewritten to only use
16 VLANs.

Fixes: d8652956cf37 ("net: dsa: realtek-smi: Add Realtek SMI driver")
Cc: Mauri Sandberg <sandberg@mailfence.com>
Cc: DENG Qingfang <dqfext@gmail.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: rtl8366rb: Rewrite weird VLAN filering enablement

While we were defining one VLAN per port for isolating the ports
the port_vlan_filtering() callback was implemented to enable a
VLAN on the port + 1. This function makes no sense, not only is
it incomplete as it only enables the VLAN, it doesn't do what
the callback is supposed to do, which is to selectively enable
and disable filtering on a certain port.

Implement the correct callback: we have two registers dealing
with filtering on the RTL9366RB, so we implement an ASIC-specific
callback and implement filering using the register bit that makes
the switch drop frames if the port is not in the VLAN member set.

The DSA documentation Documentation/networking/switchdev.rst states:

  When the bridge has VLAN filtering enabled and a PVID is not
  configured on the ingress port, untagged and 802.1p tagged
  packets must be dropped. When the bridge has VLAN filtering
  enabled and a PVID exists on the ingress port, untagged and
  priority-tagged packets must be accepted and forwarded according
  to the bridge's port membership of the PVID VLAN. When the
  bridge has VLAN filtering disabled, the presence/lack of a
  PVID should not influence the packet forwarding decision.

To comply with this, we add two arrays of bool in the RTL8366RB
state that keeps track of if filtering and PVID is enabled or
not for each port. We then add code such that whenever filtering
or PVID changes, we update the filter according to the
specification.

Cc: Vladimir Oltean <olteanv@gmail.com>
Cc: Mauri Sandberg <sandberg@mailfence.com>
Cc: Alvin Šipraga <alsi@bang-olufsen.dk>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: DENG Qingfang <dqfext@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: rtl8366: Drop custom VLAN set-up

This hacky default VLAN setup was done in order to direct
packets to the right ports and provide port isolation, both
which we now support properly using custom tags and proper
bridge port isolation.

We can drop the custom VLAN code and leave all VLAN handling
alone, as users expect things to be. We can also drop
ds->configure_vlan_while_not_filtering = false; and let
the core deal with any VLANs it wants.

Cc: Mauri Sandberg <sandberg@mailfence.com>
Cc: DENG Qingfang <dqfext@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: rtl8366rb: Support bridge offloading

Use port isolation registers to configure bridge offloading.

Tested on the D-Link DIR-685, switching between ports and
sniffing ports to make sure no packets leak.

Cc: Vladimir Oltean <olteanv@gmail.com>
Cc: Mauri Sandberg <sandberg@mailfence.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: DENG Qingfang <dqfext@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'devlink_register-last'

Leon Romanovsky says:

====================
Move devlink_register to be last devlink command

This is second version of patch series
https://lore.kernel.org/netdev/cover.1628599239.git.leonro@nvidia.com/

The main change is addition of delayed notification logic that will
allowed us to delete devlink_params_publish API (future series will
remove it completely) and conversion of all drivers to have devlink_register
being last commend.

The series itself is pretty straightforward, except liquidio driver
which performs initializations in various workqueues without proper
locks. That driver doesn't hole device_lock and it is clearly broken
for any parallel driver core flows (modprobe + devlink + PCI reset will
100% crash it).

In order to annotate devlink_register() will lockdep of holding
device_lock, I added workaround in this driver.

Thanks

----------------------
From previous cover letter:
Hi Dave and Jakub,

This series prepares code to remove devlink_reload_enable/_disable API
and in order to do, we move all devlink_register() calls to be right
before devlink_reload_enable().

The best place for such a call should be right before exiting from
the probe().

This is done because devlink_register() opens devlink netlink to the
users and gives them a venue to issue commands before initialization
is finished.

1. Some drivers were aware of such "functionality" and tried to protect
themselves with extra locks, state machines and devlink_reload_enable().
Let's assume that it worked for them, but I'm personally skeptical about
it.

2. Some drivers copied that pattern, but without locks and state
machines. That protected them from reload flows, but not from any _set_
routines.

3. And all other drivers simply didn't understand the implications of early
devlink_register() and can be seen as "broken".
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Move devlink registration to be last devlink command

This change prevents from users to access device before devlink
is fully configured.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

staging: qlge: Move devlink registration to be last devlink command

This change prevents from users to access device before devlink is
fully configured.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: ocp: Move devlink registration to be last devlink command

This change prevents from users to access device before devlink is
fully configured.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: wwan: iosm: Move devlink_register to be last devlink command

This change prevents from users to access device before devlink is
fully configured. Indirectly this change fixes the commit mentioned
below where devlink_unregister() was prematurely removed.

Fixes: db4278c55fa5 ("devlink: Make devlink_register to be void")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netdevsim: Move devlink registration to be last devlink command

This change prevents from users to access device before devlink is
fully configured.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: Move devlink registration to be last devlink command

This change prevents from users to access device before devlink is
fully configured.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed: Move devlink registration to be last devlink command

This change prevents from users to access device before devlink is
fully configured.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ionic: Move devlink registration to be last devlink command

This change prevents from users to access device before devlink is
fully configured.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: Move delink_register to be last command

Open user space access to the devlink after driver is probed.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: delay devlink registration to the end

Open access to the devlink interface when the driver fully initialized.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: core: Register devlink instance last

Make sure that devlink is open to receive user input when all
parameters are initialized.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Accept devlink user input after driver initialization complete

The change of devlink_alloc() to accept device makes sure that device
is fully initialized and device_register() does nothing except allowing
users to use that devlink instance.

Such change ensures that no user input will be usable till that point and
it eliminates the need to worry about internal locking as long as devlink_register
is called last since all accesses to the devlink are during initialization.

This change fixes the following lockdep warning.

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #27 Not tainted
------------------------------------------------------
devlink/265 is trying to acquire lock:
ffff8880133c2bc0 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_unload_one+0x1e/0xa0 [mlx5_core]
but task is already holding lock:
ffffffff8362b468 (devlink_mutex){+.+.}-{3:3}, at: devlink_nl_pre_doit+0x2b/0x8d0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:

-> #1 (devlink_mutex){+.+.}-{3:3}:
        __mutex_lock+0x149/0x1310
        devlink_register+0xe7/0x280
        mlx5_devlink_register+0x118/0x480 [mlx5_core]
        mlx5_init_one+0x34b/0x440 [mlx5_core]
        probe_one+0x480/0x6e0 [mlx5_core]
        pci_device_probe+0x2a0/0x4a0
        really_probe+0x1cb/0xba0
        __driver_probe_device+0x18f/0x470
        driver_probe_device+0x49/0x120
        __driver_attach+0x1ce/0x400
        bus_for_each_dev+0x11e/0x1a0
        bus_add_driver+0x309/0x570
        driver_register+0x20f/0x390
        0xffffffffa04a0062
        do_one_initcall+0xd5/0x400
        do_init_module+0x1c8/0x760
        load_module+0x7d9d/0xa4b0
        __do_sys_finit_module+0x118/0x1a0
        do_syscall_64+0x3d/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #0 (&dev->intf_state_mutex){+.+.}-{3:3}:
        __lock_acquire+0x2999/0x5a40
        lock_acquire+0x1a9/0x4a0
        __mutex_lock+0x149/0x1310
        mlx5_unload_one+0x1e/0xa0 [mlx5_core]
        mlx5_devlink_reload_down+0x185/0x2b0 [mlx5_core]
        devlink_reload+0x1f2/0x640
        devlink_nl_cmd_reload+0x6c3/0x10d0
        genl_family_rcv_msg_doit+0x1e9/0x2f0
        genl_rcv_msg+0x27f/0x4a0
        netlink_rcv_skb+0x11e/0x340
        genl_rcv+0x24/0x40
        netlink_unicast+0x433/0x700
        netlink_sendmsg+0x6fb/0xbe0
        sock_sendmsg+0xb0/0xe0
        __sys_sendto+0x192/0x240
        __x64_sys_sendto+0xdc/0x1b0
        do_syscall_64+0x3d/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(devlink_mutex);
                                lock(&dev->intf_state_mutex);
                                lock(devlink_mutex);
   lock(&dev->intf_state_mutex);

  *** DEADLOCK ***

3 locks held by devlink/265:
  #0: ffffffff836371d0 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40
  #1: ffffffff83637288 (genl_mutex){+.+.}-{3:3}, at: genl_rcv_msg+0x31a/0x4a0
  #2: ffffffff8362b468 (devlink_mutex){+.+.}-{3:3}, at: devlink_nl_pre_doit+0x2b/0x8d0

stack backtrace:
CPU: 0 PID: 265 Comm: devlink Not tainted 5.14.0-rc2+ #27
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
  dump_stack_lvl+0x45/0x59
  check_noncircular+0x268/0x310
  ? print_circular_bug+0x460/0x460
  ? __kernel_text_address+0xe/0x30
  ? alloc_chain_hlocks+0x1e6/0x5a0
  __lock_acquire+0x2999/0x5a40
  ? lockdep_hardirqs_on_prepare+0x3e0/0x3e0
  ? add_lock_to_list.constprop.0+0x6c/0x530
  lock_acquire+0x1a9/0x4a0
  ? mlx5_unload_one+0x1e/0xa0 [mlx5_core]
  ? lock_release+0x6c0/0x6c0
  ? lockdep_hardirqs_on_prepare+0x3e0/0x3e0
  ? lock_is_held_type+0x98/0x110
  __mutex_lock+0x149/0x1310
  ? mlx5_unload_one+0x1e/0xa0 [mlx5_core]
  ? lock_is_held_type+0x98/0x110
  ? mlx5_unload_one+0x1e/0xa0 [mlx5_core]
  ? find_held_lock+0x2d/0x110
  ? mutex_lock_io_nested+0x1160/0x1160
  ? mlx5_lag_is_active+0x72/0x90 [mlx5_core]
  ? lock_downgrade+0x6d0/0x6d0
  ? do_raw_spin_lock+0x12e/0x270
  ? rwlock_bug.part.0+0x90/0x90
  ? mlx5_unload_one+0x1e/0xa0 [mlx5_core]
  mlx5_unload_one+0x1e/0xa0 [mlx5_core]
  mlx5_devlink_reload_down+0x185/0x2b0 [mlx5_core]
  ? netlink_broadcast_filtered+0x308/0xac0
  ? mlx5_devlink_info_get+0x1f0/0x1f0 [mlx5_core]
  ? __build_skb_around+0x110/0x2b0
  ? __alloc_skb+0x113/0x2b0
  devlink_reload+0x1f2/0x640
  ? devlink_unregister+0x1e0/0x1e0
  ? security_capable+0x51/0x90
  devlink_nl_cmd_reload+0x6c3/0x10d0
  ? devlink_nl_cmd_get_doit+0x1e0/0x1e0
  ? devlink_nl_pre_doit+0x72/0x8d0
  genl_family_rcv_msg_doit+0x1e9/0x2f0
  ? __lock_acquire+0x15e2/0x5a40
  ? genl_family_rcv_msg_attrs_parse.constprop.0+0x240/0x240
  ? mutex_lock_io_nested+0x1160/0x1160
  ? security_capable+0x51/0x90
  genl_rcv_msg+0x27f/0x4a0
  ? genl_get_cmd+0x3c0/0x3c0
  ? lock_acquire+0x1a9/0x4a0
  ? devlink_nl_cmd_get_doit+0x1e0/0x1e0
  ? lock_release+0x6c0/0x6c0
  netlink_rcv_skb+0x11e/0x340
  ? genl_get_cmd+0x3c0/0x3c0
  ? netlink_ack+0x930/0x930
  genl_rcv+0x24/0x40
  netlink_unicast+0x433/0x700
  ? netlink_attachskb+0x750/0x750
  ? __alloc_skb+0x113/0x2b0
  netlink_sendmsg+0x6fb/0xbe0
  ? netlink_unicast+0x700/0x700
  ? netlink_unicast+0x700/0x700
  sock_sendmsg+0xb0/0xe0
  __sys_sendto+0x192/0x240
  ? __x64_sys_getpeername+0xb0/0xb0
  ? do_sys_openat2+0x10a/0x370
  ? down_write_nested+0x150/0x150
  ? do_user_addr_fault+0x215/0xd50
  ? __x64_sys_openat+0x11f/0x1d0
  ? __x64_sys_open+0x1a0/0x1a0
  __x64_sys_sendto+0xdc/0x1b0
  ? syscall_enter_from_user_mode+0x1d/0x50
  do_syscall_64+0x3d/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f50b50b6b3a
Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 c3 0f 1f 44 00 00 55 48 83 ec 30 44 89 4c
RSP: 002b:00007fff6c0d3f38 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f50b50b6b3a
RDX: 0000000000000038 RSI: 000055763ac08440 RDI: 0000000000000003
RBP: 000055763ac08410 R08: 00007f50b5192200 R09: 000000000000000c
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 000055763ac08410 R15: 000055763ac08440
mlx5_core 0000:00:09.0: firmware version: 4.8.9999
mlx5_core 0000:00:09.0: 0.000 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x255 link)
mlx5_core 0000:00:09.0 eth1: Link up

Fixes: a6f3b62386a0 ("net/mlx5: Move devlink registration before interfaces load")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4: Move devlink_register to be the last initialization command

Refactor the code to make sure that devlink_register() is the last
command during initialization stage.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/prestera: Split devlink and traps registrations to separate routines

Separate devlink registrations and traps registrations so devlink will
be registered when driver is fully initialized.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

octeontx2: Move devlink registration to be last devlink command

This change prevents from users to access device before devlink is fully
configured. This change allows us to delete call to devlink_params_publish()
and impossible check during unregister flow.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ice: Open devlink when device is ready

Move devlink_registration routine to be the last command, when the
device is fully initialized.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hinic: Open device for the user access when it is ready

Move devlink registration to be the last command in device activation,
so it opens the driver to accept such devlink commands from the user
when it is fully initialized.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dpaa2-eth: Register devlink instance at the end of probe

Move devlink_register to be the last command in the initialization
sequence.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio: Overcome missing device lock protection in init/remove flows

The liquidio driver is broken by design. It initialize PCI devices
in separate delayed works. It causes to the situation where device lock
is dropped during initialize and remove sequences.

That lock is part of driver/core and needed to protect from races during
init, destroy and bus invocations.

In addition to lack of locking protection, it has incorrect order of
destroy flows and very questionable synchronization scheme based on
atomic_t.

This change doesn't fix that driver but makes sure that rest of the
netdev subsystem doesn't suffer from such basic protection by adding
device_lock over devlink_*() APIs and by moving devlink_register()
to be last command in setup_nic_devices().

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Register devlink instance at the end devlink configuration

Move devlink_register() to be last command in devlink configuration
sequence, so no user space access will be possible till devlink instance
is fully operable. As part of this change, the devlink_params_publish
call is removed as not needed.

This change fixes forgotten devlink_params_unpublish() too.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Notify users when objects are accessible

The devlink core code notified users about add/remove objects without
relation if this object can be accessible or not. In this patch we unify
such user visible notifications in one place.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb: avoid open-coded offsetof()

clang-14 does not like the custom offsetof() macro in vsc7326:

drivers/net/ethernet/chelsio/cxgb/vsc7326.c:597:3: error: performing pointer subtraction with a null pointer has undefined behavior [-Werror,-Wnull-pointer-subtraction]
                HW_STAT(RxUnicast, RxUnicastFramesOK),
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/chelsio/cxgb/vsc7326.c:594:56: note: expanded from macro 'HW_STAT'
        { reg, (&((struct cmac_statistics *)NULL)->stat_name) - (u64 *)NULL }

Rewrite this to use the version provided by the kernel.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: fix gcc-10 -Wrestrict warning

gcc-10 and later warn about a theoretical array overrun when
accessing priv->int_name_rx_irq[i] with an out of bounds value
of 'i':

drivers/net/ethernet/stmicro/stmmac/stmmac_main.c: In function 'stmmac_request_irq_multi_msi':
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:3528:17: error: 'snprintf' argument 4 may overlap destination object 'dev' [-Werror=restrict]
3528 |                 snprintf(int_name, int_name_len, "%s:%s-%d", dev->name, "tx", i);
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:3404:60: note: destination object referenced by 'restrict'-qualified argument 1 was declared here
3404 | static int stmmac_request_irq_multi_msi(struct net_device *dev)
      |                                         ~~~~~~~~~~~~~~~~~~~^~~

The warning is a bit strange since it's not actually about the array
bounds but rather about possible string operations with overlapping
arguments, but it's not technically wrong.

Avoid the warning by adding an extra bounds check.

Fixes: 8532f613bc78 ("net: stmmac: introduce MSI Interrupt routines for mac, safety, RX & TX")
Link: https://lore.kernel.org/all/20210421134743.3260921-1-arnd@kernel.org/
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: emac: utilize of_net's of_get_mac_address()

of_get_mac_address() reads the same "local-mac-address" property.
... But goes above and beyond by checking the MAC value properly.

printk+message seems outdated too,
so let's put dev_err in the queue.

Signed-off-by: Christian Lamparter <chunkeey@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sparx5: fix resource_size.cocci warnings

Use resource_size function on resource object
instead of explicit computation.

Clean up coccicheck warning:
./drivers/net/ethernet/microchip/sparx5/sparx5_main.c:237:19-22: ERROR:
Missing resource_size with iores [ idx ]

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ibmveth: Use dma_alloc_coherent() instead of kmalloc/dma_map_single()

Replacing kmalloc/kfree/dma_map_single/dma_unmap_single()
with dma_alloc_coherent/dma_free_coherent() helps to reduce
code size, and simplify the code, and coherent DMA will not
clear the cache every time.

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: cisco: Fix a function name in comments

Use dma_alloc_coherent() instead of pci_alloc_consistent(),
because only dma_alloc_coherent() is called here.

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Reviewed-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/ipv4/tcp_nv.c: remove superfluous header files from tcp_nv.c

tcp_nv.c hasn't use any macro or function declared in mm.h. Thus, these files
can be removed from tcp_nv.c safely without affecting the compilation
of the net module.

Signed-off-by: Mianhan Liu <liumh1@shanghaitech.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: smsc: Fix function names in print messages and comments

Use dma_xxx_xxx() instead of pci_xxx_xxx(),
because the pci function wrappers are not called here.

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sis: Fix a function name in comments

Use dma_alloc_coherent() instead of pci_alloc_consistent(),
because only dma_alloc_coherent() is called here.

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: broadcom: Fix a function name in comments

Use dma_alloc_coherent() instead of pci_alloc_consistent(),
because only dma_alloc_coherent() is called here.

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: atl1c: Fix a function name in print messages

Use dma_map_single() instead of pci_map_single(),
because the pci function wrappers are not called here.

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: fddi: skfp: Fix a function name in comments

Use dma_map_single() instead of pci_map_single(),
because only dma_map_single() is called here.

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

FDDI: defxx: Fix function names in coments

Use dma_xxx_xxx() instead of pci_xxx_xxx(),
because the pci function wrappers are not called here.

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: make napi_disable() symmetric with enable

Commit 3765996e4f0b ("napi: fix race inside napi_enable") fixed
an ordering bug in napi_enable() and made the napi_enable() diverge
from napi_disable(). The state transitions done on disable are
not symmetric to enable.

There is no known bug in napi_disable() this is just refactoring.

Eric suggests we can also replace msleep(1) with a more opportunistic
usleep_range().

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: clockmatrix: use rsmu driver to access i2c/spi bus

rsmu (Renesas Synchronization Management Unit ) driver is located in
drivers/mfd and responsible for creating multiple devices including
clockmatrix phc, which will then use the exposed regmap and mutex
handle to access i2c/spi bus.

Signed-off-by: Min Li <min.li.xe@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: net: fib_nexthops: Wait before checking reported idle time

The purpose of this test is to verify that after a short activity passes,
the reported time is reasonable: not zero (which could be reported by
mistake), and not something outrageous (which would be indicative of an
issue in used units).

However, the idle time is reported in units of clock_t, or hundredths of
second. If the initial sequence of commands is very quick, it is possible
that the idle time is reported as just flat-out zero. When this test was
recently enabled in our nightly regression, we started seeing spurious
failures for exactly this reason.

Therefore buffer the delay leading up to the test with a sleep, to make
sure there is no legitimate way of reporting 0.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'octeontx2-af-kpu'

Kiran Kumar K says:

====================
adding KPU profile changes for GTPU and custom

Adding changes to limit the KPU processing for GTPU headers to parse
packet up to L4 and added changes to variable length headers to parse LA
as part of PKIND action.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

octeontx2-af: Optimize KPU1 processing for variable-length headers

Optimized KPU1 entry processing for variable-length custom L2 headers
of size 24B, 90B by
- Moving LA LTYPE parsing for 24B and 90B headers to PKIND.
- Removing LA flags assignment for 24B and 90B headers.
- Reserving a PKIND 55 to parse variable length headers.

Also, new mailbox(NPC_SET_PKIND) added to configure PKIND with
corresponding variable-length offset, mask, and shift count
(NPC_AF_KPUX_ENTRYX_ACTION0).

Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

octeontx2-af: Limit KPU parsing for GTPU packets

With current KPU profile, while parsing GTPU packets, GTPU payload
is also being parsed and GTPU PDU payload is being treated as IPV4
data, which is not correct. In case of GTPU packets, parsing should
be stopped after identifying the GTPU. Adding changes to limit KPU
profile parsing for GTPU payload.

Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mptcp-fixes'

Mat Martineau says:

====================
mptcp: Miscellaneous fixes

Here are five changes we've collected and tested in the mptcp-tree:

Patch 1 changes handling of the MPTCP-level snd_next value during the
recovery phase after a subflow link failure.

Patches 2 and 3 are some small refactoring changes to replace some
open-coded bits.

Patch 4 removes an unused field in struct mptcp_sock.

Patch 5 restarts the MPTCP retransmit timer when there is
not-yet-transmitted data to send and all previously sent data has been
acknowledged. This prevents some sending stalls.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: re-arm retransmit timer if data is pending

The retransmit head will be NULL in case there is no in-flight data
(meaning all data injected into network has been acked).

In that case the retransmit timer is stopped.

This is only correct if there is no more pending, not-yet-sent data.

If there is, the retransmit timer needs to set the PENDING bit again so
that mptcp tries to send the remaining (new) data once a subflow can accept
more data.

Also, mptcp_subflow_get_retrans() has to be called unconditionally.

This function checks for subflows that have become unresponsive and marks
them as stale, so in the case where the rtx queue is empty, subflows
will never be marked stale which prevents available backup subflows from
becoming eligible for transmit.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/226
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: remove tx_pending_data

The update on recovery is not correct.

msk->tx_pending_data += msk->snd_nxt - rtx_head->data_seq;

will update tx_pending_data multiple times when a subflow is declared
stale while earlier recovery is still in progress.
This means that tx_pending_data will still be positive even after
all data as has been transmitted.

Rather than fix it, remove this field: there are no consumers.
The outstanding data byte count can be computed either via

"msk->write_seq - rtx_head->data_seq" or
"msk->write_seq - msk->snd_una".

The latter is more recent/accurate estimate as rtx_head adjustment
is deferred until mptcp lock can be acquired.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: use lockdep_assert_held_once() instead of open-coding it

We have a few more places where the mptcp code duplicates
lockdep_assert_held_once(). Let's use the existing macro and
avoid a bunch of compiler's conditional.

Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: use OPTIONS_MPTCP_MPC

Since OPTIONS_MPTCP_MPC has been defined, use it instead of open-coding.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@xiaomi.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: do not shrink snd_nxt when recovering

When recovering after a link failure, snd_nxt should not be set to a
lower value. Else, update of snd_nxt is broken because:

msk->snd_nxt += ret; (where ret is number of bytes sent)

assumes that snd_nxt always moves forward.
After reduction, its possible that snd_nxt update gets out of sync:
dfrag we just sent might have had a data sequence number even past
recovery_snd_nxt.

This change factors the common msk state update to a helper
and updates snd_nxt based on the current dfrag data sequence number.

The conditional is required for the recovery phase where we may
re-transmit old dfrags that are before current snd_nxt.

After this change, snd_nxt only moves forward and covers all in-sequence
data that was transmitted.

recovery_snd_nxt is retained to detect when recovery has completed.

Fixes: 1e1d9d6f119c5 ("mptcp: handle pending data on closed subflow")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mlx5-updates-2021-09-24' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2021-09-24

mlx5 misc updates and fixes to net-next branch:

1) Roi Dayan provided some cleanups in mlx5e TC module, and some
code improvements to fwd/drop actions handling.

2) Tariq, Add TX max rate support for MQPRIO channel mode

3) Dima adds the support for TC egress/ingress offload of macvlan
interfaces

4) misc cleanup
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Enable TC offload for ingress MACVLAN

Support offloading of TC rules that filter ingress traffic from a MACVLAN
device, which is attached to uplink representor.

Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Enable TC offload for egress MACVLAN

Support offloading of TC rules that mirror/redirect egress traffic to a
MACVLAN device, which is attached to mlx5 representor net device.

Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: loopback test is not supported in switchdev mode

In switchdev mode we insert steering rules to eswitch that
make sure packets can't be looped back.
Modify the self tests infra and have flags per test.
Add a flag for tests that needs to be skipped in switchdev mode.

Before this commit:

$ ethtool --test enp8s0f0
The test result is FAIL
The test extra info:
Link Test        0
Speed Test       0
Health Test      0
Loopback Test    1

After this commit:

$ ethtool --test enp8s0f0
The test result is PASS
The test extra info:
Link Test        0
Speed Test       0
Health Test      0

Example output in dmesg:

enp8s0f0: Self test begin..
enp8s0f0:         [0] Link Test start..
enp8s0f0:         [0] Link Test end: result(0)
enp8s0f0:         [1] Speed Test start..
enp8s0f0:         [1] Speed Test end: result(0)
enp8s0f0:         [2] Health Test start..
enp8s0f0:         [2] Health Test end: result(0)
enp8s0f0: Self test out: status flags(0x1)

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Use NL_SET_ERR_MSG_MOD() for errors parsing tunnel attributes

This to be consistent and adds the module name to the error message.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Use tc sample stubs instead of ifdefs in source file

Instead of having sparse ifdefs in source files use a single
ifdef in the tc sample header file and use stubs.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Remove redundant priv arg from parse_pedit_to_reformat()

The priv argument is not being used. remove it.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Check action fwd/drop flag exists also for nic flows

The driver should add offloaded rules with either a fwd or drop action.
The check existed in parsing fdb flows but not when parsing nic flows.
Move the test into actions_match_supported() which is called for
checking nic flows and fdb flows.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Set action fwd flag when parsing tc action goto

Do it when parsing like in other actions instead of when
checking if goto is supported in current scenario.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Remove incorrect addition of action fwd flag

A user is expected to explicit request a fwd or drop action.
It is not correct to implicit add a fwd action for the user,
when modify header action flag exists.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Use correct return type

modify_header_match_supported() should return type bool but
it returns the value returned by is_action_keys_supported()
which is type int.

is_action_keys_supported() always returns either -EOPNOTSUPP
or 0 and it shouldn't change as the purpose of the function
is checking for support. so just make the function return
a bool type.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Add error flow for ethtool -X command

Prior to this patch, ethtool -X fail but the user receives a success
status. Try to roll-back when failing and return success status
accordingly.

Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Fix code indentation in dr_ste_v1

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

rxrpc: Fix _usecs_to_jiffies() by using usecs_to_jiffies()

Directly using _usecs_to_jiffies() might be unsafe, so it's
better to use usecs_to_jiffies() instead.
Because we can see that the result of _usecs_to_jiffies()
could be larger than MAX_JIFFY_OFFSET values without the
check of the input.

Fixes: c410bf01933e ("Fix the excessive initial retransmission timeout")
Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tracking packets with CE marks in BW rate sample

In order to track CE marks per rate sample (one round trip), TCP needs a
per-skb header field to record the tp->delivered_ce count when the skb
was sent. To make space, we replace the "last_in_flight" field which is
used exclusively for NV congestion control. The stat needed by NV can be
alternatively approximated by existing stats tcp_sock delivered and
mss_cache.

This patch counts the number of packets delivered which have CE marks in
the rate sample, using similar approach of delivery accounting.

Cc: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Luke Hsiao <lukehsiao@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: broadcom: Fix PHY_BRCM_IDDQ_SUSPEND definition

An extraneous number was added during the inclusion of that change,
correct that such that we use a single bit as is expected by the PHY
driver.

Reported-by: Justin Chen <justinpopo6@gmail.com>
Fixes: d6da08ed1425 ("net: phy: broadcom: Add IDDQ-SR mode")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'devlink-fixes'

Leon Romanovsky says:

====================
Batch of devlink related fixes

I'm asking to apply this batch of devlink fixes to net-next and not to
net, because most if not all fixes are for old code or/and can be considered
as cleanup.

It will cancel the need to deal with merge conflicts for my next devlink series :).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

qed: Don't ignore devlink allocation failures

devlink is a software interface that doesn't depend on any hardware
capabilities. The failure in SW means memory issues, wrong parameters,
programmer error e.t.c.

Like any other such interface in the kernel, the returned status of
devlink APIs should be checked and propagated further and not ignored.

Fixes: 755f982bb1ff ("qed/qede: make devlink survive recovery")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ice: Delete always true check of PF pointer

PF pointer is always valid when PCI core calls its .shutdown() and
.remove() callbacks. There is no need to check it again.

Fixes: 837f08fdecbe ("ice: Add basic driver framework for Intel(R) E800 Series")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Remove single line function obfuscations

There is no need in extra one line functions to call relevant
functions only once.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Delete not used port parameters APIs

There is no in-kernel users for the devlink port parameters API,
so let's remove it.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Properly remove port parameter support

This driver doesn't have any port parameters and registers
devlink port parameters with empty table. Remove the useless
calls to devlink_port_params_register and _unregister.

Fixes: da203dfa89ce ("Revert "devlink: Add a generic wake_on_lan port parameter"")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Check devlink allocation and registration status

devlink is a software interface that doesn't depend on any hardware
capabilities. The failure in SW means memory issues, wrong parameters,
programmer error e.t.c.

Like any other such interface in the kernel, the returned status of
devlink APIs should be checked and propagated further and not ignored.

Fixes: 4ab0c6a8ffd7 ("bnxt_en: add support to enable VF-representors")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mlx4: Add support for XDP_REDIRECT

Signed-off-by: Joshua Roys <roysjosh@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: iosm: Use hrtimer_forward_now()

hrtimer_forward_now() is providing the same functionality. Preparation for
making hrtimer_forward() timer core code only.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Loic Poulain <loic.poulain@linaro.org>
Cc: netdev@vger.kernel.org
Cc: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: M Chetan Kumar <m.chetan.kumar@intel.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Intel Corporation <linuxwwan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: felix: accept "ethernet-ports" OF node name

Since both forms are accepted, let's search for both when we
pre-validate the PHY modes.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-next'

Ido Schimmel says:

====================
mlxsw: Add support for IP-in-IP with IPv6 underlay

Currently, mlxsw only supports IP-in-IP with IPv4 underlay. Traffic
routed through 'gre' netdevs is encapsulated with IPv4 and GRE headers.
Similarly, incoming IPv4 GRE packets are decapsulated and routed in the
overlay VRF (which can be the same as the underlay VRF).

This patchset adds support for IPv6 underlay using the 'ip6gre' netdev.
Due to architectural differences between Spectrum-1 and later ASICs,
this functionality is only supported on Spectrum-2 onwards (the software
data path is used for Spectrum-1).

Patchset overview:

Patches #1-#5 are preparations.

Patches #6-#9 add and extend required device registers.

Patches #10-#14 gradually add IPv6 underlay support.

A follow-up patchset will add net/forwarding/ selftests.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: Add support for IP-in-IP with IPv6 underlay for Spectrum-2 and above

Currently, mlxsw driver supports IP-in-IP only with IPv4 underlay.
Add support for IPv6 underlay for Spectrum-2 and above.

Most of the configurations are same to IPv4, the main difference between
IPv4 and IPv6 is related to saving IP addresses.
IPv6 addresses are saved as part of KVD and the relevant registers hold
pointer to them.
Add API for that as part of ipip_ops, so then only Spectrum-2 and above
will save IPv6 addresses in this way.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Increase parsing depth for IPv6 decapsulation

The Spectrum ASIC has a configurable limit on how deep into the packet
it parses. By default, the limit is 96 bytes.

For IP-in-IP packets, with IPv6 outer and inner headers, the default
parsing depth is not enough and without increasing it such packets cannot
be properly decapsulated.

Use the existing API to set parsing depth, call it once for each
decapsulation entry when it is created/destroyed.
There is no need to protect the code with new mutex because 'router->lock'
is already taken in these code paths.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: Add IPV6_ADDRESS kvdl entry type

Add support for allocating and freeing KVD entries for IPv6 addresses.

These addresses are programmed by the RIPS register and referenced by
the RATR and RTDP registers for IPv6 underlay encapsulation and
decapsulation, respectively.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_ipip: Add mlxsw_sp_ipip_gre6_ops

Add operations for IP-in-IP GRE6. For now the function can_offload()
returns false and the other functions warn as they should never be called.

A later patch will add dedicated operations for Spectrum-2 which will
support IPv6 underlay, so use 'mlxsw_sp1_' prefix for the new operations.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: Create separate ipip_ops_arr for different ASICs

Currently, there is support for IP-in-IP only with IPv4 underlay for all
supported Spectrum ASICs.
The next patches will add support for IPv6 underlay only for Spectrum-2
and above.

Add infrastructure for splitting IP-in-IP support between different
ASICs - create separate ipip_ops_arr and add ipips_init function to set the
right ops.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: reg: Add support for ritr_loopback_ipip6_pack()

The RITR register is used to configure the router interface table.

For IP-in-IP, it stores the underlay source IP address for encapsulation
and also the ingress RIF for the underlay lookup.

Add support for IPv6 IP-in-IP configuration.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: reg: Add support for ratr_ipip6_entry_pack()

The RATR register is used to configure the Router Adjacency (next-hop)
Table.

For IP-in-IP entry, underlay destination IPv4 is saved as part of this
register and underlay destination IPv6 is saved by RIPS register and RATR
saves pointer to it.

Add function for setting IPv6 IP-in-IP configuration.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: reg: Add support for rtdp_ipip6_pack()

The RTDP register is used for configuring the tunnel decapsulation
properties of NVE and IP-in-IP.

Linux tunnels verify packets before decapsulation based on the packet's
source IP, which must match tunnel remote IP.
RTDP is used to configure decapsulation so that it filters out packets that
are not IPv6 or have the wrong source IP or wrong GRE key.

For IP-in-IP entry, source IPv4 is saved as part of this register and
source IPv6 is saved by RIPS register and RTDP saves pointer to it.

Create common function for configuring both IPv4 and IPv6 and add
dedicated functions for each protocol.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: reg: Add Router IP version Six Register

The RIPS register is used to store IPv6 addresses for use by the NVE and
IP-in-IP.

For IPv6 underlay support, RATR register needs to hold a pointer to the
remote IPv6 address for encapsulation and RTDP register needs to hold a
pointer to the local IPv6 address for decapsulation check.

Add the required register for saving IPv6 addresses.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: Take tunnel's type into account when searching underlay device

The function __mlxsw_sp_ipip_netdev_ul_dev_get() returns the underlay
device that corresponds to the overlay device that it gets.
Currently, this function assumes that the tunnel is IPv4 GRE, because it
is the only one that is supported by mlxsw driver.

This assumption will no longer be correct when IPv6 GRE support is added,
resulting in wrong underlay device being returned.
Instead, check 'ol_dev->type' and return the underlay device accordingly.

Move the function to spectrum_ipip.c because spectrum_router.c should not
be aware to tunnel type.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_ipip: Create common function for mlxsw_sp_ipip_ol_netdev_change_gre()

The function mlxsw_sp_ipip_ol_netdev_change_gre4() contains code that
can be shared between IPv4 and IPv6.
The only difference is the way that arguments are taken from tunnel
parameters, which are different between IPv4 and IPv6.

For that, add structure 'mlxsw_sp_ipip_parms' to hold all the required
parameters for the function and save it as part of
'struct mlxsw_sp_ipip_entry' instead of the existing structure that is
not shared between IPv4 and IPv6. Add new operation as part of
'mlxsw_sp_ipip_ops' to initialize the new structure.

Then mlxsw_sp_ipip_ol_netdev_change_gre{4,6}() will prepare the new
structure and both will call the same function.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Fix arguments alignment

Suppress the following checkpatch.pl check [1] by adding a variable to
store the IP-in-IP options. Noticed while adding equivalent IPv6 code in
subsequent patches.

[1]
CHECK: Alignment should match open parenthesis
+ mlxsw_reg_ritr_loopback_ipip4_pack(ritr_pl, lb_cf.lb_ipipt,
+
+ MLXSW_REG_RITR_LOOPBACK_IPIP_OPTIONS_GRE_KEY_PRESET,

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_ipip: Pass IP tunnel parameters by reference and as 'const'

Currently, there are several functions that handle 'struct ip_tunnel_parm'
and 'struct __ip6_tnl_parm'.

Change the mentioned functions to get IP tunnel parameters by reference
and as 'const'.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Create common function for fib_entry_type_unset() code

mlxsw_sp_fib4_entry_type_unset() is not specific for IPv4 FIB entry,
move the code to mlxsw_sp_fib_entry_type_unset(), and call this function
from mlxsw_sp_fib4_entry_type_unset() so then it will be used for IPv6
also.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/netdev/net

net/mptcp/protocol.c
977d293e23b4 ("mptcp: ensure tx skbs always have the MPTCP ext")
efe686ffce01 ("mptcp: ensure tx skbs always have the MPTCP ext")

same patch merged in both trees, keep net-next.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'net-5.15-rc3' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Current release - regressions:

   - dsa: bcm_sf2: fix array overrun in bcm_sf2_num_active_ports()

  Previous releases - regressions:

   - introduce a shutdown method to mdio device drivers, and make DSA
     switch drivers compatible with masters disappearing on shutdown;
     preventing infinite reference wait

   - fix issues in mdiobus users related to ->shutdown vs ->remove

   - virtio-net: fix pages leaking when building skb in big mode

   - xen-netback: correct success/error reporting for the
     SKB-with-fraglist

   - dsa: tear down devlink port regions when tearing down the devlink
     port on error

   - nexthop: fix division by zero while replacing a resilient group

   - hns3: check queue, vf, vlan ids range before using

  Previous releases - always broken:

   - napi: fix race against netpoll causing NAPI getting stuck

   - mlx4_en: ensure link operstate is updated even if link comes up
     before netdev registration

   - bnxt_en: fix TX timeout when TX ring size is set to the smallest

   - enetc: fix illegal access when reading affinity_hint; prevent oops
     on sysfs access

   - mtk_eth_soc: avoid creating duplicate offload entries

  Misc:

   - core: correct the sock::sk_lock.owned lockdep annotations"

* tag 'net-5.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (51 commits)
  atlantic: Fix issue in the pm resume flow.
  net/mlx4_en: Don't allow aRFS for encapsulated packets
  net: mscc: ocelot: fix forwarding from BLOCKING ports remaining enabled
  net: ethernet: mtk_eth_soc: avoid creating duplicate offload entries
  nfc: st-nci: Add SPI ID matching DT compatible
  MAINTAINERS: remove Guvenc Gulce as net/smc maintainer
  nexthop: Fix memory leaks in nexthop notification chain listeners
  mptcp: ensure tx skbs always have the MPTCP ext
  qed: rdma - don't wait for resources under hw error recovery flow
  s390/qeth: fix deadlock during failing recovery
  s390/qeth: Fix deadlock in remove_discipline
  s390/qeth: fix NULL deref in qeth_clear_working_pool_list()
  net: dsa: realtek: register the MDIO bus under devres
  net: dsa: don't allocate the slave_mii_bus using devres
  Doc: networking: Fox a typo in ice.rst
  net: dsa: fix dsa_tree_setup error path
  net/smc: fix 'workqueue leaked lock' in smc_conn_abort_work
  net/smc: add missing error check in smc_clc_prfx_set()
  net: hns3: fix a return value error in hclge_get_reset_status()
  net: hns3: check vlan id before using it
  ...

memcg: flush lruvec stats in the refault

Prior to the commit 7e1c0d6f5820 ("memcg: switch lruvec stats to rstat")
and the commit aa48e47e3906 ("memcg: infrastructure to flush memcg
stats"), each lruvec memcg stats can be off by (nr_cgroups * nr_cpus *
32) at worst and for unbounded amount of time.  The commit aa48e47e3906
moved the lruvec stats to rstat infrastructure and the commit
7e1c0d6f5820 bounded the error for all the lruvec stats to (nr_cpus *
32) at worst for at most 2 seconds.  More specifically it decoupled the
number of stats and the number of cgroups from the error rate.

However this reduction in error comes with the cost of triggering the
slowpath of stats update more frequently.  Previously in the slowpath
the kernel adds the stats up the memcg tree.  After aa48e47e3906, the
kernel triggers the asyn lruvec stats flush through queue_work().  This
causes regression reports from 0day kernel bot [1] as well as from
phoronix test suite [2].

We tried two options to fix the regression:

1) Increase the threshold to trigger the slowpath in lruvec stats
    update codepath from 32 to 512.

2) Remove the slowpath from lruvec stats update codepath and instead
    flush the stats in the page refault codepath. The assumption is that
    the kernel timely flush the stats, so, the update tree would be
    small in the refault codepath to not cause the preformance impact.

Following are the results of will-it-scale/page_fault[1|2|3] benchmark
on four settings i.e.  (1) 5.15-rc1 as baseline (2) 5.15-rc1 with
aa48e47e3906 and 7e1c0d6f5820 reverted (3) 5.15-rc1 with option-1
(4) 5.15-rc1 with option-2.

  test       (1)      (2)               (3)               (4)
  pg_f1   368563   406277 (10.23%)   399693  (8.44%)   416398 (12.97%)
  pg_f2   338399   372133  (9.96%)   369180  (9.09%)   381024 (12.59%)
  pg_f3   500853   575399 (14.88%)   570388 (13.88%)   576083 (15.02%)

From the above result, it seems like the option-2 not only solves the
regression but also improves the performance for at least these
benchmarks.

Feng Tang (intel) ran the aim7 benchmark with these two options and
confirms that option-1 reduces the regression but option-2 removes the
regression.

Michael Larabel (phoronix) ran multiple benchmarks with these options
and reported the results at [3] and it shows for most benchmarks
option-2 removes the regression introduced by the commit aa48e47e3906
("memcg: infrastructure to flush memcg stats").

Based on the experiment results, this patch proposed the option-2 as the
solution to resolve the regression.

Link: https://lore.kernel.org/all/20210726022421.GB21872@xsang-OptiPlex-9020
Link: https://www.phoronix.com/scan.php?page=article&item=linux515-compile-regress
Link: https://openbenchmarking.org/result/2109226-DEBU-LINUX5104
Fixes: aa48e47e3906 ("memcg: infrastructure to flush memcg stats")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Tested-by: Michael Larabel <Michael@phoronix.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hillf Danton <hdanton@sina.com>,
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

atlantic: Fix issue in the pm resume flow.

After fixing hibernation resume flow, another usecase was found which
should be explicitly handled - resume when device is in "down" state.
Invoke aq_nic_init jointly with aq_nic_start only if ndev was already
up during suspend/hibernate. We still need to perform nic_deinit() if
caller requests for it, to handle the freeze/resume scenarios.

Fixes: 57f780f1c433 ("atlantic: Fix driver resume flow.")
Signed-off-by: Sudarsana Reddy Kalluru <skalluru@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_en: Don't allow aRFS for encapsulated packets

Driver doesn't support aRFS for encapsulated packets, return early error
in such a case.

Fixes: 1eb8c695bda9 ("net/mlx4_en: Add accelerated RFS support")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>