review.tizen.org Git - platform/kernel/linux-starfive.git/log

projects / platform / kernel / linux-starfive.git / log

Arnd Bergmann [Tue, 3 Aug 2021 11:40:41 +0000 (13:40 +0200)]

3c509: stop calling netdev_boot_setup_check

This driver never uses the information returned by
netdev_boot_setup_check, and is not called by the boot-time probing from
driver/net/Space.c, so just remove these stale references.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Arnd Bergmann [Tue, 3 Aug 2021 11:40:40 +0000 (13:40 +0200)]

appletalk: ltpc: remove static probing

This driver never relies on the netdev_boot_setup_check()
to get its configuration, so it can just as well do its
own probing all the time.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Arnd Bergmann [Tue, 3 Aug 2021 11:40:39 +0000 (13:40 +0200)]

natsemi: sonic: stop calling netdev_boot_setup_check

The data from the kernel command line is no longer used since the
probe function gets it from the platform device resources instead.

The jazz version was changed to be like this in 2007, the xtensa
version apparently copied the code from there.

Fixes: ed9f0e0bf3ce ("remove setup of platform device from jazzsonic.c")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Arnd Bergmann [Tue, 3 Aug 2021 11:40:38 +0000 (13:40 +0200)]

bcmgenet: remove call to netdev_boot_setup_check

The driver has never used the netdev->{irq,base_addr,mem_start}
members, so this call is completely unnecessary.

Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Tue, 3 Aug 2021 11:58:22 +0000 (12:58 +0100)]

Merge branch 'ethtool-runtime-pm'

Heiner Kallweit says:

====================
ethtool: runtime-resume netdev parent before ethtool ops

If a network device is runtime-suspended then:
- network device may be flagged as detached and all ethtool ops (even if
  not accessing the device) will fail because netif_device_present()
  returns false
- ethtool ops may fail because device is not accessible (e.g. because being
  in D3 in case of a PCI device)

It may not be desirable that userspace can't use even simple ethtool ops
that not access the device if interface or link is down. To be more friendly
to userspace let's ensure that device is runtime-resumed when executing
ethtool ops in kernel.

This patch series covers the typical case that the netdev parent is power-
managed, e.g. a PCI device. Not sure whether cases exist where the netdev
itself is power-managed. If yes then we may need an extension for this.
But the series as-is at least shouldn't cause problems in that case.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Heiner Kallweit [Sun, 1 Aug 2021 10:41:31 +0000 (12:41 +0200)]

ethtool: runtime-resume netdev parent in ethnl_ops_begin

If a network device is runtime-suspended then:
- network device may be flagged as detached and all ethtool ops (even if not
  accessing the device) will fail because netif_device_present() returns
  false
- ethtool ops may fail because device is not accessible (e.g. because being
  in D3 in case of a PCI device)

It may not be desirable that userspace can't use even simple ethtool ops
that not access the device if interface or link is down. To be more friendly
to userspace let's ensure that device is runtime-resumed when executing the
respective ethtool op in kernel.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Heiner Kallweit [Sun, 1 Aug 2021 10:40:05 +0000 (12:40 +0200)]

ethtool: move netif_device_present check from ethnl_parse_header_dev_get to ethnl_ops_begin

If device is runtime-suspended and not accessible then it may be
flagged as not present. If checking whether device is present is
done too early then we may bail out before we have the chance to
runtime-resume the device. Therefore move this check to
ethnl_ops_begin(). This is in preparation of a follow-up patch
that tries to runtime-resume the device before executing ethtool
ops.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Heiner Kallweit [Sun, 1 Aug 2021 10:37:39 +0000 (12:37 +0200)]

ethtool: move implementation of ethnl_ops_begin/complete to netlink.c

In preparation of subsequent extensions to both functions move the
implementations from netlink.h to netlink.c.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Heiner Kallweit [Sun, 1 Aug 2021 10:36:48 +0000 (12:36 +0200)]

ethtool: runtime-resume netdev parent before ethtool ioctl ops

If a network device is runtime-suspended then:
- network device may be flagged as detached and all ethtool ops (even if not
  accessing the device) will fail because netif_device_present() returns
  false
- ethtool ops may fail because device is not accessible (e.g. because being
  in D3 in case of a PCI device)

It may not be desirable that userspace can't use even simple ethtool ops
that not access the device if interface or link is down. To be more friendly
to userspace let's ensure that device is runtime-resumed when executing the
respective ethtool op in kernel.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Jakub Kicinski [Mon, 2 Aug 2021 17:57:29 +0000 (10:57 -0700)]

virtio-net: realign page_to_skb() after merges

We ended up merging two versions of the same patch set:

commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
commit 5c37711d9f27 ("virtio-net: fix for unable to handle page fault for address")

into net, and

commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
commit 6c66c147b9a4 ("virtio-net: fix for unable to handle page fault for address")

into net-next. Redo the merge from commit 126285651b7f ("Merge
ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net"), so that
the most recent code remains.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Tue, 3 Aug 2021 11:38:47 +0000 (12:38 +0100)]

Merge branch 'bnxt_en-rx-ring'

Michael Chan says:

====================
bnxt_en: Increase maximum RX ring size when jumbo ring is unused

The RX jumbo ring is automatically enabled when HW GRO/LRO is enabled or
when the MTU exceeds the page size.  The RX jumbo ring provides a lot
more RX buffer space when it is in use.  When the RX jumbo ring is not
in use, some users report that the current maximum of 2K buffers is
too limiting.  This patchset increases the maximum to 8K buffers when
the RX jumbo ring is not used.  The default RX ring size is unchanged
at 511.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Michael Chan [Mon, 2 Aug 2021 14:52:39 +0000 (10:52 -0400)]

bnxt_en: Increase maximum RX ring size if jumbo ring is not used

The current maximum RX ring size is defined assuming the RX jumbo ring
(aka aggregation ring) is used.  The RX jumbo ring is automicatically used
when the MTU exceeds a threshold or when rx-gro-hw/lro is enabled.  The RX
jumbo ring is automatically sized up to 4 times the size of the RX ring
size.

The BNXT_MAX_RX_DESC_CNT constant is the upper limit on the size of the
RX ring whether or not the RX jumbo ring is used.  Obviously, the
maximum amount of RX buffer space is significantly less when the RX jumbo
ring is not used.

To increase flexibility for the user who does not use the RX jumbo ring,
we now define a bigger maximum RX ring size when the RX jumbo ring is not
used.  The maximum RX ring size is now up to 8K when the RX jumbo ring
is not used.  The maximum completion ring size also needs to be scaled
up to accomodate the larger maximum RX ring size.

Note that when the RX jumbo ring is re-enabled, the RX ring size will
automatically drop if it exceeds the maximum.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Michael Chan [Mon, 2 Aug 2021 14:52:38 +0000 (10:52 -0400)]

bnxt_en: Don't use static arrays for completion ring pages

We currently store these page addresses and DMA addreses in static
arrays.  On systems with 4K pages, we support up to 64 pages per
completion ring.  The actual number of pages for each completion ring
may be much less than 64.  For example, when the RX ring size is set
to the default 511 entries, only 16 completion ring pages are needed
per ring.

In the next patch, we'll be doubling the maximum number of completion
pages.  So we convert to allocate these arrays as needed instead of
declaring them statically.

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Yajun Deng [Mon, 2 Aug 2021 08:05:08 +0000 (16:05 +0800)]

net: Keep vertical alignment

Those files under /proc/net/stat/ don't have vertical alignment, it looks
very difficult. Modify the seq_printf statement, keep vertical alignment.

v2:
- Use seq_puts() and seq_printf() correctly.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Hangbin Liu [Mon, 2 Aug 2021 03:02:19 +0000 (11:02 +0800)]

bonding: add new option lacp_active

Add an option lacp_active, which is similar with team's runner.active.
This option specifies whether to send LACPDU frames periodically. If set
on, the LACPDU frames are sent along with the configured lacp_rate
setting. If set off, the LACPDU frames acts as "speak when spoken to".

Note, the LACPDU state frames still will be sent when init or unbind port.

v2: remove module parameter

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

zhouchuangao [Mon, 2 Aug 2021 02:18:38 +0000 (19:18 -0700)]

qed: Remove duplicated include of kernel.h

Duplicate include header file <linux/kernel.h>
line 4: #include <linux/kernel.h>
line 7: #include <linux/kernel.h>

Signed-off-by: zhouchuangao <zhouchuangao@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Len Baker [Sun, 1 Aug 2021 17:12:26 +0000 (19:12 +0200)]

drivers/net/usb: Remove all strcpy() uses

strcpy() performs no bounds checking on the destination buffer. This
could result in linear overflows beyond the end of the buffer, leading
to all kinds of misbehaviors. The safe replacement is strscpy().

Signed-off-by: Len Baker <len.baker@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Shai Malin [Sun, 1 Aug 2021 10:28:40 +0000 (13:28 +0300)]

qed: Remove redundant prints from the iWARP SYN handling

Remove redundant prints from the iWARP SYN handling.

Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Shai Malin [Sun, 1 Aug 2021 10:26:38 +0000 (13:26 +0300)]

qed: Skip DORQ attention handling during recovery

The device recovery flow will reset the entire HW device, in that case
the DORQ HW block attention is redundant.

Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Shai Malin [Sun, 1 Aug 2021 10:23:40 +0000 (13:23 +0300)]

qed: Avoid db_recovery during recovery

Avoid calling the qed doorbell recovery - qed_db_rec_handler()
during device recovery.

Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Tue, 3 Aug 2021 10:21:39 +0000 (11:21 +0100)]

Merge branch 'skb_expand_head'

Vasily Averin says:

====================
skbuff: introduce skb_expand_head()

currently if skb does not have enough headroom skb_realloc_headrom is called.
It is not optimal because it creates new skb.

this patch set introduces new helper skb_expand_head()
Unlike skb_realloc_headroom, it does not allocate a new skb if possible;
copies skb->sk on new skb when as needed and frees original skb in case of failures.

This helps to simplify ip[6]_finish_output2(), ip6_xmit() and few other
functions in vrf, ax25 and bpf.

There are few other cases where this helper can be used
but it requires an additional investigations.

v3 changes:
- ax25 compilation warning fixed
- v5.14-rc4 rebase
- now it does not depend on non-committed pathces

v2 changes:
- helper's name was changed to skb_expand_head
- fixed few mistakes inside skb_expand_head():
    skb_set_owner_w should set sk on nskb
    kfree was replaced by kfree_skb()
    improved warning message
- added minor refactoring in changed functions in vrf and bpf patches
- removed kfree_skb() in ax25_rt_build_path caller ax25_ip_xmit
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Vasily Averin [Mon, 2 Aug 2021 08:52:54 +0000 (11:52 +0300)]

bpf: use skb_expand_head in bpf_out_neigh_v4/6

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Vasily Averin [Mon, 2 Aug 2021 08:52:47 +0000 (11:52 +0300)]

ax25: use skb_expand_head

Use skb_expand_head() in ax25_transmit_buffer and ax25_rt_build_path.
Unlike skb_realloc_headroom, new helper does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Vasily Averin [Mon, 2 Aug 2021 08:52:40 +0000 (11:52 +0300)]

vrf: use skb_expand_head in vrf_finish_output

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Vasily Averin [Mon, 2 Aug 2021 08:52:35 +0000 (11:52 +0300)]

ipv4: use skb_expand_head in ip_finish_output2

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Vasily Averin [Mon, 2 Aug 2021 08:52:29 +0000 (11:52 +0300)]

ipv6: use skb_expand_head in ip6_xmit

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Vasily Averin [Mon, 2 Aug 2021 08:52:22 +0000 (11:52 +0300)]

ipv6: use skb_expand_head in ip6_finish_output2

Unlike skb_realloc_headroom, new helper skb_expand_head does not allocate
a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Vasily Averin [Mon, 2 Aug 2021 08:52:15 +0000 (11:52 +0300)]

skbuff: introduce skb_expand_head()

Like skb_realloc_headroom(), new helper increases headroom of specified skb.
Unlike skb_realloc_headroom(), it does not allocate a new skb if possible;
copies skb->sk on new skb when as needed and frees original skb in case
of failures.

This helps to simplify ip[6]_finish_output2() and a few other similar cases.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Tue, 3 Aug 2021 10:16:13 +0000 (11:16 +0100)]

Merge tag 'mlx5-updates-2021-08-02' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
This patch-set changes the TTC (Traffic Type Classification) logic
to be independent from the mlx5 ethernet driver by renaming the traffic
types enums and making the TTC API generic to the mlx5 core driver.

It allows to decouple TTC logic from mlx5e and reused by other parts
of mlx5 drivers, namely ADQ and lag TX steering hashing.

Patches overview:
1 - Rename traffic type enums to be mlx5 generic.
2 - Rename related TTC arguments and functions.
3 - Remove dependency in the mlx5e driver from the TTC implementation.
4 - Move TTC logic to fs_ttc.
5 - Embed struct mlx5_ttc_table in fs_ttc.

The refactoring series is followed by misc' cleanup patches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Jiapeng Chong [Thu, 22 Jul 2021 09:58:16 +0000 (17:58 +0800)]

net/mlx5: Fix missing return value in mlx5_devlink_eswitch_inline_mode_set()

The return value is missing in this code scenario, add the return value
'0' to the return value 'err'.

Eliminate the follow smatch warning:

drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:3083
mlx5_devlink_eswitch_inline_mode_set() warn: missing error code 'err'.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: 8e0aa4bc959c ("net/mlx5: E-switch, Protect eswitch mode changes")
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

commit | commitdiff | tree

Roi Dayan [Mon, 26 Jul 2021 12:17:31 +0000 (15:17 +0300)]

net/mlx5e: Return -EOPNOTSUPP if more relevant when parsing tc actions

Instead of returning -EINVAL.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

commit | commitdiff | tree

Roi Dayan [Mon, 26 Jul 2021 12:13:35 +0000 (15:13 +0300)]

net/mlx5e: Remove redundant assignment of counter to null

counter is being initialized before being used.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

commit | commitdiff | tree

Roi Dayan [Mon, 26 Jul 2021 12:11:43 +0000 (15:11 +0300)]

net/mlx5e: Remove redundant parse_attr arg

Passing parse_attr is redundant in parse_tc_nic_actions() and
mlx5e_tc_add_nic_flow() as we can get it from flow.
This is the same as with parse_tc_fdb_actions() and mlx5e_tc_add_fdb_flow().

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

commit | commitdiff | tree

Roi Dayan [Sun, 25 Jul 2021 13:04:00 +0000 (16:04 +0300)]

net/mlx5e: Remove redundant cap check for flow counter

The cap is very old and today will always exists.
The cap is not being checked anywhere else. Remove the check from
drop action when parsing tc rules in nic mode.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

commit | commitdiff | tree

Roi Dayan [Thu, 22 Jul 2021 09:10:54 +0000 (12:10 +0300)]

net/mlx5e: Remove redundant filter_dev arg from parse_tc_fdb_actions()

filter_dev is saved in parse_attr. and being used in other cases from
there. use it also for the leftover case.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

commit | commitdiff | tree

Roi Dayan [Thu, 22 Jul 2021 08:58:08 +0000 (11:58 +0300)]

net/mlx5e: Remove redundant tc act includes

Since the code changed to use the flow action infra
there is no usage of tcf values from those includes.
Remove those.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

commit | commitdiff | tree

Maor Gottlieb [Sun, 18 Jul 2021 12:53:53 +0000 (15:53 +0300)]

net/mlx5: Embed mlx5_ttc_table

mlx5_ttc_table struct shouldn't be exposed to the users so
this patch make it internal to ttc.

In addition add a getter function to get the TTC flow table for users
that need to add a rule which points on it.

Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>

commit | commitdiff | tree

Maor Gottlieb [Fri, 2 Jul 2021 07:38:32 +0000 (10:38 +0300)]

net/mlx5: Move TTC logic to fs_ttc

Now that TTC logic is not dependent on mlx5e structs, move it to
lib/fs_ttc.c so it could be used other part of the mlx5 driver.

Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

commit | commitdiff | tree

Maor Gottlieb [Fri, 2 Jul 2021 11:25:14 +0000 (14:25 +0300)]

net/mlx5e: Decouple TTC logic from mlx5e

Remove dependency in the mlx5e driver from the TTC implementation
by changing the TTC related functions to receive mlx5 generic arguments.
It allows to decouple TTC logic from mlx5e and reused by other parts of
mlx5 driver.

Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

commit | commitdiff | tree

Maor Gottlieb [Fri, 2 Jul 2021 11:49:37 +0000 (14:49 +0300)]

net/mlx5e: Rename some related TTC args and functions

Since TTC logic is going to be moved to a separate file, make the
relevant functions and arguments that used by TTC to be mlx5 generic.

Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

commit | commitdiff | tree

Maor Gottlieb [Fri, 2 Jul 2021 08:42:28 +0000 (11:42 +0300)]

net/mlx5e: Rename traffic type enums

Rename traffic type enums as part of the preparation for moving
the traffic type logic to a separate file.

Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

commit | commitdiff | tree

Maxim Mikityanskiy [Mon, 12 Apr 2021 16:10:17 +0000 (19:10 +0300)]

net/mlx5e: Allocate the array of channels according to the real max_nch

The channels array in struct mlx5e_rx_res is converted to a dynamic one,
which will use the dynamic value of max_nch instead of
implementation-defined maximum of MLX5E_MAX_NUM_CHANNELS.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

commit | commitdiff | tree

Maxim Mikityanskiy [Fri, 9 Apr 2021 16:01:51 +0000 (19:01 +0300)]

net/mlx5e: Hide all implementation details of mlx5e_rx_res

This commit moves all implementation details of struct mlx5e_rx_res
under en/rx_res.c. All access to RX resources is now done using methods.
Encapsulating RX resources into an object allows for better
manageability, because all the implementation details are now in a
single place, and external code can use only a limited set of API
methods to init/teardown the whole thing, reconfigure RSS and LRO
parameters, connect TIRs to flow steering and activate/deactivate TIRs.

mlx5e_rx_res is self-contained and doesn't depend on struct mlx5e_priv
or include en.h.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

commit | commitdiff | tree

Maxim Mikityanskiy [Fri, 9 Apr 2021 14:31:09 +0000 (17:31 +0300)]

net/mlx5e: Introduce mlx5e_channels API to get RQNs

Currently, struct mlx5e_channels is defined in en.h, along with a lot of
other stuff. In the following commit mlx5e_rx_res will need to get RQNs
(RQ hardware IDs), given a pointer to mlx5e_channels and the channel
index. In order to make it possible without including the whole en.h,
this commit introduces functions that will hide the implementation
details of mlx5e_channels.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

commit | commitdiff | tree

Maxim Mikityanskiy [Thu, 8 Apr 2021 09:16:34 +0000 (12:16 +0300)]

net/mlx5e: Use a new initializer to build uniform indir table

Replace mlx5e_build_default_indir_rqt with a new initializer of struct
mlx5e_rss_params_indir that works directly with the struct, rather than
its internals.

The new initializer is called mlx5e_rss_params_indir_init_uniform, which
also reflects the purpose (uniform spreading) better.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

commit | commitdiff | tree

Colin Ian King [Sun, 1 Aug 2021 15:37:42 +0000 (16:37 +0100)]

net/mlx4: make the array states static const, makes object smaller

Don't populate the array states on the stack but instead it
static const. Makes the object code smaller by 79 bytes.

Before:
   text   data   bss    dec    hex filename
  21309   8304   192  29805   746d drivers/net/ethernet/mellanox/mlx4/qp.o

After:
   text   data   bss    dec    hex filename
  21166   8368   192  29726   741e drivers/net/ethernet/mellanox/mlx4/qp.o

(gcc version 10.2.0)

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20210801153742.147304-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Colin Ian King [Sun, 1 Aug 2021 15:26:50 +0000 (16:26 +0100)]

net: 3c509: make the array if_names static const, makes object smaller

Don't populate the array if_names on the stack but instead it
static const. Makes the object code smaller by 99 bytes.

Before:
   text    data     bss     dec     hex filename
  27886   10752     672   39310    998e ./drivers/net/ethernet/3com/3c509.o

After:
   text    data     bss     dec     hex filename
  27723   10816     672   39211    992b ./drivers/net/ethernet/3com/3c509.o

(gcc version 10.2.0)

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801152650.146572-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Colin Ian King [Sun, 1 Aug 2021 15:22:09 +0000 (16:22 +0100)]

dpaa2-eth: make the array faf_bits static const, makes object smaller

Don't populate the array faf_bits on the stack but instead it
static const. Makes the object code smaller by 175 bytes.

Before:
   text  data   bss     dec   hex filename
   9645  4552     0   14197  3775 ../freescale/dpaa2/dpaa2-eth-devlink.o

After:
   text  data   bss     dec   hex filename
   9406  4616     0   14022  36c6 ../freescale/dpaa2/dpaa2-eth-devlink.o

(gcc version 10.2.0)

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801152209.146359-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Colin Ian King [Sun, 1 Aug 2021 15:16:59 +0000 (16:16 +0100)]

qlcnic: make the array random_data static const, makes object smaller

Don't populate the array random_data on the stack but instead it
static const. Makes the object code smaller by 66 bytes.

Before:
   text    data     bss     dec     hex filename
  52895   10976       0   63871    f97f ../qlogic/qlcnic/qlcnic_ethtool.o

After:
   text    data     bss     dec     hex filename
  52701   11104       0   63805    f93d ../qlogic//qlcnic/qlcnic_ethtool.o

(gcc version 10.2.0)

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801151659.146113-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Colin Ian King [Sun, 1 Aug 2021 15:06:47 +0000 (16:06 +0100)]

net: marvell: make the array name static, makes object smaller

Don't populate the const array name on the stack but instead it
static. Makes the object code smaller by 28 bytes. Add a missing
const to clean up a checkpatch warning.

Before:
   text    data   bss     dec     hex filename
124565   31565   384  156514   26362 drivers/net/ethernet/marvell/sky2.o

After:
   text    data   bss     dec     hex filename
124441   31661   384  156486   26346 drivers/net/ethernet/marvell/sky2.o

(gcc version 10.2.0)

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801150647.145728-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Colin Ian King [Sun, 1 Aug 2021 15:12:05 +0000 (16:12 +0100)]

cxgb4: make the array match_all_mac static, makes object smaller

Don't populate the array match_all_mac on the stack but instead it
static const. Makes the object code smaller by 75 bytes.

Before:
   text    data     bss     dec     hex filename
  46701    8960      64   55725    d9ad ../chelsio/cxgb4/cxgb4_filter.o

After:
   text    data     bss     dec     hex filename
  46338    9120     192   55650    d962 ../chelsio/cxgb4/cxgb4_filter.o

(gcc version 10.2.0)

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801151205.145924-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

David Ahern [Mon, 2 Aug 2021 16:02:21 +0000 (10:02 -0600)]

ipv4: Fix refcount warning for new fib_info

Ioana reported a refcount warning when booting over NFS:

[    5.042532] ------------[ cut here ]------------
[    5.047184] refcount_t: addition on 0; use-after-free.
[    5.052324] WARNING: CPU: 7 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0xa4/0x150
...
[    5.167201] Call trace:
[    5.169635]  refcount_warn_saturate+0xa4/0x150
[    5.174067]  fib_create_info+0xc00/0xc90
[    5.177982]  fib_table_insert+0x8c/0x620
[    5.181893]  fib_magic.isra.0+0x110/0x11c
[    5.185891]  fib_add_ifaddr+0xb8/0x190
[    5.189629]  fib_inetaddr_event+0x8c/0x140

fib_treeref needs to be set after kzalloc. The old code had a ++ which
led to the confusion when the int was replaced by a refcount_t.

Fixes: 79976892f7ea ("net: convert fib_treeref from int to refcount_t")
Signed-off-by: David Ahern <dsahern@kernel.org>
Reported-by: Ioana Ciornei <ciorneiioana@gmail.com>
Cc: Yajun Deng <yajun.deng@linux.dev>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20210802160221.27263-1-dsahern@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Colin Ian King [Sun, 1 Aug 2021 07:01:55 +0000 (08:01 +0100)]

net: phy: mscc: make some arrays static const, makes object smaller

Don't populate arrays on the stack but instead them static const.
Makes the object code smaller by 280 bytes.

Before:
   text    data     bss     dec     hex filename
  24142    4368     192   28702    701e ./drivers/net/phy/mscc/mscc_ptp.o

After:
   text    data     bss     dec     hex filename
  23830    4400     192   28422    6f06 ./drivers/net/phy/mscc/mscc_ptp.o

(gcc version 10.2.0)

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801070155.139057-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Colin Ian King [Sun, 1 Aug 2021 06:53:28 +0000 (07:53 +0100)]

netdevsim: make array res_ids static const, makes object smaller

Don't populate the array res_ids on the stack but instead it
static const. Makes the object code smaller by 14 bytes.

Before:
   text    data     bss     dec     hex filename
  50833    8314     256   59403    e80b ./drivers/net/netdevsim/fib.o

After:
   text    data     bss     dec     hex filename
  50755    8378     256   59389    e7fd ./drivers/net/netdevsim/fib.o

(gcc version 10.2.0)

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801065328.138906-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Gustavo A. R. Silva [Sat, 31 Jul 2021 17:08:30 +0000 (12:08 -0500)]

net/ipv4: Replace one-element array with flexible-array member

There is a regular need in the kernel to provide a way to declare having
a dynamically sized set of trailing elements in a structure. Kernel code
should always use “flexible array members”[1] for these cases. The older
style of one-element or zero-length arrays should no longer be used[2].

Use an anonymous union with a couple of anonymous structs in order to
keep userspace unchanged:

$ pahole -C ip_msfilter net/ipv4/ip_sockglue.o
struct ip_msfilter {
union {
struct {
__be32     imsf_multiaddr_aux;   /*     0     4 */
__be32     imsf_interface_aux;   /*     4     4 */
__u32      imsf_fmode_aux;       /*     8     4 */
__u32      imsf_numsrc_aux;      /*    12     4 */
__be32     imsf_slist[1];        /*    16     4 */
};                                       /*     0    20 */
struct {
__be32     imsf_multiaddr;       /*     0     4 */
__be32     imsf_interface;       /*     4     4 */
__u32      imsf_fmode;           /*     8     4 */
__u32      imsf_numsrc;          /*    12     4 */
__be32     imsf_slist_flex[0];   /*    16     0 */
};                                       /*     0    16 */
};                                               /*     0    20 */

/* size: 20, cachelines: 1, members: 1 */
/* last cacheline: 20 bytes */
};

Also, refactor the code accordingly and make use of the struct_size()
and flex_array_size() helpers.

This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.10/process/deprecated.html#zero-length-and-one-element-arrays

Link: https://github.com/KSPP/linux/issues/79
Link: https://github.com/KSPP/linux/issues/109
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Vladimir Oltean [Sat, 31 Jul 2021 14:14:32 +0000 (17:14 +0300)]

net: dsa: remove the struct packet_type argument from dsa_device_ops::rcv()

No tagging driver uses this.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Krzysztof Kozlowski [Sat, 31 Jul 2021 10:21:44 +0000 (12:21 +0200)]

nfc: hci: pass callback data param as pointer in nci_request()

The nci_request() receives a callback function and unsigned long data
argument "opt" which is passed to the callback.  Almost all of the
nci_request() callers pass pointer to a stack variable as data argument.
Only few pass scalar value (e.g. u8).

All such callbacks do not modify passed data argument and in previous
commit they were made as const.  However passing pointers via unsigned
long removes the const annotation.  The callback could simply cast
unsigned long to a pointer to writeable memory.

Use "const void *" as type of this "opt" argument to solve this and
prevent modifying the pointed contents.  This is also consistent with
generic pattern of passing data arguments - via "void *".  In few places
which pass scalar values, use casts via "unsigned long" to suppress any
warnings.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Christophe JAILLET [Sat, 31 Jul 2021 07:10:00 +0000 (09:10 +0200)]

cavium: switch from 'pci_' to 'dma_' API

The wrappers in include/linux/pci-dma-compat.h should go away.

The patch has been generated with the coccinelle script below. It has been
hand modified to use 'dma_set_mask_and_coherent()' instead of
'pci_set_dma_mask()/pci_set_consistent_dma_mask()' when applicable.

It has been compile tested.

@@
@@
-    PCI_DMA_BIDIRECTIONAL
+    DMA_BIDIRECTIONAL

@@
@@
-    PCI_DMA_TODEVICE
+    DMA_TO_DEVICE

@@
@@
-    PCI_DMA_FROMDEVICE
+    DMA_FROM_DEVICE

@@
@@
-    PCI_DMA_NONE
+    DMA_NONE

@@
expression e1, e2, e3;
@@
-    pci_alloc_consistent(e1, e2, e3)
+    dma_alloc_coherent(&e1->dev, e2, e3, GFP_)

@@
expression e1, e2, e3;
@@
-    pci_zalloc_consistent(e1, e2, e3)
+    dma_alloc_coherent(&e1->dev, e2, e3, GFP_)

@@
expression e1, e2, e3, e4;
@@
-    pci_free_consistent(e1, e2, e3, e4)
+    dma_free_coherent(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-    pci_map_single(e1, e2, e3, e4)
+    dma_map_single(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-    pci_unmap_single(e1, e2, e3, e4)
+    dma_unmap_single(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4, e5;
@@
-    pci_map_page(e1, e2, e3, e4, e5)
+    dma_map_page(&e1->dev, e2, e3, e4, e5)

@@
expression e1, e2, e3, e4;
@@
-    pci_unmap_page(e1, e2, e3, e4)
+    dma_unmap_page(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-    pci_map_sg(e1, e2, e3, e4)
+    dma_map_sg(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-    pci_unmap_sg(e1, e2, e3, e4)
+    dma_unmap_sg(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-    pci_dma_sync_single_for_cpu(e1, e2, e3, e4)
+    dma_sync_single_for_cpu(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-    pci_dma_sync_single_for_device(e1, e2, e3, e4)
+    dma_sync_single_for_device(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-    pci_dma_sync_sg_for_cpu(e1, e2, e3, e4)
+    dma_sync_sg_for_cpu(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-    pci_dma_sync_sg_for_device(e1, e2, e3, e4)
+    dma_sync_sg_for_device(&e1->dev, e2, e3, e4)

@@
expression e1, e2;
@@
-    pci_dma_mapping_error(e1, e2)
+    dma_mapping_error(&e1->dev, e2)

@@
expression e1, e2;
@@
-    pci_set_dma_mask(e1, e2)
+    dma_set_mask(&e1->dev, e2)

@@
expression e1, e2;
@@
-    pci_set_consistent_dma_mask(e1, e2)
+    dma_set_coherent_mask(&e1->dev, e2)

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Vladimir Oltean [Fri, 30 Jul 2021 22:57:14 +0000 (01:57 +0300)]

net: dsa: mt7530: drop paranoid checks in .get_tag_protocol()

It is desirable to reduce the surface of DSA_TAG_PROTO_NONE as much as
we can, because we now have options for switches without hardware
support for DSA tagging, and the occurrence in the mt7530 driver is in
fact quite gratuitout and easy to remove. Since ds->ops->get_tag_protocol()
is only called for CPU ports, the checks for a CPU port in
mtk_get_tag_protocol() are redundant and can be removed.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: DENG Qingfang <dqfext@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Mon, 2 Aug 2021 09:47:12 +0000 (10:47 +0100)]

Merge branch 'octeon-drr-config'

Sunil Goutham says:

====================
cn10k: DWRR MTU and weights configuration

On OcteonTx2 DWRR quantum is directly configured into each of
the transmit scheduler queues. And PF/VF drivers were free to
config any value upto 2^24.

On CN10K, HW is modified, the quantum configuration at scheduler
queues is in terms of weight. And SW needs to setup a base DWRR MTU
at NIX_AF_DWRR_RPM_MTU / NIX_AF_DWRR_SDP_MTU. HW will do
'DWRR MTU * weight' to get the quantum.

This patch series addresses this HW change on CN10K silicons,
both admin function and PF/VF drivers are modified.

Also added support to program DWRR MTU via devlink params.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Sunil Goutham [Fri, 30 Jul 2021 11:49:14 +0000 (17:19 +0530)]

octeontx2-pf: cn10k: Config DWRR weight based on MTU

Program SQ, MDQ, TL4 to TL2 transmit scheduler queues' DWRR
weight based on DWRR MTU programmed at NIX_AF_DWRR_RPM_MTU.
The DWRR MTU from admin function is retrieved via mbox.

On OcteaonTx2 silicon, admin function driver responds with DWRR
MTU as '1'. This helps to avoid silicon specific transmit
scheduler DWRR quantum/weight configuration logic.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Sunil Goutham [Fri, 30 Jul 2021 11:49:13 +0000 (17:19 +0530)]

octeontx2-af: cn10k: DWRR MTU configuration

On OcteonTx2 DWRR quantum is directly configured into each of
the transmit scheduler queues. And PF/VF drivers were free to
config any value upto 2^24.

On CN10K, HW is modified, the quantum configuration at scheduler
queues is in terms of weight. And SW needs to setup a base DWRR MTU
at NIX_AF_DWRR_RPM_MTU / NIX_AF_DWRR_SDP_MTU. HW will do
'DWRR MTU * weight' to get the quantum. For LBK traffic, value
programmed into NIX_AF_DWRR_RPM_MTU register is considered as
DWRR MTU.

This patch programs a default DWRR MTU of 8192 into HW and also
provides a way to change this via devlink params.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Dust Li [Fri, 30 Jul 2021 03:41:55 +0000 (11:41 +0800)]

selftests/net: remove min gso test in packet_snd

This patch removed the 'raw gso min size - 1' test which
always fails now:
./in_netns.sh ./psock_snd -v -c -g -l "${mss}"
  raw gso min size - 1 (expected to fail)
  tx: 1524
  rx: 1472
  OK

After commit 7c6d2ecbda83 ("net: be more gentle about silly
gso requests coming from user"), we relaxed the min gso_size
check in virtio_net_hdr_to_skb().
So when a packet which is smaller then the gso_size,
GSO for this packet will not be set, the packet will be
send/recv successfully.

Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Yufeng Mo [Fri, 30 Jul 2021 02:19:11 +0000 (10:19 +0800)]

bonding: 3ad: fix the concurrency between __bond_release_one() and bond_3ad_state_machine_handler()

Some time ago, I reported a calltrace issue
"did not find a suitable aggregator", please see[1].
After a period of analysis and reproduction, I find
that this problem is caused by concurrency.

Before the problem occurs, the bond structure is like follows:

bond0 - slaver0(eth0) - agg0.lag_ports -> port0 - port1
                      \
                        port0
      \
        slaver1(eth1) - agg1.lag_ports -> NULL
                      \
                        port1

If we run 'ifenslave bond0 -d eth1', the process is like below:

excuting __bond_release_one()
|
bond_upper_dev_unlink()[step1]
|                       |                       |
|                       |                       bond_3ad_lacpdu_recv()
|                       |                       ->bond_3ad_rx_indication()
|                       |                       spin_lock_bh()
|                       |                       ->ad_rx_machine()
|                       |                       ->__record_pdu()[step2]
|                       |                       spin_unlock_bh()
|                       |                       |
|                       bond_3ad_state_machine_handler()
|                       spin_lock_bh()
|                       ->ad_port_selection_logic()
|                       ->try to find free aggregator[step3]
|                       ->try to find suitable aggregator[step4]
|                       ->did not find a suitable aggregator[step5]
|                       spin_unlock_bh()
|                       |
|                       |
bond_3ad_unbind_slave() |
spin_lock_bh()
spin_unlock_bh()

step1: already removed slaver1(eth1) from list, but port1 remains
step2: receive a lacpdu and update port0
step3: port0 will be removed from agg0.lag_ports. The struct is
       "agg0.lag_ports -> port1" now, and agg0 is not free. At the
   same time, slaver1/agg1 has been removed from the list by step1.
   So we can't find a free aggregator now.
step4: can't find suitable aggregator because of step2
step5: cause a calltrace since port->aggregator is NULL

To solve this concurrency problem, put bond_upper_dev_unlink()
after bond_3ad_unbind_slave(). In this way, we can invalid the port
first and skip this port in bond_3ad_state_machine_handler(). This
eliminates the situation that the slaver has been removed from the
list but the port is still valid.

[1]https://lore.kernel.org/netdev/10374.1611947473@famine/

Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Cong Wang [Thu, 29 Jul 2021 23:12:14 +0000 (16:12 -0700)]

net_sched: refactor TC action init API

TC action ->init() API has 10 parameters, it becomes harder
to read. Some of them are just boolean and can be replaced
by flags. Similarly for the internal API tcf_action_init()
and tcf_exts_validate().

This patch converts them to flags and fold them into
the upper 16 bits of "flags", whose lower 16 bits are still
reserved for user-space. More specifically, the following
kernel flags are introduced:

TCA_ACT_FLAGS_POLICE replace 'name' in a few contexts, to
distinguish whether it is compatible with policer.

TCA_ACT_FLAGS_BIND replaces 'bind', to indicate whether
this action is bound to a filter.

TCA_ACT_FLAGS_REPLACE replaces 'ovr' in most contexts,
means we are replacing an existing action.

TCA_ACT_FLAGS_NO_RTNL replaces 'rtnl_held' but has the
opposite meaning, because we still hold RTNL in most
cases.

The only user-space flag TCA_ACT_FLAGS_NO_PERCPU_STATS is
untouched and still stored as before.

I have tested this patch with tdc and I do not see any
failure related to this patch.

Tested-by: Vlad Buslov <vladbu@nvidia.com>
Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Martin Kaiser [Thu, 29 Jul 2021 07:43:54 +0000 (09:43 +0200)]

niu: read property length only if we use it

In three places, the driver calls of_get_property and reads the property
length although the length is not used. Update the calls to not request
the length.

Signed-off-by: Martin Kaiser <martin@kaiser.cx>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Jakub Kicinski [Sat, 31 Jul 2021 18:23:25 +0000 (11:23 -0700)]

Merge https://git./linux/kernel/git/bpf/bpf-next

Andrii Nakryiko says:

====================
bpf-next 2021-07-30

We've added 64 non-merge commits during the last 15 day(s) which contain
a total of 83 files changed, 5027 insertions(+), 1808 deletions(-).

The main changes are:

1) BTF-guided binary data dumping libbpf API, from Alan.

2) Internal factoring out of libbpf CO-RE relocation logic, from Alexei.

3) Ambient BPF run context and cgroup storage cleanup, from Andrii.

4) Few small API additions for libbpf 1.0 effort, from Evgeniy and Hengqi.

5) bpf_program__attach_kprobe_opts() fixes in libbpf, from Jiri.

6) bpf_{get,set}sockopt() support in BPF iterators, from Martin.

7) BPF map pinning improvements in libbpf, from Martynas.

8) Improved module BTF support in libbpf and bpftool, from Quentin.

9) Bpftool cleanups and documentation improvements, from Quentin.

10) Libbpf improvements for supporting CO-RE on old kernels, from Shuyi.

11) Increased maximum cgroup storage size, from Stanislav.

12) Small fixes and improvements to BPF tests and samples, from various folks.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (64 commits)
  tools: bpftool: Complete metrics list in "bpftool prog profile" doc
  tools: bpftool: Document and add bash completion for -L, -B options
  selftests/bpf: Update bpftool's consistency script for checking options
  tools: bpftool: Update and synchronise option list in doc and help msg
  tools: bpftool: Complete and synchronise attach or map types
  selftests/bpf: Check consistency between bpftool source, doc, completion
  tools: bpftool: Slightly ease bash completion updates
  unix_bpf: Fix a potential deadlock in unix_dgram_bpf_recvmsg()
  libbpf: Add btf__load_vmlinux_btf/btf__load_module_btf
  tools: bpftool: Support dumping split BTF by id
  libbpf: Add split BTF support for btf__load_from_kernel_by_id()
  tools: Replace btf__get_from_id() with btf__load_from_kernel_by_id()
  tools: Free BTF objects at various locations
  libbpf: Rename btf__get_from_id() as btf__load_from_kernel_by_id()
  libbpf: Rename btf__load() as btf__load_into_kernel()
  libbpf: Return non-null error on failures in libbpf_find_prog_btf_id()
  bpf: Emit better log message if bpf_iter ctx arg btf_id == 0
  tools/resolve_btfids: Emit warnings and patch zero id for missing symbols
  bpf: Increase supported cgroup storage value size
  libbpf: Fix race when pinning maps in parallel
  ...
====================

Link: https://lore.kernel.org/r/20210730225606.1897330-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Sat, 31 Jul 2021 16:14:46 +0000 (09:14 -0700)]

Merge git://git./linux/kernel/git/netdev/net

Conflicting commits, all resolutions pretty trivial:

drivers/bus/mhi/pci_generic.c
  5c2c85315948 ("bus: mhi: pci-generic: configurable network interface MRU")
  56f6f4c4eb2a ("bus: mhi: pci_generic: Apply no-op for wake using sideband wake boolean")

drivers/nfc/s3fwrn5/firmware.c
  a0302ff5906a ("nfc: s3fwrn5: remove unnecessary label")
  46573e3ab08f ("nfc: s3fwrn5: fix undefined parameter values in dev_err()")
  801e541c79bb ("nfc: s3fwrn5: fix undefined parameter values in dev_err()")

MAINTAINERS
  7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver")
  8a7b46fa7902 ("MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Linus Torvalds [Fri, 30 Jul 2021 23:01:36 +0000 (16:01 -0700)]

Merge tag 'net-5.14-rc4' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.14-rc4, including fixes from bpf, can, WiFi
  (mac80211) and netfilter trees.

  Current release - regressions:

   - mac80211: fix starting aggregation sessions on mesh interfaces

  Current release - new code bugs:

   - sctp: send pmtu probe only if packet loss in Search Complete state

   - bnxt_en: add missing periodic PHC overflow check

   - devlink: fix phys_port_name of virtual port and merge error

   - hns3: change the method of obtaining default ptp cycle

   - can: mcba_usb_start(): add missing urb->transfer_dma initialization

  Previous releases - regressions:

   - set true network header for ECN decapsulation

   - mlx5e: RX, avoid possible data corruption w/ relaxed ordering and
     LRO

   - phy: re-add check for PHY_BRCM_DIS_TXCRXC_NOENRGY on the BCM54811
     PHY

   - sctp: fix return value check in __sctp_rcv_asconf_lookup

  Previous releases - always broken:

   - bpf:
       - more spectre corner case fixes, introduce a BPF nospec
         instruction for mitigating Spectre v4
       - fix OOB read when printing XDP link fdinfo
       - sockmap: fix cleanup related races

   - mac80211: fix enabling 4-address mode on a sta vif after assoc

   - can:
       - raw: raw_setsockopt(): fix raw_rcv panic for sock UAF
       - j1939: j1939_session_deactivate(): clarify lifetime of session
         object, avoid UAF
       - fix number of identical memory leaks in USB drivers

   - tipc:
       - do not blindly write skb_shinfo frags when doing decryption
       - fix sleeping in tipc accept routine"

* tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
  gve: Update MAINTAINERS list
  can: esd_usb2: fix memory leak
  can: ems_usb: fix memory leak
  can: usb_8dev: fix memory leak
  can: mcba_usb_start(): add missing urb->transfer_dma initialization
  can: hi311x: fix a signedness bug in hi3110_cmd()
  MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver
  bpf: Fix leakage due to insufficient speculative store bypass mitigation
  bpf: Introduce BPF nospec instruction for mitigating Spectre v4
  sis900: Fix missing pci_disable_device() in probe and remove
  net: let flow have same hash in two directions
  nfc: nfcsim: fix use after free during module unload
  tulip: windbond-840: Fix missing pci_disable_device() in probe and remove
  sctp: fix return value check in __sctp_rcv_asconf_lookup
  nfc: s3fwrn5: fix undefined parameter values in dev_err()
  net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32
  net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev()
  net/mlx5: Unload device upon firmware fatal error
  net/mlx5e: Fix page allocation failure for ptp-RQ over SF
  net/mlx5e: Fix page allocation failure for trap-RQ over SF
  ...

commit | commitdiff | tree

Linus Torvalds [Fri, 30 Jul 2021 22:56:24 +0000 (15:56 -0700)]

Merge tag 'acpi-5.14-rc4' of git://git./linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
"These revert a recent IRQ resources handling modification that turned
  out to be problematic, fix suspend-to-idle handling on AMD platforms
  to take upcoming systems into account properly and fix the retrieval
  of the DPTF attributes of the PCH FIVR.

  Specifics:

   - Revert recent change of the ACPI IRQ resources handling that
     attempted to improve the ACPI IRQ override selection logic, but
     introduced serious regressions on some systems (Hui Wang).

   - Fix up quirks for AMD platforms in the suspend-to-idle support code
     so as to take upcoming systems using uPEP HID AMDI007 into account
     as appropriate (Mario Limonciello).

   - Fix the code retrieving DPTF attributes of the PCH FIVR so that it
     agrees on the return data type with the ACPI control method
     evaluated for this purpose (Srinivas Pandruvada)"

* tag 'acpi-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: DPTF: Fix reading of attributes
  Revert "ACPI: resources: Add checks for ACPI IRQ override"
  ACPI: PM: Add support for upcoming AMD uPEP HID AMDI007

commit | commitdiff | tree

Linus Torvalds [Fri, 30 Jul 2021 22:42:34 +0000 (15:42 -0700)]

pipe: make pipe writes always wake up readers

Since commit 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup
logic") we have sanitized the pipe write logic, and would only try to
wake up readers if they needed it.

In particular, if the pipe already had data in it before the write,
there was no point in trying to wake up a reader, since any existing
readers must have been aware of the pre-existing data already.  Doing
extraneous wakeups will only cause potential thundering herd problems.

However, it turns out that some Android libraries have misused the EPOLL
interface, and expected "edge triggered" be to "any new write will
trigger it".  Even if there was no edge in sight.

Quoting Sandeep Patil:
"The commit 1b6b26ae7053 ('pipe: fix and clarify pipe write wakeup
  logic') changed pipe write logic to wakeup readers only if the pipe
  was empty at the time of write. However, there are libraries that
  relied upon the older behavior for notification scheme similar to
  what's described in [1]

  One such library 'realm-core'[2] is used by numerous Android
  applications. The library uses a similar notification mechanism as GNU
  Make but it never drains the pipe until it is full. When Android moved
  to v5.10 kernel, all applications using this library stopped working.

  The library has since been fixed[3] but it will be a while before all
  applications incorporate the updated library"

Our regression rule for the kernel is that if applications break from
new behavior, it's a regression, even if it was because the application
did something patently wrong.  Also note the original report [4] by
Michal Kerrisk about a test for this epoll behavior - but at that point
we didn't know of any actual broken use case.

So add the extraneous wakeup, to approximate the old behavior.

[ I say "approximate", because the exact old behavior was to do a wakeup
  not for each write(), but for each pipe buffer chunk that was filled
  in. The behavior introduced by this change is not that - this is just
  "every write will cause a wakeup, whether necessary or not", which
  seems to be sufficient for the broken library use. ]

It's worth noting that this adds the extraneous wakeup only for the
write side, while the read side still considers the "edge" to be purely
about reading enough from the pipe to allow further writes.

See commit f467a6a66419 ("pipe: fix and clarify pipe read wakeup logic")
for the pipe read case, which remains that "only wake up if the pipe was
full, and we read something from it".

Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/
Link: https://github.com/realm/realm-core
Link: https://github.com/realm/realm-core/issues/4666
Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/
Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/
Reported-by: Sandeep Patil <sspatil@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Andrii Nakryiko [Fri, 30 Jul 2021 22:40:28 +0000 (15:40 -0700)]

Merge branch 'tools: bpftool: update, synchronise and validate types and options'

Quentin Monnet says:

====================

To work with the different program types, map types, attach types etc.
supported by eBPF, bpftool needs occasional updates to learn about the new
features supported by the kernel. When such types translate into new
keyword for the command line, updates are expected in several locations:
typically, the help message displayed from bpftool itself, the manual page,
and the bash completion file should be updated. The options used by the
different commands for bpftool should also remain synchronised at those
locations.

Several omissions have occurred in the past, and a number of types are
still missing today. This set is an attempt to improve the situation. It
brings up-to-date the lists of types or options in bpftool, and also adds a
Python script to the BPF selftests to automatically check that most of
these lists remain synchronised.

v2:
- Reformat some lines in the bash completion file.
- Do not reformat attach types, to preserve git-blame history.
- Do not call Python script from tools/testing/selftests/bpf/Makefile.
====================

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

commit | commitdiff | tree

Quentin Monnet [Fri, 30 Jul 2021 21:54:35 +0000 (22:54 +0100)]

tools: bpftool: Complete metrics list in "bpftool prog profile" doc

Profiling programs with bpftool was extended some time ago to support
two new metrics, namely itlb_misses and dtlb_misses (misses for the
instruction/data translation lookaside buffer). Update the manual page
and bash completion accordingly.

Fixes: 450d060e8f75 ("bpftool: Add {i,d}tlb_misses support for bpftool profile")
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-8-quentin@isovalent.com

commit | commitdiff | tree

Quentin Monnet [Fri, 30 Jul 2021 21:54:34 +0000 (22:54 +0100)]

tools: bpftool: Document and add bash completion for -L, -B options

The -L|--use-loader option for using loader programs when loading, or
when generating a skeleton, did not have any documentation or bash
completion. Same thing goes for -B|--base-btf, used to pass a path to a
base BTF object for split BTF such as BTF for kernel modules.

This patch documents and adds bash completion for those options.

Fixes: 75fa1777694c ("tools/bpftool: Add bpftool support for split BTF")
Fixes: d510296d331a ("bpftool: Use syscall/loader program in "prog load" and "gen skeleton" command.")
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-7-quentin@isovalent.com

commit | commitdiff | tree

Quentin Monnet [Fri, 30 Jul 2021 21:54:33 +0000 (22:54 +0100)]

selftests/bpf: Update bpftool's consistency script for checking options

Update the script responsible for checking that the different types used
at various places in bpftool are synchronised, and extend it to check
the consistency of options between the help messages in the source code
and the manual pages.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-6-quentin@isovalent.com

commit | commitdiff | tree

Quentin Monnet [Fri, 30 Jul 2021 21:54:32 +0000 (22:54 +0100)]

tools: bpftool: Update and synchronise option list in doc and help msg

All bpftool commands support the options for JSON output and debug from
libbpf. In addition, some commands support additional options
corresponding to specific use cases.

The list of options described in the man pages for the different
commands are not always accurate. The messages for interactive help are
mostly limited to HELP_SPEC_OPTIONS, and are even less representative of
the actual set of options supported for the commands.

Let's update the lists:

- HELP_SPEC_OPTIONS is modified to contain the "default" options (JSON
  and debug), and to be extensible (no ending curly bracket).
- All commands use HELP_SPEC_OPTIONS in their help message, and then
  complete the list with their specific options.
- The lists of options in the man pages are updated.
- The formatting of the list for bpftool.rst is adjusted to match
  formatting for the other man pages. This is for consistency, and also
  because it will be helpful in a future patch to automatically check
  that the files are synchronised.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-5-quentin@isovalent.com

commit | commitdiff | tree

Quentin Monnet [Fri, 30 Jul 2021 21:54:31 +0000 (22:54 +0100)]

tools: bpftool: Complete and synchronise attach or map types

Update bpftool's list of attach type names to tell it about the latest
attach types, or the "ringbuf" map. Also update the documentation, help
messages, and bash completion when relevant.

These missing items were reported by the newly added Python script used
to help maintain consistency in bpftool.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-4-quentin@isovalent.com

commit | commitdiff | tree

Quentin Monnet [Fri, 30 Jul 2021 21:54:30 +0000 (22:54 +0100)]

selftests/bpf: Check consistency between bpftool source, doc, completion

Whenever the eBPF subsystem gains new elements, such as new program or
map types, it is necessary to update bpftool if we want it able to
handle the new items.

In addition to the main arrays containing the names of these elements in
the source code, there are also multiple locations to update:

- The help message in the do_help() functions in bpftool's source code.
- The RST documentation files.
- The bash completion file.

This has led to omissions multiple times in the past. This patch
attempts to address this issue by adding consistency checks for all
these different locations. It also verifies that the bpf_prog_type,
bpf_map_type and bpf_attach_type enums from the UAPI BPF header have all
their members present in bpftool.

The script requires no argument to run, it reads and parses the
different files to check, and prints the mismatches, if any. It
currently reports a number of missing elements, which will be fixed in a
later patch:

  $ ./test_bpftool_synctypes.py
  Comparing [...]/linux/tools/bpf/bpftool/map.c (map_type_name) and [...]/linux/tools/bpf/bpftool/bash-completion/bpftool (BPFTOOL_MAP_CREATE_TYPES): {'ringbuf'}
  Comparing BPF header (enum bpf_attach_type) and [...]/linux/tools/bpf/bpftool/common.c (attach_type_name): {'BPF_TRACE_ITER', 'BPF_XDP_DEVMAP', 'BPF_XDP', 'BPF_SK_REUSEPORT_SELECT', 'BPF_XDP_CPUMAP', 'BPF_SK_REUSEPORT_SELECT_OR_MIGRATE'}
  Comparing [...]/linux/tools/bpf/bpftool/prog.c (attach_type_strings) and [...]/linux/tools/bpf/bpftool/prog.c (do_help() ATTACH_TYPE): {'skb_verdict'}
  Comparing [...]/linux/tools/bpf/bpftool/prog.c (attach_type_strings) and [...]/linux/tools/bpf/bpftool/Documentation/bpftool-prog.rst (ATTACH_TYPE): {'skb_verdict'}
  Comparing [...]/linux/tools/bpf/bpftool/prog.c (attach_type_strings) and [...]/linux/tools/bpf/bpftool/bash-completion/bpftool (BPFTOOL_PROG_ATTACH_TYPES): {'skb_verdict'}

Note that the script does NOT check for consistency between the list of
program types that bpftool claims it accepts and the actual list of
keywords that can be used. This is because bpftool does not "see" them,
they are ELF section names parsed by libbpf. It is not hard to parse the
section_defs[] array in libbpf, but some section names are associated
with program types that bpftool cannot load at the moment. For example,
some programs require a BTF target and an attach target that bpftool
cannot handle. The script may be extended to parse the array and check
only relevant values in the future.

The script is not added to the selftests' Makefile, because doing so
would require all patches with BPF UAPI change to also update bpftool.
Instead it is to be added to the CI.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-3-quentin@isovalent.com

commit | commitdiff | tree

Quentin Monnet [Fri, 30 Jul 2021 21:54:29 +0000 (22:54 +0100)]

tools: bpftool: Slightly ease bash completion updates

Bash completion for bpftool gets two minor improvements in this patch.

Move the detection of attach types for "bpftool cgroup attach" outside
of the "case/esac" bloc, where we cannot reuse our variable holding the
list of supported attach types as a pattern list. After the change, we
have only one list of cgroup attach types to update when new types are
added, instead of the former two lists.

Also rename the variables holding lists of names for program types, map
types, and attach types, to make them more unique. This can make it
slightly easier to point people to the relevant variables to update, but
the main objective here is to help run a script to check that bash
completion is up-to-date with bpftool's source code.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-2-quentin@isovalent.com

commit | commitdiff | tree

Jakub Kicinski [Fri, 30 Jul 2021 20:16:40 +0000 (13:16 -0700)]

Merge branch 'clean-devlink-net-namespace-operations'

Leon Romanovsky says:

====================
Clean devlink net namespace operations

This short series continues my work on devlink core code to make devlink
reload less prone to errors and harden it from API abuse.

Despite first patch being a clear fix, I would ask you to apply it to
net-next anyway, because the fixed patch is anyway old and it will
help us to eliminate merge conflicts that will arise for following
patches or even for the second one.
====================

Link: https://lore.kernel.org/r/cover.1627578998.git.leonro@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Leon Romanovsky [Thu, 29 Jul 2021 17:19:25 +0000 (20:19 +0300)]

devlink: Allocate devlink directly in requested net namespace

There is no need in extra call indirection and check from impossible
flow where someone tries to set namespace without prior call
to devlink_alloc().

Instead of this extra logic and additional EXPORT_SYMBOL, use specialized
devlink allocation function that receives net namespace as an argument.

Such specialized API allows clear view when devlink initialized in wrong
net namespace and/or kernel users don't try to change devlink namespace
under the hood.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Leon Romanovsky [Thu, 29 Jul 2021 17:19:24 +0000 (20:19 +0300)]

devlink: Break parameter notification sequence to be before/after unload/load driver

The change of namespaces during devlink reload calls to driver unload
before it accesses devlink parameters. The commands below causes to
use-after-free bug when trying to get flow steering mode.

* ip netns add n1
* devlink dev reload pci/0000:00:09.0 netns n1

==================================================================
BUG: KASAN: use-after-free in mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core]
Read of size 4 at addr ffff888009d04308 by task devlink/275

CPU: 6 PID: 275 Comm: devlink Not tainted 5.12.0-rc2+ #2853
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
  dump_stack+0x93/0xc2
  print_address_description.constprop.0+0x18/0x140
  ? mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core]
  ? mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core]
  kasan_report.cold+0x7c/0xd8
  ? mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core]
  mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core]
  devlink_nl_param_fill+0x1c8/0xe80
  ? __free_pages_ok+0x37a/0x8a0
  ? devlink_flash_update_timeout_notify+0xd0/0xd0
  ? lock_acquire+0x1a9/0x6d0
  ? fs_reclaim_acquire+0xb7/0x160
  ? lock_is_held_type+0x98/0x110
  ? 0xffffffff81000000
  ? lock_release+0x1f9/0x6c0
  ? fs_reclaim_release+0xa1/0xf0
  ? lock_downgrade+0x6d0/0x6d0
  ? lock_is_held_type+0x98/0x110
  ? lock_is_held_type+0x98/0x110
  ? memset+0x20/0x40
  ? __build_skb_around+0x1f8/0x2b0
  devlink_param_notify+0x6d/0x180
  devlink_reload+0x1c3/0x520
  ? devlink_remote_reload_actions_performed+0x30/0x30
  ? mutex_trylock+0x24b/0x2d0
  ? devlink_nl_cmd_reload+0x62b/0x1070
  devlink_nl_cmd_reload+0x66d/0x1070
  ? devlink_reload+0x520/0x520
  ? devlink_get_from_attrs+0x1bc/0x260
  ? devlink_nl_pre_doit+0x64/0x4d0
  genl_family_rcv_msg_doit+0x1e9/0x2f0
  ? mutex_lock_io_nested+0x1130/0x1130
  ? genl_family_rcv_msg_attrs_parse.constprop.0+0x240/0x240
  ? security_capable+0x51/0x90
  genl_rcv_msg+0x27f/0x4a0
  ? genl_get_cmd+0x3c0/0x3c0
  ? lock_acquire+0x1a9/0x6d0
  ? devlink_reload+0x520/0x520
  ? lock_release+0x6c0/0x6c0
  netlink_rcv_skb+0x11d/0x340
  ? genl_get_cmd+0x3c0/0x3c0
  ? netlink_ack+0x9f0/0x9f0
  ? lock_release+0x1f9/0x6c0
  genl_rcv+0x24/0x40
  netlink_unicast+0x433/0x700
  ? netlink_attachskb+0x730/0x730
  ? _copy_from_iter_full+0x178/0x650
  ? __alloc_skb+0x113/0x2b0
  netlink_sendmsg+0x6f1/0xbd0
  ? netlink_unicast+0x700/0x700
  ? lock_is_held_type+0x98/0x110
  ? netlink_unicast+0x700/0x700
  sock_sendmsg+0xb0/0xe0
  __sys_sendto+0x193/0x240
  ? __x64_sys_getpeername+0xb0/0xb0
  ? do_sys_openat2+0x10b/0x370
  ? __up_read+0x1a1/0x7b0
  ? do_user_addr_fault+0x219/0xdc0
  ? __x64_sys_openat+0x120/0x1d0
  ? __x64_sys_open+0x1a0/0x1a0
  __x64_sys_sendto+0xdd/0x1b0
  ? syscall_enter_from_user_mode+0x1d/0x50
  do_syscall_64+0x2d/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fc69d0af14a
Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 c3 0f 1f 44 00 00 55 48 83 ec 30 44 89 4c
RSP: 002b:00007ffc1d8292f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fc69d0af14a
RDX: 0000000000000038 RSI: 0000555f57c56440 RDI: 0000000000000003
RBP: 0000555f57c56410 R08: 00007fc69d17b200 R09: 000000000000000c
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Allocated by task 146:
  kasan_save_stack+0x1b/0x40
  __kasan_kmalloc+0x99/0xc0
  mlx5_init_fs+0xf0/0x1c50 [mlx5_core]
  mlx5_load+0xd2/0x180 [mlx5_core]
  mlx5_init_one+0x2f6/0x450 [mlx5_core]
  probe_one+0x47d/0x6e0 [mlx5_core]
  pci_device_probe+0x2a0/0x4a0
  really_probe+0x20a/0xc90
  driver_probe_device+0xd8/0x380
  device_driver_attach+0x1df/0x250
  __driver_attach+0xff/0x240
  bus_for_each_dev+0x11e/0x1a0
  bus_add_driver+0x309/0x570
  driver_register+0x1ee/0x380
  0xffffffffa06b8062
  do_one_initcall+0xd5/0x410
  do_init_module+0x1c8/0x760
  load_module+0x6d8b/0x9650
  __do_sys_finit_module+0x118/0x1b0
  do_syscall_64+0x2d/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Freed by task 275:
  kasan_save_stack+0x1b/0x40
  kasan_set_track+0x1c/0x30
  kasan_set_free_info+0x20/0x30
  __kasan_slab_free+0x102/0x140
  slab_free_freelist_hook+0x74/0x1b0
  kfree+0xd7/0x2a0
  mlx5_unload+0x16/0xb0 [mlx5_core]
  mlx5_unload_one+0xae/0x120 [mlx5_core]
  mlx5_devlink_reload_down+0x1bc/0x380 [mlx5_core]
  devlink_reload+0x141/0x520
  devlink_nl_cmd_reload+0x66d/0x1070
  genl_family_rcv_msg_doit+0x1e9/0x2f0
  genl_rcv_msg+0x27f/0x4a0
  netlink_rcv_skb+0x11d/0x340
  genl_rcv+0x24/0x40
  netlink_unicast+0x433/0x700
  netlink_sendmsg+0x6f1/0xbd0
  sock_sendmsg+0xb0/0xe0
  __sys_sendto+0x193/0x240
  __x64_sys_sendto+0xdd/0x1b0
  do_syscall_64+0x2d/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xae

The buggy address belongs to the object at ffff888009d04300
  which belongs to the cache kmalloc-128 of size 128
The buggy address is located 8 bytes inside of
  128-byte region [ffff888009d04300, ffff888009d04380)
The buggy address belongs to the page:
page:0000000086a64ecc refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888009d04000 pfn:0x9d04
head:0000000086a64ecc order:1 compound_mapcount:0
flags: 0x4000000000010200(slab|head)
raw: 4000000000010200 ffffea0000203980 0000000200000002 ffff8880050428c0
raw: ffff888009d04000 000000008020001d 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff888009d04200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff888009d04280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888009d04300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                       ^
  ffff888009d04380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff888009d04400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

The right solution to devlink reload is to notify about deletion of
parameters, unload driver, change net namespaces, load driver and notify
about addition of parameters.

Fixes: 070c63f20f6c ("net: devlink: allow to change namespaces during reload")
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Cong Wang [Fri, 23 Jul 2021 18:36:30 +0000 (11:36 -0700)]

unix_bpf: Fix a potential deadlock in unix_dgram_bpf_recvmsg()

As Eric noticed, __unix_dgram_recvmsg() may acquire u->iolock
too, so we have to release it before calling this function.

Fixes: 9825d866ce0d ("af_unix: Implement unix_dgram_bpf_recvmsg()")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>

commit | commitdiff | tree

Hengqi Chen [Fri, 30 Jul 2021 11:40:12 +0000 (19:40 +0800)]

libbpf: Add btf__load_vmlinux_btf/btf__load_module_btf

Add two new APIs: btf__load_vmlinux_btf and btf__load_module_btf.
btf__load_vmlinux_btf is just an alias to the existing API named
libbpf_find_kernel_btf, rename to be more precisely and consistent
with existing BTF APIs. btf__load_module_btf can be used to load
module BTF, add it for completeness. These two APIs are useful for
implementing tracing tools and introspection tools. This is part
of the effort towards libbpf 1.0 ([0]).

[0] Closes: https://github.com/libbpf/libbpf/issues/280

Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730114012.494408-1-hengqi.chen@gmail.com

commit | commitdiff | tree

Paolo Abeni [Fri, 30 Jul 2021 16:30:53 +0000 (18:30 +0200)]

sk_buff: avoid potentially clearing 'slow_gro' field

If skb_dst_set_noref() is invoked with a NULL dst, the 'slow_gro'
field is cleared, too. That could lead to wrong behavior if
the skb later enters the GRO stage.

Fix the potential issue replacing preserving a non-zero value of
the 'slow_gro' field.

Additionally, fix a comment typo.

Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Fixes: 8a886b142bd0 ("sk_buff: track dst status in slow_gro")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/aa42529252dc8bb02bd42e8629427040d1058537.1627662501.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Rafael J. Wysocki [Fri, 30 Jul 2021 18:26:38 +0000 (20:26 +0200)]

Merge branches 'acpi-resources' and 'acpi-dptf'

* acpi-resources:
Revert "ACPI: resources: Add checks for ACPI IRQ override"

* acpi-dptf:
ACPI: DPTF: Fix reading of attributes

commit | commitdiff | tree

Linus Torvalds [Fri, 30 Jul 2021 18:08:12 +0000 (11:08 -0700)]

Merge tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- gendisk freeing fix (Christoph)

- blk-iocost wake ordering fix (Tejun)

- tag allocation error handling fix (John)

- loop locking fix. While this isn't the prettiest fix in the world,
   nobody has any good alternatives for 5.14. Something to likely
   revisit for 5.15. (Tetsuo)

* tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
  block: delay freeing the gendisk
  blk-iocost: fix operation ordering in iocg_wake_fn()
  blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handling
  loop: reintroduce global lock for safe loop_validate_file() traversal

commit | commitdiff | tree

Linus Torvalds [Fri, 30 Jul 2021 18:01:47 +0000 (11:01 -0700)]

Merge tag 'io_uring-5.14-2021-07-30' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

- A fix for block backed reissue (me)

- Reissue context hardening (me)

- Async link locking fix (Pavel)

* tag 'io_uring-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
  io_uring: fix poll requests leaking second poll entries
  io_uring: don't block level reissue off completion path
  io_uring: always reissue from task_work context
  io_uring: fix race in unified task_work running
  io_uring: fix io_prep_async_link locking

commit | commitdiff | tree

Linus Torvalds [Fri, 30 Jul 2021 17:56:47 +0000 (10:56 -0700)]

Merge tag 'libata-5.14-2021-07-30' of git://git.kernel.dk/linux-block

Pull libata fixlets from Jens Axboe:

- A fix for PIO highmem (Christoph)

- Kill HAVE_IDE as it's now unused (Lukas)

* tag 'libata-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
arch: Kconfig: clean up obsolete use of HAVE_IDE
libata: fix ata_pio_sector for CONFIG_HIGHMEM

commit | commitdiff | tree

Linus Torvalds [Fri, 30 Jul 2021 17:50:09 +0000 (10:50 -0700)]

Merge tag 'for-5.14-rc3-tag' of git://git./linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

- fix -Warray-bounds warning, to help external patchset to make it
   default treewide

- fix writeable device accounting (syzbot report)

- fix fsync and log replay after a rename and inode eviction

- fix potentially lost error code when submitting multiple bios for
   compressed range

* tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: calculate number of eb pages properly in csum_tree_block
  btrfs: fix rw device counting in __btrfs_free_extra_devids
  btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction
  btrfs: mark compressed range uptodate only if all bio succeed

commit | commitdiff | tree

Linus Torvalds [Fri, 30 Jul 2021 17:36:36 +0000 (10:36 -0700)]

Merge branch 'for-linus' of git://git./linux/kernel/git/hid/hid

Pull HID fixes from Jiri Kosina:

- resume timing fix for intel-ish driver (Ye Xiang)

- fix for using incorrect MMIO register in amd_sfh driver (Dylan
   MacKenzie)

- Cintiq 24HDT / 27QHDT regression fix and touch processing fix for
   Wacom driver (Jason Gerecke)

- device removal bugfix for ft260 driver (Michael Zaidman)

- other small assorted fixes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  HID: ft260: fix device removal due to USB disconnect
  HID: wacom: Skip processing of touches with negative slot values
  HID: wacom: Re-enable touch by default for Cintiq 24HDT / 27QHDT
  HID: Kconfig: Fix spelling mistake "Uninterruptable" -> "Uninterruptible"
  HID: apple: Add support for Keychron K1 wireless keyboard
  HID: fix typo in Kconfig
  HID: ft260: fix format type warning in ft260_word_show()
  HID: amd_sfh: Use correct MMIO register for DMA address
  HID: asus: Remove check for same LED brightness on set
  HID: intel-ish-hid: use async resume function

commit | commitdiff | tree

Linus Torvalds [Fri, 30 Jul 2021 17:29:58 +0000 (10:29 -0700)]

Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
"7 patches.

  Subsystems affected by this patch series: lib, ocfs2, and mm (slub,
  migration, and memcg)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()
  slub: fix unreclaimable slab stat for bulk free
  mm/migrate: fix NR_ISOLATED corruption on 64-bit
  mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code
  ocfs2: issue zeroout to EOF blocks
  ocfs2: fix zero out valid data
  lib/test_string.c: move string selftest in the Runtime Testing menu

commit | commitdiff | tree

Jakub Kicinski [Fri, 30 Jul 2021 17:29:52 +0000 (19:29 +0200)]

Merge tag 'linux-can-fixes-for-5.14-20210730' of git://git./linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2021-07-30

The first patch is by me and adds Yasushi SHOJI as a reviewer for the
Microchip CAN BUS Analyzer Tool driver.

Dan Carpenter's patch fixes a signedness bug in the hi311x driver.

Pavel Skripkin provides 4 patches, the first targets the mcba_usb
driver by adding the missing urb->transfer_dma initialization, which
was broken in a previous commit. The last 3 patches fix a memory leak
in the usb_8dev, ems_usb and esd_usb2 driver.

* tag 'linux-can-fixes-for-5.14-20210730' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  can: esd_usb2: fix memory leak
  can: ems_usb: fix memory leak
  can: usb_8dev: fix memory leak
  can: mcba_usb_start(): add missing urb->transfer_dma initialization
  can: hi311x: fix a signedness bug in hi3110_cmd()
  MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver
====================

Link: https://lore.kernel.org/r/20210730070526.1699867-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Wang Hai [Thu, 29 Jul 2021 21:53:54 +0000 (14:53 -0700)]

mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()

When I use kfree_rcu() to free a large memory allocated by kmalloc_node(),
the following dump occurs.

  BUG: kernel NULL pointer dereference, address: 0000000000000020
  [...]
  Oops: 0000 [#1] SMP
  [...]
  Workqueue: events kfree_rcu_work
  RIP: 0010:__obj_to_index include/linux/slub_def.h:182 [inline]
  RIP: 0010:obj_to_index include/linux/slub_def.h:191 [inline]
  RIP: 0010:memcg_slab_free_hook+0x120/0x260 mm/slab.h:363
  [...]
  Call Trace:
    kmem_cache_free_bulk+0x58/0x630 mm/slub.c:3293
    kfree_bulk include/linux/slab.h:413 [inline]
    kfree_rcu_work+0x1ab/0x200 kernel/rcu/tree.c:3300
    process_one_work+0x207/0x530 kernel/workqueue.c:2276
    worker_thread+0x320/0x610 kernel/workqueue.c:2422
    kthread+0x13d/0x160 kernel/kthread.c:313
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

When kmalloc_node() a large memory, page is allocated, not slab, so when
freeing memory via kfree_rcu(), this large memory should not be used by
memcg_slab_free_hook(), because memcg_slab_free_hook() is is used for
slab.

Using page_objcgs_check() instead of page_objcgs() in
memcg_slab_free_hook() to fix this bug.

Link: https://lkml.kernel.org/r/20210728145655.274476-1-wanghai38@huawei.com
Fixes: 270c6a71460e ("mm: memcontrol/slab: Use helpers to access slab page's memcg_data")
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Shakeel Butt [Thu, 29 Jul 2021 21:53:50 +0000 (14:53 -0700)]

slub: fix unreclaimable slab stat for bulk free

SLUB uses page allocator for higher order allocations and update
unreclaimable slab stat for such allocations.  At the moment, the bulk
free for SLUB does not share code with normal free code path for these
type of allocations and have missed the stat update.  So, fix the stat
update by common code.  The user visible impact of the bug is the
potential of inconsistent unreclaimable slab stat visible through
meminfo and vmstat.

Link: https://lkml.kernel.org/r/20210728155354.3440560-1-shakeelb@google.com
Fixes: 6a486c0ad4dc ("mm, sl[ou]b: improve memory accounting")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 29 Jul 2021 21:53:47 +0000 (14:53 -0700)]

mm/migrate: fix NR_ISOLATED corruption on 64-bit

Similar to commit 2da9f6305f30 ("mm/vmscan: fix NR_ISOLATED_FILE
corruption on 64-bit") avoid using unsigned int for nr_pages. With
unsigned int type the large unsigned int converts to a large positive
signed long.

Symptoms include CMA allocations hanging forever due to
alloc_contig_range->...->isolate_migratepages_block waiting forever in
"while (unlikely(too_many_isolated(pgdat)))".

Link: https://lkml.kernel.org/r/20210728042531.359409-1-aneesh.kumar@linux.ibm.com
Fixes: c5fc5c3ae0c8 ("mm: migrate: account THP NUMA migration counters correctly")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Johannes Weiner [Thu, 29 Jul 2021 21:53:44 +0000 (14:53 -0700)]

mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code

Dan Carpenter reports:

    The patch 2d146aa3aa84: "mm: memcontrol: switch to rstat" from Apr
    29, 2021, leads to the following static checker warning:

    kernel/cgroup/rstat.c:200 cgroup_rstat_flush()
    warn: sleeping in atomic context

    mm/memcontrol.c
      3572  static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
      3573  {
      3574          unsigned long val;
      3575
      3576          if (mem_cgroup_is_root(memcg)) {
      3577                  cgroup_rstat_flush(memcg->css.cgroup);
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    This is from static analysis and potentially a false positive.  The
    problem is that mem_cgroup_usage() is called from __mem_cgroup_threshold()
    which holds an rcu_read_lock().  And the cgroup_rstat_flush() function
    can sleep.

      3578                  val = memcg_page_state(memcg, NR_FILE_PAGES) +
      3579                          memcg_page_state(memcg, NR_ANON_MAPPED);
      3580                  if (swap)
      3581                          val += memcg_page_state(memcg, MEMCG_SWAP);
      3582          } else {
      3583                  if (!swap)
      3584                          val = page_counter_read(&memcg->memory);
      3585                  else
      3586                          val = page_counter_read(&memcg->memsw);
      3587          }
      3588          return val;
      3589  }

__mem_cgroup_threshold() indeed holds the rcu lock.  In addition, the
thresholding code is invoked during stat changes, and those contexts
have irqs disabled as well.  If the lock breaking occurs inside the
flush function, it will result in a sleep from an atomic context.

Use the irqsafe flushing variant in mem_cgroup_usage() to fix this.

Link: https://lkml.kernel.org/r/20210726150019.251820-1-hannes@cmpxchg.org
Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Junxiao Bi [Thu, 29 Jul 2021 21:53:41 +0000 (14:53 -0700)]

ocfs2: issue zeroout to EOF blocks

For punch holes in EOF blocks, fallocate used buffer write to zero the
EOF blocks in last cluster.  But since ->writepage will ignore EOF
pages, those zeros will not be flushed.

This "looks" ok as commit 6bba4471f0cc ("ocfs2: fix data corruption by
fallocate") will zero the EOF blocks when extend the file size, but it
isn't.  The problem happened on those EOF pages, before writeback, those
pages had DIRTY flag set and all buffer_head in them also had DIRTY flag
set, when writeback run by write_cache_pages(), DIRTY flag on the page
was cleared, but DIRTY flag on the buffer_head not.

When next write happened to those EOF pages, since buffer_head already
had DIRTY flag set, it would not mark page DIRTY again.  That made
writeback ignore them forever.  That will cause data corruption.  Even
directio write can't work because it will fail when trying to drop pages
caches before direct io, as it found the buffer_head for those pages
still had DIRTY flag set, then it will fall back to buffer io mode.

To make a summary of the issue, as writeback ingores EOF pages, once any
EOF page is generated, any write to it will only go to the page cache,
it will never be flushed to disk even file size extends and that page is
not EOF page any more.  The fix is to avoid zero EOF blocks with buffer
write.

The following code snippet from qemu-img could trigger the corruption.

  656   open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11
  ...
  660   fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...>
  660   fallocate(11, 0, 2275868672, 327680) = 0
  658   pwrite64(11, "

Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Junxiao Bi [Thu, 29 Jul 2021 21:53:38 +0000 (14:53 -0700)]

ocfs2: fix zero out valid data

If append-dio feature is enabled, direct-io write and fallocate could
run in parallel to extend file size, fallocate used "orig_isize" to
record i_size before taking "ip_alloc_sem", when
ocfs2_zeroout_partial_cluster() zeroout EOF blocks, i_size maybe already
extended by ocfs2_dio_end_io_write(), that will cause valid data zeroed
out.

Link: https://lkml.kernel.org/r/20210722054923.24389-1-junxiao.bi@oracle.com
Fixes: 6bba4471f0cc ("ocfs2: fix data corruption by fallocate")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Matteo Croce [Thu, 29 Jul 2021 21:53:35 +0000 (14:53 -0700)]

lib/test_string.c: move string selftest in the Runtime Testing menu

STRING_SELFTEST is presented in the "Library routines" menu.  Move it in
Kernel hacking > Kernel Testing and Coverage > Runtime Testing together
with other similar tests found in lib/

--- Runtime Testing
<*>   Test functions located in the hexdump module at runtime
<*>   Test string functions (NEW)
<*>   Test functions located in the string_helpers module at runtime
<*>   Test strscpy*() family of functions at runtime
<*>   Test kstrto*() family of functions at runtime
<*>   Test printf() family of functions at runtime
<*>   Test scanf() family of functions at runtime

Link: https://lkml.kernel.org/r/20210719185158.190371-1-mcroce@linux.microsoft.com
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Cc: Peter Rosin <peda@axentia.se>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Domain: System / Kernel;

RSS Atom