Arnd Bergmann [Tue, 3 Aug 2021 11:40:41 +0000 (13:40 +0200)]
3c509: stop calling netdev_boot_setup_check
This driver never uses the information returned by
netdev_boot_setup_check, and is not called by the boot-time probing from
driver/net/Space.c, so just remove these stale references.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Tue, 3 Aug 2021 11:40:40 +0000 (13:40 +0200)]
appletalk: ltpc: remove static probing
This driver never relies on the netdev_boot_setup_check()
to get its configuration, so it can just as well do its
own probing all the time.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Tue, 3 Aug 2021 11:40:39 +0000 (13:40 +0200)]
natsemi: sonic: stop calling netdev_boot_setup_check
The data from the kernel command line is no longer used since the
probe function gets it from the platform device resources instead.
The jazz version was changed to be like this in 2007, the xtensa
version apparently copied the code from there.
Fixes: ed9f0e0bf3ce ("remove setup of platform device from jazzsonic.c")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Tue, 3 Aug 2021 11:40:38 +0000 (13:40 +0200)]
bcmgenet: remove call to netdev_boot_setup_check
The driver has never used the netdev->{irq,base_addr,mem_start}
members, so this call is completely unnecessary.
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 3 Aug 2021 11:58:22 +0000 (12:58 +0100)]
Merge branch 'ethtool-runtime-pm'
Heiner Kallweit says:
====================
ethtool: runtime-resume netdev parent before ethtool ops
If a network device is runtime-suspended then:
- network device may be flagged as detached and all ethtool ops (even if
not accessing the device) will fail because netif_device_present()
returns false
- ethtool ops may fail because device is not accessible (e.g. because being
in D3 in case of a PCI device)
It may not be desirable that userspace can't use even simple ethtool ops
that not access the device if interface or link is down. To be more friendly
to userspace let's ensure that device is runtime-resumed when executing
ethtool ops in kernel.
This patch series covers the typical case that the netdev parent is power-
managed, e.g. a PCI device. Not sure whether cases exist where the netdev
itself is power-managed. If yes then we may need an extension for this.
But the series as-is at least shouldn't cause problems in that case.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Sun, 1 Aug 2021 10:41:31 +0000 (12:41 +0200)]
ethtool: runtime-resume netdev parent in ethnl_ops_begin
If a network device is runtime-suspended then:
- network device may be flagged as detached and all ethtool ops (even if not
accessing the device) will fail because netif_device_present() returns
false
- ethtool ops may fail because device is not accessible (e.g. because being
in D3 in case of a PCI device)
It may not be desirable that userspace can't use even simple ethtool ops
that not access the device if interface or link is down. To be more friendly
to userspace let's ensure that device is runtime-resumed when executing the
respective ethtool op in kernel.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Sun, 1 Aug 2021 10:40:05 +0000 (12:40 +0200)]
ethtool: move netif_device_present check from ethnl_parse_header_dev_get to ethnl_ops_begin
If device is runtime-suspended and not accessible then it may be
flagged as not present. If checking whether device is present is
done too early then we may bail out before we have the chance to
runtime-resume the device. Therefore move this check to
ethnl_ops_begin(). This is in preparation of a follow-up patch
that tries to runtime-resume the device before executing ethtool
ops.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Sun, 1 Aug 2021 10:37:39 +0000 (12:37 +0200)]
ethtool: move implementation of ethnl_ops_begin/complete to netlink.c
In preparation of subsequent extensions to both functions move the
implementations from netlink.h to netlink.c.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Sun, 1 Aug 2021 10:36:48 +0000 (12:36 +0200)]
ethtool: runtime-resume netdev parent before ethtool ioctl ops
If a network device is runtime-suspended then:
- network device may be flagged as detached and all ethtool ops (even if not
accessing the device) will fail because netif_device_present() returns
false
- ethtool ops may fail because device is not accessible (e.g. because being
in D3 in case of a PCI device)
It may not be desirable that userspace can't use even simple ethtool ops
that not access the device if interface or link is down. To be more friendly
to userspace let's ensure that device is runtime-resumed when executing the
respective ethtool op in kernel.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 2 Aug 2021 17:57:29 +0000 (10:57 -0700)]
virtio-net: realign page_to_skb() after merges
We ended up merging two versions of the same patch set:
commit
8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
commit
5c37711d9f27 ("virtio-net: fix for unable to handle page fault for address")
into net, and
commit
7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
commit
6c66c147b9a4 ("virtio-net: fix for unable to handle page fault for address")
into net-next. Redo the merge from commit
126285651b7f ("Merge
ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net"), so that
the most recent code remains.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 3 Aug 2021 11:38:47 +0000 (12:38 +0100)]
Merge branch 'bnxt_en-rx-ring'
Michael Chan says:
====================
bnxt_en: Increase maximum RX ring size when jumbo ring is unused
The RX jumbo ring is automatically enabled when HW GRO/LRO is enabled or
when the MTU exceeds the page size. The RX jumbo ring provides a lot
more RX buffer space when it is in use. When the RX jumbo ring is not
in use, some users report that the current maximum of 2K buffers is
too limiting. This patchset increases the maximum to 8K buffers when
the RX jumbo ring is not used. The default RX ring size is unchanged
at 511.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Mon, 2 Aug 2021 14:52:39 +0000 (10:52 -0400)]
bnxt_en: Increase maximum RX ring size if jumbo ring is not used
The current maximum RX ring size is defined assuming the RX jumbo ring
(aka aggregation ring) is used. The RX jumbo ring is automicatically used
when the MTU exceeds a threshold or when rx-gro-hw/lro is enabled. The RX
jumbo ring is automatically sized up to 4 times the size of the RX ring
size.
The BNXT_MAX_RX_DESC_CNT constant is the upper limit on the size of the
RX ring whether or not the RX jumbo ring is used. Obviously, the
maximum amount of RX buffer space is significantly less when the RX jumbo
ring is not used.
To increase flexibility for the user who does not use the RX jumbo ring,
we now define a bigger maximum RX ring size when the RX jumbo ring is not
used. The maximum RX ring size is now up to 8K when the RX jumbo ring
is not used. The maximum completion ring size also needs to be scaled
up to accomodate the larger maximum RX ring size.
Note that when the RX jumbo ring is re-enabled, the RX ring size will
automatically drop if it exceeds the maximum.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Mon, 2 Aug 2021 14:52:38 +0000 (10:52 -0400)]
bnxt_en: Don't use static arrays for completion ring pages
We currently store these page addresses and DMA addreses in static
arrays. On systems with 4K pages, we support up to 64 pages per
completion ring. The actual number of pages for each completion ring
may be much less than 64. For example, when the RX ring size is set
to the default 511 entries, only 16 completion ring pages are needed
per ring.
In the next patch, we'll be doubling the maximum number of completion
pages. So we convert to allocate these arrays as needed instead of
declaring them statically.
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yajun Deng [Mon, 2 Aug 2021 08:05:08 +0000 (16:05 +0800)]
net: Keep vertical alignment
Those files under /proc/net/stat/ don't have vertical alignment, it looks
very difficult. Modify the seq_printf statement, keep vertical alignment.
v2:
- Use seq_puts() and seq_printf() correctly.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hangbin Liu [Mon, 2 Aug 2021 03:02:19 +0000 (11:02 +0800)]
bonding: add new option lacp_active
Add an option lacp_active, which is similar with team's runner.active.
This option specifies whether to send LACPDU frames periodically. If set
on, the LACPDU frames are sent along with the configured lacp_rate
setting. If set off, the LACPDU frames acts as "speak when spoken to".
Note, the LACPDU state frames still will be sent when init or unbind port.
v2: remove module parameter
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
zhouchuangao [Mon, 2 Aug 2021 02:18:38 +0000 (19:18 -0700)]
qed: Remove duplicated include of kernel.h
Duplicate include header file <linux/kernel.h>
line 4: #include <linux/kernel.h>
line 7: #include <linux/kernel.h>
Signed-off-by: zhouchuangao <zhouchuangao@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Len Baker [Sun, 1 Aug 2021 17:12:26 +0000 (19:12 +0200)]
drivers/net/usb: Remove all strcpy() uses
strcpy() performs no bounds checking on the destination buffer. This
could result in linear overflows beyond the end of the buffer, leading
to all kinds of misbehaviors. The safe replacement is strscpy().
Signed-off-by: Len Baker <len.baker@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shai Malin [Sun, 1 Aug 2021 10:28:40 +0000 (13:28 +0300)]
qed: Remove redundant prints from the iWARP SYN handling
Remove redundant prints from the iWARP SYN handling.
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shai Malin [Sun, 1 Aug 2021 10:26:38 +0000 (13:26 +0300)]
qed: Skip DORQ attention handling during recovery
The device recovery flow will reset the entire HW device, in that case
the DORQ HW block attention is redundant.
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shai Malin [Sun, 1 Aug 2021 10:23:40 +0000 (13:23 +0300)]
qed: Avoid db_recovery during recovery
Avoid calling the qed doorbell recovery - qed_db_rec_handler()
during device recovery.
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 3 Aug 2021 10:21:39 +0000 (11:21 +0100)]
Merge branch 'skb_expand_head'
Vasily Averin says:
====================
skbuff: introduce skb_expand_head()
currently if skb does not have enough headroom skb_realloc_headrom is called.
It is not optimal because it creates new skb.
this patch set introduces new helper skb_expand_head()
Unlike skb_realloc_headroom, it does not allocate a new skb if possible;
copies skb->sk on new skb when as needed and frees original skb in case of failures.
This helps to simplify ip[6]_finish_output2(), ip6_xmit() and few other
functions in vrf, ax25 and bpf.
There are few other cases where this helper can be used
but it requires an additional investigations.
v3 changes:
- ax25 compilation warning fixed
- v5.14-rc4 rebase
- now it does not depend on non-committed pathces
v2 changes:
- helper's name was changed to skb_expand_head
- fixed few mistakes inside skb_expand_head():
skb_set_owner_w should set sk on nskb
kfree was replaced by kfree_skb()
improved warning message
- added minor refactoring in changed functions in vrf and bpf patches
- removed kfree_skb() in ax25_rt_build_path caller ax25_ip_xmit
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vasily Averin [Mon, 2 Aug 2021 08:52:54 +0000 (11:52 +0300)]
bpf: use skb_expand_head in bpf_out_neigh_v4/6
Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.
Additionally this patch replaces commonly used dereferencing with variables.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vasily Averin [Mon, 2 Aug 2021 08:52:47 +0000 (11:52 +0300)]
ax25: use skb_expand_head
Use skb_expand_head() in ax25_transmit_buffer and ax25_rt_build_path.
Unlike skb_realloc_headroom, new helper does not allocate a new skb if possible.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vasily Averin [Mon, 2 Aug 2021 08:52:40 +0000 (11:52 +0300)]
vrf: use skb_expand_head in vrf_finish_output
Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vasily Averin [Mon, 2 Aug 2021 08:52:35 +0000 (11:52 +0300)]
ipv4: use skb_expand_head in ip_finish_output2
Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vasily Averin [Mon, 2 Aug 2021 08:52:29 +0000 (11:52 +0300)]
ipv6: use skb_expand_head in ip6_xmit
Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.
Additionally this patch replaces commonly used dereferencing with variables.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vasily Averin [Mon, 2 Aug 2021 08:52:22 +0000 (11:52 +0300)]
ipv6: use skb_expand_head in ip6_finish_output2
Unlike skb_realloc_headroom, new helper skb_expand_head does not allocate
a new skb if possible.
Additionally this patch replaces commonly used dereferencing with variables.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vasily Averin [Mon, 2 Aug 2021 08:52:15 +0000 (11:52 +0300)]
skbuff: introduce skb_expand_head()
Like skb_realloc_headroom(), new helper increases headroom of specified skb.
Unlike skb_realloc_headroom(), it does not allocate a new skb if possible;
copies skb->sk on new skb when as needed and frees original skb in case
of failures.
This helps to simplify ip[6]_finish_output2() and a few other similar cases.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 3 Aug 2021 10:16:13 +0000 (11:16 +0100)]
Merge tag 'mlx5-updates-2021-08-02' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
This patch-set changes the TTC (Traffic Type Classification) logic
to be independent from the mlx5 ethernet driver by renaming the traffic
types enums and making the TTC API generic to the mlx5 core driver.
It allows to decouple TTC logic from mlx5e and reused by other parts
of mlx5 drivers, namely ADQ and lag TX steering hashing.
Patches overview:
1 - Rename traffic type enums to be mlx5 generic.
2 - Rename related TTC arguments and functions.
3 - Remove dependency in the mlx5e driver from the TTC implementation.
4 - Move TTC logic to fs_ttc.
5 - Embed struct mlx5_ttc_table in fs_ttc.
The refactoring series is followed by misc' cleanup patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiapeng Chong [Thu, 22 Jul 2021 09:58:16 +0000 (17:58 +0800)]
net/mlx5: Fix missing return value in mlx5_devlink_eswitch_inline_mode_set()
The return value is missing in this code scenario, add the return value
'0' to the return value 'err'.
Eliminate the follow smatch warning:
drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:3083
mlx5_devlink_eswitch_inline_mode_set() warn: missing error code 'err'.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: 8e0aa4bc959c ("net/mlx5: E-switch, Protect eswitch mode changes")
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Roi Dayan [Mon, 26 Jul 2021 12:17:31 +0000 (15:17 +0300)]
net/mlx5e: Return -EOPNOTSUPP if more relevant when parsing tc actions
Instead of returning -EINVAL.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Roi Dayan [Mon, 26 Jul 2021 12:13:35 +0000 (15:13 +0300)]
net/mlx5e: Remove redundant assignment of counter to null
counter is being initialized before being used.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Roi Dayan [Mon, 26 Jul 2021 12:11:43 +0000 (15:11 +0300)]
net/mlx5e: Remove redundant parse_attr arg
Passing parse_attr is redundant in parse_tc_nic_actions() and
mlx5e_tc_add_nic_flow() as we can get it from flow.
This is the same as with parse_tc_fdb_actions() and mlx5e_tc_add_fdb_flow().
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Roi Dayan [Sun, 25 Jul 2021 13:04:00 +0000 (16:04 +0300)]
net/mlx5e: Remove redundant cap check for flow counter
The cap is very old and today will always exists.
The cap is not being checked anywhere else. Remove the check from
drop action when parsing tc rules in nic mode.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Roi Dayan [Thu, 22 Jul 2021 09:10:54 +0000 (12:10 +0300)]
net/mlx5e: Remove redundant filter_dev arg from parse_tc_fdb_actions()
filter_dev is saved in parse_attr. and being used in other cases from
there. use it also for the leftover case.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Roi Dayan [Thu, 22 Jul 2021 08:58:08 +0000 (11:58 +0300)]
net/mlx5e: Remove redundant tc act includes
Since the code changed to use the flow action infra
there is no usage of tcf values from those includes.
Remove those.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Maor Gottlieb [Sun, 18 Jul 2021 12:53:53 +0000 (15:53 +0300)]
net/mlx5: Embed mlx5_ttc_table
mlx5_ttc_table struct shouldn't be exposed to the users so
this patch make it internal to ttc.
In addition add a getter function to get the TTC flow table for users
that need to add a rule which points on it.
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Maor Gottlieb [Fri, 2 Jul 2021 07:38:32 +0000 (10:38 +0300)]
net/mlx5: Move TTC logic to fs_ttc
Now that TTC logic is not dependent on mlx5e structs, move it to
lib/fs_ttc.c so it could be used other part of the mlx5 driver.
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Maor Gottlieb [Fri, 2 Jul 2021 11:25:14 +0000 (14:25 +0300)]
net/mlx5e: Decouple TTC logic from mlx5e
Remove dependency in the mlx5e driver from the TTC implementation
by changing the TTC related functions to receive mlx5 generic arguments.
It allows to decouple TTC logic from mlx5e and reused by other parts of
mlx5 driver.
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Maor Gottlieb [Fri, 2 Jul 2021 11:49:37 +0000 (14:49 +0300)]
net/mlx5e: Rename some related TTC args and functions
Since TTC logic is going to be moved to a separate file, make the
relevant functions and arguments that used by TTC to be mlx5 generic.
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Maor Gottlieb [Fri, 2 Jul 2021 08:42:28 +0000 (11:42 +0300)]
net/mlx5e: Rename traffic type enums
Rename traffic type enums as part of the preparation for moving
the traffic type logic to a separate file.
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Maxim Mikityanskiy [Mon, 12 Apr 2021 16:10:17 +0000 (19:10 +0300)]
net/mlx5e: Allocate the array of channels according to the real max_nch
The channels array in struct mlx5e_rx_res is converted to a dynamic one,
which will use the dynamic value of max_nch instead of
implementation-defined maximum of MLX5E_MAX_NUM_CHANNELS.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Maxim Mikityanskiy [Fri, 9 Apr 2021 16:01:51 +0000 (19:01 +0300)]
net/mlx5e: Hide all implementation details of mlx5e_rx_res
This commit moves all implementation details of struct mlx5e_rx_res
under en/rx_res.c. All access to RX resources is now done using methods.
Encapsulating RX resources into an object allows for better
manageability, because all the implementation details are now in a
single place, and external code can use only a limited set of API
methods to init/teardown the whole thing, reconfigure RSS and LRO
parameters, connect TIRs to flow steering and activate/deactivate TIRs.
mlx5e_rx_res is self-contained and doesn't depend on struct mlx5e_priv
or include en.h.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Maxim Mikityanskiy [Fri, 9 Apr 2021 14:31:09 +0000 (17:31 +0300)]
net/mlx5e: Introduce mlx5e_channels API to get RQNs
Currently, struct mlx5e_channels is defined in en.h, along with a lot of
other stuff. In the following commit mlx5e_rx_res will need to get RQNs
(RQ hardware IDs), given a pointer to mlx5e_channels and the channel
index. In order to make it possible without including the whole en.h,
this commit introduces functions that will hide the implementation
details of mlx5e_channels.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Maxim Mikityanskiy [Thu, 8 Apr 2021 09:16:34 +0000 (12:16 +0300)]
net/mlx5e: Use a new initializer to build uniform indir table
Replace mlx5e_build_default_indir_rqt with a new initializer of struct
mlx5e_rss_params_indir that works directly with the struct, rather than
its internals.
The new initializer is called mlx5e_rss_params_indir_init_uniform, which
also reflects the purpose (uniform spreading) better.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Colin Ian King [Sun, 1 Aug 2021 15:37:42 +0000 (16:37 +0100)]
net/mlx4: make the array states static const, makes object smaller
Don't populate the array states on the stack but instead it
static const. Makes the object code smaller by 79 bytes.
Before:
text data bss dec hex filename
21309 8304 192 29805 746d drivers/net/ethernet/mellanox/mlx4/qp.o
After:
text data bss dec hex filename
21166 8368 192 29726 741e drivers/net/ethernet/mellanox/mlx4/qp.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20210801153742.147304-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Colin Ian King [Sun, 1 Aug 2021 15:26:50 +0000 (16:26 +0100)]
net: 3c509: make the array if_names static const, makes object smaller
Don't populate the array if_names on the stack but instead it
static const. Makes the object code smaller by 99 bytes.
Before:
text data bss dec hex filename
27886 10752 672 39310 998e ./drivers/net/ethernet/3com/3c509.o
After:
text data bss dec hex filename
27723 10816 672 39211 992b ./drivers/net/ethernet/3com/3c509.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801152650.146572-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Colin Ian King [Sun, 1 Aug 2021 15:22:09 +0000 (16:22 +0100)]
dpaa2-eth: make the array faf_bits static const, makes object smaller
Don't populate the array faf_bits on the stack but instead it
static const. Makes the object code smaller by 175 bytes.
Before:
text data bss dec hex filename
9645 4552 0 14197 3775 ../freescale/dpaa2/dpaa2-eth-devlink.o
After:
text data bss dec hex filename
9406 4616 0 14022 36c6 ../freescale/dpaa2/dpaa2-eth-devlink.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801152209.146359-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Colin Ian King [Sun, 1 Aug 2021 15:16:59 +0000 (16:16 +0100)]
qlcnic: make the array random_data static const, makes object smaller
Don't populate the array random_data on the stack but instead it
static const. Makes the object code smaller by 66 bytes.
Before:
text data bss dec hex filename
52895 10976 0 63871 f97f ../qlogic/qlcnic/qlcnic_ethtool.o
After:
text data bss dec hex filename
52701 11104 0 63805 f93d ../qlogic//qlcnic/qlcnic_ethtool.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801151659.146113-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Colin Ian King [Sun, 1 Aug 2021 15:06:47 +0000 (16:06 +0100)]
net: marvell: make the array name static, makes object smaller
Don't populate the const array name on the stack but instead it
static. Makes the object code smaller by 28 bytes. Add a missing
const to clean up a checkpatch warning.
Before:
text data bss dec hex filename
124565 31565 384 156514 26362 drivers/net/ethernet/marvell/sky2.o
After:
text data bss dec hex filename
124441 31661 384 156486 26346 drivers/net/ethernet/marvell/sky2.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801150647.145728-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Colin Ian King [Sun, 1 Aug 2021 15:12:05 +0000 (16:12 +0100)]
cxgb4: make the array match_all_mac static, makes object smaller
Don't populate the array match_all_mac on the stack but instead it
static const. Makes the object code smaller by 75 bytes.
Before:
text data bss dec hex filename
46701 8960 64 55725 d9ad ../chelsio/cxgb4/cxgb4_filter.o
After:
text data bss dec hex filename
46338 9120 192 55650 d962 ../chelsio/cxgb4/cxgb4_filter.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801151205.145924-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Ahern [Mon, 2 Aug 2021 16:02:21 +0000 (10:02 -0600)]
ipv4: Fix refcount warning for new fib_info
Ioana reported a refcount warning when booting over NFS:
[ 5.042532] ------------[ cut here ]------------
[ 5.047184] refcount_t: addition on 0; use-after-free.
[ 5.052324] WARNING: CPU: 7 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0xa4/0x150
...
[ 5.167201] Call trace:
[ 5.169635] refcount_warn_saturate+0xa4/0x150
[ 5.174067] fib_create_info+0xc00/0xc90
[ 5.177982] fib_table_insert+0x8c/0x620
[ 5.181893] fib_magic.isra.0+0x110/0x11c
[ 5.185891] fib_add_ifaddr+0xb8/0x190
[ 5.189629] fib_inetaddr_event+0x8c/0x140
fib_treeref needs to be set after kzalloc. The old code had a ++ which
led to the confusion when the int was replaced by a refcount_t.
Fixes: 79976892f7ea ("net: convert fib_treeref from int to refcount_t")
Signed-off-by: David Ahern <dsahern@kernel.org>
Reported-by: Ioana Ciornei <ciorneiioana@gmail.com>
Cc: Yajun Deng <yajun.deng@linux.dev>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20210802160221.27263-1-dsahern@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Colin Ian King [Sun, 1 Aug 2021 07:01:55 +0000 (08:01 +0100)]
net: phy: mscc: make some arrays static const, makes object smaller
Don't populate arrays on the stack but instead them static const.
Makes the object code smaller by 280 bytes.
Before:
text data bss dec hex filename
24142 4368 192 28702 701e ./drivers/net/phy/mscc/mscc_ptp.o
After:
text data bss dec hex filename
23830 4400 192 28422 6f06 ./drivers/net/phy/mscc/mscc_ptp.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801070155.139057-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Colin Ian King [Sun, 1 Aug 2021 06:53:28 +0000 (07:53 +0100)]
netdevsim: make array res_ids static const, makes object smaller
Don't populate the array res_ids on the stack but instead it
static const. Makes the object code smaller by 14 bytes.
Before:
text data bss dec hex filename
50833 8314 256 59403 e80b ./drivers/net/netdevsim/fib.o
After:
text data bss dec hex filename
50755 8378 256 59389 e7fd ./drivers/net/netdevsim/fib.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801065328.138906-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gustavo A. R. Silva [Sat, 31 Jul 2021 17:08:30 +0000 (12:08 -0500)]
net/ipv4: Replace one-element array with flexible-array member
There is a regular need in the kernel to provide a way to declare having
a dynamically sized set of trailing elements in a structure. Kernel code
should always use “flexible array members”[1] for these cases. The older
style of one-element or zero-length arrays should no longer be used[2].
Use an anonymous union with a couple of anonymous structs in order to
keep userspace unchanged:
$ pahole -C ip_msfilter net/ipv4/ip_sockglue.o
struct ip_msfilter {
union {
struct {
__be32 imsf_multiaddr_aux; /* 0 4 */
__be32 imsf_interface_aux; /* 4 4 */
__u32 imsf_fmode_aux; /* 8 4 */
__u32 imsf_numsrc_aux; /* 12 4 */
__be32 imsf_slist[1]; /* 16 4 */
}; /* 0 20 */
struct {
__be32 imsf_multiaddr; /* 0 4 */
__be32 imsf_interface; /* 4 4 */
__u32 imsf_fmode; /* 8 4 */
__u32 imsf_numsrc; /* 12 4 */
__be32 imsf_slist_flex[0]; /* 16 0 */
}; /* 0 16 */
}; /* 0 20 */
/* size: 20, cachelines: 1, members: 1 */
/* last cacheline: 20 bytes */
};
Also, refactor the code accordingly and make use of the struct_size()
and flex_array_size() helpers.
This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().
[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.10/process/deprecated.html#zero-length-and-one-element-arrays
Link: https://github.com/KSPP/linux/issues/79
Link: https://github.com/KSPP/linux/issues/109
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 31 Jul 2021 14:14:32 +0000 (17:14 +0300)]
net: dsa: remove the struct packet_type argument from dsa_device_ops::rcv()
No tagging driver uses this.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Krzysztof Kozlowski [Sat, 31 Jul 2021 10:21:44 +0000 (12:21 +0200)]
nfc: hci: pass callback data param as pointer in nci_request()
The nci_request() receives a callback function and unsigned long data
argument "opt" which is passed to the callback. Almost all of the
nci_request() callers pass pointer to a stack variable as data argument.
Only few pass scalar value (e.g. u8).
All such callbacks do not modify passed data argument and in previous
commit they were made as const. However passing pointers via unsigned
long removes the const annotation. The callback could simply cast
unsigned long to a pointer to writeable memory.
Use "const void *" as type of this "opt" argument to solve this and
prevent modifying the pointed contents. This is also consistent with
generic pattern of passing data arguments - via "void *". In few places
which pass scalar values, use casts via "unsigned long" to suppress any
warnings.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christophe JAILLET [Sat, 31 Jul 2021 07:10:00 +0000 (09:10 +0200)]
cavium: switch from 'pci_' to 'dma_' API
The wrappers in include/linux/pci-dma-compat.h should go away.
The patch has been generated with the coccinelle script below. It has been
hand modified to use 'dma_set_mask_and_coherent()' instead of
'pci_set_dma_mask()/pci_set_consistent_dma_mask()' when applicable.
It has been compile tested.
@@
@@
- PCI_DMA_BIDIRECTIONAL
+ DMA_BIDIRECTIONAL
@@
@@
- PCI_DMA_TODEVICE
+ DMA_TO_DEVICE
@@
@@
- PCI_DMA_FROMDEVICE
+ DMA_FROM_DEVICE
@@
@@
- PCI_DMA_NONE
+ DMA_NONE
@@
expression e1, e2, e3;
@@
- pci_alloc_consistent(e1, e2, e3)
+ dma_alloc_coherent(&e1->dev, e2, e3, GFP_)
@@
expression e1, e2, e3;
@@
- pci_zalloc_consistent(e1, e2, e3)
+ dma_alloc_coherent(&e1->dev, e2, e3, GFP_)
@@
expression e1, e2, e3, e4;
@@
- pci_free_consistent(e1, e2, e3, e4)
+ dma_free_coherent(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4;
@@
- pci_map_single(e1, e2, e3, e4)
+ dma_map_single(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4;
@@
- pci_unmap_single(e1, e2, e3, e4)
+ dma_unmap_single(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4, e5;
@@
- pci_map_page(e1, e2, e3, e4, e5)
+ dma_map_page(&e1->dev, e2, e3, e4, e5)
@@
expression e1, e2, e3, e4;
@@
- pci_unmap_page(e1, e2, e3, e4)
+ dma_unmap_page(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4;
@@
- pci_map_sg(e1, e2, e3, e4)
+ dma_map_sg(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4;
@@
- pci_unmap_sg(e1, e2, e3, e4)
+ dma_unmap_sg(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4;
@@
- pci_dma_sync_single_for_cpu(e1, e2, e3, e4)
+ dma_sync_single_for_cpu(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4;
@@
- pci_dma_sync_single_for_device(e1, e2, e3, e4)
+ dma_sync_single_for_device(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4;
@@
- pci_dma_sync_sg_for_cpu(e1, e2, e3, e4)
+ dma_sync_sg_for_cpu(&e1->dev, e2, e3, e4)
@@
expression e1, e2, e3, e4;
@@
- pci_dma_sync_sg_for_device(e1, e2, e3, e4)
+ dma_sync_sg_for_device(&e1->dev, e2, e3, e4)
@@
expression e1, e2;
@@
- pci_dma_mapping_error(e1, e2)
+ dma_mapping_error(&e1->dev, e2)
@@
expression e1, e2;
@@
- pci_set_dma_mask(e1, e2)
+ dma_set_mask(&e1->dev, e2)
@@
expression e1, e2;
@@
- pci_set_consistent_dma_mask(e1, e2)
+ dma_set_coherent_mask(&e1->dev, e2)
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Fri, 30 Jul 2021 22:57:14 +0000 (01:57 +0300)]
net: dsa: mt7530: drop paranoid checks in .get_tag_protocol()
It is desirable to reduce the surface of DSA_TAG_PROTO_NONE as much as
we can, because we now have options for switches without hardware
support for DSA tagging, and the occurrence in the mt7530 driver is in
fact quite gratuitout and easy to remove. Since ds->ops->get_tag_protocol()
is only called for CPU ports, the checks for a CPU port in
mtk_get_tag_protocol() are redundant and can be removed.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: DENG Qingfang <dqfext@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 2 Aug 2021 09:47:12 +0000 (10:47 +0100)]
Merge branch 'octeon-drr-config'
Sunil Goutham says:
====================
cn10k: DWRR MTU and weights configuration
On OcteonTx2 DWRR quantum is directly configured into each of
the transmit scheduler queues. And PF/VF drivers were free to
config any value upto 2^24.
On CN10K, HW is modified, the quantum configuration at scheduler
queues is in terms of weight. And SW needs to setup a base DWRR MTU
at NIX_AF_DWRR_RPM_MTU / NIX_AF_DWRR_SDP_MTU. HW will do
'DWRR MTU * weight' to get the quantum.
This patch series addresses this HW change on CN10K silicons,
both admin function and PF/VF drivers are modified.
Also added support to program DWRR MTU via devlink params.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Fri, 30 Jul 2021 11:49:14 +0000 (17:19 +0530)]
octeontx2-pf: cn10k: Config DWRR weight based on MTU
Program SQ, MDQ, TL4 to TL2 transmit scheduler queues' DWRR
weight based on DWRR MTU programmed at NIX_AF_DWRR_RPM_MTU.
The DWRR MTU from admin function is retrieved via mbox.
On OcteaonTx2 silicon, admin function driver responds with DWRR
MTU as '1'. This helps to avoid silicon specific transmit
scheduler DWRR quantum/weight configuration logic.
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Fri, 30 Jul 2021 11:49:13 +0000 (17:19 +0530)]
octeontx2-af: cn10k: DWRR MTU configuration
On OcteonTx2 DWRR quantum is directly configured into each of
the transmit scheduler queues. And PF/VF drivers were free to
config any value upto 2^24.
On CN10K, HW is modified, the quantum configuration at scheduler
queues is in terms of weight. And SW needs to setup a base DWRR MTU
at NIX_AF_DWRR_RPM_MTU / NIX_AF_DWRR_SDP_MTU. HW will do
'DWRR MTU * weight' to get the quantum. For LBK traffic, value
programmed into NIX_AF_DWRR_RPM_MTU register is considered as
DWRR MTU.
This patch programs a default DWRR MTU of 8192 into HW and also
provides a way to change this via devlink params.
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Fri, 30 Jul 2021 03:41:55 +0000 (11:41 +0800)]
selftests/net: remove min gso test in packet_snd
This patch removed the 'raw gso min size - 1' test which
always fails now:
./in_netns.sh ./psock_snd -v -c -g -l "${mss}"
raw gso min size - 1 (expected to fail)
tx: 1524
rx: 1472
OK
After commit
7c6d2ecbda83 ("net: be more gentle about silly
gso requests coming from user"), we relaxed the min gso_size
check in virtio_net_hdr_to_skb().
So when a packet which is smaller then the gso_size,
GSO for this packet will not be set, the packet will be
send/recv successfully.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yufeng Mo [Fri, 30 Jul 2021 02:19:11 +0000 (10:19 +0800)]
bonding: 3ad: fix the concurrency between __bond_release_one() and bond_3ad_state_machine_handler()
Some time ago, I reported a calltrace issue
"did not find a suitable aggregator", please see[1].
After a period of analysis and reproduction, I find
that this problem is caused by concurrency.
Before the problem occurs, the bond structure is like follows:
bond0 - slaver0(eth0) - agg0.lag_ports -> port0 - port1
\
port0
\
slaver1(eth1) - agg1.lag_ports -> NULL
\
port1
If we run 'ifenslave bond0 -d eth1', the process is like below:
excuting __bond_release_one()
|
bond_upper_dev_unlink()[step1]
| | |
| | bond_3ad_lacpdu_recv()
| | ->bond_3ad_rx_indication()
| | spin_lock_bh()
| | ->ad_rx_machine()
| | ->__record_pdu()[step2]
| | spin_unlock_bh()
| | |
| bond_3ad_state_machine_handler()
| spin_lock_bh()
| ->ad_port_selection_logic()
| ->try to find free aggregator[step3]
| ->try to find suitable aggregator[step4]
| ->did not find a suitable aggregator[step5]
| spin_unlock_bh()
| |
| |
bond_3ad_unbind_slave() |
spin_lock_bh()
spin_unlock_bh()
step1: already removed slaver1(eth1) from list, but port1 remains
step2: receive a lacpdu and update port0
step3: port0 will be removed from agg0.lag_ports. The struct is
"agg0.lag_ports -> port1" now, and agg0 is not free. At the
same time, slaver1/agg1 has been removed from the list by step1.
So we can't find a free aggregator now.
step4: can't find suitable aggregator because of step2
step5: cause a calltrace since port->aggregator is NULL
To solve this concurrency problem, put bond_upper_dev_unlink()
after bond_3ad_unbind_slave(). In this way, we can invalid the port
first and skip this port in bond_3ad_state_machine_handler(). This
eliminates the situation that the slaver has been removed from the
list but the port is still valid.
[1]https://lore.kernel.org/netdev/10374.
1611947473@famine/
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Thu, 29 Jul 2021 23:12:14 +0000 (16:12 -0700)]
net_sched: refactor TC action init API
TC action ->init() API has 10 parameters, it becomes harder
to read. Some of them are just boolean and can be replaced
by flags. Similarly for the internal API tcf_action_init()
and tcf_exts_validate().
This patch converts them to flags and fold them into
the upper 16 bits of "flags", whose lower 16 bits are still
reserved for user-space. More specifically, the following
kernel flags are introduced:
TCA_ACT_FLAGS_POLICE replace 'name' in a few contexts, to
distinguish whether it is compatible with policer.
TCA_ACT_FLAGS_BIND replaces 'bind', to indicate whether
this action is bound to a filter.
TCA_ACT_FLAGS_REPLACE replaces 'ovr' in most contexts,
means we are replacing an existing action.
TCA_ACT_FLAGS_NO_RTNL replaces 'rtnl_held' but has the
opposite meaning, because we still hold RTNL in most
cases.
The only user-space flag TCA_ACT_FLAGS_NO_PERCPU_STATS is
untouched and still stored as before.
I have tested this patch with tdc and I do not see any
failure related to this patch.
Tested-by: Vlad Buslov <vladbu@nvidia.com>
Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Kaiser [Thu, 29 Jul 2021 07:43:54 +0000 (09:43 +0200)]
niu: read property length only if we use it
In three places, the driver calls of_get_property and reads the property
length although the length is not used. Update the calls to not request
the length.
Signed-off-by: Martin Kaiser <martin@kaiser.cx>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Sat, 31 Jul 2021 18:23:25 +0000 (11:23 -0700)]
Merge https://git./linux/kernel/git/bpf/bpf-next
Andrii Nakryiko says:
====================
bpf-next 2021-07-30
We've added 64 non-merge commits during the last 15 day(s) which contain
a total of 83 files changed, 5027 insertions(+), 1808 deletions(-).
The main changes are:
1) BTF-guided binary data dumping libbpf API, from Alan.
2) Internal factoring out of libbpf CO-RE relocation logic, from Alexei.
3) Ambient BPF run context and cgroup storage cleanup, from Andrii.
4) Few small API additions for libbpf 1.0 effort, from Evgeniy and Hengqi.
5) bpf_program__attach_kprobe_opts() fixes in libbpf, from Jiri.
6) bpf_{get,set}sockopt() support in BPF iterators, from Martin.
7) BPF map pinning improvements in libbpf, from Martynas.
8) Improved module BTF support in libbpf and bpftool, from Quentin.
9) Bpftool cleanups and documentation improvements, from Quentin.
10) Libbpf improvements for supporting CO-RE on old kernels, from Shuyi.
11) Increased maximum cgroup storage size, from Stanislav.
12) Small fixes and improvements to BPF tests and samples, from various folks.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (64 commits)
tools: bpftool: Complete metrics list in "bpftool prog profile" doc
tools: bpftool: Document and add bash completion for -L, -B options
selftests/bpf: Update bpftool's consistency script for checking options
tools: bpftool: Update and synchronise option list in doc and help msg
tools: bpftool: Complete and synchronise attach or map types
selftests/bpf: Check consistency between bpftool source, doc, completion
tools: bpftool: Slightly ease bash completion updates
unix_bpf: Fix a potential deadlock in unix_dgram_bpf_recvmsg()
libbpf: Add btf__load_vmlinux_btf/btf__load_module_btf
tools: bpftool: Support dumping split BTF by id
libbpf: Add split BTF support for btf__load_from_kernel_by_id()
tools: Replace btf__get_from_id() with btf__load_from_kernel_by_id()
tools: Free BTF objects at various locations
libbpf: Rename btf__get_from_id() as btf__load_from_kernel_by_id()
libbpf: Rename btf__load() as btf__load_into_kernel()
libbpf: Return non-null error on failures in libbpf_find_prog_btf_id()
bpf: Emit better log message if bpf_iter ctx arg btf_id == 0
tools/resolve_btfids: Emit warnings and patch zero id for missing symbols
bpf: Increase supported cgroup storage value size
libbpf: Fix race when pinning maps in parallel
...
====================
Link: https://lore.kernel.org/r/20210730225606.1897330-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 31 Jul 2021 16:14:46 +0000 (09:14 -0700)]
Merge git://git./linux/kernel/git/netdev/net
Conflicting commits, all resolutions pretty trivial:
drivers/bus/mhi/pci_generic.c
5c2c85315948 ("bus: mhi: pci-generic: configurable network interface MRU")
56f6f4c4eb2a ("bus: mhi: pci_generic: Apply no-op for wake using sideband wake boolean")
drivers/nfc/s3fwrn5/firmware.c
a0302ff5906a ("nfc: s3fwrn5: remove unnecessary label")
46573e3ab08f ("nfc: s3fwrn5: fix undefined parameter values in dev_err()")
801e541c79bb ("nfc: s3fwrn5: fix undefined parameter values in dev_err()")
MAINTAINERS
7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver")
8a7b46fa7902 ("MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Fri, 30 Jul 2021 23:01:36 +0000 (16:01 -0700)]
Merge tag 'net-5.14-rc4' of git://git./linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.14-rc4, including fixes from bpf, can, WiFi
(mac80211) and netfilter trees.
Current release - regressions:
- mac80211: fix starting aggregation sessions on mesh interfaces
Current release - new code bugs:
- sctp: send pmtu probe only if packet loss in Search Complete state
- bnxt_en: add missing periodic PHC overflow check
- devlink: fix phys_port_name of virtual port and merge error
- hns3: change the method of obtaining default ptp cycle
- can: mcba_usb_start(): add missing urb->transfer_dma initialization
Previous releases - regressions:
- set true network header for ECN decapsulation
- mlx5e: RX, avoid possible data corruption w/ relaxed ordering and
LRO
- phy: re-add check for PHY_BRCM_DIS_TXCRXC_NOENRGY on the BCM54811
PHY
- sctp: fix return value check in __sctp_rcv_asconf_lookup
Previous releases - always broken:
- bpf:
- more spectre corner case fixes, introduce a BPF nospec
instruction for mitigating Spectre v4
- fix OOB read when printing XDP link fdinfo
- sockmap: fix cleanup related races
- mac80211: fix enabling 4-address mode on a sta vif after assoc
- can:
- raw: raw_setsockopt(): fix raw_rcv panic for sock UAF
- j1939: j1939_session_deactivate(): clarify lifetime of session
object, avoid UAF
- fix number of identical memory leaks in USB drivers
- tipc:
- do not blindly write skb_shinfo frags when doing decryption
- fix sleeping in tipc accept routine"
* tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
gve: Update MAINTAINERS list
can: esd_usb2: fix memory leak
can: ems_usb: fix memory leak
can: usb_8dev: fix memory leak
can: mcba_usb_start(): add missing urb->transfer_dma initialization
can: hi311x: fix a signedness bug in hi3110_cmd()
MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver
bpf: Fix leakage due to insufficient speculative store bypass mitigation
bpf: Introduce BPF nospec instruction for mitigating Spectre v4
sis900: Fix missing pci_disable_device() in probe and remove
net: let flow have same hash in two directions
nfc: nfcsim: fix use after free during module unload
tulip: windbond-840: Fix missing pci_disable_device() in probe and remove
sctp: fix return value check in __sctp_rcv_asconf_lookup
nfc: s3fwrn5: fix undefined parameter values in dev_err()
net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32
net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev()
net/mlx5: Unload device upon firmware fatal error
net/mlx5e: Fix page allocation failure for ptp-RQ over SF
net/mlx5e: Fix page allocation failure for trap-RQ over SF
...
Linus Torvalds [Fri, 30 Jul 2021 22:56:24 +0000 (15:56 -0700)]
Merge tag 'acpi-5.14-rc4' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These revert a recent IRQ resources handling modification that turned
out to be problematic, fix suspend-to-idle handling on AMD platforms
to take upcoming systems into account properly and fix the retrieval
of the DPTF attributes of the PCH FIVR.
Specifics:
- Revert recent change of the ACPI IRQ resources handling that
attempted to improve the ACPI IRQ override selection logic, but
introduced serious regressions on some systems (Hui Wang).
- Fix up quirks for AMD platforms in the suspend-to-idle support code
so as to take upcoming systems using uPEP HID AMDI007 into account
as appropriate (Mario Limonciello).
- Fix the code retrieving DPTF attributes of the PCH FIVR so that it
agrees on the return data type with the ACPI control method
evaluated for this purpose (Srinivas Pandruvada)"
* tag 'acpi-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: DPTF: Fix reading of attributes
Revert "ACPI: resources: Add checks for ACPI IRQ override"
ACPI: PM: Add support for upcoming AMD uPEP HID AMDI007
Linus Torvalds [Fri, 30 Jul 2021 22:42:34 +0000 (15:42 -0700)]
pipe: make pipe writes always wake up readers
Since commit
1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup
logic") we have sanitized the pipe write logic, and would only try to
wake up readers if they needed it.
In particular, if the pipe already had data in it before the write,
there was no point in trying to wake up a reader, since any existing
readers must have been aware of the pre-existing data already. Doing
extraneous wakeups will only cause potential thundering herd problems.
However, it turns out that some Android libraries have misused the EPOLL
interface, and expected "edge triggered" be to "any new write will
trigger it". Even if there was no edge in sight.
Quoting Sandeep Patil:
"The commit
1b6b26ae7053 ('pipe: fix and clarify pipe write wakeup
logic') changed pipe write logic to wakeup readers only if the pipe
was empty at the time of write. However, there are libraries that
relied upon the older behavior for notification scheme similar to
what's described in [1]
One such library 'realm-core'[2] is used by numerous Android
applications. The library uses a similar notification mechanism as GNU
Make but it never drains the pipe until it is full. When Android moved
to v5.10 kernel, all applications using this library stopped working.
The library has since been fixed[3] but it will be a while before all
applications incorporate the updated library"
Our regression rule for the kernel is that if applications break from
new behavior, it's a regression, even if it was because the application
did something patently wrong. Also note the original report [4] by
Michal Kerrisk about a test for this epoll behavior - but at that point
we didn't know of any actual broken use case.
So add the extraneous wakeup, to approximate the old behavior.
[ I say "approximate", because the exact old behavior was to do a wakeup
not for each write(), but for each pipe buffer chunk that was filled
in. The behavior introduced by this change is not that - this is just
"every write will cause a wakeup, whether necessary or not", which
seems to be sufficient for the broken library use. ]
It's worth noting that this adds the extraneous wakeup only for the
write side, while the read side still considers the "edge" to be purely
about reading enough from the pipe to allow further writes.
See commit
f467a6a66419 ("pipe: fix and clarify pipe read wakeup logic")
for the pipe read case, which remains that "only wake up if the pipe was
full, and we read something from it".
Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/
Link: https://github.com/realm/realm-core
Link: https://github.com/realm/realm-core/issues/4666
Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/
Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/
Reported-by: Sandeep Patil <sspatil@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrii Nakryiko [Fri, 30 Jul 2021 22:40:28 +0000 (15:40 -0700)]
Merge branch 'tools: bpftool: update, synchronise and validate types and options'
Quentin Monnet says:
====================
To work with the different program types, map types, attach types etc.
supported by eBPF, bpftool needs occasional updates to learn about the new
features supported by the kernel. When such types translate into new
keyword for the command line, updates are expected in several locations:
typically, the help message displayed from bpftool itself, the manual page,
and the bash completion file should be updated. The options used by the
different commands for bpftool should also remain synchronised at those
locations.
Several omissions have occurred in the past, and a number of types are
still missing today. This set is an attempt to improve the situation. It
brings up-to-date the lists of types or options in bpftool, and also adds a
Python script to the BPF selftests to automatically check that most of
these lists remain synchronised.
v2:
- Reformat some lines in the bash completion file.
- Do not reformat attach types, to preserve git-blame history.
- Do not call Python script from tools/testing/selftests/bpf/Makefile.
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Quentin Monnet [Fri, 30 Jul 2021 21:54:35 +0000 (22:54 +0100)]
tools: bpftool: Complete metrics list in "bpftool prog profile" doc
Profiling programs with bpftool was extended some time ago to support
two new metrics, namely itlb_misses and dtlb_misses (misses for the
instruction/data translation lookaside buffer). Update the manual page
and bash completion accordingly.
Fixes: 450d060e8f75 ("bpftool: Add {i,d}tlb_misses support for bpftool profile")
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-8-quentin@isovalent.com
Quentin Monnet [Fri, 30 Jul 2021 21:54:34 +0000 (22:54 +0100)]
tools: bpftool: Document and add bash completion for -L, -B options
The -L|--use-loader option for using loader programs when loading, or
when generating a skeleton, did not have any documentation or bash
completion. Same thing goes for -B|--base-btf, used to pass a path to a
base BTF object for split BTF such as BTF for kernel modules.
This patch documents and adds bash completion for those options.
Fixes: 75fa1777694c ("tools/bpftool: Add bpftool support for split BTF")
Fixes: d510296d331a ("bpftool: Use syscall/loader program in "prog load" and "gen skeleton" command.")
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-7-quentin@isovalent.com
Quentin Monnet [Fri, 30 Jul 2021 21:54:33 +0000 (22:54 +0100)]
selftests/bpf: Update bpftool's consistency script for checking options
Update the script responsible for checking that the different types used
at various places in bpftool are synchronised, and extend it to check
the consistency of options between the help messages in the source code
and the manual pages.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-6-quentin@isovalent.com
Quentin Monnet [Fri, 30 Jul 2021 21:54:32 +0000 (22:54 +0100)]
tools: bpftool: Update and synchronise option list in doc and help msg
All bpftool commands support the options for JSON output and debug from
libbpf. In addition, some commands support additional options
corresponding to specific use cases.
The list of options described in the man pages for the different
commands are not always accurate. The messages for interactive help are
mostly limited to HELP_SPEC_OPTIONS, and are even less representative of
the actual set of options supported for the commands.
Let's update the lists:
- HELP_SPEC_OPTIONS is modified to contain the "default" options (JSON
and debug), and to be extensible (no ending curly bracket).
- All commands use HELP_SPEC_OPTIONS in their help message, and then
complete the list with their specific options.
- The lists of options in the man pages are updated.
- The formatting of the list for bpftool.rst is adjusted to match
formatting for the other man pages. This is for consistency, and also
because it will be helpful in a future patch to automatically check
that the files are synchronised.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-5-quentin@isovalent.com
Quentin Monnet [Fri, 30 Jul 2021 21:54:31 +0000 (22:54 +0100)]
tools: bpftool: Complete and synchronise attach or map types
Update bpftool's list of attach type names to tell it about the latest
attach types, or the "ringbuf" map. Also update the documentation, help
messages, and bash completion when relevant.
These missing items were reported by the newly added Python script used
to help maintain consistency in bpftool.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-4-quentin@isovalent.com
Quentin Monnet [Fri, 30 Jul 2021 21:54:30 +0000 (22:54 +0100)]
selftests/bpf: Check consistency between bpftool source, doc, completion
Whenever the eBPF subsystem gains new elements, such as new program or
map types, it is necessary to update bpftool if we want it able to
handle the new items.
In addition to the main arrays containing the names of these elements in
the source code, there are also multiple locations to update:
- The help message in the do_help() functions in bpftool's source code.
- The RST documentation files.
- The bash completion file.
This has led to omissions multiple times in the past. This patch
attempts to address this issue by adding consistency checks for all
these different locations. It also verifies that the bpf_prog_type,
bpf_map_type and bpf_attach_type enums from the UAPI BPF header have all
their members present in bpftool.
The script requires no argument to run, it reads and parses the
different files to check, and prints the mismatches, if any. It
currently reports a number of missing elements, which will be fixed in a
later patch:
$ ./test_bpftool_synctypes.py
Comparing [...]/linux/tools/bpf/bpftool/map.c (map_type_name) and [...]/linux/tools/bpf/bpftool/bash-completion/bpftool (BPFTOOL_MAP_CREATE_TYPES): {'ringbuf'}
Comparing BPF header (enum bpf_attach_type) and [...]/linux/tools/bpf/bpftool/common.c (attach_type_name): {'BPF_TRACE_ITER', 'BPF_XDP_DEVMAP', 'BPF_XDP', 'BPF_SK_REUSEPORT_SELECT', 'BPF_XDP_CPUMAP', 'BPF_SK_REUSEPORT_SELECT_OR_MIGRATE'}
Comparing [...]/linux/tools/bpf/bpftool/prog.c (attach_type_strings) and [...]/linux/tools/bpf/bpftool/prog.c (do_help() ATTACH_TYPE): {'skb_verdict'}
Comparing [...]/linux/tools/bpf/bpftool/prog.c (attach_type_strings) and [...]/linux/tools/bpf/bpftool/Documentation/bpftool-prog.rst (ATTACH_TYPE): {'skb_verdict'}
Comparing [...]/linux/tools/bpf/bpftool/prog.c (attach_type_strings) and [...]/linux/tools/bpf/bpftool/bash-completion/bpftool (BPFTOOL_PROG_ATTACH_TYPES): {'skb_verdict'}
Note that the script does NOT check for consistency between the list of
program types that bpftool claims it accepts and the actual list of
keywords that can be used. This is because bpftool does not "see" them,
they are ELF section names parsed by libbpf. It is not hard to parse the
section_defs[] array in libbpf, but some section names are associated
with program types that bpftool cannot load at the moment. For example,
some programs require a BTF target and an attach target that bpftool
cannot handle. The script may be extended to parse the array and check
only relevant values in the future.
The script is not added to the selftests' Makefile, because doing so
would require all patches with BPF UAPI change to also update bpftool.
Instead it is to be added to the CI.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-3-quentin@isovalent.com
Quentin Monnet [Fri, 30 Jul 2021 21:54:29 +0000 (22:54 +0100)]
tools: bpftool: Slightly ease bash completion updates
Bash completion for bpftool gets two minor improvements in this patch.
Move the detection of attach types for "bpftool cgroup attach" outside
of the "case/esac" bloc, where we cannot reuse our variable holding the
list of supported attach types as a pattern list. After the change, we
have only one list of cgroup attach types to update when new types are
added, instead of the former two lists.
Also rename the variables holding lists of names for program types, map
types, and attach types, to make them more unique. This can make it
slightly easier to point people to the relevant variables to update, but
the main objective here is to help run a script to check that bash
completion is up-to-date with bpftool's source code.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-2-quentin@isovalent.com
Jakub Kicinski [Fri, 30 Jul 2021 20:16:40 +0000 (13:16 -0700)]
Merge branch 'clean-devlink-net-namespace-operations'
Leon Romanovsky says:
====================
Clean devlink net namespace operations
This short series continues my work on devlink core code to make devlink
reload less prone to errors and harden it from API abuse.
Despite first patch being a clear fix, I would ask you to apply it to
net-next anyway, because the fixed patch is anyway old and it will
help us to eliminate merge conflicts that will arise for following
patches or even for the second one.
====================
Link: https://lore.kernel.org/r/cover.1627578998.git.leonro@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Leon Romanovsky [Thu, 29 Jul 2021 17:19:25 +0000 (20:19 +0300)]
devlink: Allocate devlink directly in requested net namespace
There is no need in extra call indirection and check from impossible
flow where someone tries to set namespace without prior call
to devlink_alloc().
Instead of this extra logic and additional EXPORT_SYMBOL, use specialized
devlink allocation function that receives net namespace as an argument.
Such specialized API allows clear view when devlink initialized in wrong
net namespace and/or kernel users don't try to change devlink namespace
under the hood.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Leon Romanovsky [Thu, 29 Jul 2021 17:19:24 +0000 (20:19 +0300)]
devlink: Break parameter notification sequence to be before/after unload/load driver
The change of namespaces during devlink reload calls to driver unload
before it accesses devlink parameters. The commands below causes to
use-after-free bug when trying to get flow steering mode.
* ip netns add n1
* devlink dev reload pci/0000:00:09.0 netns n1
==================================================================
BUG: KASAN: use-after-free in mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core]
Read of size 4 at addr
ffff888009d04308 by task devlink/275
CPU: 6 PID: 275 Comm: devlink Not tainted 5.12.0-rc2+ #2853
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x93/0xc2
print_address_description.constprop.0+0x18/0x140
? mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core]
? mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core]
kasan_report.cold+0x7c/0xd8
? mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core]
mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core]
devlink_nl_param_fill+0x1c8/0xe80
? __free_pages_ok+0x37a/0x8a0
? devlink_flash_update_timeout_notify+0xd0/0xd0
? lock_acquire+0x1a9/0x6d0
? fs_reclaim_acquire+0xb7/0x160
? lock_is_held_type+0x98/0x110
? 0xffffffff81000000
? lock_release+0x1f9/0x6c0
? fs_reclaim_release+0xa1/0xf0
? lock_downgrade+0x6d0/0x6d0
? lock_is_held_type+0x98/0x110
? lock_is_held_type+0x98/0x110
? memset+0x20/0x40
? __build_skb_around+0x1f8/0x2b0
devlink_param_notify+0x6d/0x180
devlink_reload+0x1c3/0x520
? devlink_remote_reload_actions_performed+0x30/0x30
? mutex_trylock+0x24b/0x2d0
? devlink_nl_cmd_reload+0x62b/0x1070
devlink_nl_cmd_reload+0x66d/0x1070
? devlink_reload+0x520/0x520
? devlink_get_from_attrs+0x1bc/0x260
? devlink_nl_pre_doit+0x64/0x4d0
genl_family_rcv_msg_doit+0x1e9/0x2f0
? mutex_lock_io_nested+0x1130/0x1130
? genl_family_rcv_msg_attrs_parse.constprop.0+0x240/0x240
? security_capable+0x51/0x90
genl_rcv_msg+0x27f/0x4a0
? genl_get_cmd+0x3c0/0x3c0
? lock_acquire+0x1a9/0x6d0
? devlink_reload+0x520/0x520
? lock_release+0x6c0/0x6c0
netlink_rcv_skb+0x11d/0x340
? genl_get_cmd+0x3c0/0x3c0
? netlink_ack+0x9f0/0x9f0
? lock_release+0x1f9/0x6c0
genl_rcv+0x24/0x40
netlink_unicast+0x433/0x700
? netlink_attachskb+0x730/0x730
? _copy_from_iter_full+0x178/0x650
? __alloc_skb+0x113/0x2b0
netlink_sendmsg+0x6f1/0xbd0
? netlink_unicast+0x700/0x700
? lock_is_held_type+0x98/0x110
? netlink_unicast+0x700/0x700
sock_sendmsg+0xb0/0xe0
__sys_sendto+0x193/0x240
? __x64_sys_getpeername+0xb0/0xb0
? do_sys_openat2+0x10b/0x370
? __up_read+0x1a1/0x7b0
? do_user_addr_fault+0x219/0xdc0
? __x64_sys_openat+0x120/0x1d0
? __x64_sys_open+0x1a0/0x1a0
__x64_sys_sendto+0xdd/0x1b0
? syscall_enter_from_user_mode+0x1d/0x50
do_syscall_64+0x2d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fc69d0af14a
Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 c3 0f 1f 44 00 00 55 48 83 ec 30 44 89 4c
RSP: 002b:
00007ffc1d8292f8 EFLAGS:
00000246 ORIG_RAX:
000000000000002c
RAX:
ffffffffffffffda RBX:
0000000000000005 RCX:
00007fc69d0af14a
RDX:
0000000000000038 RSI:
0000555f57c56440 RDI:
0000000000000003
RBP:
0000555f57c56410 R08:
00007fc69d17b200 R09:
000000000000000c
R10:
0000000000000000 R11:
0000000000000246 R12:
0000000000000000
R13:
0000000000000000 R14:
0000000000000000 R15:
0000000000000000
Allocated by task 146:
kasan_save_stack+0x1b/0x40
__kasan_kmalloc+0x99/0xc0
mlx5_init_fs+0xf0/0x1c50 [mlx5_core]
mlx5_load+0xd2/0x180 [mlx5_core]
mlx5_init_one+0x2f6/0x450 [mlx5_core]
probe_one+0x47d/0x6e0 [mlx5_core]
pci_device_probe+0x2a0/0x4a0
really_probe+0x20a/0xc90
driver_probe_device+0xd8/0x380
device_driver_attach+0x1df/0x250
__driver_attach+0xff/0x240
bus_for_each_dev+0x11e/0x1a0
bus_add_driver+0x309/0x570
driver_register+0x1ee/0x380
0xffffffffa06b8062
do_one_initcall+0xd5/0x410
do_init_module+0x1c8/0x760
load_module+0x6d8b/0x9650
__do_sys_finit_module+0x118/0x1b0
do_syscall_64+0x2d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
Freed by task 275:
kasan_save_stack+0x1b/0x40
kasan_set_track+0x1c/0x30
kasan_set_free_info+0x20/0x30
__kasan_slab_free+0x102/0x140
slab_free_freelist_hook+0x74/0x1b0
kfree+0xd7/0x2a0
mlx5_unload+0x16/0xb0 [mlx5_core]
mlx5_unload_one+0xae/0x120 [mlx5_core]
mlx5_devlink_reload_down+0x1bc/0x380 [mlx5_core]
devlink_reload+0x141/0x520
devlink_nl_cmd_reload+0x66d/0x1070
genl_family_rcv_msg_doit+0x1e9/0x2f0
genl_rcv_msg+0x27f/0x4a0
netlink_rcv_skb+0x11d/0x340
genl_rcv+0x24/0x40
netlink_unicast+0x433/0x700
netlink_sendmsg+0x6f1/0xbd0
sock_sendmsg+0xb0/0xe0
__sys_sendto+0x193/0x240
__x64_sys_sendto+0xdd/0x1b0
do_syscall_64+0x2d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at
ffff888009d04300
which belongs to the cache kmalloc-128 of size 128
The buggy address is located 8 bytes inside of
128-byte region [
ffff888009d04300,
ffff888009d04380)
The buggy address belongs to the page:
page:
0000000086a64ecc refcount:1 mapcount:0 mapping:
0000000000000000 index:0xffff888009d04000 pfn:0x9d04
head:
0000000086a64ecc order:1 compound_mapcount:0
flags: 0x4000000000010200(slab|head)
raw:
4000000000010200 ffffea0000203980 0000000200000002 ffff8880050428c0
raw:
ffff888009d04000 000000008020001d 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888009d04200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888009d04280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>
ffff888009d04300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888009d04380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888009d04400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
The right solution to devlink reload is to notify about deletion of
parameters, unload driver, change net namespaces, load driver and notify
about addition of parameters.
Fixes: 070c63f20f6c ("net: devlink: allow to change namespaces during reload")
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cong Wang [Fri, 23 Jul 2021 18:36:30 +0000 (11:36 -0700)]
unix_bpf: Fix a potential deadlock in unix_dgram_bpf_recvmsg()
As Eric noticed, __unix_dgram_recvmsg() may acquire u->iolock
too, so we have to release it before calling this function.
Fixes: 9825d866ce0d ("af_unix: Implement unix_dgram_bpf_recvmsg()")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Hengqi Chen [Fri, 30 Jul 2021 11:40:12 +0000 (19:40 +0800)]
libbpf: Add btf__load_vmlinux_btf/btf__load_module_btf
Add two new APIs: btf__load_vmlinux_btf and btf__load_module_btf.
btf__load_vmlinux_btf is just an alias to the existing API named
libbpf_find_kernel_btf, rename to be more precisely and consistent
with existing BTF APIs. btf__load_module_btf can be used to load
module BTF, add it for completeness. These two APIs are useful for
implementing tracing tools and introspection tools. This is part
of the effort towards libbpf 1.0 ([0]).
[0] Closes: https://github.com/libbpf/libbpf/issues/280
Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730114012.494408-1-hengqi.chen@gmail.com
Paolo Abeni [Fri, 30 Jul 2021 16:30:53 +0000 (18:30 +0200)]
sk_buff: avoid potentially clearing 'slow_gro' field
If skb_dst_set_noref() is invoked with a NULL dst, the 'slow_gro'
field is cleared, too. That could lead to wrong behavior if
the skb later enters the GRO stage.
Fix the potential issue replacing preserving a non-zero value of
the 'slow_gro' field.
Additionally, fix a comment typo.
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Fixes: 8a886b142bd0 ("sk_buff: track dst status in slow_gro")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/aa42529252dc8bb02bd42e8629427040d1058537.1627662501.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rafael J. Wysocki [Fri, 30 Jul 2021 18:26:38 +0000 (20:26 +0200)]
Merge branches 'acpi-resources' and 'acpi-dptf'
* acpi-resources:
Revert "ACPI: resources: Add checks for ACPI IRQ override"
* acpi-dptf:
ACPI: DPTF: Fix reading of attributes
Linus Torvalds [Fri, 30 Jul 2021 18:08:12 +0000 (11:08 -0700)]
Merge tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- gendisk freeing fix (Christoph)
- blk-iocost wake ordering fix (Tejun)
- tag allocation error handling fix (John)
- loop locking fix. While this isn't the prettiest fix in the world,
nobody has any good alternatives for 5.14. Something to likely
revisit for 5.15. (Tetsuo)
* tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
block: delay freeing the gendisk
blk-iocost: fix operation ordering in iocg_wake_fn()
blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handling
loop: reintroduce global lock for safe loop_validate_file() traversal
Linus Torvalds [Fri, 30 Jul 2021 18:01:47 +0000 (11:01 -0700)]
Merge tag 'io_uring-5.14-2021-07-30' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- A fix for block backed reissue (me)
- Reissue context hardening (me)
- Async link locking fix (Pavel)
* tag 'io_uring-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
io_uring: fix poll requests leaking second poll entries
io_uring: don't block level reissue off completion path
io_uring: always reissue from task_work context
io_uring: fix race in unified task_work running
io_uring: fix io_prep_async_link locking
Linus Torvalds [Fri, 30 Jul 2021 17:56:47 +0000 (10:56 -0700)]
Merge tag 'libata-5.14-2021-07-30' of git://git.kernel.dk/linux-block
Pull libata fixlets from Jens Axboe:
- A fix for PIO highmem (Christoph)
- Kill HAVE_IDE as it's now unused (Lukas)
* tag 'libata-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
arch: Kconfig: clean up obsolete use of HAVE_IDE
libata: fix ata_pio_sector for CONFIG_HIGHMEM
Linus Torvalds [Fri, 30 Jul 2021 17:50:09 +0000 (10:50 -0700)]
Merge tag 'for-5.14-rc3-tag' of git://git./linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix -Warray-bounds warning, to help external patchset to make it
default treewide
- fix writeable device accounting (syzbot report)
- fix fsync and log replay after a rename and inode eviction
- fix potentially lost error code when submitting multiple bios for
compressed range
* tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: calculate number of eb pages properly in csum_tree_block
btrfs: fix rw device counting in __btrfs_free_extra_devids
btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction
btrfs: mark compressed range uptodate only if all bio succeed
Linus Torvalds [Fri, 30 Jul 2021 17:36:36 +0000 (10:36 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/hid/hid
Pull HID fixes from Jiri Kosina:
- resume timing fix for intel-ish driver (Ye Xiang)
- fix for using incorrect MMIO register in amd_sfh driver (Dylan
MacKenzie)
- Cintiq 24HDT / 27QHDT regression fix and touch processing fix for
Wacom driver (Jason Gerecke)
- device removal bugfix for ft260 driver (Michael Zaidman)
- other small assorted fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: ft260: fix device removal due to USB disconnect
HID: wacom: Skip processing of touches with negative slot values
HID: wacom: Re-enable touch by default for Cintiq 24HDT / 27QHDT
HID: Kconfig: Fix spelling mistake "Uninterruptable" -> "Uninterruptible"
HID: apple: Add support for Keychron K1 wireless keyboard
HID: fix typo in Kconfig
HID: ft260: fix format type warning in ft260_word_show()
HID: amd_sfh: Use correct MMIO register for DMA address
HID: asus: Remove check for same LED brightness on set
HID: intel-ish-hid: use async resume function
Linus Torvalds [Fri, 30 Jul 2021 17:29:58 +0000 (10:29 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"7 patches.
Subsystems affected by this patch series: lib, ocfs2, and mm (slub,
migration, and memcg)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()
slub: fix unreclaimable slab stat for bulk free
mm/migrate: fix NR_ISOLATED corruption on 64-bit
mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code
ocfs2: issue zeroout to EOF blocks
ocfs2: fix zero out valid data
lib/test_string.c: move string selftest in the Runtime Testing menu
Jakub Kicinski [Fri, 30 Jul 2021 17:29:52 +0000 (19:29 +0200)]
Merge tag 'linux-can-fixes-for-5.14-
20210730' of git://git./linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2021-07-30
The first patch is by me and adds Yasushi SHOJI as a reviewer for the
Microchip CAN BUS Analyzer Tool driver.
Dan Carpenter's patch fixes a signedness bug in the hi311x driver.
Pavel Skripkin provides 4 patches, the first targets the mcba_usb
driver by adding the missing urb->transfer_dma initialization, which
was broken in a previous commit. The last 3 patches fix a memory leak
in the usb_8dev, ems_usb and esd_usb2 driver.
* tag 'linux-can-fixes-for-5.14-
20210730' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: esd_usb2: fix memory leak
can: ems_usb: fix memory leak
can: usb_8dev: fix memory leak
can: mcba_usb_start(): add missing urb->transfer_dma initialization
can: hi311x: fix a signedness bug in hi3110_cmd()
MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver
====================
Link: https://lore.kernel.org/r/20210730070526.1699867-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wang Hai [Thu, 29 Jul 2021 21:53:54 +0000 (14:53 -0700)]
mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()
When I use kfree_rcu() to free a large memory allocated by kmalloc_node(),
the following dump occurs.
BUG: kernel NULL pointer dereference, address:
0000000000000020
[...]
Oops: 0000 [#1] SMP
[...]
Workqueue: events kfree_rcu_work
RIP: 0010:__obj_to_index include/linux/slub_def.h:182 [inline]
RIP: 0010:obj_to_index include/linux/slub_def.h:191 [inline]
RIP: 0010:memcg_slab_free_hook+0x120/0x260 mm/slab.h:363
[...]
Call Trace:
kmem_cache_free_bulk+0x58/0x630 mm/slub.c:3293
kfree_bulk include/linux/slab.h:413 [inline]
kfree_rcu_work+0x1ab/0x200 kernel/rcu/tree.c:3300
process_one_work+0x207/0x530 kernel/workqueue.c:2276
worker_thread+0x320/0x610 kernel/workqueue.c:2422
kthread+0x13d/0x160 kernel/kthread.c:313
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
When kmalloc_node() a large memory, page is allocated, not slab, so when
freeing memory via kfree_rcu(), this large memory should not be used by
memcg_slab_free_hook(), because memcg_slab_free_hook() is is used for
slab.
Using page_objcgs_check() instead of page_objcgs() in
memcg_slab_free_hook() to fix this bug.
Link: https://lkml.kernel.org/r/20210728145655.274476-1-wanghai38@huawei.com
Fixes: 270c6a71460e ("mm: memcontrol/slab: Use helpers to access slab page's memcg_data")
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shakeel Butt [Thu, 29 Jul 2021 21:53:50 +0000 (14:53 -0700)]
slub: fix unreclaimable slab stat for bulk free
SLUB uses page allocator for higher order allocations and update
unreclaimable slab stat for such allocations. At the moment, the bulk
free for SLUB does not share code with normal free code path for these
type of allocations and have missed the stat update. So, fix the stat
update by common code. The user visible impact of the bug is the
potential of inconsistent unreclaimable slab stat visible through
meminfo and vmstat.
Link: https://lkml.kernel.org/r/20210728155354.3440560-1-shakeelb@google.com
Fixes: 6a486c0ad4dc ("mm, sl[ou]b: improve memory accounting")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aneesh Kumar K.V [Thu, 29 Jul 2021 21:53:47 +0000 (14:53 -0700)]
mm/migrate: fix NR_ISOLATED corruption on 64-bit
Similar to commit
2da9f6305f30 ("mm/vmscan: fix NR_ISOLATED_FILE
corruption on 64-bit") avoid using unsigned int for nr_pages. With
unsigned int type the large unsigned int converts to a large positive
signed long.
Symptoms include CMA allocations hanging forever due to
alloc_contig_range->...->isolate_migratepages_block waiting forever in
"while (unlikely(too_many_isolated(pgdat)))".
Link: https://lkml.kernel.org/r/20210728042531.359409-1-aneesh.kumar@linux.ibm.com
Fixes: c5fc5c3ae0c8 ("mm: migrate: account THP NUMA migration counters correctly")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Thu, 29 Jul 2021 21:53:44 +0000 (14:53 -0700)]
mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code
Dan Carpenter reports:
The patch
2d146aa3aa84: "mm: memcontrol: switch to rstat" from Apr
29, 2021, leads to the following static checker warning:
kernel/cgroup/rstat.c:200 cgroup_rstat_flush()
warn: sleeping in atomic context
mm/memcontrol.c
3572 static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
3573 {
3574 unsigned long val;
3575
3576 if (mem_cgroup_is_root(memcg)) {
3577 cgroup_rstat_flush(memcg->css.cgroup);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is from static analysis and potentially a false positive. The
problem is that mem_cgroup_usage() is called from __mem_cgroup_threshold()
which holds an rcu_read_lock(). And the cgroup_rstat_flush() function
can sleep.
3578 val = memcg_page_state(memcg, NR_FILE_PAGES) +
3579 memcg_page_state(memcg, NR_ANON_MAPPED);
3580 if (swap)
3581 val += memcg_page_state(memcg, MEMCG_SWAP);
3582 } else {
3583 if (!swap)
3584 val = page_counter_read(&memcg->memory);
3585 else
3586 val = page_counter_read(&memcg->memsw);
3587 }
3588 return val;
3589 }
__mem_cgroup_threshold() indeed holds the rcu lock. In addition, the
thresholding code is invoked during stat changes, and those contexts
have irqs disabled as well. If the lock breaking occurs inside the
flush function, it will result in a sleep from an atomic context.
Use the irqsafe flushing variant in mem_cgroup_usage() to fix this.
Link: https://lkml.kernel.org/r/20210726150019.251820-1-hannes@cmpxchg.org
Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Junxiao Bi [Thu, 29 Jul 2021 21:53:41 +0000 (14:53 -0700)]
ocfs2: issue zeroout to EOF blocks
For punch holes in EOF blocks, fallocate used buffer write to zero the
EOF blocks in last cluster. But since ->writepage will ignore EOF
pages, those zeros will not be flushed.
This "looks" ok as commit
6bba4471f0cc ("ocfs2: fix data corruption by
fallocate") will zero the EOF blocks when extend the file size, but it
isn't. The problem happened on those EOF pages, before writeback, those
pages had DIRTY flag set and all buffer_head in them also had DIRTY flag
set, when writeback run by write_cache_pages(), DIRTY flag on the page
was cleared, but DIRTY flag on the buffer_head not.
When next write happened to those EOF pages, since buffer_head already
had DIRTY flag set, it would not mark page DIRTY again. That made
writeback ignore them forever. That will cause data corruption. Even
directio write can't work because it will fail when trying to drop pages
caches before direct io, as it found the buffer_head for those pages
still had DIRTY flag set, then it will fall back to buffer io mode.
To make a summary of the issue, as writeback ingores EOF pages, once any
EOF page is generated, any write to it will only go to the page cache,
it will never be flushed to disk even file size extends and that page is
not EOF page any more. The fix is to avoid zero EOF blocks with buffer
write.
The following code snippet from qemu-img could trigger the corruption.
656 open("
6b3711ae-3306-4bdd-823c-
cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11
...
660 fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE,
2275868672, 327680 <unfinished ...>
660 fallocate(11, 0,
2275868672, 327680) = 0
658 pwrite64(11, "
Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Junxiao Bi [Thu, 29 Jul 2021 21:53:38 +0000 (14:53 -0700)]
ocfs2: fix zero out valid data
If append-dio feature is enabled, direct-io write and fallocate could
run in parallel to extend file size, fallocate used "orig_isize" to
record i_size before taking "ip_alloc_sem", when
ocfs2_zeroout_partial_cluster() zeroout EOF blocks, i_size maybe already
extended by ocfs2_dio_end_io_write(), that will cause valid data zeroed
out.
Link: https://lkml.kernel.org/r/20210722054923.24389-1-junxiao.bi@oracle.com
Fixes: 6bba4471f0cc ("ocfs2: fix data corruption by fallocate")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matteo Croce [Thu, 29 Jul 2021 21:53:35 +0000 (14:53 -0700)]
lib/test_string.c: move string selftest in the Runtime Testing menu
STRING_SELFTEST is presented in the "Library routines" menu. Move it in
Kernel hacking > Kernel Testing and Coverage > Runtime Testing together
with other similar tests found in lib/
--- Runtime Testing
<*> Test functions located in the hexdump module at runtime
<*> Test string functions (NEW)
<*> Test functions located in the string_helpers module at runtime
<*> Test strscpy*() family of functions at runtime
<*> Test kstrto*() family of functions at runtime
<*> Test printf() family of functions at runtime
<*> Test scanf() family of functions at runtime
Link: https://lkml.kernel.org/r/20210719185158.190371-1-mcroce@linux.microsoft.com
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Cc: Peter Rosin <peda@axentia.se>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>