review.tizen.org Git - platform/kernel/linux-starfive.git/log

wireguard: ratelimiter: use kvcalloc() instead of kvzalloc()

Use 2-factor argument form kvcalloc() instead of kvzalloc().

Link: https://github.com/KSPP/linux/issues/162
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
[Jason: Gustavo's link above is for KSPP, but this isn't actually a
security fix, as table_size is bounded to 8192 anyway, and gcc realizes
this, so the codegen comes out to be about the same.]
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

wireguard: receive: drop handshakes if queue lock is contended

If we're being delivered packets from multiple CPUs so quickly that the
ring lock is contended for CPU tries, then it's safe to assume that the
queue is near capacity anyway, so just drop the packet rather than
spinning. This helps deal with multicore DoS that can interfere with
data path performance. It _still_ does not completely fix the issue, but
it again chips away at it.

Reported-by: Streun Fabio <fstreun@student.ethz.ch>
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

wireguard: receive: use ring buffer for incoming handshakes

Apparently the spinlock on incoming_handshake's skb_queue is highly
contended, and a torrent of handshake or cookie packets can bring the
data plane to its knees, simply by virtue of enqueueing the handshake
packets to be processed asynchronously. So, we try switching this to a
ring buffer to hopefully have less lock contention. This alleviates the
problem somewhat, though it still isn't perfect, so future patches will
have to improve this further. However, it at least doesn't completely
diminish the data plane.

Reported-by: Streun Fabio <fstreun@student.ethz.ch>
Reported-by: Joel Wanner <joel.wanner@inf.ethz.ch>
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

wireguard: device: reset peer src endpoint when netns exits

Each peer's endpoint contains a dst_cache entry that takes a reference
to another netdev. When the containing namespace exits, we take down the
socket and prevent future sockets from being created (by setting
creating_net to NULL), which removes that potential reference on the
netns. However, it doesn't release references to the netns that a netdev
cached in dst_cache might be taking, so the netns still might fail to
exit. Since the socket is gimped anyway, we can simply clear all the
dst_caches (by way of clearing the endpoint src), which will release all
references.

However, the current dst_cache_reset function only releases those
references lazily. But it turns out that all of our usages of
wg_socket_clear_peer_endpoint_src are called from contexts that are not
exactly high-speed or bottle-necked. For example, when there's
connection difficulty, or when userspace is reconfiguring the interface.
And in particular for this patch, when the netns is exiting. So for
those cases, it makes more sense to call dst_release immediately. For
that, we add a small helper function to dst_cache.

This patch also adds a test to netns.sh from Hangbin Liu to ensure this
doesn't regress.

Tested-by: Hangbin Liu <liuhangbin@gmail.com>
Reported-by: Xiumei Mu <xmu@redhat.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Fixes: 900575aa33a3 ("wireguard: device: avoid circular netns references")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

wireguard: selftests: rename DEBUG_PI_LIST to DEBUG_PLIST

DEBUG_PI_LIST was renamed to DEBUG_PLIST since 8e18faeac3 ("lib/plist:
rename DEBUG_PI_LIST to DEBUG_PLIST").

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Fixes: 8e18faeac3e4 ("lib/plist: rename DEBUG_PI_LIST to DEBUG_PLIST")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

wireguard: main: rename 'mod_init' & 'mod_exit' functions to be module-specific

Rename module_init & module_exit functions that are named
"mod_init" and "mod_exit" so that they are unique in both the
System.map file and in initcall_debug output instead of showing
up as almost anonymous "mod_init".

This is helpful for debugging and in determining how long certain
module_init calls take to execute.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

wireguard: selftests: actually test for routing loops

We previously removed the restriction on looping to self, and then added
a test to make sure the kernel didn't blow up during a routing loop. The
kernel didn't blow up, thankfully, but on certain architectures where
skb fragmentation is easier, such as ppc64, the skbs weren't actually
being discarded after a few rounds through. But the test wasn't catching
this. So actually test explicitly for massive increases in tx to see if
we have a routing loop. Note that the actual loop problem will need to
be addressed in a different commit.

Fixes: b673e24aad36 ("wireguard: socket: remove errant restriction on looping to self")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

wireguard: selftests: increase default dmesg log size

The selftests currently parse the kernel log at the end to track
potential memory leaks. With these tests now reading off the end of the
buffer, due to recent optimizations, some creation messages were lost,
making the tests think that there was a free without an alloc. Fix this
by increasing the kernel log size.

Fixes: 24b70eeeb4f4 ("wireguard: use synchronize_net rather than synchronize_rcu")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

wireguard: allowedips: add missing __rcu annotation to satisfy sparse

A __rcu annotation got lost during refactoring, which caused sparse to
become enraged.

Fixes: bf7b042dc62a ("wireguard: allowedips: free empty intermediate nodes when removing single node")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: fix memory leak in fib6_rule_suppress

The kernel leaks memory when a `fib` rule is present in IPv6 nftables
firewall rules and a suppress_prefix rule is present in the IPv6 routing
rules (used by certain tools such as wg-quick). In such scenarios, every
incoming packet will leak an allocation in `ip6_dst_cache` slab cache.

After some hours of `bpftrace`-ing and source code reading, I tracked
down the issue to ca7a03c41753 ("ipv6: do not free rt if
FIB_LOOKUP_NOREF is set on suppress rule").

The problem with that change is that the generic `args->flags` always have
`FIB_LOOKUP_NOREF` set[1][2] but the IPv6-specific flag
`RT6_LOOKUP_F_DST_NOREF` might not be, leading to `fib6_rule_suppress` not
decreasing the refcount when needed.

How to reproduce:
- Add the following nftables rule to a prerouting chain:
     meta nfproto ipv6 fib saddr . mark . iif oif missing drop
   This can be done with:
     sudo nft create table inet test
     sudo nft create chain inet test test_chain '{ type filter hook prerouting priority filter + 10; policy accept; }'
     sudo nft add rule inet test test_chain meta nfproto ipv6 fib saddr . mark . iif oif missing drop
- Run:
     sudo ip -6 rule add table main suppress_prefixlength 0
- Watch `sudo slabtop -o | grep ip6_dst_cache` to see memory usage increase
   with every incoming ipv6 packet.

This patch exposes the protocol-specific flags to the protocol
specific `suppress` function, and check the protocol-specific `flags`
argument for RT6_LOOKUP_F_DST_NOREF instead of the generic
FIB_LOOKUP_NOREF when decreasing the refcount, like this.

[1]: https://github.com/torvalds/linux/blob/ca7a03c4175366a92cee0ccc4fec0038c3266e26/net/ipv6/fib6_rules.c#L71
[2]: https://github.com/torvalds/linux/blob/ca7a03c4175366a92cee0ccc4fec0038c3266e26/net/ipv6/fib6_rules.c#L99

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215105
Fixes: ca7a03c41753 ("ipv6: do not free rt if FIB_LOOKUP_NOREF is set on suppress rule")
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'atlantic-fixes'

Sudarsana Reddy Kalluru says:

====================
net: atlantic: 11-2021 fixes

The patch series contains fixes for atlantic driver to improve support
of latest AQC113 chipset.

Please consider applying it to 'net' tree.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

atlantic: Remove warn trace message.

Remove the warn trace message - it's not a correct check here, because
the function can still be called on the device in DOWN state

Fixes: 508f2e3dce454 ("net: atlantic: split rx and tx per-queue stats")
Signed-off-by: Sameer Saurabh <ssaurabh@marvell.com>
Signed-off-by: Sudarsana Reddy Kalluru <skalluru@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

atlantic: Fix statistics logic for production hardware

B0 is the main and widespread device revision of atlantic2 HW. In the
current state, driver will incorrectly fetch the statistics for this
revision.

Fixes: 5cfd54d7dc186 ("net: atlantic: minimal A2 fw_ops")
Signed-off-by: Dmitry Bogdanov <dbezrukov@marvell.com>
Signed-off-by: Sudarsana Reddy Kalluru <skalluru@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Remove Half duplex mode speed capabilities.

Since Half Duplex mode has been deprecated by the firmware, driver should
not advertise Half Duplex speed in ethtool support link speed values.

Fixes: 071a02046c262 ("net: atlantic: A2: half duplex support")
Signed-off-by: Sameer Saurabh <ssaurabh@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: Sudarsana Reddy Kalluru <skalluru@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

atlantic: Add missing DIDs and fix 115c.

At the late production stages new dev ids were introduced. These are
now in production, so its important for the driver to recognize these.
And also fix the board caps for AQC115C adapter.

Fixes: b3f0c79cba206 ("net: atlantic: A2 hw_ops skeleton")
Signed-off-by: Nikita Danilov <ndanilov@aquantia.com>
Signed-off-by: Sudarsana Reddy Kalluru <skalluru@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

atlantic: Fix to display FW bundle version instead of FW mac version.

The correct way to reflect firmware version is to use bundle version.
Hence populating the same instead of MAC fw version.

Fixes: c1be0bf092bd2 ("net: atlantic: common functions needed for basic A2 init/deinit hw_ops")
Signed-off-by: Sameer Saurabh <ssaurabh@marvell.com>
Signed-off-by: Sudarsana Reddy Kalluru <skalluru@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

atlatnic: enable Nbase-t speeds with base-t

When 2.5G is advertised, N-Base should be advertised against the T-base
caps. N5G is out of use in baseline code and driver should treat both 5G
and N5G (and also 2.5G and N2.5G) equally from user perspective.

Fixes: 5cfd54d7dc186 ("net: atlantic: minimal A2 fw_ops")
Signed-off-by: Nikita Danilov <ndanilov@aquantia.com>
Signed-off-by: Sudarsana Reddy Kalluru <skalluru@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

atlantic: Increase delay for fw transactions

The max waiting period (of 1 ms) while reading the data from FW shared
buffer is too small for certain types of data (e.g., stats). There's a
chance that FW could be updating buffer at the same time and driver
would be unsuccessful in reading data. Firmware manual recommends to
have 1 sec timeout to fix this issue.

Fixes: 5cfd54d7dc186 ("net: atlantic: minimal A2 fw_ops")
Signed-off-by: Dmitry Bogdanov <dbezrukov@marvell.com>
Signed-off-by: Sudarsana Reddy Kalluru <skalluru@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_en: Update reported link modes for 1/10G

When link modes were initially added in commit 2c762679435dc
("net/mlx4_en: Use PTYS register to query ethtool settings") and
later updated for the new ethtool API in commit 3d8f7cc78d0eb
("net: mlx4: use new ETHTOOL_G/SSETTINGS API") the only 1/10G non-baseT
link modes configured were 1000baseKX, 10000baseKX4 and 10000baseKR.
It looks like these got picked to represent other modes since nothing
better was available.

Switch to using more specific link modes added in commit 5711a98221443
("net: ethtool: add support for 1000BaseX and missing 10G link modes").

Tested with MCX311A-XCAT connected via DAC.
Before:

% sudo ethtool enp3s0
Settings for enp3s0:
Supported ports: [ FIBRE ]
Supported link modes:   1000baseKX/Full
                        10000baseKR/Full
Supported pause frame use: Symmetric Receive-only
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes:  1000baseKX/Full
                        10000baseKR/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 10000Mb/s
Duplex: Full
Auto-negotiation: off
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Supports Wake-on: d
Wake-on: d
        Current message level: 0x00000014 (20)
                               link ifdown
Link detected: yes

With this change:

% sudo ethtool enp3s0
Settings for enp3s0:
Supported ports: [ FIBRE ]
Supported link modes:   1000baseX/Full
                        10000baseCR/Full
                        10000baseSR/Full
Supported pause frame use: Symmetric Receive-only
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes:  1000baseX/Full
                        10000baseCR/Full
                        10000baseSR/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 10000Mb/s
Duplex: Full
Auto-negotiation: off
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Supports Wake-on: d
Wake-on: d
        Current message level: 0x00000014 (20)
                               link ifdown
Link detected: yes

Tested-by: Michael Stapelberg <michael@stapelberg.ch>
Signed-off-by: Erik Ekman <erik@kryo.se>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mctp: test: fix skb free in test device tx

In our test device, we're currently freeing skbs in the transmit path
with kfree(), rather than kfree_skb(). This change uses the correct
kfree_skb() instead.

Fixes: ded21b722995 ("mctp: Add test utils")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/tls: Fix authentication failure in CCM mode

When the TLS cipher suite uses CCM mode, including AES CCM and
SM4 CCM, the first byte of the B0 block is flags, and the real
IV starts from the second byte. The XOR operation of the IV and
rec_seq should be skip this byte, that is, add the iv_offset.

Fixes: f295b3ae9f59 ("net/tls: Add support of AES128-CCM based ciphers")
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Cc: Vakul Garg <vakul.garg@nxp.com>
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mpls-notifications'

Benjamin Poirier says:

====================
net: mpls: Netlink notification fixes

fix missing or inaccurate route notifications when devices used in
nexthops are deleted.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: mpls: Remove rcu protection from nh_dev

Following the previous commit, nh_dev can no longer be accessed and
modified concurrently.

Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mpls: Fix notifications when deleting a device

There are various problems related to netlink notifications for mpls route
changes in response to interfaces being deleted:
* delete interface of only nexthop
DELROUTE notification is missing RTA_OIF attribute
* delete interface of non-last nexthop
NEWROUTE notification is missing entirely
* delete interface of last nexthop
DELROUTE notification is missing nexthop

All of these problems stem from the fact that existing routes are modified
in-place before sending a notification. Restructure mpls_ifdown() to avoid
changing the route in the DELROUTE cases and to create a copy in the
NEWROUTE case.

Fixes: f8efb73c97e2 ("mpls: multipath route support")
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: usb: lan78xx: lan78xx_phy_init(): use PHY_POLL instead of "0" if no IRQ is available

On most systems request for IRQ 0 will fail, phylib will print an error message
and fall back to polling. To fix this set the phydev->irq to PHY_POLL if no IRQ
is available.

Fixes: cc89c323a30e ("lan78xx: Use irq_domain for phy interrupt from USB Int. EP")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Sven Schuchmann <schuchmann@schleissheimer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

USB: NO_LPM quirk Lenovo Powered USB-C Travel Hub

This is another branded 8153 device that doesn't work well with LPM:
r8152 2-2.1:1.0 enp0s13f0u2u1: Stop submitting intr, status -71

Disable LPM to resolve the issue.

Signed-off-by: Ole Ernst <olebowle@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: realtek-smi: fix indirect reg access for ports>3

This switch family can have up to 8 UTP ports {0..7}. However,
INDIRECT_ACCESS_ADDRESS_PHYNUM_MASK was using 2 bits instead of 3,
dropping the most significant bit during indirect register reads and
writes. Reading or writing ports 4, 5, 6, and 7 registers was actually
manipulating, respectively, ports 0, 1, 2, and 3 registers.

This is not sufficient but necessary to support any variant with more
than 4 UTP ports, like RTL8367S.

rtl8365mb_phy_{read,write} will now returns -EINVAL if phy is greater
than 7.

Fixes: 4af2950c50c8 ("net: dsa: realtek-smi: add rtl8365mb subdriver for RTL8365MB-VC")
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: fix page frag corruption on page fault

Steffen reported a TCP stream corruption for HTTP requests
served by the apache web-server using a cifs mount-point
and memory mapping the relevant file.

The root cause is quite similar to the one addressed by
commit 20eb4f29b602 ("net: fix sk_page_frag() recursion from
memory reclaim"). Here the nested access to the task page frag
is caused by a page fault on the (mmapped) user-space memory
buffer coming from the cifs file.

The page fault handler performs an smb transaction on a different
socket, inside the same process context. Since sk->sk_allaction
for such socket does not prevent the usage for the task_frag,
the nested allocation modify "under the hood" the page frag
in use by the outer sendmsg call, corrupting the stream.

The overall relevant stack trace looks like the following:

httpd 78268 [001] 3461630.850950:      probe:tcp_sendmsg_locked:
        ffffffff91461d91 tcp_sendmsg_locked+0x1
        ffffffff91462b57 tcp_sendmsg+0x27
        ffffffff9139814e sock_sendmsg+0x3e
        ffffffffc06dfe1d smb_send_kvec+0x28
        [...]
        ffffffffc06cfaf8 cifs_readpages+0x213
        ffffffff90e83c4b read_pages+0x6b
        ffffffff90e83f31 __do_page_cache_readahead+0x1c1
        ffffffff90e79e98 filemap_fault+0x788
        ffffffff90eb0458 __do_fault+0x38
        ffffffff90eb5280 do_fault+0x1a0
        ffffffff90eb7c84 __handle_mm_fault+0x4d4
        ffffffff90eb8093 handle_mm_fault+0xc3
        ffffffff90c74f6d __do_page_fault+0x1ed
        ffffffff90c75277 do_page_fault+0x37
        ffffffff9160111e page_fault+0x1e
        ffffffff9109e7b5 copyin+0x25
        ffffffff9109eb40 _copy_from_iter_full+0xe0
        ffffffff91462370 tcp_sendmsg_locked+0x5e0
        ffffffff91462370 tcp_sendmsg_locked+0x5e0
        ffffffff91462b57 tcp_sendmsg+0x27
        ffffffff9139815c sock_sendmsg+0x4c
        ffffffff913981f7 sock_write_iter+0x97
        ffffffff90f2cc56 do_iter_readv_writev+0x156
        ffffffff90f2dff0 do_iter_write+0x80
        ffffffff90f2e1c3 vfs_writev+0xa3
        ffffffff90f2e27c do_writev+0x5c
        ffffffff90c042bb do_syscall_64+0x5b
        ffffffff916000ad entry_SYSCALL_64_after_hwframe+0x65

The cifs filesystem rightfully sets sk_allocations to GFP_NOFS,
we can avoid the nesting using the sk page frag for allocation
lacking the __GFP_FS flag. Do not define an additional mm-helper
for that, as this is strictly tied to the sk page frag usage.

v1 -> v2:
- use a stricted sk_page_frag() check instead of reordering the
   code (Eric)

Reported-by: Steffen Froemer <sfroemer@redhat.com>
Fixes: 5640f7685831 ("net: use a per task frag allocator")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: Avoid DMA_CHAN_CONTROL write if no Split Header support

The driver assumes that split headers can be enabled/disabled without
stopping/starting the device, so it writes DMA_CHAN_CONTROL from
stmmac_set_features(). However, on my system (IP v5.10a without Split
Header support), simply writing DMA_CHAN_CONTROL when DMA is running
(for example, with the commands below) leads to a TX watchdog timeout.

host$ socat TCP-LISTEN:1024,fork,reuseaddr - &
device$ ethtool -K eth0 tso off
device$ ethtool -K eth0 tso on
device$ dd if=/dev/zero bs=1M count=10 | socat - TCP4:host:1024
<tx watchdog timeout>

Note that since my IP is configured without Split Header support, the
driver always just reads and writes the same value to the
DMA_CHAN_CONTROL register.

I don't have access to any platforms with Split Header support so I
don't know if these writes to the DMA_CHAN_CONTROL while DMA is running
actually work properly on such systems. I could not find anything in
the databook that says that DMA_CHAN_CONTROL should not be written when
the DMA is running.

But on systems without Split Header support, there is in any case no
need to call enable_sph() in stmmac_set_features() at all since SPH can
never be toggled, so we can avoid the watchdog timeout there by skipping
this call.

Fixes: 8c6fc097a2f4acf ("net: stmmac: gmac4+: Add Split Header support")
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'net-5.16-rc3' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from netfilter.

  Current release - regressions:

   - r8169: fix incorrect mac address assignment

   - vlan: fix underflow for the real_dev refcnt when vlan creation
     fails

   - smc: avoid warning of possible recursive locking

  Current release - new code bugs:

   - vsock/virtio: suppress used length validation

   - neigh: fix crash in v6 module initialization error path

  Previous releases - regressions:

   - af_unix: fix change in behavior in read after shutdown

   - igb: fix netpoll exit with traffic, avoid warning

   - tls: fix splice_read() when starting mid-record

   - lan743x: fix deadlock in lan743x_phy_link_status_change()

   - marvell: prestera: fix bridge port operation

  Previous releases - always broken:

   - tcp_cubic: fix spurious Hystart ACK train detections for
     not-cwnd-limited flows

   - nexthop: fix refcount issues when replacing IPv6 groups

   - nexthop: fix null pointer dereference when IPv6 is not enabled

   - phylink: force link down and retrigger resolve on interface change

   - mptcp: fix delack timer length calculation and incorrect early
     clearing

   - ieee802154: handle iftypes as u32, prevent shift-out-of-bounds

   - nfc: virtual_ncidev: change default device permissions

   - netfilter: ctnetlink: fix error codes and flags used for kernel
     side filtering of dumps

   - netfilter: flowtable: fix IPv6 tunnel addr match

   - ncsi: align payload to 32-bit to fix dropped packets

   - iavf: fix deadlock and loss of config during VF interface reset

   - ice: avoid bpf_prog refcount underflow

   - ocelot: fix broken PTP over IP and PTP API violations

  Misc:

   - marvell: mvpp2: increase MTU limit when XDP enabled"

* tag 'net-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
  net: dsa: microchip: implement multi-bridge support
  net: mscc: ocelot: correctly report the timestamping RX filters in ethtool
  net: mscc: ocelot: set up traps for PTP packets
  net: ptp: add a definition for the UDP port for IEEE 1588 general messages
  net: mscc: ocelot: create a function that replaces an existing VCAP filter
  net: mscc: ocelot: don't downgrade timestamping RX filters in SIOCSHWTSTAMP
  net: hns3: fix incorrect components info of ethtool --reset command
  net: hns3: fix one incorrect value of page pool info when queried by debugfs
  net: hns3: add check NULL address for page pool
  net: hns3: fix VF RSS failed problem after PF enable multi-TCs
  net: qed: fix the array may be out of bound
  net/smc: Don't call clcsock shutdown twice when smc shutdown
  net: vlan: fix underflow for the real_dev refcnt
  ptp: fix filter names in the documentation
  ethtool: ioctl: fix potential NULL deref in ethtool_set_coalesce()
  nfc: virtual_ncidev: change default device permissions
  net/sched: sch_ets: don't peek at classes beyond 'nbands'
  net: stmmac: Disable Tx queues when reconfiguring the interface
  selftests: tls: test for correct proto_ops
  tls: fix replacing proto_ops
  ...

net: dsa: microchip: implement multi-bridge support

Current driver version is able to handle only one bridge at time.
Configuring two bridges on two different ports would end up shorting this
bridges by HW. To reproduce it:

ip l a name br0 type bridge
ip l a name br1 type bridge
ip l s dev br0 up
ip l s dev br1 up
ip l s lan1 master br0
ip l s dev lan1 up
ip l s lan2 master br1
ip l s dev lan2 up

Ping on lan1 and get response on lan2, which should not happen.

This happened, because current driver version is storing one global "Port VLAN
Membership" and applying it to all ports which are members of any
bridge.
To solve this issue, we need to handle each port separately.

This patch is dropping the global port member storage and calculating
membership dynamically depending on STP state and bridge participation.

Note: STP support was broken before this patch and should be fixed
separately.

Fixes: c2e866911e25 ("net: dsa: microchip: break KSZ9477 DSA driver into two files")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/20211126123926.2981028-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'acpi-5.16-rc3' of git://git./linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
"These fix a NULL pointer dereference in the CPPC library code and a
  locking issue related to printing the names of ACPI device nodes in
  the device properties framework.

  Specifics:

   - Fix NULL pointer dereference in the CPPC library code occuring on
     hybrid systems without CPPC support (Rafael Wysocki).

   - Avoid attempts to acquire a semaphore with interrupts off when
     printing the names of ACPI device nodes and clean up code on top of
     that fix (Sakari Ailus)"

* tag 'acpi-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: CPPC: Add NULL pointer check to cppc_get_perf()
  ACPI: Make acpi_node_get_parent() local
  ACPI: Get acpi_device's parent from the parent field

Merge tag 'pm-5.16-rc3' of git://git./linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
"These address three issues in the intel_pstate driver and fix two
  problems related to hibernation.

  Specifics:

   - Make intel_pstate work correctly on Ice Lake server systems with
     out-of-band performance control enabled (Adamos Ttofari).

   - Fix EPP handling in intel_pstate during CPU offline and online in
     the active mode (Rafael Wysocki).

   - Make intel_pstate support ITMT on asymmetric systems with
     overclocking enabled (Srinivas Pandruvada).

   - Fix hibernation image saving when using the user space interface
     based on the snapshot special device file (Evan Green).

   - Make the hibernation code release the snapshot block device using
     the same mode that was used when acquiring it (Thomas Zeitlhofer)"

* tag 'pm-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: hibernate: Fix snapshot partial write lengths
  PM: hibernate: use correct mode for swsusp_close()
  cpufreq: intel_pstate: ITMT support for overclocked system
  cpufreq: intel_pstate: Fix active mode offline/online EPP handling
  cpufreq: intel_pstate: Add Ice Lake server to out-of-band IDs

Merge tag 'fuse-fixes-5.16-rc3' of git://git./linux/kernel/git/mszeredi/fuse

Pull fuse fix from Miklos Szeredi:
"Fix a regression caused by a bugfix in the previous release. The
  symptom is a VM_BUG_ON triggered from splice to the fuse device.

  Unfortunately the original bugfix was already backported to a number
  of stable releases, so this fix-fix will need to be backported as
  well"

* tag 'fuse-fixes-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: release pipe buf after last use

Merge branch 'fix-broken-ptp-over-ip-on-ocelot-switches'

Vladimir Oltean says:

====================
Fix broken PTP over IP on Ocelot switches

Changes in v2: added patch 5, added Richard's ack for the whole series
sans patch 5 which is new.

Po Liu reported recently that timestamping PTP over IPv4 is broken using
the felix driver on NXP LS1028A. This has been known for a while, of
course, since it has always been broken. The reason is because IP PTP
packets are currently treated as unknown IP multicast, which is not
flooded to the CPU port in the ocelot driver design, so packets don't
reach the ptp4l program.

The series solves the problem by installing packet traps per port when
the timestamping ioctl is called, depending on the RX filter selected
(L2, L4 or both).
====================

Link: https://lore.kernel.org/r/20211126172845.3149260-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: correctly report the timestamping RX filters in ethtool

The driver doesn't support RX timestamping for non-PTP packets, but it
declares that it does. Restrict the reported RX filters to PTP v2 over
L2 and over L4.

Fixes: 4e3b0468e6d7 ("net: mscc: PTP Hardware Clock (PHC) support")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: set up traps for PTP packets

IEEE 1588 support was declared too soon for the Ocelot switch. Out of
reset, this switch does not apply any special treatment for PTP packets,
i.e. when an event message is received, the natural tendency is to
forward it by MAC DA/VLAN ID. This poses a problem when the ingress port
is under a bridge, since user space application stacks (written
primarily for endpoint ports, not switches) like ptp4l expect that PTP
messages are always received on AF_PACKET / AF_INET sockets (depending
on the PTP transport being used), and never being autonomously
forwarded. Any forwarding, if necessary (for example in Transparent
Clock mode) is handled in software by ptp4l. Having the hardware forward
these packets too will cause duplicates which will confuse endpoints
connected to these switches.

So PTP over L2 barely works, in the sense that PTP packets reach the CPU
port, but they reach it via flooding, and therefore reach lots of other
unwanted destinations too. But PTP over IPv4/IPv6 does not work at all.
This is because the Ocelot switch have a separate destination port mask
for unknown IP multicast (which PTP over IP is) flooding compared to
unknown non-IP multicast (which PTP over L2 is) flooding. Specifically,
the driver allows the CPU port to be in the PGID_MC port group, but not
in PGID_MCIPV4 and PGID_MCIPV6. There are several presentations from
Allan Nielsen which explain that the embedded MIPS CPU on Ocelot
switches is not very powerful at all, so every penny they could save by
not allowing flooding to the CPU port module matters. Unknown IP
multicast did not make it.

The de facto consensus is that when a switch is PTP-aware and an
application stack for PTP is running, switches should have some sort of
trapping mechanism for PTP packets, to extract them from the hardware
data path. This avoids both problems:
(a) PTP packets are no longer flooded to unwanted destinations
(b) PTP over IP packets are no longer denied from reaching the CPU since
they arrive there via a trap and not via flooding

It is not the first time when this change is attempted. Last time, the
feedback from Allan Nielsen and Andrew Lunn was that the traps should
not be installed by default, and that PTP-unaware switching may be
desired for some use cases:
https://patchwork.ozlabs.org/project/netdev/patch/20190813025214.18601-5-yangbo.lu@nxp.com/

To address that feedback, the present patch adds the necessary packet
traps according to the RX filter configuration transmitted by user space
through the SIOCSHWTSTAMP ioctl. Trapping is done via VCAP IS2, where we
keep 5 filters, which are amended each time RX timestamping is enabled
or disabled on a port:
- 1 for PTP over L2
- 2 for PTP over IPv4 (UDP ports 319 and 320)
- 2 for PTP over IPv6 (UDP ports 319 and 320)

The cookie by which these filters (invisible to tc) are identified is
strategically chosen such that it does not collide with the filters used
for the ocelot-8021q tagging protocol by the Felix driver, or with the
MRP traps set up by the Ocelot library.

Other alternatives were considered, like patching user space to do
something, but there are so many ways in which PTP packets could be made
to reach the CPU, generically speaking, that "do what?" is a very valid
question. The ptp4l program from the linuxptp stack already attempts to
do something: it calls setsockopt(IP_ADD_MEMBERSHIP) (and
PACKET_ADD_MEMBERSHIP, respectively) which translates in both cases into
a dev_mc_add() on the interface, in the kernel:
https://github.com/richardcochran/linuxptp/blob/v3.1.1/udp.c#L73
https://github.com/richardcochran/linuxptp/blob/v3.1.1/raw.c

Reality shows that this is not sufficient in case the interface belongs
to a switchdev driver, as dev_mc_add() does not show the intention to
trap a packet to the CPU, but rather the intention to not drop it (it is
strictly for RX filtering, same as promiscuous does not mean to send all
traffic to the CPU, but to not drop traffic with unknown MAC DA). This
topic is a can of worms in itself, and it would be great if user space
could just stay out of it.

On the other hand, setting up PTP traps privately within the driver is
not new by any stretch of the imagination:
https://elixir.bootlin.com/linux/v5.16-rc2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c#L833
https://elixir.bootlin.com/linux/v5.16-rc2/source/drivers/net/dsa/hirschmann/hellcreek.c#L1050
https://elixir.bootlin.com/linux/v5.16-rc2/source/include/linux/dsa/sja1105.h#L21

So this is the approach taken here as well. The difference here being
that we prepare and destroy the traps per port, dynamically at runtime,
as opposed to driver init time, because apparently, PTP-unaware
forwarding is a use case.

Fixes: 4e3b0468e6d7 ("net: mscc: PTP Hardware Clock (PHC) support")
Reported-by: Po Liu <po.liu@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ptp: add a definition for the UDP port for IEEE 1588 general messages

As opposed to event messages (Sync, PdelayReq etc) which require
timestamping, general messages (Announce, FollowUp etc) do not.
In PTP they are part of different streams of data.

IEEE 1588-2008 Annex D.2 "UDP port numbers" states that the UDP
destination port assigned by IANA is 319 for event messages, and 320 for
general messages. Yet the kernel seems to be missing the definition for
general messages. This patch adds it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: create a function that replaces an existing VCAP filter

VCAP (Versatile Content Aware Processor) is the TCAM-based engine behind
tc flower offload on ocelot, among other things. The ingress port mask
on which VCAP rules match is present as a bit field in the actual key of
the rule. This means that it is possible for a rule to be shared among
multiple source ports. When the rule is added one by one on each desired
port, that the ingress port mask of the key must be edited and rewritten
to hardware.

But the API in ocelot_vcap.c does not allow for this. For one thing,
ocelot_vcap_filter_add() and ocelot_vcap_filter_del() are not symmetric,
because ocelot_vcap_filter_add() works with a preallocated and
prepopulated filter and programs it to hardware, and
ocelot_vcap_filter_del() does both the job of removing the specified
filter from hardware, as well as kfreeing it. That is to say, the only
option of editing a filter in place, which is to delete it, modify the
structure and add it back, does not work because it results in
use-after-free.

This patch introduces ocelot_vcap_filter_replace, which trivially
reprograms a VCAP entry to hardware, at the exact same index at which it
existed before, without modifying any list or allocating any memory.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: don't downgrade timestamping RX filters in SIOCSHWTSTAMP

The ocelot driver, when asked to timestamp all receiving packets, 1588
v1 or NTP, says "nah, here's 1588 v2 for you".

According to this discussion:
https://patchwork.kernel.org/project/netdevbpf/patch/20211104133204.19757-8-martin.kaistra@linutronix.de/#24577647
drivers that downgrade from a wider request to a narrower response (or
even a response where the intersection with the request is empty) are
buggy, and should return -ERANGE instead. This patch fixes that.

Fixes: 4e3b0468e6d7 ("net: mscc: PTP Hardware Clock (PHC) support")
Suggested-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-hns3-add-some-fixes-for-net'

Guangbin Huang says:

====================
net: hns3: add some fixes for -net

This series adds some fixes for the HNS3 ethernet driver.
====================

Link: https://lore.kernel.org/r/20211126120318.33921-1-huangguangbin2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hns3: fix incorrect components info of ethtool --reset command

Currently, HNS3 driver doesn't clear the reset flags of components after
successfully executing reset, it causes userspace info of
"Components reset" and "Components not reset" is incorrect.

So fix this problem by clear corresponding reset flag after reset process.

Fixes: ddccc5e368a3 ("net: hns3: add support for triggering reset by ethtool")
Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hns3: fix one incorrect value of page pool info when queried by debugfs

Currently, when user queries page pool info by debugfs command
"cat page_pool_info", the cnt of allocated page for page pool may be
incorrect because of memory inconsistency problem caused by compiler
optimization.

So this patch uses READ_ONCE() to read value of pages_state_hold_cnt to
fix this problem.

Fixes: 850bfb912a6d ("net: hns3: debugfs add support dumping page pool info")
Signed-off-by: Hao Chen <chenhao288@hisilicon.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hns3: add check NULL address for page pool

When page pool is not enabled, its address value is still NULL and page
pool should not be accessed, so add a check for it.

Fixes: 850bfb912a6d ("net: hns3: debugfs add support dumping page pool info")
Signed-off-by: Hao Chen <chenhao288@hisilicon.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hns3: fix VF RSS failed problem after PF enable multi-TCs

When PF is set to multi-TCs and configured mapping relationship between
priorities and TCs, the hardware will active these settings for this PF
and its VFs.

In this case when VF just uses one TC and its rx packets contain priority,
and if the priority is not mapped to TC0, as other TCs of VF is not valid,
hardware always put this kind of packets to the queue 0. It cause this kind
of packets of VF can not be used RSS function.

To fix this problem, set tc mode of all unused TCs of VF to the setting of
TC0, then rx packet with priority which map to unused TC will be direct to
TC0.

Fixes: e2cb1dec9779 ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support")
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: qed: fix the array may be out of bound

If the variable 'p_bit->flags' is always 0,
the loop condition is always 0.

The variable 'j' may be greater than or equal to 32.

At this time, the array 'p_aeu->bits[32]' may be out
of bound.

Signed-off-by: zhangyue <zhangyue1@kylinos.cn>
Link: https://lore.kernel.org/r/20211125113610.273841-1-zhangyue1@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'for-5.16-rc2-tag' of git://git./linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
"One more fix to the lzo code, a missing put_page causing memory leaks
when some error branches are taken"

* tag 'for-5.16-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix the memory leak caused in lzo_compress_pages()

net/smc: Don't call clcsock shutdown twice when smc shutdown

When applications call shutdown() with SHUT_RDWR in userspace,
smc_close_active() calls kernel_sock_shutdown(), and it is called
twice in smc_shutdown().

This fixes this by checking sk_state before do clcsock shutdown, and
avoids missing the application's call of smc_shutdown().

Link: https://lore.kernel.org/linux-s390/1f67548e-cbf6-0dce-82b5-10288a4583bd@linux.ibm.com/
Fixes: 606a63c9783a ("net/smc: Ensure the active closing peer first closes clcsock")
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20211126024134.45693-1-tonylu@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: vlan: fix underflow for the real_dev refcnt

Inject error before dev_hold(real_dev) in register_vlan_dev(),
and execute the following testcase:

ip link add dev dummy1 type dummy
ip link add name dummy1.100 link dummy1 type vlan id 100
ip link del dev dummy1

When the dummy netdevice is removed, we will get a WARNING as following:

=======================================================================
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 2 PID: 0 at lib/refcount.c:31 refcount_warn_saturate+0xbf/0x1e0

and an endless loop of:

=======================================================================
unregister_netdevice: waiting for dummy1 to become free. Usage count = -1073741824

That is because dev_put(real_dev) in vlan_dev_free() be called without
dev_hold(real_dev) in register_vlan_dev(). It makes the refcnt of real_dev
underflow.

Move the dev_hold(real_dev) to vlan_dev_init() which is the call-back of
ndo_init(). That makes dev_hold() and dev_put() for vlan's real_dev
symmetrical.

Fixes: 563bcbae3ba2 ("net: vlan: fix a UAF in vlan_dev_real_dev()")
Reported-by: Petr Machata <petrm@nvidia.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Link: https://lore.kernel.org/r/20211126015942.2918542-1-william.xuanziyang@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptp: fix filter names in the documentation

All the filter names are missing _PTP in them.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Link: https://lore.kernel.org/r/20211126031921.2466944-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: ioctl: fix potential NULL deref in ethtool_set_coalesce()

ethtool_set_coalesce() now uses both the .get_coalesce() and
.set_coalesce() callbacks. But the check for their availability is
buggy, so changing the coalesce settings on a device where the driver
provides only _one_ of the callbacks results in a NULL pointer
dereference instead of an -EOPNOTSUPP.

Fix the condition so that the availability of both callbacks is
ensured. This also matches the netlink code.

Note that reproducing this requires some effort - it only affects the
legacy ioctl path, and needs a specific combination of driver options:
- have .get_coalesce() and .coalesce_supported but no
.set_coalesce(), or
- have .set_coalesce() but no .get_coalesce(). Here eg. ethtool doesn't
cause the crash as it first attempts to call ethtool_get_coalesce()
and bails out on error.

Fixes: f3ccfda19319 ("ethtool: extend coalesce setting uAPI with CQE mode")
Cc: Yufeng Mo <moyufeng@huawei.com>
Cc: Huazhong Tan <tanhuazhong@huawei.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Link: https://lore.kernel.org/r/20211126175543.28000-1-jwi@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

nfc: virtual_ncidev: change default device permissions

Device permissions is S_IALLUGO, with many unnecessary bits. Remove them
and also remove read and write permissions from group and others.

Before the change:
crwsrwsrwt 1 0 0 10, 125 Nov 25 13:59 /dev/virtual_nci

After the change:
crw------- 1 0 0 10, 125 Nov 25 14:05 /dev/virtual_nci

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Reviewed-by: Bongsu Jeon <bongsu.jeon@samsung.com>
Link: https://lore.kernel.org/r/20211125141457.716921-1-cascardo@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: sch_ets: don't peek at classes beyond 'nbands'

when the number of DRR classes decreases, the round-robin active list can
contain elements that have already been freed in ets_qdisc_change(). As a
consequence, it's possible to see a NULL dereference crash, caused by the
attempt to call cl->qdisc->ops->peek(cl->qdisc) when cl->qdisc is NULL:

BUG: kernel NULL pointer dereference, address: 0000000000000018
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 910 Comm: mausezahn Not tainted 5.16.0-rc1+ #475
Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
RIP: 0010:ets_qdisc_dequeue+0x129/0x2c0 [sch_ets]
Code: c5 01 41 39 ad e4 02 00 00 0f 87 18 ff ff ff 49 8b 85 c0 02 00 00 49 39 c4 0f 84 ba 00 00 00 49 8b ad c0 02 00 00 48 8b 7d 10 <48> 8b 47 18 48 8b 40 38 0f ae e8 ff d0 48 89 c3 48 85 c0 0f 84 9d
RSP: 0000:ffffbb36c0b5fdd8 EFLAGS: 00010287
RAX: ffff956678efed30 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000002 RSI: ffffffff9b938dc9 RDI: 0000000000000000
RBP: ffff956678efed30 R08: e2f3207fe360129c R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff956678efeac0
R13: ffff956678efe800 R14: ffff956611545000 R15: ffff95667ac8f100
FS:  00007f2aa9120740(0000) GS:ffff95667b800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000018 CR3: 000000011070c000 CR4: 0000000000350ee0
Call Trace:
  <TASK>
  qdisc_peek_dequeued+0x29/0x70 [sch_ets]
  tbf_dequeue+0x22/0x260 [sch_tbf]
  __qdisc_run+0x7f/0x630
  net_tx_action+0x290/0x4c0
  __do_softirq+0xee/0x4f8
  irq_exit_rcu+0xf4/0x130
  sysvec_apic_timer_interrupt+0x52/0xc0
  asm_sysvec_apic_timer_interrupt+0x12/0x20
RIP: 0033:0x7f2aa7fc9ad4
Code: b9 ff ff 48 8b 54 24 18 48 83 c4 08 48 89 ee 48 89 df 5b 5d e9 ed fc ff ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa <53> 48 83 ec 10 48 8b 05 10 64 33 00 48 8b 00 48 85 c0 0f 85 84 00
RSP: 002b:00007ffe5d33fab8 EFLAGS: 00000202
RAX: 0000000000000002 RBX: 0000561f72c31460 RCX: 0000561f72c31720
RDX: 0000000000000002 RSI: 0000561f72c31722 RDI: 0000561f72c31720
RBP: 000000000000002a R08: 00007ffe5d33fa40 R09: 0000000000000014
R10: 0000000000000000 R11: 0000000000000246 R12: 0000561f7187e380
R13: 0000000000000000 R14: 0000000000000000 R15: 0000561f72c31460
  </TASK>
Modules linked in: sch_ets sch_tbf dummy rfkill iTCO_wdt intel_rapl_msr iTCO_vendor_support intel_rapl_common joydev virtio_balloon lpc_ich i2c_i801 i2c_smbus pcspkr ip_tables xfs libcrc32c crct10dif_pclmul crc32_pclmul crc32c_intel ahci libahci ghash_clmulni_intel serio_raw libata virtio_blk virtio_console virtio_net net_failover failover sunrpc dm_mirror dm_region_hash dm_log dm_mod
CR2: 0000000000000018

Ensuring that 'alist' was never zeroed [1] was not sufficient, we need to
remove from the active list those elements that are no more SP nor DRR.

[1] https://lore.kernel.org/netdev/60d274838bf09777f0371253416e8af71360bc08.1633609148.git.dcaratti@redhat.com/

v3: fix race between ets_qdisc_change() and ets_qdisc_dequeue() delisting
    DRR classes beyond 'nbands' in ets_qdisc_change() with the qdisc lock
    acquired, thanks to Cong Wang.

v2: when a NULL qdisc is found in the DRR active list, try to dequeue skb
    from the next list item.

Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Fixes: dcc68b4d8084 ("net: sch_ets: Add a new Qdisc")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/7a5c496eed2d62241620bdbb83eb03fb9d571c99.1637762721.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'acpi-properties'

Merge fix and cleanup related to the management of ACPI device
properties for 5.16-rc3.

* acpi-properties:
ACPI: Make acpi_node_get_parent() local
ACPI: Get acpi_device's parent from the parent field

Merge branch 'pm-sleep'

Merge hibernation-related fixes for 5.16-rc3.

* pm-sleep:
PM: hibernate: Fix snapshot partial write lengths
PM: hibernate: use correct mode for swsusp_close()

net: stmmac: Disable Tx queues when reconfiguring the interface

The Tx queues were not disabled in situations where the driver needed to
stop the interface to apply a new configuration. This could result in a
kernel panic when doing any of the 3 following actions:
* reconfiguring the number of queues (ethtool -L)
* reconfiguring the size of the ring buffers (ethtool -G)
* installing/removing an XDP program (ip l set dev ethX xdp)

Prevent the panic by making sure netif_tx_disable is called when stopping
an interface.

Without this patch, the following kernel panic can be observed when doing
any of the actions above:

Unable to handle kernel paging request at virtual address ffff80001238d040
[....]
Call trace:
  dwmac4_set_addr+0x8/0x10
  dev_hard_start_xmit+0xe4/0x1ac
  sch_direct_xmit+0xe8/0x39c
  __dev_queue_xmit+0x3ec/0xaf0
  dev_queue_xmit+0x14/0x20
[...]
[ end trace 0000000000000002 ]---

Fixes: 5fabb01207a2d ("net: stmmac: Add initial XDP support")
Fixes: aa042f60e4961 ("net: stmmac: Add support to Ethtool get/set ring parameters")
Fixes: 0366f7e06a6be ("net: stmmac: add ethtool support for get/set channels")
Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
Link: https://lore.kernel.org/r/20211124154731.1676949-1-yannick.vignon@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'char-misc-5.16-rc3' of git://git./linux/kernel/git/gregkh/char-misc

Pull char/misc driver fix from Greg KH:
"Here is a single binder driver fix for 5.16-rc3.

  It resolves a problem reported in the set of binder fixes that went
  into 5.16-rc1. It has been in linux-next for a while with no reported
  problems"

* tag 'char-misc-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  binder: fix test regression due to sender_euid change

Merge tag 'staging-5.16-rc3' of git://git./linux/kernel/git/gregkh/staging

Pull staging fixes from Greg KH:
"Here are some small staging driver fixes and one driver removal for
  5.16-rc3.

  The fixes resolve a number of small issues found in 5.16-rc1, nothing
  huge at all. The driver removal was due to a platform being removed in
  5.16-rc1, but this driver was forgotten about. It wasn't being built
  anymore so it's safe to delete.

  All have been in linux-next for a while with no reported problems"

* tag 'staging-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: rtl8192e: Fix use after free in _rtl92e_pci_disconnect()
  staging: greybus: Add missing rwsem around snd_ctl_remove() calls
  staging: Remove Netlogic XLP network driver
  staging: r8188eu: fix a memory leak in rtw_wx_read32()
  staging: r8188eu: use GFP_ATOMIC under spinlock
  staging: r8188eu: Use kzalloc() with GFP_ATOMIC in atomic context
  staging/fbtft: Fix backlight
  staging: r8188eu: Fix breakage introduced when 5G code was removed

Merge tag 'usb-5.16-rc1' of git://git./linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
"Here are a number of small USB fixes for reported problems for
  5.16-rc3

  They include:

   - typec driver fixes

   - new usb-serial driver ids

   - usb hub enumeration issues that were much reported

   - gadget driver fixes

   - dwc3 driver fix

   - chipidea driver fixe

  All of these have been in linux-next with no reported problems"

* tag 'usb-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  USB: serial: option: add Fibocom FM101-GL variants
  usb: typec: tipd: Fix initialization sequence for cd321x
  usb: typec: tipd: Fix typo in cd321x_switch_power_state
  usb: hub: Fix locking issues with address0_mutex
  USB: serial: pl2303: fix GC type detection
  USB: serial: option: add Telit LE910S1 0x9200 composition
  usb: chipidea: ci_hdrc_imx: fix potential error pointer dereference in probe
  usb: hub: Fix usb enumeration issue due to address0 race
  usb: typec: fusb302: Fix masking of comparator and bc_lvl interrupts
  usb: dwc3: leave default DMA for PCI devices
  usb: dwc2: hcd_queue: Fix use of floating point literal
  usb: dwc3: gadget: Fix null pointer exception
  usb: gadget: udc-xilinx: Fix an error handling path in 'xudc_probe()'
  usb: xhci: tegra: Check padctrl interrupt presence in device tree
  usb: dwc2: gadget: Fix ISOC flow for elapsed frames
  usb: dwc3: gadget: Check for L1/L2/U3 for Start Transfer
  usb: dwc3: gadget: Ignore NoStream after End Transfer
  usb: dwc3: core: Revise GHWPARAMS9 offset

Merge tag 'mmc-v5.16-rc1' of git://git./linux/kernel/git/ulfh/mmc

Pull MMC host fixes from Ulf Hansson:

- mmc_spi: Add SPI IDs to silence warning

- sdhci: Fix ADMA for PAGE_SIZE >= 64KiB

- sdhci-esdhc-imx: Disable broken CMDQ for imx8qm/imx8qxp/imx8mm

* tag 'mmc-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: spi: Add device-tree SPI IDs
  mmc: sdhci: Fix ADMA for PAGE_SIZE >= 64KiB
  mmc: sdhci-esdhc-imx: disable CMDQ support

Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
"I2C has an interrupt storm fix for the i801, better timeout handling
  for the new virtio driver, and some documentation fixes this time"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  docs: i2c: smbus-protocol: mention the repeated start condition
  i2c: virtio: disable timeout handling
  i2c: i801: Fix interrupt storm from SMB_ALERT signal
  i2c: i801: Restore INTREN on unload
  dt-bindings: i2c: imx-lpi2c: Fix i.MX 8QM compatible matching

Merge tag 'for-linus-5.16c-rc3-tag' of git://git./linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:

- Kconfig fix to make it possible to control building of the privcmd
   driver

- three fixes for issues identified by the kernel test robot

- a five-patch series to simplify timeout handling for Xen PV driver
   initialization

- two patches to fix error paths in xenstore/xenbus driver
   initialization

* tag 'for-linus-5.16c-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen: make HYPERVISOR_set_debugreg() always_inline
  xen: make HYPERVISOR_get_debugreg() always_inline
  xen: detect uninitialized xenbus in xenbus_init
  xen: flag xen_snd_front to be not essential for system boot
  xen: flag pvcalls-front to be not essential for system boot
  xen: flag hvc_xen to be not essential for system boot
  xen: flag xen_drm_front to be not essential for system boot
  xen: add "not_essential" flag to struct xenbus_driver
  xen/pvh: add missing prototype to header
  xen: don't continue xenstore initialization in case of errors
  xen/privcmd: make option visible in Kconfig

Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:
"Three arm64 fixes.

  The main one is a fix to the way in which we evaluate the macro
  arguments to our uaccess routines, which we _think_ might be the root
  cause behind some unkillable tasks we've seen in the Android arm64 CI
  farm (testing is ongoing). In any case, it's worth fixing.

  Other than that, we've toned down an over-zealous VM_BUG_ON() and
  fixed ftrace stack unwinding in a bunch of cases.

  Summary:

   - Evaluate uaccess macro arguments outside of the critical section

   - Tighten up VM_BUG_ON() in pmd_populate_kernel() to avoid false positive

   - Fix ftrace stack unwinding using HAVE_FUNCTION_GRAPH_RET_ADDR_PTR"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: uaccess: avoid blocking within critical sections
  arm64: mm: Fix VM_BUG_ON(mm != &init_mm) for trans_pgd
  arm64: ftrace: use HAVE_FUNCTION_GRAPH_RET_ADDR_PTR

btrfs: fix the memory leak caused in lzo_compress_pages()

[BUG]
Fstests generic/027 is pretty easy to trigger a slow but steady memory
leak if run with "-o compress=lzo" mount option.

Normally one single run of generic/027 is enough to eat up at least 4G ram.

[CAUSE]
In commit d4088803f511 ("btrfs: subpage: make lzo_compress_pages()
compatible") we changed how @page_in is released.

But that refactoring makes @page_in only released after all pages being
compressed.

This leaves error path not releasing @page_in. And by "error path"
things like incompressible data will also be treated as an error
(-E2BIG).

Thus it can cause a memory leak if even nothing wrong happened.

[FIX]
Add check under @out label to release @page_in when needed, so when we
hit any error, the input page is properly released.

Reported-by: Josef Bacik <josef@toxicpanda.com>
Fixes: d4088803f511 ("btrfs: subpage: make lzo_compress_pages() compatible")
Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Merge branch 'tls-splice_read-fixes'

Jakub Kicinski says:

====================
tls: splice_read fixes

As I work my way to unlocked and zero-copy TLS Rx the obvious bugs
in the splice_read implementation get harder and harder to ignore.
This is to say the fixes here are discovered by code inspection,
I'm not aware of anyone actually using splice_read.
====================

Link: https://lore.kernel.org/r/20211124232557.2039757-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tls: test for correct proto_ops

Previous patch fixes overriding callbacks incorrectly. Triggering
the crash in sendpage_locked would be more spectacular but it's
hard to get to, so take the easier path of proving this is broken
and call getname. We're currently getting IPv4 socket info on an
IPv6 socket.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: fix replacing proto_ops

We replace proto_ops whenever TLS is configured for RX. But our
replacement also overrides sendpage_locked, which will crash
unless TX is also configured. Similarly we plug both of those
in for TLS_HW (NIC crypto offload) even tho TLS_HW has a completely
different implementation for TX.

Last but not least we always plug in something based on inet_stream_ops
even though a few of the callbacks differ for IPv6 (getname, release,
bind).

Use a callback building method similar to what we do for struct proto.

Fixes: c46234ebb4d1 ("tls: RX path for ktls")
Fixes: d4ffb02dee2f ("net/tls: enable sk_msg redirect to tls socket egress")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tls: test splicing decrypted records

Add tests for half-received and peeked records.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: splice_read: fix accessing pre-processed records

recvmsg() will put peek()ed and partially read records onto the rx_list.
splice_read() needs to consult that list otherwise it may miss data.
Align with recvmsg() and also put partially-read records onto rx_list.
tls_sw_advance_skb() is pretty pointless now and will be removed in
net-next.

Fixes: 692d7b5d1f91 ("tls: Fix recvmsg() to be able to peek across multiple records")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tls: test splicing cmsgs

Make sure we correctly reject splicing non-data records.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: splice_read: fix record type check

We don't support splicing control records. TLS 1.3 changes moved
the record type check into the decrypt if(). The skb may already
be decrypted and still be an alert.

Note that decrypt_skb_update() is idempotent and updates ctx->decrypted
so the if() is pointless.

Reorder the check for decryption errors with the content type check
while touching them. This part is not really a bug, because if
decryption failed in TLS 1.3 content type will be DATA, and for
TLS 1.2 it will be correct. Nevertheless its strange to touch output
before checking if the function has failed.

Fixes: fedf201e1296 ("net: tls: Refactor control message handling on recv")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tls: add tests for handling of bad records

Test broken records.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tls: factor out cmsg send/receive

Add helpers for sending and receiving special record types.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tls: add helper for creating sock pairs

We have the same code 3 times, about to add a fourth copy.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'drm-fixes-2021-11-26' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
"No idea if turkey comes before pull request processing, but here's the
  regular week's fixes. A bunch for amdgpu, nouveau adds support for a
  new GPU (like a PCI ID addition), and a scattering of fixes across
  i915/hyperv/aspeed/vc4.

  Specifics:

  amdgpu:
   - SRIOV fixes
   - dma-buf double free fix
   - Display fixes for GPU resets
   - Fix DSC powergating regression
   - GPU TSC fixes
   - Interrupt handler overflow fixes
   - Endian fix in IP discovery table handling
   - Aldebaran ASPM fix
   - Fix overclocking regression on older asics
   - Backlight/ACPI fix

  amdkfd:
   - SVM fixes
   - VMA removal race fix

  hyperv:
   - removal fix

  aspeed:
   - vga_pw sysfs file fix

  vc4:
   - error checking fix

  nouveau:
   - support GA106
   - fix a few error checks

  i915:
   - fix wakeref handling around PXP suspend"

* tag 'drm-fixes-2021-11-26' of git://anongit.freedesktop.org/drm/drm: (25 commits)
  drm/amd/display: update bios scratch when setting backlight
  drm/amdgpu/pm: fix powerplay OD interface
  drm/amdgpu: Skip ASPM programming on aldebaran
  drm/amdgpu: fix byteorder error in amdgpu discovery
  drm/amdgpu: enable Navi retry fault wptr overflow
  drm/amdgpu: enable Navi 48-bit IH timestamp counter
  drm/amdkfd: simplify drain retry fault
  drm/amdkfd: handle VMA remove race
  drm/amdkfd: process exit and retry fault race
  drm/amdgpu: IH process reset count when restart
  drm/amdgpu/gfx9: switch to golden tsc registers for renoir+
  drm/amdgpu/gfx10: add wraparound gpu counter check for APUs as well
  drm/amdgpu: move kfd post_reset out of reset_sriov function
  drm/amd/display: Fixed DSC would not PG after removing DSC stream
  drm/amd/display: Reset link encoder assignments for GPU reset
  drm/amd/display: Set plane update flags for all planes in reset
  drm/amd/display: Fix DPIA outbox timeout after GPU reset
  drm/amdgpu: Fix double free of dmabuf
  drm/amdgpu: Fix MMIO HDP flush on SRIOV
  drm/i915/gt: Hold RPM wakelock during PXP suspend
  ...

Merge tag 'drm-intel-fixes-2021-11-24' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes

Fix wakeref handling of PXP suspend.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/YZ65bsPOK+6JLv0d@intel.com

Merge tag 'drm-misc-fixes-2021-11-25' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

One removal fix for hyperv, one fix in aspeed for the vga_pw sysfs file
content, one error-checking fix for vc4 and two fixes for nouveau, one
to support a new device and another one to properly check for errors.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <maxime@cerno.tech>
Link: https://patchwork.freedesktop.org/patch/msgid/20211125101819.ynu7zgbs7yfwedri@houat

Merge tag 'amd-drm-fixes-5.16-2021-11-24' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

amd-drm-fixes-5.16-2021-11-24:

amdgpu:
- SRIOV fixes
- dma-buf double free fix
- Display fixes for GPU resets
- Fix DSC powergating regression
- GPU TSC fixes
- Interrupt handler overflow fixes
- Endian fix in IP discovery table handling
- Aldebaran ASPM fix
- Fix overclocking regression on older asics
- Backlight/ACPI fix

amdkfd:
- SVM fixes
- VMA removal race fix

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211124212056.6327-1-alexander.deucher@amd.com

Merge tag 'block-5.16-2021-11-25' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- NVMe pull request via Christoph:
      - Add a NO APST quirk for a Kioxia device (Enzo Matsumiya)
      - Fix write zeroes pi (Klaus Jensen)
      - Various TCP transport fixes (Maurizio Lombardi and Varun
        Prakash)
      - Ignore invalid fast_io_fail_tmo values (Maurizio Lombardi)
      - Use IOCB_NOWAIT only if the filesystem supports it (Maurizio
        Lombardi)

- Module loading fix (Ming)

- Kerneldoc warning fix (Yang)

* tag 'block-5.16-2021-11-25' of git://git.kernel.dk/linux-block:
  block: fix parameter not described warning
  nvmet: use IOCB_NOWAIT only if the filesystem supports it
  nvme: fix write zeroes pi
  nvme-fabrics: ignore invalid fast_io_fail_tmo values
  nvme-pci: add NO APST quirk for Kioxia device
  nvme-tcp: fix memory leak when freeing a queue
  nvme-tcp: validate R2T PDU in nvme_tcp_handle_r2t()
  nvmet-tcp: fix incomplete data digest send
  nvmet-tcp: fix memory leak when performing a controller reset
  nvmet-tcp: add an helper to free the cmd buffers
  nvmet-tcp: fix a race condition between release_queue and io_work
  block: avoid to touch unloaded module instance when opening bdev

Merge tag 'io_uring-5.16-2021-11-25' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
"A locking fix for link traversal, and fixing up an outdated function
  name in a comment"

* tag 'io_uring-5.16-2021-11-25' of git://git.kernel.dk/linux-block:
  io_uring: correct link-list traversal locking
  io_uring: fix missed comment from *task_file rename

Merge tag '5.16-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
"Four small cifs/smb3 fixes:

   - two multichannel fixes

   - fix problem noted by kernel test robot

   - update internal version number"

* tag '5.16-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: update internal version number
  smb2: clarify rc initialization in smb2_reconnect
  cifs: populate server_hostname for extra channels
  cifs: nosharesock should be set on new server

Merge tag 'asm-generic-5.16-2' of git://git./linux/kernel/git/arnd/asm-generic

Pull asm-generic syscall table update from Arnd Bergmann:
"André Almeida sends an update for the newly added futex_waitv syscall
  that was initially only added to a few architectures.

  Some additional ones have since made it through architecture
  maintainer trees, this finishes the remaining ones"

* tag 'asm-generic-5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  futex: Wireup futex_waitv syscall

Merge tag 'arm-fixes-5.16-2' of git://git./linux/kernel/git/soc/soc

Pull ARM SoC fixes from Arnd Bergmann:
"There are only a few devicetree fixes this time:

   - one outdated devicetree property that slipped into the newly added
     ExynosAutov9 support

   - three changes to Broadcom SoCs that had incorrect number values for
     interrupts or irqchips.

  In the MAINTAINERS file, Nishanth Menon gets listed for TI K3 SoCs,
  while Taichi Sugaya and Takao Orito take ownership of the Socionext
  Milbeaut platform.

  All other changes are for SoC specific drivers, fixing:

   - A missing NULL pointer check in the mediatek memory driver

   - An integer overflow issue in the Arm smccc firwmare interface

   - A false-positive fortify-source check

   - Error handling fixes for optee and smci

   - Incorrect message format in one SCMI call"

* tag 'arm-fixes-5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
  memory: mtk-smi: Fix a null dereference for the ostd
  arm64: dts: exynos: drop samsung,ufs-shareability-reg-offset in ExynosAutov9
  MAINTAINERS: Update maintainer entry for keystone platforms
  MAINTAINERS: Add entry to MAINTAINERS for Milbeaut
  firmware: smccc: Fix check for ARCH_SOC_ID not implemented
  ARM: socfpga: Fix crash with CONFIG_FORTIRY_SOURCE
  firmware: arm_scmi: Fix type error assignment in voltage protocol
  firmware: arm_scmi: Fix type error in sensor protocol
  firmware: arm_scmi: pm: Propagate return value to caller
  firmware: arm_scmi: Fix base agent discover response
  optee: fix kfree NULL pointer
  ARM: dts: bcm2711: Fix PCIe interrupts
  ARM: dts: BCM5301X: Add interrupt properties to GPIO node
  ARM: dts: BCM5301X: Fix I2C controller interrupt
  firmware: arm_scmi: Fix null de-reference on error path

Merge tag 'folio-5.16b' of git://git.infradead.org/users/willy/pagecache

Pull folio fixes from Matthew Wilcox:
"In the course of preparing the folio changes for iomap for next merge
  window, we discovered some problems that would be nice to address now:

   - Renaming multi-page folios to large folios.

     mapping_multi_page_folio_support() is just a little too long, so we
     settled on mapping_large_folio_support(). That meant renaming, eg
     folio_test_multi() to folio_test_large().

     Rename AS_THP_SUPPORT to match

   - I hadn't included folio wrappers for zero_user_segments(), etc.
     Also, multi-page^W^W large folio support is now independent of
     CONFIG_TRANSPARENT_HUGEPAGE, so machines with HIGHMEM always need
     to fall back to the out-of-line zero_user_segments().

     Remove FS_THP_SUPPORT to match

   - The build bots finally got round to telling me that I missed a
     couple of architectures when adding flush_dcache_folio(). Christoph
     suggested that we just add linux/cacheflush.h and not rely on
     asm-generic/cacheflush.h"

* tag 'folio-5.16b' of git://git.infradead.org/users/willy/pagecache:
  mm: Add functions to zero portions of a folio
  fs: Rename AS_THP_SUPPORT and mapping_thp_support
  fs: Remove FS_THP_SUPPORT
  mm: Remove folio_test_single
  mm: Rename folio_test_multi to folio_test_large
  Add linux/cacheflush.h

block: fix parameter not described warning

The build warning:
block/blk-core.c:968: warning: Function parameter or member 'iob'
not described in 'bio_poll'.

Fixes: 5a72e899ceb4 ("block: add a struct io_comp_batch argument to fops->iopoll()")
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Yang Guang <yang.guang5@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

mdio: aspeed: Fix "Link is Down" issue

The issue happened randomly in runtime. The message "Link is Down" is
popped but soon it recovered to "Link is Up".

The "Link is Down" results from the incorrect read data for reading the
PHY register via MDIO bus. The correct sequence for reading the data
shall be:
1. fire the command
2. wait for command done (this step was missing)
3. wait for data idle
4. read data from data register

Cc: stable@vger.kernel.org
Fixes: f160e99462c6 ("net: phy: Add mdio-aspeed")
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Dylan Hung <dylan_hung@aspeedtech.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20211125024432.15809-1-dylan_hung@aspeedtech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

igb: fix netpoll exit with traffic

Oleksandr brought a bug report where netpoll causes trace
messages in the log on igb.

Danielle brought this back up as still occurring, so we'll try
again.

[22038.710800] ------------[ cut here ]------------
[22038.710801] igb_poll+0x0/0x1440 [igb] exceeded budget in poll
[22038.710802] WARNING: CPU: 12 PID: 40362 at net/core/netpoll.c:155 netpoll_poll_dev+0x18a/0x1a0

As Alex suggested, change the driver to return work_done at the
exit of napi_poll, which should be safe to do in this driver
because it is not polling multiple queues in this single napi
context (multiple queues attached to one MSI-X vector). Several
other drivers contain the same simple sequence, so I hope
this will not create new problems.

Fixes: 16eb8815c235 ("igb: Refactor clean_rx_irq to reduce overhead and improve performance")
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Reported-by: Danielle Ratson <danieller@nvidia.com>
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Danielle Ratson <danieller@nvidia.com>
Link: https://lore.kernel.org/r/20211123204000.1597971-1-jesse.brandeburg@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

xen: make HYPERVISOR_set_debugreg() always_inline

HYPERVISOR_set_debugreg() is being called from noinstr code, so it
should be attributed "always_inline".

Fixes: 7361fac0465ba96ec8f ("x86/xen: Make set_debugreg() noinstr")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20211125092056.24758-3-jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

xen: make HYPERVISOR_get_debugreg() always_inline

HYPERVISOR_get_debugreg() is being called from noinstr code, so it
should be attributed "always_inline".

Fixes: f4afb713e5c3a4419ba ("x86/xen: Make get_debugreg() noinstr")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20211125092056.24758-2-jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

Merge tag 'nvme-5.16-2021-11-25' of git://git.infradead.org/nvme into block-5.16

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 5.16

- add a NO APST quirk for a Kioxia device (Enzo Matsumiya)
- fix write zeroes pi (Klaus Jensen)
- various TCP transport fixes (Maurizio Lombardi and Varun Prakash)
- ignore invalid fast_io_fail_tmo values (Maurizio Lombardi)
- use IOCB_NOWAIT only if the filesystem supports it (Maurizio Lombardi)"

* tag 'nvme-5.16-2021-11-25' of git://git.infradead.org/nvme:
  nvmet: use IOCB_NOWAIT only if the filesystem supports it
  nvme: fix write zeroes pi
  nvme-fabrics: ignore invalid fast_io_fail_tmo values
  nvme-pci: add NO APST quirk for Kioxia device
  nvme-tcp: fix memory leak when freeing a queue
  nvme-tcp: validate R2T PDU in nvme_tcp_handle_r2t()
  nvmet-tcp: fix incomplete data digest send
  nvmet-tcp: fix memory leak when performing a controller reset
  nvmet-tcp: add an helper to free the cmd buffers
  nvmet-tcp: fix a race condition between release_queue and io_work

nvmet: use IOCB_NOWAIT only if the filesystem supports it

Submit I/O requests with the IOCB_NOWAIT flag set only if
the underlying filesystem supports it.

Fixes: 50a909db36f2 ("nvmet: use IOCB_NOWAIT for file-ns buffered I/O")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

memory: mtk-smi: Fix a null dereference for the ostd

We add the ostd setting for mt8195. It introduces a KE for the
previous SoC which doesn't have ostd setting. This is the log:

Unable to handle kernel NULL pointer dereference at virtual address
0000000000000080
...
pc : mtk_smi_larb_config_port_gen2_general+0x64/0x130
lr : mtk_smi_larb_resume+0x54/0x98
...
Call trace:
mtk_smi_larb_config_port_gen2_general+0x64/0x130
pm_generic_runtime_resume+0x2c/0x48
__genpd_runtime_resume+0x30/0xa8
genpd_runtime_resume+0x94/0x2c8
__rpm_callback+0x44/0x150
rpm_callback+0x6c/0x78
rpm_resume+0x310/0x558
__pm_runtime_resume+0x3c/0x88

In the code: larbostd = larb->larb_gen->ostd[larb->larbid],
if "larb->larb_gen->ostd" is null, the "larbostd" is the offset(e.g.
0x80 above), it's also a valid value, then accessing "larbostd[i]" in the
"for" loop will cause the KE above. To avoid this issue, initialize
"larbostd" to NULL when the SoC doesn't have ostd setting.

Fixes: fe6dd2a4017d ("memory: mtk-smi: mt8195: Add initial setting for smi-larb")
Signed-off-by: Yong Wu <yong.wu@mediatek.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Link: https://lore.kernel.org/r/20211108082429.15080-1-yong.wu@mediatek.com
Link: https://lore.kernel.org/r/20211124085042.9649-3-krzysztof.kozlowski@canonical.com'
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

arm64: dts: exynos: drop samsung,ufs-shareability-reg-offset in ExynosAutov9

samsung,ufs-shareability-reg-offset is not necessary anymore since it
was integrated into the second argument of samsung,sysreg.

Fixes: 31bbac5263aa ("arm64: dts: exynos: add initial support for exynosautov9 SoC")
Signed-off-by: Chanho Park <chanho61.park@samsung.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Link: https://lore.kernel.org/r/20211102064826.15796-1-chanho61.park@samsung.com
Link: https://lore.kernel.org/r/20211124085042.9649-2-krzysztof.kozlowski@canonical.com'
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

MAINTAINERS: Update maintainer entry for keystone platforms

Switch the kernel tree for keystone to the consolidated ti tree and add
myself as primary maintainer for keystone platforms to offset Santosh's
workload.

Signed-off-by: Nishanth Menon <nm@ti.com>
Acked-by: Santosh Shilimkar <ssantosh@kernel.org>
Link: https://lore.kernel.org/r/20211123001725.21422-1-nm@ti.com'
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

futex: Wireup futex_waitv syscall

Wireup futex_waitv syscall for all remaining archs.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

fuse: release pipe buf after last use

Checking buf->flags should be done before the pipe_buf_release() is called
on the pipe buffer, since releasing the buffer might modify the flags.

This is exactly what page_cache_pipe_buf_release() does, and which results
in the same VM_BUG_ON_PAGE(PageLRU(page)) that the original patch was
trying to fix.

Reported-by: Justin Forbes <jmforbes@linuxtx.org>
Fixes: 712a951025c0 ("fuse: fix page stealing")
Cc: <stable@vger.kernel.org> # v2.6.35
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

Merge tag 'usb-serial-5.16-rc3' of https://git./linux/kernel/git/johan/usb-serial into usb-linus

Johan writes:

USB-serial fixes for 5.16-rc3

Here's a fix for a pl2303 type-detection issue and some new modem
devices ids.

All but the last commit have been in linux-next, and with no reported
issues.

* tag 'usb-serial-5.16-rc3' of https://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial:
  USB: serial: option: add Fibocom FM101-GL variants
  USB: serial: pl2303: fix GC type detection
  USB: serial: option: add Telit LE910S1 0x9200 composition

Merge branch 'net-smc-fixes-2021-11-24'

Karsten Graul says:

====================
net/smc: fixes 2021-11-24

Patch 1 from DaXing fixes a possible loop in smc_listen().
Patch 2 prevents a NULL pointer dereferencing while iterating
over the lower network devices.
====================

Link: https://lore.kernel.org/r/20211124123238.471429-1-kgraul@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/smc: Fix loop in smc_listen

The kernel_listen function in smc_listen will fail when all the available
ports are occupied. At this point smc->clcsock->sk->sk_data_ready has
been changed to smc_clcsock_data_ready. When we call smc_listen again,
now both smc->clcsock->sk->sk_data_ready and smc->clcsk_data_ready point
to the smc_clcsock_data_ready function.

The smc_clcsock_data_ready() function calls lsmc->clcsk_data_ready which
now points to itself resulting in an infinite loop.

This patch restores smc->clcsock->sk->sk_data_ready with the old value.

Fixes: a60a2b1e0af1 ("net/smc: reduce active tcp_listen workers")
Signed-off-by: Guo DaXing <guodaxing@huawei.com>
Acked-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/smc: Fix NULL pointer dereferencing in smc_vlan_by_tcpsk()

Coverity reports a possible NULL dereferencing problem:

in smc_vlan_by_tcpsk():
6. returned_null: netdev_lower_get_next returns NULL (checked 29 out of 30 times).
7. var_assigned: Assigning: ndev = NULL return value from netdev_lower_get_next.
1623 ndev = (struct net_device *)netdev_lower_get_next(ndev, &lower);
CID 1468509 (#1 of 1): Dereference null return value (NULL_RETURNS)
8. dereference: Dereferencing a pointer that might be NULL ndev when calling is_vlan_dev.
1624 if (is_vlan_dev(ndev)) {

Remove the manual implementation and use netdev_walk_all_lower_dev() to
iterate over the lower devices. While on it remove an obsolete function
parameter comment.

Fixes: cb9d43f67754 ("net/smc: determine vlan_id of stacked net_device")
Suggested-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>