Filipe Manana [Tue, 15 Mar 2022 15:22:39 +0000 (15:22 +0000)]
btrfs: lock the inode first before flushing range when punching hole
When doing hole punching we are flushing delalloc and waiting for ordered
extents to complete before locking the inode (VFS lock and the btrfs
specific i_mmap_lock). This is fine because even if a write happens after
we call btrfs_wait_ordered_range() and before we lock the inode (call
btrfs_inode_lock()), we will notice the write at
btrfs_punch_hole_lock_range() and flush delalloc and wait for its ordered
extent.
We can however make this simpler by locking first the inode an then call
btrfs_wait_ordered_range(), which will allow us to remove the ordered
extent lookup logic from btrfs_punch_hole_lock_range() in the next patch.
It also makes the behaviour the same as plain fallocate, hole punching
and reflinks.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 15 Mar 2022 15:22:38 +0000 (15:22 +0000)]
btrfs: remove ordered extent check and wait during fallocate
For fallocate() we have this loop that checks if we have ordered extents
after locking the file range, and if so unlock the range, wait for ordered
extents, and retry until we don't find more ordered extents.
This logic was needed in the past because:
1) Direct IO writes within the i_size boundary did not take the inode's
VFS lock. This was because that lock used to be a mutex, then some
years ago it was switched to a rw semaphore (commit
9902af79c01a8e
("parallel lookups: actual switch to rwsem")), and then btrfs was
changed to take the VFS inode's lock in shared mode for writes that
don't cross the i_size boundary (commit
e9adabb9712ef9 ("btrfs: use
shared lock for direct writes within EOF"));
2) We could race with memory mapped writes, because memory mapped writes
don't acquire the inode's VFS lock. We don't have that race anymore,
as we have a rw semaphore to synchronize memory mapped writes with
fallocate (and reflinking too). That change happened with commit
8d9b4a162a37ce ("btrfs: exclude mmap from happening during all
fallocate operations").
So stop looking for ordered extents after locking the file range when
doing a plain fallocate.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 15 Mar 2022 15:22:37 +0000 (15:22 +0000)]
btrfs: remove inode_dio_wait() calls when starting reflink operations
When starting a reflink operation we have these calls to inode_dio_wait()
which used to be needed because direct IO writes that don't cross the
i_size boundary did not take the inode's VFS lock, so we could race with
them and end up with ordered extents in target range after calling
btrfs_wait_ordered_range().
However that is not the case anymore, because the inode's VFS lock was
changed from a mutex to a rw semaphore, by commit
9902af79c01a8e
("parallel lookups: actual switch to rwsem"), and several years later we
started to lock the inode's VFS lock in shared mode for direct IO writes
that don't cross the i_size boundary (commit
e9adabb9712ef9 ("btrfs: use
shared lock for direct writes within EOF")).
So remove those inode_dio_wait() calls.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 15 Mar 2022 15:22:36 +0000 (15:22 +0000)]
btrfs: remove useless dio wait call when doing fallocate zero range
When starting a fallocate zero range operation, before getting the first
extent map for the range, we make a call to inode_dio_wait().
This logic was needed in the past because direct IO writes within the
i_size boundary did not take the inode's VFS lock. This was because that
lock used to be a mutex, then some years ago it was switched to a rw
semaphore (by commit
9902af79c01a8e ("parallel lookups: actual switch to
rwsem")), and then btrfs was changed to take the VFS inode's lock in
shared mode for writes that don't cross the i_size boundary (done in
commit
e9adabb9712ef9 ("btrfs: use shared lock for direct writes within
EOF")). The lockless direct IO writes could result in a race with the
zero range operation, resulting in the later getting a stale extent
map for the range.
So remove this no longer needed call to inode_dio_wait(), as fallocate
takes the inode's VFS lock in exclusive mode and direct IO writes within
i_size take that same lock in shared mode.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 15 Mar 2022 15:22:35 +0000 (15:22 +0000)]
btrfs: only reserve the needed data space amount during fallocate
During a plain fallocate, we always start by reserving an amount of data
space that matches the length of the range passed to fallocate. When we
already have extents allocated in that range, we may end up trying to
reserve a lot more data space then we need, which can result in several
undesired behaviours:
1) We fail with -ENOSPC. For example the passed range has a length
of 1G, but there's only one hole with a size of 1M in that range;
2) We temporarily reserve excessive data space that could be used by
other operations happening concurrently;
3) By reserving much more data space then we need, we can end up
doing expensive things like triggering dellaloc for other inodes,
waiting for the ordered extents to complete, trigger transaction
commits, allocate new block groups, etc.
Example:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f -b 1g $DEV
mount $DEV $MNT
# Create a file with a size of 600M and two holes, one at [200M, 201M[
# and another at [401M, 402M[
xfs_io -f -c "pwrite -S 0xab 0 200M" \
-c "pwrite -S 0xcd 201M 200M" \
-c "pwrite -S 0xef 402M 198M" \
$MNT/foobar
# Now call fallocate against the whole file range, see if it fails
# with -ENOSPC or not - it shouldn't since we only need to allocate
# 2M of data space.
xfs_io -c "falloc 0 600M" $MNT/foobar
umount $MNT
$ ./test.sh
(...)
wrote
209715200/
209715200 bytes at offset 0
200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec)
wrote
209715200/
209715200 bytes at offset
210763776
200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec)
wrote
207618048/
207618048 bytes at offset
421527552
198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec)
fallocate: No space left on device
$
So fix this by not allocating an amount of data space that matches the
length of the range passed to fallocate. Instead allocate an amount of
data space that corresponds to the sum of the sizes of each hole found
in the range. This reservation now happens after we have locked the file
range, which is safe since we know at this point there's no delalloc
in the range because we've taken the inode's VFS lock in exclusive mode,
we have taken the inode's i_mmap_lock in exclusive mode, we have flushed
delalloc and waited for all ordered extents in the range to complete.
This type of failure actually seems to happen in practice with systemd,
and we had at least one report about this in a very long thread which
is referenced by the Link tag below.
Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sweet Tea Dorminy [Fri, 8 Apr 2022 17:15:07 +0000 (13:15 -0400)]
btrfs: restore inode creation before xattr setting
According to the tree checker, "all xattrs with a given objectid follow
the inode with that objectid in the tree" is an invariant. This was
broken by the recent change "btrfs: move common inode creation code into
btrfs_create_new_inode()", which moved acl creation and property
inheritance (stored in xattrs) to before inode insertion into the tree.
As a result, under certain timings, the xattrs could be written to the
tree before the inode, causing the tree checker to report violation of
the invariant.
Move property inheritance and acl creation back to their old ordering
after the inode insertion.
Suggested-by: Omar Sandoval <osandov@osandov.com>
Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Tue, 15 Mar 2022 01:12:35 +0000 (18:12 -0700)]
btrfs: move common inode creation code into btrfs_create_new_inode()
All of our inode creation code paths duplicate the calls to
btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation
additionally duplicates property inheritance and the call to
btrfs_set_inode_index(). Fix this by moving the common code into
btrfs_create_new_inode(). This accomplishes a few things at once:
1. It reduces code duplication.
2. It allows us to set up the inode completely before inserting the
inode item, removing calls to btrfs_update_inode().
3. It fixes a leak of an inode on disk in some error cases. For example,
in btrfs_create(), if btrfs_new_inode() succeeds, then we have
inserted an inode item and its inode ref. However, if something after
that fails (e.g., btrfs_init_inode_security()), then we end the
transaction and then decrement the link count on the inode. If the
transaction is committed and the system crashes before the failed
inode is deleted, then we leak that inode on disk. Instead, this
refactoring aborts the transaction when we can't recover more
gracefully.
4. It exposes various ways that subvolume creation diverges from mkdir
in terms of inheriting flags, properties, permissions, and POSIX
ACLs, a lot of which appears to be accidental. This patch explicitly
does _not_ change the existing non-standard behavior, but it makes
those differences more clear in the code and documents them so that
we can discuss whether they should be changed.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Tue, 15 Mar 2022 01:12:34 +0000 (18:12 -0700)]
btrfs: reserve correct number of items for inode creation
The various inode creation code paths do not account for the compression
property, POSIX ACLs, or the parent inode item when starting a
transaction. Fix it by refactoring all of these code paths to use a new
function, btrfs_new_inode_prepare(), which computes the correct number
of items. To do so, it needs to know whether POSIX ACLs will be created,
so move the ACL creation into that function. To reduce the number of
arguments that need to be passed around for inode creation, define
struct btrfs_new_inode_args containing all of the relevant information.
btrfs_new_inode_prepare() will also be a good place to set up the
fscrypt context and encrypted filename in the future.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Tue, 15 Mar 2022 01:12:33 +0000 (18:12 -0700)]
btrfs: factor out common part of btrfs_{mknod,create,mkdir}()
btrfs_{mknod,create,mkdir}() are now identical other than the inode
initialization and some inconsequential function call order differences.
Factor out the common code to reduce code duplication.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Tue, 15 Mar 2022 01:12:32 +0000 (18:12 -0700)]
btrfs: allocate inode outside of btrfs_new_inode()
Instead of calling new_inode() and inode_init_owner() inside of
btrfs_new_inode(), do it in the callers. This allows us to pass in just
the inode instead of the mnt_userns and mode and removes the need for
memalloc_nofs_{save,restores}() since we do it before starting a
transaction. In create_subvol(), it also means we no longer have to look
up the inode again to instantiate it. This also paves the way for some
more cleanups in later patches.
This also removes the comments about Smack checking i_op, which are no
longer true since commit
5d6c31910bc0 ("xattr: Add
__vfs_{get,set,remove}xattr helpers"). Now it checks inode->i_opflags &
IOP_XATTR, which is set based on sb->s_xattr.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 15 Mar 2022 10:01:33 +0000 (18:01 +0800)]
btrfs: warn when extent buffer leak test fails
Although we have btrfs_extent_buffer_leak_debug_check() (enabled by
CONFIG_BTRFS_DEBUG option) to detect and warn QA testers that we have
some extent buffer leakage, it's just pr_err(), not noisy enough for
fstests to cache.
So here we trigger a WARN_ON() if the allocated_ebs list is not empty.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Mon, 14 Mar 2022 02:09:29 +0000 (10:09 +0800)]
btrfs: use a local variable for fs_devices pointer in btrfs_dev_replace_finishing
In the function btrfs_dev_replace_finishing, we dereferenced
fs_info->fs_devices 6 times. Use keep local variable for that.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gabriel Niebler [Wed, 9 Mar 2022 13:50:51 +0000 (14:50 +0100)]
btrfs: use btrfs_for_each_slot in btrfs_listxattr
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gabriel Niebler [Wed, 9 Mar 2022 13:50:50 +0000 (14:50 +0100)]
btrfs: use btrfs_for_each_slot in btrfs_read_chunk_tree
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gabriel Niebler [Wed, 9 Mar 2022 13:50:49 +0000 (14:50 +0100)]
btrfs: use btrfs_for_each_slot in btrfs_unlink_all_paths
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gabriel Niebler [Wed, 9 Mar 2022 13:50:48 +0000 (14:50 +0100)]
btrfs: use btrfs_for_each_slot in process_all_extents
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gabriel Niebler [Wed, 9 Mar 2022 13:50:47 +0000 (14:50 +0100)]
btrfs: use btrfs_for_each_slot in process_all_new_xattrs
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gabriel Niebler [Wed, 9 Mar 2022 13:50:46 +0000 (14:50 +0100)]
btrfs: use btrfs_for_each_slot in process_all_refs
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gabriel Niebler [Wed, 9 Mar 2022 13:50:45 +0000 (14:50 +0100)]
btrfs: use btrfs_for_each_slot in is_ancestor
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gabriel Niebler [Wed, 9 Mar 2022 13:50:44 +0000 (14:50 +0100)]
btrfs: use btrfs_for_each_slot in can_rmdir
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gabriel Niebler [Wed, 9 Mar 2022 13:50:43 +0000 (14:50 +0100)]
btrfs: use btrfs_for_each_slot in did_create_dir
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gabriel Niebler [Wed, 9 Mar 2022 13:50:42 +0000 (14:50 +0100)]
btrfs: use btrfs_for_each_slot in btrfs_real_readdir
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gabriel Niebler [Wed, 9 Mar 2022 13:50:41 +0000 (14:50 +0100)]
btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gabriel Niebler [Wed, 9 Mar 2022 13:50:40 +0000 (14:50 +0100)]
btrfs: use btrfs_for_each_slot in mark_block_group_to_copy
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gabriel Niebler [Wed, 9 Mar 2022 13:50:39 +0000 (14:50 +0100)]
btrfs: use btrfs_for_each_slot in find_first_block_group
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gabriel Niebler [Wed, 9 Mar 2022 13:50:38 +0000 (14:50 +0100)]
btrfs: introduce btrfs_for_each_slot iterator macro
There is a common pattern when searching for a key in btrfs:
* Call btrfs_search_slot to find the slot for the key
* Enter an endless loop:
* If the found slot is larger than the no. of items in the current
leaf, check the next leaf
* If it's still not found in the next leaf, terminate the loop
* Otherwise do something with the found key
* Increment the current slot and continue
To reduce code duplication, we can replace this code pattern with an
iterator macro, similar to the existing for_each_X macros found
elsewhere in the kernel. This also makes the code easier to understand
for newcomers by putting a name to the encapsulated functionality.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sun, 13 Mar 2022 10:40:02 +0000 (18:40 +0800)]
btrfs: scrub: rename scrub_bio::pagev and related members
Since the subpage support for scrub, one page no longer always represents
one sector, thus scrub_bio::pagev and scrub_bio::sector_count are no
longer accurate.
Rename them to scrub_bio::sectors and scrub_bio::sector_count respectively.
This also involves scrub_ctx::pages_per_bio and other macros involved.
Now the renaming of pages involved in scrub is be finished.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sun, 13 Mar 2022 10:40:01 +0000 (18:40 +0800)]
btrfs: scrub: rename scrub_page to scrub_sector
Since the subpage support of scrub, scrub_sector is in fact just
representing one sector.
Thus the name scrub_page is no longer correct, rename it to
scrub_sector.
This also involves the following renames:
- spage -> sector
Normally we would just replace "page" with "sector" and result
something like "ssector".
But the repeating 's' is not really eye friendly.
So here we just simple use "sector", as there is nothing from MM layer
called "sector" to cause any confusion.
- scrub_parity::spages -> sectors_list
Normally we use plural to indicate an array, not a list.
Rename it to @sectors_list to be more explicit on the list part.
- Also reformat and update comments that get changed
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sun, 13 Mar 2022 10:40:00 +0000 (18:40 +0800)]
btrfs: scrub: rename members related to scrub_block::pagev
The following will be renamed in this patch:
- scrub_block::pagev -> sectors
- scrub_block::page_count -> sector_count
- SCRUB_MAX_PAGES_PER_BLOCK -> SCRUB_MAX_SECTORS_PER_BLOCK
- page_num -> sector_num to iterate scrub_block::sectors
For now scrub_page is not yet renamed to keep the patch reasonable and
it will be updated in a followup.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 11 Mar 2022 11:35:34 +0000 (11:35 +0000)]
btrfs: remove trivial wrapper btrfs_read_buffer()
The function btrfs_read_buffer() is useless, it just calls
btree_read_extent_buffer_pages() with exactly the same arguments.
So remove it and rename btree_read_extent_buffer_pages() to
btrfs_read_extent_buffer(), which is a shorter name, has the "btrfs_"
prefix (since it's used outside disk-io.c) and the name is clear enough
about what it does.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 11 Mar 2022 11:35:33 +0000 (11:35 +0000)]
btrfs: update outdated comment for read_block_for_search()
The comment at the top of read_block_for_search() is very outdated, as it
refers to the blocking versus spinning path locking modes. We no longer
have these two locking modes after we switched the btree locks from custom
code to rw semaphores. So update the comment to stop referring to the
blocking mode and put it more up to date.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 11 Mar 2022 11:35:32 +0000 (11:35 +0000)]
btrfs: release upper nodes when reading stale btree node from disk
When reading a btree node (or leaf), at read_block_for_search(), if we
can't find its extent buffer in the cache (the fs_info->buffer_radix
radix tree), then we unlock all upper level nodes before reading the
btree node/leaf from disk, to prevent blocking other tasks for too long.
However if we find that the extent buffer is in the cache but it is not
up to date, we don't unlock upper level nodes before reading it from disk,
potentially blocking other tasks on upper level nodes for too long.
Fix this inconsistent behaviour by unlocking upper level nodes if we need
to read a node/leaf from disk because its in-memory extent buffer is not
up to date. If we unlocked upper level nodes then we must return -EAGAIN
to the caller, just like the case where the extent buffer is not cached in
memory. And like that case, we determine if upper level nodes are locked
by checking only if the parent node is locked - if it isn't, then no other
upper level nodes are locked.
This is actually a rare case, as if we have an extent buffer in memory,
it typically has the uptodate flag set and passes all the checks done by
btrfs_buffer_uptodate().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 11 Mar 2022 11:35:31 +0000 (11:35 +0000)]
btrfs: avoid unnecessary btree search restarts when reading node
When reading a btree node, at read_block_for_search(), if we don't find
the node's (or leaf) extent buffer in the cache, we will read it from
disk. Since that requires waiting on IO, we release all upper level nodes
from our path before reading the target node/leaf, and then return -EAGAIN
to the caller, which will make the caller restart the while btree search.
However we are causing the restart of btree search even for cases where
it is not necessary:
1) We have a path with ->skip_locking set to true, typically when doing
a search on a commit root, so we are never holding locks on any node;
2) We are doing a read search (the "ins_len" argument passed to
btrfs_search_slot() is 0), or we are doing a search to modify an
existing key (the "cow" argument passed to btrfs_search_slot() has
a value of 1 and "ins_len" is 0), in which case we never hold locks
for upper level nodes;
3) We are doing a search to insert or delete a key, in which case we may
or may not have upper level nodes locked. That depends on the current
minimum write lock levels at btrfs_search_slot(), if we had to split
or merge parent nodes, if we had to COW upper level nodes and if
we ever visited slot 0 of an upper level node. It's still common to
not have upper level nodes locked, but our current node must be at
least at level 1, for insertions, or at least at level 2 for deletions.
In these cases when we have locks on upper level nodes, they are always
write locks.
These cases where we are not holding locks on upper level nodes far
outweigh the cases where we are holding locks, so it's completely wasteful
to retry the whole search when we have no upper nodes locked.
So change the logic to not return -EAGAIN, and make the caller retry the
search, when we don't have the parent node locked - when it's not locked
it means no other upper level nodes are locked as well.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Thu, 10 Mar 2022 01:31:42 +0000 (17:31 -0800)]
btrfs: set inode flags earlier in btrfs_new_inode()
btrfs_new_inode() inherits the inode flags from the parent directory and
the mount options _after_ we fill the inode item. This works because all
of the callers of btrfs_new_inode() make further changes to the inode
and then call btrfs_update_inode(). It'd be better to fully initialize
the inode once to avoid the extra update, so as a first step, set the
inode flags _before_ filling the inode item.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Thu, 10 Mar 2022 01:31:41 +0000 (17:31 -0800)]
btrfs: move btrfs_get_free_objectid() call into btrfs_new_inode()
Every call of btrfs_new_inode() is immediately preceded by a call to
btrfs_get_free_objectid(). Since getting an inode number is part of
creating a new inode, this is better off being moved into
btrfs_new_inode(). While we're here, get rid of the comment about
reclaiming inode numbers, since we only did that when using the ino
cache, which was removed by commit
5297199a8bca ("btrfs: remove inode
number cache feature").
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Thu, 10 Mar 2022 01:31:40 +0000 (17:31 -0800)]
btrfs: don't pass parent objectid to btrfs_new_inode() explicitly
For everything other than a subvolume root inode, we get the parent
objectid from the parent directory. For the subvolume root inode, the
parent objectid is the same as the inode's objectid. We can find this
within btrfs_new_inode() instead of passing it.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Thu, 10 Mar 2022 01:31:39 +0000 (17:31 -0800)]
btrfs: remove redundant name and name_len parameters to create_subvol
The passed dentry already contains the name.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Thu, 10 Mar 2022 01:31:38 +0000 (17:31 -0800)]
btrfs: remove unused mnt_userns parameter from __btrfs_set_acl
Commit
4a8b34afa9c9 ("btrfs: handle ACLs on idmapped mounts") added this
parameter but didn't use it. __btrfs_set_acl() is the low-level helper
that writes an ACL to disk. The higher-level btrfs_set_acl() is the one
that translates the ACL based on the user namespace.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Thu, 10 Mar 2022 01:31:37 +0000 (17:31 -0800)]
btrfs: remove unnecessary set_nlink() in btrfs_create_subvol_root()
btrfs_new_inode() already returns an inode with nlink set to 1 (via
inode_init_always()). Get rid of the unnecessary set.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Thu, 10 Mar 2022 01:31:36 +0000 (17:31 -0800)]
btrfs: remove unnecessary inode_set_bytes(0) call
new_inode() always returns an inode with i_blocks and i_bytes set to 0
(via inode_init_always()). Remove the unnecessary call to
inode_set_bytes() in btrfs_new_inode().
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Thu, 10 Mar 2022 01:31:35 +0000 (17:31 -0800)]
btrfs: remove unnecessary btrfs_i_size_write(0) calls
btrfs_new_inode() always returns an inode with i_size and disk_i_size
set to 0 (via inode_init_always() and btrfs_alloc_inode(),
respectively). Remove the unnecessary calls to btrfs_i_size_write() in
btrfs_mkdir() and btrfs_create_subvol_root().
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Thu, 10 Mar 2022 01:31:34 +0000 (17:31 -0800)]
btrfs: get rid of btrfs_add_nondir()
This is a trivial wrapper around btrfs_add_link(). The only thing it
does other than moving arguments around is translating a > 0 return
value to -EEXIST. As far as I can tell, btrfs_add_link() won't return >
0 (and if it did, the existing callsites in, e.g., btrfs_mkdir() would
be broken). The check itself dates back to commit
2c90e5d65842 ("Btrfs:
still corruption hunting"), so it's probably left over from debugging.
Let's just get rid of btrfs_add_nondir().
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Thu, 10 Mar 2022 01:31:33 +0000 (17:31 -0800)]
btrfs: fix anon_dev leak in create_subvol()
When btrfs_qgroup_inherit(), btrfs_alloc_tree_block, or
btrfs_insert_root() fail in create_subvol(), we return without freeing
anon_dev. Reorganize the error handling in create_subvol() to fix this.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Thu, 10 Mar 2022 01:31:32 +0000 (17:31 -0800)]
btrfs: reserve correct number of items for rename
btrfs_rename() and btrfs_rename_exchange() don't account for enough
items. Replace the incorrect explanations with a specific breakdown of
the number of items and account them accurately.
Note that this glosses over RENAME_WHITEOUT because the next commit is
going to rework that, too.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Thu, 10 Mar 2022 01:31:31 +0000 (17:31 -0800)]
btrfs: reserve correct number of items for unlink and rmdir
__btrfs_unlink_inode() calls btrfs_update_inode() on the parent
directory in order to update its size and sequence number. Make sure we
account for it.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Linus Torvalds [Mon, 16 May 2022 01:08:58 +0000 (18:08 -0700)]
Linux 5.18-rc7
Linus Torvalds [Sun, 15 May 2022 15:08:51 +0000 (08:08 -0700)]
Merge tag 'driver-core-5.18-rc7' of git://git./linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here is one fix, and three documentation updates for 5.18-rc7.
The fix is for the firmware loader which resolves a long-reported
problem where the credentials of the firmware loader could be set to a
userspace process without enough permissions to actually load the
firmware image. Many Android vendors have been reporting this for
quite some time.
The documentation updates are for the embargoed-hardware-issues.rst
file to add a new entry, change an existing one, and sort the list to
make changes easier in the future.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
Documentation/process: Update ARM contact for embargoed hardware issues
Documentation/process: Add embargoed HW contact for Ampere Computing
Documentation/process: Make groups alphabetical and use tabs consistently
firmware_loader: use kernel credentials when reading firmware
Linus Torvalds [Sun, 15 May 2022 15:07:07 +0000 (08:07 -0700)]
Merge tag 'char-misc-5.18-rc7' of git://git./linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are two small driver fixes for 5.18-rc7 that resolve reported
problems:
- slimbus driver irq bugfix
- interconnect sync state bugfix
Both of these have been in linux-next with no reported problems"
* tag 'char-misc-5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
slimbus: qcom: Fix IRQ check in qcom_slim_probe
interconnect: Restore sync state by ignoring ipa-virt in provider count
Linus Torvalds [Sun, 15 May 2022 15:05:04 +0000 (08:05 -0700)]
Merge tag 'tty-5.18-rc7' of git://git./linux/kernel/git/gregkh/tty
Pull tty/serial driver fixes from Greg KH:
"Here are some small tty n_gsm and serial driver fixes for 5.18-rc7
that resolve reported problems. They include:
- n_gsm fixes for reported issues
- 8250_mtk driver fixes for some platforms
- fsl_lpuart driver fix for reported problem.
- digicolor driver fix for reported problem.
All have been in linux-next for a while with no reported problems"
* tag 'tty-5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
fsl_lpuart: Don't enable interrupts too early
tty: n_gsm: fix invalid gsmtty_write_room() result
tty: n_gsm: fix mux activation issues in gsm_config()
tty: n_gsm: fix buffer over-read in gsm_dlci_data()
serial: 8250_mtk: Fix register address for XON/XOFF character
serial: 8250_mtk: Make sure to select the right FEATURE_SEL
serial: 8250_mtk: Fix UART_EFR register address
tty/serial: digicolor: fix possible null-ptr-deref in digicolor_uart_probe()
Linus Torvalds [Sun, 15 May 2022 15:03:24 +0000 (08:03 -0700)]
Merge tag 'usb-5.18-rc7' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small fixes for reported issues with some USB drivers.
They include:
- xhci fixes for xhci-mtk platform driver
- typec driver fixes for reported problems.
- cdc-wdm read-stuck fix
- gadget driver fix for reported race condition
- new usb-serial driver ids
All of these have been in linux-next with no reported problems"
* tag 'usb-5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: xhci-mtk: remove bandwidth budget table
usb: xhci-mtk: fix fs isoc's transfer error
usb: gadget: fix race when gadget driver register via ioctl
usb: typec: tcpci_mt6360: Update for BMC PHY setting
usb: gadget: uvc: allow for application to cleanly shutdown
usb: typec: tcpci: Don't skip cleanup in .remove() on error
usb: cdc-wdm: fix reading stuck on device close
USB: serial: qcserial: add support for Sierra Wireless EM7590
USB: serial: option: add Fibocom MA510 modem
USB: serial: option: add Fibocom L610 modem
USB: serial: pl2303: add device id for HP LM930 Display
Linus Torvalds [Sun, 15 May 2022 13:46:03 +0000 (06:46 -0700)]
Merge tag 'powerpc-5.18-5' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
- Fix KVM PR on 32-bit, which was broken by some MMU code refactoring.
Thanks to: Alexander Graf, and Matt Evans.
* tag 'powerpc-5.18-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
KVM: PPC: Book3S PR: Enable MSR_DR for switch_mmu_context()
Linus Torvalds [Sun, 15 May 2022 13:42:40 +0000 (06:42 -0700)]
Merge tag 'x86-urgent-2022-05-15' of git://git./linux/kernel/git/tip/tip
Pull x86 fix from Thomas Gleixner:
"A single fix for the handling of unpopulated sub-pmd spaces.
The copy & pasta from the corresponding s390 code screwed up the
address calculation for marking the sub-pmd ranges via memset by
omitting the ALIGN_DOWN() to calculate the proper start address.
It's a mystery why this code is not generic and shared because there
is nothing architecture specific in there, but that's too intrusive
for a backportable fix"
* tag 'x86-urgent-2022-05-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Fix marking of unused sub-pmd ranges
Linus Torvalds [Sun, 15 May 2022 13:40:11 +0000 (06:40 -0700)]
Merge tag 'sched-urgent-2022-05-15' of git://git./linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
"The recent expansion of the sched switch tracepoint inserted a new
argument in the middle of the arguments. This reordering broke BPF
programs which relied on the old argument list.
While tracepoints are not considered stable ABI, it's not trivial to
make BPF cope with such a change, but it's being worked on. For now
restore the original argument order and move the new argument to the
end of the argument list"
* tag 'sched-urgent-2022-05-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/tracing: Append prev_state to tp args instead
Linus Torvalds [Sun, 15 May 2022 13:37:05 +0000 (06:37 -0700)]
Merge tag 'irq-urgent-2022-05-15' of git://git./linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
"A single fix for a recent (introduced in 5.16) regression in the core
interrupt code.
The consolidation of the interrupt handler invocation code added an
unconditional warning when generic_handle_domain_irq() is invoked from
outside hard interrupt context. That's overbroad as the requirement
for invoking these handlers in hard interrupt context is only required
for certain interrupt types. The subsequently called code already
contains a warning which triggers conditionally for interrupt chips
which indicate this requirement in their properties.
Remove the overbroad one"
* tag 'irq-urgent-2022-05-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Remove WARN_ON_ONCE() in generic_handle_domain_irq()
Linus Torvalds [Sat, 14 May 2022 18:43:47 +0000 (11:43 -0700)]
Merge tag 'perf-tools-fixes-for-v5.18-2022-05-14' of git://git./linux/kernel/git/acme/linux
Pull perf tools fixes from Arnaldo Carvalho de Melo:
- Fix two NDEBUG warnings in 'perf bench numa'
- Fix ARM coresight `perf test` failure
- Sync linux/kvm.h with the kernel sources
- Add James and Mike as Arm64 performance events reviewers
* tag 'perf-tools-fixes-for-v5.18-2022-05-14' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
MAINTAINERS: Add James and Mike as Arm64 performance events reviewers
tools headers UAPI: Sync linux/kvm.h with the kernel sources
perf tests: Fix coresight `perf test` failure.
perf bench: Fix two numa NDEBUG warnings
Linus Torvalds [Fri, 13 May 2022 23:20:25 +0000 (16:20 -0700)]
Merge tag 'drm-fixes-2022-05-14' of git://anongit.freedesktop.org/drm/drm
Pull more drm fixes from Dave Airlie:
"Turns out I was right, some fixes hadn't made it to me yet. The vmwgfx
ones also popped up later, but all seem like bad enough things to fix.
The dma-buf, vc4 and nouveau ones are all pretty small.
The fbdev fixes are a bit more complicated: a fix to cleanup fbdev
devices properly, uncovered some use-after-free bugs in existing
drivers. Then the fix for those bugs wasn't correct. This reverts that
fix, and puts the proper fixes in place in the drivers to avoid the
use-after-frees.
This has had a fair number of eyes on it at this stage, and I'm
confident enough that it puts things in the right place, and is less
dangerous than reverting our way out of the initial change at this
stage.
fbdev:
- revert NULL deref fix that turned into a use-after-free
- prevent use-after-free in fbdev
- efifb/simplefb/vesafb: fix cleanup paths to avoid use-after-frees
dma-buf:
- fix panic in stats setup
vc4:
- fix hdmi build
nouveau:
- tegra iommu present fix
- fix leak in backlight name
vmwgfx:
- Black screen due to fences using FIFO checks on SVGA3
- Random black screens on boot due to uninitialized drm_mode_fb_cmd2
- Hangs on SVGA3 due to command buffers being used with gbobjects"
* tag 'drm-fixes-2022-05-14' of git://anongit.freedesktop.org/drm/drm:
drm/vmwgfx: Disable command buffers on svga3 without gbobjects
drm/vmwgfx: Initialize drm_mode_fb_cmd2
drm/vmwgfx: Fix fencing on SVGAv3
drm/vc4: hdmi: Fix build error for implicit function declaration
dma-buf: call dma_buf_stats_setup after dmabuf is in valid list
fbdev: efifb: Fix a use-after-free due early fb_info cleanup
drm/nouveau: Fix a potential theorical leak in nouveau_get_backlight_name()
drm/nouveau/tegra: Stop using iommu_present()
fbdev: vesafb: Cleanup fb_info in .fb_destroy rather than .remove
fbdev: efifb: Cleanup fb_info in .fb_destroy rather than .remove
fbdev: simplefb: Cleanup fb_info in .fb_destroy rather than .remove
fbdev: Prevent possible use-after-free in fb_release()
Revert "fbdev: Make fb_release() return -ENODEV if fbdev was unregistered"
Dave Airlie [Fri, 13 May 2022 22:34:01 +0000 (08:34 +1000)]
Merge tag 'drm-misc-fixes-2022-05-13' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
Multiple fixes to fbdev to address a regression at unregistration, an
iommu detection improvement for nouveau, a memory leak fix for nouveau,
pointer dereference fix for dma_buf_file_release(), and a build breakage
fix for vc4
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <maxime@cerno.tech>
Link: https://patchwork.freedesktop.org/patch/msgid/20220513073044.ymayac7x7bzatrt7@houat
Dave Airlie [Fri, 13 May 2022 22:29:41 +0000 (08:29 +1000)]
Merge tag 'vmwgfx-drm-fixes-5.18-2022-05-13' of https://gitlab.freedesktop.org/zack/vmwgfx into drm-fixes
vmwgfx fixes for:
- Black screen due to fences using FIFO checks on SVGA3
- Random black screens on boot due to uninitialized drm_mode_fb_cmd2
- Hangs on SVGA3 due to command buffers being used with gbobjects
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Zack Rusin <zackr@vmware.com>
Link: https://patchwork.freedesktop.org/patch/msgid/a1d32799e4c74b8540216376d7576bb783ca07ba.camel@vmware.com
Linus Torvalds [Fri, 13 May 2022 21:32:53 +0000 (14:32 -0700)]
Merge tag 'gfs2-v5.18-rc4-fix3' of git://git./linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
"We've finally identified commit
dc732906c245 ("gfs2: Introduce flag
for glock holder auto-demotion") to be the other cause of the
filesystem corruption we've been seeing. This feature isn't strictly
necessary anymore, so we've decided to stop using it for now.
With this and the gfs_iomap_end rounding fix you've already seen
("gfs2: Fix filesystem block deallocation for short writes" in this
pull request), we're corruption free again now.
- Fix filesystem block deallocation for short writes.
- Stop using glock holder auto-demotion for now.
- Get rid of buffered writes inefficiencies due to page faults being
disabled.
- Minor other cleanups"
* tag 'gfs2-v5.18-rc4-fix3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Stop using glock holder auto-demotion for now
gfs2: buffered write prefaulting
gfs2: Align read and write chunks to the page cache
gfs2: Pull return value test out of should_fault_in_pages
gfs2: Clean up use of fault_in_iov_iter_{read,write}able
gfs2: Variable rename
gfs2: Fix filesystem block deallocation for short writes
Andreas Gruenbacher [Wed, 11 May 2022 16:27:12 +0000 (18:27 +0200)]
gfs2: Stop using glock holder auto-demotion for now
We're having unresolved issues with the glock holder auto-demotion mechanism
introduced in commit
dc732906c245. This mechanism was assumed to be essential
for avoiding frequent short reads and writes until commit
296abc0d91d8
("gfs2: No short reads or writes upon glock contention"). Since then,
when the inode glock is lost, it is simply re-acquired and the operation
is resumed. This means that apart from the performance penalty, we
might as well drop the inode glock before faulting in pages, and
re-acquire it afterwards.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Andreas Gruenbacher [Wed, 4 May 2022 21:37:30 +0000 (23:37 +0200)]
gfs2: buffered write prefaulting
In gfs2_file_buffered_write, to increase the likelihood that all the
user memory we're trying to write will be resident in memory, carry out
the write in chunks and fault in each chunk of user memory before trying
to write it. Otherwise, some workloads will trigger frequent short
"internal" writes, causing filesystem blocks to be allocated and then
partially deallocated again when writing into holes, which is wasteful
and breaks reservations.
Neither the chunked writes nor any of the short "internal" writes are
user visible.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Linus Torvalds [Fri, 13 May 2022 20:13:48 +0000 (13:13 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Four fixes, all in drivers.
These patches mosly fix error legs and exceptional conditions
(scsi_dh_alua, qla2xxx). The lpfc fixes are for coding issues with
lpfc features"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: lpfc: Correct BDE DMA address assignment for GEN_REQ_WQE
scsi: lpfc: Fix split code for FLOGI on FCoE
scsi: qla2xxx: Fix missed DMA unmap for aborted commands
scsi: scsi_dh_alua: Properly handle the ALUA transitioning state
Andreas Gruenbacher [Thu, 5 May 2022 11:32:23 +0000 (13:32 +0200)]
gfs2: Align read and write chunks to the page cache
Align the chunks that reads and writes are carried out in to the page
cache rather than the user buffers. This will be more efficient in
general, especially for allocating writes. Optimizing the case that the
user buffer is gfs2 backed isn't very useful; we only need to make sure
we won't deadlock.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Andreas Gruenbacher [Thu, 5 May 2022 10:53:26 +0000 (12:53 +0200)]
gfs2: Pull return value test out of should_fault_in_pages
Pull the return value test of the previous read or write operation out
of should_fault_in_pages(). In a following patch, we'll fault in pages
before the I/O and there will be no return value to check.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Andreas Gruenbacher [Thu, 5 May 2022 10:37:49 +0000 (12:37 +0200)]
gfs2: Clean up use of fault_in_iov_iter_{read,write}able
No need to store the return value of the fault_in functions in separate
variables.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Andreas Gruenbacher [Wed, 27 Apr 2022 11:53:42 +0000 (13:53 +0200)]
gfs2: Variable rename
Instead of counting the number of bytes read from the filesystem,
functions gfs2_file_direct_read and gfs2_file_read_iter count the number
of bytes written into the user buffer. Conversely, functions
gfs2_file_direct_write and gfs2_file_buffered_write count the number of
bytes read from the user buffer. This is nothing but confusing, so
change the read functions to count how many bytes they have read, and
the write functions to count how many bytes they have written.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Andreas Gruenbacher [Thu, 14 Apr 2022 15:52:39 +0000 (17:52 +0200)]
gfs2: Fix filesystem block deallocation for short writes
When a write cannot be carried out in full, gfs2_iomap_end() releases
blocks that have been allocated for this write but haven't been used.
To compute the end of the allocation, gfs2_iomap_end() incorrectly
rounded the end of the attempted write down to the next block boundary
to arrive at the end of the allocation. It would have to round up, but
the end of the allocation is also available as iomap->offset +
iomap->length, so just use that instead.
In addition, use round_up() for computing the start of the unused range.
Fixes:
64bc06bb32ee ("gfs2: iomap buffered write support")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Linus Torvalds [Fri, 13 May 2022 18:12:04 +0000 (11:12 -0700)]
Merge tag 'ceph-for-5.18-rc7' of https://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"Two fixes to properly maintain xattrs on async creates and thus
preserve SELinux context on newly created files and to avoid improper
usage of folio->private field which triggered BUG_ONs.
Both marked for stable"
* tag 'ceph-for-5.18-rc7' of https://github.com/ceph/ceph-client:
ceph: check folio PG_private bit instead of folio->private
ceph: fix setting of xattrs on async created inodes
Linus Torvalds [Fri, 13 May 2022 18:04:37 +0000 (11:04 -0700)]
Merge tag 'nfs-for-5.18-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"One more pull request. There was a bug in the fix to ensure that gss-
proxy continues to work correctly after we fixed the AF_LOCAL socket
leak in the RPC code. This therefore reverts that broken patch, and
replaces it with one that works correctly.
Stable fixes:
- SUNRPC: Ensure that the gssproxy client can start in a connected
state
Bugfixes:
- Revert "SUNRPC: Ensure gss-proxy connects on setup"
- nfs: fix broken handling of the softreval mount option"
* tag 'nfs-for-5.18-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: fix broken handling of the softreval mount option
SUNRPC: Ensure that the gssproxy client can start in a connected state
Revert "SUNRPC: Ensure gss-proxy connects on setup"
Linus Torvalds [Fri, 13 May 2022 17:22:37 +0000 (10:22 -0700)]
Merge tag 'mm-hotfixes-stable-2022-05-11' of git://git./linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Seven MM fixes, three of which address issues added in the most recent
merge window, four of which are cc:stable.
Three non-MM fixes, none very serious"
[ And yes, that's a real pull request from Andrew, not me creating a
branch from emailed patches. Woo-hoo! ]
* tag 'mm-hotfixes-stable-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add a mailing list for DAMON development
selftests: vm: Makefile: rename TARGETS to VMTARGETS
mm/kfence: reset PG_slab and memcg_data before freeing __kfence_pool
mailmap: add entry for martyna.szapar-mudlaw@intel.com
arm[64]/memremap: don't abuse pfn_valid() to ensure presence of linear map
procfs: prevent unprivileged processes accessing fdinfo dir
mm: mremap: fix sign for EFAULT error return value
mm/hwpoison: use pr_err() instead of dump_page() in get_any_page()
mm/huge_memory: do not overkill when splitting huge_zero_page
Revert "mm/memory-failure.c: skip huge_zero_page in memory_failure()"
Linus Torvalds [Fri, 13 May 2022 17:17:39 +0000 (10:17 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
- TLB invalidation workaround for Qualcomm Kryo-4xx "gold" CPUs
- Fix broken dependency in the vDSO Makefile
- Fix pointer authentication overrides in ISAR2 ID register
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Enable repeat tlbi workaround on KRYO4XX gold CPUs
arm64: cpufeature: remove duplicate ID_AA64ISAR2_EL1 entry
arm64: vdso: fix makefile dependency on vdso.so
Linus Torvalds [Fri, 13 May 2022 17:10:07 +0000 (10:10 -0700)]
Merge tag 'hwmon-for-v5.18-rc7' of git://git./linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
- Restrict ltq-cputemp to SOC_XWAY to fix build failure
- Add OF device ID table to tmp401 driver to enable auto-load
* tag 'hwmon-for-v5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (ltq-cputemp) restrict it to SOC_XWAY
hwmon: (tmp401) Add OF device ID table
Linus Torvalds [Fri, 13 May 2022 17:00:37 +0000 (10:00 -0700)]
Merge tag 'drm-fixes-2022-05-13' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Pretty quiet week on the fixes front, 4 amdgpu and one i915 fix.
I think there might be a few misc fbdev ones outstanding, but I'll see
if they are necessary and pass them on if so.
amdgpu:
- Disable ASPM for VI boards on ADL platforms
- S0ix DCN3.1 display fix
- Resume regression fix
- Stable pstate fix
i915:
- fix for kernel memory corruption when running a lot of OpenCL tests
in parallel"
* tag 'drm-fixes-2022-05-13' of git://anongit.freedesktop.org/drm/drm:
drm/amdgpu/ctx: only reset stable pstate if the user changed it (v2)
Revert "drm/amd/pm: keep the BACO feature enabled for suspend"
drm/i915: Fix race in __i915_vma_remove_closed
drm/amd/display: undo clearing of z10 related function pointers
drm/amdgpu: vi: disable ASPM on Intel Alder Lake based systems
Zack Rusin [Fri, 18 Mar 2022 17:43:31 +0000 (13:43 -0400)]
drm/vmwgfx: Disable command buffers on svga3 without gbobjects
With very limited vram on svga3 it's difficult to handle all the surface
migrations. Without gbobjects, i.e. the ability to store surfaces in
guest mobs, there's no reason to support intermediate svga2 features,
especially because we can fall back to fb traces and svga3 will never
support those in-between features.
On svga3 we wither want to use fb traces or screen targets
(i.e. gbobjects), nothing in between. This fixes presentation on a lot
of fusion/esxi tech previews where the exposed svga3 caps haven't been
finalized yet.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Fixes:
2cd80dbd3551 ("drm/vmwgfx: Add basic support for SVGA3")
Cc: <stable@vger.kernel.org> # v5.14+
Reviewed-by: Martin Krastev <krastevm@vmware.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220318174332.440068-5-zack@kde.org
Zack Rusin [Wed, 2 Mar 2022 15:24:24 +0000 (10:24 -0500)]
drm/vmwgfx: Initialize drm_mode_fb_cmd2
Transition to drm_mode_fb_cmd2 from drm_mode_fb_cmd left the structure
unitialized. drm_mode_fb_cmd2 adds a few additional members, e.g. flags
and modifiers which were never initialized. Garbage in those members
can cause random failures during the bringup of the fbcon.
Initializing the structure fixes random blank screens after bootup due
to flags/modifiers mismatches during the fbcon bring up.
Fixes:
dabdcdc9822a ("drm/vmwgfx: Switch to mode_cmd2")
Signed-off-by: Zack Rusin <zackr@vmware.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: <stable@vger.kernel.org> # v4.10+
Reviewed-by: Martin Krastev <krastevm@vmware.com>
Reviewed-by: Maaz Mombasawala <mombasawalam@vmware.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220302152426.885214-7-zack@kde.org
Zack Rusin [Wed, 2 Mar 2022 15:24:22 +0000 (10:24 -0500)]
drm/vmwgfx: Fix fencing on SVGAv3
Port of the vmwgfx to SVGAv3 lacked support for fencing. SVGAv3 removed
FIFO's and replaced them with command buffers and extra registers.
The initial version of SVGAv3 lacked support for most advanced features
(e.g. 3D) which made fences unnecessary. That is no longer the case,
especially as 3D support is being turned on.
Switch from FIFO commands and capabilities to command buffers and extra
registers to enable fences on SVGAv3.
Fixes:
2cd80dbd3551 ("drm/vmwgfx: Add basic support for SVGA3")
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Martin Krastev <krastevm@vmware.com>
Reviewed-by: Maaz Mombasawala <mombasawalam@vmware.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220302152426.885214-5-zack@kde.org
Greg Kroah-Hartman [Fri, 13 May 2022 14:15:28 +0000 (16:15 +0200)]
Merge tag 'icc-5.18-rc6' of git://git./linux/kernel/git/djakov/icc into char-misc-linus
Pull interconnect fixes from Georgi:
"interconnect fixes for v5.18-rc
This contains an additional fix for sc7180 and sdx55 platforms that helps
them to enter suspend even on devices that don't have the most recent DT
changes.
- interconnect: Restore sync state by ignoring ipa-virt in provider count
Signed-off-by: Georgi Djakov <djakov@kernel.org>"
* tag 'icc-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/djakov/icc:
interconnect: Restore sync state by ignoring ipa-virt in provider count
Adrian-Ken Rueegsegger [Mon, 9 May 2022 09:06:37 +0000 (11:06 +0200)]
x86/mm: Fix marking of unused sub-pmd ranges
The unused part precedes the new range spanned by the start, end parameters
of vmemmap_use_new_sub_pmd(). This means it actually goes from
ALIGN_DOWN(start, PMD_SIZE) up to start.
Use the correct address when applying the mark using memset.
Fixes:
8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges")
Signed-off-by: Adrian-Ken Rueegsegger <ken@codelabs.ch>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220509090637.24152-2-ken@codelabs.ch
Greg Kroah-Hartman [Fri, 13 May 2022 06:29:40 +0000 (08:29 +0200)]
Merge tag 'usb-serial-5.18-rc7' of https://git./linux/kernel/git/johan/usb-serial
Johan writes:
USB-serial fixes for 5.18-rc7
Here are some new device ids.
All have been in linux-next with no reported issues.
* tag 'usb-serial-5.18-rc7' of https://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial:
USB: serial: qcserial: add support for Sierra Wireless EM7590
USB: serial: option: add Fibocom MA510 modem
USB: serial: option: add Fibocom L610 modem
USB: serial: pl2303: add device id for HP LM930 Display
Dave Airlie [Fri, 13 May 2022 00:40:55 +0000 (10:40 +1000)]
Merge tag 'amd-drm-fixes-5.18-2022-05-11' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-5.18-2022-05-11:
amdgpu:
- Disable ASPM for VI boards on ADL platforms
- S0ix DCN3.1 display fix
- Resume regression fix
- Stable pstate fix
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220511174422.5769-1-alexander.deucher@amd.com
Dave Airlie [Thu, 12 May 2022 23:24:44 +0000 (09:24 +1000)]
Merge tag 'drm-intel-fixes-2022-05-12' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
Fix for #5732: (Cc stable) kernel memory corruption when running a lot of OpenCL tests in parallel
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/YnykW6L4e7vD3yl3@jlahtine-mobl.ger.corp.intel.com
Linus Torvalds [Thu, 12 May 2022 18:51:45 +0000 (11:51 -0700)]
Merge tag 'net-5.18-rc7' of git://git./linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from wireless, and bluetooth.
No outstanding fires.
Current release - regressions:
- eth: atlantic: always deep reset on pm op, fix null-deref
Current release - new code bugs:
- rds: use maybe_get_net() when acquiring refcount on TCP sockets
[refinement of a previous fix]
- eth: ocelot: mark traps with a bool instead of guessing type based
on list membership
Previous releases - regressions:
- net: fix skipping features in for_each_netdev_feature()
- phy: micrel: fix null-derefs on suspend/resume and probe
- bcmgenet: check for Wake-on-LAN interrupt probe deferral
Previous releases - always broken:
- ipv4: drop dst in multicast routing path, prevent leaks
- ping: fix address binding wrt vrf
- net: fix wrong network header length when BPF protocol translation
is used on skbs with a fraglist
- bluetooth: fix the creation of hdev->name
- rfkill: uapi: fix RFKILL_IOCTL_MAX_SIZE ioctl request definition
- wifi: iwlwifi: iwl-dbg: use del_timer_sync() before freeing
- wifi: ath11k: reduce the wait time of 11d scan and hw scan while
adding an interface
- mac80211: fix rx reordering with non explicit / psmp ack policy
- mac80211: reset MBSSID parameters upon connection
- nl80211: fix races in nl80211_set_tx_bitrate_mask()
- tls: fix context leak on tls_device_down
- sched: act_pedit: really ensure the skb is writable
- batman-adv: don't skb_split skbuffs with frag_list
- eth: ocelot: fix various issues with TC actions (null-deref; bad
stats; ineffective drops; ineffective filter removal)"
* tag 'net-5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (61 commits)
tls: Fix context leak on tls_device_down
net: sfc: ef10: fix memory leak in efx_ef10_mtd_probe()
net/smc: non blocking recvmsg() return -EAGAIN when no data and signal_pending
net: dsa: bcm_sf2: Fix Wake-on-LAN with mac_link_down()
mlxsw: Avoid warning during ip6gre device removal
net: bcmgenet: Check for Wake-on-LAN interrupt probe deferral
net: ethernet: mediatek: ppe: fix wrong size passed to memset()
Bluetooth: Fix the creation of hdev->name
i40e: i40e_main: fix a missing check on list iterator
net/sched: act_pedit: really ensure the skb is writable
s390/lcs: fix variable dereferenced before check
s390/ctcm: fix potential memory leak
s390/ctcm: fix variable dereferenced before check
net: atlantic: verify hw_head_ lies within TX buffer ring
net: atlantic: add check for MAX_SKB_FRAGS
net: atlantic: reduce scope of is_rsc_complete
net: atlantic: fix "frag[0] not initialized"
net: stmmac: fix missing pci_disable_device() on error in stmmac_pci_probe()
net: phy: micrel: Fix incorrect variable type in micrel
decnet: Use container_of() for struct dn_neigh casts
...
Linus Torvalds [Thu, 12 May 2022 17:42:56 +0000 (10:42 -0700)]
Merge branch 'for-5.18-fixes' of git://git./linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
"Waiman's fix for a cgroup2 cpuset bug where it could miss nodes which
were hot-added"
* 'for-5.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
Linus Torvalds [Thu, 12 May 2022 17:21:44 +0000 (10:21 -0700)]
Merge tag 'fixes_for_v5.18-rc7' of git://git./linux/kernel/git/jack/linux-fs
Pull fs fixes from Jan Kara:
"Three fixes that I'd still like to get to 5.18:
- add a missing sanity check in the fanotify FAN_RENAME feature
(added in 5.17, let's fix it before it gets wider usage in
userspace)
- udf fix for recently introduced filesystem corruption issue
- writeback fix for a race in inode list handling that can lead to
delayed writeback and possible dirty throttling stalls"
* tag 'fixes_for_v5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Avoid using stale lengthOfImpUse
writeback: Avoid skipping inode writeback
fanotify: do not allow setting dirent events in mask of non-dir
Maxim Mikityanskiy [Thu, 12 May 2022 09:18:30 +0000 (12:18 +0300)]
tls: Fix context leak on tls_device_down
The commit cited below claims to fix a use-after-free condition after
tls_device_down. Apparently, the description wasn't fully accurate. The
context stayed alive, but ctx->netdev became NULL, and the offload was
torn down without a proper fallback, so a bug was present, but a
different kind of bug.
Due to misunderstanding of the issue, the original patch dropped the
refcount_dec_and_test line for the context to avoid the alleged
premature deallocation. That line has to be restored, because it matches
the refcount_inc_not_zero from the same function, otherwise the contexts
that survived tls_device_down are leaked.
This patch fixes the described issue by restoring refcount_dec_and_test.
After this change, there is no leak anymore, and the fallback to
software kTLS still works.
Fixes:
c55dcdd435aa ("net/tls: Fix use-after-free after the TLS device goes down and up")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220512091830.678684-1-maximmi@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Taehee Yoo [Thu, 12 May 2022 05:47:09 +0000 (05:47 +0000)]
net: sfc: ef10: fix memory leak in efx_ef10_mtd_probe()
In the NIC ->probe() callback, ->mtd_probe() callback is called.
If NIC has 2 ports, ->probe() is called twice and ->mtd_probe() too.
In the ->mtd_probe(), which is efx_ef10_mtd_probe() it allocates and
initializes mtd partiion.
But mtd partition for sfc is shared data.
So that allocated mtd partition data from last called
efx_ef10_mtd_probe() will not be used.
Therefore it must be freed.
But it doesn't free a not used mtd partition data in efx_ef10_mtd_probe().
kmemleak reports:
unreferenced object 0xffff88811ddb0000 (size 63168):
comm "systemd-udevd", pid 265, jiffies
4294681048 (age 348.586s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<
ffffffffa3767749>] kmalloc_order_trace+0x19/0x120
[<
ffffffffa3873f0e>] __kmalloc+0x20e/0x250
[<
ffffffffc041389f>] efx_ef10_mtd_probe+0x11f/0x270 [sfc]
[<
ffffffffc0484c8a>] efx_pci_probe.cold.17+0x3df/0x53d [sfc]
[<
ffffffffa414192c>] local_pci_probe+0xdc/0x170
[<
ffffffffa4145df5>] pci_device_probe+0x235/0x680
[<
ffffffffa443dd52>] really_probe+0x1c2/0x8f0
[<
ffffffffa443e72b>] __driver_probe_device+0x2ab/0x460
[<
ffffffffa443e92a>] driver_probe_device+0x4a/0x120
[<
ffffffffa443f2ae>] __driver_attach+0x16e/0x320
[<
ffffffffa4437a90>] bus_for_each_dev+0x110/0x190
[<
ffffffffa443b75e>] bus_add_driver+0x39e/0x560
[<
ffffffffa4440b1e>] driver_register+0x18e/0x310
[<
ffffffffc02e2055>] 0xffffffffc02e2055
[<
ffffffffa3001af3>] do_one_initcall+0xc3/0x450
[<
ffffffffa33ca574>] do_init_module+0x1b4/0x700
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Fixes:
8127d661e77f ("sfc: Add support for Solarflare SFC9100 family")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://lore.kernel.org/r/20220512054709.12513-1-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Guangguan Wang [Thu, 12 May 2022 03:08:20 +0000 (11:08 +0800)]
net/smc: non blocking recvmsg() return -EAGAIN when no data and signal_pending
Non blocking sendmsg will return -EAGAIN when any signal pending
and no send space left, while non blocking recvmsg return -EINTR
when signal pending and no data received. This may makes confused.
As TCP returns -EAGAIN in the conditions described above. Align the
behavior of smc with TCP.
Fixes:
846e344eb722 ("net/smc: add receive timeout check")
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20220512030820.73848-1-guangguan.wang@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian Fainelli [Thu, 12 May 2022 02:17:31 +0000 (19:17 -0700)]
net: dsa: bcm_sf2: Fix Wake-on-LAN with mac_link_down()
After commit
2d1f90f9ba83 ("net: dsa/bcm_sf2: fix incorrect usage of
state->link") the interface suspend path would call our mac_link_down()
call back which would forcibly set the link down, thus preventing
Wake-on-LAN packets from reaching our management port.
Fix this by looking at whether the port is enabled for Wake-on-LAN and
not clearing the link status in that case to let packets go through.
Fixes:
2d1f90f9ba83 ("net: dsa/bcm_sf2: fix incorrect usage of state->link")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20220512021731.2494261-1-f.fainelli@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Chunfeng Yun [Thu, 12 May 2022 06:49:31 +0000 (14:49 +0800)]
usb: xhci-mtk: remove bandwidth budget table
The bandwidth budget table is introduced to trace ideal bandwidth used
by each INT/ISOC endpoint, but in fact the endpoint may consume more
bandwidth and cause data transfer error, so it's better to leave some
margin. Obviously it's difficult to find the best margin for all cases,
instead take use of the worst-case scenario.
Signed-off-by: Chunfeng Yun <chunfeng.yun@mediatek.com>
Link: https://lore.kernel.org/r/20220512064931.31670-2-chunfeng.yun@mediatek.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Chunfeng Yun [Thu, 12 May 2022 06:49:30 +0000 (14:49 +0800)]
usb: xhci-mtk: fix fs isoc's transfer error
Due to the scheduler allocates the optimal bandwidth for FS ISOC endpoints,
this may be not enough actually and causes data transfer error, so come up
with an estimate that is no less than the worst case bandwidth used for
any one mframe, but may be an over-estimate.
Fixes:
451d3912586a ("usb: xhci-mtk: update fs bus bandwidth by bw_budget_table")
Cc: stable@vger.kernel.org
Signed-off-by: Chunfeng Yun <chunfeng.yun@mediatek.com>
Link: https://lore.kernel.org/r/20220512064931.31670-1-chunfeng.yun@mediatek.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Schspa Shi [Sun, 8 May 2022 15:02:47 +0000 (23:02 +0800)]
usb: gadget: fix race when gadget driver register via ioctl
The usb_gadget_register_driver can be called multi time by to
threads via USB_RAW_IOCTL_RUN ioctl syscall, which will lead
to multiple registrations.
Call trace:
driver_register+0x220/0x3a0 drivers/base/driver.c:171
usb_gadget_register_driver_owner+0xfb/0x1e0
drivers/usb/gadget/udc/core.c:1546
raw_ioctl_run drivers/usb/gadget/legacy/raw_gadget.c:513 [inline]
raw_ioctl+0x1883/0x2730 drivers/usb/gadget/legacy/raw_gadget.c:1220
ioctl USB_RAW_IOCTL_RUN
This routine allows two processes to register the same driver instance
via ioctl syscall. which lead to a race condition.
Please refer to the following scenarios.
T1 T2
------------------------------------------------------------------
usb_gadget_register_driver_owner
driver_register driver_register
driver_find driver_find
bus_add_driver bus_add_driver
priv alloced <context switch>
drv->p = priv;
<schedule out>
kobject_init_and_add // refcount = 1;
//couldn't find an available UDC or it's busy
<context switch>
priv alloced
drv->priv = priv;
kobject_init_and_add
---> refcount = 1 <------
// register success
<context switch>
===================== another ioctl/process ======================
driver_register
driver_find
k = kset_find_obj()
---> refcount = 2 <------
<context out>
driver_unregister
// drv->p become T2's priv
---> refcount = 1 <------
<context switch>
kobject_put(k)
---> refcount = 0 <------
return priv->driver;
--------UAF here----------
There will be UAF in this scenario.
We can fix it by adding a new STATE_DEV_REGISTERING device state to
avoid double register.
Reported-by: syzbot+dc7c3ca638e773db07f6@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/000000000000e66c2805de55b15a@google.com/
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Signed-off-by: Schspa Shi <schspa@gmail.com>
Link: https://lore.kernel.org/r/20220508150247.38204-1-schspa@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ChiYuan Huang [Tue, 10 May 2022 05:13:00 +0000 (13:13 +0800)]
usb: typec: tcpci_mt6360: Update for BMC PHY setting
Update MT6360 BMC PHY Tx/Rx setting for the compatibility.
Macpaul reported this CtoDP cable attention message cannot be received from
MT6360 TCPC. But actually, attention message really sent from UFP_D
device.
After RD's comment, there may be BMC PHY Tx/Rx setting causes this issue.
Below's the detailed TCPM log and DP attention message didn't received from 6360
TCPCI.
[ 1206.367775] Identity: 0000:0000.0000
[ 1206.416570] Alternate mode 0: SVID 0xff01, VDO 1: 0x00000405
[ 1206.447378] AMS DFP_TO_UFP_ENTER_MODE start
[ 1206.447383] PD TX, header: 0x1d6f
[ 1206.449393] PD TX complete, status: 0
[ 1206.454110] PD RX, header: 0x184f [1]
[ 1206.456867] Rx VDM cmd 0xff018144 type 1 cmd 4 len 1
[ 1206.456872] AMS DFP_TO_UFP_ENTER_MODE finished
[ 1206.456873] cc:=4
[ 1206.473100] AMS STRUCTURED_VDMS start
[ 1206.473103] PD TX, header: 0x2f6f
[ 1206.475397] PD TX complete, status: 0
[ 1206.480442] PD RX, header: 0x2a4f [1]
[ 1206.483145] Rx VDM cmd 0xff018150 type 1 cmd 16 len 2
[ 1206.483150] AMS STRUCTURED_VDMS finished
[ 1206.483151] cc:=4
[ 1206.505643] AMS STRUCTURED_VDMS start
[ 1206.505646] PD TX, header: 0x216f
[ 1206.507933] PD TX complete, status: 0
[ 1206.512664] PD RX, header: 0x1c4f [1]
[ 1206.515456] Rx VDM cmd 0xff018151 type 1 cmd 17 len 1
[ 1206.515460] AMS STRUCTURED_VDMS finished
[ 1206.515461] cc:=4
Fixes:
e1aefcdd394fd ("usb typec: mt6360: Add support for mt6360 Type-C driver")
Cc: stable <stable@vger.kernel.org>
Reported-by: Macpaul Lin <macpaul.lin@mediatek.com>
Tested-by: Macpaul Lin <macpaul.lin@mediatek.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Signed-off-by: ChiYuan Huang <cy_huang@richtek.com>
Signed-off-by: Fabien Parent <fparent@baylibre.com>
Link: https://lore.kernel.org/r/1652159580-30959-1-git-send-email-u0084500@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Shreyas K K [Thu, 12 May 2022 11:01:34 +0000 (16:31 +0530)]
arm64: Enable repeat tlbi workaround on KRYO4XX gold CPUs
Add KRYO4XX gold/big cores to the list of CPUs that need the
repeat TLBI workaround. Apply this to the affected
KRYO4XX cores (rcpe to rfpe).
The variant and revision bits are implementation defined and are
different from the their Cortex CPU counterparts on which they are
based on, i.e., (r0p0 to r3p0) is equivalent to (rcpe to rfpe).
Signed-off-by: Shreyas K K <quic_shrekk@quicinc.com>
Reviewed-by: Sai Prakash Ranjan <quic_saipraka@quicinc.com>
Link: https://lore.kernel.org/r/20220512110134.12179-1-quic_shrekk@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>
Amit Cohen [Wed, 11 May 2022 11:57:47 +0000 (14:57 +0300)]
mlxsw: Avoid warning during ip6gre device removal
IPv6 addresses which are used for tunnels are stored in a hash table
with reference counting. When a new GRE tunnel is configured, the driver
is notified and configures it in hardware.
Currently, any change in the tunnel is not applied in the driver. It
means that if the remote address is changed, the driver is not aware of
this change and the first address will be used.
This behavior results in a warning [1] in scenarios such as the
following:
# ip link add name gre1 type ip6gre local 2000::3 remote 2000::fffe tos inherit ttl inherit
# ip link set name gre1 type ip6gre local 2000::3 remote 2000::ffff ttl inherit
# ip link delete gre1
The change of the address is not applied in the driver. Currently, the
driver uses the remote address which is stored in the 'parms' of the
overlay device. When the tunnel is removed, the new IPv6 address is
used, the driver tries to release it, but as it is not aware of the
change, this address is not configured and it warns about releasing non
existing IPv6 address.
Fix it by using the IPv6 address which is cached in the IPIP entry, this
address is the last one that the driver used, so even in cases such the
above, the first address will be released, without any warning.
[1]:
WARNING: CPU: 1 PID: 2197 at drivers/net/ethernet/mellanox/mlxsw/spectrum.c:2920 mlxsw_sp_ipv6_addr_put+0x146/0x220 [mlxsw_spectrum]
...
CPU: 1 PID: 2197 Comm: ip Not tainted 5.17.0-rc8-custom-95062-gc1e5ded51a9a #84
Hardware name: Mellanox Technologies Ltd. MSN4700/VMOD0010, BIOS 5.11 07/12/2021
RIP: 0010:mlxsw_sp_ipv6_addr_put+0x146/0x220 [mlxsw_spectrum]
...
Call Trace:
<TASK>
mlxsw_sp2_ipip_rem_addr_unset_gre6+0xf1/0x120 [mlxsw_spectrum]
mlxsw_sp_netdevice_ipip_ol_event+0xdb/0x640 [mlxsw_spectrum]
mlxsw_sp_netdevice_event+0xc4/0x850 [mlxsw_spectrum]
raw_notifier_call_chain+0x3c/0x50
call_netdevice_notifiers_info+0x2f/0x80
unregister_netdevice_many+0x311/0x6d0
rtnl_dellink+0x136/0x360
rtnetlink_rcv_msg+0x12f/0x380
netlink_rcv_skb+0x49/0xf0
netlink_unicast+0x233/0x340
netlink_sendmsg+0x202/0x440
____sys_sendmsg+0x1f3/0x220
___sys_sendmsg+0x70/0xb0
__sys_sendmsg+0x54/0xa0
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes:
e846efe2737b ("mlxsw: spectrum: Add hash table for IPv6 address mapping")
Reported-by: Maksym Yaremchuk <maksymy@nvidia.com>
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20220511115747.238602-1-idosch@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Kristina Martsenko [Wed, 11 May 2022 16:20:30 +0000 (17:20 +0100)]
arm64: cpufeature: remove duplicate ID_AA64ISAR2_EL1 entry
The ID register table should have one entry per ID register but
currently has two entries for ID_AA64ISAR2_EL1. Only one entry has an
override, and get_arm64_ftr_reg() can end up choosing the other, causing
the override to be ignored. Fix this by removing the duplicate entry.
While here, also make the check in sort_ftr_regs() more strict so that
duplicate entries can't be added in the future.
Fixes:
def8c222f054 ("arm64: Add support of PAuth QARMA3 architected algorithm")
Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com>
Reviewed-by: Vladimir Murzin <vladimir.murzin@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20220511162030.1403386-1-kristina.martsenko@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Hui Tang [Tue, 10 May 2022 13:51:48 +0000 (21:51 +0800)]
drm/vc4: hdmi: Fix build error for implicit function declaration
drivers/gpu/drm/vc4/vc4_hdmi.c: In function ‘vc4_hdmi_connector_detect’:
drivers/gpu/drm/vc4/vc4_hdmi.c:228:7: error: implicit declaration of function ‘gpiod_get_value_cansleep’; did you mean ‘gpio_get_value_cansleep’? [-Werror=implicit-function-declaration]
if (gpiod_get_value_cansleep(vc4_hdmi->hpd_gpio))
^~~~~~~~~~~~~~~~~~~~~~~~
gpio_get_value_cansleep
CC [M] drivers/gpu/drm/vc4/vc4_validate.o
CC [M] drivers/gpu/drm/vc4/vc4_v3d.o
CC [M] drivers/gpu/drm/vc4/vc4_validate_shaders.o
CC [M] drivers/gpu/drm/vc4/vc4_debugfs.o
drivers/gpu/drm/vc4/vc4_hdmi.c: In function ‘vc4_hdmi_bind’:
drivers/gpu/drm/vc4/vc4_hdmi.c:2883:23: error: implicit declaration of function ‘devm_gpiod_get_optional’; did you mean ‘devm_clk_get_optional’? [-Werror=implicit-function-declaration]
vc4_hdmi->hpd_gpio = devm_gpiod_get_optional(dev, "hpd", GPIOD_IN);
^~~~~~~~~~~~~~~~~~~~~~~
devm_clk_get_optional
drivers/gpu/drm/vc4/vc4_hdmi.c:2883:59: error: ‘GPIOD_IN’ undeclared (first use in this function); did you mean ‘GPIOF_IN’?
vc4_hdmi->hpd_gpio = devm_gpiod_get_optional(dev, "hpd", GPIOD_IN);
^~~~~~~~
GPIOF_IN
drivers/gpu/drm/vc4/vc4_hdmi.c:2883:59: note: each undeclared identifier is reported only once for each function it appears in
cc1: all warnings being treated as errors
Fixes:
6800234ceee0 ("drm/vc4: hdmi: Convert to gpiod")
Signed-off-by: Hui Tang <tanghui20@huawei.com>
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
Link: https://patchwork.freedesktop.org/patch/msgid/20220510135148.247719-1-tanghui20@huawei.com
Florian Fainelli [Wed, 11 May 2022 03:17:51 +0000 (20:17 -0700)]
net: bcmgenet: Check for Wake-on-LAN interrupt probe deferral
The interrupt controller supplying the Wake-on-LAN interrupt line maybe
modular on some platforms (irq-bcm7038-l1.c) and might be probed at a
later time than the GENET driver. We need to specifically check for
-EPROBE_DEFER and propagate that error to ensure that we eventually
fetch the interrupt descriptor.
Fixes:
9deb48b53e7f ("bcmgenet: add WOL IRQ check")
Fixes:
5b1f0e62941b ("net: bcmgenet: Avoid touching non-existent interrupt")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Stefan Wahren <stefan.wahren@i2se.com>
Link: https://lore.kernel.org/r/20220511031752.2245566-1-f.fainelli@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Yang Yingliang [Wed, 11 May 2022 03:08:29 +0000 (11:08 +0800)]
net: ethernet: mediatek: ppe: fix wrong size passed to memset()
'foe_table' is a pointer, the real size of struct mtk_foe_entry
should be pass to memset().
Fixes:
ba37b7caf1ed ("net: ethernet: mtk_eth_soc: add support for initializing the PPE")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20220511030829.3308094-1-yangyingliang@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Thu, 12 May 2022 00:40:39 +0000 (17:40 -0700)]
Merge tag 'for-net-2022-05-11' of git://git./linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- Fix the creation of hdev->name when index is greater than 9999
* tag 'for-net-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: Fix the creation of hdev->name
====================
Link: https://lore.kernel.org/r/20220512002901.823647-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 12 May 2022 00:33:01 +0000 (17:33 -0700)]
Merge tag 'wireless-2022-05-11' of git://git./linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v5.18
Second set of fixes for v5.18 and hopefully the last one. We have a
new iwlwifi maintainer, a fix to rfkill ioctl interface and important
fixes to both stack and two drivers.
* tag 'wireless-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
rfkill: uapi: fix RFKILL_IOCTL_MAX_SIZE ioctl request definition
nl80211: fix locking in nl80211_set_tx_bitrate_mask()
mac80211_hwsim: call ieee80211_tx_prepare_skb under RCU protection
mac80211_hwsim: fix RCU protected chanctx access
mailmap: update Kalle Valo's email
mac80211: Reset MBSSID parameters upon connection
cfg80211: retrieve S1G operating channel number
nl80211: validate S1G channel width
mac80211: fix rx reordering with non explicit / psmp ack policy
ath11k: reduce the wait time of 11d scan and hw scan while add interface
MAINTAINERS: update iwlwifi driver maintainer
iwlwifi: iwl-dbg: Use del_timer_sync() before freeing
====================
Link: https://lore.kernel.org/r/20220511154535.A1A12C340EE@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>