Naohiro Aota [Mon, 7 Aug 2023 16:12:34 +0000 (01:12 +0900)]
btrfs: zoned: defer advancing meta write pointer
We currently advance the meta_write_pointer in
btrfs_check_meta_write_pointer(). That makes it necessary to revert it
when locking the buffer failed. Instead, we can advance it just before
sending the buffer.
Also, this is necessary for the following commit. In the commit, it needs
to release the zoned_meta_io_lock to allow IOs to come in and wait for them
to fill the currently active block group. If we advance the
meta_write_pointer before locking the extent buffer, the following extent
buffer can pass the meta_write_pointer check, resulting in an unaligned
write failure.
Advancing the pointer is still thread-safe as the extent buffer is locked.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Mon, 7 Aug 2023 16:12:33 +0000 (01:12 +0900)]
btrfs: zoned: return int from btrfs_check_meta_write_pointer
Now that we have writeback_control passed to
btrfs_check_meta_write_pointer(), we can move the wbc condition in
submit_eb_page() to btrfs_check_meta_write_pointer() and return int.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Mon, 7 Aug 2023 16:12:32 +0000 (01:12 +0900)]
btrfs: zoned: introduce block group context to btrfs_eb_write_context
For metadata write out on the zoned mode, we call
btrfs_check_meta_write_pointer() to check if an extent buffer to be written
is aligned to the write pointer.
We look up a block group containing the extent buffer for every extent
buffer, which takes unnecessary effort as the writing extent buffers are
mostly contiguous.
Introduce "zoned_bg" to cache the block group working on. Also, while
at it, rename "cache" to "block_group".
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Mon, 7 Aug 2023 16:12:31 +0000 (01:12 +0900)]
btrfs: introduce struct to consolidate extent buffer write context
Introduce btrfs_eb_write_context to consolidate writeback_control and the
exntent buffer context. This will help adding a block group context as
well.
While at it, move the eb context setting before
btrfs_check_meta_write_pointer(). We can set it here because we anyway need
to skip pages in the same eb if that eb is rejected by
btrfs_check_meta_write_pointer().
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:57:13 +0000 (16:57 +0100)]
btrfs: avoid start and commit empty transaction when flushing qgroups
When flushing qgroups, we try to join a running transaction, with
btrfs_join_transaction(), and then commit the transaction. However using
btrfs_join_transaction() will result in creating a new transaction in case
there isn't any running or if there's an existing one already committing.
This is pointless as we only need to attach to an existing one that is
not committing and in case there's an existing one committing, wait for
its commit to complete. Creating and committing an empty transaction is
wasteful, pointless IO and unnecessary rotation of the backup roots.
So use btrfs_attach_transaction_barrier() instead, to avoid creating and
committing empty transactions.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:57:12 +0000 (16:57 +0100)]
btrfs: avoid start and commit empty transaction when starting qgroup rescan
When starting a qgroup rescan, we try to join a running transaction, with
btrfs_join_transaction(), and then commit the transaction. However using
btrfs_join_transaction() will result in creating a new transaction in case
there isn't any running or if there's an existing one already committing.
This is pointless as we only need to attach to an existing one that is
not committing and in case there's an existing one committing, wait for
its commit to complete. Creating and committing an empty transaction is
wasteful, pointless IO and unnecessary rotation of the backup roots.
So use btrfs_attach_transaction_barrier() instead, to avoid creating and
committing empty transactions.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:57:11 +0000 (16:57 +0100)]
btrfs: avoid starting and committing empty transaction when flushing space
When flushing space and we are in the COMMIT_TRANS state, we join a
transaction with btrfs_join_transaction() and then commit the returned
transaction. However btrfs_join_transaction() starts a new transaction if
there is none currently open, which is pointless since comitting a new,
empty transaction, doesn't achieve anything, it only wastes time, IO and
creates an unnecessary rotation of the backup roots.
So use btrfs_attach_transaction_barrier() to avoid starting a new
transaction. This also waits for any ongoing transaction that is
committing (state >= TRANS_STATE_COMMIT_DOING) to fully complete, and
therefore wait for all the extents that were pinned during the
transaction's lifetime to be unpinned.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:57:10 +0000 (16:57 +0100)]
btrfs: avoid starting new transaction when flushing delayed items and refs
When flushing space we join a transaction to flush delayed items and
delayed references, in order to try to release space. However using
btrfs_join_transaction() not only joins an existing transaction as well
as it starts a new transaction if there is none open. If there is no
transaction open, we don't have neither delayed items nor delayed
references, so creating a new transaction is a waste of time, IO and
creates an unnecessary rotation of the backup roots without gaining any
benefits (including releasing space).
So use btrfs_join_transaction_nostart() when attempting to flush delayed
items and references.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:57:09 +0000 (16:57 +0100)]
btrfs: merge find_free_dev_extent() and find_free_dev_extent_start()
There is no point in having find_free_dev_extent() because it's just a
simple wrapper around find_free_dev_extent_start() which always passes a
value of 0 for the search_start argument. Since there are no other callers
of find_free_dev_extent_start(), remove find_free_dev_extent() and rename
find_free_dev_extent_start() to find_free_dev_extent(), removing its
search_start argument because it's always 0.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:57:08 +0000 (16:57 +0100)]
btrfs: make find_free_dev_extent() static
The function find_free_dev_extent() is only used within volumes.c, so make
it static and remove its prototype from volumes.h.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:57:07 +0000 (16:57 +0100)]
btrfs: make btrfs_cleanup_fs_roots() static
btrfs_cleanup_fs_roots() is not used outside disk-io.c, so make it static,
remove its prototype from disk-io.h and move its definition above the
where it's used in disk-io.c
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:57:06 +0000 (16:57 +0100)]
btrfs: fail priority metadata ticket with real fs error
At priority_reclaim_metadata_space(), if we were not able to satisfy the
the ticket after going through the various flushing states and we notice
the fs went into an error state, likely due to a transaction abort during
the flushing, set the ticket's error to the error that caused the
transaction abort instead of an unconditional -EROFS.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:57:05 +0000 (16:57 +0100)]
btrfs: return real error when orphan cleanup fails due to a transaction abort
During mount we will call btrfs_orphan_cleanup() to remove any inodes that
were previously deleted (have a link count of 0) but for which we were not
able before to remove their items from the subvolume tree. The removal of
the items will happen by triggering eviction, when we do the final iput()
on them at btrfs_orphan_cleanup(), which will end in the loop at
btrfs_evict_inode() that truncates inode items.
In a dire situation we may have a transaction abort due to -ENOSPC when
attempting to truncate the inode items, and in that case the orphan item
(key type BTRFS_ORPHAN_ITEM_KEY) will remain in the subvolume tree and
when we hit the next iteration of the while loop at btrfs_orphan_cleanup()
we will find the same orphan item as before, and then we will return
-EINVAL from btrfs_orphan_cleanup() through the following if statement:
if (found_key.offset == last_objectid) {
btrfs_err(fs_info,
"Error removing orphan entry, stopping orphan cleanup");
ret = -EINVAL;
goto out;
}
This makes the mount operation fail with -EINVAL, when it should have been
-ENOSPC. This is confusing because -EINVAL might lead a user into thinking
it provided invalid mount options for example.
An example where this happens:
$ mount test.img /mnt
mount: /mnt: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.
$ dmesg
[ 2542.356934] BTRFS: device fsid
977fff75-1181-4d2b-a739-
384fa710d16e devid 1 transid
47409973 /dev/loop0 scanned by mount (4459)
[ 2542.357451] BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm
[ 2542.357461] BTRFS info (device loop0): disk space caching is enabled
[ 2542.742287] BTRFS info (device loop0): auto enabling async discard
[ 2542.764554] BTRFS info (device loop0): checking UUID tree
[ 2551.743065] ------------[ cut here ]------------
[ 2551.743068] BTRFS: Transaction aborted (error -28)
[ 2551.743149] WARNING: CPU: 7 PID: 215 at fs/btrfs/block-group.c:3494 btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs]
[ 2551.743311] Modules linked in: btrfs blake2b_generic (...)
[ 2551.743353] CPU: 7 PID: 215 Comm: kworker/u24:5 Not tainted 6.4.0-rc6-btrfs-next-134+ #1
[ 2551.743356] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[ 2551.743357] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[ 2551.743405] RIP: 0010:btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs]
[ 2551.743449] Code: 8b 43 0c (...)
[ 2551.743451] RSP: 0018:
ffff982c005a7c40 EFLAGS:
00010286
[ 2551.743452] RAX:
0000000000000000 RBX:
ffff88fc6e44b400 RCX:
0000000000000000
[ 2551.743453] RDX:
0000000000000002 RSI:
ffffffff8dff0878 RDI:
00000000ffffffff
[ 2551.743454] RBP:
ffff88fc51817208 R08:
0000000000000000 R09:
ffff982c005a7ae0
[ 2551.743455] R10:
0000000000000001 R11:
0000000000000001 R12:
ffff88fc43d2e570
[ 2551.743456] R13:
ffff88fc43d2e400 R14:
ffff88fc8fb08ee0 R15:
ffff88fc6e44b530
[ 2551.743457] FS:
0000000000000000(0000) GS:
ffff89035fbc0000(0000) knlGS:
0000000000000000
[ 2551.743458] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 2551.743459] CR2:
00007fa8cdf2f6f4 CR3:
0000000124850003 CR4:
0000000000370ee0
[ 2551.743462] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 2551.743463] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[ 2551.743464] Call Trace:
[ 2551.743472] <TASK>
[ 2551.743474] ? __warn+0x80/0x130
[ 2551.743478] ? btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs]
[ 2551.743520] ? report_bug+0x1f4/0x200
[ 2551.743523] ? handle_bug+0x42/0x70
[ 2551.743526] ? exc_invalid_op+0x14/0x70
[ 2551.743528] ? asm_exc_invalid_op+0x16/0x20
[ 2551.743532] ? btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs]
[ 2551.743574] ? _raw_spin_unlock+0x15/0x30
[ 2551.743576] ? btrfs_run_delayed_refs+0x1bd/0x200 [btrfs]
[ 2551.743609] commit_cowonly_roots+0x1e9/0x260 [btrfs]
[ 2551.743652] btrfs_commit_transaction+0x42e/0xfa0 [btrfs]
[ 2551.743693] ? __pfx_autoremove_wake_function+0x10/0x10
[ 2551.743697] flush_space+0xf1/0x5d0 [btrfs]
[ 2551.743743] ? _raw_spin_unlock+0x15/0x30
[ 2551.743745] ? finish_task_switch+0x91/0x2a0
[ 2551.743748] ? _raw_spin_unlock+0x15/0x30
[ 2551.743750] ? btrfs_get_alloc_profile+0xc9/0x1f0 [btrfs]
[ 2551.743793] btrfs_async_reclaim_metadata_space+0xe1/0x230 [btrfs]
[ 2551.743837] process_one_work+0x1d9/0x3e0
[ 2551.743844] worker_thread+0x4a/0x3b0
[ 2551.743847] ? __pfx_worker_thread+0x10/0x10
[ 2551.743849] kthread+0xee/0x120
[ 2551.743852] ? __pfx_kthread+0x10/0x10
[ 2551.743854] ret_from_fork+0x29/0x50
[ 2551.743860] </TASK>
[ 2551.743861] ---[ end trace
0000000000000000 ]---
[ 2551.743863] BTRFS info (device loop0: state A): dumping space info:
[ 2551.743866] BTRFS info (device loop0: state A): space_info DATA has 126976 free, is full
[ 2551.743868] BTRFS info (device loop0: state A): space_info total=
13458472960, used=
13458137088, pinned=143360, reserved=0, may_use=0, readonly=65536 zone_unusable=0
[ 2551.743870] BTRFS info (device loop0: state A): space_info METADATA has -
51625984 free, is full
[ 2551.743872] BTRFS info (device loop0: state A): space_info total=
771751936, used=
770146304, pinned=1605632, reserved=0, may_use=
51625984, readonly=0 zone_unusable=0
[ 2551.743874] BTRFS info (device loop0: state A): space_info SYSTEM has
14663680 free, is not full
[ 2551.743875] BTRFS info (device loop0: state A): space_info total=
14680064, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
[ 2551.743877] BTRFS info (device loop0: state A): global_block_rsv: size
53231616 reserved
51544064
[ 2551.743878] BTRFS info (device loop0: state A): trans_block_rsv: size 0 reserved 0
[ 2551.743879] BTRFS info (device loop0: state A): chunk_block_rsv: size 0 reserved 0
[ 2551.743880] BTRFS info (device loop0: state A): delayed_block_rsv: size 0 reserved 0
[ 2551.743881] BTRFS info (device loop0: state A): delayed_refs_rsv: size 786432 reserved 0
[ 2551.743886] BTRFS: error (device loop0: state A) in btrfs_write_dirty_block_groups:3494: errno=-28 No space left
[ 2551.743911] BTRFS info (device loop0: state EA): forced readonly
[ 2551.743951] BTRFS warning (device loop0: state EA): could not allocate space for delete; will truncate on mount
[ 2551.743962] BTRFS error (device loop0: state EA): Error removing orphan entry, stopping orphan cleanup
[ 2551.743973] BTRFS warning (device loop0: state EA): Skipping commit of aborted transaction.
[ 2551.743989] BTRFS error (device loop0: state EA): could not do orphan cleanup -22
So make the btrfs_orphan_cleanup() return the value of BTRFS_FS_ERROR(),
if it's set, and -EINVAL otherwise.
For that same example, after this change, the mount operation fails with
-ENOSPC:
$ mount test.img /mnt
mount: /mnt: mount(2) system call failed: No space left on device.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:57:04 +0000 (16:57 +0100)]
btrfs: store the error that turned the fs into error state
Currently when we turn the fs into an error state, typically after a
transaction abort, we don't store the error anywhere, we just set a bit
(BTRFS_FS_STATE_ERROR) at struct btrfs_fs_info::fs_state to signal the
error state.
There are cases where it would be useful to have access to the specific
error in order to provide a more meaningful error to users/applications.
This change adds a member to struct btrfs_fs_info to store the error and
removes the BTRFS_FS_STATE_ERROR bit. When there's no error, the new
member (fs_error) has a value of 0, otherwise its value is a negative
errno value.
Followup changes will make use of this new member.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:57:03 +0000 (16:57 +0100)]
btrfs: don't steal space from global rsv after a transaction abort
When doing a priority metadata space reclaim, while we are going through
the flush states and running their respective operations, it's possible
that a transaction abort happened, for example when running delayed refs
we hit -ENOSPC or in the critical section of transaction commit we failed
with -ENOSPC or some other error. In these cases a transaction was aborted
and the fs turned into error state. If that happened, then it makes no
sense to steal from the global block reserve and return success to the
caller if the stealing was successful - the caller will later get an
error when attempting to modify the fs. Instead make the ticket fail if
we have the fs in error state and don't attempt to steal from the global
rsv, as it's not only it's pointless, it also simplifies debugging some
-ENOSPC problems.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:57:02 +0000 (16:57 +0100)]
btrfs: print available space across all block groups when dumping space info
When dumping a space info also sum the available space for all block
groups and then print it. This often useful for debugging -ENOSPC
related problems.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:57:01 +0000 (16:57 +0100)]
btrfs: print available space for a block group when dumping a space info
When dumping a space info, we iterate over all its block groups and then
print their size and the amounts of bytes used, reserved, pinned, etc.
When debugging -ENOSPC problems it's also useful to know how much space
is available (free), so calculate that and print it as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:57:00 +0000 (16:57 +0100)]
btrfs: print block group super and delalloc bytes when dumping space info
When dumping a space info's block groups, also print the number of bytes
used for super blocks and delalloc. This is often useful for debugging
-ENOSPC problems.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:56:59 +0000 (16:56 +0100)]
btrfs: print target number of bytes when dumping free space
When dumping free space, with btrfs_dump_free_space(), we pass a bytes
argument in order to count how many free space entries in the block group
have a size greater than or equal to that number of bytes. We then print
how many suitable entries we found, but we don't print the target number
of bytes, we just say "bytes". Change the message to actually print the
number of bytes, which makes debugging -ENOSPC issues a bit easier.
Also sligthly change the odd grammar and terminology: the sentence is
ending with 'is', which doesn't make sense, and the term 'blocks' is
confusing as we are referring to free space entries within the block
group's free space cache.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:56:58 +0000 (16:56 +0100)]
btrfs: update comment for btrfs_join_transaction_nostart()
Update the comment for btrfs_join_transaction_nostart() to be more clear
about how it works and how it's different from btrfs_attach_transaction().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Jul 2023 15:56:57 +0000 (16:56 +0100)]
btrfs: don't start transaction when joining with TRANS_JOIN_NOSTART
When joining a transaction with TRANS_JOIN_NOSTART, if we don't find a
running transaction we end up creating one. This goes against the purpose
of TRANS_JOIN_NOSTART which is to join a running transaction if its state
is at or below the state TRANS_STATE_COMMIT_START, otherwise return an
-ENOENT error and don't start a new transaction. So fix this to not create
a new transaction if there's no running transaction at or below that
state.
CC: stable@vger.kernel.org # 4.14+
Fixes:
a6d155d2e363 ("Btrfs: fix deadlock between fiemap and transaction commits")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sat, 15 Jul 2023 11:08:34 +0000 (19:08 +0800)]
btrfs: refactor main loop in memmove_extent_buffer()
[BACKGROUND]
Currently memove_extent_buffer() does a loop where it strop at any page
boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
Instead of sticking with copy_pages(), here we utilize the new
__write_extent_buffer() helper to handle the writes.
Unlike the refactoring in memcpy_extent_buffer(), we can not just rely
on the write_extent_buffer() and only handle page boundaries inside src
range.
The function write_extent_buffer() itself is still doing forward
writing, thus it cannot handle the following case: (already in the
extent buffer memory operation tests, cross page overlapping run 2)
Src Page boundary
|///////|
|///|////|
Dst
In the above case, if we just follow page boundary in the src range, we
have no need to do any split, just one __write_extent_buffer() with
use_memmove = true.
But __write_extent_buffer() would split the dst range into two,
so it first copies the beginning part of the src range into the first half
of the dst range.
After this operation, the beginning of the dst range is already updated,
causing corruption.
So we have to follow the old behavior of handling both page boundaries.
And since we're the last caller of copy_pages(), we can remove it
completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sat, 15 Jul 2023 11:08:33 +0000 (19:08 +0800)]
btrfs: refactor main loop in memcpy_extent_buffer()
[BACKGROUND]
Currently memcpy_extent_buffer() does a loop where it would stop at
any page boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
There is a hidden pitfall of the naming memcpy_extent_buffer(), unlike
regular memcpy(), this function can handle overlapping ranges.
So here we extract write_extent_buffer() into a new internal helper,
__write_extent_buffer(), and add a new parameter @use_memmove, to
indicate whether we should use memmove() or regular memcpy().
Now we can go __write_extent_buffer() to handle writing into the dst
range, with proper overlapping detection.
This has a tiny change to the chance of calling memmove().
As the split only happens at the source range page boundaries, the
memcpy/memmove() range would be slightly larger than the old code,
thus slightly increase the chance we call memmove() other than memcopy().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sat, 15 Jul 2023 11:08:32 +0000 (19:08 +0800)]
btrfs: copy all pages at once at the end of btrfs_clone_extent_buffer()
btrfs_clone_extent_buffer() calls copy_page() at each iteration but we
can copy all pages at the end in one go if there were no errors.
This would make later conversion to folios easier.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sat, 15 Jul 2023 11:08:31 +0000 (19:08 +0800)]
btrfs: refactor main loop in copy_extent_buffer_full()
[BACKGROUND]
copy_extent_buffer_full() currently does different handling for regular
and subpage cases, for regular cases it does a page by page copying.
For subpage cases, it just copies the content.
This is fine for the page based extent buffer code, but for the incoming
folio conversion, it can be a burden to add a new branch just to handle
all the different combinations (subpage vs regular, one single folio vs
multi pages).
[ENHANCE]
Instead of handling the different combinations, just go one single
handling for all cases, utilizing write_extent_buffer() to do the
copying.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sat, 15 Jul 2023 11:08:30 +0000 (19:08 +0800)]
btrfs: use write_extent_buffer() to implement write_extent_buffer_*id()
Helpers write_extent_buffer_chunk_tree_uuid() and
write_extent_buffer_fsid(), they can be implemented by
write_extent_buffer().
These two helpers are not that frequently used, they only get called
during initialization of a new tree block. There is not much need for
those slightly optimized versions. And since they can be easily
converted to one write_extent_buffer() call, define them as inline
helpers.
This would make later page/folio switch much easier, as all change only
need to happen in write_extent_buffer().
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sat, 15 Jul 2023 11:08:29 +0000 (19:08 +0800)]
btrfs: refactor extent buffer bitmaps operations
[BACKGROUND]
Currently we handle extent bitmaps manually in
extent_buffer_bitmap_set() and extent_buffer_bitmap_clear().
Although with various helpers like eb_bitmap_offset() it's still a little
messy to read. The code seems to be a copy of bitmap_set(), but with
all the cross-page handling embedded into the code.
[ENHANCEMENT]
This patch would enhance the readability by introducing two helpers:
- memset_extent_buffer()
To handle the byte aligned range, thus all the cross-page handling is
done there.
- extent_buffer_get_byte()
This for the first and the last byte operations, which only need to
grab one byte, thus no need for any cross-page handling.
So we can split both extent_buffer_bitmap_set() and
extent_buffer_bitmap_clear() into 3 parts:
- Handle the first byte
If the range fits inside the first byte, we can exit early.
- Handle the byte aligned part
This is the part which can have cross-page operations, and it would
be handled by memset_extent_buffer().
- Handle the last byte
This refactoring does not only make the code a little easier to read,
but also makes later folio/page switch much easier, as the switch only
needs to be done inside memset_extent_buffer() and extent_buffer_get_byte().
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sat, 15 Jul 2023 11:08:28 +0000 (19:08 +0800)]
btrfs: tests: add self tests for extent buffer memory operations
The new self tests would populate a memory range with random bytes, then
copy it to the extent buffer, so that we can verify if the extent buffer
memory operation and memmove()/memcopy() are resulting the same
contents.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sat, 15 Jul 2023 11:08:27 +0000 (19:08 +0800)]
btrfs: tests: enhance extent buffer bitmap tests
Enhance extent bitmap tests for the following aspects:
- Remove unnecessary @len from __test_eb_bitmaps()
We can fetch the length from extent buffer
- Explicitly distinguish bit and byte length
Now every start/len inside bitmap tests would have either "byte_" or
"bit_" prefix to make it more explicit.
- Better error reporting
If we have mismatch bits, the error report would dump the following
contents:
* start bytenr
* bit number
* the full byte from bitmap
* the full byte from the extent
This is to save developers time so obvious problem can be found
immediately
- Extract bitmap set/clear and check operation into two helpers
This is to save some code lines, as we will have more tests to do.
- Add new tests
The following tests are added, mostly for the incoming extent bitmap
accessor refactoring:
* Set bits inside the same byte
* Clear bits inside the same byte
* Cross byte boundary set
* Cross byte boundary clear
* Cross multi-byte boundary set
* Cross multi-byte boundary clear
Those new tests have already saved my backend for the incoming extent
buffer bitmap refactoring.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jul 2023 20:12:15 +0000 (16:12 -0400)]
btrfs: move comments to btrfs_loop_type definition
Some of these loop types aren't described, and they should be with the
definitions to make it easier to tell what each of them do.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Thu, 27 Jul 2023 13:53:03 +0000 (21:53 +0800)]
btrfs: print name and pid when device scanning processes race
There is a race between systemd and mount, as both of them try to register
the device in the kernel. When systemd loses the race, it prints the
following message:
BTRFS error: device /dev/sdb7 belongs to fsid
1b3bacbf-14db-49c9-a3ef-
547998aacc4e, and the fs is already mounted.
The 'btrfs dev scan' registers one device at a time, so there is no way
for the mount thread to wait in the kernel for all the devices to have
registered as it won't know if all the devices are discovered.
For now, improve the error log by printing the command name and process
ID along with the error message.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:44 +0000 (17:31 +0200)]
mm: remove folio_account_redirty
Fold folio_account_redirty into folio_redirty_for_writepage now
that all other users except for the also unused account_page_redirty
wrapper are gone.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:43 +0000 (17:31 +0200)]
btrfs: fix zoned handling in submit_uncompressed_range
For zoned file systems we need to use run_delalloc_zoned to submit
writeback, as we need to write out partial allocations when running into
zone active limits.
submit_uncompressed_range currently always calls cow_file_range to
allocate blocks and thus misses the active zone limits handling. Fix
this by passing the pages_dirty argument to run_delalloc_zoned and always
using it from submit_uncompressed_range as it does the right thing for
zoned and non-zoned file systems.
To account for the fact that run_delalloc_zoned is now also used for
non-zoned file systems rename it to run_delalloc_cow, and add comment
describing it.
Fixes:
42c011000963 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:42 +0000 (17:31 +0200)]
btrfs: don't redirty locked_page in run_delalloc_zoned
extent_write_locked_range currently expects that either all or no
pages are dirty when it is called. Bur run_delalloc_zoned is called
directly in the writepages path, and has the dirty bit cleared only
for locked_page and which the extent_write_cache_pages currently
operates. It currently works around this by redirtying locked_page,
but that is a bit inefficient and cumbersome. Pass a locked_page
argument to run_delalloc_zoned so that clearing the dirty bit can
be skipped on just that page.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:41 +0000 (17:31 +0200)]
btrfs: refactor the zoned device handling in cow_file_range
Handling of the done_offset to cow_file_range is a bit confusing, as
it is not updated at all when the function succeeds, and the -EAGAIN
status is used bother for the case where we need to wait for a zone
finish and the one where the allocation was partially successful.
Change the calling convention so that done_offset is always updated,
and 0 is returned if some allocation was successful (partial allocation
can still only happen for zoned devices), and waiting for a zone
finish is done internally in cow_file_range instead of the caller.
Also write a comment explaining the logic.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:40 +0000 (17:31 +0200)]
btrfs: don't redirty pages in compress_file_range
compress_file_range needs to clear the dirty bit before handing off work
to the compression worker threads to prevent processes coming in through
mmap and changing the file contents while the compression is accessing
the data (See commit
4adaa611020f ("Btrfs: fix race between mmap writes
and compression").
But when compress_file_range decides to not compress the data, it falls
back to submit_uncompressed_range which uses extent_write_locked_range
to write the uncompressed data. extent_write_locked_range currently
expects all pages to be marked dirty so that it can clear the dirty
bit itself, and thus compress_file_range has to redirty the page range.
Redirtying the page range is rather inefficient and also pointless,
so instead pass a pages_dirty parameter to extent_write_locked_range
and skip the redirty game entirely.
Note that compress_file_range was even redirtying the locked_page twice
given that extent_range_clear_dirty_for_io already redirties all pages
in the range, which must include locked_page if there is one.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:39 +0000 (17:31 +0200)]
btrfs: share the code to free the page array in compress_file_range
compress_file_range has two code blocks to free the page array for the
compressed data. Share the code using a goto label.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:38 +0000 (17:31 +0200)]
btrfs: use a separate label for the incompressible case in compress_file_range
compress_file_range can fail to compress either because of resource or
alignment constraints or because the data is incompressible. In the latter
case the inode is marked so that compression isn't tried again. Currently
that check is based on the condition that the pages array has been allocated
which is rather cryptic. Use a separate label to clearly distinguish this
case.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:37 +0000 (17:31 +0200)]
btrfs: further simplify the compress or not logic in compress_file_range
Currently the logic whether to compress or not in compress_file_range is
a bit convoluted because it tries to share code for creating inline
extents for the compressible [1] path and the bail to uncompressed path.
But the latter isn't needed at all, because cow_file_range as called by
submit_uncompressed_range will already create inline extents as needed,
so there is no need to have special handling for it if we can live with
the fact that it will be called a bit later in the ->ordered_func of the
workqueue instead of right now.
[1] there is undocumented logic that creates an uncompressed inline
extent outside of the shall not compress logic if total_in is too small.
This logic isn't explained in comments or any commit log I could find,
so I've preserved it. Documentation explaining it would be appreciated
if anyone understands this code.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:36 +0000 (17:31 +0200)]
btrfs: streamline compress_file_range
Reorder compress_file_range so that the main compression flow happens
straight line and not in branches. To do this ensure that pages is
always zeroed before a page allocation happens, which allows the
cleanup_and_bail_uncompressed label to clean up the page allocations
as needed.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:35 +0000 (17:31 +0200)]
btrfs: merge submit_compressed_extents and async_cow_submit
The code in submit_compressed_extents just loops over the async_extents,
and doesn't need to be conditional on an inode being present, as there
won't be any async_extent in the list if we created and inline extent.
Merge the two functions to simplify the logic.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:34 +0000 (17:31 +0200)]
btrfs: merge async_cow_start and compress_file_range
There is no good reason to have the simple async_cow_start wrapper,
merge the argument conversion into the main compress_file_range function.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:33 +0000 (17:31 +0200)]
btrfs: don't clear async_chunk->inode in async_cow_start
Now that the ->inode check isn't needed in submit_compressed_extents
any more, there is no reason to clear the field early. Always keep
the inode around until the work item is finished and remove the special
casing, and the counting of compressed extents in compress_file_range.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:32 +0000 (17:31 +0200)]
btrfs: clean up the check for uncompressed ranges in submit_one_async_extent
Instead of checking for a NULL !pages and explaining this with a cryptic
comment, just check the compression type for BTRFS_COMPRESS_NONE to make
the check self-explanatory.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:31 +0000 (17:31 +0200)]
btrfs: reduce the number of arguments to btrfs_run_delalloc_range
Instead of a separate page_started argument that tells the callers that
btrfs_run_delalloc_range already started writeback by itself, overload
the return value with a positive 1 in additio to 0 and a negative error
code to indicate that is has already started writeback, and remove the
nr_written argument as that caller can calculate it directly based on
the range, and in fact already does so for the case where writeback
wasn't started yet.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:30 +0000 (17:31 +0200)]
btrfs: improve the delalloc_to_write calculation in writepage_delalloc
Currently writepage_delalloc adds to delalloc_to_write in every loop
operation. That is not only more work than doing it once after the
loop, but can also over-increment the counter due to rounding errors
when a new loop iteration starts with an offset into a page.
Add a new page_start variable instead of recaculation that value over
and over, move the delalloc_to_write calculation out of the loop, use
the DIV_ROUND_UP helper instead of open coding it and remove the pointless
found local variable.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:29 +0000 (17:31 +0200)]
btrfs: remove the return value from extent_write_locked_range
The return value from extent_write_locked_range is ignored, and that's
fine because the error reporting happens through the mapping and
ordered_extent.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:28 +0000 (17:31 +0200)]
btrfs: remove the return value from submit_uncompressed_range
The return value from submit_uncompressed_range is ignored, and that's
fine because the error reporting happens through the mapping and
ordered_extent.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:27 +0000 (17:31 +0200)]
btrfs: reduce debug spam from submit_compressed_extents
Move the printk that is supposed to help to debug failures in
submit_one_async_extent into submit_one_async_extent and make it
coniditonal on actually having an error condition instead of spamming
the log unconditionally.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:26 +0000 (17:31 +0200)]
btrfs: remove end_extent_writepage
end_extent_writepage is a small helper that combines a call to
btrfs_mark_ordered_io_finished with conditional error-only calls to
btrfs_page_clear_uptodate and mapping_set_error with a somewhat
unfortunate calling convention that passes and inclusive end instead
of the len expected by the underlying functions.
Remove end_extent_writepage and open code it in the 4 callers. Out
of those two already are error-only and thus don't need the extra
conditional, and one already has the mapping_set_error, so a duplicate
call can be avoided.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:25 +0000 (17:31 +0200)]
btrfs: remove btrfs_writepage_endio_finish_ordered
btrfs_writepage_endio_finish_ordered is a small wrapper around
btrfs_mark_ordered_io_finished that just changs the argument passing
slightly, and adds a tracepoint.
Move the tracpoint to btrfs_mark_ordered_io_finished, which means
it now also covers the error handling in btrfs_cleanup_ordered_extent
and switch all callers to just call btrfs_mark_ordered_io_finished
directly.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:24 +0000 (17:31 +0200)]
btrfs: split page locking out of __process_pages_contig
There is a lot of complexity in __process_pages_contig to deal with the
PAGE_LOCK case that can return an error unlike all the other actions.
Open code the page iteration for page locking in lock_delalloc_pages and
remove all the now unused code from __process_pages_contig.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:23 +0000 (17:31 +0200)]
btrfs: don't create inline extents in fallback_to_cow
For NOCOW files, run_delalloc_nocow can still fall back to COW
allocations when required and calls to fallback_to_cow helper for
that. For such an allocation we can have multiple ordered_extents
for existing extents that NOCOW overwrites and new allocations that
fallback_to_cow creates. If one of the new extents is an inline
extent, the writepages could would have to avoid normal page writeback
for them as indicated by the page_started return argument, which
run_delalloc_nocow can't return. Fix this by never creating inline
extents from fallback_to_cow.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 28 Jun 2023 15:31:22 +0000 (17:31 +0200)]
btrfs: pass a flags argument to cow_file_range
The int used as bool unlock is not a very good way to describe the
behavior, and the next patch will have to add another behavior modifier.
We'll do that by two bool parameters instead of adding bit flags. Now
specifies that the pages should always be kept locked. This is the
inverse of the old unlock argument.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switch flags to bool ]
Signed-off-by: David Sterba <dsterba@suse.com>
Boris Burkov [Fri, 21 Jul 2023 16:02:07 +0000 (09:02 -0700)]
btrfs: fix start transaction qgroup rsv double free
btrfs_start_transaction reserves metadata space of the PERTRANS type
before it identifies a transaction to start/join. This allows flushing
when reserving that space without a deadlock. However, it results in a
race which temporarily breaks qgroup rsv accounting.
T1 T2
start_transaction
do_stuff
start_transaction
qgroup_reserve_meta_pertrans
commit_transaction
qgroup_free_meta_all_pertrans
hit an error starting txn
goto reserve_fail
qgroup_free_meta_pertrans (already freed!)
The basic issue is that there is nothing preventing another commit from
committing before start_transaction finishes (in fact sometimes we
intentionally wait for it) so any error path that frees the reserve is
at risk of this race.
While this exact space was getting freed anyway, and it's not a huge
deal to double free it (just a warning, the free code catches this), it
can result in incorrectly freeing some other pertrans reservation in
this same reservation, which could then lead to spuriously granting
reservations we might not have the space for. Therefore, I do believe it
is worth fixing.
To fix it, use the existing prealloc->pertrans conversion mechanism.
When we first reserve the space, we reserve prealloc space and only when
we are sure we have a transaction do we convert it to pertrans. This way
any racing commits do not blow away our reservation, but we still get a
pertrans reservation that is freed when _this_ transaction gets committed.
This issue can be reproduced by running generic/269 with either qgroups
or squotas enabled via mkfs on the scratch device.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Boris Burkov [Fri, 21 Jul 2023 16:02:06 +0000 (09:02 -0700)]
btrfs: free qgroup rsv on io failure
If we do a write whose bio suffers an error, we will never reclaim the
qgroup reserved space for it. We allocate the space in the write_iter
codepath, then release the reservation as we allocate the ordered
extent, but we only create a delayed ref if the ordered extent finishes.
If it has an error, we simply leak the rsv. This is apparent in running
any error injecting (dmerror) fstests like btrfs/146 or btrfs/160. Such
tests fail due to dmesg on umount complaining about the leaked qgroup
data space.
When we clean up other aspects of space on failed ordered_extents, also
free the qgroup rsv.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Goldwyn Rodrigues [Fri, 14 Jul 2023 15:18:19 +0000 (10:18 -0500)]
btrfs: remove duplicate free_async_extent_pages() on reservation error
While performing compressed writes, if the extent reservation fails, the
async extent pages are first freed in the error check for return value
ret, and then again at out_free label.
Remove the first call to free_async_extent_pages().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sun, 9 Jul 2023 07:08:18 +0000 (15:08 +0800)]
btrfs: move eb subpage preallocation out of the loop
Initially we preallocate btrfs_subpage structure in the main loop of
alloc_extent_buffer().
But later commit
fbca46eb46ec ("btrfs: make nodesize >= PAGE_SIZE case
to reuse the non-subpage routine") has made sure we only go subpage
routine if our nodesize is smaller than PAGE_SIZE.
This means for that case, we only need to allocate the subpage structure
once anyway.
So this patch would make the preallocation out of the main loop. This
would slightly reduce the workload when we hold the page lock, and make
code a little easier to read.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Tue, 27 Jun 2023 06:13:24 +0000 (08:13 +0200)]
btrfs: simplify the no-bioc fast path condition in btrfs_map_block
nr_alloc_stripes can't be one if we are writing to a replacement device,
as it is incremented for that case right above. Remove the duplicate
checks.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 14 Jun 2023 06:39:55 +0000 (14:39 +0800)]
btrfs: scrub: remove unused btrfs_path in scrub_simple_mirror()
The @path in scrub_simple_mirror() is no longer utilized after commit
e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe
infrastructure").
Before that commit, we call find_first_extent_item() directly, which
needs a path and that path can be reused. But after that switch commit,
the extent search is done inside queue_scrub_stripe(), which will no
longer accept a path from outside.
So the @path variable can be safely removed.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ remove the stale comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Minjie Du [Mon, 17 Jul 2023 07:16:22 +0000 (15:16 +0800)]
btrfs: use folio_next_index() helper in extent_write_cache_pages
Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using
the existing helper folio_next_index().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Minjie Du <duminjie@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 13 Jul 2023 14:10:26 +0000 (16:10 +0200)]
btrfs: use helper sizeof_field in struct accessors
There's a helper for obtaining size of a struct member, we can use it
instead of open coding the pointer magic.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Mon, 26 Jun 2023 09:55:25 +0000 (17:55 +0800)]
btrfs: deprecate integrity checker feature
The integrity checker feature needs to be enabled at compile time
(BTRFS_FS_CHECK_INTEGRITY) and then enabled by mount options check_int*.
Although it provides some unique features which can not be provided by
any other sanity checks like tree-checker, it does not only have high
CPU and memory overhead, but is also a maintenance burden.
For example, it's the only caller of btrfs_map_block() with
@need_raid_map = 0.
Considering most btrfs developers are not even testing this feature, I'm
here to propose deprecation of this feature.
For now only warning messages will be printed, the feature itself would
still work.
Removal time has been set to 6.7 release.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 30 Jun 2023 15:03:51 +0000 (16:03 +0100)]
btrfs: move btrfs_free_excluded_extents() into block-group.c
The function btrfs_free_excluded_extents() is only used by block-group.c,
so move it into block-group.c and make it static. Also removed unnecessary
variables that are used only once.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 30 Jun 2023 15:03:50 +0000 (16:03 +0100)]
btrfs: open code trivial btrfs_add_excluded_extent()
The code for btrfs_add_excluded_extent() is trivial, it's just a
set_extent_bit() call. However it's defined in extent-tree.c but it is
only used (twice) in block-group.c. So open code it in block-group.c,
reducing the need to export a trivial function.
Also since the only caller btrfs_add_excluded_extent() is prepared to
deal with errors, stop ignoring errors from the set_extent_bit() call.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 30 Jun 2023 15:03:49 +0000 (16:03 +0100)]
btrfs: make find_first_extent_bit() return a boolean
Currently find_first_extent_bit() returns a 0 if it found a range in the
given io tree and 1 if it didn't find any. There's no need to return any
errors, so make the return value a boolean and invert the logic to make
more sense: return true if it found a range and false if it didn't find
any range.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 30 Jun 2023 15:03:48 +0000 (16:03 +0100)]
btrfs: make btrfs_destroy_pinned_extent() return void
Currently btrfs_destroy_pinned_extent() is always returning 0 no matter
what and its caller ignores its return value (as well everything up in
the call chain). This is because this is called in the transaction abort
path, where we can't even deal with any errors since we are in a critical
situation already and cleanup of resources is done in a best effort
fashion.
So make btrfs_destroy_pinned_extent() return void.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 30 Jun 2023 15:03:47 +0000 (16:03 +0100)]
btrfs: make btrfs_destroy_marked_extents() return void
Currently btrfs_destroy_marked_extents() is returning the value of the
last call to find_first_extent_bit(), which returns a value of 1 meaning
no more ranges found the dirty pages io tree. This value is useless to the
single caller of btrfs_destroy_marked_extents(), which ignores any return
value from btrfs_destroy_marked_extents(). This is because it's only used
in the transaction abort path, where we can't even deal with any errors
since we are in a critical situation already and cleanup of resources is
done in a best effort fashion.
So make btrfs_destroy_marked_extents() return void.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 30 Jun 2023 15:03:46 +0000 (16:03 +0100)]
btrfs: rename add_new_free_space() to btrfs_add_new_free_space()
Since add_new_free_space() is exported, used outside block-group.c, rename
it to include the 'btrfs_' prefix.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 30 Jun 2023 15:03:45 +0000 (16:03 +0100)]
btrfs: update documentation for add_new_free_space()
The documentation for add_new_free_space() is stale and no longer correct:
1) It's no longer used only when caching a block group. It's also called
when creating a block group (btrfs_make_block_group()), when reading
a block group at mount time (read_one_block_group()) and when reading
the free space tree for a block group (typically the first time we
attempt to allocate from the block group);
2) It has nothing to do with pinned extents. It only deals with the
excluded extents io tree, which is used to track the locations of
super blocks in order to make sure we never add the location of a
super block to the free space cache of a block group.
So update the documention and also add a description of the arguments
and return values.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 20 Jun 2023 07:06:05 +0000 (15:06 +0800)]
btrfs: tracepoints: simplify raid56 events
After commit
6bfd0133bee2 ("btrfs: raid56: switch scrub path to use a
single function"), the raid56 implementation no longer uses different
endio functions for RMW/recover/scrub.
All read operations end in submit_read_wait_bio_list(), while all write
operations end in submit_write_bios(). This means quite some trace
events are out-of-date and no longer utilized.
This patch would unify the trace events into just two:
- trace_raid56_read()
Replaces trace_raid56_read_partial(), trace_raid56_scrub_read() and
trace_raid56_scrub_read_recover().
- trace_raid56_write()
Replaces trace_raid56_write_stripe() and
trace_raid56_scrub_write_stripe().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Tue, 20 Jun 2023 08:55:09 +0000 (16:55 +0800)]
btrfs: sysfs: show if ACL support has been compiled in
ACL support depends on the compile-time configuration option
CONFIG_BTRFS_FS_POSIX_ACL. Prior to mounting a btrfs filesystem, it is not
possible to determine whether ACL support has been compiled in. To address
this, add a sysfs interface, /sys/fs/btrfs/features/acl, and check for ACL
support in the system's btrfs.
To determine ACL support:
Return 0 indicates ACL is not supported:
$ cat /sys/fs/btrfs/features/acl
0
Return 1 indicates ACL is supported:
$ cat /sys/fs/btrfs/features/acl
1
IMO, this is a better approach, so that we also know if kernel is older.
On an older kernel
$ ls /sys/fs/btrfs/features/acl
ls: cannot access '/sys/fs/btrfs/features/acl': No such file or directory
mount a btrfs filesystem
$ cat /proc/self/mounts | grep btrfs | grep -q noacl
$ echo $?
0
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 28 Jun 2023 08:11:15 +0000 (16:11 +0800)]
btrfs: raid56: remove unused BTRFS_RBIO_REBUILD_MISSING
Commit
aca43fe839e4 ("btrfs: remove unused raid56 functions which were
dedicated for scrub") removed the special handling of RAID56 scrub for
missing device.
As scrub goes full mirror_num based recovery, that means if it hits a
missing device in RAID56, it would just try the next mirror, which would
go through the BTRFS_RBIO_READ_REBUILD operation.
This means there is no longer any use of BTRFS_RBIO_REBUILD_MISSING
operation and we can safely remove it.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 27 Jun 2023 07:34:31 +0000 (15:34 +0800)]
btrfs: add comments for btrfs_map_block()
The function btrfs_map_block() is a critical part of the btrfs storage
layer, which handles mapping of logical ranges to physical ranges.
Thus it's better to have some basic explanation, especially on the
following points:
- Segment split by various boundaries
As a continuous logical range may be split into different segments,
due to various factors like zones and RAID0/5/6/10 boundaries.
- The meaning of @mirror_num
- The possible single stripe optimization
- One deprecated parameter @need_raid_map
Just explicitly mark it deprecated so we're aware of the problem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Colin Ian King [Thu, 22 Jun 2023 07:54:30 +0000 (08:54 +0100)]
btrfs: remove redundant initialization of variables in log_new_ancestors
The variables leaf and slot are initialized when declared but the values
assigned to them are never read as they are being re-assigned later on.
The initializations are redundant and can be removed. Cleans up clang
scan build warnings:
fs/btrfs/tree-log.c:6797:25: warning: Value stored to 'leaf' during its
initialization is never read [deadcode.DeadStores]
fs/btrfs/tree-log.c:6798:7: warning: Value stored to 'slot' during its
initialization is never read [deadcode.DeadStores]
It's been there since
b8aa330d2acb ("Btrfs: improve performance on fsync
of files with multiple hardlinks") without any usage so it's safe to be
removed.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Colin Ian King [Wed, 21 Jun 2023 13:09:37 +0000 (14:09 +0100)]
btrfs: scrub: remove redundant division of stripe_nr
Variable stripe_nr is being divided by map->num_stripes however the
result is never read. The division and assignment are redundant and
can be removed. Cleans up clang scan build warning:
fs/btrfs/scrub.c:1264:3: warning: Value stored to 'stripe_nr' is
never read [deadcode.DeadStores]
The code is a leftover from
6ded22c1bfe6 ("btrfs: reduce div64 calls by
limiting the number of stripes of a chunk to u32") that converted div64
to normal division, it's the same but previous version did not trigger a
warning.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Julia Lawall [Fri, 23 Jun 2023 21:14:43 +0000 (23:14 +0200)]
btrfs: zoned: use vcalloc instead of for vzalloc in btrfs_get_dev_zone_info
Use vcalloc that checks potential multiplication overflows. The changes
were done using Coccinelle semantic patch.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Linus Torvalds [Sun, 20 Aug 2023 13:02:52 +0000 (15:02 +0200)]
Linux 6.5-rc7
Linus Torvalds [Sun, 20 Aug 2023 06:26:51 +0000 (08:26 +0200)]
Merge tag 'tty-6.5-rc7' of git://git./linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are some small tty and serial core fixes for 6.5-rc7 that resolve
a lot of reported issues.
Primarily in here are the fixes for the serial bus code from Tony that
came in -rc1, as it hit wider testing with the huge number of
different types of systems and serial ports. All of the reported
issues with duplicate names and other issues with this code are now
resolved.
Other than that included in here is:
- n_gsm fix for a previous fix
- 8250 lockdep annotation fix
- fsl_lpuart serial driver fix
- TIOCSTI documentation update for previous CAP_SYS_ADMIN change
All of these have been in linux-next for a while with no reported
problems"
* tag 'tty-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: core: Fix serial core port id, including multiport devices
serial: 8250: drop lockdep annotation from serial8250_clear_IER()
tty: n_gsm: fix the UAF caused by race condition in gsm_cleanup_mux
serial: core: Revert port_id use
TIOCSTI: Document CAP_SYS_ADMIN behaviour in Kconfig
serial: 8250: Fix oops for port->pm on uart_change_pm()
serial: 8250: Reinit port_id when adding back serial8250_isa_devs
serial: core: Fix kmemleak issue for serial core device remove
MAINTAINERS: Merge TTY layer and serial drivers
serial: core: Fix serial_base_match() after fixing controller port name
serial: core: Fix serial core controller port name to show controller id
serial: core: Fix serial core port id to not use port->line
serial: core: Controller id cannot be negative
tty: serial: fsl_lpuart: Clear the error flags by writing 1 for lpuart32 platforms
Linus Torvalds [Sun, 20 Aug 2023 06:18:58 +0000 (08:18 +0200)]
Merge tag 'rust-fixes-6.5-rc7' of https://github.com/Rust-for-Linux/linux
Pull rust fix from Miguel Ojeda:
- Macros: fix 'HAS_*' redefinition by the '#[vtable]' macro
under conditional compilation
* tag 'rust-fixes-6.5-rc7' of https://github.com/Rust-for-Linux/linux:
rust: macros: vtable: fix `HAS_*` redefinition (`gen_const_name`)
Linus Torvalds [Sat, 19 Aug 2023 17:22:41 +0000 (19:22 +0200)]
Merge tag 'i2c-for-6.5-rc7' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Usual set of driver fixes. A bit more than usual because I was
unavailable for a while"
* tag 'i2c-for-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: bcm-iproc: Fix bcm_iproc_i2c_isr deadlock issue
i2c: Update documentation to use .probe() again
i2c: sun6i-p2wi: Fix an error message in probe()
i2c: hisi: Only handle the interrupt of the driver's transfer
i2c: tegra: Fix i2c-tegra DMA config option processing
i2c: tegra: Fix failure during probe deferral cleanup
i2c: designware: Handle invalid SMBus block data response length value
i2c: designware: Correct length byte validation logic
i2c: imx-lpi2c: return -EINVAL when i2c peripheral clk doesn't work
Linus Torvalds [Sat, 19 Aug 2023 15:57:07 +0000 (17:57 +0200)]
Merge tag 'for-6.5-rc6-tag' of git://git./linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix infinite loop in readdir(), could happen in a big directory when
files get renamed during enumeration
- fix extent map handling of skipped pinned ranges
- fix a corner case when handling ordered extent length
- fix a potential crash when balance cancel races with pause
- verify correct uuid when starting scrub or device replace
* tag 'for-6.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix incorrect splitting in btrfs_drop_extent_map_range
btrfs: fix BUG_ON condition in btrfs_cancel_balance
btrfs: only subtract from len_to_oe_boundary when it is tracking an extent
btrfs: fix replace/scrub failure with metadata_uuid
btrfs: fix infinite directory reads
Linus Torvalds [Sat, 19 Aug 2023 15:43:55 +0000 (17:43 +0200)]
Merge tag 'fbdev-for-6.5-rc7' of git://git./linux/kernel/git/deller/linux-fbdev
Pull fbdev fixes and cleanups from Helge Deller:
- various code cleanups in amifb, atmel_lcdfb, ssd1307fb, kyro and
goldfishfb
* tag 'fbdev-for-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
fbdev: goldfishfb: Do not check 0 for platform_get_irq()
fbdev: atmel_lcdfb: Remove redundant of_match_ptr()
fbdev: kyro: Remove unused declarations
fbdev: ssd1307fb: Print the PWM's label instead of its number
fbdev: mmp: fix value check in mmphw_probe()
fbdev: amifb: Replace zero-length arrays with DECLARE_FLEX_ARRAY() helper
Linus Torvalds [Sat, 19 Aug 2023 15:31:46 +0000 (17:31 +0200)]
Merge tag 'block-6.5-2023-08-19' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"Main thing here is the fix for the regression in flush handling which
caused IO hangs/stalls for a few reporters. Hopefully that should all
be sorted out now. Outside of that, just a few minor fixes for issues
that were introduced in this cycle"
* tag 'block-6.5-2023-08-19' of git://git.kernel.dk/linux:
blk-mq: release scheduler resource when request completes
blk-crypto: dynamically allocate fallback profile
blk-cgroup: hold queue_lock when removing blkg->q_node
drivers/rnbd: restore sysfs interface to rnbd-client
Chengming Zhou [Sun, 13 Aug 2023 15:23:25 +0000 (23:23 +0800)]
blk-mq: release scheduler resource when request completes
Chuck reported [1] an IO hang problem on NFS exports that reside on SATA
devices and bisected to commit
615939a2ae73 ("blk-mq: defer to the normal
submission path for post-flush requests").
We analysed the IO hang problem, found there are two postflush requests
waiting for each other.
The first postflush request completed the REQ_FSEQ_DATA sequence, so go to
the REQ_FSEQ_POSTFLUSH sequence and added in the flush pending list, but
failed to blk_kick_flush() because of the second postflush request which
is inflight waiting in scheduler queue.
The second postflush waiting in scheduler queue can't be dispatched because
the first postflush hasn't released scheduler resource even though it has
completed by itself.
Fix it by releasing scheduler resource when the first postflush request
completed, so the second postflush can be dispatched and completed, then
make blk_kick_flush() succeed.
While at it, remove the check for e->ops.finish_request, as all
schedulers set that. Reaffirm this requirement by adding a WARN_ON_ONCE()
at scheduler registration time, just like we do for insert_requests and
dispatch_request.
[1] https://lore.kernel.org/all/
7A57C7AE-A51A-4254-888B-
FE15CA21F9E9@oracle.com/
Link: https://lore.kernel.org/linux-block/20230819031206.2744005-1-chengming.zhou@linux.dev/
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/
202308172100.
8ce4b853-oliver.sang@intel.com
Fixes:
615939a2ae73 ("blk-mq: defer to the normal submission path for post-flush requests")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20230813152325.3017343-1-chengming.zhou@linux.dev
[axboe: folded in incremental fix and added tags]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sat, 19 Aug 2023 11:13:55 +0000 (13:13 +0200)]
Merge tag 'media/v6.5-3' of git://git./linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
"Three driver fixes"
* tag 'media/v6.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: imx: imx7-media-csi: Fix applying format constraints
media: uvcvideo: Fix menu count handling for userspace XU mappings
media: mtk-jpeg: Set platform driver data earlier
Linus Torvalds [Sat, 19 Aug 2023 08:46:02 +0000 (10:46 +0200)]
Merge tag 'x86_urgent_for_v6.5_rc7' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"Extraordinary embargoed times call for extraordinary measures. That's
why this week's x86/urgent branch is larger than usual, containing all
the known fallout fixes after the SRSO mitigation got merged.
I know, it is a bit late in the game but everyone who has reported a
bug stemming from the SRSO pile, has tested that branch and has
confirmed that it fixes their bug.
Also, I've run it on every possible hardware I have and it is looking
good. It is running on this very machine while I'm typing, for 2 days
now without an issue. Famous last words...
- Use LEA ...%rsp instead of ADD %rsp in the Zen1/2 SRSO return
sequence as latter clobbers flags which interferes with fastop
emulation in KVM, leading to guests freezing during boot
- A fix for the DIV(0) quotient data leak on Zen1 to clear the
divider buffers at the right time
- Disable the SRSO mitigation on unaffected configurations as it got
enabled there unnecessarily
- Change .text section name to fix CONFIG_LTO_CLANG builds
- Improve the optprobe indirect jmp check so that certain
configurations can still be able to use optprobes at all
- A serious and good scrubbing of the untraining routines by PeterZ:
- Add proper speculation stopping traps so that objtool is happy
- Adjust objtool to handle the new thunks
- Make the thunk pointer assignable to the different untraining
sequences at runtime, thus avoiding the alternative at the
return thunk. It simplifies the code a bit too.
- Add a entry_untrain_ret() main entry point which selects the
respective untraining sequence
- Rename things so that they're more clear
- Fix stack validation with FRAME_POINTER=y builds
- Fix static call patching to handle when a JMP to the return thunk
is the last insn on the very last module memory page
- Add more documentation about what each untraining routine does and
why"
* tag 'x86_urgent_for_v6.5_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/srso: Correct the mitigation status when SMT is disabled
x86/static_call: Fix __static_call_fixup()
objtool/x86: Fixup frame-pointer vs rethunk
x86/srso: Explain the untraining sequences a bit more
x86/cpu/kvm: Provide UNTRAIN_RET_VM
x86/cpu: Cleanup the untrain mess
x86/cpu: Rename srso_(.*)_alias to srso_alias_\1
x86/cpu: Rename original retbleed methods
x86/cpu: Clean up SRSO return thunk mess
x86/alternative: Make custom return thunk unconditional
objtool/x86: Fix SRSO mess
x86/cpu: Fix up srso_safe_ret() and __x86_return_thunk()
x86/cpu: Fix __x86_return_thunk symbol type
x86/retpoline,kprobes: Skip optprobe check for indirect jumps with retpolines and IBT
x86/retpoline,kprobes: Fix position of thunk sections with CONFIG_LTO_CLANG
x86/srso: Disable the mitigation on unaffected configurations
x86/CPU/AMD: Fix the DIV(0) initial fix attempt
x86/retpoline: Don't clobber RFLAGS during srso_safe_ret()
Linus Torvalds [Sat, 19 Aug 2023 06:32:16 +0000 (08:32 +0200)]
Merge tag 'powerpc-6.5-6' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
- Fix hardened usercopy BUG when using /proc based firmware update
interface
Thanks to Nathan Lynch and Kees Cook.
* tag 'powerpc-6.5-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/rtas_flash: allow user copy to flash block cache objects
Sweet Tea Dorminy [Thu, 17 Aug 2023 14:15:56 +0000 (10:15 -0400)]
blk-crypto: dynamically allocate fallback profile
blk_crypto_profile_init() calls lockdep_register_key(), which warns and
does not register if the provided memory is a static object.
blk-crypto-fallback currently has a static blk_crypto_profile and calls
blk_crypto_profile_init() thereupon, resulting in the warning and
failure to register.
Fortunately it is simple enough to use a dynamically allocated profile
and make lockdep function correctly.
Fixes:
2fb48d88e77f ("blk-crypto: use dynamic lock class for blk_crypto_profile::lock")
Cc: stable@vger.kernel.org
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230817141615.15387-1-sweettea-kernel@dorminy.me
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Thu, 17 Aug 2023 14:17:51 +0000 (22:17 +0800)]
blk-cgroup: hold queue_lock when removing blkg->q_node
When blkg is removed from q->blkg_list from blkg_free_workfn(), queue_lock
has to be held, otherwise, all kinds of bugs(list corruption, hard lockup,
..) can be triggered from blkg_destroy_all().
Fixes:
f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: xiaoli feng <xifeng@redhat.com>
Cc: Chunyu Hu <chuhu@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230817141751.1128970-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Li Zhijian [Wed, 16 Aug 2023 02:22:10 +0000 (10:22 +0800)]
drivers/rnbd: restore sysfs interface to rnbd-client
Commit
137380c0ec40 renamed 'rnbd-client' to 'rnbd_client', this changed
sysfs interface to /sys/devices/virtual/rnbd_client/ctl/map_device
from /sys/devices/virtual/rnbd-client/ctl/map_device.
CC: Ivan Orlov <ivan.orlov0322@gmail.com>
CC: "Md. Haris Iqbal" <haris.iqbal@ionos.com>
CC: Jack Wang <jinpu.wang@ionos.com>
Fixes:
137380c0ec40 ("block/rnbd: make all 'class' structures const")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20230816022210.2501228-1-lizhijian@fujitsu.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 18 Aug 2023 18:52:25 +0000 (20:52 +0200)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
"Two more SME fixes related to ptrace(): ensure that the SME is
properly set up for the target thread and that the thread sees
the ZT registers set via ptrace"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64/ptrace: Ensure that the task sees ZT writes on first use
arm64/ptrace: Ensure that SME is set up for target when writing SSVE state
Linus Torvalds [Fri, 18 Aug 2023 18:36:49 +0000 (20:36 +0200)]
Merge tag 'gpio-fixes-for-v6.5-rc7' of git://git./linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix a regression in the sysfs interface
- fix a reference counting bug that's been around for years
- MAINTAINERS update
* tag 'gpio-fixes-for-v6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: fix reference leaks when removing GPIO chips still in use
gpiolib: sysfs: Do unexport GPIO when user asks for it
MAINTAINERS: add content regex for gpio-regmap
Linus Torvalds [Fri, 18 Aug 2023 17:41:12 +0000 (19:41 +0200)]
Merge tag '6.5-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fix from Steve French:
"A small SMB mount option fix, also for stable"
* tag '6.5-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: fix null auth
Linus Torvalds [Fri, 18 Aug 2023 17:26:53 +0000 (19:26 +0200)]
Merge tag 'riscv-for-linus-6.5-rc7' of git://git./linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- avoid excessive rejections from seccomp RET_ERRNO rules
- compressed jal/jalr decoding fix
- fixes for independent irq/softirq stacks on kernels built with
CONFIG_FRAME_POINTER=n
- avoid a hang handling uaccess fixups
- another build fix for toolchain ISA strings, this time for Zicsr and
Zifenci on old GNU toolchains
* tag 'riscv-for-linus-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Handle zicsr/zifencei issue between gcc and binutils
riscv: uaccess: Return the number of bytes effectively not copied
riscv: stack: Fixup independent softirq stack for CONFIG_FRAME_POINTER=n
riscv: stack: Fixup independent irq stack for CONFIG_FRAME_POINTER=n
riscv: correct riscv_insn_is_c_jr() and riscv_insn_is_c_jalr()
riscv: entry: set a0 = -ENOSYS only when syscall != -1
Linus Torvalds [Fri, 18 Aug 2023 17:17:01 +0000 (19:17 +0200)]
Merge tag 'sound-6.5-rc7' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Slightly bigger than I wished, but here we go, a collection of fixes
for 6.5.
The only change in the core side is the ease for repeated ASoC error
messages, and the rest are all pretty device-specific small fixes
(including regression fixes) for ASoC Intel and HD-audio / USB-audio
quirks"
* tag 'sound-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/realtek - Remodified 3k pull low procedure
ASoC: rt1308-sdw: fix random louder sound
ALSA: hda/cs8409: Support new Dell Dolphin Variants
ALSA: hda/realtek: Switch Dell Oasis models to use SPI
ALSA: hda/realtek: Add quirks for HP G11 Laptops
ASoC: meson: axg-tdm-formatter: fix channel slot allocation
ASoC: SOF: ipc4-topology: Update the basecfg for copier earlier
ASoC: SOF: intel: hda: Clean up link DMA for IPC3 during stop
ASoC: Intel: sof-sdw-cs42142: fix for codec button mapping
ASoC: Intel: sof-sdw: update jack detection quirk for LunarLake RVP
ASoC: SOF: Fix incorrect use of sizeof in sof_ipc3_do_rx_work()
ASoC: lower "no backend DAIs enabled for ... Port" log severity
ASoC: rt5665: add missed regulator_bulk_disable
ASoC: max98363: don't return on success reading revision ID
ALSA: usb-audio: Add support for Mythware XA001AU capture and playback interfaces.
ASoC: fsl: micfil: Use dual license micfil code
Linus Torvalds [Fri, 18 Aug 2023 17:10:41 +0000 (19:10 +0200)]
Merge tag 'mmc-v6.5-rc1-2' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC core:
- Fix in_flight[issue_type] value error to properly manage requests
MMC host:
- wbsd: Fix double free in the probe error path
- sunplus: Fix error path in probe
- sdhci_f_sdh30: Fix order of function calls in sdhci_f_sdh30_remove"
* tag 'mmc-v6.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: f-sdh30: fix order of function calls in sdhci_f_sdh30_remove
mmc: sunplus: Fix error handling in spmmc_drv_probe()
mmc: sunplus: fix return value check of mmc_add_host()
mmc: wbsd: fix double mmc_free_host() in wbsd_init()
mmc: block: Fix in_flight[issue_type] value error
Linus Torvalds [Fri, 18 Aug 2023 16:37:40 +0000 (18:37 +0200)]
Merge tag 'pinctrl-v6.5-3' of git://git./linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
"Fixes two issues with the Qualcomm SA8775P platform:
- Some minor device tree binding flunky that is nice to iron out but
more importantly:
- Support the increased interrupt targets mask from 3 to 4 bits,
making interrupts with higher (hardware) numbers work"
* tag 'pinctrl-v6.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: qcom: Add intr_target_width field to support increased number of interrupt targets
dt-bindings: pinctrl: qcom,sa8775p-tlmm: add gpio function constant
Linus Torvalds [Fri, 18 Aug 2023 16:24:41 +0000 (18:24 +0200)]
Merge tag 'soc-fixes-6.5-3' of git://git./linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"As usual, mostly DT fixes for the major Arm platforms from Qualcomm
and NXP, plus a bit for Rockchips and others:
The qualcomm fixes mainly deal with their higher-end arm64 devices
trees, fixing issues in L3 interconnect, crypto, thermal, UFS and a
regression for the DSI phy.
NXP i.MX has two correctness fixes for the 64-bit chips, dealing with
the imx93 "anatop" module and the CSI interface. On the 32-bit side,
there are functional fixes for RTC, display and SD card intefaces.
Rockchip fixes are for wifi support on certain boards, a eMMC
stability and DT build warnings.
On TI OMAP, a regulator is described in DT to avoid problems with the
ethernet phy initialization.
The code changes include a missing MMIO serialization on OMAP, plus a
few minor fixes on ASpeed and AMD/Zynq chips"
* tag 'soc-fixes-6.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (30 commits)
ARM: dts: am335x-bone-common: Add vcc-supply for on-board eeprom
ARM: dts: am335x-bone-common: Add GPIO PHY reset on revision C3 board
soc: aspeed: socinfo: Add kfree for kstrdup
soc: aspeed: uart-routing: Use __sysfs_match_string
ARM: dts: integrator: fix PCI bus dtc warnings
arm64: dts: imx93: Fix anatop node size
arm64: dts: qcom: sc7180: Fix DSI0_PHY reg-names
ARM: dts: imx: Set default tuning step for imx6sx usdhc
arm64: dts: imx8mm: Drop CSI1 PHY reference clock configuration
arm64: dts: imx8mn: Drop CSI1 PHY reference clock configuration
ARM: dts: imx: Set default tuning step for imx7d usdhc
ARM: dts: imx6: phytec: fix RTC interrupt level
ARM: dts: imx6sx: Remove LDB endpoint
arm64: dts: rockchip: Fix Wifi/Bluetooth on ROCK Pi 4 boards
ARM: zynq: Explicitly include correct DT includes
arm64: dts: qcom: sa8775p-ride: Update L4C parameters
arm64: dts: rockchip: minor whitespace cleanup around '='
arm64: dts: rockchip: Disable HS400 for eMMC on ROCK 4C+
arm64: dts: rockchip: Disable HS400 for eMMC on ROCK Pi 4
arm64: dts: rockchip: add missing space before { on indiedroid nova
...
Linus Torvalds [Fri, 18 Aug 2023 16:13:36 +0000 (18:13 +0200)]
Merge tag 'asm-generic-fix-6.5' of git://git./linux/kernel/git/arnd/asm-generic
Pull asm-generic regression fix from Arnd Bergmann:
"Just one partial revert for a commit from the merge window that caused
annoying behavior when building old kernels on arm64 hosts"
* tag 'asm-generic-fix-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
asm-generic: partially revert "Unify uapi bitsperlong.h for arm64, riscv and loongarch"