Christoph Hellwig [Tue, 1 Aug 2023 17:22:00 +0000 (19:22 +0200)]
block: use iomap for writes to block devices
Use iomap in buffer_head compat mode to write to block devices.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230801172201.1923299-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 1 Aug 2023 17:21:59 +0000 (19:21 +0200)]
block: stop setting ->direct_IO
Direct I/O on block devices now nevers goes through aops->direct_IO.
Stop setting it and set the FMODE_CAN_ODIRECT in ->open instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230801172201.1923299-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 1 Aug 2023 17:21:58 +0000 (19:21 +0200)]
block: open code __generic_file_write_iter for blkdev writes
Open code __generic_file_write_iter to remove the indirect call into
->direct_IO and to prepare using the iomap based write code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20230801172201.1923299-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 1 Aug 2023 17:21:57 +0000 (19:21 +0200)]
fs: rename and move block_page_mkwrite_return
block_page_mkwrite_return is neither block nor mkwrite specific, and
should not be under CONFIG_BLOCK. Move it to mm.h and rename it to
vmf_fs_error.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230801172201.1923299-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 1 Aug 2023 17:21:56 +0000 (19:21 +0200)]
fs: remove emergency_thaw_bdev
Fold emergency_thaw_bdev into it's only caller, to prepare for buffer.c
to be built only when buffer_head support is enabled.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230801172201.1923299-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 29 Jul 2023 13:17:50 +0000 (07:17 -0600)]
Merge tag 'md-next-
20230729' of https://git./linux/kernel/git/song/md into for-6.6/block
Pull MD updates from Song:
"1. Deprecate bitmap file support, by Christoph Hellwig;
2. Fix deadlock with md sync thread, by Yu Kuai;
3. Refactor md io accounting, by Yu Kuai;
4. Various non-urgent fixes by Li Nan, Yu Kuai, and Jack Wang."
* tag 'md-next-
20230729' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (36 commits)
md/md-bitmap: hold 'reconfig_mutex' in backlog_store()
md/md-bitmap: remove unnecessary local variable in backlog_store()
md/raid10: use dereference_rdev_and_rrdev() to get devices
md/raid10: factor out dereference_rdev_and_rrdev()
md/raid10: check replacement and rdev to prevent submit the same io twice
md/raid1: Avoid lock contention from wake_up()
md: restore 'noio_flag' for the last mddev_resume()
md: don't quiesce in mddev_suspend()
md: remove redundant check in fix_read_error()
md/raid10: optimize fix_read_error
md/raid1: prioritize adding disk to 'removed' mirror
md/md-faulty: enable io accounting
md/md-linear: enable io accounting
md/md-multipath: enable io accounting
md/raid10: switch to use md_account_bio() for io accounting
md/raid1: switch to use md_account_bio() for io accounting
raid5: fix missing io accounting in raid5_align_endio()
md: also clone new io if io accounting is disabled
md: move initialization and destruction of 'io_acct_set' to md.c
md: deprecate bitmap file support
...
Yu Kuai [Thu, 6 Jul 2023 08:37:27 +0000 (16:37 +0800)]
md/md-bitmap: hold 'reconfig_mutex' in backlog_store()
Several reasons why 'reconfig_mutex' should be held:
1) rdev_for_each() is not safe to be called without the lock, because
rdev can be removed concurrently.
2) mddev_destroy_serial_pool() and mddev_create_serial_pool() should not
be called concurrently.
3) mddev_suspend() from mddev_destroy/create_serial_pool() should be
protected by the lock.
Fixes:
10c92fca636e ("md-bitmap: create and destroy wb_info_pool with the change of backlog")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230706083727.608914-3-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Yu Kuai [Thu, 6 Jul 2023 08:37:26 +0000 (16:37 +0800)]
md/md-bitmap: remove unnecessary local variable in backlog_store()
Local variable is definied first in the beginning of backlog_store(),
there is no need to define it again.
Fixes:
8c13ab115b57 ("md/bitmap: don't set max_write_behind if there is no write mostly device")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230706083727.608914-2-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Li Nan [Sat, 1 Jul 2023 08:05:29 +0000 (16:05 +0800)]
md/raid10: use dereference_rdev_and_rrdev() to get devices
Commit
2ae6aaf76912 ("md/raid10: fix io loss while replacement replace
rdev") reads replacement first to prevent io loss. However, there are same
issue in wait_blocked_dev() and raid10_handle_discard(), too. Fix it by
using dereference_rdev_and_rrdev() to get devices.
Fixes:
d30588b2731f ("md/raid10: improve raid10 discard request")
Fixes:
f2e7e269a752 ("md/raid10: pull the code that wait for blocked dev into one function")
Signed-off-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/r/20230701080529.2684932-4-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Li Nan [Sat, 1 Jul 2023 08:05:28 +0000 (16:05 +0800)]
md/raid10: factor out dereference_rdev_and_rrdev()
Factor out a helper to get 'rdev' and 'replacement' from config->mirrors.
Just to make code cleaner and prepare to fix the bug of io loss while
'replacement' replace 'rdev'.
There is no functional change.
Signed-off-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/r/20230701080529.2684932-3-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Li Nan [Sat, 1 Jul 2023 08:05:27 +0000 (16:05 +0800)]
md/raid10: check replacement and rdev to prevent submit the same io twice
After commit
4ca40c2ce099 ("md/raid10: Allow replacement device to be
replace old drive."), 'rdev' and 'replacement' could appear to be
identical. There are already checks for that in wait_blocked_dev() and
raid10_write_request(). Add check for raid10_handle_discard() now.
Signed-off-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/r/20230701080529.2684932-2-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Jack Wang [Wed, 5 Jul 2023 11:32:27 +0000 (13:32 +0200)]
md/raid1: Avoid lock contention from wake_up()
wake_up is called unconditionally in a few paths such as make_request(),
which cause lock contention under high concurrency workload like below
raid1_end_write_request
wake_up
__wake_up_common_lock
spin_lock_irqsave
Improve performance by only call wake_up() if waitqueue is not empty
Fio test script:
[global]
name=random reads and writes
ioengine=libaio
direct=1
readwrite=randrw
rwmixread=70
iodepth=64
buffered=0
filename=/dev/md0
size=1G
runtime=30
time_based
randrepeat=0
norandommap
refill_buffers
ramp_time=10
bs=4k
numjobs=400
group_reporting=1
[job1]
Test result with 2 ramdisk in raid1 on a Intel Broadwell 56 cores server.
Before this patch With this patch
READ BW=4621MB/s BW=7337MB/s
WRITE BW=1980MB/s BW=3144MB/s
The patch is inspired by Yu Kuai's change for raid10:
https://lore.kernel.org/r/
20230621105728.1268542-1-yukuai1@huaweicloud.com
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230705113227.148494-1-jinpu.wang@ionos.com
Signed-off-by: Song Liu <song@kernel.org>
Yu Kuai [Wed, 28 Jun 2023 01:29:31 +0000 (09:29 +0800)]
md: restore 'noio_flag' for the last mddev_resume()
memalloc_noio_save() is called for the first mddev_suspend(), and
repeated mddev_suspend() only increase 'suspended'. However,
memalloc_noio_restore() is also called for the first mddev_resume(),
which means that memory reclaim will be enabled before the last
mddev_resume() is called, while the array is still suspended.
Fix this problem by restore 'noio_flag' for the last mddev_resume().
Fixes:
78f57ef9d50a ("md: use memalloc scope APIs in mddev_suspend()/mddev_resume()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230628012931.88911-3-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Yu Kuai [Wed, 28 Jun 2023 01:29:30 +0000 (09:29 +0800)]
md: don't quiesce in mddev_suspend()
Some levels doesn't implement "pers->quiesce", for example
raid0_quiesce() is empty, and now that all levels will drop 'active_io'
until io is done, wait for 'active_io' to be 0 is enough to make sure all
normal io is done, and percpu_ref_kill() for 'active_io' will make sure
no new normal io can be dispatched. There is no need to call
"pers->quiesce" anymore from mddev_suspend().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230628012931.88911-2-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Li Nan [Fri, 23 Jun 2023 17:32:35 +0000 (01:32 +0800)]
md: remove redundant check in fix_read_error()
In fix_read_error(), 'success' will be checked immediately after assigning
it, if it is set to 1 then the loop will break. Checking it again in
condition of loop is redundant. Clean it up.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230623173236.2513554-3-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Li Nan [Fri, 23 Jun 2023 17:32:34 +0000 (01:32 +0800)]
md/raid10: optimize fix_read_error
We dereference r10_bio->read_slot too many times in fix_read_error().
Optimize it by using a variable to store read_slot.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230623173236.2513554-2-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Li Nan [Tue, 27 Jun 2023 01:43:32 +0000 (09:43 +0800)]
md/raid1: prioritize adding disk to 'removed' mirror
New disk should be added to "removed" position first instead of to be a
replacement. Commit
6090368abcb4 ("md/raid10: prioritize adding disk to
'removed' mirror") has fixed this issue for raid10. Fix it for raid1 now.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230627014332.3810102-1-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Yu Kuai [Wed, 21 Jun 2023 16:51:10 +0000 (00:51 +0800)]
md/md-faulty: enable io accounting
use md_account_bio() to enable io accounting, also make sure
mddev_suspend() will wait for all io to be done.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-9-yukuai1@huaweicloud.com
Yu Kuai [Wed, 21 Jun 2023 16:51:09 +0000 (00:51 +0800)]
md/md-linear: enable io accounting
use md_account_bio() to enable io accounting, also make sure
mddev_suspend() will wait for all io to be done.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-8-yukuai1@huaweicloud.com
Yu Kuai [Wed, 21 Jun 2023 16:51:08 +0000 (00:51 +0800)]
md/md-multipath: enable io accounting
use md_account_bio() to enable io accounting, also make sure
mddev_suspend() will wait for all io to be done.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-7-yukuai1@huaweicloud.com
Yu Kuai [Wed, 21 Jun 2023 16:51:07 +0000 (00:51 +0800)]
md/raid10: switch to use md_account_bio() for io accounting
Make sure that 'active_io' will represent inflight io instead of io that
is dispatching, and io accounting from all levels will be consistent.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-6-yukuai1@huaweicloud.com
Yu Kuai [Wed, 21 Jun 2023 16:51:06 +0000 (00:51 +0800)]
md/raid1: switch to use md_account_bio() for io accounting
Two problems can be fixed this way:
1) 'active_io' will represent inflight io instead of io that is
dispatching.
2) If io accounting is enabled or disabled while io is still inflight,
bio_start_io_acct() and bio_end_io_acct() is not balanced and io
inflight counter will be leaked.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-5-yukuai1@huaweicloud.com
Yu Kuai [Wed, 21 Jun 2023 16:51:05 +0000 (00:51 +0800)]
raid5: fix missing io accounting in raid5_align_endio()
Io will only be accounted as done from raid5_align_endio() if the io
succeeded, and io inflight counter will be leaked if such io failed.
Fix this problem by switching to use md_account_bio() for io accounting.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-4-yukuai1@huaweicloud.com
Yu Kuai [Wed, 21 Jun 2023 16:51:04 +0000 (00:51 +0800)]
md: also clone new io if io accounting is disabled
Currently, 'active_io' is grabbed before make_reqeust() is called, and
it's dropped immediately make_reqeust() returns. Hence 'active_io'
actually means io is dispatching, not io is inflight.
For raid0 and raid456 that io accounting is enabled, 'active_io' will
also be grabbed when bio is cloned for io accounting, and this 'active_io'
is dropped until io is done.
Always clone new bio so that 'active_io' will mean that io is inflight,
raid1 and raid10 will switch to use this method in later patches.
Now that bio will be cloned even if io accounting is disabled, also
rename related structure from '*_acct_*' to '*_clone_*'.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-3-yukuai1@huaweicloud.com
Yu Kuai [Wed, 21 Jun 2023 16:51:03 +0000 (00:51 +0800)]
md: move initialization and destruction of 'io_acct_set' to md.c
'io_acct_set' is only used for raid0 and raid456, prepare to use it for
raid1 and raid10, so that io accounting from different levels can be
consistent.
By the way, follow up patches will also use this io clone mechanism to
make sure 'active_io' represents in flight io, not io that is dispatching,
so that mddev_suspend will wait for io to be done as designed.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-2-yukuai1@huaweicloud.com
Christoph Hellwig [Thu, 15 Jun 2023 06:48:40 +0000 (08:48 +0200)]
md: deprecate bitmap file support
The support for bitmaps on files is a very bad idea abusing various kernel
APIs, and fundamentally requires the file to not be on the actual array
without a way to check that this is actually the case. Add a deprecation
warning to see if we might be able to eventually drop it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-12-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:39 +0000 (08:48 +0200)]
md: make bitmap file support optional
The support for write intent bitmaps in files on an external files in md
is a hot mess that abuses ->bmap to map file offsets into physical device
objects, and also abuses buffer_heads in a creative way.
Make this code optional so that MD can be built into future kernels
without buffer_head support, and so that we can eventually deprecate it.
Note this does not affect the internal bitmap support, which has none of
the problems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-11-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:38 +0000 (08:48 +0200)]
md-bitmap: don't use ->index for pages backing the bitmap file
The md driver allocates pages for storing the bitmap file data, which
are not page cache pages, and then stores the page granularity file
offset in page->index, which is a field that isn't really valid except
for page cache pages.
Use a separate index for the superblock, and use the scheme used at
read size to recalculate the index for the bitmap pages instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-10-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:37 +0000 (08:48 +0200)]
md-bitmap: account for mddev->bitmap_info.offset in read_sb_page
Diretly apply mddev->bitmap_info.offset to the sector number to read
instead of doing that in both callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-9-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:36 +0000 (08:48 +0200)]
md-bitmap: cleanup read_sb_page
Convert read_sb_page to the normal kernel coding style, calculate the
target sector only once, and add a local iosize variable to make the call
to sync_page_io more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-8-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:35 +0000 (08:48 +0200)]
md-bitmap: refactor md_bitmap_init_from_disk
Split the confusing loop in md_bitmap_init_from_disk that iterates over
all chunks but also needs to read and map the pages into three separate
loops: one that iterates over the pages to read them, a second optional
one to iterate over the pages to mark them invalid if the bitmaps are
out of date, and a final one that actually iterates over the chunks.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/
202306160552.smw0qbmb-lkp@intel.com/
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-7-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:34 +0000 (08:48 +0200)]
md-bitmap: rename read_page to read_file_page
Make the difference to read_sb_page clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-6-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:33 +0000 (08:48 +0200)]
md-bitmap: split file writes into a separate helper
Split the file write code out of write_page into a separate helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-5-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:32 +0000 (08:48 +0200)]
md-bitmap: use %pD to print the file name in md_bitmap_file_kick
Don't bother allocating an extra buffer in the I/O failure handler and
instead use the printk built-in format to print the last 4 path name
components.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-4-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:31 +0000 (08:48 +0200)]
md-bitmap: initialize variables at declaration time in md_bitmap_file_unmap
Just a small tidyup to prepare for bigger changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-3-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:30 +0000 (08:48 +0200)]
md-bitmap: set BITMAP_WRITE_ERROR in write_sb_page
Set BITMAP_WRITE_ERROR directly in write_sb_page instead of propagating
the error to the caller and setting it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-2-hch@lst.de
Yu Kuai [Mon, 29 May 2023 13:20:37 +0000 (21:20 +0800)]
md: enhance checking in md_check_recovery()
For md_check_recovery():
1) if 'MD_RECOVERY_RUNING' is not set, register new sync_thread.
2) if 'MD_RECOVERY_RUNING' is set:
a) if 'MD_RECOVERY_DONE' is not set, don't do anything, wait for
md_do_sync() to be done.
b) if 'MD_RECOVERY_DONE' is set, unregister sync_thread. Current code
expects that sync_thread is not NULL, otherwise new sync_thread will
be registered, which will corrupt the array.
Make sure md_check_recovery() won't register new sync_thread if
'MD_RECOVERY_RUNING' is still set, and a new WARN_ON_ONCE() is added for
the above corruption,
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-7-yukuai1@huaweicloud.com
Yu Kuai [Mon, 29 May 2023 13:20:36 +0000 (21:20 +0800)]
md: wake up 'resync_wait' at last in md_reap_sync_thread()
md_reap_sync_thread() is just replaced with wait_event(resync_wait, ...)
from action_store(), just make sure action_store() will still wait for
everything to be done in md_reap_sync_thread().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewd-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-6-yukuai1@huaweicloud.com
Yu Kuai [Mon, 29 May 2023 13:20:35 +0000 (21:20 +0800)]
md: refactor idle/frozen_sync_thread() to fix deadlock
Our test found a following deadlock in raid10:
1) Issue a normal write, and such write failed:
raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry
// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)
Dependency chain 1: normal io is waiting for updating superblock
2) Trigger a recovery:
raid10_sync_request
raise_barrier
Dependency chain 2: sync thread is waiting for normal io
3) echo idle/frozen to sync_action:
action_store
mddev_lock
md_unregister_thread
kthread_stop
Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
4) md thread can't update superblock:
raid10d
md_check_recovery
if (mddev_trylock(mddev))
md_update_sb
Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.
This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.
Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
deadlock can be fixed by this patch as well.
[1] https://lore.kernel.org/linux-raid/
5ed54ffc-ce82-bf66-4eff-
390cb23bc1ac@molgen.mpg.de/T/#t
[2] https://lore.kernel.org/linux-raid/
e9067438-d713-f5f3-0d3d-
9e6b0e9efa0e@huaweicloud.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com
Yu Kuai [Mon, 29 May 2023 13:20:34 +0000 (21:20 +0800)]
md: add a mutex to synchronize idle and frozen in action_store()
Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
and call md_reap_sync_thread() to stop sync thread, however, this will
cause deadlock (explained in the next patch). In order to fix the
problem, following patch will release 'reconfig_mutex' and wait on
'resync_wait', like md_set_readonly() and do_md_stop() does.
Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
unconditionally, which might cause unexpected problems, for example,
frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
might starve in progress frozen. A mutex is added to synchronize idle
and frozen from action_store().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-4-yukuai1@huaweicloud.com
Yu Kuai [Mon, 29 May 2023 13:20:33 +0000 (21:20 +0800)]
md: refactor action_store() for 'idle' and 'frozen'
Prepare to handle 'idle' and 'frozen' differently to fix a deadlock, there
are no functional changes except that MD_RECOVERY_RUNNING is checked
again after 'reconfig_mutex' is held.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-3-yukuai1@huaweicloud.com
Yu Kuai [Mon, 29 May 2023 13:20:32 +0000 (21:20 +0800)]
Revert "md: unlock mddev before reap sync_thread in action_store"
This reverts commit
9dfbdafda3b34e262e43e786077bab8e476a89d1.
Because it will introduce a defect that sync_thread can be running while
MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems,
for example:
list_add corruption. prev->next should be next (
ffff0001ac1daba0), but was
ffff0000ce1a02a0. (prev=
ffff0000ce1a02a0).
Call trace:
__list_add_valid+0xfc/0x140
insert_work+0x78/0x1a0
__queue_work+0x500/0xcf4
queue_work_on+0xe8/0x12c
md_check_recovery+0xa34/0xf30
raid10d+0xb8/0x900 [raid10]
md_thread+0x16c/0x2cc
kthread+0x1a4/0x1ec
ret_from_fork+0x10/0x18
This is because work is requeued while it's still inside workqueue:
t1: t2:
action_store
mddev_lock
if (mddev->sync_thread)
mddev_unlock
md_unregister_thread
// first sync_thread is done
md_check_recovery
mddev_try_lock
/*
* once MD_RECOVERY_DONE is set, new sync_thread
* can start.
*/
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
INIT_WORK(&mddev->del_work, md_start_sync)
queue_work(md_misc_wq, &mddev->del_work)
test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...)
// set pending bit
insert_work
list_add_tail
mddev_unlock
mddev_lock_nointr
md_reap_sync_thread
// MD_RECOVERY_RUNNING is cleared
mddev_unlock
t3:
// before queued work started from t2
md_check_recovery
// MD_RECOVERY_RUNNING is not set, a new sync_thread can be started
INIT_WORK(&mddev->del_work, md_start_sync)
work->data = 0
// work pending bit is cleared
queue_work(md_misc_wq, &mddev->del_work)
insert_work
list_add_tail
// list is corrupted
The above commit is reverted to fix the problem, the deadlock this
commit tries to fix will be fixed in following patches.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-2-yukuai1@huaweicloud.com
Jinyoung Choi [Tue, 25 Jul 2023 05:18:39 +0000 (14:18 +0900)]
block: cleanup bio_integrity_prep
If a problem occurs in the process of creating an integrity payload, the
status of bio is always BLK_STS_RESOURCE.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230725051839epcms2p8e4d20ad6c51326ad032e8406f59d0aaa@epcms2p8
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Fri, 21 Jul 2023 17:27:30 +0000 (10:27 -0700)]
block: Improve performance for BLK_MQ_F_BLOCKING drivers
blk_mq_run_queue() runs the queue asynchronously if BLK_MQ_F_BLOCKING
has been set. This is suboptimal since running the queue asynchronously
is slower than running the queue synchronously. This patch modifies
blk_mq_run_queue() as follows if BLK_MQ_F_BLOCKING has been set:
- Run the queue synchronously if it is allowed to sleep.
- Run the queue asynchronously if it is not allowed to sleep.
Additionally, blk_mq_run_hw_queue(hctx, false) calls are modified into
blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING) if the caller
may be invoked from atomic context.
The following caller chains have been reviewed:
blk_mq_run_hw_queue(hctx, false)
blk_mq_get_tag() /* may sleep, hence the functions it calls may also sleep */
blk_execute_rq() /* may sleep */
blk_mq_run_hw_queues(q, async=false)
blk_freeze_queue_start() /* may sleep */
blk_mq_requeue_work() /* may sleep */
scsi_kick_queue()
scsi_requeue_run_queue() /* may sleep */
scsi_run_host_queues()
scsi_ioctl_reset() /* may sleep */
blk_mq_insert_requests(hctx, ctx, list, run_queue_async=false)
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list(plug, from_schedule=false)
__blk_flush_plug(plug, from_schedule=false)
blk_add_rq_to_plug()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_plug_issue_direct()
blk_mq_flush_plug_list() /* see above */
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list() /* see above */
blk_mq_try_issue_directly()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_try_issue_list_directly(hctx, list)
blk_mq_insert_requests() /* see above */
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230721172731.955724-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Fri, 21 Jul 2023 17:27:29 +0000 (10:27 -0700)]
scsi: Remove a blk_mq_run_hw_queues() call
blk_mq_kick_requeue_list() calls blk_mq_run_hw_queues() asynchronously.
Leave out the direct blk_mq_run_hw_queues() call. This patch causes
scsi_run_queue() to call blk_mq_run_hw_queues() asynchronously instead
of synchronously. Since scsi_run_queue() is not called from the hot I/O
submission path, this patch does not affect the hot path.
This patch prepares for allowing blk_mq_run_hw_queue() to sleep if
BLK_MQ_F_BLOCKING has been set. scsi_run_queue() may be called from
atomic context and must not sleep. Hence the removal of the
blk_mq_run_hw_queues(q, false) call. See also scsi_unblock_requests().
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230721172731.955724-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Fri, 21 Jul 2023 17:27:28 +0000 (10:27 -0700)]
scsi: Inline scsi_kick_queue()
Inline scsi_kick_queue() to prepare for modifying the second argument
passed to blk_mq_run_hw_queues().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230721172731.955724-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jul 2023 16:54:33 +0000 (09:54 -0700)]
block: don't pass a bio to bio_try_merge_hw_seg
There is no good reason to pass the bio to bio_try_merge_hw_seg. Just
pass the current bvec and rename the function to bvec_try_merge_hw_page.
This will allow reusing this function for supporting multi-page integrity
payload bvecs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Link: https://lore.kernel.org/r/20230724165433.117645-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jul 2023 16:54:32 +0000 (09:54 -0700)]
block: move the bi_size update out of __bio_try_merge_page
The update of bi_size is the only thing in __bio_try_merge_page that
needs a bio. Move it to the callers, and merge __bio_try_merge_page
and page_is_mergeable into a single bvec_try_merge_page that only takes
the current bvec instead of a full bio. This will allow reusing this
function for supporting multi-page integrity payload bvecs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Link: https://lore.kernel.org/r/20230724165433.117645-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jul 2023 16:54:31 +0000 (09:54 -0700)]
block: downgrade a bio_full call in bio_add_page
bio_add_page already checks that there is space in bi_size a little
earlier. So after we failed to add to an existing segment, just check
that there is another one available instead of duplicating the bi_size
check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Link: https://lore.kernel.org/r/20230724165433.117645-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jul 2023 16:54:30 +0000 (09:54 -0700)]
block: move the bi_size overflow check in __bio_try_merge_page
Checking for availability in bi_size in a function that attempts to
merge into an existing segment is a bit odd, as the limit also applies
when adding a new segment. This code works fine as we always call
__bio_try_merge_page, but contributes to sub-optimal calling conventions
and doesn't lead to clear code.
Move it to two of the callers instead, the third one already has a more
strict check that includes max_hw_segments anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230724165433.117645-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jul 2023 16:54:29 +0000 (09:54 -0700)]
block: move the bi_vcnt check out of __bio_try_merge_page
Move the bi_vcnt out of __bio_try_merge_page and into the two callers
that don't already have it in preparation for additional changes to
__bio_try_merge_page.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230724165433.117645-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jul 2023 16:54:28 +0000 (09:54 -0700)]
block: move the BIO_CLONED checks out of __bio_try_merge_page
__bio_try_merge_page is a way too low-level helper to assert that the
bio is not cloned. Move the check into bio_add_page and
bio_iov_iter_get_pages instead, which are the high level entry points
that should enforce this variant. bio_add_hw_page already this
check, coverig the third (indirect) caller of __bio_try_merge_page.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230724165433.117645-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jul 2023 16:54:27 +0000 (09:54 -0700)]
block: use SECTOR_SHIFT bio_add_hw_page
Use the SECTOR_SHIFT magic constant instead of the magic number.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230724165433.117645-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jul 2023 16:54:26 +0000 (09:54 -0700)]
block: tidy up the bio full checks in bio_add_hw_page
bio_add_hw_page already checks if the number of bytes trying to be added
even fit into max_hw_sectors limit of the queue. Remove the call to
bio_full and just do a check for the smaller of the number of segments
in the bio and the queue max segments limit, and do this cheap check
before the more expensive gap to previous check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Link: https://lore.kernel.org/r/20230724165433.117645-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Nitesh Shetty [Wed, 19 Jul 2023 12:16:08 +0000 (17:46 +0530)]
block: refactor to use helper
Reduce some code by making use of bio_integrity_bytes().
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230719121608.32105-1-nj.shetty@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chengming Zhou [Mon, 17 Jul 2023 04:00:58 +0000 (12:00 +0800)]
blk-flush: reuse rq queuelist in flush state machine
Since we don't need to maintain inflight flush_data requests list
anymore, we can reuse rq->queuelist for flush pending list.
Note in mq_flush_data_end_io(), we need to re-initialize rq->queuelist
before reusing it in the state machine when end, since the rq->rq_next
also reuse it, may have corrupted rq->queuelist by the driver.
This patch decrease the size of struct request by 16 bytes.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230717040058.3993930-5-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chengming Zhou [Mon, 17 Jul 2023 04:00:57 +0000 (12:00 +0800)]
blk-flush: count inflight flush_data requests
The flush state machine use a double list to link all inflight
flush_data requests, to avoid issuing separate post-flushes for
these flush_data requests which shared PREFLUSH.
So we can't reuse rq->queuelist, this is why we need rq->flush.list
In preparation of the next patch that reuse rq->queuelist for flush
state machine, we change the double linked list to unsigned long
counter, which count all inflight flush_data requests.
This is ok since we only need to know if there is any inflight
flush_data request, so unsigned long counter is good.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230717040058.3993930-4-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chengming Zhou [Mon, 17 Jul 2023 04:00:56 +0000 (12:00 +0800)]
blk-flush: fix rq->flush.seq for post-flush requests
If the policy == (REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH), it means that the
data sequence and post-flush sequence need to be done for this request.
The rq->flush.seq should record what sequences have been done (or don't
need to be done). So in this case, pre-flush doesn't need to be done,
we should init rq->flush.seq to REQ_FSEQ_PREFLUSH not REQ_FSEQ_POSTFLUSH.
Fixes:
615939a2ae73 ("blk-mq: defer to the normal submission path for post-flush requests")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230717040058.3993930-3-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chengming Zhou [Mon, 17 Jul 2023 04:00:55 +0000 (12:00 +0800)]
blk-mq: use percpu csd to remote complete instead of per-rq csd
If request need to be completed remotely, we insert it into percpu llist,
and smp_call_function_single_async() if llist is empty previously.
We don't need to use per-rq csd, percpu csd is enough. And the size of
struct request is decreased by 24 bytes.
This way is cleaner, and looks correct, given block softirq is guaranteed
to be scheduled to consume the list if one new request is added to this
percpu list, either smp_call_function_single_async() returns -EBUSY or 0.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230717040058.3993930-2-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 7 Jul 2023 09:42:39 +0000 (11:42 +0200)]
block: don't allow enabling a cache on devices that don't support it
Currently the write_cache attribute allows enabling the QUEUE_FLAG_WC
flag on devices that never claimed the capability.
Fix that by adding a QUEUE_FLAG_HW_WC flag that is set by
blk_queue_write_cache and guards re-enabling the cache through sysfs.
Note that any rescan that calls blk_queue_write_cache will still
re-enable the write cache as in the current code.
Fixes:
93e9d8e836cb ("block: add ability to flag write back caching on a device")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230707094239.107968-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 7 Jul 2023 09:42:38 +0000 (11:42 +0200)]
block: cleanup queue_wc_store
Get rid of the local queue_wc_store variable and handling setting and
clearing the QUEUE_FLAG_WC flag diretly instead the if / else if.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230707094239.107968-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Thomas Weißschuh [Thu, 13 Jul 2023 19:29:35 +0000 (21:29 +0200)]
nbd: automatically load module on genl access
Add a module alias to nbd.ko that allows the generic netlink core to
automatically load the module when netlink messages for nbd are
received.
This frees the user from manually having to load the module before using
nbd functionality via netlink.
If the system policy allows it this can even be used to load the nbd
module from containers which would otherwise not have access to the
necessary module files to do a normal "modprobe nbd".
For example this avoids the following error when using nbd-client:
$ nbd-client localhost 10809 /dev/nbd0
...
Error: Couldn't resolve the nbd netlink family, make sure the nbd module is loaded and your nbd driver supports the netlink interface.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Josef Bacik <josef@toxicpadna.com>
Link: https://lore.kernel.org/r/20230713-b4-nbd-genl-v3-1-226cbddba04b@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Azeem Shaikh [Mon, 3 Jul 2023 17:21:59 +0000 (17:21 +0000)]
blk-wbt: Replace strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.
[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230703172159.3668349-3-azeemshaikh38@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Azeem Shaikh [Mon, 3 Jul 2023 17:21:58 +0000 (17:21 +0000)]
kyber: Replace strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.
[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230703172159.3668349-2-azeemshaikh38@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sun, 16 Jul 2023 22:10:37 +0000 (15:10 -0700)]
Linux 6.5-rc2
Linus Torvalds [Sun, 16 Jul 2023 21:12:49 +0000 (14:12 -0700)]
Merge tag 'xtensa-
20230716' of https://github.com/jcmvbkbc/linux-xtensa
Pull xtensa fixes from Max Filippov:
- fix interaction between unaligned exception handler and load/store
exception handler
- fix parsing ISS network interface specification string
- add comment about etherdev freeing to ISS network driver
* tag 'xtensa-
20230716' of https://github.com/jcmvbkbc/linux-xtensa:
xtensa: fix unaligned and load/store configuration interaction
xtensa: ISS: fix call to split_if_spec
xtensa: ISS: add comment about etherdev freeing
Linus Torvalds [Sun, 16 Jul 2023 20:46:08 +0000 (13:46 -0700)]
Merge tag 'perf_urgent_for_v6.5_rc2' of git://git./linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:
- Fix a lockdep warning when the event given is the first one, no event
group exists yet but the code still goes and iterates over event
siblings
* tag 'perf_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix lockdep warning in for_each_sibling_event() on SPR
Linus Torvalds [Sun, 16 Jul 2023 20:34:29 +0000 (13:34 -0700)]
Merge tag 'objtool_urgent_for_v6.5_rc2' of git://git./linux/kernel/git/tip/tip
Pull objtool fixes from Borislav Petkov:
- Mark copy_iovec_from_user() __noclone in order to prevent gcc from
doing an inter-procedural optimization and confuse objtool
- Initialize struct elf fully to avoid build failures
* tag 'objtool_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
iov_iter: Mark copy_iovec_from_user() noclone
objtool: initialize all of struct elf
Linus Torvalds [Sun, 16 Jul 2023 20:22:08 +0000 (13:22 -0700)]
Merge tag 'sched_urgent_for_v6.5_rc2' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Remove a cgroup from under a polling process properly
- Fix the idle sibling selection
* tag 'sched_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/psi: use kernfs polling functions for PSI trigger polling
sched/fair: Use recent_used_cpu to test p->cpus_ptr
Linus Torvalds [Sun, 16 Jul 2023 19:55:31 +0000 (12:55 -0700)]
Merge tag 'pinctrl-v6.5-2' of git://git./linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
"I'm mostly on vacation but what would vacation be without a few
critical fixes so people can use their gaming laptops when hiding away
from the sun (or rain)?
- Fix a really annoying interrupt storm in the AMD driver affecting
Asus TUF gaming notebooks
- Fix device tree parsing in the Renesas driver"
* tag 'pinctrl-v6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: amd: Unify debounce handling into amd_pinconf_set()
pinctrl: amd: Drop pull up select configuration
pinctrl: amd: Use amd_pinconf_set() for all config options
pinctrl: amd: Only use special debounce behavior for GPIO 0
pinctrl: renesas: rzg2l: Handle non-unique subnode names
pinctrl: renesas: rzv2m: Handle non-unique subnode names
Linus Torvalds [Sun, 16 Jul 2023 19:49:05 +0000 (12:49 -0700)]
Merge tag '6.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Two reconnect fixes: important fix to address inFlight count to leak
(which can leak credits), and fix for better handling a deleted share
- DFS fix
- SMB1 cleanup fix
- deferred close fix
* tag '6.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix mid leak during reconnection after timeout threshold
cifs: is_network_name_deleted should return a bool
smb: client: fix missed ses refcounting
smb: client: Fix -Wstringop-overflow issues
cifs: if deferred close is disabled then close files immediately
Linus Torvalds [Sun, 16 Jul 2023 19:28:04 +0000 (12:28 -0700)]
Merge tag 'powerpc-6.5-3' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix Speculation_Store_Bypass reporting in /proc/self/status on
Power10
- Fix HPT with 4K pages since recent changes by implementing pmd_same()
- Fix 64-bit native_hpte_remove() to be irq-safe
Thanks to Aneesh Kumar K.V, Nageswara R Sastry, and Russell Currey.
* tag 'powerpc-6.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm/book3s64/hash/4k: Add pmd_same callback for 4K page size
powerpc/64e: Fix obtool warnings in exceptions-64e.S
powerpc/security: Fix Speculation_Store_Bypass reporting on Power10
powerpc/64s: Fix native_hpte_remove() to be irq-safe
Linus Torvalds [Sun, 16 Jul 2023 19:18:18 +0000 (12:18 -0700)]
Merge tag 'hardening-v6.5-rc2' of git://git./linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
- Remove LTO-only suffixes from promoted global function symbols
(Yonghong Song)
- Remove unused .text..refcount section from vmlinux.lds.h (Petr Pavlu)
- Add missing __always_inline to sparc __arch_xchg() (Arnd Bergmann)
- Claim maintainership of string routines
* tag 'hardening-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
sparc: mark __arch_xchg() as __always_inline
MAINTAINERS: Foolishly claim maintainership of string routines
kallsyms: strip LTO-only suffixes from promoted global functions
vmlinux.lds.h: Remove a reference to no longer used sections .text..refcount
Linus Torvalds [Sun, 16 Jul 2023 19:13:51 +0000 (12:13 -0700)]
Merge tag 'probes-fixes-v6.5-rc1-2' of git://git./linux/kernel/git/trace/linux-trace
Pull probe fixes from Masami Hiramatsu:
- fprobe: Add a comment why fprobe will be skipped if another kprobe is
running in fprobe_kprobe_handler().
- probe-events: Fix some issues related to fetch-arguments:
- Fix double counting of the string length for user-string and
symstr. This will require longer buffer in the array case.
- Fix not to count error code (minus value) for the total used
length in array argument. This makes the total used length
shorter.
- Fix to update dynamic used data size counter only if fetcharg uses
the dynamic size data. This may mis-count the used dynamic data
size and corrupt data.
- Revert "tracing: Add "(fault)" name injection to kernel probes"
because that did not work correctly with a bug, and we agreed the
current '(fault)' output (instead of '"(fault)"' like a string)
explains what happened more clearly.
- Fix to record 0-length (means fault access) data_loc data in fetch
function itself, instead of store_trace_args(). If we record an
array of string, this will fix to save fault access data on each
entry of the array correctly.
* tag 'probes-fixes-v6.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if fails
Revert "tracing: Add "(fault)" name injection to kernel probes"
tracing/probes: Fix to update dynamic data counter if fetcharg uses it
tracing/probes: Fix not to count error code to total length
tracing/probes: Fix to avoid double count of the string length on the array
fprobes: Add a comment why fprobe_kprobe_handler exits if kprobe is running
Linus Torvalds [Sat, 15 Jul 2023 15:51:02 +0000 (08:51 -0700)]
Merge tag 'spi-fix-v6.5-rc1' of git://git./linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A couple of fairly minor driver specific fixes here, plus a bunch of
maintainership and admin updates. Nothing too remarkable"
* tag 'spi-fix-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
mailmap: add entry for Jonas Gorski
MAINTAINERS: add myself for spi-bcm63xx
spi: s3c64xx: clear loopback bit after loopback test
spi: bcm63xx: fix max prepend length
MAINTAINERS: Add myself as a maintainer for Microchip SPI
Linus Torvalds [Sat, 15 Jul 2023 15:46:09 +0000 (08:46 -0700)]
Merge tag 'regmap-fix-v6.5-rc1' of git://git./linux/kernel/git/broonie/regmap
Pull regmap fix from Mark Brown:
"One fix for an out of bounds access in the interupt code here"
* tag 'regmap-fix-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap-irq: Fix out-of-bounds access when allocating config buffers
Linus Torvalds [Sat, 15 Jul 2023 15:40:00 +0000 (08:40 -0700)]
Merge tag 'iommu-fixes-v6.5-rc1' of git://git./linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
- Fix a regression causing a crash on sysfs access of iommu-group
specific files
- Fix signedness bug in SVA code
* tag 'iommu-fixes-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/sva: Fix signedness bug in iommu_sva_alloc_pasid()
iommu: Fix crash during syfs iommu_groups/N/type
Linus Torvalds [Sat, 15 Jul 2023 03:19:25 +0000 (20:19 -0700)]
Merge tag 'x86_urgent_for_6.5_rc2' of git://git./linux/kernel/git/tip/tip
Pull x86 CFI fixes from Peter Zijlstra:
"Fix kCFI/FineIBT weaknesses
The primary bug Alyssa noticed was that with FineIBT enabled function
prologues have a spurious ENDBR instruction:
__cfi_foo:
endbr64
subl $hash, %r10d
jz 1f
ud2
nop
1:
foo:
endbr64 <--- *sadface*
This means that any indirect call that fails to target the __cfi
symbol and instead targets (the regular old) foo+0, will succeed due
to that second ENDBR.
Fixing this led to the discovery of a single indirect call that was
still doing this: ret_from_fork(). Since that's an assembly stub the
compiler would not generate the proper kCFI indirect call magic and it
would not get patched.
Brian came up with the most comprehensive fix -- convert the thing to
C with only a very thin asm wrapper. This ensures the kernel thread
boostrap is a proper kCFI call.
While discussing all this, Kees noted that kCFI hashes could/should be
poisoned to seal all functions whose address is never taken, further
limiting the valid kCFI targets -- much like we already do for IBT.
So what was a 'simple' observation and fix cascaded into a bunch of
inter-related CFI infrastructure fixes"
* tag 'x86_urgent_for_6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cfi: Only define poison_cfi() if CONFIG_X86_KERNEL_IBT=y
x86/fineibt: Poison ENDBR at +0
x86: Rewrite ret_from_fork() in C
x86/32: Remove schedule_tail_wrapper()
x86/cfi: Extend ENDBR sealing to kCFI
x86/alternative: Rename apply_ibt_endbr()
x86/cfi: Extend {JMP,CAKK}_NOSPEC comment
Linus Torvalds [Sat, 15 Jul 2023 02:57:29 +0000 (19:57 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This is a bunch of small driver fixes and a larger rework of zone disk
handling (which reaches into blk and nvme).
The aacraid array-bounds fix is now critical since the security people
turned on -Werror for some build tests, which now fail without it"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: storvsc: Handle SRB status value 0x30
scsi: block: Improve checks in blk_revalidate_disk_zones()
scsi: block: virtio_blk: Set zone limits before revalidating zones
scsi: block: nullblk: Set zone limits before revalidating zones
scsi: nvme: zns: Set zone limits before revalidating zones
scsi: sd_zbc: Set zone limits before revalidating zones
scsi: ufs: core: Add support for qTimestamp attribute
scsi: aacraid: Avoid -Warray-bounds warning
scsi: ufs: ufs-mediatek: Add dependency for RESET_CONTROLLER
scsi: ufs: core: Update contact email for monitor sysfs nodes
scsi: scsi_debug: Remove dead code
scsi: qla2xxx: Use vmalloc_array() and vcalloc()
scsi: fnic: Use vmalloc_array() and vcalloc()
scsi: qla2xxx: Fix error code in qla2x00_start_sp()
scsi: qla2xxx: Silence a static checker warning
scsi: lpfc: Fix a possible data race in lpfc_unregister_fcf_rescan()
Linus Torvalds [Sat, 15 Jul 2023 02:52:18 +0000 (19:52 -0700)]
Merge tag 'block-6.5-2023-07-14' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- Don't require quirk to use duplicate namespace identifiers
(Christoph, Sagi)
- One more BOGUS_NID quirk (Pankaj)
- IO timeout and error hanlding fixes for PCI (Keith)
- Enhanced metadata format mask fix (Ankit)
- Association race condition fix for fibre channel (Michael)
- Correct debugfs error checks (Minjie)
- Use PAGE_SECTORS_SHIFT where needed (Damien)
- Reduce kernel logs for legacy nguid attribute (Keith)
- Use correct dma direction when unmapping metadata (Ming)
- Fix for a flush handling regression in this release (Christoph)
- Fix for batched request time stamping (Chengming)
- Fix for a regression in the mq-deadline position calculation (Bart)
- Lockdep fix for blk-crypto (Eric)
- Fix for a regression in the Amiga partition handling changes
(Michael)
* tag 'block-6.5-2023-07-14' of git://git.kernel.dk/linux:
block: queue data commands from the flush state machine at the head
blk-mq: fix start_time_ns and alloc_time_ns for pre-allocated rq
nvme-pci: fix DMA direction of unmapping integrity data
nvme: don't reject probe due to duplicate IDs for single-ported PCIe devices
block/mq-deadline: Fix a bug in deadline_from_pos()
nvme: ensure disabling pairs with unquiesce
nvme-fc: fix race between error recovery and creating association
nvme-fc: return non-zero status code when fails to create association
nvme: fix parameter check in nvme_fault_inject_init()
nvme: warn only once for legacy uuid attribute
block: remove dead struc request->completion_data field
nvme: fix the NVME_ID_NS_NVM_STS_MASK definition
nvmet: use PAGE_SECTORS_SHIFT
nvme: add BOGUS_NID quirk for Samsung SM953
blk-crypto: use dynamic lock class for blk_crypto_profile::lock
block/partition: fix signedness issue for Amiga partitions
Linus Torvalds [Sat, 15 Jul 2023 02:46:54 +0000 (19:46 -0700)]
Merge tag 'io_uring-6.5-2023-07-14' of git://git.kernel.dk/linux
Pull io_uring fix from Jens Axboe:
"Just a single tweak for the wait logic in io_uring"
* tag 'io_uring-6.5-2023-07-14' of git://git.kernel.dk/linux:
io_uring: Use io_schedule* in cqring wait
Linus Torvalds [Fri, 14 Jul 2023 18:14:07 +0000 (11:14 -0700)]
Merge tag 'riscv-for-linus-6.5-rc2' of git://git./linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- fix a formatting error in the hwprobe documentation
- fix a spurious warning in the RISC-V PMU driver
- fix memory detection on rv32 (problem does not manifest on any known
system)
- avoid parsing legacy parsing of I in ACPI ISA strings
* tag 'riscv-for-linus-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
RISC-V: Don't include Zicsr or Zifencei in I from ACPI
riscv: mm: fix truncation warning on RV32
perf: RISC-V: Remove PERF_HES_STOPPED flag checking in riscv_pmu_start()
Documentation: RISC-V: hwprobe: Fix a formatting error
Linus Torvalds [Fri, 14 Jul 2023 18:07:04 +0000 (11:07 -0700)]
Merge tag 'pm-6.5-rc2' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix hibernation (after recent changes), frequency QoS and the
sparc cpufreq driver.
Specifics:
- Unbreak the /sys/power/resume interface after recent changes (Azat
Khuzhin).
- Allow PM_QOS_DEFAULT_VALUE to be used with frequency QoS (Chungkai
Yang).
- Remove __init from cpufreq callbacks in the sparc driver, because
they may be called after initialization too (Viresh Kumar)"
* tag 'pm-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: sparc: Don't mark cpufreq callbacks with __init
PM: QoS: Restore support for default value on frequency QoS
PM: hibernate: Fix writing maj:min to /sys/power/resume
Rafael J. Wysocki [Fri, 14 Jul 2023 17:13:21 +0000 (19:13 +0200)]
Merge branches 'pm-sleep' and 'pm-qos'
Merge a PM QoS fix and a hibernation fix for 6.5-rc2.
- Unbreak the /sys/power/resume interface after recent changes (Azat
Khuzhin).
- Allow PM_QOS_DEFAULT_VALUE to be used with frequency QoS (Chungkai
Yang).
* pm-sleep:
PM: hibernate: Fix writing maj:min to /sys/power/resume
* pm-qos:
PM: QoS: Restore support for default value on frequency QoS
Shyam Prasad N [Fri, 14 Jul 2023 08:56:33 +0000 (08:56 +0000)]
cifs: fix mid leak during reconnection after timeout threshold
When the number of responses with status of STATUS_IO_TIMEOUT
exceeds a specified threshold (NUM_STATUS_IO_TIMEOUT), we reconnect
the connection. But we do not return the mid, or the credits
returned for the mid, or reduce the number of in-flight requests.
This bug could result in the server->in_flight count to go bad,
and also cause a leak in the mids.
This change moves the check to a few lines below where the
response is decrypted, even of the response is read from the
transform header. This way, the code for returning the mids
can be reused.
Also, the cifs_reconnect was reconnecting just the transport
connection before. In case of multi-channel, this may not be
what we want to do after several timeouts. Changed that to
reconnect the session and the tree too.
Also renamed NUM_STATUS_IO_TIMEOUT to a more appropriate name
MAX_STATUS_IO_TIMEOUT.
Fixes:
8e670f77c4a5 ("Handle STATUS_IO_TIMEOUT gracefully")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Shyam Prasad N [Fri, 14 Jul 2023 08:56:34 +0000 (08:56 +0000)]
cifs: is_network_name_deleted should return a bool
Currently, is_network_name_deleted and it's implementations
do not return anything if the network name did get deleted.
So the function doesn't fully achieve what it advertizes.
Changed the function to return a bool instead. It will now
return true if the error returned is STATUS_NETWORK_NAME_DELETED
and the share (tree id) was found to be connected. It returns
false otherwise.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Linus Torvalds [Fri, 14 Jul 2023 16:10:28 +0000 (09:10 -0700)]
Merge tag 'drm-fixes-2023-07-14-1' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"There were a bunch of fixes lined up for 2 weeks, so we have quite a
few scattered fixes, mostly amdgpu and i915, but ttm has a bunch and
nouveau makes an appearance.
So a bit busier than usual for rc2, but nothing seems out of the
ordinary.
fbdev:
- dma: Fix documented default preferred_bpp value
ttm:
- fix warning that we shouldn't mix && and ||
- never consider pinned BOs for eviction&swap
- Don't leak a resource on eviction error
- Don't leak a resource on swapout move error
- fix bulk_move corruption when adding a entry
client:
- Send hotplug event after registering a client
dma-buf:
- keep the signaling time of merged fences v3
- fix an error pointer vs NULL bug
sched:
- wait for all deps in kill jobs
- call set fence parent from scheduled
i915:
- Don't preserve dpll_hw_state for slave crtc in Bigjoiner
- Consider OA buffer boundary when zeroing out reports
- Remove dead code from gen8_pte_encode
- Fix one wrong caching mode enum usage
amdgpu:
- SMU i2c locking fix
- Fix a possible deadlock in process restoration for ROCm apps
- Disable PCIe lane/speed switching on Intel platforms (the platforms
don't support it)
nouveau:
- disp: fix HDMI on gt215+
- disp/g94: enable HDMI
- acr: Abort loading ACR if no firmware was found
- bring back blit subchannel for pre nv50 GPUs
- Fix drm_dp_remove_payload() invocation
ivpu:
- Fix VPU register access in irq disable
- Clear specific interrupt status bits on C0
bridge:
- dw_hdmi: fix connector access for scdc
- ti-sn65dsi86: Fix auxiliary bus lifetime
panel:
- simple: Add connector_type for innolux_at043tn24
- simple: Add Powertip PH800480T013 drm_display_mode flags"
* tag 'drm-fixes-2023-07-14-1' of git://anongit.freedesktop.org/drm/drm: (32 commits)
drm/nouveau: bring back blit subchannel for pre nv50 GPUs
drm/nouveau/acr: Abort loading ACR if no firmware was found
drm/amd: Align SMU11 SMU_MSG_OverridePcieParameters implementation with SMU13
drm/amd: Move helper for dynamic speed switch check out of smu13
drm/amd/pm: conditionally disable pcie lane/speed switching for SMU13
drm/amd/pm: share the code around SMU13 pcie parameters update
drm/amdgpu: avoid restore process run into dead loop.
drm/amd/pm: fix smu i2c data read risk
drm/nouveau/disp/g94: enable HDMI
drm/nouveau/disp: fix HDMI on gt215+
drm/client: Send hotplug event after registering a client
drm/i915: Fix one wrong caching mode enum usage
drm/i915: Remove dead code from gen8_pte_encode
drm/i915/perf: Consider OA buffer boundary when zeroing out reports
drm/i915: Don't preserve dpll_hw_state for slave crtc in Bigjoiner
drm/ttm: never consider pinned BOs for eviction&swap
drm/fbdev-dma: Fix documented default preferred_bpp value
dma-buf: fix an error pointer vs NULL bug
accel/ivpu: Clear specific interrupt status bits on C0
accel/ivpu: Fix VPU register access in irq disable
...
Linus Torvalds [Fri, 14 Jul 2023 16:05:15 +0000 (09:05 -0700)]
Merge tag 'ceph-for-6.5-rc2' of https://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A fix to prevent a potential buffer overrun in the messenger, marked
for stable"
* tag 'ceph-for-6.5-rc2' of https://github.com/ceph/ceph-client:
libceph: harden msgr2.1 frame segment length checks
Christoph Hellwig [Fri, 14 Jul 2023 14:30:14 +0000 (16:30 +0200)]
block: queue data commands from the flush state machine at the head
We used to insert the data commands following a pre-flush to the head
of the queue until commit
1e82fadfc6b ("blk-mq: do not do head insertions
post-pre-flush commands"). Not doing this seems to cause hangs of
such commands on NFS workloads when exported from file systems with
SATA SSDs. I have no idea why this would starve these workloads,
but doing a semantic revert of this patch (which looks quite different
due to various other changes) fixes the hangs.
Fixes:
1e82fadfc6b ("blk-mq: do not do head insertions post-pre-flush commands")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20230714143014.11879-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dan Carpenter [Thu, 6 Apr 2023 08:55:31 +0000 (11:55 +0300)]
iommu/sva: Fix signedness bug in iommu_sva_alloc_pasid()
The ida_alloc_range() function returns negative error codes on error.
On success it returns values in the min to max range (inclusive). It
never returns more then INT_MAX even if "max" is higher. It never
returns values in the 0 to (min - 1) range.
The bug is that "min" is an unsigned int so negative error codes will
be promoted to high positive values errors treated as success.
Fixes:
1a14bf0fc7ed ("iommu/sva: Use GFP_KERNEL for pasid allocation")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/6b32095d-7491-4ebb-a850-12e96209eaaf@kili.mountain
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Jason Gunthorpe [Mon, 26 Jun 2023 15:13:11 +0000 (12:13 -0300)]
iommu: Fix crash during syfs iommu_groups/N/type
The err_restore_domain flow was accidently inserted into the success path
in commit
1000dccd5d13 ("iommu: Allow IOMMU_RESV_DIRECT to work on
ARM"). It should only happen if iommu_create_device_direct_mappings()
fails. This caused the domains the be wrongly changed and freed whenever
the sysfs is used, resulting in an oops:
BUG: kernel NULL pointer dereference, address:
0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 3417 Comm: avocado Not tainted 6.4.0-rc4-next-
20230602 #3
Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 2.3.6 07/06/2021
RIP: 0010:__iommu_attach_device+0xc/0xa0
Code: c0 c3 cc cc cc cc 48 89 f0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 54 55 48 8b 47 08 <48> 8b 00 48 85 c0 74 74 48 89 f5 e8 64 12 49 00 41 89 c4 85 c0 74
RSP: 0018:
ffffabae0220bd48 EFLAGS:
00010246
RAX:
0000000000000000 RBX:
ffff9ac04f70e410 RCX:
0000000000000001
RDX:
ffff9ac044db20c0 RSI:
ffff9ac044fa50d0 RDI:
ffff9ac04f70e410
RBP:
ffff9ac044fa50d0 R08:
1000000100209001 R09:
00000000000002dc
R10:
0000000000000000 R11:
0000000000000000 R12:
ffff9ac043d54700
R13:
ffff9ac043d54700 R14:
0000000000000001 R15:
0000000000000001
FS:
00007f02e30ae000(0000) GS:
ffff9afeb2440000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000000000000 CR3:
000000012afca006 CR4:
0000000000770ee0
PKRU:
55555554
Call Trace:
<TASK>
? __die+0x24/0x70
? page_fault_oops+0x82/0x150
? __iommu_queue_command_sync+0x80/0xc0
? exc_page_fault+0x69/0x150
? asm_exc_page_fault+0x26/0x30
? __iommu_attach_device+0xc/0xa0
? __iommu_attach_device+0x1c/0xa0
__iommu_device_set_domain+0x42/0x80
__iommu_group_set_domain_internal+0x5d/0x160
iommu_setup_default_domain+0x318/0x400
iommu_group_store_type+0xb1/0x200
kernfs_fop_write_iter+0x12f/0x1c0
vfs_write+0x2a2/0x3b0
ksys_write+0x63/0xe0
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7f02e2f14a6f
Reorganize the error flow so that the success branch and error branches
are clearer.
Fixes:
1000dccd5d13 ("iommu: Allow IOMMU_RESV_DIRECT to work on ARM")
Reported-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Tested-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/0-v1-5bd8cc969d9e+1f1-iommu_set_def_fix_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Masami Hiramatsu (Google) [Tue, 11 Jul 2023 14:16:07 +0000 (23:16 +0900)]
tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if fails
Fix to record 0-length data to data_loc in fetch_store_string*() if it fails
to get the string data.
Currently those expect that the data_loc is updated by store_trace_args() if
it returns the error code. However, that does not work correctly if the
argument is an array of strings. In that case, store_trace_args() only clears
the first entry of the array (which may have no error) and leaves other
entries. So it should be cleared by fetch_store_string*() itself.
Also, 'dyndata' and 'maxlen' in store_trace_args() should be updated
only if it is used (ret > 0 and argument is a dynamic data.)
Link: https://lore.kernel.org/all/168908496683.123124.4761206188794205601.stgit@devnote2/
Fixes:
40b53b771806 ("tracing: probeevent: Add array type support")
Cc: stable@vger.kernel.org
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Dave Airlie [Fri, 14 Jul 2023 01:44:54 +0000 (11:44 +1000)]
Merge tag 'amd-drm-fixes-6.5-2023-07-12' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.5-2023-07-12:
amdgpu:
- SMU i2c locking fix
- Fix a possible deadlock in process restoration for ROCm apps
- Disable PCIe lane/speed switching on Intel platforms (the platforms don't support it)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230712184009.7740-1-alexander.deucher@amd.com
Dave Airlie [Fri, 14 Jul 2023 01:13:14 +0000 (11:13 +1000)]
Merge tag 'drm-intel-fixes-2023-07-13' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
- Don't preserve dpll_hw_state for slave crtc in Bigjoiner (Stanislav Lisovskiy)
- Consider OA buffer boundary when zeroing out reports [perf] (Umesh Nerlige Ramappa)
- Remove dead code from gen8_pte_encode (Tvrtko Ursulin)
- Fix one wrong caching mode enum usage (Tvrtko Ursulin)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/ZK+nHLCltaxoxVw/@tursulin-desk
Dave Airlie [Fri, 14 Jul 2023 00:37:18 +0000 (10:37 +1000)]
Merge tag 'drm-misc-fixes-2023-07-13' of ssh://git.freedesktop.org/git/drm/drm-misc into drm-fixes
A couple of nouveau patches addressing improving HDMI support and
firmware handling, a fix for TTM to skip pinned BO when evicting, and a
fix for the fbdev documentation.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <mripard@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/nq3ke75juephbex5acfyi5t6bxv22nhmfcpfhru55haj2nv3us@gehrlmjbqgjk
Linus Torvalds [Thu, 13 Jul 2023 21:35:02 +0000 (14:35 -0700)]
Merge tag 'erofs-for-6.5-rc2-fixes' of git://git./linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"Three patches address regressions related to post-EOF unexpected
behaviors and fsdax unavailability of chunk-based regular files.
The other two patches mainly get rid of kmap_atomic() and simplify
z_erofs_transform_plain().
- Fix two unexpected loop cases when reading beyond EOF
- Fix fsdax unavailability for chunk-based regular files
- Get rid of the remaining kmap_atomic()
- Minor cleanups"
* tag 'erofs-for-6.5-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix fsdax unavailability for chunk-based regular files
erofs: avoid infinite loop in z_erofs_do_read_page() when reading beyond EOF
erofs: avoid useless loops in z_erofs_pcluster_readmore() when reading beyond EOF
erofs: simplify z_erofs_transform_plain()
erofs: get rid of the remaining kmap_atomic()
Linus Torvalds [Thu, 13 Jul 2023 21:21:22 +0000 (14:21 -0700)]
Merge tag 'net-6.5-rc2' of git://git./linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter, wireless and ebpf.
Current release - regressions:
- netfilter: conntrack: gre: don't set assured flag for clash entries
- wifi: iwlwifi: remove 'use_tfh' config to fix crash
Previous releases - regressions:
- ipv6: fix a potential refcount underflow for idev
- icmp6: ifix null-ptr-deref of ip6_null_entry->rt6i_idev in
icmp6_dev()
- bpf: fix max stack depth check for async callbacks
- eth: mlx5e:
- check for NOT_READY flag state after locking
- fix page_pool page fragment tracking for XDP
- eth: igc:
- fix tx hang issue when QBV gate is closed
- fix corner cases for TSN offload
- eth: octeontx2-af: Move validation of ptp pointer before its usage
- eth: ena: fix shift-out-of-bounds in exponential backoff
Previous releases - always broken:
- core: prevent skb corruption on frag list segmentation
- sched:
- cls_fw: fix improper refcount update leads to use-after-free
- sch_qfq: account for stab overhead in qfq_enqueue
- netfilter:
- report use refcount overflow
- prevent OOB access in nft_byteorder_eval
- wifi: mt7921e: fix init command fail with enabled device
- eth: ocelot: fix oversize frame dropping for preemptible TCs
- eth: fec: recycle pages for transmitted XDP frames"
* tag 'net-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits)
selftests: tc-testing: add test for qfq with stab overhead
net/sched: sch_qfq: account for stab overhead in qfq_enqueue
selftests: tc-testing: add tests for qfq mtu sanity check
net/sched: sch_qfq: reintroduce lmax bound check for MTU
wifi: cfg80211: fix receiving mesh packets without RFC1042 header
wifi: rtw89: debug: fix error code in rtw89_debug_priv_send_h2c_set()
net: txgbe: fix eeprom calculation error
net/sched: make psched_mtu() RTNL-less safe
net: ena: fix shift-out-of-bounds in exponential backoff
netdevsim: fix uninitialized data in nsim_dev_trap_fa_cookie_write()
net/sched: flower: Ensure both minimum and maximum ports are specified
MAINTAINERS: Add another mailing list for QUALCOMM ETHQOS ETHERNET DRIVER
docs: netdev: update the URL of the status page
wifi: iwlwifi: remove 'use_tfh' config to fix crash
xdp: use trusted arguments in XDP hints kfuncs
bpf: cpumap: Fix memory leak in cpu_map_update_elem
wifi: airo: avoid uninitialized warning in airo_get_rate()
octeontx2-pf: Add additional check for MCAM rules
net: dsa: Removed unneeded of_node_put in felix_parse_ports_node
net: fec: use netdev_err_once() instead of netdev_err()
...
Linus Torvalds [Thu, 13 Jul 2023 20:44:28 +0000 (13:44 -0700)]
Merge tag 'trace-v6.5-rc1-3' of git://git./linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix some missing-prototype warnings
- Fix user events struct args (did not include size of struct)
When creating a user event, the "struct" keyword is to denote that
the size of the field will be passed in. But the parsing failed to
handle this case.
- Add selftest to struct sizes for user events
- Fix sample code for direct trampolines.
The sample code for direct trampolines attached to handle_mm_fault().
But the prototype changed and the direct trampoline sample code was
not updated. Direct trampolines needs to have the arguments correct
otherwise it can fail or crash the system.
- Remove unused ftrace_regs_caller_ret() prototype.
- Quiet false positive of FORTIFY_SOURCE
Due to backward compatibility, the structure used to save stack
traces in the kernel had a fixed size of 8. This structure is
exported to user space via the tracing format file. A change was made
to allow more than 8 functions to be recorded, and user space now
uses the size field to know how many functions are actually in the
stack.
But the structure still has size of 8 (even though it points into the
ring buffer that has the required amount allocated to hold a full
stack.
This was fine until the fortifier noticed that the
memcpy(&entry->caller, stack, size) was greater than the 8 functions
and would complain at runtime about it.
Hide this by using a pointer to the stack location on the ring buffer
instead of using the address of the entry structure caller field.
- Fix a deadloop in reading trace_pipe that was caused by a mismatch
between ring_buffer_empty() returning false which then asked to read
the data, but the read code uses rb_num_of_entries() that returned
zero, and causing a infinite "retry".
- Fix a warning caused by not using all pages allocated to store ftrace
functions, where this can happen if the linker inserts a bunch of
"NULL" entries, causing the accounting of how many pages needed to be
off.
- Fix histogram synthetic event crashing when the start event is
removed and the end event is still using a variable from it
- Fix memory leak in freeing iter->temp in tracing_release_pipe()
* tag 'trace-v6.5-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix memory leak of iter->temp when reading trace_pipe
tracing/histograms: Add histograms to hist_vars if they have referenced variables
tracing: Stop FORTIFY_SOURCE complaining about stack trace caller
ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()
ring-buffer: Fix deadloop issue on reading trace_pipe
tracing: arm64: Avoid missing-prototype warnings
selftests/user_events: Test struct size match cases
tracing/user_events: Fix struct arg size match check
x86/ftrace: Remove unsued extern declaration ftrace_regs_caller_ret()
arm64: ftrace: Add direct call trampoline samples support
samples: ftrace: Save required argument registers in sample trampolines
Linus Torvalds [Thu, 13 Jul 2023 20:39:36 +0000 (13:39 -0700)]
Merge tag 'for-linus-6.5-rc2-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- a cleanup of the Xen related ELF-notes
- a fix for virtio handling in Xen dom0 when running Xen in a VM
* tag 'for-linus-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/virtio: Fix NULL deref when a bridge of PCI root bus has no parent
x86/Xen: tidy xen-head.S
Linus Torvalds [Thu, 13 Jul 2023 20:34:00 +0000 (13:34 -0700)]
Merge tag 'sh-for-v6.5-tag2' of git://git./linux/kernel/git/glaubitz/sh-linux
Pull sh fixes from John Paul Adrian Glaubitz:
"The sh updates introduced multiple regressions.
In particular, the change
a8ac2961148e ("sh: Avoid using IRQ0 on SH3
and SH4") causes several boards to hang during boot due to incorrect
IRQ numbers.
Geert Uytterhoeven has contributed patches that handle the virq offset
in the IRQ code for the dreamcast, highlander and r2d boards while
Artur Rojek has contributed a patch which handles the virq offset for
the hd64461 companion chip"
* tag 'sh-for-v6.5-tag2' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux:
sh: hd64461: Handle virq offset for offchip IRQ base and HD64461 IRQ
sh: mach-dreamcast: Handle virq offset in cascaded IRQ demux
sh: mach-highlander: Handle virq offset in cascaded IRL demux
sh: mach-r2d: Handle virq offset in cascaded IRL demux