Christoph Hellwig [Fri, 3 Feb 2023 15:03:53 +0000 (16:03 +0100)]
blk-rq-qos: move rq_qos_add and rq_qos_del out of line
These two functions are rather larger and not in a fast path, so move
them out of line.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 3 Feb 2023 15:03:52 +0000 (16:03 +0100)]
blk-wbt: open code wbt_queue_depth_changed in wbt_init
wbt_queue_depth_changed just updates a field and calls another function.
Open code it in wbt_init, so that the local queue variable can be used
instead of the one stored in the rq_qos. This will allow delaying that
rq_qos->queue assignment in a subsequent patch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 3 Feb 2023 15:03:51 +0000 (16:03 +0100)]
blk-wbt: move private information from blk-wbt.h to blk-wbt.c
A large part of blk-wbt.h is only used in blk-wbt.c, so move it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 3 Feb 2023 15:03:50 +0000 (16:03 +0100)]
blk-wbt: pass a gendisk to wbt_init
Pass a gendisk to wbt_init to prepare for phasing out usage of the
request_queue in the blk-cgroup code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 3 Feb 2023 15:03:49 +0000 (16:03 +0100)]
blk-wbt: pass a gendisk to wbt_{enable,disable}_default
Pass a gendisk to wbt_enable_default and wbt_disable_default to
prepare for phasing out usage of the request_queue in the blk-cgroup
code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 3 Feb 2023 15:03:48 +0000 (16:03 +0100)]
blk-cgroup: store a gendisk to throttle in struct task_struct
Switch from a request_queue pointer and reference to a gendisk once
for the throttle information in struct task_struct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Link: https://lore.kernel.org/r/20230203150400.3199230-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 3 Feb 2023 15:03:47 +0000 (16:03 +0100)]
blk-cgroup: pin the gendisk in struct blkcg_gq
Currently each blkcg_gq holds a request_queue reference, which is what
is used in the policies. But a lot of these interfaces will move over to
use a gendisk, so store a disk in struct blkcg_gq and hold a reference to
it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 3 Feb 2023 15:03:46 +0000 (16:03 +0100)]
blk-cgroup: remove the !bdi->dev check in blkg_dev_name
bdi_dev_name already performs the same check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 3 Feb 2023 15:03:45 +0000 (16:03 +0100)]
blk-cgroup: simplify blkg freeing from initialization failure paths
There is no need to delay freeing a blkg to a workqueue when freeing it
after an initialization failure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 3 Feb 2023 15:03:44 +0000 (16:03 +0100)]
blk-cgroup: improve error unwinding in blkg_alloc
Unwind only the previous initialization steps that happened in blkg_alloc
using goto based unwinding. This avoids the need for the !queue special
case in blkg_free and thus ensures that any blkg seens outside of
blkg_alloc is always fully constructed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 3 Feb 2023 15:03:43 +0000 (16:03 +0100)]
blk-cgroup: delay blk-cgroup initialization until add_disk
There is no need to initialize the cgroup code before the disk is marked
live. Moving the cgroup initialization earlier will help to have a
fully initialized struct device in the gendisk for the cgroup code to
use in the future. Similarly tear the cgroup information down in
del_gendisk to be symmetric and because none of the cgroup tracking is
needed once non-passthrough I/O stops.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 3 Feb 2023 15:03:42 +0000 (16:03 +0100)]
block: don't call blk_throtl_stat_add for non-READ/WRITE commands
blk_throtl_stat_add is called from blk_stat_add explicitly, unlike the
other stats that go through q->stats->callbacks. To prepare for cgroup
data moving to the gendisk, ensure blk_throtl_stat_add is only called
for the plain READ and WRITE commands that it actually handles internally,
as blk_stat_add can also be called for passthrough commands on queues that
do not have a gendisk associated with them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 2 Feb 2023 23:04:54 +0000 (16:04 -0700)]
Merge branch 'md-next' of https://git./linux/kernel/git/song/md into for-6.3/block
Pull MD updates from Song:
"Non-urgent fixes:
md: don't update recovery_cp when curr_resync is ACTIVE
md: Free writes_pending in md_stop
Performance optimization:
md: Change active_io to percpu"
* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md: use MD_RESYNC_* whenever possible
md: Free writes_pending in md_stop
md: Change active_io to percpu
md: Factor out is_md_suspended helper
md: don't update recovery_cp when curr_resync is ACTIVE
Hou Tao [Wed, 1 Feb 2023 07:59:20 +0000 (15:59 +0800)]
md: use MD_RESYNC_* whenever possible
Just replace magic numbers by MD_RESYNC_* enumerations.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Xiao Ni [Sat, 21 Jan 2023 01:48:10 +0000 (09:48 +0800)]
md: Free writes_pending in md_stop
dm raid calls md_stop to stop the raid device. It needs to
free the writes_pending here.
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Xiao Ni [Tue, 31 Jan 2023 05:17:10 +0000 (13:17 +0800)]
md: Change active_io to percpu
Now the type of active_io is atomic. It's used to count how many ios are
in the submitting process and it's added and decreased very time. But it
only needs to check if it's zero when suspending the raid. So we can
switch atomic to percpu to improve the performance.
After switching active_io to percpu type, we use the state of active_io
to judge if the raid device is suspended. And we don't need to wake up
->sb_wait in md_handle_request anymore. It's done in the callback function
which is registered when initing active_io. The argument mddev->suspended
is only used to count how many users are trying to set raid to suspend
state.
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Xiao Ni [Tue, 31 Jan 2023 05:17:09 +0000 (13:17 +0800)]
md: Factor out is_md_suspended helper
This helper function will be used in next patch. It's easy for
understanding.
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Hou Tao [Tue, 31 Jan 2023 07:07:19 +0000 (15:07 +0800)]
md: don't update recovery_cp when curr_resync is ACTIVE
Don't update recovery_cp when curr_resync is MD_RESYNC_ACTIVE, otherwise
md may skip the resync of the first 3 sectors if the resync procedure is
interrupted before the first calling of ->sync_request() as shown below:
md_do_sync thread control thread
// setup resync
mddev->recovery_cp = 0
j = 0
mddev->curr_resync = MD_RESYNC_ACTIVE
// e.g., set array as idle
set_bit(MD_RECOVERY_INTR, &&mddev_recovery)
// resync loop
// check INTR before calling sync_request
!test_bit(MD_RECOVERY_INTR, &mddev->recovery
// resync interrupted
// update recovery_cp from 0 to 3
// the resync of three 3 sectors will be skipped
mddev->recovery_cp = 3
Fixes:
eac58d08d493 ("md: Use enum for overloaded magic numbers used by mddev->curr_resync")
Cc: stable@vger.kernel.org # 6.0+
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Bart Van Assche [Mon, 30 Jan 2023 21:13:47 +0000 (13:13 -0800)]
loop: Improve the hw_queue_depth kernel module parameter implementation
Make the following minor changes which were reported by colleagues
while reviewing this code:
- Remove the parentheses from around the LOOP_DEFAULT_HW_Q_DEPTH
definition since these are superfluous.
- Accept other number formats than decimal, e.g. hexadecimal.
- Do not set hw_queue_depth to an out-of-range value, even if that value
won't be used.
- Use the LOOP_DEFAULT_HW_Q_DEPTH macro in the kernel module parameter
description to prevent that the description gets out of sync.
This patch has been tested as follows:
# modprobe -r loop
# modprobe loop hw_queue_depth=-1
modprobe: ERROR: could not insert 'loop': Invalid argument
# modprobe loop hw_queue_depth=0
modprobe: ERROR: could not insert 'loop': Invalid argument
# modprobe loop hw_queue_depth=1; cat /sys/module/loop/parameters/hw_queue_depth
1
# modprobe -r loop; modprobe loop; cat /sys/module/loop/parameters/hw_queue_depth hw_queue_depth=0x10
16
# modprobe -r loop; modprobe loop; cat /sys/module/loop/parameters/hw_queue_depth hw_queue_depth=128
128
# modprobe -r loop; modprobe loop hw_queue_depth=129; cat /sys/module/loop/parameters/hw_queue_depth
129
# modprobe -r loop; modprobe loop hw_queue_depth=$((1<<32))
modprobe: ERROR: could not insert 'loop': Numerical result out of range
See also commit
ef44c50837ab ("loop: allow user to set the queue
depth").
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230130211347.832110-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Matthew Wilcox [Tue, 31 Jan 2023 05:01:32 +0000 (05:01 +0000)]
block: Remove mm.h from bvec.h
This was originally added for the definition of nth_page(), but we no
longer use nth_page() in this header, so we can drop the heavyweight
mm.h now.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20230131050132.2627124-1-willy@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 31 Jan 2023 04:04:46 +0000 (12:04 +0800)]
ublk_drv: only allow owner to open unprivileged disk
Owner of one unprivileged ublk device could be one evil user, which
can grant this disk's privilege to other users deliberately, and
this way could be like making one trap and waiting for other users
to be caught.
So only owner to open unprivileged disk even though the owner
grants disk privilege to other user. This way is reasonable too
given anyone can create ublk disk, and no need other's grant.
Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Fixes:
4093cb5a0634 ("ublk_drv: add mechanism for supporting unprivileged ublk device")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230131040446.214583-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ulf Hansson [Mon, 30 Jan 2023 12:12:40 +0000 (13:12 +0100)]
block: Default to use cgroup support for BFQ
Assuming that both Kconfig options, BLK_CGROUP and IOSCHED_BFQ are set, we
most likely want cgroup support for BFQ too (BFQ_GROUP_IOSCHED), so let's
make it default y.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20230130121240.159456-1-ulf.hansson@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 09:51:53 +0000 (17:51 +0800)]
block, bfq: remove unused bfq_wr_max_time in struct bfq_data
bfqd->bfq_wr_max_time is set to 0 in bfq_init_queue and is never changed.
It is only used in bfq_wr_duration when bfq_wr_max_time > 0 which never
meets, so bfqd->bfq_wr_max_time is not used actually. Just remove it.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-9-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 09:51:52 +0000 (17:51 +0800)]
block, bfq: remove unnecessary goto tag in bfq_dispatch_rq_from_bfqq
We jump to tag only for returning current rq. Return directly to
remove this tag.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116095153.3810101-8-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 09:51:51 +0000 (17:51 +0800)]
block, bfq: remove redundant check in bfq_put_cooperator
We have already avoided a circular list in bfq_setup_merge (see comments
in bfq_setup_merge() for details), so bfq_queue will not appear in it's
new_bfqq list. Just remove this check.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-7-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 09:51:50 +0000 (17:51 +0800)]
block, bfq: remove unnecessary dereference to get async_bfqq
The async_bfqq is assigned with bfqq->bic->bfqq[0], use it directly.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-6-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 09:51:49 +0000 (17:51 +0800)]
block, bfq: use helper macro RQ_BFQQ to get bfqq of request
Use helper macro RQ_BFQQ to get bfqq of request.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-5-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 09:51:48 +0000 (17:51 +0800)]
block, bfq: initialize bfqq->decrease_time_jif correctly
Inject limit is updated or reset when time_is_before_eq_jiffies(
decrease_time_jif + several msecs) or think-time state changes.
decrease_time_jif is initialized to 0 and will be set to current jiffies
when inject limit is updated or reset. If the jiffies is slightly greater
than LONG_MAX, time_is_after_eq_jiffies(0) will keep for a long time, so as
time_is_after_eq_jiffies(decrease_time_jif + several msecs). If the
think-time state never chages, then the injection will not work as expected
for long time.
To be more specific:
Function bfq_update_inject_limit maybe triggered when jiffies pasts
decrease_time_jif + msecs_to_jiffies(10) in bfq_add_request by setting
bfqd->wait_dispatch to true.
Function bfq_reset_inject_limit are called in two conditions:
1. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(1000) in
function bfq_add_request.
2. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(100) or
bfq think-time state change from short to long.
Fix this by initializing bfqq->decrease_time_jif to current jiffies
to trigger service injection soon when service injection conditions
are met.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-4-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 09:51:47 +0000 (17:51 +0800)]
block, bfq: remove unsed parameter reason in bfq_bfqq_is_slow
Parameter reason is never used, just remove it.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-3-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 09:51:46 +0000 (17:51 +0800)]
block, bfq: correctly raise inject limit in bfq_choose_bfqq_for_injection
Function bfq_choose_bfqq_for_injection may temporarily raise inject limit
to one request if current inject_limit is 0 before search of the source
queue for injection. However the search below will reset inject limit to
bfqd->in_service_queue which is zero for raised inject limit. Then the
temporarily raised inject limit never works as expected.
Assigment limit to bfqd->in_service_queue in search is needed as limit
maybe overwriten to min_t(unsigned int, 1, limit) for condition that
a large in-flight request is on non-rotational devices in found queue.
So we need to reset limit to bfqd->in_service_queue for normal case.
Actually, we have already make sure bfqd->rq_in_driver is < limit before
search, then
-Limit is >= 1 as bfqd->rq_in_driver is >= 0. Then min_t(unsigned int,
1, limit) is always 1. So we can simply check bfqd->rq_in_driver with
1 instead of result of min_t(unsigned int, 1, limit) for larget request in
non-rotational device case to avoid overwritting limit and the bug is gone.
-For normal case, we have already check bfqd->rq_in_driver is < limit,
so we can return found bfqq unconditionally to remove unncessary check.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-2-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 20:50:59 +0000 (04:50 +0800)]
sbitmap: correct wake_batch recalculation to avoid potential IO hung
Commit
180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened")
mentioned that in case of shared tags, there could be just one real
active hctx(queue) because of lazy detection of tag idle. Then driver tag
allocation may wait forever on this real active hctx(queue) if wake_batch
is > hctx_max_depth where hctx_max_depth is available tags depth for the
actve hctx(queue). However, the condition wake_batch > hctx_max_depth is
not strong enough to avoid IO hung as the sbitmap_queue_wake_up will only
wake up one wait queue for each wake_batch even though there is only one
waiter in the woken wait queue. After this, there is only one tag to free
and wake_batch may not be reached anymore. Commit
180dccb0dba4f ("blk-mq:
fix tag_get wait task can't be awakened") methioned that driver tag
allocation may wait forever. Actually, the inactive hctx(queue) will be
truely idle after at most 30 seconds and will call blk_mq_tag_wakeup_all
to wake one waiter per wait queue to break the hung. But IO hung for 30
seconds is also not acceptable. Set batch size to small enough that depth
of the shared hctx(queue) is enough to wake up all of the queues like
sbq_calc_wake_batch do to fix this potential IO hung.
Although hctx_max_depth will be clamped to at least 4 while wake_batch
recalculation does not do the clamp, the wake_batch will be always
recalculated to 1 when hctx_max_depth <= 4.
Fixes:
180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-6-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 20:50:58 +0000 (04:50 +0800)]
sbitmap: add sbitmap_find_bit to remove repeat code in __sbitmap_get/__sbitmap_get_shallow
There are three differences between __sbitmap_get and
__sbitmap_get_shallow when searching free bit:
1. __sbitmap_get_shallow limit number of bit to search per word.
__sbitmap_get has no such limit.
2. __sbitmap_get_shallow always searches with wrap set. __sbitmap_get set
wrap according to round_robin.
3. __sbitmap_get_shallow always searches from first bit in first word.
__sbitmap_get searches from first bit when round_robin is not set
otherwise searches from SB_NR_TO_BIT(sb, alloc_hint).
Add helper function sbitmap_find_bit function to do common search while
accept "limit depth per word", "wrap flag" and "first bit to
search" from caller to support the need of both __sbitmap_get and
__sbitmap_get_shallow.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-5-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 20:50:57 +0000 (04:50 +0800)]
sbitmap: rewrite sbitmap_find_bit_in_index to reduce repeat code
Rewrite sbitmap_find_bit_in_index as following:
1. Rename sbitmap_find_bit_in_index to sbitmap_find_bit_in_word
2. Accept "struct sbitmap_word *" directly instead of accepting
"struct sbitmap *" and "int index" to get "struct sbitmap_word *".
3. Accept depth/shallow_depth and wrap for __sbitmap_get_word from caller
to support need of both __sbitmap_get_shallow and __sbitmap_get.
With helper function sbitmap_find_bit_in_word, we can remove repeat
code in __sbitmap_get_shallow to find bit considring deferred clear.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-4-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 20:50:56 +0000 (04:50 +0800)]
sbitmap: remove redundant check in __sbitmap_queue_get_batch
Commit
fbb564a557809 ("lib/sbitmap: Fix invalid loop in
__sbitmap_queue_get_batch()") mentioned that "Checking free bits when
setting the target bits. Otherwise, it may reuse the busying bits."
This commit add check to make sure all masked bits in word before
cmpxchg is zero. Then the existing check after cmpxchg to check any
zero bit is existing in masked bits in word is redundant.
Actually, old value of word before cmpxchg is stored in val and we
will filter out busy bits in val by "(get_mask & ~val)" after cmpxchg.
So we will not reuse busy bits methioned in commit
fbb564a557809
("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()"). Revert
new-added check to remove redundant check.
Fixes:
fbb564a55780 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-3-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 20:50:55 +0000 (04:50 +0800)]
sbitmap: remove unnecessary calculation of alloc_hint in __sbitmap_get_shallow
Updates to alloc_hint in the loop in __sbitmap_get_shallow() are mostly
pointless and equivalent to setting alloc_hint to zero (because
SB_NR_TO_BIT() considers only low sb->shift bits from alloc_hint). So
simplify the logic.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-2-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Thu, 19 Jan 2023 11:03:50 +0000 (19:03 +0800)]
blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()
Currently parent pd can be freed before child pd:
t1: remove cgroup C1
blkcg_destroy_blkgs
blkg_destroy
list_del_init(&blkg->q_node)
// remove blkg from queue list
percpu_ref_kill(&blkg->refcnt)
blkg_release
call_rcu
t2: from t1
__blkg_release
blkg_free
schedule_work
t4: deactivate policy
blkcg_deactivate_policy
pd_free_fn
// parent of C1 is freed first
t3: from t2
blkg_free_workfn
pd_free_fn
If policy(for example, ioc_timer_fn() from iocost) access parent pd from
child pd after pd_offline_fn(), then UAF can be triggered.
Fix the problem by delaying 'list_del_init(&blkg->q_node)' from
blkg_destroy() to blkg_free_workfn(), and using a new disk level mutex to
synchronize blkg_free_workfn() and blkcg_deactivate_policy().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Thu, 19 Jan 2023 11:03:49 +0000 (19:03 +0800)]
blk-cgroup: support to track if policy is online
A new field 'online' is added to blkg_policy_data to fix following
2 problem:
1) In blkcg_activate_policy(), if pd_alloc_fn() with 'GFP_NOWAIT'
failed, 'queue_lock' will be dropped and pd_alloc_fn() will try again
without 'GFP_NOWAIT'. In the meantime, remove cgroup can race with
it, and pd_offline_fn() will be called without pd_init_fn() and
pd_online_fn(). This way null-ptr-deference can be triggered.
2) In order to synchronize pd_free_fn() from blkg_free_workfn() and
blkcg_deactivate_policy(), 'list_del_init(&blkg->q_node)' will be
delayed to blkg_free_workfn(), hence pd_offline_fn() can be called
first in blkg_destroy(), and then blkcg_deactivate_policy() will
call it again, we must prevent it.
The new field 'online' will be set after pd_online_fn() and will be
cleared after pd_offline_fn(), in the meantime pd_offline_fn() will only
be called if 'online' is set.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Thu, 19 Jan 2023 11:03:48 +0000 (19:03 +0800)]
blk-cgroup: dropping parent refcount after pd_free_fn() is done
Some cgroup policies will access parent pd through child pd even
after pd_offline_fn() is done. If pd_free_fn() for parent is called
before child, then UAF can be triggered. Hence it's better to guarantee
the order of pd_free_fn().
Currently refcount of parent blkg is dropped in __blkg_release(), which
is before pd_free_fn() is called in blkg_free_work_fn() while
blkg_free_work_fn() is called asynchronously.
This patch make sure pd_free_fn() called from removing cgroup is ordered
by delaying dropping parent refcount after calling pd_free_fn() for
child.
BTW, pd_free_fn() will also be called from blkcg_deactivate_policy()
from deleting device, and following patches will guarantee the order.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Zhong Jinghua [Sat, 28 Jan 2023 03:04:19 +0000 (11:04 +0800)]
blk-mq: cleanup unused methods: blk_mq_hw_sysfs_store
We found that the blk_mq_hw_sysfs_store interface has no place to use.
The object default_hw_ctx_attrs using blk_mq_hw_sysfs_ops only uses
the show method and does not use the store method.
Since this patch:
4a46f05ebf99 ("blk-mq: move hctx and ctx counters from sysfs to debugfs")
moved the store method to debugfs, the store method is not used anymore.
So let me do some tiny work to clean up unused code.
Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230128030419.2780298-1-zhongjinghua@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 23 Jan 2023 07:53:56 +0000 (08:53 +0100)]
s390/dcssblk:: don't call bio_split_to_limits
s390 iterates over the bio using bio_for_each_segment and doesn't need
any bio splitting.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/r/20230123075356.60847-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 23 Jan 2023 07:47:18 +0000 (08:47 +0100)]
ps3vram: remove bio splitting
ps3vram iterates over the bio one segment, that is page aligned and max
page sized chunk, a time. Because of that there is no point in
calling bio_split_to_limits, or explicitly setting the default limits
that are only used by bio_split_to_limits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Geoff Levand <geoff@infradead.org>
Link: https://lore.kernel.org/r/20230123074718.57951-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 20 Jan 2023 14:51:07 +0000 (07:51 -0700)]
block: treat poll queue enter similarly to timeouts
We ran into an issue where a production workload would randomly grind to
a halt and not continue until the pending IO had timed out. This turned
out to be a complicated interaction between queue freezing and polled
IO:
1) You have an application that does polled IO. At any point in time,
there may be polled IO pending.
2) You have a monitoring application that issues a passthrough command,
which is marked with side effects such that it needs to freeze the
queue.
3) Passthrough command is started, which calls blk_freeze_queue_start()
on the device. At this point the queue is marked frozen, and any
attempt to enter the queue will fail (for non-blocking) or block.
4) Now the driver calls blk_mq_freeze_queue_wait(), which will return
when the queue is quiesced and pending IO has completed.
5) The pending IO is polled IO, but any attempt to poll IO through the
normal iocb_bio_iopoll() -> bio_poll() will fail when it gets to
bio_queue_enter() as the queue is frozen. Rather than poll and
complete IO, the polling threads will sit in a tight loop attempting
to poll, but failing to enter the queue to do so.
The end result is that progress for either application will be stalled
until all pending polled IO has timed out. This causes obvious huge
latency issues for the application doing polled IO, but also long delays
for passthrough command.
Fix this by treating queue enter for polled IO just like we do for
timeouts. This allows quick quiesce of the queue as we still poll and
complete this IO, while still disallowing queueing up new IO.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Li Nan [Tue, 17 Jan 2023 07:08:06 +0000 (15:08 +0800)]
blk-iocost: change div64_u64 to DIV64_U64_ROUND_UP in ioc_refresh_params()
vrate_min is calculated by DIV64_U64_ROUND_UP, but vrate_max is calculated
by div64_u64. Vrate_min may be 1 greater than vrate_max if the input
values min and max of cost.qos are equal.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-6-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Li Nan [Tue, 17 Jan 2023 07:08:05 +0000 (15:08 +0800)]
blk-iocost: fix divide by 0 error in calc_lcoefs()
echo max of u64 to cost.model can cause divide by 0 error.
# echo 8:0 rbps=
18446744073709551615 > /sys/fs/cgroup/io.cost.model
divide error: 0000 [#1] PREEMPT SMP
RIP: 0010:calc_lcoefs+0x4c/0xc0
Call Trace:
<TASK>
ioc_refresh_params+0x2b3/0x4f0
ioc_cost_model_write+0x3cb/0x4c0
? _copy_from_iter+0x6d/0x6c0
? kernfs_fop_write_iter+0xfc/0x270
cgroup_file_write+0xa0/0x200
kernfs_fop_write_iter+0x17d/0x270
vfs_write+0x414/0x620
ksys_write+0x73/0x160
__x64_sys_write+0x1e/0x30
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
calc_lcoefs() uses the input value of cost.model in DIV_ROUND_UP_ULL,
overflow would happen if bps plus IOC_PAGE_SIZE is greater than
ULLONG_MAX, it can cause divide by 0 error.
Fix the problem by setting basecost
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Tue, 17 Jan 2023 07:08:04 +0000 (15:08 +0800)]
blk-iocost: read params inside lock in sysfs apis
Otherwise, user might get abnormal values if params is updated
concurrently.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Tue, 17 Jan 2023 07:08:03 +0000 (15:08 +0800)]
blk-iocost: don't allow to configure bio based device
iocost is based on rq_qos, which can only work for request based device,
thus it doesn't make sense to configure iocost for bio based device.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Tue, 17 Jan 2023 07:08:02 +0000 (15:08 +0800)]
blk-iocost: check return value of match_u64()
This patch fixs that the return value of match_u64() from ioc_qos_write()
is not checked,
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Arnd Bergmann [Wed, 18 Jan 2023 08:07:01 +0000 (09:07 +0100)]
blk-iocost: avoid 64-bit division in ioc_timer_fn
The behavior of 'enum' types has changed in gcc-13, so now the
UNBUSY_THR_PCT constant is interpreted as a 64-bit number because
it is defined as part of the same enum definition as some other
constants that do not fit within a 32-bit integer. This in turn
leads to some inefficient code on 32-bit architectures as well
as a link error:
arm-linux-gnueabi/bin/arm-linux-gnueabi-ld: block/blk-iocost.o: in function `ioc_timer_fn':
blk-iocost.c:(.text+0x68e8): undefined reference to `__aeabi_uldivmod'
arm-linux-gnueabi-ld: blk-iocost.c:(.text+0x6908): undefined reference to `__aeabi_uldivmod'
Split the enum definition to keep the 64-bit timing constants in
a separate enum type from those constants that can clearly fit
within a smaller type.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230118080706.3303186-1-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 18 Jan 2023 04:23:18 +0000 (12:23 +0800)]
block: ublk: fix doc build warning
Fix the following warning:
Documentation/block/ublk.rst:157: WARNING: Enumerated list ends without a blank line; unexpected unindent.
Documentation/block/ublk.rst:171: WARNING: Enumerated list ends without a blank line; unexpected unindent.
Fixes:
56f5160bc1b8 ("ublk_drv: add mechanism for supporting unprivileged ublk device")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230118042318.127900-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pankaj Raghav [Tue, 10 Jan 2023 14:36:35 +0000 (15:36 +0100)]
block: introduce bdev_zone_no helper
Add a generic bdev_zone_no() helper to calculate zone number for a
given sector in a block device. This helper internally uses disk_zone_no()
to find the zone number.
Use the helper bdev_zone_no() to calculate nr of zones. This lets us
make modifications to the math if needed in one place.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20230110143635.77300-4-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pankaj Raghav [Tue, 10 Jan 2023 14:36:34 +0000 (15:36 +0100)]
block: add a new helper bdev_{is_zone_start, offset_from_zone_start}
Instead of open coding to check for zone start, add a helper to improve
readability and store the logic in one place.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230110143635.77300-3-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pankaj Raghav [Tue, 10 Jan 2023 14:36:33 +0000 (15:36 +0100)]
block: remove superfluous check for request queue in bdev_is_zoned()
Remove the superfluous request queue check in bdev_is_zoned() as
bdev_get_queue() can never return NULL.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20230110143635.77300-2-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Anuj Gupta [Tue, 17 Jan 2023 12:06:38 +0000 (17:36 +0530)]
block: extend bio-cache for non-polled requests
This patch modifies the present check, so that bio-cache is not limited
to iopoll.
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20230117120638.72254-3-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Anuj Gupta [Tue, 17 Jan 2023 12:06:37 +0000 (17:36 +0530)]
nvme: set REQ_ALLOC_CACHE for uring-passthru request
This patch sets REQ_ALLOC_CACHE flag for uring-passthru requests.
This is a prep-patch so that normal / IRQ-driven uring-passthru
I/Os can also leverage bio-cache.
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20230117120638.72254-2-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 6 Jan 2023 04:17:11 +0000 (12:17 +0800)]
ublk_drv: add mechanism for supporting unprivileged ublk device
unprivileged ublk device is helpful for container use case, such
as: ublk device created in one unprivileged container can be controlled
and accessed by this container only.
Implement this feature by adding flag of UBLK_F_UNPRIVILEGED_DEV, and if
this flag isn't set, any control command has been run from privileged
user. Otherwise, any control command can be sent from any unprivileged
user, but the user has to be permitted to access the ublk char device
to be controlled.
In case of UBLK_F_UNPRIVILEGED_DEV:
1) for command UBLK_CMD_ADD_DEV, it is always allowed, and user needs
to provide owner's uid/gid in this command, so that udev can set correct
ownership for the created ublk device, since the device owner uid/gid
can be queried via command of UBLK_CMD_GET_DEV_INFO.
2) for other control commands, they can only be run successfully if the
current user is allowed to access the specified ublk char device, for
running the permission check, path of the ublk char device has to be
provided by these commands.
Also add one control of command UBLK_CMD_GET_DEV_INFO2 which always
include the char dev path in payload since userspace may not have
knowledge if this device is created in unprivileged mode.
For applying this mechanism, system administrator needs to take
the following policies:
1) chmod 0666 /dev/ublk-control
2) change ownership of ublkcN & ublkbN
- chown owner_uid:owner_gid /dev/ublkcN
- chown owner_uid:owner_gid /dev/ublkbN
Both can be done via one simple udev rule.
Userspace:
https://github.com/ming1/ubdsrv/tree/unprivileged-ublk
'ublk add -t $TYPE --un_privileged=1' is for creating one un-privileged
ublk device if the user is un-privileged.
Link: https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 6 Jan 2023 04:17:10 +0000 (12:17 +0800)]
ublk_drv: add module parameter of ublks_max for limiting max allowed ublk dev
Prepare for supporting unprivileged ublk device by limiting max number
ublk devices added. Otherwise too many ublk devices could be added by
un-trusted user, which can be thought as one DoS.
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 6 Jan 2023 04:17:09 +0000 (12:17 +0800)]
ublk_drv: add device parameter UBLK_PARAM_TYPE_DEVT
Userspace side only knows device ID, but the associated path of ublkc* and
ublkb* could be changed by udev, and that depends on userspace's policy, so
add parameter of UBLK_PARAM_TYPE_DEVT for retrieving major/minor of the
ublkc* and ublkb*, then user may figure out major/minor of the ublk disks
he/she owns. With major/minor, it is easy to find the device node path.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 6 Jan 2023 04:17:08 +0000 (12:17 +0800)]
ublk_drv: move ublk_get_device_from_id into ublk_ctrl_uring_cmd
It is annoying for each control command handler to get/put ublk
device and deal with failure.
Control command handler is simplified a lot by moving
ublk_get_device_from_id into ublk_ctrl_uring_cmd().
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 6 Jan 2023 04:17:07 +0000 (12:17 +0800)]
ublk_drv: don't probe partitions if the ubq daemon isn't trusted
If any ubq daemon is unprivileged, the ublk char device is allowed
for unprivileged user actually, and we can't trust the current user,
so not probe partitions.
Fixes:
71f28f3136af ("ublk_drv: add io_uring based userspace block driver")
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 6 Jan 2023 04:17:06 +0000 (12:17 +0800)]
ublk_drv: remove nr_aborted_queues from ublk_device
No one uses 'nr_aborted_queues' any more, so remove it.
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 16 Jan 2023 15:55:53 +0000 (08:55 -0700)]
block: don't allow multiple bios for IOCB_NOWAIT issue
If we're doing a large IO request which needs to be split into multiple
bios for issue, then we can run into the same situation as the below
marked commit fixes - parts will complete just fine, one or more parts
will fail to allocate a request. This will result in a partially
completed read or write request, where the caller gets EAGAIN even though
parts of the IO completed just fine.
Do the same for large bios as we do for splits - fail a NOWAIT request
with EAGAIN. This isn't technically fixing an issue in the below marked
patch, but for stable purposes, we should have either none of them or
both.
This depends on:
613b14884b85 ("block: handle bio_split_to_limits() NULL return")
Cc: stable@vger.kernel.org # 5.15+
Fixes:
9cea62b2cbab ("block: don't allow splitting of a REQ_NOWAIT bio")
Link: https://github.com/axboe/liburing/issues/766
Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Andreas Gruenbacher [Fri, 13 Jan 2023 12:35:38 +0000 (13:35 +0100)]
drbd: drbd_insert_interval(): Clarify comment
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-9-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lars Ellenberg [Fri, 13 Jan 2023 12:35:37 +0000 (13:35 +0100)]
drbd: interval tree: make removing an "empty" interval a no-op
Trying to remove an "empty" (just initialized, or "cleared") interval
from the tree, this results in an endless loop.
As we typically protect the tree with a spinlock_irq,
the result is a hung system.
Be nice to error cleanup code paths, ignore removal of empty intervals.
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-8-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Fri, 13 Jan 2023 12:35:36 +0000 (13:35 +0100)]
MAINTAINERS: add drbd headers
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Joel Colledge <joel.colledge@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-7-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Fri, 13 Jan 2023 12:35:35 +0000 (13:35 +0100)]
drbd: remove macros using require_context
This require_context attribute originated in a proposed sparse patch by
Philipp Reisner back in 2008. Johannes Berg had a different solution to
a similar problem, and that patch "won" in the end; so the require_context
thing never got merged. The whole history can be read at [0].
DRBD kept using these annotations anyway for a while. Nowadays, on a
modern unmodified sparse, they obviously do nothing, and they are hardly
used anymore anyway.
So, just remove the definitions of these macros.
[0] https://www.spinics.net/lists/linux-sparse/msg01150.html
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Joel Colledge <joel.colledge@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-6-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Fri, 13 Jan 2023 12:35:34 +0000 (13:35 +0100)]
drbd: remove unnecessary assignment in vli_encode_bits
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Joel Colledge <joel.colledge@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-5-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Fri, 13 Jan 2023 12:35:33 +0000 (13:35 +0100)]
drbd: make limits unsigned
These are almost always used as unsigned integers, so mark them as such.
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Joel Colledge <joel.colledge@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-4-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Robert Altnoeder [Fri, 13 Jan 2023 12:35:32 +0000 (13:35 +0100)]
drbd: fix DRBD_VOLUME_MAX 65535 -> 65534
The protocol uses -1 as a reserved value for
'no specific volume', and since the protocol field
is a 16 bit unsigned value, -1 is converted to
65535. Therefore, limit the range of valid volume
numbers to [0, 65534].
Signed-off-by: Robert Altnoeder <robert.altnoeder@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Joel Colledge <joel.colledge@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-3-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Fri, 13 Jan 2023 12:35:31 +0000 (13:35 +0100)]
drbd: adjust drbd_limits license header
See also commit
93c68cc46a07 ("drbd: use consistent license"). We only
want to license drbd under GPL-2.0, so use the corresponding SPDX header
consistently.
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Joel Colledge <joel.colledge@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-2-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Fri, 13 Jan 2023 12:35:06 +0000 (13:35 +0100)]
drbd: split off drbd_config into separate file
To be more similar to what we do in the out-of-tree module and ease the
upstreaming process.
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Joel Colledge <joel.colledge@linbit.com>
Link: https://lore.kernel.org/r/20230113123506.144082-4-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Fri, 13 Jan 2023 12:35:05 +0000 (13:35 +0100)]
drbd: drop API_VERSION define
Use the genetlink api version as defined in drbd_genl_api.h.
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Joel Colledge <joel.colledge@linbit.com>
Link: https://lore.kernel.org/r/20230113123506.144082-3-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Fri, 13 Jan 2023 12:35:04 +0000 (13:35 +0100)]
drbd: split off drbd_buildtag into separate file
To be more similar to what we do in the out-of-tree module and ease the
upstreaming process.
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Joel Colledge <joel.colledge@linbit.com>
Link: https://lore.kernel.org/r/20230113123506.144082-2-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 9 Jan 2023 17:45:56 +0000 (10:45 -0700)]
block: add a BUILD_BUG_ON() for adding more bio flags than we have space
We have BIO_FLAG_LAST in the enum for bio specific flags, but it's
not used to check that we're not exceeding the size of them. Add
such a check.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Thu, 5 Jan 2023 20:51:46 +0000 (12:51 -0800)]
block: save user max_sectors limit
The user can set the max_sectors limit to any valid value via sysfs
/sys/block/<dev>/queue/max_sectors_kb attribute. If the device limits
are ever rescanned, though, the limit reverts back to the potentially
artificially low BLK_DEF_MAX_SECTORS value.
Preserve the user's setting as the max_sectors limit as long as it's
valid. The user can reset back to defaults by writing 0 to the sysfs
file.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230105205146.3610282-3-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Thu, 5 Jan 2023 20:51:45 +0000 (12:51 -0800)]
block: make BLK_DEF_MAX_SECTORS unsigned
This is used as an unsigned value, so define it that way to avoid
having to cast it.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230105205146.3610282-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Davide Zini [Tue, 3 Jan 2023 14:55:03 +0000 (15:55 +0100)]
block, bfq: balance I/O injection among underutilized actuators
Upon the invocation of its dispatch function, BFQ returns the next I/O
request of the in-service bfq_queue, unless some exception holds. One
such exception is that there is some underutilized actuator, different
from the actuator for which the in-service queue contains I/O, and
that some other bfq_queue happens to contain I/O for such an
actuator. In this case, the next I/O request of the latter bfq_queue,
and not of the in-service bfq_queue, is returned (I/O is injected from
that bfq_queue). To find such an actuator, a linear scan, in
increasing index order, is performed among actuators.
Performing a linear scan entails a prioritization among actuators: an
underutilized actuator may be considered for injection only if all
actuators with a lower index are currently fully utilized, or if there
is no pending I/O for any lower-index actuator that happens to be
underutilized.
This commits breaks this prioritization and tends to distribute
injection uniformly across actuators. This is obtained by adding the
following condition to the linear scan: even if an actuator A is
underutilized, A is however skipped if its load is higher than that of
the next actuator.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Davide Zini <davidezini2@gmail.com>
Link: https://lore.kernel.org/r/20230103145503.71712-9-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Davide Zini [Tue, 3 Jan 2023 14:55:02 +0000 (15:55 +0100)]
block, bfq: inject I/O to underutilized actuators
The main service scheme of BFQ for sync I/O is serving one sync
bfq_queue at a time, for a while. In particular, BFQ enforces this
scheme when it deems the latter necessary to boost throughput or
to preserve service guarantees. Unfortunately, when BFQ enforces
this policy, only one actuator at a time gets served for a while,
because each bfq_queue contains I/O only for one actuator. The
other actuators may remain underutilized.
Actually, BFQ may serve (inject) extra I/O, taken from other
bfq_queues, in parallel with that of the in-service queue. This
injection mechanism may provide the ground for dealing also with
the above actuator-underutilization problem. Yet BFQ does not take
the actuator load into account when choosing which queue to pick
extra I/O from. In addition, BFQ may happen to inject extra I/O
only when the in-service queue is temporarily empty.
In view of these facts, this commit extends the
injection mechanism in such a way that the latter:
(1) takes into account also the actuator load;
(2) checks such a load on each dispatch, and injects I/O for an
underutilized actuator, if there is one and there is I/O for it.
To perform the check in (2), this commit introduces a load
threshold, currently set to 4. A linear scan of each actuator is
performed, until an actuator is found for which the following two
conditions hold: the load of the actuator is below the threshold,
and there is at least one non-in-service queue that contains I/O
for that actuator. If such a pair (actuator, queue) is found, then
the head request of that queue is returned for dispatch, instead
of the head request of the in-service queue.
We have set the threshold, empirically, to the minimum possible
value for which an actuator is fully utilized, or close to be
fully utilized. By doing so, injected I/O 'steals' as few
drive-queue slots as possibile to the in-service queue. This
reduces as much as possible the probability that the service of
I/O from the in-service bfq_queue gets delayed because of slot
exhaustion, i.e., because all the slots of the drive queue are
filled with I/O injected from other queues (NCQ provides for 32
slots).
This new mechanism also counters actuator underutilization in the
case of asymmetric configurations of bfq_queues. Namely if there
are few bfq_queues containing I/O for some actuators and many
bfq_queues containing I/O for other actuators. Or if the
bfq_queues containing I/O for some actuators have lower weights
than the other bfq_queues.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Davide Zini <davidezini2@gmail.com>
Link: https://lore.kernel.org/r/20230103145503.71712-8-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Federico Gavioli [Tue, 3 Jan 2023 14:55:01 +0000 (15:55 +0100)]
block, bfq: retrieve independent access ranges from request queue
This patch implements the code to gather the content of the
independent_access_ranges structure from the request_queue and copy
it into the queue's bfq_data. This copy is done at queue initialization.
We copy the access ranges into the bfq_data to avoid taking the queue
lock each time we access the ranges.
This implementation, however, puts a limit to the maximum independent
ranges supported by the scheduler. Such a limit is equal to the constant
BFQ_MAX_ACTUATORS. This limit was placed to avoid the allocation of
dynamic memory.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Co-developed-by: Rory Chen <rory.c.chen@seagate.com>
Signed-off-by: Rory Chen <rory.c.chen@seagate.com>
Signed-off-by: Federico Gavioli <f.gavioli97@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20230103145503.71712-7-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Davide Zini [Tue, 3 Jan 2023 14:55:00 +0000 (15:55 +0100)]
block, bfq: split also async bfq_queues on a per-actuator basis
Similarly to sync bfq_queues, also async bfq_queues need to be split
on a per-actuator basis.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Davide Zini <davidezini2@gmail.com>
Link: https://lore.kernel.org/r/20230103145503.71712-6-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Paolo Valente [Tue, 3 Jan 2023 14:54:59 +0000 (15:54 +0100)]
block, bfq: turn bfqq_data into an array in bfq_io_cq
When a bfq_queue Q is merged with another queue, several pieces of
information are saved about Q. These pieces are stored in the
bfqq_data field in the bfq_io_cq data structure of the process
associated with Q.
Yet, with a multi-actuator drive, a process may get associated with
multiple bfq_queues: one queue for each of the N actuators. Each of
these queues may undergo a merge. So, the bfq_io_cq data structure
must be able to accommodate the above information for N queues.
This commit solves this problem by turning the bfqq_data scalar field
into an array of N elements (and by changing code so as to handle
this array).
This solution is written under the assumption that bfq_queues
associated with different actuators cannot be cross-merged. This
assumption holds naturally with basic queue merging: the latter is
triggered by spatial locality, and sectors for different actuators are
not close to each other (apart from the corner case of the last
sectors served by a given actuator and the first sectors served by the
next actuator). As for stable cross-merging, the assumption here is
that it is disabled.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Gabriele Felici <felicigb@gmail.com>
Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net>
Signed-off-by: Giulio Barabino <giuliobarabino99@gmail.com>
Signed-off-by: Emiliano Maccaferri <inbox@emilianomaccaferri.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20230103145503.71712-5-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Paolo Valente [Tue, 3 Jan 2023 14:54:58 +0000 (15:54 +0100)]
block, bfq: move io_cq-persistent bfqq data into a dedicated struct
With a multi-actuator drive, a process may get associated with multiple
bfq_queues: one queue for each of the N actuators. So, the bfq_io_cq
data structure must be able to accommodate its per-queue persistent
information for N queues. Currently it stores this information for
just one queue, in several scalar fields.
This is a preparatory commit for moving to accommodating persistent
information for N queues. In particular, this commit packs all the
above scalar fields into a single data structure. Then there is now
only one field, in bfq_io_cq, that stores all the above information. This
scalar field will then be turned into an array by a following commit.
Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net>
Signed-off-by: Giulio Barabino <giuliobarabino99@gmail.com>
Signed-off-by: Emiliano Maccaferri <inbox@emilianomaccaferri.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20230103145503.71712-4-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Paolo Valente [Tue, 3 Jan 2023 14:54:57 +0000 (15:54 +0100)]
block, bfq: forbid stable merging of queues associated with different actuators
If queues associated with different actuators are merged, then control
is lost on each actuator. Therefore some actuator may be
underutilized, and throughput may decrease. This problem cannot occur
with basic queue merging, because the latter is triggered by spatial
locality, and sectors for different actuators are not close to each
other. Yet it may happen with stable merging. To address this issue,
this commit prevents stable merging from occurring among queues
associated with different actuators.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20230103145503.71712-3-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Paolo Valente [Tue, 3 Jan 2023 14:54:56 +0000 (15:54 +0100)]
block, bfq: split sync bfq_queues on a per-actuator basis
Single-LUN multi-actuator SCSI drives, as well as all multi-actuator
SATA drives appear as a single device to the I/O subsystem [1]. Yet
they address commands to different actuators internally, as a function
of Logical Block Addressing (LBAs). A given sector is reachable by
only one of the actuators. For example, Seagate’s Serial Advanced
Technology Attachment (SATA) version contains two actuators and maps
the lower half of the SATA LBA space to the lower actuator and the
upper half to the upper actuator.
Evidently, to fully utilize actuators, no actuator must be left idle
or underutilized while there is pending I/O for it. The block layer
must somehow control the load of each actuator individually. This
commit lays the ground for allowing BFQ to provide such a per-actuator
control.
BFQ associates an I/O-request sync bfq_queue with each process doing
synchronous I/O, or with a group of processes, in case of queue
merging. Then BFQ serves one bfq_queue at a time. While in service, a
bfq_queue is emptied in request-position order. Yet the same process,
or group of processes, may generate I/O for different actuators. In
this case, different streams of I/O (each for a different actuator)
get all inserted into the same sync bfq_queue. So there is basically
no individual control on when each stream is served, i.e., on when the
I/O requests of the stream are picked from the bfq_queue and
dispatched to the drive.
This commit enables BFQ to control the service of each actuator
individually for synchronous I/O, by simply splitting each sync
bfq_queue into N queues, one for each actuator. In other words, a sync
bfq_queue is now associated to a pair (process, actuator). As a
consequence of this split, the per-queue proportional-share policy
implemented by BFQ will guarantee that the sync I/O generated for each
actuator, by each process, receives its fair share of service.
This is just a preparatory patch. If the I/O of the same process
happens to be sent to different queues, then each of these queues may
undergo queue merging. To handle this event, the bfq_io_cq data
structure must be properly extended. In addition, stable merging must
be disabled to avoid loss of control on individual actuators. Finally,
also async queues must be split. These issues are described in detail
and addressed in next commits. As for this commit, although multiple
per-process bfq_queues are provided, the I/O of each process or group
of processes is still sent to only one queue, regardless of the
actuator the I/O is for. The forwarding to distinct bfq_queues will be
enabled after addressing the above issues.
[1] https://www.linaro.org/blog/budget-fair-queueing-bfq-linux-io-scheduler-optimizations-for-multi-actuator-sata-hard-drives/
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Gabriele Felici <felicigb@gmail.com>
Signed-off-by: Carmine Zaccagnino <carmine@carminezacc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20230103145503.71712-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sun, 29 Jan 2023 21:59:43 +0000 (13:59 -0800)]
Linux 6.2-rc6
Linus Torvalds [Sun, 29 Jan 2023 19:26:49 +0000 (11:26 -0800)]
Merge tag 'irq_urgent_for_v6.2_rc6' of git://git./linux/kernel/git/tip/tip
Pull irq fix from Borislav Petkov:
- Cleanup the firmware node for the new IRQ MSI domain properly, to
avoid leaking memory
* tag 'irq_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/msi: Free the fwnode created by msi_create_device_irq_domain()
Linus Torvalds [Sun, 29 Jan 2023 19:17:34 +0000 (11:17 -0800)]
Merge tag 'x86_urgent_for_v6.2_rc6' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Start checking for -mindirect-branch-cs-prefix clang support too now
that LLVM 16 will support it
- Fix a NULL ptr deref when suspending with Xen PV
- Have a SEV-SNP guest check explicitly for features enabled by the
hypervisor and fail gracefully if some are unsupported by the guest
instead of failing in a non-obvious and hard-to-debug way
- Fix a MSI descriptor leakage under Xen
- Mark Xen's MSI domain as supporting MSI-X
- Prevent legacy PIC interrupts from being resent in software by
marking them level triggered, as they should be, which lead to a NULL
ptr deref
* tag 'x86_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/build: Move '-mindirect-branch-cs-prefix' out of GCC-only block
acpi: Fix suspend with Xen PV
x86/sev: Add SEV-SNP guest feature negotiation support
x86/pci/xen: Fixup fallout from the PCI/MSI overhaul
x86/pci/xen: Set MSI_FLAG_PCI_MSIX support in Xen MSI domain
x86/i8259: Mark legacy PIC interrupts with IRQ_LEVEL
Linus Torvalds [Sun, 29 Jan 2023 19:06:47 +0000 (11:06 -0800)]
Merge tag 'input-for-v6.2-rc5' of git://git./linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
- touchpads on HP 15-* laptops switched back to PS/2 emulation mode
- a quirk for Clevo PCX0DX/TUXEDO XP1511 to make sure keyboard is
responding after resume
* tag 'input-for-v6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: i8042 - add Clevo PCX0DX to i8042 quirk table
Revert "Input: synaptics - switch touchpad on HP Laptop 15-da3001TU to RMI mode"
Linus Torvalds [Sun, 29 Jan 2023 18:47:22 +0000 (10:47 -0800)]
Merge tag 'cxl-fixes-for-6.2-rc6' of git://git./linux/kernel/git/cxl/cxl
Pull cxl fixes from Dan Williams:
"A couple of fixes for bugs introduced during the merge window. One is
a regression, the other was a bug in the CXL AER handler:
- Fix a crash regression due to module load order of cxl_pmem.ko
- Fix wrong register offset read in CXL AER handling path"
* tag 'cxl-fixes-for-6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/pmem: Fix nvdimm unregistration when cxl_pmem driver is absent
cxl: fix cxl_report_and_clear() RAS UE addr mis-assignment
Vlastimil Babka [Fri, 13 Jan 2023 17:33:45 +0000 (18:33 +0100)]
Revert "mm/compaction: fix set skip in fast_find_migrateblock"
This reverts commit
7efc3b7261030da79001c00d92bc3392fd6c664c.
We have got openSUSE reports (Link 1) for 6.1 kernel with khugepaged
stalling CPU for long periods of time. Investigation of tracepoint data
shows that compaction is stuck in repeating fast_find_migrateblock()
based migrate page isolation, and then fails to migrate all isolated
pages.
Commit
7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
was suspected as it was merged in 6.1 and in theory can indeed remove a
termination condition for fast_find_migrateblock() under certain
conditions, as it removes a place that always marks a scanned pageblock
from being re-scanned. There are other such places, but those can be
skipped under certain conditions, which seems to match the tracepoint
data.
Testing of revert also appears to have resolved the issue, thus revert
the commit until a more robust solution for the original problem is
developed.
It's also likely this will fix qemu stalls with 6.1 kernel reported in
Link 2, but that is not yet confirmed.
Link: https://bugzilla.suse.com/show_bug.cgi?id=1206848
Link: https://lore.kernel.org/kvm/b8017e09-f336-3035-8344-c549086c2340@kernel.org/
Link: https://lore.kernel.org/lkml/20230125134434.18017-1-mgorman@techsingularity.net/
Fixes:
7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
Cc: <stable@vger.kernel.org>
Tested-by: Pedro Falcato <pedro.falcato@gmail.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 28 Jan 2023 19:17:57 +0000 (11:17 -0800)]
Fix up more non-executable files marked executable
Joe found another DT file that shouldn't be executable, and that
frustrated me enough that I went hunting with this script:
git ls-files -s |
grep '^100755' |
cut -f2 |
xargs grep -L '^#!'
and that found another file that shouldn't have been marked executable
either, despite being in the scripts directory.
Maybe these two are the last ones at least for now. But I'm sure we'll
be back in a few years, fixing things up again.
Fixes:
8c6789f4e2d4 ("ASoC: dt-bindings: Add Everest ES8326 audio CODEC")
Fixes:
4d8e5cd233db ("locking/atomics: Fix scripts/atomic/ script permissions")
Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 28 Jan 2023 18:52:51 +0000 (10:52 -0800)]
Merge tag '6.2-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull ksmbd server fixes from Steve French:
"Four smb3 server fixes, all also for stable:
- fix for signing bug
- fix to more strictly check packet length
- add a max connections parm to limit simultaneous connections
- fix error message flood that can occur with newer Samba xattr
format"
* tag '6.2-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: downgrade ndr version error message to debug
ksmbd: limit pdu length size according to connection status
ksmbd: do not sign response to session request for guest login
ksmbd: add max connections parameter
Linus Torvalds [Sat, 28 Jan 2023 01:41:47 +0000 (17:41 -0800)]
Merge tag '6.2-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fix from Steve French:
"Fix for reconnect oops in smbdirect (RDMA), also is marked for stable"
* tag '6.2-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix oops due to uncleared server->smbd_conn in reconnect
Linus Torvalds [Sat, 28 Jan 2023 00:16:57 +0000 (16:16 -0800)]
Merge tag 'block-6.2-2023-01-27' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"Minor tweaks for this release:
- NVMe pull request via Christoph:
- Flush initial scan_work for async probe (Keith Busch)
- Fix passthrough csi check (Keith Busch)
- Fix nvme-fc initialization order (Ross Lagerwall)
- Fix for tearing down non-started device in ublk (Ming)"
* tag 'block-6.2-2023-01-27' of git://git.kernel.dk/linux:
block: ublk: move ublk_chr_class destroying after devices are removed
nvme: fix passthrough csi check
nvme-pci: flush initial scan_work for async probe
nvme-fc: fix initialization order
Linus Torvalds [Sat, 28 Jan 2023 00:15:06 +0000 (16:15 -0800)]
Merge tag 'io_uring-6.2-2023-01-27' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
"Two small fixes for this release:
- Sanitize how async prep is done for drain requests, so we ensure
that it always gets done (Dylan)
- A ring provided buffer recycling fix for multishot receive (me)"
* tag 'io_uring-6.2-2023-01-27' of git://git.kernel.dk/linux:
io_uring: always prep_async for drain requests
io_uring/net: cache provided buffer group value for multishot receives
Linus Torvalds [Sat, 28 Jan 2023 00:09:12 +0000 (16:09 -0800)]
Merge tag 'hardening-v6.2-rc6' of git://git./linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
- Split slow memcpy tests into MEMCPY_SLOW_KUNIT_TEST
- Reorganize gcc-plugin includes for GCC 13
- Silence bcache memcpy run-time false positive warnings
* tag 'hardening-v6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
bcache: Silence memcpy() run-time false positive warnings
gcc-plugins: Reorganize gimple includes for GCC 13
kunit: memcpy: Split slow memcpy tests into MEMCPY_SLOW_KUNIT_TEST
Linus Torvalds [Sat, 28 Jan 2023 00:03:32 +0000 (16:03 -0800)]
Merge tag 'trace-v6.2-rc5' of git://git./linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix filter memory leak by calling ftrace_free_filter()
- Initialize trace_printk() earlier so that ftrace_dump_on_oops shows
data on early crashes.
- Update the outdated instructions in scripts/tracing/ftrace-bisect.sh
- Add lockdep_is_held() to fix lockdep warning
- Add allocation failure check in create_hist_field()
- Don't initialize pointer that gets set right away in enabled_monitors_write()
- Update MAINTAINER entries
- Fix help messages in Kconfigs
- Fix kernel-doc header for update_preds()
* tag 'trace-v6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
bootconfig: Update MAINTAINERS file to add tree and mailing list
rv: remove redundant initialization of pointer ptr
ftrace: Maintain samples/ftrace
tracing/filter: fix kernel-doc warnings
lib: Kconfig: fix spellos
trace_events_hist: add check for return value of 'create_hist_field'
tracing/osnoise: Use built-in RCU list checking
tracing: Kconfig: Fix spelling/grammar/punctuation
ftrace/scripts: Update the instructions for ftrace-bisect.sh
tracing: Make sure trace_printk() can output as soon as it can be used
ftrace: Export ftrace_free_filter() to modules
Linus Torvalds [Fri, 27 Jan 2023 21:52:38 +0000 (13:52 -0800)]
Merge tag 'i2c-for-6.2-rc6' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"A bunch of driver fixes with a tiny bit of new IDs"
* tag 'i2c-for-6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: rk3x: fix a bunch of kernel-doc warnings
i2c: axxia: use 'struct' for kernel-doc notation
dt-bindings: i2c: renesas,rzv2m: Fix SoC specific string
i2c: mxs: suppress probe-deferral error message
i2c: designware-pci: Add new PCI IDs for AMD NAVI GPU
i2c: designware: Fix unbalanced suspended flag
i2c: designware: use casting of u64 in clock multiplication to avoid overflow
Linus Torvalds [Fri, 27 Jan 2023 21:47:40 +0000 (13:47 -0800)]
Merge tag 'gpio-fixes-for-v6.2-rc6' of git://git./linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix the -c option in the gpio-event-mode user-space example program
- fix the irq number translation in gpio-ep93xx and make its irqchip
immutable
- add a missing spin_unlock in error path in gpio-mxc
- fix a suspend breakage on System76 and Lenovo Gen2a introduced in
GPIO ACPI
* tag 'gpio-fixes-for-v6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
tools: gpio: fix -c option of gpio-event-mon
gpio: ep93xx: remove unused variable
gpio: ep93xx: Make irqchip immutable
gpio: ep93xx: Fix port F hwirq numbers in handler
gpio: mxc: Unlock on error path in mxc_flip_edge()
gpiolib-acpi: Don't set GPIOs for wakeup in S3 mode
Linus Torvalds [Fri, 27 Jan 2023 21:43:46 +0000 (13:43 -0800)]
Merge tag 'regulator-fix-v6.2-rc5' of git://git./linux/kernel/git/broonie/regulator
Pull regulator fix from Mark Brown:
"A fix for the DT binding documentation which dropped a property when
being converted to YAML format causing spurious errors validating
device trees for platforms using the device"
* tag 'regulator-fix-v6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: dt-bindings: samsung,s2mps14: add lost samsung,ext-control-gpios
Linus Torvalds [Fri, 27 Jan 2023 21:39:30 +0000 (13:39 -0800)]
Merge tag 'ovl-fixes-6.2-rc6' of git://git./linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Fix two bugs, a recent one introduced in the last cycle, and an older
one from v5.11"
* tag 'ovl-fixes-6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fail on invalid uid/gid mapping at copy up
ovl: fix tmpfile leak